Document Sample
magical Powered By Docstoc
					The Magical Art of Extracting
    Meaning From Data
                  Data Mining For The Web

        Luis Rei

• Introduction
• Recommender Systems
• Classification
• Clustering
“The greatest problem of today is how to teach people to ignore the
irrelevant, how to refuse to know things, before they are suffocated. For
too many facts are as bad as none at all.”
(W.H. Auden)

      “The key in business is to know something that nobody else knows.”
                                                    (Aristotle Onassis)
                      DATA                                   MEANING

                                                   NAME              Luis Rei
                  <a href="">
                                                   AGE                   25
<a href="">

                        25                         WEBSITE   <a href="">

                                  Luis Rei
•   Python vs C or C++

•   feedparser, Beautiful
    Soup (scrap web pages)

•   NumPy, SciPy

•   Weka

•   R

•   Libraries

        Down The Rabbit Hole
•   In 2006, google search crawler used
    850TB of data. Total web history is
    around 3PB

    •   Think of all the audio, photos & videos

    •   That’s a lot of data

•   Open formats (HTML, RSS, PDF, ...)

•   Everyone + their dog has an API

    •   facebook, twitter, flickr,,
        delicious, digg, gowalla, ...

•   Think about:

    •   news articles published every day

    •   status updates / day
               The Netflix Prize

•   In October 2006 Netflix launched an open competition for the best
    collaborative filtering algorithm

    •   at least 10% improvement over netflix’s own algorithm

•   Predict user ratings for films based on previous ratings (by all users)

•   US$1,000,000 prize won in Sep 2009
The Three Acts
                      I: The Pledge
 The magician shows you something ordinary. But of course... it
                       probably isn't.

                       II: The Turn
  The magician takes the ordinary something and makes it do
 something extraordinary. Now you're looking for the secret...

                    III: The Prestige
 But you wouldn't clap yet. Because making something disappear
            isn't enough; you have to bring it back.
       Collaborative Filtering

I. Collect Preferences

II. Find Similar Users
   or Items

III. Recommend
       I. Collecting Preferences
•   yes/no votes

•   Ratings in stars

•   Purchase history

•   Who you follow/who’s your

•   The music you listen to or the
    movies you watch

•   Comments (“Bad”, “Great”, “Lousy”, ...)
                   II. Similarity
•    Euclidean Distance

                                            √(a-b) 2

•    Pearson Correlation

Olsen Twins - notice the similarity!
     > 0.0 (positive correlation)
          < 1.0 (not equal)
         Same eyes, nose, ...
Different hair color, dress, earings, ...
III. Recommend
             Users Vs Items
•   Find similar items instead of similar users!
•   Same recommendation process:
    •  just switch users with items & vice versa (conceptually)
•   Why?
    •  Works for new users
    •  Might be more accurate (might not)
    •  It can be useful to have both
•   How good are the recommendations?
•   Partitioning the data: Training set vs Test set
    •   Size of the sets? 95/5
•   Variance
•   Multiple rounds with different partitions
    •   How many rounds? 1? 2? 100?
•   Measure of “goodness” (or rather, the error): Root
    Mean Square Error
Case Study:

•   Django project by 1 programmer

•   Users give ratings to restaurants

    •   0 to 5 stars (0-100 internally)

•   Challenge: recommend users
    restaurants they will probably like
User Similarity

Restaurant Similarity
Allows you to show similar restaurants in a restaurants page
      (based on user similarity)
(based on restaurant similarity)
     restaurant recommendations
can be based on user or restaurant similarity
                 (it’s restaurant)
 Case Study: Twitter Follow
• Recommend users to follow
• Users don’t have ratings
  • implied rating:
    “follow” (binary)
• Recommend users that the
 people the target user              this was stuff I presented @codebits in 2008
 follows also follow (but that the    before twitter had follow recommendations
                                                              (code was rewritten)
 target user doesn’t)

A KNN in 1 minute
•   Calculate the nearest neighbors (similarity)

    •   e.g. the other users with the highest number of equal ratings
        to the customer

•   For the k nearest neighbors:

    •   neighbor base predictor (e.g. avg rating for neighbor)

    •   s += sim * (rating - nbp)

    •   d += sim

•   prediction = cbp + s/d (cbp = customer base predictor, e.g. average customer rating)
               Item                   Item

• Assign an item into a category
 • An email as spam (document classification)
 • A set of symptoms to a particular disease
 • A signature to an individual (biometric identification)
 • An individual as credit worthy (credit scoring)
 • An image as a particular letter (Optical Character Recognition)
     Common Algorithms
• Supervised
 • Neural Networks
 • Support Vector Machines
 • Genetic Algorithms
 • Naive Bayes Classifier
• Unsupervised:
 • Usually done via Clustering (clustering hypothesis)
   • i.e. similar contents => similar classification
         Naive Bayes Classifier

I. Train

II. Calculate Probabillities

III. Classify
Case Study: A Spam Filter

      • The item (document) is an email message
      • 2 Categories: Spam and Ham
      • What do we need?

fc:     {'python': {'spam': 0, 'ham': 6}, 'the': {'spam': 3, 'ham': 3}}

cc:     {'ham': 6, 'spam': 6}
        Feature Extraction
• Input data can be way too large
 • Think every pixel of an image
• It can also be mostly useless
 • A signature is the same regardless of color (B&W
    will suffice)
• And incredibly redundant (lots of data, little info)
• The solution is too transform the input into a
  smaller representation - a features vector!
• A feature is either present or not
               Get Features
• Word Vector: Features are words                    (basic for doc classfication)

• An item (document) is an email message and can:
 • contain a word (feature is present)
 • not contain a word (feature is absent)

    [‘date', 'don', 'mortgage', 'taint', ‘you’, ‘how’, ‘delay’, ...]
    Other ideas: use capitalization, stemming, tlf-idf
                                I. Training
For every training example (item, category):
 1.Extract the item’s features
 2.For each feature:
      •   Increment the count for this (feature, category) pair fc: {'feature': {'category': count, ...}}

 3.Increment the category count (+1 example)                           cc: {'category': count, ...}
                      II. Probabilities
P(word | category)        the probability that a word is in a particular category (classification)

             P(c ∩ w)
P(w | c) =

    using only the information it has seen so far makes it incredibly sensitive to words
                           Assumed Probability
    that appear very rarely.
    It would be much more realistic for the value to gradually change as a word is
    found in more and more documents with the same category.

       a weight of 1 means the assumed probability is weighted the same as one word
P(Document | Category)      probability that a given doc belongs in a particular category

 = P(w1 | c) ∩ P(w2 | c) ∩ ... P(wn | c) for every word in the document

                                                      *note: Decimal vs float

            Yeah that’s nice... but what we want is
                      P(Category | Document)!
III. Bayes’ Theorem
III. Bayes’ Theorem
                        P(d | c) = P(w1 | c) ∩ P(w2 | c) ∩ ... P(wn | c)

             P(d | c) x P(c)
P(c | d) =
                 P(d)             can be ignored
• If you’re thinking of filtering spam, go with akismet
• If you really want to do your own bayesian spam filter,
  a good start is wikipedia
• Training datasets are available online - for spam and
  pretty much everything else

                    Clustering               A, C

A, B, C, D, F, G, I, J                 F             B, D, G
                                              I, J

    • Find structure in datasets:
     • Groups of things, people, concepts
    • Unsupervised (i.e. there is no training)
    • Common algorithms:
     • Hierarchical clustering
     • K-means
     • Non Negative Matrix Approximation
     Non Negative Matrix
Approximation (or Factorization)

I. Get the data

  • in matrix form!
II. Factorize the matrix

III. Present the results

                           yeah the matrix is kind of magic
Case Study: News Clustering
                 I. The Data
article vector                          [‘A’, ‘B’, ‘C’, ‘D’, ...]

word vector                    [‘sapo’, ‘codebits’, ‘haiti’, ‘iraq’, ...]
                                               property (word)
            (word frequency/article)

                                           [[7, 8, 1, 10, ...]
                                            [2, 0, 16, 1, ...]
     Matrix                                 [22, 3, 0, 0, ...]
                                            [9, 12, 5, 4, ...]
      Article D contains the word ‘iraq’ 4 times
II. Factorize
data matrix = features matrix             x      weights matrix

                    word                                 feature

[[23, 24]         [[7, 8]                               [[1, 0]
 [2, 0]]
              =    [2, 0]]
                                         x               [2, 3]]
                              feature         article

importance of the word to the feature

                       how much the feature applies to the article
k - the number of features to find (i.e. number of clusters)

            III. The Results

• For every feature:
 • Display the top X words
    (from the features
 • Display the top Y articles
    for this feature (from the
    weights matrix)
*note: this was created using an OPML file exported from my google
                                         reader (260 subscriptions)

['adobe', 'flash', 'platform', 'acrobat', 'software', 'reader']
(0.0014202284481846406, u"Apple, Adobe, and Openness: Let's Get Real")
(0.00049914481067248734, u'Piggybacking on Adobe Acrobat and others')
(0.00047202214371591086, u'CVE-2010-3654 - New dangerous 0-day authplay library adobe products')

['macbook', 'hard', 'only', 'much', 'drive', 'screen']
(0.0017976618817123543, u'The new MacBook Air')
(0.00067015549607138966, u'Revisiting Solid State Hard Drives')
(0.00035732495413261966, u"The new MacBook Air's SSD performance")

['apps', 'mobile', 'business', 'other', 'good', 'application']
(0.0013598162030796167, u'Which mobile apps are making good money?')
(0.00054549656743046277, u'An open enhancement request to the Mobile Safari team for sane
bookmarklet installation or alternatives')
(0.00040802131970223176, u'Google Apps highlights \u2013 10/29/2010')

['quot', 'strike', 'operations', 'forces', 'some', 'afghan']
(0.002464522414843272, u'Kandahar diary: Watching conventional forces conduct a successful COIN')
(0.00027058999725999285, u'How universities can help in our wars - By Tom Ricks')
(0.00026940637538539202, u'This Weekend\u2019s News: Afghanistan\u2019s Long-Term Stability')
Food for the Brain
                   Programming Collective Intelligence:
                   Building Smart Web 2.0 Applications
                              Toby Segaran

    Neural Networks:              Data Mining: Practical Machine
                                                                   Machine Learning
A Comprehensive Foundation        Learning Tools and Techniques
                                                                    Tom Mitchell
      Simon Haykin                  Ian H. Witten, Eibe Frank

Shared By: