Docstoc

Information Filtering

Document Sample
Information Filtering Powered By Docstoc
					    Information Filtering

            LBSC 878
Douglas W. Oard and Dagobert Soergel
      Week 10: April 12, 1999
                  Agenda
• Information filtering
• Profile learning
• Social filtering
Information Access Problems
  Information Need   Different   Retrieval
                     Each Time




                       Stable                     Filtering


                                 Stable           Different
                                                  Each Time
                                     Collection
         Information Filtering
• An abstract problem in which:
  – The information need is stable
     • Characterized by a “profile”
  – A stream of documents is arriving
     • Each must either be presented to the user or not
• Introduced by Luhn in 1958
  – As “Selective Dissemination of Information”
• Named “Filtering” by Denning in 1975
    A Simple Filtering Strategy
• Use any information retrieval system
  – Boolean, vector space, probabilistic, …
• Have the user specify a “standing query”
  – This will be the profile
• Limit the standing query by date
  – Each use, show what arrived since the last use
       What’s Wrong With That?
• Unnecessary indexing overhead
  – Indexing only speeds up retrospective searches
• Every profile is treated separately
  – The same work might be done repeatedly
• Forming effective queries by hand is hard
  – The computer might be able to help
• It is OK for text, but what about audio, video, …
  – Are words the only possible basis for filtering?
        The Fast Data Finder
• Fast Boolean filtering using custom hardware
  – Up to 10,000 documents per second
• Words pass through a pipeline architecture
  – Each element looks for one word
        good        great         party   aid



               OR                         NOT


                            AND
      Profile Indexing in SIFT
• Build an inverted file of profiles
  – Postings are profiles that contain each term
• RAM can hold 5,000 profiles/megabyte
  – And several machines can be run in parallel
• Both Boolean and vector space matching
  – User-selected threshold for each ranked profile
     • Hand-tuned on a web page using today’s news
• Delivered by email
              SIFT Limitations
• Centralization
  – Requires some form of cost sharing
• Privacy
  – Centrally registered profiles
  – Each profile associated with an email address
• Usability
  – Manually specified profiles
  – Thresholds based on obscure retrieval status value
            Profile Learning
• Automatic learning for stable profiles
  – Based on observations of user behavior
• Explicit ratings are easy to use
  – Binary ratings are simply relevance feedback
• Unobtrusive observations are easier to get
  – Reading time, save/delete, …
  – But we only have a few clues on how to use them
                      Relevance Feedback
  New
Documents         Make       Vectors
                                            Compute
                Document
                                            Similarity
                 Vectors
                                       Documents,
                                        Vectors,
                                       Rank Order
                                           Select and
                                            Examine
                                             (user)
                                       Document,         Vector(s)
                                        Vector

                                             Assign
                                             Ratings
                                              (user)
                                          Rating,
                                          Vector
   Initial
Profile Terms    Make        Vector
                                             Update
                 Profile
                                           User Model
                 Vector
           Machine Learning
• Given a set of vectors with associated values
  – e.g., document vector + relevance judgments
• Predict the values associated with new vectors
  – i.e., learn a mapping from vectors to values
• All learning systems share two problems
  – They need some basis for making predictions
     • This is called an “inductive bias”
  – They must balance adaptation with generalization
    Machine Learning Techniques
•   Hill climbing (Rocchio Feedback)
•   Instance-based learning
•   Rule induction
•   Regression
•   Neural networks
•   Genetic algorithms
•   Statistical classification
     Instance-Based Learning
• Remember examples of desired documents
  – Dissimilar examples capture range of possibilities
• Compare new documents to each example
  – Vector similarity, probabilistic, …
• Base rank order on the most similar example
• May be useful for broader interests
             Rule Induction
• Automatically derived Boolean profiles
  – (Hopefully) effective and easily explained
• Specificity from the “perfect query”
  – AND terms in a document, OR the documents
• Generality from a bias favoring short profiles
  – e.g., penalize rules with more Boolean operators
  – Balanced by rewards for precision, recall, …
                  Regression
• Fit a simple function to the observations
  – Straight line, s-curve, …
• Logistic regression matches binary relevance
  – S-curves fit better than straight lines
  – Same calculation as a 3-layer neural network
     • But with a different training technique
             Neural Networks
• Design inspired by the human brain
  – Neurons, connected by synapses
     • Organized into layers for simplicity
  – Synapse strengths are learned from experience
     • Seek to minimize predict-observe differences
• Can be fast with special purpose hardware
  – Biological neurons are far slower than computers
• Work best with application-specific structures
  – In biology this is achieved through evolution
           Genetic Algorithms
• Design inspired by evolutionary biology
  – Parents exchange parts of genes (crossover)
  – Random variations in each generation (mutation)
  – Fittest genes dominate the population (selection)
• Genes are vectors
  – Term weights, source weights, module weights, …
• Selection is based on a fitness function
  – Tested repeatedly - must be informative and cheap
      Statistical Classification
• Represent relevant docs as one random vector
  – And nonrelevant docs as another
• Build a statistical model for each
  – e.g., a normal distribution
• Find the surface separating the distributions
  – e.g., a hyperplane
• Rank documents by distance from that surface
  – Possibly distorted by the shape of the distributions
            Training Strategies
• Overtraining can hurt performance
   – Performance on training data rises and plateaus
   – Performance on new data rises, then falls
• One strategy is to learn less each time
   – But it is hard to guess the right learning rate
• Splitting the training set is a useful alternative
   – Part for training, part for finding “new data” peak
             Social Filtering

• Exploit ratings from other users as features
  – Like personal recommendations, peer review, …
• Reaches beyond topicality to:
  – Accuracy, coherence, depth, novelty, style, …
• Applies equally well to other modalities
  – Movies, recorded music, …
• Sometimes called “collaborative” filtering
Social Filtering Example
         Comedy        Drama       Action       Mystery

Amy      3         5   7           9        6     1 9 8


Norina         9       7     2 9   5   7 5        5 5 4


Paul         4 9           8 1 8       3        3 9 1 2


Skip           9                   8   2 5
Some Things We (Sort of) Know
• Treating each genre separately can be useful
  – Separate predictions for separate tastes
• Negative information might be useful
  – “I hate everything my parents like”
• People like to know who provided ratings
• Popularity provides a useful fallback
• People don’t like to provide ratings
  – Few experiments have achieved sufficient scale
              Implicit Feedback
• Observe user behavior to infer a set of ratings
  – Examine (reading time, scrolling behavior, …)
  – Retain (bookmark, save, save & annotate, print, …)
  – Refer to (reply, forward, include link, cut & paste, …)
• Some measurements are directly useful
  – e.g., use reading time to predict reading time
• Others require some inference
  – Should you treat cut & paste as an endorsement?
        Constructing a Rating Matrix
                              Document
                              1   2   3   4                    Document
                  Reference                                   1      2      3     4
               Retention
        Examination
                                                      Mary          0.13 0.57 0.69
 Explicit Rating
        Mary                                          Jack   0.29          0.14
        Jack                                          Joe    0.37          0.19 0.44




                                              Rater
        Joe                                           Jill
Rater




        Jill                                          Ken           0.62
        Ken                                           Marty 0.53           0.79
        Marty                                         Sam           0.77 0.05 0.57
        Sam                                           Tim    0.71
        Tim
     Some Research Questions
• How readers can pay raters efficiently
• How to use implicit feedback
  – No large scale integrated experiments yet
• How to protect privacy
  – Pseudonyms can be defeated fairly easily
• How to merge social and content-based filtering
  – Social filtering never finds anything new
• How to evaluate social filtering effectiveness
                                                       Document
                                                      1      2      3     4
 Combining Content                     complicated          0.13 0.57 0.69


    and Ratings                        contaminated 0.29

                                       fallout       0.37
                                                                   0.14

                                                                   0.19 0.44




                              Terms
                                       information

• Two sources of features              interesting          0.62

                                       nuclear       0.53          0.79
  – Terms and raters                   retrieval            0.77 0.05 0.57

                                       siberia       0.71

                                                            0.13 0.57 0.69
• Each is used differently
                                       Mary

                                       Jack          0.29          0.14

  – Find terms in documents            Joe           0.37          0.19 0.44




                               Rater
  – Find informative raters            Jill

                                       Ken                  0.62

                                       Marty         0.53          0.79

                                       Sam                  0.77 0.05 0.57

                                       Tim           0.71
   Variations on Social Filtering
• Citation indexing
  – A special case of refer-to implicit feedback
• Search for people based on their behavior
  – Discover potential collaborators
  – Focus advertising on interested consumers
• Collaborative exploration
  – Explore a large, fairly stable collection
  – Discoveries migrate to people with similar interests

				
DOCUMENT INFO