CS276A Text Information Retrieval_ Mining_ and Exploitation

Shared by: wuyunqing
-
Stats
views:
1
posted:
9/5/2011
language:
English
pages:
49
Document Sample
scope of work template
							            CS276A
Text Information Retrieval, Mining, and
             Exploitation




             Lecture 8
            31 Oct 2002
Recap: IR based on Language
Model

Information
    need
                                         P (Q | M d )        M d1   d1

                                        generation
                                                             M d2   d2
                          query




                                                              …

                                                                     …
   One night in a hotel, I saw this late night             M dn    dn
    talk show where Sergey Brin popped on
    suggesting the web search tip that you
    should think of some words that would               document collection
    likely appear on pages that would answer
    your question and use those as your
    search terms – let’s exploit that idea!
Recap: Query generation
probability (unigrams)
   Ranking formula
           p (Q, d )  p (d ) p (Q | d )
                      p (d ) p (Q | M d )
   The probability of producing the query given the
    language model of document d using MLE is:
         p (Q | M d )   pml (t | M d )
         ˆ                ˆ
                              tQ

                       tf ( t ,d )                  Unigram assumption:
                                          Given a particular language model,
                         dld                the query terms occur independently
                 tQ


            M d : language model of document d
           tf ( t ,d ) : raw tf of term t in document d
           dld : total number of tokens in document d
Recap: Query generation
probability (mixture model)
   P(w|d) = Pmle(w|Md) + (1 – )Pmle(w|Mc)
   Mixes the probability from the document
    with the general collection frequency of the
    word.
   Correctly setting  is very important
   A high value of lambda makes the search
    “conjunctive-like” – suitable for short queries
   A low value is more suitable for long queries
   Can tune  to optimize performance
       Perhaps make it dependent on document size
        (cf. Dirichlet prior or Witten-Bell smoothing)
Today’s topics

   Relevance Feedback
   Query Expansion
Relevance Feedback

   Relevance feedback = user feedback on
    relevance of initial set of results
   The user marks returned documents as
    relevant or non-relevant.
   The system computes a better representation
    of the information need based on feedback.
   Relevance feedback can go through one or
    more iterations.
Relevance Feedback: Example

   Image search engine (couldn’t find relevance
    feedback engine for text!)
   url:
    http://nayana.ece.ucsb.edu/imsearch/imsearch.ht
    ml
Initial Query
Results for Initial Query
Relevance Feedback
Results after Relevance
Feedback
Rocchio Algorithm

   The Rocchio algorithm incorporates relevance
    feedback information into the vector space
    model.
   The optimal query vector for separating
    relevant and non-relevant documents:




   Unrealistic: we don’t know all relevant
Rocchio Algorithm
   Used in practice:




   Typical weights: alpha = 8, beta = 64, gamma = 64
   Tradeoff alpha vs beta/gamma: If we have a lot of
    judged documents, we want a higher beta/gamma.
   But we usually don’t …
Relevance Feedback in
Probabilistic Information Retrieval

   How?
Relevance Feedback in
Probabilistic Information Retrieval

   We can modify the query based on relevance
    feedback and apply standard model.
   Examples:
       Binary independence model
       Language model
Binary Independence Model
                                      n
                                             p ( xi | R, q )
 O ( R | q, d )  O ( R | q )  
                                     i 1   p ( xi | NR, q )
 • Since xi is either 0 or 1:
                                    p( xi  1 | R, q)        p( xi  0 | R, q)
 O( R | q, d )  O( R | q)                          
                             xi 1 p( xi  1 | NR, q) xi 0 p( xi  0 | NR, q)
Binary Independence Model
                                   p( xi  1 | R, q)        p( xi  0 | R, q)
O( R | q, d )  O( R | q)                          
                            xi 1 p( xi  1 | NR, q) xi 0 p( xi  0 | NR, q)




                      Used as before.                  We assume this is =1
                                                       in simple retrieval.
                                                       In relevance
                                                       feedback, we
                                                       have data to estimate
                                                       probability of
                                                       non-occurrence.
  Binary Independence Model

                               p( xi  1 | R, q)    p( xi  0 | NR, q)
O( R | q, d )  O( R | q)                      
                            qi p( xi  0 | R, q) qi p( xi  1 | NR, q)


      Note that we have 3 relevance “states” now:
       relevant, non-relevant and unjudged.
Positive vs Negative Feedback

   Positive feedback is more valuable than
    negative feedback.
   Many systems only allow positive feedback.
Relevance Feedback: Assumptions

   A1: User has sufficient knowledge for initial
    query.
   A2: Relevance prototypes are “well-behaved”.
       Either: All relevant documents are similar to a
        single prototype.
       Or: There are different prototypes, but they
        have significant vocabulary overlap.
Violation of A1

   User does not have sufficient initial
    knowledge.
   Examples:
       Misspellings (Brittany Speers)
       Cross-language information retrieval (hígado)
       Mismatch of searcher’s vocabulary vs collection
        vocabulary (e.g., regional differences, different
        fields of scientific study: genetics vs medicine,
        bernoulli naïve bayes vs binary independence
        model)
Violation of A2

   There are several relevance prototypes.
   Examples:
       Burma/Myanmar
       Contradictory government policies
       Pop stars that worked at Burger King
   Often: instances of a general concept
   Good editorial content can address problem
       Report on contradictory government policies
Relevance Feedback on the Web

   Some search engines offer a similar/related pages
    feature (simplest form of relevance feedback)
       Google (link-based)
       Altavista
       Stanford web
   But some don’t because it’s hard to explain to average
    user:
       Alltheweb
       msn
       Yahoo
   Excite initially had true relevance feedback, but
    abandoned it due to lack of use.
Relevance Feedback: Cost

   Long queries are inefficient for typical IR
    engine.
   Long response times for user.
   High cost for retrieval system.
Other Uses of Relevance Feedback

   Following a changing information need
   Maintaining an information filter (e.g., for a
    news feed)
   Active learning
   Topics for next quarter
Active Learning

   Goal: create a training set for a text classifier
    (or some other classification problem)
   One approach: uncertainty sampling
   At any point in learning, present document
    with highest uncertainty to user and get
    relevant/non-relevant judgment. (closest to p(
    R) = 0.5 )
   Scarce resource is user’s time: maximize
    benefit of each decision.
   Active learning significantly reduces the
    number of labeled documents needed to learn
 Active Learning
 somehow Build
 Initial classifier
                       retrain classifier

  classifier selects
   most uncertain                           user labels
     document                               document



                       most uncertain
documents               document
Active Learning
   Could we use active learning for relevance
    feedback?
Relevance Feedback
Summary
   Relevance feedback has been shown to be
    effective at improving relevance of results.
   Full relevance feedback is painful for the user.
   Full relevance feedback is not very efficient in
    most IR systems.
   Other types of interactive retrieval may
    improve relevance by as much with less work.
Forward Pointer: DirectHit
   DirectHit uses indirect relevance feedback.
   DirectHit ranks documents higher that users
    look at more often.
   Not user or query specific.
Pseudo Feedback

   Pseudo feedback attempts to automate the
    manual part of relevance feedback.
   Retrieve an initial set of relevant documents.
   Assume that top-ranked documents are
    relevant.
 Pseudo Feedback
  initial
  query
                 apply relevance
                    feedback

      retrieve
     documents                      label top k
                                   docs relevant



                 highest ranked
documents          documents
Pseudo Feedback

   Not surprisingly, the success of pseudo feedback
    depends on whether relevance assumption is true.
   If it is true, pseudo feedback can improve precision
    and recall dramatically.
   If it is not true, then the results will become less
    relevant in each iteration. (concept drift)
   Example: ambiguous words (“jaguar”)
   Bimodal distribution: depending on query,
    performance is dramatically better or dramatically
    worse.
   Unfortunately, hard to predict which will be the case.
Pseudo-Feedback: Performance
Query Expansion

   In relevance feedback, users give additional
    input (relevant/non-relevant) on documents.
   In query expansion, users give additional input
    (good/bad search term) on words or phrases.
Query Expansion: Example




 Also: see altavista, teoma
Types of Query Expansion

   Refinements based on query log mining
       Common on the web
   Global Analysis: Thesaurus-based
       Controlled vocabulary
            Maintained by editors (e.g., medline)
       Automatically derived thesaurus
            (co-occurrence statistics)
   Local Analysis:
       Analysis of documents in result set
Controlled Vocabulary
Automatic Thesaurus
Generation

   High cost of manually producing a thesaurus
   Attempt to generate a thesaurus automatically
    by analyzing a collection of documents
   Two main approaches
       Co-occurrence based (co-occurring words are
        more likely to be similar)
       Shallow analysis of grammatical relations
            Entities that are grown, cooked, eaten, and digested
             are more likely to be food items.
   Co-occurrence based is more robust,
    grammatical relations are more accurate.
                                 Why?
Co-occurrence Thesaurus
   Simplest way to compute one is based on term-term similarities
    in C = AAT where A is term-document matrix.
   wi,j = (normalized) weighted count (ti , dj)

                                     dj           n


     ti



          m
Automatic Thesaurus Generation
Example
Automatic Thesaurus Generation
Discussion

   Quality of associations is usually a problem.
   Very similar to LSI (why?)
   Problems:
       “False positives”
            Words deemed similar that are not
       “False negatives”
            Words deemed dissimilar that are similar
Query Expansion: Summary
   Query expansion is very effective in increasing
    recall.
   In most cases, precision is decreased, often
    significantly.
Sense-Based Retrieval

   In query expansion, new words are added to
    the query (disjunctively). Increase of matches.
   In sense-based retrieval, term matches are
    only counted if the same sense is used in
    query and document. Decrease of matches.
   Example: In sense-based retrieval, “jaguar” is
    only a match if it’s used in the “animal” sense
    in both query and document.
Sense-Based Retrieval: Results
Expansion vs. Sense-Based
Retrieval

   Same type of information is used in pseudo
    relevance feedback and sense-based retrieval.
   But: disambiguation is expensive
   Indexing with senses is complicated
   Automatic sense-based retrieval only makes
    sense for long queries                    Why?
   If senses are supplied in interactive loop, then
    it’s easier to add words rather than senses
Exercise

   What is the factor of increase of the index
    size?
   How could you integrate confidence numbers
    for senses?
Interactive Retrieval

   Query expansion and relevance feedback are
    examples of interactive retrieval.
   Others:
       Query “editing” (adding and removing terms)
       Visualization-based, e.g., interaction with visual
        map
       Parametric search
Resources
 MG Ch. 4.7
 MIR Ch. 5.2 – 5.4
 Singhal, Mitra, Buckley: Learning routing queries in a query
    zone, ACM SIGIR, 1997.
 Yonggang Qiu , Hans-Peter Frei, Concept based query
    expansion, Sigir, 1993.
 Schuetze: Automatic Word Sense Discrimination,
    Computational Linguistics, 1998.
 Buckley, Singhal, Mitra, Salton, New retrieval approaches
    using smart: trec4, nist, 1996.
 Gerard Salton and Chris Buckley. Improving retrieval
    performance by relevance feedback. Journal of the
    American Society ]or In:formation Science, 41(4):288-297,
    1990.

						
Other docs by wuyunqing
GRADUATE HANDBOOK
Views: 181  |  Downloads: 0
Send completed form to PHSSNC_usa.redcross.org
Views: 124  |  Downloads: 0
What To Know About Cumberland County
Views: 166  |  Downloads: 0
Working together on Monitoring Care Homes
Views: 1  |  Downloads: 0
Megastudy IR
Views: 171  |  Downloads: 0