Docstoc

NLP for Web Search a summary of recent work

Document Sample
NLP for Web Search a summary of recent work Powered By Docstoc
					Fuchun Peng
Microsoft Bing
7/23/2010




                 1
       Query is often treated as a bag of words
       But when people are formulating queries,
        they use “concepts” as building blocks
                 sports
simmons          psychology
college’s        (course)
                               Q: simmons college sports psychology

                               A1: “simmons college”, “sports psychology”
                               A2: “college sports”



   • Can we automatically segment the query to
     recover the concepts?


                                                2
   Summary of Segmentation approaches
   Use for Improving Search Relevance
    ◦ Query rewriting
    ◦ Ranking features
   Conclusions




                                   3
    Supervised learning (Bergsma et al,
     EMNLP-CoNLL07)
     ◦ Binary decision at each possible segmentation
       point
     ◦ Features: POS, web counts, the, and, …
         w1       w2       w3       w4       w5
              N        Y        N        Y
• Problem:
     – Limited-range context
     – Features specifically designed for noun phrases




                                                  4
   Manual Data Preparation
    ◦ Linguistic driven
      [San jose international airport]
    ◦ Relevance driven
      [San jose] [international airport]




                                            5
                    3,4
                          MI(w1,w2) = P(w1w2) / P(w1)P(w2)

 MI    1,2                  4,5

                                  threshold      insert segment boundary
             2,3                                 w1w2 | w3w4w5
      w1 w2        w3 w4     w5
                                              Iterative update
• Problem:
  – only captures short-range correlation (between
    adjacent words)
  – What about my heart will go on?

                                                                 6
7
       Assume the query is generated by independent sampling
        from a probability distribution of concepts:

              simmons college sports psychology                       P=0.000016×0.000002

P(simmons college)=0.000016                P(sports psychology)=0.000002




                                                                              >
                              unigram model
                                                                 P=0.000007×0.000006×0.000024
              simmons college sports psychology
P(simmons)=0.000007    P(college sports)=0.000006 P(psychology)=0.000024


 • Enumerate all possible segmentations; Rank by probability
   of being generated by the unigram model
 • How to estimate parameters P(w) for the unigram model?


                                                                       8
    We have ngram (n=1..5) counts in a web
     corpus
     ◦ 464M documents; L = 33B tokens
     ◦ Approximate counts for longer ngrams are often
       computable: e.g. #(harry potter and the goblet of
       fire) is in [5783, 6399]
       #(ABC)=#(AB)+#(BC)-#(AB OR BC)
               >= #(AB)+#(BC)-#(B)
Solved by DP




                                              9
   Maximum Likelihood Estimate:
    PMLE(t) = #(t) / N
   Problem:
    ◦ #(potter and the goblet of) = 6765
    ◦ P(potter and the goblet of) > P(harry potter and the
      goblet of fire)? Wrong!
    ◦ not prob. of seeing t in text, but prob. of seeing t
      as a self-contained concept in text




                                                             10
  Query-relevant web corpus
                                      longest    raw         Choose parameters to maximize the posterior
ngram                                 matching   frequency   probability given query-relevant corpus / minimize
                                      count                  the total description length)
harry                                 1657108    2003112
harry potter                          277736     346004      t: a query substring
harry potter and                      10436      68268       C(t): longest matching count of t
harry potter and the                  51330      57832       D = {(t, C(t)}: query-relevant corpus
harry potter and the goblet           101        6502        s(t): a segmentation of t
harry potter and the goblet of        618        6401        θ: unigram model parameters (ngram probabilities)
harry potter and the goblet of fire   5783       5783
...                                   …          …
                                                                      posterior prob.
…                                     …          …
fire                                  4200957    4478774
                                                             θ = argmax P(D|θ)P(θ)

                                                                       DL of corpus DL of parameters
                                                               = argmax log P(D|θ) + log P(θ)

                                                             log P(D|θ) = ∑t log P(t|θ)C(t)
                                                             P(t|θ) = ∑ s(t) P(s(t)|θ)

                                                                                        11
12
   Three human-segmented datasets
    ◦ 3 data sets, for training, validation, and testing,
      500 queries for each set
      Segmented by three editors A, B, C




                                              13
   Evaluation metric:
    ◦ Boundary classification accuracy


         w1          w2         w3             w4       w5
               N           Y           N            Y
    ◦ Whole query accuracy: the percentage of queries with perfect
      boundary classification accuracy

    ◦ Segment accuracy: the percentage of segments being
      recovered
       Truth [abc] [de] [fg]
       Prediction: [abc] [de fg]: precision




                                                             14
15
16
   Summary of Segmentation approaches
   Use for Improving Search Relevance
    ◦ Query rewriting
    ◦ Ranking features
   Conclusions




                                   17
   Phrase Proximity Boosting
   Phrase Level Query Expansion




                                   18
   Classifying a segment into one of three
    categories
    ◦ Strong concept: no word reordering, no word
      insertion/deletion
      Treat the whole segment as a single unit in matching
       and ranking
    ◦ Weak concept: allow word reordering or
      deletion/insertion
      Boost documents matching the weak concepts
    ◦ Not a concept
      Do nothing



                                                              19
   Concept based BM25
    ◦ Weighted by the confidence of concepts
   Concept based min coverage
    ◦ Weighted by the confidence of concepts




                                               20
   Phrase level replacement
    ◦ [San Francisco] -> [sf]
    ◦ [red eye flight] ->[late night flight]




                                               21
   Significant relevance boosting
    ◦ Affects 40% query traffic
    ◦ Significant DCG gain (1.5% for affected queries)
    ◦ Significant online CTR gain (0.5% over all)




                                                         22
   Summary of Segmentation approaches
   Use for Improving Search Relevance
    ◦ Query rewriting
    ◦ Ranking features
   Conclusions




                                   23
   Data is segmentation is important for query
    segmentation
   Phrases are important for improving
    relevance




                                                  24
   Bergsma et al, EMNLP-CoNLL07
   Risvik et al. WWW 2003
   Hagen et al SIGIR 2010
   Tan & Peng, WWW 2008




                                   25
26
      Solution 1: Offline segment the web corpus, then
       collect counts for ngrams being segments
                         harry potter and the goblet of fire += 1
... …                   potter and the goblet of += 0
| Harry Potter and
the Goblet of Fire |
is | the | fourth |        C. G. de Marcken, Unsupervised Language Acquisition, 96

novel | in | the |         Fuchun Peng, Self-supervised Chinese Word Segmentation,
Harry Potter series |      IDA01
written by | J.K.
Rowling |
... ...



  • Technical difficulties
                                                            27
   Solution 2: Online computation: only consider
    parts of the web corpus overlapping with the
    query (longest matches)
    Q=harry potter and the goblet of fire


                                       harry potter and the goblet of fire += 1
    ... …
    Harry Potter and the               the += 2
    Goblet of Fire is the
    fourth novel in the                harry potter += 1
    Harry Potter series
    written by J.K.
    Rowling
    ... ...


                                                                    28
29
   Solution 2: Online computation: only consider
    parts of the web corpus overlapping with the
    query (longest matches)
        Q= potter and the goblet



    ... …                          potter and the goblet += 1
    Harry Potter and the           the += 2
    Goblet of Fire is the
    fourth novel in the            potter += 1
    Harry Potter series
    written by J.K.
    Rowling
    ... ...                            Directly compute longest matching counts
                                       using raw ngram frequency: O(|Q|2)


                                                                30

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:1
posted:4/25/2013
language:English
pages:30