Language Model

Document Sample
Language Model Powered By Docstoc
					Language Model

CSC4170 Web Intelligence and Social Computing
Tutorial 8

Tutor: Tom Chao Zhou
Email: czhou@cse.cuhk.edu.hk

                                           1
Outline

   Language models
       Finite automata and language models
       Types of language models
       Multinomial distributions over words
   Query likelihood model
   Application
   Q&A
   Reference




                                               2
Language Models (LMs)

   How can we come up with good queries?
       Think of words that would likely appear in a relevant document.
   Idea of LM:
       A document is a good match to a query if the document model is
        likely to generate the query.




                                                                          3
Language Models (LMs)

   Generative Model:
       Recognize or generate strings.




       The full set of strings that can be generated is called the language of the
        automaton.
   Language Model:
       A function that puts a probability measure over strings drawn from some
        vocabulary.



                                                                                 4
Language Models (LMs)

   Example 1:
       Calculate the probability of a word sequence.
            Multiply the probabilities that the model gives to each word in the
             sequence, together with the probability of continuing or stopping
             after producing each word.
            P(frog said that toad likes frog)=(0.01*0.03*0.04*0.01*0.02*0.01)
                                              *(0.8*0.8*0.8*0.8*0.8*0.8*0.2)
                                             =0.000000000001573




       Most of the time, we will omit to include STOP and (1-STOP)
        probabilities.

                                                                                   5
Language Models (LMs)

   Example 2:




     P(s|M1)>P(s|M2)




                        6
Language Models (LMs)

   Basic LM using chain rule:

   Unigram language model:
       Throws away all conditioning context.
       Most used in Information Retrieval.



   Bigram language model:
       Condition on the previous term.




                                                7
Language Models (LMs)

   Unigram LM:
     Bag-of-words model.
     Multinomial distributions over
      words.




            multinomial coefficient, can leave out in practical
            calculations.

        Ld  1iM tfti ,d      The length of document d. M is
                                 the size of the vocabulary.


                                                                  8
Query Likelihood Model

   Query likelihood model:
     Rank document by P(d|q)
     Likelihood that document d is
      relevant to the query.
     Using Bayes rule:
           P(q) is the same for all
            documents.
           P(d) is treated as uniform
            across all d.                P( d | q )  P( q | d )




                                                                   9
Query Likelihood Model

   Multinomial + Unigram:



                                 Multinomial coefficient for the query q.
                                 Can be ignored.

   Retrieve based on a language model:
       Infer a LM for each document.
       Estimate P(q|Mdi).
       Rank the documents according to these probabilities.



                                                                            10
Query Likelihood Model

   Estimating the query generation probability:
       Maximum Likelihood Estimation (MLE) + unigram LM




   Limitations:
       If we estimate P(t|Md)=0, documents will only give a query nonzero
        probability if all of the query terms appear in the document.
       Occurring words are poorly estimated, the probability of words
        occurring once in the document is overestimated, because their
        one occurrence was partly by chance.



                                                                      11
Query Likelihood Model

   Estimating the query generation probability:
       Maximum Likelihood Estimation (MLE) + unigram LM




   Smoothing:
       Use the whole collection to smooth.
            Linear Interpolation (Jelinek-Mercer Smoothing)



            Bayesian Smoothing



                                                               12
Query Likelihood Model

   Query likelihood model with linear interpolation:



   Query likelihood model with Bayesian smoothing:
                             P (t | M d )  P (t | M C )
    P (d | q )  P (d ) (                                )
                     tq               Ld  




                                                              13
Query Likelihood Model

   Example using unigram + MLE + linear interpolation:
       d1: Xyzzy reports a profit but revenue is down
       d2: Quorus narrows quarter loss but revenue decreases further
       λ=1/2
       query: revenue down




       ranking: d1>d2



                                                                        14
Application

   Community-based Question Answering (CQA) System:
       Question Search.
            Given a queried question, find a semantically equivalent question
             for the queried question.
   General Search Engine
       Given a query, rank documents.




                                                                                 15
 Questions?




               16
Reference

   Multinomial distribution:
       http://en.wikipedia.org/wiki/Multinomial_distribution
   Likelihood function:
       http://en.wikipedia.org/wiki/Likelihood
   Maximum likelihood:
       http://en.wikipedia.org/wiki/Maximum_likelihood




                                                                17

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:2
posted:10/1/2012
language:Unknown
pages:17