# Language Model

Document Sample

```					Language Model

CSC4170 Web Intelligence and Social Computing
Tutorial 8

Tutor: Tom Chao Zhou
Email: czhou@cse.cuhk.edu.hk

1
Outline

   Language models
   Finite automata and language models
   Types of language models
   Multinomial distributions over words
   Query likelihood model
   Application
   Q&A
   Reference

2
Language Models (LMs)

   How can we come up with good queries?
   Think of words that would likely appear in a relevant document.
   Idea of LM:
   A document is a good match to a query if the document model is
likely to generate the query.

3
Language Models (LMs)

   Generative Model:
   Recognize or generate strings.

   The full set of strings that can be generated is called the language of the
automaton.
   Language Model:
   A function that puts a probability measure over strings drawn from some
vocabulary.

4
Language Models (LMs)

   Example 1:
   Calculate the probability of a word sequence.
   Multiply the probabilities that the model gives to each word in the
sequence, together with the probability of continuing or stopping
after producing each word.
   P(frog said that toad likes frog)=(0.01*0.03*0.04*0.01*0.02*0.01)
                                     *(0.8*0.8*0.8*0.8*0.8*0.8*0.2)
                                    =0.000000000001573

   Most of the time, we will omit to include STOP and (1-STOP)
probabilities.

5
Language Models (LMs)

   Example 2:

P(s|M1)>P(s|M2)

6
Language Models (LMs)

   Basic LM using chain rule:

   Unigram language model:
   Throws away all conditioning context.
   Most used in Information Retrieval.

   Bigram language model:
   Condition on the previous term.

7
Language Models (LMs)

   Unigram LM:
 Bag-of-words model.
 Multinomial distributions over
words.

multinomial coefficient, can leave out in practical
calculations.

Ld  1iM tfti ,d      The length of document d. M is
the size of the vocabulary.

8
Query Likelihood Model

   Query likelihood model:
 Rank document by P(d|q)
 Likelihood that document d is
relevant to the query.
 Using Bayes rule:
   P(q) is the same for all
documents.
   P(d) is treated as uniform
across all d.                P( d | q )  P( q | d )

9
Query Likelihood Model

   Multinomial + Unigram:

Multinomial coefficient for the query q.
Can be ignored.

   Retrieve based on a language model:
   Infer a LM for each document.
   Estimate P(q|Mdi).
   Rank the documents according to these probabilities.

10
Query Likelihood Model

   Estimating the query generation probability:
   Maximum Likelihood Estimation (MLE) + unigram LM

   Limitations:
   If we estimate P(t|Md)=0, documents will only give a query nonzero
probability if all of the query terms appear in the document.
   Occurring words are poorly estimated, the probability of words
occurring once in the document is overestimated, because their
one occurrence was partly by chance.

11
Query Likelihood Model

   Estimating the query generation probability:
   Maximum Likelihood Estimation (MLE) + unigram LM

   Smoothing:
   Use the whole collection to smooth.
   Linear Interpolation (Jelinek-Mercer Smoothing)

   Bayesian Smoothing

12
Query Likelihood Model

   Query likelihood model with linear interpolation:

   Query likelihood model with Bayesian smoothing:
P (t | M d )  P (t | M C )
P (d | q )  P (d ) (                                )
tq               Ld  

13
Query Likelihood Model

   Example using unigram + MLE + linear interpolation:
   d1: Xyzzy reports a profit but revenue is down
   d2: Quorus narrows quarter loss but revenue decreases further
   λ=1/2
   query: revenue down

   ranking: d1>d2

14
Application

   Community-based Question Answering (CQA) System:
   Question Search.
   Given a queried question, find a semantically equivalent question
for the queried question.
   General Search Engine
   Given a query, rank documents.

15
 Questions?

16
Reference

   Multinomial distribution:
   http://en.wikipedia.org/wiki/Multinomial_distribution
   Likelihood function:
   http://en.wikipedia.org/wiki/Likelihood
   Maximum likelihood:
   http://en.wikipedia.org/wiki/Maximum_likelihood

17

```
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
 views: 2 posted: 10/1/2012 language: Unknown pages: 17