From Grammar to N-grams
Estimating N-grams From a
Context-Free Grammar and Sparse Data
Thomas K Harris
May 16, 2002
Recognizers typically use n-grams.
Systems are typically defined by CFGs.
Data collection is difficult.
Goal: To have a language model that
benefits from the grammar and the
priors of the parses.
Ignore data, use a language model
derived from the grammar alone.
Ignore grammar, use a language model
derived from the data alone.
Interpolate between these two models.
grammar with some data.
Work in progress - available at
Written in C++
A library (API) consists of a PCFG class and
an n-gram class.
A program which uses the library to create n-
grams from Phoenix grammars and data.
A make script to automate building and
Read Phoenix grammar file.
Convert to Chomsky Normal Form.
Read data and train grammar.
Smooth the grammar.
Compute n-grams from the smoothed
Reading Phoenix Formats
Doesn’t handle #include directive.
Doesn’t handle +* (Kleen closure)
Net – Rewrite distinction is ignored.
+ and * markers are rewritten as rules.
Conversion to CNF permanently
Chomsky Normal Form
Remove unit productions.
Change all rules A->βaγ of length >1 to
A->βNγ and N->a.
Recursively shorten all rules A->βBC of
length >2 to A->βN and N->BC.
Initialize rule probabilities.
For each sentence,
Use CYK chart parser to compute inside and
Use those probabilities to determine the expected
number of times the rule is used in the sentence.
Use the expectations to get a new set of rule
Repeat until the corpus likelihood appears to
A user-specified probability mass can
be redistributed over unseen rules.
At the bottom of the tree this
generalizes a class-based model.
This only smoothes the trained
grammar over other grammatical
Precise n-grams can be computed from
P(wn|w1…wn-1) = E(w1…wn|S)/E(w1…wn-1|S)
Divide and Conquer
S S S
A B A B A B
…w1-n… …w1-n… w1………….wn
P( S AB)
E ( wn | S ) P( S wn ) E ( wn | A) E ( wn | B)
P( A w ) P( B w
k 1n )
1 k n
R 1k L
USI MovieLine oracle transcripts
Used only parsable sentences (85%)
Divided into 60% training, 40% test
Language Model Perplexity - Perplexity -
Absolute Good Turing
Pure-grammar bi-grams 30.9 781
Pure-grammar tri-grams 33.5 619
Pure-data bi-grams 5.25 19.0
Pure-data tri-grams 5.22 18.9
PCFG bi-grams 5.50 5.52
50 Pure-grammar bi-
40 Pure-grammar tri-
30 Pure-data bi-grams
20 Pure-data tri-grams
10 PCFG bi-grams
Lower perplexities than pure-grammar
method, comparable perplexities to
More flexible and cheaper than pure-
More smoothing work needs to be done.
smoothing over different classes
other smoothing methods??
Testingfor word error rate
Adapting to modified grammars