Computational Linguistics
Lecture 3: Part of Speech Tagging
IBased on Dan Jurafsky’s Lecture Notes for the textbook, Speech and Language Processing Additional slides by Jim Martin and Bonnie Dorr
CS 563100NLP Spring 2008
1
Outline
Probability
Part of speech tagging
Parts of speech What’s POS tagging good for anyhow? Tag sets Rule-based tagging Statistical tagging
– Simple most-frequent-tag baseline
Important Ideas
– Training sets and test sets – Unknown words – Error analysis
HMM tagging
CS 563100NLP Spring 2008
2
Big Ideas for today
Methodology
Evaluation Gold standards Training sets Test sets % Correct
Models:
Rule-based
Statistical
CS 563100NLP Spring 2008
3
Part of Speech tagging
Part of speech tagging
Parts of speech What’s POS tagging good for anyhow? Tag sets Rule-based tagging Statistical tagging
– Simple most-frequent-tag baseline
Important Ideas
– Training sets and test sets – Unknown words
HMM tagging
CS 563100NLP Spring 2008
4
Parts of Speech
8 traditional parts of speech
Noun, verb, adjective, preposition, adverb, article, interjection, pronoun, conjunction, etc
The idea has been around for over 2000 years (Dionysius Thrax of Alexandria, c. 100 B.C.) under different names
parts-of-speech (POS)
lexical category/tags
Word/morphological classes
Lots of debate in linguistics about (we will ignore)
number, nature, and universality
CS 563100NLP Spring 2008
5
POS examples
N V ADJ ADV P PRO DET noun verb adjective adverb preposition pronoun determiner chair, bandwidth, pacing study, debate, munch purple, tall, ridiculous unfortunately, slowly, of, by, to I, me, mine the, a, that, those
CS 563100NLP Spring 2008
6
POS Tagging: Definition
The process of assigning a part-of-speech or lexical class marker to each word in a corpus:
WORDS
the koala put the keys on the table
TAGS
N V P DET
CS 563100NLP Spring 2008
7
POS Tagging example
WORD the koala put the keys on the table tag DET N V DET N P DET N
CS 563100NLP Spring 2008
8
What is POS tagging good for?
The first step of many NLP tasks
Speech synthesis, parsing, machine translation
Speech synthesis – pronounciation and stress
How to pronounce “lead”? NN or VBD
Where to put stress? INsult inSULT OBject obJECT DIScount disCOUNT CONtent conTENT
Parsing – Need to know if a word is an N or V before you can parse
Grammar is written using POS
NP DT NN
Machine Translation
Different lexical translation for different POSes
China NNP china NN
CS 563100NLP Spring 2008
9
Open and closed class words
Closed class: a relatively fixed membership
Prepositions: of, in, by, … Auxiliaries: may, can, will had, been, … Pronouns: I, you, she, mine, his, them, … Usually function words (short common words which play a role in grammar)
Open class: new ones can be created all the time
English has 4: Nouns, Verbs, Adjectives, Adverbs
Many languages have all 4, but not all!
In Chinese, what English treats as adjectives act more like verbs
CS 563100NLP Spring 2008
10
Open class words
Nouns
Proper nouns (Stanford University, Boulder, Neal Snider, Margaret Jacks Hall). English capitalizes these.
Common nouns (the rest). German capitalizes these.
Count nouns and mass nouns
– Count: have plurals, get counted: goat/goats, one goat, two goats – Mass: don’t get counted (snow, salt, communism) (*two snows)
Adverbs: tend to modify things
Unfortunately, John walked home extremely slowly yesterday Directional/locative adverbs (here,home, downhill) Degree adverbs (extremely, very, somewhat) Manner adverbs (slowly, slinkily, delicately)
Verbs:
In English, have morphological affixes (eat/eats/eaten)
CS 563100NLP Spring 2008
11
Closed Class Words
Compared to open classed, closed classes differ more from language to language
Examples:
prepositions: on, under, over, … particles: up, down, on, off, … determiners: a, an, the, … pronouns: she, who, I, .. conjunctions: and, but, or, … auxiliary verbs: can, may should, … numerals: one, two, three, third, …
CS 563100NLP Spring 2008
12
Prepositions from CELEX online dictionary
CS 563100NLP Spring 2008
13
English particles Quirk et al. (1985)
CS 563100NLP Spring 2008
14
Pronouns: CELEX online dictionary
CS 563100NLP Spring 2008
15
Conjunctions
CS 563100NLP Spring 2008
16
POS tagging: Choosing a tagset
There are so many parts of speech, potential distinctions we can draw
To do POS tagging, need to choose a standard set of tags to work with
Could pick very coarse tagets
N, V, Adj, Adv.
More commonly used set is finer grained, the “UPenn TreeBank tagset”, 45 tags
PRP$, WRB, WP$, VBG
Even more fine-grained tagsets exist
CS 563100NLP Spring 2008
17
Penn TreeBank POS Tag set
CS 563100NLP Spring 2008
18
Using the UPenn tagset
Prepositions and subordinating conjunctions marked IN (“although/IN I/PRP..”)
Except the preposition/ complementizer “to” is just marked “to”.
Why?
Because it is difficult to tell whether it is a preposition or infinitive
The/DT grand/JJ jury/NN commmented/VBD on/IN a/DT number/NN of/IN other/JJ topics/NNS ./.
CS 563100NLP Spring 2008
19
POS Tagging
Words often have more than one POS: back
The back door = JJ On my back = NN Win the voters back = RB Promised to back the bill = VB
The POS tagging problem is to determine the POS tag for a particular instance of a word.
These examples from Dekang Lin
CS 563100NLP Spring 2008
20
POS tagging is not a hard problem
CS 563100NLP Spring 2008
21
3 methods for POS tagging
1. Rule-based tagging
(ENGTWOL)
2. Stochastic (=Probabilistic) tagging
HMM (Hidden Markov Model) tagging
3. Transformation-based tagging
Brill tagger
CS 563100NLP Spring 2008
22
Rule-based tagging
Start with a dictionary Assign all possible tags to words from the dictionary Write rules by hand to selectively remove tags Leaving the correct tag for each word.
CS 563100NLP Spring 2008
23
Start with a dictionary
• • • • • • she: promised: to back: the: bill: PRP VBN,VBD TO VB, JJ, RB, NN DT NN, VB
• Etc… for some 100,000 words of English
CS 563100NLP Spring 2008
24
Use the dictionary to assign every possible tag
NN RB JJ VB back
PRP She
VBN VBD promised
TO to
DT the
VB NN bill
CS 563100NLP Spring 2008
25
Write rules to eliminate tags
Eliminate VBN if VBD is an option when VBN|VBD follows “ PRP” NN RB JJ VB VBN PRP VBD TO VB DT NN She promised to back the bill
CS 563100NLP Spring 2008
26
Sample ENGTWOL Lexicon (Voutilainen 1995 ENGCG)
CS 563100NLP Spring 2008
27
Stage 1 of ENGTWOL Tagging
First Stage: Run words through FST morphological analyzer to get all parts of speech.
Example: Pavlov had shown that salivation …
Pavlov had PAVLOV N NOM SG PROPER HAVE V PAST VFIN SVO HAVE PCP2 SVO SHOW PCP2 SVOO SVO SV ADV PRON DEM SG DET CENTRAL DEM SG CS N NOM SG
shown that
salivation
CS 563100NLP Spring 2008
28
Stage 2 of ENGTWOL Tagging
Second Stage: Apply NEGATIVE constraints.
Example: Adverbial “that” rule
Eliminates all readings of “that” except the one in
– “It isn’t that odd” Given input: “that” If (+1 A/ADV/QUANT) ;if next word is adj/adv/quantifier (+2 SENT-LIM) ;following which is E-O-S (NOT -1 SVOC/A) ; and the previous word is not a ; verb like “consider” which ; allows adjective complements ; in “I consider that odd” Then eliminate non-ADV tags Else eliminate ADV
CS 563100NLP Spring 2008
29
Statistical Tagging
Based on probability theory
Model probability of
Lexical information
Contexture information
CS 563100NLP Spring 2008
30
Conditional Probability and Tags
• P(Verb) is probability of randomly selected word being a verb. • P(Verb|race) is “what’s the probability of a word being a verb given that it’s the word “race”? • Race can be a noun or a verb. Is it more likely to be a verb? • P(Verb|race) can be estimated by counting related instances in a annotaed corpus
Count(race is verb) P(V | race) = total Count(race)
• In Brown corpus, P( V | race) = 96/98 = .98
€
CS 563100NLP Spring 2008
31
Most frequent tag
Some ambiguous words have a more frequent tag and a less frequent tag:
Consider the word “a” in these 2 sentences:
would/MD prohibit/VB a/DT suit/NN for/IN refund/NN
of/IN section/NN 381/CD (/( a/NN )/) ./.
Obviously DT is more frequent than NN
CS 563100NLP Spring 2008
32
Counting in a corpus
We could count in a corpus
A corpus: an on-line collection of text, often linguistically annotated
The Brown Corpus: 1 million words from 1961
Part of speech tagged at U Penn
After counting in the Brown Corpus
The results: 21830 DT 6 3 NN FW
CS 563100NLP Spring 2008
33
Test set
We take a set of test sentences Hand-label them for part of speech The result is a “Gold Standard” test set Who does this?
Get a set of sentences (e.g., Brown corpus)
More than one taggers (e.g., U Penn grad students in linguistics)
Did they agree with each other?
Most of the time (97%)
But on about 3% of tags: disagreements
If the taggers discuss the remaining 3%, they often reach agreement
CS 563100NLP Spring 2008
34
Training and test sets
To test a tagging method, we need 2 things:
A hand-labeled training set: the data that we compute frequencies from, etc
A hand-labeled test set: The data that we use to compute our accuracy rate
CS 563100NLP Spring 2008
35
Computing accuracy rate
Of all the words in the test set
For what percent of them did the tag chosen by the tagger equal the humanselected tag.
# of words tagged correctly in test set %correct = total # of words in test set
Human tag set: (“Gold Standard” set)
€
CS 563100NLP Spring 2008
36
Training and Test sets
Often they come from the same corpus
We just use 90% of the corpus for training and save out 10% for testing
Even better: cross-validation
Take 90% training, 10% test, calculate the accuracy rate
Now take a different 10% test, 90% training, calculate the accuracy rate
Do this 10 times and average of the accuracy rates
CS 563100NLP Spring 2008
37
Summary
Probability
Part of speech tagging
Parts of speech What’s POS tagging good for anyhow? Tag sets 3 taggers
– Rule-based tagging – Statistical tagging Simple most-frequent-tag baseline – Transformation-based learning
Important Ideas
– Evaluation: % correct, training sets and test sets – Unknown words
What is ahead:
– TBL tagging (“Brill tagging”) and HMM Tagging
CS 563100NLP Spring 2008
38
Unknown Words
What about words that don’t appear in the training set?
For example, here are some words that occur in a small Brown Corpus test set but not the training set:
– Abernathy – absolution – Adrien – ajar – Alicia – all-american-boy azalea baby-sitter bantered bare-armed big-boned boathouses alligator asparagus boxcar boxcars bumped
CS 563100NLP Spring 2008
39
Unknown words
20+ new words added to (newspaper) language per month
Plus many proper names …
Increases error rates by 1-2%
Methods
Assume they are nouns
Assume the unknown words have a probability distribution similar to words only occurring once in the training set
Use morphological information, e.g., words ending with –ed tend to be tagged VBN
Combine several methods (probability functions)
Slide from Bonnie Dorr
CS 563100NLP Spring 2008
40
Transformation-Based Tagging (Brill Tagging)
Combine rule and stastistics
Like rule-based because rules are used to specify tags in a certain environment
Like stochastic approach because machine learning is used—with tagged corpus as input
Input:
tagged corpus
dictionary (with most frequent tags)
Slide from Bonnie Dorr
CS 563100NLP Spring 2008
41
Transformation-Based Tagging (cont.)
Basic Idea:
Set the most probable tag for each word as a start value
Change tags according to rules of type “if word-1 is a determiner and word is a verb then change the tag to noun” in a specific order
Training is done on tagged corpus:
Write a set of rule templates Among the set of rules, find one with highest score Continue from 2 until lowest score threshold is passed Keep the ordered set of rules
Rules make errors that are corrected by later rules
Slide from Bonnie Dorr
CS 563100NLP Spring 2008
42
TBL Rule Application
Tagger labels every word with its most-likely tag
For example: race has the following probabilities in the Brown corpus:
– P(NN|race) = .98 – P(VB|race)= .02
Transformation rules make changes to tags
“Change NN to VB when previous tag is TO” … is/VBZ expected/VBN to/TO race/NN tomorrow/ NN becomes … is/VBZ expected/VBN to/TO race/VB tomorrow/ NN
Slide from Bonnie Dorr
CS 563100NLP Spring 2008
43
TBL: Rule Learning
2 parts to a rule
Triggering environment
Rewrite rule
The range of triggering environments of templates
Schutze 1999:363)
(from Manning &
Schema ti-3 1 2 3 4 5 6 7 8 9
ti-2
ti-1
ti * * * * * * * * *
ti+1
ti+2
ti+3
Slide from Bonnie Dorr
CS 563100NLP Spring 2008
44
TBL: The Tagging Algorithm
Label every word with most likely tag (from dictionary)
Check every possible transformation & select one which most improves tagging
Re-tag corpus applying the rules
Repeat rule learning and tagging until some criterion is reached, e.g., X% correct with respect to training corpus
RESULT: Sequence of transformation rules
Slide from Bonnie Dorr
CS 563100NLP Spring 2008
45
TBL: Rule Learning (cont.)
Problem
Could have too many rule
Solution
Constrain the set of transformations with “templates”: Replace tag X with tag Y, provided tag Z or word Z’ appears in some position
Advantages
Rules are learned in ordered sequence
Rules may interact.
Rules are compact and can be inspected by humans
Slide from Bonnie Dorr
CS 563100NLP Spring 2008
46
Templates for TBL
Slide from Bonnie Dorr
CS 563100NLP Spring 2008
47
Hidden Markov Model Tagging
Using an HMM to do POS tagging
Is a special case of Bayesian inference
Foundational work in computational linguistics
Bledsoe 1959: OCR
Mosteller and Wallace 1964: authorship identification
It is also related to the “noisy channel” model applied to many task
speech recognition
machine translation
CS 563100NLP Spring 2008
48
Getting to HMM
We want, out of all sequences of n tags t1…tn the single tag sequence such that P(t1…tn|w1…wn) is highest.
Hat ^ means “our estimate of the best one”
Argmaxx f(x) means “the x such that f(x) is maximized”
CS 563100NLP Spring 2008
49
Getting to HMM
This equation is guaranteed to give us the best tag sequence
But how to make it operational? How to compute this value?
Intuition of Bayesian classification:
Use Bayes rule to transform into a set of other probabilities that are easier to compute
CS 563100NLP Spring 2008
50
Using Bayes Rule
CS 563100NLP Spring 2008
51
Likelihood and prior
n
CS 563100NLP Spring 2008
52
Two kinds of probabilities (1)
Tag transition probabilities p(ti|ti-1)
Determiners likely to precede adjectives/nouns
– That/DT flight/NN – The/DT yellow/JJ hat/NN – So we expect P(NN|DT) and P(JJ|DT) to be high – But P(DT|JJ) to be:
Compute P(NN|DT) by counting in a labeled corpus:
CS 563100NLP Spring 2008
53
Two kinds of probabilities (2)
Word likelihood probabilities p(wi|ti)
VBZ (3sg Pres verb) likely to be “is” or “’s”
Compute P(is|VBZ) by counting in a labeled corpus:
CS 563100NLP Spring 2008
54
An Example: the verb “race”
Secretariat/NNP is/VBZ expected/VBN to/TO race/VB tomorrow/NR
People/NNS continue/VB to/TO inquire/VB the/DT reason/ NN for/IN the/DT race/NN for/IN outer/JJ space/NN
How do we pick the right tag?
CS 563100NLP Spring 2008
55
Disambiguating “race”
CS 563100NLP Spring 2008
56
P(NN|TO) = .00047
P(VB|TO) = .83
P(race|NN) = .00057
P(race|VB) = .00012
P(NR|VB) = .0027
P(NR|NN) = .0012
P(VB|TO) P(NR|VB) P(race|VB) = .00000027
P(NN|TO) P(NR|NN) P(race|NN)=.00000000032
So we (correctly) choose the verb reading for the word race
CS 563100NLP Spring 2008
57
Definitions
A weighted finite-state automaton adds probabilities to the arcs
The sum of the probabilities leaving any arc must sum to one
A Markov chain is a special case of a WFST in which the input sequence uniquely determines which states the automaton will go through
Markov chains can’t represent inherently ambiguous problems
Useful for assigning probabilities to unambiguous sequences
CS 563100NLP Spring 2008
58
Markov chain for weather
CS 563100NLP Spring 2008
59
Markov chain for words
CS 563100NLP Spring 2008
60
Markov chain = “First-order observable Markov Model”
a set of states
Q = q1, q2…qN; the state at time t is qt
Transition probabilities:
a set of probabilities A = a01a02…an1…ann.
Each aij represents the probability of transitioning from state i to state j
The set of these is the transition probability matrix A
aij = P(qt = j | qt−1 = i) 1 ≤ i, j ≤ N
N
∑a
j=1
ij
= 1;
1≤ i ≤ N
€
€
Distinguished start and end states
CS 563100NLP Spring 2008
61
Another representation for start state
Instead of start state
Special initial probability vector π
An initial distribution over probability of start states
π i = P(q1 = i) 1 ≤ i ≤ N
€
CS 563100NLP Spring 2008
62
The weather figure using pi
CS 563100NLP Spring 2008
63
The weather figure: specific example
CS 563100NLP Spring 2008
64
Markov chain for weather
What is the probability of 4 consecutive rainy days? Sequence is rainy-rainy-rainy-rainy I.e., state sequence is 3-3-3-3 P(3,3,3,3) =
π1a11a11a11a11 = 0.2 x (0.6)3 = 0.0432
CS 563100NLP Spring 2008
65
HMM for Ice Cream and Weather
Observation:
How many ice-creams someone ate every day
1, 2, 3
State:
Weather
Cold, Hot
Our job
Given ice-scream sequence, produce weather sequence
CS 563100NLP Spring 2008
66
Hidden Markov Model
For Markov chains, the output symbols are the same as the states.
See hot weather: we’re in state hot
But in part-of-speech tagging (and other things)
The output symbols are words
But the hidden states are part-of-speech tags
Need an extension
A Hidden Markov Model is an extension of a Markov chain in which the input symbols are not the same as the states.
This means we don’t know which state we are in
CS 563100NLP Spring 2008
67
Hidden Markov Models
States Q = q1, q2…qN;
Observations O= o1, o2…oN;
Transition probabilities
Transition probability matrix A = {aij}
aij = P(qt = j | qt−1 = i) 1 ≤ i, j ≤ N
Each observation is a symbol from a vocabulary V = {v1,v2,…vV}
Observation likelihoods
Output probability matrix B={bi(k)}
€
Special initial probability vector π
π i = P(q1 = i) 1 ≤ i ≤ N
€
CS 563100NLP Spring 2008
bi (k) = P(X t = ok | qt = i)
68
Hidden Markov Models
Some constraints
N
∑a
j=1
ij
= 1;
1≤ i ≤ N
M
∑ b (k) = 1
i
€
k=1
N
∑π
€
j=1
j
=1
π i = P(q1 = i) 1 ≤ i ≤ N
€
€
CS 563100NLP Spring 2008
69
Assumptions
Markov assumption: P(qi | q1 ...qi−1) = P(qi | qi−1 )
Output-independence assumption
€
P(ot | o , q ) = P(ot | q t )
t−1 1
t 1
€
CS 563100NLP Spring 2008
70
Example of weather information
Given
Ice Cream Observation Sequence: 1,2,3,2,2,2,3…
Produce:
Weather Sequence: H,C,H,H,H,C…
CS 563100NLP Spring 2008
71
HMM for ice cream
CS 563100NLP Spring 2008
72
Transitions between the hidden states of HMM, showing A probs
CS 563100NLP Spring 2008
73
B observation likelihoods for POS HMM
CS 563100NLP Spring 2008
74
The A matrix for the POS HMM
CS 563100NLP Spring 2008
75
The B matrix for the POS HMM
CS 563100NLP Spring 2008
76
Viterbi intuition: Find the best path
S1 S2 S3
RB NN VBN JJ TO VBD VB NNP NN DT VB
S4
S5
promised
to
back
the
bill
Lin
77
CS 563100NLP Spring 2008Dekang Slide from
The Viterbi Algorithm
CS 563100NLP Spring 2008
78
Intuition
The value in each cell is computed by taking the MAX over all paths that lead to this cell.
An extension of a path from state i at time t-1 is computed by multiplying:
CS 563100NLP Spring 2008
79
Viterbi example
CS 563100NLP Spring 2008
80
Error Analysis of typical tagger
Look at a confusion matrix
See what errors are causing problems
Noun (NN) vs ProperNoun (NNP) vs Adj (JJ)
Adverb (RB) vs Particle (RP) vs Prep (IN)
Past (VBD) vs Participle (VBN) vs Adjective (JJ)
CS 563100NLP Spring 2008
81
Evaluation
The result is compared with a manually coded “Gold Standard”
Typically accuracy reaches 96-97%
This may be compared with result for a baseline tagger (one that uses no context).
Important: 100% is impossible even for human annotators.
CS 563100NLP Spring 2008
82
HMMs more formally
Three fundamental problems
1. Given HMM, calculate likelihood of observation sequence (e.g., words) 2. Given observation and HMM, find the best states sequence (e.g., POS) 3. Given only observation sequences, learn the HMM model (A, B, π)
CS 563100NLP Spring 2008
83
The Three Basic Problems for HMMs
1. (Evaluation): Given the observation sequence O=(o1o2… oT), and an HMM model Φ = (A,B), how do we efficiently compute P(O| Φ), the probability of the observation sequence, given the model 2. (Decoding): Given the observation sequence O=(o1o2… oT), and an HMM model Φ = (A,B), how do we choose a corresponding state sequence Q=(q1q2…qT) that is optimal in some sense (i.e., best explains the observations) 3. (Learning): How do we adjust the model parameters Φ = (A,B) to maximize P(O| Φ )?
CS 563100NLP Spring 2008
84
P1: computing observation likelihood
How likely is the sequence 3 1 3 generated by this HMM
CS 563100NLP Spring 2008
85
How to compute likelihood
For a Markov chain, we just follow the states 3 1 3 and multiply the probabilities
But for an HMM, we don’t know the states
To start, compute the observation likelihood for a given hidden state sequence
Suppose we knew the weather and wanted to predict how much ice cream someone would eat
i.e. P( 3 1 3 | H H C)
CS 563100NLP Spring 2008
86
Computing likelihood of 3 1 3 given hidden state sequence
CS 563100NLP Spring 2008
87
Computing joint probability of observation and state sequence
CS 563100NLP Spring 2008
88
Computing total likelihood of 3 1 3
We would need to sum over
Hot hot cold Hot hot hot Hot cold hot ….
Too many possible hidden state sequences
For HMM with N hidden states and a sequence of T observations?
Number of state sequence = NT
Many subsequences are the same redundant computation
CS 563100NLP Spring 2008
89
Instead: Forward Algorithm
A kind of dynamic programming algorithm
Uses a table to store intermediate values
Idea:
Compute the likelihood of the observation sequence
By summing over all possible hidden state sequences
But doing this efficiently
– By folding all the sequences into a single trellis
CS 563100NLP Spring 2008
90
The Forward Trellis
CS 563100NLP Spring 2008
91
The forward algorithm
Each cell of the forward algorithm compute the partial solution of size t (not T) α t (j)
Subject to the condition
After seeing the first t observations
The number t observation is in in state j
α t (j) form a trellis (lattice) of cells of forward probability
CS 563100NLP Spring 2008
92
We update each cell
CS 563100NLP Spring 2008
93
The Forward Algorithm by Induction
CS 563100NLP Spring 2008
94
The Forward Algorithm
CS 563100NLP Spring 2008
95
P2. Decoding
Given an observation sequence and HMM
3 1 3
The task of the decoder
To find the best hidden state sequence (e.g., H C H)
Formally
Given the observation sequence O=(o1o2…oT), and an HMM model Φ = (A,B),
Find state sequence Q=(q1q2…qT)
which best explains the observations
CS 563100NLP Spring 2008
96
Decoding
One possibility:
For each hidden state sequence Q
– HHH, HHC, HCH,
Compute P(O|Q)
Pick the highest one
Why not?
NT
Instead:
The Viterbi algorithm
Is again a dynamic programming algorithm
Uses a similar trellis to the Forward algorithm
CS 563100NLP Spring 2008
97
The Viterbi trellis
CS 563100NLP Spring 2008
98
Viterbi intuition
Process observation sequence left to right
Filling out the trellis with the forward prob (now with max instead of sum) :
CS 563100NLP Spring 2008
99
Viterbi Algorithm
CS 563100NLP Spring 2008
100
Viterbi backtrace
CS 563100NLP Spring 2008
101
Viterbi Recursion
CS 563100NLP Spring 2008
102
Why “Dynamic Programming”
“I spent the Fall quarter (of 1950) at RAND. My first task was to find a name for multistage decision processes. An interesting question is, Where did the name, dynamic programming, come from? The 1950s were not good years for mathematical research. We had a very interesting gentleman in Washington named Wilson. He was Secretary of Defense, and he actually had a pathological fear and hatred of the word, research. I’m not using the term lightly; I’m using it precisely. His face would suffuse, he would turn red, and he would get violent if people used the term, research, in his presence. You can imagine how he felt, then, about the term, mathematical. The RAND Corporation was employed by the Air Force, and the Air Force had Wilson as its boss, essentially. Hence, I felt I had to do something to shield Wilson and the Air Force from the fact that I was really doing mathematics inside the RAND Corporation.
CS 563100NLP Spring 2008
103
Why “Dynamic Programming”
What title, what name, could I choose? In the first place I was interested in planning, in decision making, in thinking. But planning, is not a good word for various reasons. I decided therefore to use the word, “programming” I wanted to get across the idea that this was dynamic, this was multistage, this was time-varying I thought, lets kill two birds with one stone. Let’s take a word that has an absolutely precise meaning, namely dynamic, in the classical physical sense. It also has a very interesting property as an adjective, and that is its impossible to use the word, dynamic, in a pejorative sense. Try thinking of some combination that will possibly give it a pejorative meaning. Its impossible. Thus, I thought dynamic programming was a good name. It was something not even a Congressman could object to. So I used it as an umbrella for my activities.”
Richard Bellman, “Eye of the Hurrican: an autobiography” 1984.
CS 563100NLP Spring 2008
104
Viterbi example
CS 563100NLP Spring 2008
105
Backward Algorithm
• backward probability i(m) = probability of wm, wm+1, …, wN with wm having tag Ti.
!i(m) = P(wm, …, wN & wm /Ti)
• Similar to forward probability, except starting at the end of the sentence and work backwards
• P3: The best way to estimate transition and lexical tag probabilities
• Use both forward and backward probabilities
CS 563100NLP Spring 2008
106
CS 563100NLP Spring 2008
107
Go si sj at time t emitting ot+1 =vk
CS 563100NLP Spring 2008
108
Re-estimate aij
• isible Markov Model V • idden Markov Model H
CS 563100NLP Spring 2008
109
Enter sj at time t and emit ot (=vk)
CS 563100NLP Spring 2008
110
Re-estimate bjk
CS 563100NLP Spring 2008
111
CS 563100NLP Spring 2008
112
HMM Taggers: Supervised vs. Unsupervised
• Supervise training
Relative frequency
Relative Frequency with further Maximum Likelihood training
• Unsupervised training
Maximum Likelihood training with random start
Read corpus, take counts and build transition and emission tables
Use Forward-Backward to estimate lexical probabilities
Compute most likely hidden state sequence
Determine POS role that each state most likely plays
CS 563100NLP Spring 2008
113
Hidden Markov Model Taggers
• When to use unsupervised training? – To tag a text from a special domain with probabilities different from those in available training texts – To tag text in a foreign language for which training corpora do not exist at all • Two way of initialization – Randomly initialize lexical probabilities involved in HMM – Use dictionary information • Jelinek’s method – Dictionary + Uniform distribution • Kupiec’s method – Dictionary + Equivalence Class • Group all the words according to the set of their possible tags in dictionary • E.g., bottom, top JJ-NN class
CS 563100NLP Spring 2008
114
Hidden Markov Model Taggers
Jelinek’s method
Assuming that words occur equally likely with each of their possible tags
CS 563100NLP Spring 2008
115
Kupiec’s method
Reduce the total number of parameters
word classes: words with the same possible POS’s
Estimate lexical probability of words in word class as if they are one word
Not including the 100 most frequent words in equivalence classes, but treats as one-word classes
Less parameters, more reliable estimation
Can be use in unsupervised HMM
Or in supervised HMM as a way of smoothing
CS 563100NLP Spring 2008
116