most-frequent-tag

Reviews
Shared by: XIAOHUI MA
Stats
views:
4
rating:
not rated
reviews:
0
posted:
10/22/2009
language:
ENGLISH
pages:
0
Computational Linguistics Lecture 3: Part of Speech Tagging IBased on Dan Jurafsky’s Lecture Notes for the textbook, Speech and Language Processing Additional slides by Jim Martin and Bonnie Dorr CS 563100NLP Spring 2008 1 Outline   Probability   Part of speech tagging   Parts of speech   What’s POS tagging good for anyhow?   Tag sets   Rule-based tagging   Statistical tagging – Simple most-frequent-tag baseline   Important Ideas – Training sets and test sets – Unknown words – Error analysis   HMM tagging CS 563100NLP Spring 2008 2 Big Ideas for today   Methodology   Evaluation   Gold standards   Training sets   Test sets   % Correct   Models:   Rule-based   Statistical CS 563100NLP Spring 2008 3 Part of Speech tagging   Part of speech tagging   Parts of speech   What’s POS tagging good for anyhow?   Tag sets   Rule-based tagging   Statistical tagging – Simple most-frequent-tag baseline   Important Ideas – Training sets and test sets – Unknown words   HMM tagging CS 563100NLP Spring 2008 4 Parts of Speech   8 traditional parts of speech   Noun, verb, adjective, preposition, adverb, article, interjection, pronoun, conjunction, etc   The idea has been around for over 2000 years (Dionysius Thrax of Alexandria, c. 100 B.C.) under different names   parts-of-speech (POS)   lexical category/tags   Word/morphological classes   Lots of debate in linguistics about (we will ignore)   number, nature, and universality CS 563100NLP Spring 2008 5 POS examples  N  V   ADJ   ADV  P   PRO   DET noun verb adjective adverb preposition pronoun determiner chair, bandwidth, pacing study, debate, munch purple, tall, ridiculous unfortunately, slowly, of, by, to I, me, mine the, a, that, those CS 563100NLP Spring 2008 6 POS Tagging: Definition   The process of assigning a part-of-speech or lexical class marker to each word in a corpus: WORDS the koala put the keys on the table TAGS N V P DET CS 563100NLP Spring 2008 7 POS Tagging example WORD the koala put the keys on the table tag DET N V DET N P DET N CS 563100NLP Spring 2008 8 What is POS tagging good for?   The first step of many NLP tasks   Speech synthesis, parsing, machine translation   Speech synthesis – pronounciation and stress   How to pronounce “lead”? NN or VBD   Where to put stress? INsult inSULT OBject obJECT DIScount disCOUNT CONtent conTENT   Parsing – Need to know if a word is an N or V before you can parse   Grammar is written using POS   NP  DT NN   Machine Translation   Different lexical translation for different POSes   China NNP china NN CS 563100NLP Spring 2008 9 Open and closed class words   Closed class: a relatively fixed membership   Prepositions: of, in, by, …   Auxiliaries: may, can, will had, been, …   Pronouns: I, you, she, mine, his, them, …   Usually function words (short common words which play a role in grammar)   Open class: new ones can be created all the time   English has 4: Nouns, Verbs, Adjectives, Adverbs   Many languages have all 4, but not all!   In Chinese, what English treats as adjectives act more like verbs CS 563100NLP Spring 2008 10 Open class words   Nouns   Proper nouns (Stanford University, Boulder, Neal Snider, Margaret Jacks Hall). English capitalizes these.   Common nouns (the rest). German capitalizes these.   Count nouns and mass nouns –  Count: have plurals, get counted: goat/goats, one goat, two goats –  Mass: don’t get counted (snow, salt, communism) (*two snows)   Adverbs: tend to modify things         Unfortunately, John walked home extremely slowly yesterday Directional/locative adverbs (here,home, downhill) Degree adverbs (extremely, very, somewhat) Manner adverbs (slowly, slinkily, delicately)   Verbs:   In English, have morphological affixes (eat/eats/eaten) CS 563100NLP Spring 2008 11 Closed Class Words  Compared to open classed, closed classes differ more from language to language  Examples:  prepositions: on, under, over, …  particles: up, down, on, off, …  determiners: a, an, the, …  pronouns: she, who, I, ..  conjunctions: and, but, or, …  auxiliary verbs: can, may should, …  numerals: one, two, three, third, … CS 563100NLP Spring 2008 12 Prepositions from CELEX online dictionary CS 563100NLP Spring 2008 13 English particles Quirk et al. (1985) CS 563100NLP Spring 2008 14 Pronouns: CELEX online dictionary CS 563100NLP Spring 2008 15 Conjunctions CS 563100NLP Spring 2008 16 POS tagging: Choosing a tagset   There are so many parts of speech, potential distinctions we can draw   To do POS tagging, need to choose a standard set of tags to work with   Could pick very coarse tagets   N, V, Adj, Adv.   More commonly used set is finer grained, the “UPenn TreeBank tagset”, 45 tags   PRP$, WRB, WP$, VBG   Even more fine-grained tagsets exist CS 563100NLP Spring 2008 17 Penn TreeBank POS Tag set CS 563100NLP Spring 2008 18 Using the UPenn tagset   Prepositions and subordinating conjunctions marked IN (“although/IN I/PRP..”)   Except the preposition/ complementizer “to” is just marked “to”.   Why?   Because it is difficult to tell whether it is a preposition or infinitive   The/DT   grand/JJ   jury/NN   commmented/VBD   on/IN   a/DT   number/NN   of/IN   other/JJ   topics/NNS   ./. CS 563100NLP Spring 2008 19 POS Tagging  Words often have more than one POS: back  The back door = JJ  On my back = NN  Win the voters back = RB  Promised to back the bill = VB  The POS tagging problem is to determine the POS tag for a particular instance of a word. These examples from Dekang Lin CS 563100NLP Spring 2008 20 POS tagging is not a hard problem CS 563100NLP Spring 2008 21 3 methods for POS tagging 1.  Rule-based tagging   (ENGTWOL) 2.  Stochastic (=Probabilistic) tagging   HMM (Hidden Markov Model) tagging 3.  Transformation-based tagging   Brill tagger CS 563100NLP Spring 2008 22 Rule-based tagging   Start with a dictionary   Assign all possible tags to words from the dictionary   Write rules by hand to selectively remove tags   Leaving the correct tag for each word. CS 563100NLP Spring 2008 23 Start with a dictionary •  •  •  •  •  •  she: promised: to back: the: bill: PRP VBN,VBD TO VB, JJ, RB, NN DT NN, VB •  Etc… for some 100,000 words of English CS 563100NLP Spring 2008 24 Use the dictionary to assign every possible tag NN RB JJ VB back PRP She VBN VBD promised TO to DT the VB NN bill CS 563100NLP Spring 2008 25 Write rules to eliminate tags Eliminate VBN if VBD is an option when VBN|VBD follows “ PRP” NN RB JJ VB VBN PRP VBD TO VB DT NN She promised to back the bill CS 563100NLP Spring 2008 26 Sample ENGTWOL Lexicon (Voutilainen 1995 ENGCG) CS 563100NLP Spring 2008 27 Stage 1 of ENGTWOL Tagging   First Stage: Run words through FST morphological analyzer to get all parts of speech.   Example: Pavlov had shown that salivation … Pavlov had PAVLOV N NOM SG PROPER HAVE V PAST VFIN SVO HAVE PCP2 SVO SHOW PCP2 SVOO SVO SV ADV PRON DEM SG DET CENTRAL DEM SG CS N NOM SG shown that salivation CS 563100NLP Spring 2008 28 Stage 2 of ENGTWOL Tagging   Second Stage: Apply NEGATIVE constraints.   Example: Adverbial “that” rule   Eliminates all readings of “that” except the one in –  “It isn’t that odd” Given input: “that” If (+1 A/ADV/QUANT) ;if next word is adj/adv/quantifier (+2 SENT-LIM) ;following which is E-O-S (NOT -1 SVOC/A) ; and the previous word is not a ; verb like “consider” which ; allows adjective complements ; in “I consider that odd” Then eliminate non-ADV tags Else eliminate ADV CS 563100NLP Spring 2008 29 Statistical Tagging   Based on probability theory   Model probability of   Lexical information   Contexture information CS 563100NLP Spring 2008 30 Conditional Probability and Tags •  P(Verb) is probability of randomly selected word being a verb. •  P(Verb|race) is “what’s the probability of a word being a verb given that it’s the word “race”? •  Race can be a noun or a verb. Is it more likely to be a verb? •  P(Verb|race) can be estimated by counting related instances in a annotaed corpus Count(race is verb) P(V | race) = total Count(race) •  In Brown corpus, P( V | race) = 96/98 = .98 € CS 563100NLP Spring 2008 31 Most frequent tag   Some ambiguous words have a more frequent tag and a less frequent tag:   Consider the word “a” in these 2 sentences:   would/MD prohibit/VB a/DT suit/NN for/IN refund/NN   of/IN section/NN 381/CD (/( a/NN )/) ./.   Obviously DT is more frequent than NN CS 563100NLP Spring 2008 32 Counting in a corpus   We could count in a corpus   A corpus: an on-line collection of text, often linguistically annotated   The Brown Corpus: 1 million words from 1961   Part of speech tagged at U Penn   After counting in the Brown Corpus   The results: 21830 DT 6 3 NN FW CS 563100NLP Spring 2008 33 Test set         We take a set of test sentences Hand-label them for part of speech The result is a “Gold Standard” test set Who does this?   Get a set of sentences (e.g., Brown corpus)   More than one taggers (e.g., U Penn grad students in linguistics)   Did they agree with each other?   Most of the time (97%)   But on about 3% of tags: disagreements   If the taggers discuss the remaining 3%, they often reach agreement CS 563100NLP Spring 2008 34 Training and test sets   To test a tagging method, we need 2 things:   A hand-labeled training set: the data that we compute frequencies from, etc   A hand-labeled test set: The data that we use to compute our accuracy rate CS 563100NLP Spring 2008 35 Computing accuracy rate  Of all the words in the test set  For what percent of them did the tag chosen by the tagger equal the humanselected tag. # of words tagged correctly in test set %correct = total # of words in test set  Human tag set: (“Gold Standard” set) € CS 563100NLP Spring 2008 36 Training and Test sets  Often they come from the same corpus  We just use 90% of the corpus for training and save out 10% for testing  Even better: cross-validation  Take 90% training, 10% test, calculate the accuracy rate  Now take a different 10% test, 90% training, calculate the accuracy rate  Do this 10 times and average of the accuracy rates CS 563100NLP Spring 2008 37 Summary   Probability   Part of speech tagging         Parts of speech What’s POS tagging good for anyhow? Tag sets 3 taggers –  Rule-based tagging –  Statistical tagging   Simple most-frequent-tag baseline –  Transformation-based learning   Important Ideas –  Evaluation: % correct, training sets and test sets –  Unknown words   What is ahead: –  TBL tagging (“Brill tagging”) and HMM Tagging CS 563100NLP Spring 2008 38 Unknown Words   What about words that don’t appear in the training set?   For example, here are some words that occur in a small Brown Corpus test set but not the training set: –  Abernathy –  absolution –  Adrien –  ajar –  Alicia –  all-american-boy azalea baby-sitter bantered bare-armed big-boned boathouses alligator asparagus boxcar boxcars bumped CS 563100NLP Spring 2008 39 Unknown words   20+ new words added to (newspaper) language per month   Plus many proper names …   Increases error rates by 1-2%   Methods   Assume they are nouns   Assume the unknown words have a probability distribution similar to words only occurring once in the training set   Use morphological information, e.g., words ending with –ed tend to be tagged VBN   Combine several methods (probability functions) Slide from Bonnie Dorr CS 563100NLP Spring 2008 40 Transformation-Based Tagging (Brill Tagging)   Combine rule and stastistics   Like rule-based because rules are used to specify tags in a certain environment   Like stochastic approach because machine learning is used—with tagged corpus as input   Input:   tagged corpus   dictionary (with most frequent tags) Slide from Bonnie Dorr CS 563100NLP Spring 2008 41 Transformation-Based Tagging (cont.)   Basic Idea:   Set the most probable tag for each word as a start value   Change tags according to rules of type “if word-1 is a determiner and word is a verb then change the tag to noun” in a specific order   Training is done on tagged corpus:         Write a set of rule templates Among the set of rules, find one with highest score Continue from 2 until lowest score threshold is passed Keep the ordered set of rules   Rules make errors that are corrected by later rules Slide from Bonnie Dorr CS 563100NLP Spring 2008 42 TBL Rule Application   Tagger labels every word with its most-likely tag   For example: race has the following probabilities in the Brown corpus: – P(NN|race) = .98 – P(VB|race)= .02   Transformation rules make changes to tags   “Change NN to VB when previous tag is TO” … is/VBZ expected/VBN to/TO race/NN tomorrow/ NN becomes … is/VBZ expected/VBN to/TO race/VB tomorrow/ NN Slide from Bonnie Dorr CS 563100NLP Spring 2008 43 TBL: Rule Learning   2 parts to a rule   Triggering environment   Rewrite rule   The range of triggering environments of templates Schutze 1999:363) (from Manning & Schema ti-3 1 2 3 4 5 6 7 8 9 ti-2 ti-1 ti * * * * * * * * * ti+1 ti+2 ti+3 Slide from Bonnie Dorr CS 563100NLP Spring 2008 44 TBL: The Tagging Algorithm   Label every word with most likely tag (from dictionary)   Check every possible transformation & select one which most improves tagging   Re-tag corpus applying the rules   Repeat rule learning and tagging until some criterion is reached, e.g., X% correct with respect to training corpus   RESULT: Sequence of transformation rules Slide from Bonnie Dorr CS 563100NLP Spring 2008 45 TBL: Rule Learning (cont.)   Problem   Could have too many rule   Solution   Constrain the set of transformations with “templates”: Replace tag X with tag Y, provided tag Z or word Z’ appears in some position   Advantages   Rules are learned in ordered sequence   Rules may interact.   Rules are compact and can be inspected by humans Slide from Bonnie Dorr CS 563100NLP Spring 2008 46 Templates for TBL Slide from Bonnie Dorr CS 563100NLP Spring 2008 47 Hidden Markov Model Tagging   Using an HMM to do POS tagging   Is a special case of Bayesian inference   Foundational work in computational linguistics   Bledsoe 1959: OCR   Mosteller and Wallace 1964: authorship identification   It is also related to the “noisy channel” model applied to many task   speech recognition   machine translation CS 563100NLP Spring 2008 48 Getting to HMM   We want, out of all sequences of n tags t1…tn the single tag sequence such that P(t1…tn|w1…wn) is highest.   Hat ^ means “our estimate of the best one”   Argmaxx f(x) means “the x such that f(x) is maximized” CS 563100NLP Spring 2008 49 Getting to HMM   This equation is guaranteed to give us the best tag sequence   But how to make it operational? How to compute this value?   Intuition of Bayesian classification:   Use Bayes rule to transform into a set of other probabilities that are easier to compute CS 563100NLP Spring 2008 50 Using Bayes Rule CS 563100NLP Spring 2008 51 Likelihood and prior n CS 563100NLP Spring 2008 52 Two kinds of probabilities (1)  Tag transition probabilities p(ti|ti-1)  Determiners likely to precede adjectives/nouns – That/DT flight/NN – The/DT yellow/JJ hat/NN – So we expect P(NN|DT) and P(JJ|DT) to be high – But P(DT|JJ) to be:  Compute P(NN|DT) by counting in a labeled corpus: CS 563100NLP Spring 2008 53 Two kinds of probabilities (2)  Word likelihood probabilities p(wi|ti)  VBZ (3sg Pres verb) likely to be “is” or “’s”  Compute P(is|VBZ) by counting in a labeled corpus: CS 563100NLP Spring 2008 54 An Example: the verb “race”   Secretariat/NNP is/VBZ expected/VBN to/TO race/VB tomorrow/NR   People/NNS continue/VB to/TO inquire/VB the/DT reason/ NN for/IN the/DT race/NN for/IN outer/JJ space/NN   How do we pick the right tag? CS 563100NLP Spring 2008 55 Disambiguating “race” CS 563100NLP Spring 2008 56   P(NN|TO) = .00047   P(VB|TO) = .83   P(race|NN) = .00057   P(race|VB) = .00012   P(NR|VB) = .0027   P(NR|NN) = .0012   P(VB|TO) P(NR|VB) P(race|VB) = .00000027   P(NN|TO) P(NR|NN) P(race|NN)=.00000000032   So we (correctly) choose the verb reading for the word race CS 563100NLP Spring 2008 57 Definitions   A weighted finite-state automaton adds probabilities to the arcs   The sum of the probabilities leaving any arc must sum to one   A Markov chain is a special case of a WFST in which the input sequence uniquely determines which states the automaton will go through   Markov chains can’t represent inherently ambiguous problems   Useful for assigning probabilities to unambiguous sequences CS 563100NLP Spring 2008 58 Markov chain for weather CS 563100NLP Spring 2008 59 Markov chain for words CS 563100NLP Spring 2008 60 Markov chain = “First-order observable Markov Model”   a set of states   Q = q1, q2…qN; the state at time t is qt   Transition probabilities:   a set of probabilities A = a01a02…an1…ann.   Each aij represents the probability of transitioning from state i to state j   The set of these is the transition probability matrix A aij = P(qt = j | qt−1 = i) 1 ≤ i, j ≤ N N ∑a j=1 ij = 1; 1≤ i ≤ N € €   Distinguished start and end states CS 563100NLP Spring 2008 61 Another representation for start state   Instead of start state   Special initial probability vector π   An initial distribution over probability of start states π i = P(q1 = i) 1 ≤ i ≤ N € CS 563100NLP Spring 2008 62 The weather figure using pi CS 563100NLP Spring 2008 63 The weather figure: specific example CS 563100NLP Spring 2008 64 Markov chain for weather   What is the probability of 4 consecutive rainy days?   Sequence is rainy-rainy-rainy-rainy   I.e., state sequence is 3-3-3-3   P(3,3,3,3) =   π1a11a11a11a11 = 0.2 x (0.6)3 = 0.0432 CS 563100NLP Spring 2008 65 HMM for Ice Cream and Weather   Observation:   How many ice-creams someone ate every day   1, 2, 3   State:   Weather   Cold, Hot   Our job   Given ice-scream sequence, produce weather sequence CS 563100NLP Spring 2008 66 Hidden Markov Model   For Markov chains, the output symbols are the same as the states.   See hot weather: we’re in state hot   But in part-of-speech tagging (and other things)   The output symbols are words   But the hidden states are part-of-speech tags   Need an extension   A Hidden Markov Model is an extension of a Markov chain in which the input symbols are not the same as the states.   This means we don’t know which state we are in CS 563100NLP Spring 2008 67 Hidden Markov Models   States Q = q1, q2…qN;   Observations O= o1, o2…oN;   Transition probabilities  Transition probability matrix A = {aij} aij = P(qt = j | qt−1 = i) 1 ≤ i, j ≤ N   Each observation is a symbol from a vocabulary V = {v1,v2,…vV}   Observation likelihoods  Output probability matrix B={bi(k)} €   Special initial probability vector π π i = P(q1 = i) 1 ≤ i ≤ N € CS 563100NLP Spring 2008 bi (k) = P(X t = ok | qt = i) 68 Hidden Markov Models   Some constraints N ∑a j=1 ij = 1; 1≤ i ≤ N M ∑ b (k) = 1 i € k=1 N ∑π € j=1 j =1 π i = P(q1 = i) 1 ≤ i ≤ N € € CS 563100NLP Spring 2008 69 Assumptions  Markov assumption: P(qi | q1 ...qi−1) = P(qi | qi−1 )  Output-independence assumption € P(ot | o , q ) = P(ot | q t ) t−1 1 t 1 € CS 563100NLP Spring 2008 70 Example of weather information   Given   Ice Cream Observation Sequence: 1,2,3,2,2,2,3…   Produce:   Weather Sequence: H,C,H,H,H,C… CS 563100NLP Spring 2008 71 HMM for ice cream CS 563100NLP Spring 2008 72 Transitions between the hidden states of HMM, showing A probs CS 563100NLP Spring 2008 73 B observation likelihoods for POS HMM CS 563100NLP Spring 2008 74 The A matrix for the POS HMM CS 563100NLP Spring 2008 75 The B matrix for the POS HMM CS 563100NLP Spring 2008 76 Viterbi intuition: Find the best path S1 S2 S3 RB NN VBN JJ TO VBD VB NNP NN DT VB S4 S5 promised to back the bill Lin 77 CS 563100NLP Spring 2008Dekang Slide from The Viterbi Algorithm CS 563100NLP Spring 2008 78 Intuition   The value in each cell is computed by taking the MAX over all paths that lead to this cell.   An extension of a path from state i at time t-1 is computed by multiplying: CS 563100NLP Spring 2008 79 Viterbi example CS 563100NLP Spring 2008 80 Error Analysis of typical tagger   Look at a confusion matrix   See what errors are causing problems   Noun (NN) vs ProperNoun (NNP) vs Adj (JJ)   Adverb (RB) vs Particle (RP) vs Prep (IN)   Past (VBD) vs Participle (VBN) vs Adjective (JJ) CS 563100NLP Spring 2008 81 Evaluation   The result is compared with a manually coded “Gold Standard”   Typically accuracy reaches 96-97%   This may be compared with result for a baseline tagger (one that uses no context).   Important: 100% is impossible even for human annotators. CS 563100NLP Spring 2008 82 HMMs more formally   Three fundamental problems 1.  Given HMM, calculate likelihood of observation sequence (e.g., words) 2.  Given observation and HMM, find the best states sequence (e.g., POS) 3.  Given only observation sequences, learn the HMM model (A, B, π) CS 563100NLP Spring 2008 83 The Three Basic Problems for HMMs 1.  (Evaluation): Given the observation sequence O=(o1o2… oT), and an HMM model Φ = (A,B), how do we efficiently compute P(O| Φ), the probability of the observation sequence, given the model 2.  (Decoding): Given the observation sequence O=(o1o2… oT), and an HMM model Φ = (A,B), how do we choose a corresponding state sequence Q=(q1q2…qT) that is optimal in some sense (i.e., best explains the observations) 3.  (Learning): How do we adjust the model parameters Φ = (A,B) to maximize P(O| Φ )? CS 563100NLP Spring 2008 84 P1: computing observation likelihood   How likely is the sequence 3 1 3 generated by this HMM CS 563100NLP Spring 2008 85 How to compute likelihood   For a Markov chain, we just follow the states 3 1 3 and multiply the probabilities   But for an HMM, we don’t know the states   To start, compute the observation likelihood for a given hidden state sequence   Suppose we knew the weather and wanted to predict how much ice cream someone would eat   i.e. P( 3 1 3 | H H C) CS 563100NLP Spring 2008 86 Computing likelihood of 3 1 3 given hidden state sequence CS 563100NLP Spring 2008 87 Computing joint probability of observation and state sequence CS 563100NLP Spring 2008 88 Computing total likelihood of 3 1 3   We would need to sum over   Hot hot cold   Hot hot hot   Hot cold hot   ….   Too many possible hidden state sequences   For HMM with N hidden states and a sequence of T observations?   Number of state sequence = NT   Many subsequences are the same  redundant computation CS 563100NLP Spring 2008 89 Instead: Forward Algorithm  A kind of dynamic programming algorithm  Uses a table to store intermediate values  Idea:  Compute the likelihood of the observation sequence  By summing over all possible hidden state sequences  But doing this efficiently – By folding all the sequences into a single trellis CS 563100NLP Spring 2008 90 The Forward Trellis CS 563100NLP Spring 2008 91 The forward algorithm   Each cell of the forward algorithm compute the partial solution of size t (not T)  α t (j)
   Subject to the condition   After seeing the first t observations   The number t observation is in in state j   α t (j) form a trellis (lattice) of cells of forward probability CS 563100NLP Spring 2008 92 We update each cell CS 563100NLP Spring 2008 93 The Forward Algorithm by Induction CS 563100NLP Spring 2008 94 The Forward Algorithm CS 563100NLP Spring 2008 95 P2. Decoding   Given an observation sequence and HMM  3 1 3   The task of the decoder   To find the best hidden state sequence (e.g., H C H)   Formally   Given the observation sequence O=(o1o2…oT), and an HMM model Φ = (A,B),   Find state sequence Q=(q1q2…qT)   which best explains the observations CS 563100NLP Spring 2008 96 Decoding   One possibility:   For each hidden state sequence Q – HHH, HHC, HCH,   Compute P(O|Q)   Pick the highest one   Why not?   NT   Instead:   The Viterbi algorithm   Is again a dynamic programming algorithm   Uses a similar trellis to the Forward algorithm CS 563100NLP Spring 2008 97 The Viterbi trellis CS 563100NLP Spring 2008 98 Viterbi intuition   Process observation sequence left to right   Filling out the trellis with the forward prob (now with max instead of sum) : CS 563100NLP Spring 2008 99 Viterbi Algorithm CS 563100NLP Spring 2008 100 Viterbi backtrace CS 563100NLP Spring 2008 101 Viterbi Recursion CS 563100NLP Spring 2008 102 Why “Dynamic Programming”   “I spent the Fall quarter (of 1950) at RAND. My first task was to find a name for multistage decision processes. An interesting question is, Where did the name, dynamic programming, come from? The 1950s were not good years for mathematical research. We had a very interesting gentleman in Washington named Wilson. He was Secretary of Defense, and he actually had a pathological fear and hatred of the word, research. I’m not using the term lightly; I’m using it precisely. His face would suffuse, he would turn red, and he would get violent if people used the term, research, in his presence. You can imagine how he felt, then, about the term, mathematical. The RAND Corporation was employed by the Air Force, and the Air Force had Wilson as its boss, essentially. Hence, I felt I had to do something to shield Wilson and the Air Force from the fact that I was really doing mathematics inside the RAND Corporation. CS 563100NLP Spring 2008 103 Why “Dynamic Programming”   What title, what name, could I choose? In the first place I was interested in planning, in decision making, in thinking. But planning, is not a good word for various reasons. I decided therefore to use the word, “programming” I wanted to get across the idea that this was dynamic, this was multistage, this was time-varying I thought, lets kill two birds with one stone. Let’s take a word that has an absolutely precise meaning, namely dynamic, in the classical physical sense. It also has a very interesting property as an adjective, and that is its impossible to use the word, dynamic, in a pejorative sense. Try thinking of some combination that will possibly give it a pejorative meaning. Its impossible. Thus, I thought dynamic programming was a good name. It was something not even a Congressman could object to. So I used it as an umbrella for my activities.”   Richard Bellman, “Eye of the Hurrican: an autobiography” 1984. CS 563100NLP Spring 2008 104 Viterbi example CS 563100NLP Spring 2008 105 Backward Algorithm •  backward probability i(m) = probability of wm, wm+1, …, wN with wm having tag Ti.   !i(m) = P(wm, …, wN & wm /Ti) •  Similar to forward probability, except starting at the end of the sentence and work backwards •  P3: The best way to estimate transition and lexical tag probabilities •  Use both forward and backward probabilities CS 563100NLP Spring 2008 106 CS 563100NLP Spring 2008 107 Go si  sj at time t emitting ot+1 =vk CS 563100NLP Spring 2008 108 Re-estimate aij •  isible Markov Model V •  idden Markov Model H CS 563100NLP Spring 2008 109 Enter sj at time t and emit ot (=vk) CS 563100NLP Spring 2008 110 Re-estimate bjk CS 563100NLP Spring 2008 111 CS 563100NLP Spring 2008 112 HMM Taggers: Supervised vs. Unsupervised •  Supervise training   Relative frequency   Relative Frequency with further Maximum Likelihood training •  Unsupervised training   Maximum Likelihood training with random start   Read corpus, take counts and build transition and emission tables   Use Forward-Backward to estimate lexical probabilities   Compute most likely hidden state sequence   Determine POS role that each state most likely plays CS 563100NLP Spring 2008 113 Hidden Markov Model Taggers •  When to use unsupervised training? –  To tag a text from a special domain with probabilities different from those in available training texts –  To tag text in a foreign language for which training corpora do not exist at all •  Two way of initialization –  Randomly initialize lexical probabilities involved in HMM –  Use dictionary information •  Jelinek’s method – Dictionary + Uniform distribution •  Kupiec’s method – Dictionary + Equivalence Class •  Group all the words according to the set of their possible tags in dictionary •  E.g., bottom, top  JJ-NN class CS 563100NLP Spring 2008 114 Hidden Markov Model Taggers   Jelinek’s method   Assuming that words occur equally likely with each of their possible tags CS 563100NLP Spring 2008 115 Kupiec’s method   Reduce the total number of parameters   word classes: words with the same possible POS’s   Estimate lexical probability of words in word class as if they are one word   Not including the 100 most frequent words in equivalence classes, but treats as one-word classes   Less parameters, more reliable estimation   Can be use in unsupervised HMM   Or in supervised HMM as a way of smoothing CS 563100NLP Spring 2008 116

Related docs
premium docs
Other docs by XIAOHUI MA
Group Exercise Schedule - ymcadcorg
Views: 9  |  Downloads: 0
FT 240
Views: 7  |  Downloads: 0
Fitness-Pilates for Pregnancy Handout
Views: 6  |  Downloads: 0
Fitness-Pilates Exercises
Views: 7  |  Downloads: 0
FINAL PARADE LINEUP 2006 - City Of Belvedere
Views: 7  |  Downloads: 0
Exercise for Life
Views: 6  |  Downloads: 0
Directory - cmslgflnet - LGfL
Views: 20  |  Downloads: 0
CSP Student Representatives Conference
Views: 8  |  Downloads: 0
Covenant Wellness Center Schedule
Views: 7  |  Downloads: 0