Document Sample

Natural Language Processing COMPSCI 423/723 Rohit Kate Conditional Random Fields (CRFs) for Sequence Labeling Some of the slides have been adapted from Raymond Mooney’s NLP course at UT Austin. Graphical Models • If no assumption of independence is made, then an exponential number of parameters must be estimated – No realistic amount of training data is sufficient to estimate so many parameters • If a blanket assumption of conditional independence is made, efficient training and inference is possible, but such a strong assumption is rarely warranted • Graphical models use directed or undirected graphs over a set of random variables to explicitly specify variable dependencies and allow for less restrictive independence assumptions while limiting the number of parameters that must be estimated – Bayesian Networks: Directed acyclic graphs that indicate causal structure – Markov Networks: Undirected graphs that capture general dependencies Bayesian Networks • Directed Acyclic Graph (DAG) – Nodes are random variables – Edges indicate causal influences Burglary Earthquake Alarm JohnCalls MaryCalls Conditional Probability Tables • Each node has a conditional probability table (CPT) that gives the probability of each of its values given every possible combination of values for its parents (conditioning case). – Roots (sources) of the DAG that have no parents are given prior probabilities. P(B) P(E) .001 Burglary Earthquake .002 B E P(A|B,E) T T .95 Alarm T F .94 F T .29 F F .001 A P(J|A) A P(M|A) T .90 JohnCalls MaryCalls T .70 F .05 F .01 Joint Distributions for Bayes Nets • A Bayesian Network implicitly defines (factors) a joint distribution n P( x1 , x2 ,... xn ) P( xi | Parents ( X i )) i 1 • Example P( J M A B E ) P( J | A) P(M | A) P( A | B E ) P(B) P(E ) 0.9 0.7 0.001 0.999 0.998 0.00062 Naïve Bayes as a Bayes Net • Naïve Bayes is a simple Bayes Net Y X1 X2 … Xn • Priors P(Y) and conditionals P(Xi|Y) for Naïve Bayes provide CPTs for the network HMMs as Bayesian Network • The directed probabilistic graphical model for the random variables w1 to wn and t1 to tn with the independence assumptions: P(t1) P(t2|t1) P(t3|t2) P(tn|tn-1) t1 t2 t3 … tn P(w1|t1) P(w2|t2) P(w3|t3) P(wn|tn) w1 w2 w3 … wn Drawbacks of HMMs • HMMs are generative models and are not directly designed to maximize the performance of sequence labeling. They model the joint distribution P(O,Q) and thus only indirectly model P(Q|O) which is what is needed for the sequence labeling task (O: observation sequence, Q: label sequence) • Can’t use arbitrary features related to the words (e.g. capitalization, prefixes etc. that can help POS tagging) unless these are explicitly modeled as part of observations 9 Undirected Graphical Model • Also called Markov Network, Random Field • Undirected graph over a set of random variables, where an edge represents a dependency • The Markov blanket of a node, X, in a Markov Net is the set of its neighbors in the graph (nodes that have an edge connecting to X) • Every node in a Markov Net is conditionally independent of every other node given its Markov blanket 10 Sample Markov Network Burglary Earthquake Alarm JohnCalls MaryCalls Distribution for a Markov Network • The distribution of a Markov net is most compactly described in terms of a set of potential functions (a.k.a. factors, compatibility functions), φk, for each clique, k, in the graph. • For each joint assignment of values to the variables in clique k, φk assigns a non-negative real value that represents the compatibility of these values. • The joint distribution of a Markov network is then defined by: 1 P ( x1 , x2 ,... xn ) k ( x{k } ) Z k Where x{k} represents the joint assignment of the variables in clique k, and Z is a normalizing constant that makes a joint distribution that sums to 1. Z k ( x{k } ) x k Sample Markov Network E A 2 B A 1 T T 50 T T 100 T F 10 T F 1 F T 1 F T 1 F F 200 F F 200 Burglary Earthquake Alarm M A 4 J A 3 JohnCalls MaryCalls T T 50 T T 75 T F 1 T F 10 F T 10 F T 1 F F 200 F F 200 P( J M A B E) 1*1* 75* 50 / (...) Discriminative Markov Network or Conditional Random Field • Directly models P(Y|X) 1 P(y1, y 2 ,...y m | x1, x 2 ,...x n ) k (y{k}, x{k} ) Z(X) k Z(X) k (y{k} , x{k} ) Y k • The potential functions could be based on arbitrary features of X and Y and they are expressed as exponentials Random Field 1 (Undirected Graphical Model) P (v1 , v2 ,... vn ) k (v{k } ) Z k v1 v2 vn Z k (v{k } ) … v k v3 v10 v4 Conditional Random Field (CRF) …Y Y3 Y1 Y2 1 k ( y{k} , x{k} ) n P( y1 , y2 ,... ym | x1 , x2 ,... xn ) Z(X ) k X1, X2,…, Xn Z(X) k (y{k} , x{k} ) Two types of variables x & y, Y k there is no factor with only x variables Linear-Chain Conditional Random Field (CRF) Y1 Y2 …Y n Ys are connected in a linear chain X1, X2,…, Xn 1 P( y1 , y2 ,... ym | x1 , x2 ,... xn ) k ( yi , yi1 , x{k} ) Z(X ) k Z ( X ) k ( yi , yi 1 , x{k } ) Y k Logistic Regression as a Simplest CRF • Logistic regression is a simple CRF with only one output variable Y X1 X2 … Xn • Models the conditional distribution, P(Y | X) and not the full joint P(X,Y) Simplification Assumption for MaxEnt • The probability P(Y|X1..Xn) can be factored as: N exp(W ci f i (c, x)) P(c | X) i 0 N exp(W c' i f i (c', x)) c' Classes i 0 18 Generative vs. Discriminative Sequence Labeling Models • HMMs are generative models and are not directly designed to maximize the performance of sequence labeling. They model the joint distribution P(O,Q) • HMMs are trained to have an accurate probabilistic model of the underlying language, and not all aspects of this model benefit the sequence labeling task • Conditional Random Fields (CRFs) are specifically designed and trained to maximize performance of sequence labeling. They model the conditional distribution P(Q | O) Classification Y Naïve X1 X2 … Xn Bayes Generative Conditional Discriminative Y Logistic Regression X1 X2 … Xn Sequence Labeling Y1 Y2 .. YT HMM X1 X2 … XT Generative Conditional Discriminative Y1 Y2 .. YT Linear-chain CRF X1 X2 … XT Simple Linear Chain CRF Features • Modeling the conditional distribution is similar to that used in multinomial logistic regression. • Create feature functions fk(Yt, Yt−1, Xt) – Feature for each state transition pair i, j • fi,j(Yt, Yt−1, Xt) = 1 if Yt = i and Yt−1 = j and 0 otherwise – Feature for each state observation pair i, o • fi,o(Yt, Yt−1, Xt) = 1 if Yt = i and Xt = o and 0 otherwise • Note: number of features grows quadratically in the number of states (i.e. tags). 22 Conditional Distribution for Linear Chain CRF • Using these feature functions for a simple linear chain CRF, we can define: T M 1 P(Y | X) exp( m f m (Yt ,Yt1, X t )) Z(X) t1 m1 T M Z(X) exp( m f m (Yt ,Yt1, X t )) Y t1 m1 23 Adding Token Features to a CRF • Can add token features Xi,j Y1 Y2 … YT X1,1 … X1,m X2,1 … X2,m … XT,1 … XT,m • Can add additional feature functions for each token feature to model conditional distribution. 24 Features in POS Tagging • For POS Tagging, use lexicographic features of tokens. – Capitalized? – Start with numeral? – Ends in given suffix (e.g. “s”, “ed”, “ly”)? 25 Enhanced Linear Chain CRF (standard approach) • Can also condition transition on the current token features. Y1 Y2 … YT X1,1 X2,1 … XT,1 … … … X1,m X2,m XT,m • Add feature functions: • fi,j,k(Yt, Yt−1, X) 1 if Yt = i and Yt−1 = j and Xt −1,k = 1 and 0 otherwise 26 Supervised Learning (Parameter Estimation) • As in logistic regression, use L-BFGS optimization procedure, to set λ weights to maximize CLL of the supervised training data 27 Sequence Tagging (Inference) • Variant of dynamic programming (Viterbi) algorithm can be used to efficiently, O(TN2), determine the globally most probable label sequence for a given token sequence using a given log-linear model of the conditional probability P(Y | X) 28 Skip-Chain CRFs • Can model some long-distance dependencies (i.e. the same word appearing in different parts of the text) by including long-distance edges in the Markov model. Y1 Y2 Y3 Y100 Y101 … X1 X2 X3 X100 X101 Michael Dell said Dell bought • Additional links make exact inference intractable, so must resort to approximate inference to try to find the most probable 29 CRF Results • Experimental results verify that they have superior accuracy on various sequence labeling tasks – Part of Speech tagging – Noun phrase chunking – Named entity recognition – Semantic role labeling • However, CRFs are much slower to train and do not scale as well to large amounts of training data – Training for POS on full Penn Treebank (~1M words) currently takes “over a week.” • Skip-chain CRFs improve results on IE 30 CRF Summary • CRFs are a discriminative approach to sequence labeling whereas HMMs are generative • Discriminative methods are usually more accurate since they are trained for a specific performance task • CRFs also easily allow adding additional token features without making additional independence assumptions • Training time is increased since a complex optimization procedure is needed to fit supervised training data • CRFs are a state-of-the-art method for sequence labeling 31 Phrase Structure • Most languages have a word order • Words are organized into phrases, group of words that act as a single unit or a constituent – [The dog] [chased] [the cat]. – [The fat dog] [chased] [the thin cat]. – [The fat dog with red collar] [chased] [the thin old cat]. – [The fat dog with red collar named Tom] [suddenly chased] [the thin old white cat]. Phrases • Noun phrase: A syntactic unit of a sentence which acts like a noun and in which a noun is usually embedded called its head – An optional determiner followed by zero or more adjectives, a noun head and zero or more prepositional phrases • Prepositional phrase: Headed by a preposition and express spatial, temporal or other attributes • Verb phrase: Part of the sentence that depend on the verb. Headed by the verb. • Adjective phrase: Acts like an adjective. Phrase Chunking • Find all non-recursive noun phrases (NPs) and verb phrases (VPs) in a sentence. – [NP I] [VP ate] [NP the spaghetti] [PP with] [NP meatballs]. – [NP He ] [VP reckons ] [NP the current account deficit ] [VP will narrow ] [PP to ] [NP only # 1.8 billion ] [PP in ] [NP September ] • Some applications need all the noun phrases in a sentence Phrase Chunking as Sequence Labeling • Tag individual words with one of 3 tags – B (Begin) word starts new target phrase – I (Inside) word is part of target phrase but not the first word – O (Other) word is not part of target phrase • Sample for NP chunking – He reckons the current account deficit will narrow to only # 1.8 billion in September. Begin Inside Other 35 Evaluating Chunking • Per token accuracy does not evaluate finding correct full chunks. Instead use: Number of correctchunks found Precision Total number of chunksfound Number of correctchunks found Recall Total number of actual chunks • Take harmonic mean to produce a single evaluation metric called F measure. 1 2 PR F1 1 1 ( )/2 PR P R 36 Current Chunking Results • Best system for NP chunking: F1=96% • Typical results for finding range of chunk types (CoNLL 2000 shared task: NP, VP, PP, ADV, SBAR, ADJP) is F1=92−94% 37

DOCUMENT INFO

Shared By:

Categories:

Tags:

Stats:

views: | 14 |

posted: | 12/1/2011 |

language: | English |

pages: | 37 |

OTHER DOCS BY ajizai

How are you planning on using Docstoc?
BUSINESS
PERSONAL

By registering with docstoc.com you agree to our
privacy policy and
terms of service, and to receive content and offer notifications.

Docstoc is the premier online destination to start and grow small businesses. It hosts the best quality and widest selection of professional documents (over 20 million) and resources including expert videos, articles and productivity tools to make every small business better.

Search or Browse for any specific document or resource you need for your business. Or explore our curated resources for Starting a Business, Growing a Business or for Professional Development.

Feel free to Contact Us with any questions you might have.