VIEWS: 16 PAGES: 37 POSTED ON: 12/1/2011
Natural Language Processing COMPSCI 423/723 Rohit Kate Conditional Random Fields (CRFs) for Sequence Labeling Some of the slides have been adapted from Raymond Mooney’s NLP course at UT Austin. Graphical Models • If no assumption of independence is made, then an exponential number of parameters must be estimated – No realistic amount of training data is sufficient to estimate so many parameters • If a blanket assumption of conditional independence is made, efficient training and inference is possible, but such a strong assumption is rarely warranted • Graphical models use directed or undirected graphs over a set of random variables to explicitly specify variable dependencies and allow for less restrictive independence assumptions while limiting the number of parameters that must be estimated – Bayesian Networks: Directed acyclic graphs that indicate causal structure – Markov Networks: Undirected graphs that capture general dependencies Bayesian Networks • Directed Acyclic Graph (DAG) – Nodes are random variables – Edges indicate causal influences Burglary Earthquake Alarm JohnCalls MaryCalls Conditional Probability Tables • Each node has a conditional probability table (CPT) that gives the probability of each of its values given every possible combination of values for its parents (conditioning case). – Roots (sources) of the DAG that have no parents are given prior probabilities. P(B) P(E) .001 Burglary Earthquake .002 B E P(A|B,E) T T .95 Alarm T F .94 F T .29 F F .001 A P(J|A) A P(M|A) T .90 JohnCalls MaryCalls T .70 F .05 F .01 Joint Distributions for Bayes Nets • A Bayesian Network implicitly defines (factors) a joint distribution n P( x1 , x2 ,... xn ) P( xi | Parents ( X i )) i 1 • Example P( J M A B E ) P( J | A) P(M | A) P( A | B E ) P(B) P(E ) 0.9 0.7 0.001 0.999 0.998 0.00062 Naïve Bayes as a Bayes Net • Naïve Bayes is a simple Bayes Net Y X1 X2 … Xn • Priors P(Y) and conditionals P(Xi|Y) for Naïve Bayes provide CPTs for the network HMMs as Bayesian Network • The directed probabilistic graphical model for the random variables w1 to wn and t1 to tn with the independence assumptions: P(t1) P(t2|t1) P(t3|t2) P(tn|tn-1) t1 t2 t3 … tn P(w1|t1) P(w2|t2) P(w3|t3) P(wn|tn) w1 w2 w3 … wn Drawbacks of HMMs • HMMs are generative models and are not directly designed to maximize the performance of sequence labeling. They model the joint distribution P(O,Q) and thus only indirectly model P(Q|O) which is what is needed for the sequence labeling task (O: observation sequence, Q: label sequence) • Can’t use arbitrary features related to the words (e.g. capitalization, prefixes etc. that can help POS tagging) unless these are explicitly modeled as part of observations 9 Undirected Graphical Model • Also called Markov Network, Random Field • Undirected graph over a set of random variables, where an edge represents a dependency • The Markov blanket of a node, X, in a Markov Net is the set of its neighbors in the graph (nodes that have an edge connecting to X) • Every node in a Markov Net is conditionally independent of every other node given its Markov blanket 10 Sample Markov Network Burglary Earthquake Alarm JohnCalls MaryCalls Distribution for a Markov Network • The distribution of a Markov net is most compactly described in terms of a set of potential functions (a.k.a. factors, compatibility functions), φk, for each clique, k, in the graph. • For each joint assignment of values to the variables in clique k, φk assigns a non-negative real value that represents the compatibility of these values. • The joint distribution of a Markov network is then defined by: 1 P ( x1 , x2 ,... xn ) k ( x{k } ) Z k Where x{k} represents the joint assignment of the variables in clique k, and Z is a normalizing constant that makes a joint distribution that sums to 1. Z k ( x{k } ) x k Sample Markov Network E A 2 B A 1 T T 50 T T 100 T F 10 T F 1 F T 1 F T 1 F F 200 F F 200 Burglary Earthquake Alarm M A 4 J A 3 JohnCalls MaryCalls T T 50 T T 75 T F 1 T F 10 F T 10 F T 1 F F 200 F F 200 P( J M A B E) 1*1* 75* 50 / (...) Discriminative Markov Network or Conditional Random Field • Directly models P(Y|X) 1 P(y1, y 2 ,...y m | x1, x 2 ,...x n ) k (y{k}, x{k} ) Z(X) k Z(X) k (y{k} , x{k} ) Y k • The potential functions could be based on arbitrary features of X and Y and they are expressed as exponentials Random Field 1 (Undirected Graphical Model) P (v1 , v2 ,... vn ) k (v{k } ) Z k v1 v2 vn Z k (v{k } ) … v k v3 v10 v4 Conditional Random Field (CRF) …Y Y3 Y1 Y2 1 k ( y{k} , x{k} ) n P( y1 , y2 ,... ym | x1 , x2 ,... xn ) Z(X ) k X1, X2,…, Xn Z(X) k (y{k} , x{k} ) Two types of variables x & y, Y k there is no factor with only x variables Linear-Chain Conditional Random Field (CRF) Y1 Y2 …Y n Ys are connected in a linear chain X1, X2,…, Xn 1 P( y1 , y2 ,... ym | x1 , x2 ,... xn ) k ( yi , yi1 , x{k} ) Z(X ) k Z ( X ) k ( yi , yi 1 , x{k } ) Y k Logistic Regression as a Simplest CRF • Logistic regression is a simple CRF with only one output variable Y X1 X2 … Xn • Models the conditional distribution, P(Y | X) and not the full joint P(X,Y) Simplification Assumption for MaxEnt • The probability P(Y|X1..Xn) can be factored as: N exp(W ci f i (c, x)) P(c | X) i 0 N exp(W c' i f i (c', x)) c' Classes i 0 18 Generative vs. Discriminative Sequence Labeling Models • HMMs are generative models and are not directly designed to maximize the performance of sequence labeling. They model the joint distribution P(O,Q) • HMMs are trained to have an accurate probabilistic model of the underlying language, and not all aspects of this model benefit the sequence labeling task • Conditional Random Fields (CRFs) are specifically designed and trained to maximize performance of sequence labeling. They model the conditional distribution P(Q | O) Classification Y Naïve X1 X2 … Xn Bayes Generative Conditional Discriminative Y Logistic Regression X1 X2 … Xn Sequence Labeling Y1 Y2 .. YT HMM X1 X2 … XT Generative Conditional Discriminative Y1 Y2 .. YT Linear-chain CRF X1 X2 … XT Simple Linear Chain CRF Features • Modeling the conditional distribution is similar to that used in multinomial logistic regression. • Create feature functions fk(Yt, Yt−1, Xt) – Feature for each state transition pair i, j • fi,j(Yt, Yt−1, Xt) = 1 if Yt = i and Yt−1 = j and 0 otherwise – Feature for each state observation pair i, o • fi,o(Yt, Yt−1, Xt) = 1 if Yt = i and Xt = o and 0 otherwise • Note: number of features grows quadratically in the number of states (i.e. tags). 22 Conditional Distribution for Linear Chain CRF • Using these feature functions for a simple linear chain CRF, we can define: T M 1 P(Y | X) exp( m f m (Yt ,Yt1, X t )) Z(X) t1 m1 T M Z(X) exp( m f m (Yt ,Yt1, X t )) Y t1 m1 23 Adding Token Features to a CRF • Can add token features Xi,j Y1 Y2 … YT X1,1 … X1,m X2,1 … X2,m … XT,1 … XT,m • Can add additional feature functions for each token feature to model conditional distribution. 24 Features in POS Tagging • For POS Tagging, use lexicographic features of tokens. – Capitalized? – Start with numeral? – Ends in given suffix (e.g. “s”, “ed”, “ly”)? 25 Enhanced Linear Chain CRF (standard approach) • Can also condition transition on the current token features. Y1 Y2 … YT X1,1 X2,1 … XT,1 … … … X1,m X2,m XT,m • Add feature functions: • fi,j,k(Yt, Yt−1, X) 1 if Yt = i and Yt−1 = j and Xt −1,k = 1 and 0 otherwise 26 Supervised Learning (Parameter Estimation) • As in logistic regression, use L-BFGS optimization procedure, to set λ weights to maximize CLL of the supervised training data 27 Sequence Tagging (Inference) • Variant of dynamic programming (Viterbi) algorithm can be used to efficiently, O(TN2), determine the globally most probable label sequence for a given token sequence using a given log-linear model of the conditional probability P(Y | X) 28 Skip-Chain CRFs • Can model some long-distance dependencies (i.e. the same word appearing in different parts of the text) by including long-distance edges in the Markov model. Y1 Y2 Y3 Y100 Y101 … X1 X2 X3 X100 X101 Michael Dell said Dell bought • Additional links make exact inference intractable, so must resort to approximate inference to try to find the most probable 29 CRF Results • Experimental results verify that they have superior accuracy on various sequence labeling tasks – Part of Speech tagging – Noun phrase chunking – Named entity recognition – Semantic role labeling • However, CRFs are much slower to train and do not scale as well to large amounts of training data – Training for POS on full Penn Treebank (~1M words) currently takes “over a week.” • Skip-chain CRFs improve results on IE 30 CRF Summary • CRFs are a discriminative approach to sequence labeling whereas HMMs are generative • Discriminative methods are usually more accurate since they are trained for a specific performance task • CRFs also easily allow adding additional token features without making additional independence assumptions • Training time is increased since a complex optimization procedure is needed to fit supervised training data • CRFs are a state-of-the-art method for sequence labeling 31 Phrase Structure • Most languages have a word order • Words are organized into phrases, group of words that act as a single unit or a constituent – [The dog] [chased] [the cat]. – [The fat dog] [chased] [the thin cat]. – [The fat dog with red collar] [chased] [the thin old cat]. – [The fat dog with red collar named Tom] [suddenly chased] [the thin old white cat]. Phrases • Noun phrase: A syntactic unit of a sentence which acts like a noun and in which a noun is usually embedded called its head – An optional determiner followed by zero or more adjectives, a noun head and zero or more prepositional phrases • Prepositional phrase: Headed by a preposition and express spatial, temporal or other attributes • Verb phrase: Part of the sentence that depend on the verb. Headed by the verb. • Adjective phrase: Acts like an adjective. Phrase Chunking • Find all non-recursive noun phrases (NPs) and verb phrases (VPs) in a sentence. – [NP I] [VP ate] [NP the spaghetti] [PP with] [NP meatballs]. – [NP He ] [VP reckons ] [NP the current account deficit ] [VP will narrow ] [PP to ] [NP only # 1.8 billion ] [PP in ] [NP September ] • Some applications need all the noun phrases in a sentence Phrase Chunking as Sequence Labeling • Tag individual words with one of 3 tags – B (Begin) word starts new target phrase – I (Inside) word is part of target phrase but not the first word – O (Other) word is not part of target phrase • Sample for NP chunking – He reckons the current account deficit will narrow to only # 1.8 billion in September. Begin Inside Other 35 Evaluating Chunking • Per token accuracy does not evaluate finding correct full chunks. Instead use: Number of correctchunks found Precision Total number of chunksfound Number of correctchunks found Recall Total number of actual chunks • Take harmonic mean to produce a single evaluation metric called F measure. 1 2 PR F1 1 1 ( )/2 PR P R 36 Current Chunking Results • Best system for NP chunking: F1=96% • Typical results for finding range of chunk types (CoNLL 2000 shared task: NP, VP, PP, ADV, SBAR, ADJP) is F1=92−94% 37