# crf by ajizai

VIEWS: 16 PAGES: 37

• pg 1
```									Natural Language Processing
COMPSCI 423/723
Rohit Kate
Conditional Random Fields
(CRFs) for Sequence Labeling

Some of the slides have been adapted from Raymond
Mooney’s NLP course at UT Austin.
Graphical Models
• If no assumption of independence is made, then an
exponential number of parameters must be estimated
– No realistic amount of training data is sufficient to estimate so
many parameters
• If a blanket assumption of conditional independence is
made, efficient training and inference is possible, but
such a strong assumption is rarely warranted
• Graphical models use directed or undirected graphs
over a set of random variables to explicitly specify
variable dependencies and allow for less restrictive
independence assumptions while limiting the number of
parameters that must be estimated
– Bayesian Networks: Directed acyclic graphs that indicate
causal structure
– Markov Networks: Undirected graphs that capture general
dependencies
Bayesian Networks
• Directed Acyclic Graph (DAG)
– Nodes are random variables
– Edges indicate causal influences
Burglary            Earthquake

Alarm

JohnCalls           MaryCalls
Conditional Probability Tables
• Each node has a conditional probability table (CPT)
that gives the probability of each of its values given
every possible combination of values for its parents
(conditioning case).
– Roots (sources) of the DAG that have no parents are given prior
probabilities.
P(B)                                         P(E)

.001
Burglary            Earthquake        .002

B     E   P(A|B,E)

T     T   .95
Alarm    T     F   .94
F     T   .29
F     F   .001
A   P(J|A)                                                 A   P(M|A)

T   .90          JohnCalls                MaryCalls        T   .70
F   .05                                                    F   .01
Joint Distributions for Bayes
Nets
• A Bayesian Network implicitly defines
(factors) a joint distribution
n
P( x1 , x2 ,... xn )   P( xi | Parents ( X i ))
i 1

• Example
P( J  M  A  B  E )
 P( J | A) P(M | A) P( A | B  E ) P(B) P(E )
 0.9  0.7  0.001 0.999 0.998  0.00062
Naïve Bayes as a Bayes Net
• Naïve Bayes is a simple Bayes Net
Y

X1   X2       …   Xn

• Priors P(Y) and conditionals P(Xi|Y) for
Naïve Bayes provide CPTs for the network
HMMs as Bayesian Network

• The directed probabilistic graphical
model for the random variables w1 to wn
and t1 to tn with the independence
assumptions:

P(t1)           P(t2|t1)    P(t3|t2)       P(tn|tn-1)
t1           t2           t3   …           tn
P(w1|t1)      P(w2|t2)      P(w3|t3)    P(wn|tn)
w1          w2           w3    …          wn
Drawbacks of HMMs
• HMMs are generative models and are not
directly designed to maximize the
performance of sequence labeling. They
model the joint distribution P(O,Q) and thus
only indirectly model P(Q|O) which is what is
needed for the sequence labeling task (O:
observation sequence, Q: label sequence)
• Can’t use arbitrary features related to the
words (e.g. capitalization, prefixes etc. that
can help POS tagging) unless these are
explicitly modeled as part of observations

9
Undirected Graphical Model
• Also called Markov Network, Random Field
• Undirected graph over a set of random variables,
where an edge represents a dependency
• The Markov blanket of a node, X, in a Markov Net
is the set of its neighbors in the graph (nodes that
have an edge connecting to X)
• Every node in a Markov Net is conditionally
independent of every other node given its Markov
blanket

10
Sample Markov Network

Burglary            Earthquake

Alarm

JohnCalls              MaryCalls
Distribution for a Markov
Network
• The distribution of a Markov net is most compactly described in
terms of a set of potential functions (a.k.a. factors, compatibility
functions), φk, for each clique, k, in the graph.
• For each joint assignment of values to the variables in clique k, φk
assigns a non-negative real value that represents the compatibility
of these values.
• The joint distribution of a Markov network is then defined by:

1
P ( x1 , x2 ,... xn )   k ( x{k } )
Z k
Where x{k} represents the joint assignment of the variables in
clique k, and Z is a normalizing constant that makes a joint
distribution that sums to 1.

Z   k ( x{k } )
x   k
Sample Markov Network
E   A   2
B   A   1                                       T   T   50
T   T   100                                      T   F   10
T   F   1                                        F   T   1
F   T   1                                        F   F   200
F   F   200   Burglary            Earthquake

Alarm

M   A   4
J   A   3        JohnCalls              MaryCalls
T   T   50
T   T   75
T   F   1
T   F   10
F   T   10
F   T   1
F   F   200
F   F   200

P( J  M  A  B  E)  1*1* 75* 50 /  (...)
Discriminative Markov Network
or Conditional Random Field
• Directly models P(Y|X)
1
P(y1, y 2 ,...y m | x1, x 2 ,...x n )       k (y{k}, x{k} )
Z(X) k

Z(X)     k (y{k} , x{k} )
Y   k


• The potential functions could be based on arbitrary

features of X and Y and they are expressed as
exponentials
Random Field
1
(Undirected Graphical Model)                   P (v1 , v2 ,... vn )   k (v{k } )
Z k
v1      v2           vn
Z   k (v{k } )
…
v       k

v3              v10

v4
Conditional Random Field (CRF)

…Y
Y3
Y1        Y2
1
 k ( y{k} , x{k} )
n
P( y1 , y2 ,... ym | x1 , x2 ,... xn ) 
Z(X ) k
X1, X2,…, Xn                              Z(X)     k (y{k} , x{k} )
Two types of variables x & y,                                Y       k
there is no factor with only x variables
Linear-Chain Conditional Random Field (CRF)

Y1        Y2     …Y        n
Ys are connected in a linear chain

X1, X2,…, Xn

1
P( y1 , y2 ,... ym | x1 , x2 ,... xn )        k ( yi , yi1 , x{k} )
Z(X ) k
Z ( X )   k ( yi , yi 1 , x{k } )
Y    k
Logistic Regression as a
Simplest CRF
• Logistic regression is a simple CRF with
only one output variable
Y

X1   X2       …   Xn

• Models the conditional distribution, P(Y | X)
and not the full joint P(X,Y)
Simplification Assumption for
MaxEnt

• The probability P(Y|X1..Xn) can be factored as:
N
exp(W ci f i (c, x))
P(c | X)                    i 0
N

          exp(W c' i f i (c', x))
c' Classes            i 0



18
Generative vs. Discriminative
Sequence Labeling Models
• HMMs are generative models and are not
directly designed to maximize the performance
of sequence labeling. They model the joint
distribution P(O,Q)
• HMMs are trained to have an accurate
probabilistic model of the underlying language,
and not all aspects of this model benefit the
• Conditional Random Fields (CRFs) are
specifically designed and trained to maximize
performance of sequence labeling. They model
the conditional distribution P(Q | O)
Classification
Y
Naïve

X1   X2       …    Xn
Bayes

Generative
Conditional

Discriminative
Y
Logistic
Regression
X1   X2       …    Xn
Sequence Labeling
Y1    Y2      ..   YT
HMM

X1    X2    …      XT

Generative
Conditional

Discriminative
Y1    Y2     ..    YT
Linear-chain CRF

X1    X2   …       XT
Simple Linear Chain CRF
Features
• Modeling the conditional distribution is similar to
that used in multinomial logistic regression.
• Create feature functions fk(Yt, Yt−1, Xt)
– Feature for each state transition pair i, j
• fi,j(Yt, Yt−1, Xt) = 1 if Yt = i and Yt−1 = j and 0 otherwise
– Feature for each state observation pair i, o
• fi,o(Yt, Yt−1, Xt) = 1 if Yt = i and Xt = o and 0 otherwise
• Note: number of features grows quadratically in
the number of states (i.e. tags).

22
Conditional Distribution for
Linear Chain CRF
• Using these feature functions for a
simple linear chain CRF, we can define:
T     M
1
P(Y | X)       exp(  m f m (Yt ,Yt1, X t ))
Z(X)     t1 m1

T   M
Z(X)   exp(  m f m (Yt ,Yt1, X t ))
Y       t1 m1


23
CRF
• Can add token features Xi,j
Y1                 Y2          …          YT

X1,1   …    X1,m   X2,1   …    X2,m   …   XT,1   …    XT,m

each token feature to model conditional
distribution.
24
Features in POS Tagging

• For POS Tagging, use lexicographic
features of tokens.
– Capitalized?
– Ends in given suffix (e.g. “s”, “ed”, “ly”)?

25
Enhanced Linear Chain CRF
(standard approach)
• Can also condition transition on the
current token features.
Y1                       Y2                    …          YT

X1,1                    X2,1        …                       XT,1

…
…

…
X1,m                    X2,m                                XT,m

• fi,j,k(Yt, Yt−1, X) 1 if Yt = i and Yt−1 = j and Xt −1,k = 1
and 0 otherwise                                                       26
Supervised Learning
(Parameter Estimation)
• As in logistic regression, use L-BFGS
optimization procedure, to set λ weights
to maximize CLL of the supervised
training data

27
Sequence Tagging
(Inference)
• Variant of dynamic programming (Viterbi)
algorithm can be used to efficiently,
O(TN2), determine the globally most
probable label sequence for a given
token sequence using a given log-linear
model of the conditional probability P(Y |
X)

28
Skip-Chain CRFs
• Can model some long-distance dependencies
(i.e. the same word appearing in different parts
of the text) by including long-distance edges in
the Markov model.

Y1      Y2     Y3         Y100   Y101

…
X1      X2     X3         X100   X101

Michael   Dell   said       Dell bought

intractable, so must resort to approximate
inference to try to find the most probable         29
CRF Results
• Experimental results verify that they have
superior accuracy on various sequence
–   Part of Speech tagging
–   Noun phrase chunking
–   Named entity recognition
–   Semantic role labeling
• However, CRFs are much slower to train and
do not scale as well to large amounts of
training data
– Training for POS on full Penn Treebank (~1M
words) currently takes “over a week.”
• Skip-chain CRFs improve results on IE
30
CRF Summary
• CRFs are a discriminative approach to sequence labeling
whereas HMMs are generative
• Discriminative methods are usually more accurate since
they are trained for a specific performance task
• Training time is increased since a complex optimization
procedure is needed to fit supervised training data
• CRFs are a state-of-the-art method for sequence labeling

31
Phrase Structure
• Most languages have a word order
• Words are organized into phrases,
group of words that act as a single unit
or a constituent
– [The dog] [chased] [the cat].
– [The fat dog] [chased] [the thin cat].
– [The fat dog with red collar] [chased] [the
thin old cat].
– [The fat dog with red collar named Tom]
[suddenly chased] [the thin old white cat].
Phrases
• Noun phrase: A syntactic unit of a sentence
which acts like a noun and in which a noun is
– An optional determiner followed by zero or more
prepositional phrases
• Prepositional phrase: Headed by a
preposition and express spatial, temporal or
other attributes
• Verb phrase: Part of the sentence that
depend on the verb. Headed by the verb.
Phrase Chunking
• Find all non-recursive noun phrases
(NPs) and verb phrases (VPs) in a
sentence.
– [NP I] [VP ate] [NP the spaghetti] [PP
with] [NP meatballs].
– [NP He ] [VP reckons ] [NP the current
account deficit ] [VP will narrow ] [PP to ]
[NP only # 1.8 billion ] [PP in ] [NP
September ]
• Some applications need all the noun
phrases in a sentence
Phrase Chunking as
Sequence Labeling
• Tag individual words with one of 3 tags
– B (Begin) word starts new target phrase
– I (Inside) word is part of target phrase but
not the first word
– O (Other) word is not part of target phrase
• Sample for NP chunking
– He reckons the current account deficit will
narrow to only # 1.8 billion in September.
Begin      Inside     Other

35
Evaluating Chunking
• Per token accuracy does not evaluate finding
Number of correctchunks found
Precision 
Total number of chunksfound
Number of correctchunks found
Recall 
Total number of actual chunks
• Take harmonic mean to produce a single
evaluation metric called F measure.
1        2 PR
F1            
1 1
(  )/2     PR
P R                             36
Current Chunking Results
• Best system for NP chunking: F1=96%
• Typical results for finding range of
chunk types (CoNLL 2000 shared task: