crf by ajizai


									Natural Language Processing
      COMPSCI 423/723
         Rohit Kate
  Conditional Random Fields
(CRFs) for Sequence Labeling

Some of the slides have been adapted from Raymond
  Mooney’s NLP course at UT Austin.
               Graphical Models
• If no assumption of independence is made, then an
  exponential number of parameters must be estimated
   – No realistic amount of training data is sufficient to estimate so
     many parameters
• If a blanket assumption of conditional independence is
  made, efficient training and inference is possible, but
  such a strong assumption is rarely warranted
• Graphical models use directed or undirected graphs
  over a set of random variables to explicitly specify
  variable dependencies and allow for less restrictive
  independence assumptions while limiting the number of
  parameters that must be estimated
   – Bayesian Networks: Directed acyclic graphs that indicate
     causal structure
   – Markov Networks: Undirected graphs that capture general
       Bayesian Networks
• Directed Acyclic Graph (DAG)
  – Nodes are random variables
  – Edges indicate causal influences
        Burglary            Earthquake


        JohnCalls           MaryCalls
 Conditional Probability Tables
• Each node has a conditional probability table (CPT)
  that gives the probability of each of its values given
  every possible combination of values for its parents
  (conditioning case).
   – Roots (sources) of the DAG that have no parents are given prior
               P(B)                                         P(E)

                      Burglary            Earthquake        .002

                                           B     E   P(A|B,E)

                                           T     T   .95
                                  Alarm    T     F   .94
                                           F     T   .29
                                           F     F   .001
     A   P(J|A)                                                 A   P(M|A)

     T   .90          JohnCalls                MaryCalls        T   .70
     F   .05                                                    F   .01
  Joint Distributions for Bayes
• A Bayesian Network implicitly defines
  (factors) a joint distribution
        P( x1 , x2 ,... xn )   P( xi | Parents ( X i ))
                             i 1

• Example
      P( J  M  A  B  E )
     P( J | A) P(M | A) P( A | B  E ) P(B) P(E )
     0.9  0.7  0.001 0.999 0.998  0.00062
 Naïve Bayes as a Bayes Net
• Naïve Bayes is a simple Bayes Net

              X1   X2       …   Xn

• Priors P(Y) and conditionals P(Xi|Y) for
  Naïve Bayes provide CPTs for the network
HMMs as Bayesian Network

• The directed probabilistic graphical
  model for the random variables w1 to wn
  and t1 to tn with the independence

 P(t1)           P(t2|t1)    P(t3|t2)       P(tn|tn-1)
         t1           t2           t3   …           tn
P(w1|t1)      P(w2|t2)      P(w3|t3)    P(wn|tn)
         w1          w2           w3    …          wn
       Drawbacks of HMMs
• HMMs are generative models and are not
  directly designed to maximize the
  performance of sequence labeling. They
  model the joint distribution P(O,Q) and thus
  only indirectly model P(Q|O) which is what is
  needed for the sequence labeling task (O:
  observation sequence, Q: label sequence)
• Can’t use arbitrary features related to the
  words (e.g. capitalization, prefixes etc. that
  can help POS tagging) unless these are
  explicitly modeled as part of observations

  Undirected Graphical Model
• Also called Markov Network, Random Field
• Undirected graph over a set of random variables,
  where an edge represents a dependency
• The Markov blanket of a node, X, in a Markov Net
  is the set of its neighbors in the graph (nodes that
  have an edge connecting to X)
• Every node in a Markov Net is conditionally
  independent of every other node given its Markov

Sample Markov Network

   Burglary            Earthquake


   JohnCalls              MaryCalls
        Distribution for a Markov
• The distribution of a Markov net is most compactly described in
  terms of a set of potential functions (a.k.a. factors, compatibility
  functions), φk, for each clique, k, in the graph.
• For each joint assignment of values to the variables in clique k, φk
  assigns a non-negative real value that represents the compatibility
  of these values.
• The joint distribution of a Markov network is then defined by:

                P ( x1 , x2 ,... xn )   k ( x{k } )
                                       Z k
    Where x{k} represents the joint assignment of the variables in
    clique k, and Z is a normalizing constant that makes a joint
    distribution that sums to 1.

                      Z   k ( x{k } )
                             x   k
        Sample Markov Network
                                                     E   A   2
    B   A   1                                       T   T   50
    T   T   100                                      T   F   10
    T   F   1                                        F   T   1
    F   T   1                                        F   F   200
    F   F   200   Burglary            Earthquake


                                                         M   A   4
J   A   3        JohnCalls              MaryCalls
                                                         T   T   50
T   T   75
                                                         T   F   1
T   F   10
                                                         F   T   10
F   T   1
                                                         F   F   200
F   F   200

              P( J  M  A  B  E)  1*1* 75* 50 /  (...)
     Discriminative Markov Network
      or Conditional Random Field
     • Directly models P(Y|X)
         P(y1, y 2 ,...y m | x1, x 2 ,...x n )       k (y{k}, x{k} )
                                                 Z(X) k

                        Z(X)     k (y{k} , x{k} )
                                  Y   k

     • The potential functions could be based on arbitrary
       features of X and Y and they are expressed as
Random Field
(Undirected Graphical Model)                   P (v1 , v2 ,... vn )   k (v{k } )
                                                                     Z k
                v1      v2           vn
                                                        Z   k (v{k } )
                                                                 v       k

                        v3              v10

Conditional Random Field (CRF)

 Y1        Y2
                                                                               k ( y{k} , x{k} )
                                P( y1 , y2 ,... ym | x1 , x2 ,... xn ) 
                                                                         Z(X ) k
      X1, X2,…, Xn                              Z(X)     k (y{k} , x{k} )
Two types of variables x & y,                                Y       k
there is no factor with only x variables
Linear-Chain Conditional Random Field (CRF)

Y1        Y2     …Y        n
                                     Ys are connected in a linear chain

     X1, X2,…, Xn

        P( y1 , y2 ,... ym | x1 , x2 ,... xn )        k ( yi , yi1 , x{k} )
                                                 Z(X ) k
                       Z ( X )   k ( yi , yi 1 , x{k } )
                                     Y    k
     Logistic Regression as a
          Simplest CRF
• Logistic regression is a simple CRF with
  only one output variable

               X1   X2       …   Xn

• Models the conditional distribution, P(Y | X)
  and not the full joint P(X,Y)
 Simplification Assumption for

• The probability P(Y|X1..Xn) can be factored as:
                             exp(W ci f i (c, x))
          P(c | X)                    i 0

                                    exp(W c' i f i (c', x))
                       c' Classes            i 0


    Generative vs. Discriminative
     Sequence Labeling Models
• HMMs are generative models and are not
  directly designed to maximize the performance
  of sequence labeling. They model the joint
  distribution P(O,Q)
• HMMs are trained to have an accurate
  probabilistic model of the underlying language,
  and not all aspects of this model benefit the
  sequence labeling task
• Conditional Random Fields (CRFs) are
  specifically designed and trained to maximize
  performance of sequence labeling. They model
  the conditional distribution P(Q | O)

X1   X2       …    Xn


X1   X2       …    Xn
Sequence Labeling
Y1    Y2      ..   YT

X1    X2    …      XT


 Y1    Y2     ..    YT
                           Linear-chain CRF

 X1    X2   …       XT
    Simple Linear Chain CRF
• Modeling the conditional distribution is similar to
  that used in multinomial logistic regression.
• Create feature functions fk(Yt, Yt−1, Xt)
   – Feature for each state transition pair i, j
      • fi,j(Yt, Yt−1, Xt) = 1 if Yt = i and Yt−1 = j and 0 otherwise
   – Feature for each state observation pair i, o
      • fi,o(Yt, Yt−1, Xt) = 1 if Yt = i and Xt = o and 0 otherwise
• Note: number of features grows quadratically in
  the number of states (i.e. tags).

       Conditional Distribution for
          Linear Chain CRF
     • Using these feature functions for a
       simple linear chain CRF, we can define:
                                  T     M
          P(Y | X)       exp(  m f m (Yt ,Yt1, X t ))
                     Z(X)     t1 m1

                              T   M
             Z(X)   exp(  m f m (Yt ,Yt1, X t ))
                      Y       t1 m1

       Adding Token Features to a
  • Can add token features Xi,j
       Y1                 Y2          …          YT

X1,1   …    X1,m   X2,1   …    X2,m   …   XT,1   …    XT,m

   • Can add additional feature functions for
     each token feature to model conditional
   Features in POS Tagging

• For POS Tagging, use lexicographic
  features of tokens.
  – Capitalized?
  – Start with numeral?
  – Ends in given suffix (e.g. “s”, “ed”, “ly”)?

   Enhanced Linear Chain CRF
      (standard approach)
  • Can also condition transition on the
    current token features.
  Y1                       Y2                    …          YT

            X1,1                    X2,1        …                       XT,1


            X1,m                    X2,m                                XT,m

• Add feature functions:
       • fi,j,k(Yt, Yt−1, X) 1 if Yt = i and Yt−1 = j and Xt −1,k = 1
         and 0 otherwise                                                       26
      Supervised Learning
     (Parameter Estimation)
• As in logistic regression, use L-BFGS
  optimization procedure, to set λ weights
  to maximize CLL of the supervised
  training data

        Sequence Tagging
• Variant of dynamic programming (Viterbi)
  algorithm can be used to efficiently,
  O(TN2), determine the globally most
  probable label sequence for a given
  token sequence using a given log-linear
  model of the conditional probability P(Y |

              Skip-Chain CRFs
• Can model some long-distance dependencies
  (i.e. the same word appearing in different parts
  of the text) by including long-distance edges in
  the Markov model.

         Y1      Y2     Y3         Y100   Y101

         X1      X2     X3         X100   X101

       Michael   Dell   said       Dell bought

• Additional links make exact inference
  intractable, so must resort to approximate
  inference to try to find the most probable         29
                CRF Results
• Experimental results verify that they have
  superior accuracy on various sequence
  labeling tasks
  –   Part of Speech tagging
  –   Noun phrase chunking
  –   Named entity recognition
  –   Semantic role labeling
• However, CRFs are much slower to train and
  do not scale as well to large amounts of
  training data
  – Training for POS on full Penn Treebank (~1M
    words) currently takes “over a week.”
• Skip-chain CRFs improve results on IE
                CRF Summary
• CRFs are a discriminative approach to sequence labeling
  whereas HMMs are generative
• Discriminative methods are usually more accurate since
  they are trained for a specific performance task
• CRFs also easily allow adding additional token features
  without making additional independence assumptions
• Training time is increased since a complex optimization
  procedure is needed to fit supervised training data
• CRFs are a state-of-the-art method for sequence labeling

          Phrase Structure
• Most languages have a word order
• Words are organized into phrases,
  group of words that act as a single unit
  or a constituent
  – [The dog] [chased] [the cat].
  – [The fat dog] [chased] [the thin cat].
  – [The fat dog with red collar] [chased] [the
    thin old cat].
  – [The fat dog with red collar named Tom]
    [suddenly chased] [the thin old white cat].
• Noun phrase: A syntactic unit of a sentence
  which acts like a noun and in which a noun is
  usually embedded called its head
  – An optional determiner followed by zero or more
    adjectives, a noun head and zero or more
    prepositional phrases
• Prepositional phrase: Headed by a
  preposition and express spatial, temporal or
  other attributes
• Verb phrase: Part of the sentence that
  depend on the verb. Headed by the verb.
• Adjective phrase: Acts like an adjective.
          Phrase Chunking
• Find all non-recursive noun phrases
  (NPs) and verb phrases (VPs) in a
  – [NP I] [VP ate] [NP the spaghetti] [PP
    with] [NP meatballs].
  – [NP He ] [VP reckons ] [NP the current
    account deficit ] [VP will narrow ] [PP to ]
    [NP only # 1.8 billion ] [PP in ] [NP
    September ]
• Some applications need all the noun
  phrases in a sentence
       Phrase Chunking as
       Sequence Labeling
• Tag individual words with one of 3 tags
  – B (Begin) word starts new target phrase
  – I (Inside) word is part of target phrase but
    not the first word
  – O (Other) word is not part of target phrase
• Sample for NP chunking
  – He reckons the current account deficit will
    narrow to only # 1.8 billion in September.
         Begin      Inside     Other

       Evaluating Chunking
• Per token accuracy does not evaluate finding
  correct full chunks. Instead use:
                   Number of correctchunks found
      Precision 
                    Total number of chunksfound
                 Number of correctchunks found
       Recall 
                  Total number of actual chunks
• Take harmonic mean to produce a single
  evaluation metric called F measure.
                     1        2 PR
             F1            
                   1 1
                  (  )/2     PR
                   P R                             36
   Current Chunking Results
• Best system for NP chunking: F1=96%
• Typical results for finding range of
  chunk types (CoNLL 2000 shared task:


To top