Learning Center
Plans & pricing Sign in
Sign Out



									             CS 388:
  Natural Language Processing:
   Discriminative Training and
Conditional Random Fields (CRFs)
      for Sequence Labeling

      Raymond J. Mooney
    University of Texas at Austin
                          Joint Distribution
•   The joint probability distribution for a set of random variables,
    X1,…,Xn gives the probability of every combination of values (an n-
    dimensional array with vn values if all variables are discrete with v
    values, all vn values must sum to 1): P(X1,…,Xn)
                   positive                         negative
                   circle     square               circle      square
         red       0.20       0.02        red      0.05        0.30
         blue      0.02       0.01        blue     0.20        0.20
•   The marginal probability of all possible conjunctions (assignments of
    values to some subset of variables) can be calculated by summing the
    appropriate subset of values from the joint distribution.

•   Therefore, all conditional probabilities can also be calculated.

            Probabilistic Classification
• Let Y be the random variable for the class which takes
  values {y1,y2,…ym}.
• Let X be the random variable describing an instance
  consisting of a vector of values for n features
  <X1,X2…Xn>, let xk be a possible vector value for X and
  xij a possible value for Xi.
• For classification, we need to compute P(Y=yi | X=xk)
  for i = 1…m
• Could be done using joint distribution but this requires
  estimating an exponential number of parameters.

           Bayesian Categorization
• Determine category of xk by determining for each yi

• P(X=xk) can be determined since categories are
  complete and disjoint.

      Bayesian Categorization (cont.)
• Need to know:
   – Priors: P(Y=yi)
   – Conditionals: P(X=xk | Y=yi)
• P(Y=yi) are easily estimated from data.
   – If ni of the examples in D are in yi then P(Y=yi) = ni / |D|
• Too many possible instances (e.g. 2n for binary
  features) to estimate all P(X=xk | Y=yi).
• Still need to make some sort of independence
  assumptions about the features to make learning
            Naïve Bayes Generative Model

                                        pos pos
                                       pos neg
                                         pos neg


                red          circ                     lg         red         circ
   med                                                          blue
 sm lg         blue       tri tri                     sm                     sqr
med         redgrn red          circ
                         circ circ                 med med    grn grn     tri circ
 lg lg sm                                           sm lglg    red blue   circtri sqr
  sm med     red blue      circ sqr                  sm        blue grn     sqr tri
 Size        Color        Shape                     Size       Color       Shape
        Positive                                              Negative                  6
            Naïve Bayes Inference Problem

                                       lg red circ
                                       ??      ??

                                         pos pos
                                        pos neg
                                          pos neg


                red          circ                       lg         red         circ
   med                                                            blue
 sm lg         blue       tri tri                       sm                     sqr
med         redgrn red          circ
                         circ circ                   med med    grn grn     tri circ
 lg lg sm                                             sm lglg    red blue   circtri sqr
  sm med     red blue      circ sqr                    sm        blue grn     sqr tri
 Size        Color        Shape                       Size       Color       Shape
        Positive                                                Negative                  7
       Naïve Bayesian Categorization
• If we assume features of an instance are independent given
  the category (conditionally independent).

• Therefore, we then only need to know P(Xi | Y) for each
  possible pair of a feature-value and a category.
• If Y and all Xi and binary, this requires specifying only 2n
   – P(Xi=true | Y=true) and P(Xi=true | Y=false) for each Xi
   – P(Xi=false | Y) = 1 – P(Xi=true | Y)

• Compared to specifying 2n parameters without any
  independence assumptions.

    Generative vs. Discriminative Models

•  Generative models and are not directly designed to
  maximize the performance of classification. They model the
  complete joint distribution P(X,Y).
• Classification is then done using Bayesian inference given
  the generative model of the joint distribution.
• But a generative model can also be used to perform any other
  inference task, e.g. P(X1 | X2, …Xn, Y)
    – “Jack of all trades, master of none.”
• Discriminative models are specifically designed and trained
  to maximize performance of classification. They only model
  the conditional distribution P(Y | X).
• By focusing on modeling the conditional distribution, they
  generally perform better on classification than generative
  models when given a reasonable amount of training data.

                Logistic Regression
• Assumes a parametric form for directly estimating
  P(Y | X). For binary concepts, this is:

• Equivalent to a one-layer backpropagation neural net.
   – Logistic regression is the source of the sigmoid function
     used in backpropagation.
   – Objective function for training is somewhat different.
    Logistic Regression as a Log-Linear Model
• Logistic regression is basically a linear model, which
  is demonstrated by taking logs.

• Also called a maximum entropy model (MaxEnt)
  because it can be shown that standard training for
  logistic regression gives the distribution with maximum
  entropy that is consistent with the training data.
       Logistic Regression Training
• Weights are set during training to maximize the
  conditional data likelihood :

  where D is the set of training examples and Yd and
  Xd denote, respectively, the values of Y and X for
  example d.
• Equivalently viewed as maximizing the
  conditional log likelihood (CLL)
        Logistic Regression Training

• Like neural-nets, can use standard gradient
  descent to find the parameters (weights) that
  optimize the CLL objective function.
• Many other more advanced training
  methods are possible to speed convergence.
  –   Conjugate gradient
  –   Generalized Iterative Scaling (GIS)
  –   Improved Iterative Scaling (IIS)
  –   Limited-memory quasi-Newton (L-BFGS)
Preventing Overfitting in Logistic Regression

• To prevent overfitting, one can use regularization
  (a.k.a. smoothing) by penalizing large weights by
  changing the training objective:

   Where λ is a constant that determines the amount of smoothing

• This can be shown to be equivalent to MAP
  parameter estimation assuming a Guassian prior
  for W with zero mean and a variance related to
       Multinomial Logistic Regression
• Logistic regression can be generalized to multi-class
  problems (where Y has a multinomial distribution).
• Create feature functions for each combination of a
  class value y´ and each feature Xj and another for the
  “bias weight” of each class.
   – f y´, j (Y, X) = Xj if Y= y´ and 0 otherwise
   – f y´ (Y, X) = 1 if Y= y´ and 0 otherwise
• The final conditional distribution is:
                                                (λk are weights)

                                             (normalizing constant)
                     Graphical Models
• If no assumption of independence is made, then an
  exponential number of parameters must be estimated for
  sound probabilistic inference.
   – No realistic amount of training data is sufficient to estimate so many
• If a blanket assumption of conditional independence is made,
  efficient training and inference is possible, but such a strong
  assumption is rarely warranted.
• Graphical models use directed or undirected graphs over a
  set of random variables to explicitly specify variable
  dependencies and allow for less restrictive independence
  assumptions while limiting the number of parameters that
  must be estimated.
   – Bayesian Networks: Directed acyclic graphs that indicate causal
   – Markov Networks: Undirected graphs that capture general
             Bayesian Networks

• Directed Acyclic Graph (DAG)
  – Nodes are random variables
  – Edges indicate causal influences

         Burglary            Earthquake


         JohnCalls           MaryCalls
           Conditional Probability Tables
• Each node has a conditional probability table (CPT) that
  gives the probability of each of its values given every possible
  combination of values for its parents (conditioning case).
   – Roots (sources) of the DAG that have no parents are given prior

             P(B)                                            P(E)

                    Burglary               Earthquake        .002

                                             B   E   P(A)
                                             T   T   .95
                                             T   F   .94
                                 Alarm       F   T   .29
                                             F   F   .001

       A   P(J)                                                  A   P(M)
                                                                 T   .70
       T   .90
                    JohnCalls                    MaryCalls       F   .01
       F   .05
    Joint Distributions for Bayes Nets
• A Bayesian Network implicitly defines a joint

• Example
       Naïve Bayes as a Bayes Net
• Naïve Bayes is a simple Bayes Net

             X1   X2       …   Xn

• Priors P(Y) and conditionals P(Xi|Y) for
  Naïve Bayes provide CPTs for the network.
             Markov Networks
• Undirected graph over a set of random
  variables, where an edge represents a
• The Markov blanket of a node, X, in a
  Markov Net is the set of its neighbors in the
  graph (nodes that have an edge connecting
  to X).
• Every node in a Markov Net is
  conditionally independent of every other
  node given its Markov blanket.
      Distribution for a Markov Network
• The distribution of a Markov net is most compactly described
  in terms of a set of potential functions (a.k.a. factors,
  compatibility functions), φk, for each clique, k, in the graph.
• For each joint assignment of values to the variables in clique
  k, φk assigns a non-negative real value that represents the
  compatibility of these values.
• The joint distribution of a Markov network is then defined by:

    Where x{k} represents the joint assignment of the variables
    in clique k, and Z is a normalizing constant that makes a
    joint distribution that sums to 1.
              Sample Markov Network
                                                 E   A   f2
B   A   f1                                       T   T   50
T   T   100                                      T   F   10
T   F   1                                        F   T   1
F   T   1                                        F   F   200
F   F   200    Burglary           Earthquake


                                                     M   A   f4
J   A   f3    JohnCalls              MaryCalls
                                                     T   T   50
T   T   75
                                                     T   F   1
T   F   10
                                                     F   T   10
F   T   1
                                                     F   F   200
F   F   200
  Logistic Regression as a Markov Net
• Logistic regression is a simple Markov Net

              X1   X2       …   Xn

• But only models the conditional distribution,
  P(Y | X) and not the full joint P(X,Y)
• Same as a discriminatively trained naïve
          Generative vs. Discriminative
           Sequence Labeling Models

• HMMs are generative models and are not directly
  designed to maximize the performance of sequence
  labeling. They model the joint distribution P(O,Q).
• HMMs are trained to have an accurate probabilistic
  model of the underlying language, and not all
  aspects of this model benefit the sequence labeling
• Conditional Random Fields (CRFs) are
  specifically designed and trained to maximize
  performance of sequence labeling. They model the
  conditional distribution P(Q | O)

X1   X2       …    Xn


X1   X2       …    Xn
      Sequence Labeling

Y1     Y2     ..    YT

X1     X2    …      XT


 Y1     Y2     ..    YT
                            Linear-chain CRF

 X1     X2   …       XT
   Simple Linear Chain CRF Features

• Modeling the conditional distribution is
  similar to that used in multinomial logistic
• Create feature functions fk(Yt, Yt−1, Xt)
  – Feature for each state transition pair i, j
     • fi,j(Yt, Yt−1, Xt) = 1 if Yt = i and Yt−1 = j and 0 otherwise
  – Feature for each state observation pair i, o
     • fi,o(Yt, Yt−1, Xt) = 1 if Yt = i and Xt = o and 0 otherwise
• Note: number of features grows quadratically
  in the number of states (i.e. tags).
       Conditional Distribution for
           Linear Chain CRF
• Using these feature functions for a simple
  linear chain CRF, we can define:

        Adding Token Features to a CRF

   • Can add token features Xi,j

       Y1                 Y2          …          YT

X1,1   …    X1,m   X2,1   …    X2,m   …   XT,1   …    XT,m

   • Can add additional feature functions for
     each token feature to model conditional
         Features in POS Tagging

• For POS Tagging, use lexicographic
  features of tokens.
  – Capitalized?
  – Start with numeral?
  – Ends in given suffix (e.g. “s”, “ed”, “ly”)?

            Enhanced Linear Chain CRF
               (standard approach)
  • Can also condition transition on the current
    token features.
  Y1                       Y2                     …         YT

            X1,1                    X2,1        …                       XT,1


            X1,m                    X2,m                                XT,m

• Add feature functions:
       • fi,j,k(Yt, Yt−1, X) 1 if Yt = i and Yt−1 = j and Xt −1,k = 1
         and 0 otherwise                                                       32
           Supervised Learning
          (Parameter Estimation)
• As in logistic regression, use L-BFGS
  optimization procedure, to set λ weights to
  maximize CLL of the supervised training
• See paper for details.

             Sequence Tagging
• Variant of Viterbi algorithm can be used to
  efficiently, O(TN2), determine the globally
  most probable label sequence for a given
  token sequence using a given log-linear
  model of the conditional probability P(Y | X).
• See paper for details.

                  Skip-Chain CRFs
• Can model some long-distance dependencies (i.e. the
  same word appearing in different parts of the text) by
  including long-distance edges in the Markov model.

          Y1      Y2     Y3         Y100   Y101

          X1      X2     X3         X100   X101

        Michael   Dell   said       Dell bought

• Additional links make exact inference intractable,
  so must resort to approximate inference to try to
  find the most probable labeling.                         35
                     CRF Results
• Experimental results verify that they have superior
  accuracy on various sequence labeling tasks.
   –   Part of Speech tagging
   –   Noun phrase chunking
   –   Named entity recognition
   –   Semantic role labeling
• However, CRFs are much slower to train and do
  not scale as well to large amounts of training data.
   – Training for POS on full Penn Treebank (~1M words)
     currently takes “over a week.”
• Skip-chain CRFs improve results on IE.
                   CRF Summary
• CRFs are a discriminative approach to sequence
  labeling whereas HMMs are generative.
• Discriminative methods are usually more accurate
  since they are trained for a specific performance task.
• CRFs also easily allow adding additional token features
  without making additional independence assumptions.
• Training time is increased since a complex
  optimization procedure is needed to fit supervised
  training data.
• CRFs are a state-of-the-art method for sequence

To top