Mini-CRF Tutorial Part 1: Model Formulation and Training

Document Sample
Mini-CRF Tutorial Part 1: Model Formulation and Training Powered By Docstoc
					  Mini-CRF Tutorial Part 1:
Model Formulation and Training

    Robuddies: October 11, 2006
          Mark Schmidt
                Logistic Regression
         (discriminative binary classifier)

Y        y: class label
         w: weights on features
         x: features (pixels intensities, output of filters,
X        etc.)

    Likelihood function:
Logistic Regression Inference

      Given an instance x with weights w, assign
 Y    x to the class with the highest likelihood:

    Logistic Regression Training Objective
Supervised Training:


Assuming a class representation of {-1,1} we can write the likelihood of x being
in the ‘right’ class as:

                                                              (Equation 2 in paper)

Maximum Likelihood training finds the w that maximizes this expression,
giving the weights that maximize the likelihood function over the training data
         (in practice, a bias term is added and we compute an MAP estimate)
Logistic Regression Training Optimization

 We write the log-likelihood:

The log-likelihood is concave in w, so has a unique global maximum, gradient
methods (Newton, Quasi-Newton, non-linear CG, etc.) can be used to find a w
yielding this maximum.

 Note that the Gradient of log-likelihood has a special form:
  Expressing the Likelihood in terms of Potentials

An alternate view from comparing the normalized likelihoods
of the two classes, is that we are finding the maximum
among non-negative un-normalized potentials for each class

 Using exp(0) == 1, we can re-
 write the likelihoods for each               Here, the class potentials are
 class as follows:                            defined as the numerators:

                                            The denominator sums the potentials
                                            for the different classes
              Potentials vs. Likelihood
New view: Classification is finding the class with the highest
potentials:                                                           Y


We can re-write the likelihood in terms of class potentials:          X

Note that the denominator forces the sum of the likelihoods to be 1
Training can be viewed as maximizing the potential of the “right”
class relative to the wrong one
     From likelihood to pseudolikelihood

We know how to do this:   How do we do this?

         Y                  Y1           Y2
          w                  w             w

         X                  X1           X2
From likelihood to pseudolikelihood
    Y                                Y1              Y2

     w                                w                 w

    X                                X1              X2

x: features                        x_i: node features
w: weights                         w: node weights
y: label                           y_i: node label
                                   x_ij: edge features
                                   v: edge weights

        Typically a distance measure between
        node features
From likelihood to pseudolikelihood

   Y                    Y1       Y2

   w                     w        w

   X                    X1       X2
Node and Edge Potentials

We typically re-write the class potential factorized into
node and edge potentials
(called association + interaction in the paper):

 In general:
                      Pseudolikelihood training
                                           (Similar to Equation 3) (Equation 4 in paper)

Logistic Regression            Pseudolikelihood
Define:                        Define:

                                                      (Take log and add prior on
                                                      v to get expression similar
                                                      to Equation 5)

    Training is almost identical to logistic regression for pseudolikelihood, except:
             - you jointly optimize of w and v instead of just w
             - you compute the likelihood for each node for each data point, instead
                       of 1 node per data point
           Can use arbitrary structure connecting Y nodes

       v                   v                   v            v
Y1               Y2                                 Yn-1        Yn

 w                 w                                    w        w

X1               X2                                 Xn-1        Xn

                           Y1                      Y2
                   v                       v
     OR:               v                            w
                 Y3                Y4

                           X1          w           X2

                 X3                X4
Can use Global Features (or mixed):
                                 Y1                    Y2
                        v                      v
                            v                      w
                    Y3                  Y4
                            w          w


Can use Untied paramters (or mixed):

          v1                v2                 v9            v10
   Y1             Y2                                   Y9          Y10

     w1            w2                                   w3          w4

   X1             X2                                   X9          X10
                Pseudolikelihood Caveat


                                                       Y1              Y2

                                                         w               w

                                                       X1              X2

       Can’t find probability of class label c for node y1 without label y2
       For training, this is fine
       For testing, we usually don’t have this information
  (graph cuts used in paper to find an MAP estimate of different model)
From Pseudolikelihood to Conditional Random Fields

 Pseudolikelihood:                   Conditional Random Field:
 Likelihood is the potential for a   Likelihood is the potential for joint assignment of
 node’s class label, normalized      all node’s labels, normalized over all possible joint
 over all possible class lables      class label assignments
Pseudolikelihood:           Y1       Y2

                             w        w

                            X1       X2

Conditional Random Field:
In general, CRFs use the following likelihood function:

   The denominator is the ‘partition’ function and is typically written simply as Z.

                          (Equation 1 in paper)

 The log-likelihood is often written as (where w concatenates w and v):

 The gradient of the above is:

  As before, the denominator can be expressed as an expectation:
                  Conditional Random Fields

Training is the same as before:
         - find paramters {w,v} that jointly maximize the likelihood of the training data
Unlike Pseudolikelihood, we can do inference:
        - find marginal class probability for each y_i, or find the joint assignment
                 that has the highest potential (MAP)

       - if we have k nodes, then the denominator will have 2^k terms
But, not impractial:
          - for small k, everything can be done by brute force
          - for some graph structures, we can compute Z and do inference
                   exactly with dynamic programming
         - for binary labels and a restricted class of potentials, we can find
                   the MAP labeling with graph cuts
         - for other structures, there exist approximations with nice
(not actually needed for optimization w/ gradient methods or finding
MAP assignment)

                (needed for training w/ gradient methods or finding
                optimal max marginal assignment)

Shared By: