# Mini-CRF Tutorial Part 1: Model Formulation and Training

Document Sample

```					  Mini-CRF Tutorial Part 1:
Model Formulation and Training

Robuddies: October 11, 2006
Mark Schmidt
Logistic Regression
(discriminative binary classifier)

Y        y: class label
w: weights on features
w
x: features (pixels intensities, output of filters,
X        etc.)

Likelihood function:
Logistic Regression Inference

Given an instance x with weights w, assign
Y    x to the class with the highest likelihood:
w

X
Logistic Regression Training Objective
Supervised Training:
Y
Using:
w

X

Assuming a class representation of {-1,1} we can write the likelihood of x being
in the ‘right’ class as:

(Equation 2 in paper)

Maximum Likelihood training finds the w that maximizes this expression,
giving the weights that maximize the likelihood function over the training data
(in practice, a bias term is added and we compute an MAP estimate)
Logistic Regression Training Optimization
Using:
Y

w
We write the log-likelihood:
X

The log-likelihood is concave in w, so has a unique global maximum, gradient
methods (Newton, Quasi-Newton, non-linear CG, etc.) can be used to find a w
yielding this maximum.

Note that the Gradient of log-likelihood has a special form:
Expressing the Likelihood in terms of Potentials

An alternate view from comparing the normalized likelihoods
of the two classes, is that we are finding the maximum
among non-negative un-normalized potentials for each class

Using exp(0) == 1, we can re-
write the likelihoods for each               Here, the class potentials are
class as follows:                            defined as the numerators:

The denominator sums the potentials
for the different classes
Potentials vs. Likelihood
New view: Classification is finding the class with the highest
potentials:                                                           Y

w

We can re-write the likelihood in terms of class potentials:          X

Note that the denominator forces the sum of the likelihoods to be 1
Training can be viewed as maximizing the potential of the “right”
class relative to the wrong one
From likelihood to pseudolikelihood

We know how to do this:   How do we do this?

v
Y                  Y1           Y2
w                  w             w

X                  X1           X2
From likelihood to pseudolikelihood
v
Y                                Y1              Y2

w                                w                 w

X                                X1              X2

x: features                        x_i: node features
w: weights                         w: node weights
y: label                           y_i: node label
x_ij: edge features
v: edge weights

Typically a distance measure between
node features
From likelihood to pseudolikelihood

v
Y                    Y1       Y2

w                     w        w

X                    X1       X2
Node and Edge Potentials

We typically re-write the class potential factorized into
node and edge potentials
(called association + interaction in the paper):

In general:
Pseudolikelihood training
(Similar to Equation 3) (Equation 4 in paper)

Logistic Regression            Pseudolikelihood
Define:                        Define:

Likelihood:
(Take log and add prior on
v to get expression similar
to Equation 5)

Training is almost identical to logistic regression for pseudolikelihood, except:
- you jointly optimize of w and v instead of just w
- you compute the likelihood for each node for each data point, instead
of 1 node per data point
Can use arbitrary structure connecting Y nodes

v                   v                   v            v
Y1               Y2                                 Yn-1        Yn

w                 w                                    w        w

X1               X2                                 Xn-1        Xn

v
Y1                      Y2
v                       v
w
OR:               v                            w
Y3                Y4

X1          w           X2
w

X3                X4
Can use Global Features (or mixed):
v
Y1                    Y2
v                      v
w
v                      w
Y3                  Y4
w          w

X3

Can use Untied paramters (or mixed):

v1                v2                 v9            v10
Y1             Y2                                   Y9          Y10

w1            w2                                   w3          w4

X1             X2                                   X9          X10
Pseudolikelihood Caveat

Recall:

v
Y1              Y2

w               w

X1              X2

Problem:
Can’t find probability of class label c for node y1 without label y2
For training, this is fine
For testing, we usually don’t have this information
(graph cuts used in paper to find an MAP estimate of different model)
From Pseudolikelihood to Conditional Random Fields

Pseudolikelihood:                   Conditional Random Field:
Likelihood is the potential for a   Likelihood is the potential for joint assignment of
node’s class label, normalized      all node’s labels, normalized over all possible joint
over all possible class lables      class label assignments
v
Pseudolikelihood:           Y1       Y2

w        w

X1       X2

Conditional Random Field:
In general, CRFs use the following likelihood function:

The denominator is the ‘partition’ function and is typically written simply as Z.

(Equation 1 in paper)

The log-likelihood is often written as (where w concatenates w and v):

The gradient of the above is:

As before, the denominator can be expressed as an expectation:
Conditional Random Fields

Training is the same as before:
- find paramters {w,v} that jointly maximize the likelihood of the training data
Unlike Pseudolikelihood, we can do inference:
- find marginal class probability for each y_i, or find the joint assignment
that has the highest potential (MAP)

Problem:
- if we have k nodes, then the denominator will have 2^k terms
But, not impractial:
- for small k, everything can be done by brute force
- for some graph structures, we can compute Z and do inference
exactly with dynamic programming
- for binary labels and a restricted class of potentials, we can find
the MAP labeling with graph cuts
- for other structures, there exist approximations with nice
properties
(not actually needed for optimization w/ gradient methods or finding
MAP assignment)

(needed for training w/ gradient methods or finding
optimal max marginal assignment)

```
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
 views: 122 posted: 2/28/2012 language: pages: 20