Document Sample

Mini-CRF Tutorial Part 1: Model Formulation and Training Robuddies: October 11, 2006 Mark Schmidt Logistic Regression (discriminative binary classifier) Y y: class label w: weights on features w x: features (pixels intensities, output of filters, X etc.) Likelihood function: Logistic Regression Inference Given an instance x with weights w, assign Y x to the class with the highest likelihood: w X Logistic Regression Training Objective Supervised Training: Y Using: w X Assuming a class representation of {-1,1} we can write the likelihood of x being in the ‘right’ class as: (Equation 2 in paper) Maximum Likelihood training finds the w that maximizes this expression, giving the weights that maximize the likelihood function over the training data (in practice, a bias term is added and we compute an MAP estimate) Logistic Regression Training Optimization Using: Y w We write the log-likelihood: X The log-likelihood is concave in w, so has a unique global maximum, gradient methods (Newton, Quasi-Newton, non-linear CG, etc.) can be used to find a w yielding this maximum. Note that the Gradient of log-likelihood has a special form: Expressing the Likelihood in terms of Potentials An alternate view from comparing the normalized likelihoods of the two classes, is that we are finding the maximum among non-negative un-normalized potentials for each class Using exp(0) == 1, we can re- write the likelihoods for each Here, the class potentials are class as follows: defined as the numerators: The denominator sums the potentials for the different classes Potentials vs. Likelihood New view: Classification is finding the class with the highest potentials: Y w We can re-write the likelihood in terms of class potentials: X Note that the denominator forces the sum of the likelihoods to be 1 Training can be viewed as maximizing the potential of the “right” class relative to the wrong one From likelihood to pseudolikelihood We know how to do this: How do we do this? v Y Y1 Y2 w w w X X1 X2 From likelihood to pseudolikelihood v Y Y1 Y2 w w w X X1 X2 x: features x_i: node features w: weights w: node weights y: label y_i: node label x_ij: edge features v: edge weights Typically a distance measure between node features From likelihood to pseudolikelihood v Y Y1 Y2 w w w X X1 X2 Node and Edge Potentials We typically re-write the class potential factorized into node and edge potentials (called association + interaction in the paper): In general: Pseudolikelihood training (Similar to Equation 3) (Equation 4 in paper) Logistic Regression Pseudolikelihood Define: Define: Likelihood: (Take log and add prior on v to get expression similar to Equation 5) Training is almost identical to logistic regression for pseudolikelihood, except: - you jointly optimize of w and v instead of just w - you compute the likelihood for each node for each data point, instead of 1 node per data point Can use arbitrary structure connecting Y nodes v v v v Y1 Y2 Yn-1 Yn w w w w X1 X2 Xn-1 Xn v Y1 Y2 v v w OR: v w Y3 Y4 X1 w X2 w X3 X4 Can use Global Features (or mixed): v Y1 Y2 v v w v w Y3 Y4 w w X3 Can use Untied paramters (or mixed): v1 v2 v9 v10 Y1 Y2 Y9 Y10 w1 w2 w3 w4 X1 X2 X9 X10 Pseudolikelihood Caveat Recall: v Y1 Y2 w w X1 X2 Problem: Can’t find probability of class label c for node y1 without label y2 For training, this is fine For testing, we usually don’t have this information (graph cuts used in paper to find an MAP estimate of different model) From Pseudolikelihood to Conditional Random Fields Pseudolikelihood: Conditional Random Field: Likelihood is the potential for a Likelihood is the potential for joint assignment of node’s class label, normalized all node’s labels, normalized over all possible joint over all possible class lables class label assignments v Pseudolikelihood: Y1 Y2 w w X1 X2 Conditional Random Field: In general, CRFs use the following likelihood function: The denominator is the ‘partition’ function and is typically written simply as Z. (Equation 1 in paper) The log-likelihood is often written as (where w concatenates w and v): The gradient of the above is: As before, the denominator can be expressed as an expectation: Conditional Random Fields Training is the same as before: - find paramters {w,v} that jointly maximize the likelihood of the training data Unlike Pseudolikelihood, we can do inference: - find marginal class probability for each y_i, or find the joint assignment that has the highest potential (MAP) Problem: - if we have k nodes, then the denominator will have 2^k terms But, not impractial: - for small k, everything can be done by brute force - for some graph structures, we can compute Z and do inference exactly with dynamic programming - for binary labels and a restricted class of potentials, we can find the MAP labeling with graph cuts - for other structures, there exist approximations with nice properties (not actually needed for optimization w/ gradient methods or finding MAP assignment) (needed for training w/ gradient methods or finding optimal max marginal assignment)

DOCUMENT INFO

Shared By:

Categories:

Tags:

Stats:

views: | 122 |

posted: | 2/28/2012 |

language: | |

pages: | 20 |

OTHER DOCS BY Ew822JK

How are you planning on using Docstoc?
BUSINESS
PERSONAL

By registering with docstoc.com you agree to our
privacy policy and
terms of service, and to receive content and offer notifications.

Docstoc is the premier online destination to start and grow small businesses. It hosts the best quality and widest selection of professional documents (over 20 million) and resources including expert videos, articles and productivity tools to make every small business better.

Search or Browse for any specific document or resource you need for your business. Or explore our curated resources for Starting a Business, Growing a Business or for Professional Development.

Feel free to Contact Us with any questions you might have.