An Asymptotic Analysis of Generative Discriminative and

W
Shared by: alextt
-
Stats
views:
20
posted:
11/4/2008
language:
English
pages:
16
Document Sample
scope of work template
							An Asymptotic Analysis of Generative, Discriminative,
and Pseudolikelihood Estimators
by

Percy Liang and Michael Jordan
(ICML 2008 )

Presented by Lihan He ECE, Duke University June 27, 2008

Outline
Introduction

Exponential family estimators
Generative Fully discriminative Pseudolikelihood discriminative

Asymptotic analysis Experiments

Conclusions

Introduction
 Data points are not considered to be drawn independently.

 There are correlations between data points.
 Given data z  ( x , y )  {( x1 , y1 ), ( x 2 , y 2 ), ..., ( x n , y n )}, we have to consider the joint distribution over all the data points.
p ( z )  p ( z1 , ..., z n )  p ( z1 ) p ( z 2 | z1 )... p ( z n | z1 , ..., z n 1 )

 Correspondingly, the overall likelihood is not the product of the likelihood for each data point.

Introduction
Generative vs. Discriminative Generative model: • A model for randomly generating observed data; • Learning a joint probability distribution over both observations and labels

p ( x , y )  p ( x1 , ..., x n , y1 , ..., y n )
Discriminative model: • A model only of the label variables conditional on the observed data; • Learning a conditional distribution over labels given observations

p ( y | x )  p ( y1 , ..., y n | x1 , ..., x n )

Introduction
Full Likelihood vs. Pseudolikelihood Full likelihood:
p ( z )  p ( z1 , ..., z n )  p ( z1 ) p ( z 2 | z1 )... p ( z n | z1 , ..., z n 1 )

• Could be intractable; • Computationally inefficient. Pseudolikelihood: • An approximation of the full likelihood; • Computationally more efficient.

A set of dependencies between data points

p(z) 


i

p ( z i | z j for all { z i , z j }  E )

Estimators
Exponential Family Estimators

p ( z )  exp{ ( z )   A ( )} for z  Z
T

z  ( x , y ) and Z  X  Yx
 ( z ) : features  : model parameters
A ( ) :

normalization

Example: conditional random field

Estimators
Composite Likelihood Estimators [Lindsay 1988]  One class of pseudolikelihood estimator;  Consists of a weighted sum of component likelihoods, each of which is the probability of one subset of data points conditioned on another.  Partitions the output space (denoted by r) according to a fixed distribution Pr, and obtains the component likelihood.  Defines criterion function

m ( z )  E e~ Pr log p ( z | z  r ( z )) r
which reflects the quality of the estimator.  The maximum composite likelihood estimator

ˆ ˆ  arg m ax E [ m ( z )] z 


Estimators
Three estimators to be compared in the paper:

 Generative:
one component rg ( x , y )  X  Y  Fully discriminative:

one component r ( x , y )  x  Y x d
 Pseudolikelihood discriminative: for each data point, we have one component
ri ( x , y )  {( x ', y ') : x '  x , y '  Y , y j '  y j for j  i}

Estimators
Risk Decomposition Bayes risk R *  H (Y | X )  E ( X ,Y ) ~ p [  log p (Y | X )]
*

Define  o  arg m ax E * m ( Z ) Z~p


unrelated to data samples z

have only finite data

intrinsic suboptimality of the estimator

Asymptotic Analysis

before

Well-specified model: , achieves O(n-1) convergence rate. Misspecified model: only fully discriminative estimator achieves O(n-1) rate.

Asymptotic Analysis

Experiments
Toy example: four-node binary-valued graphical model z  ( x1 , x 2 , y1 , y 2 ) True model:
 ( z )    1 ( y1  y 2 )   [1 ( x1  y1 )  1 ( x 2  y 2 )]   [1 ( x1  y 2 )  1 ( x 2  y1 )]
T * * *

Learned model:
 ( z )    1 ( y1  y 2 )   [1 ( x1  y1 )  1 ( x 2  y 2 )]
T

When  *  0 , the learned model is well-specified; When  *  0 , the learned model is misspecified.

Experiments
 0
*

well-specified

  0.5
*

misspecified

n  20000
(g)     1
* *

( h )   1,   0
* *

Experiments
Part-of-speech (POS) Tagging:
Input: a sequence of words x  ( x1 , ..., x l )

Output: a sequence of POS tags y  ( y1 , ..., y l ) , i.e. noun, verb,etc. (45 tags total) Specified model:

Node features  node ( y i , x i ) : indicator functions of the form 1 ( y i  a , x i  b ) Edge features  edge ( y i , y i  1 ) : indicator functions of the form 1 ( y i  a , y i 1  b )

Training: Wall Street Journal, 38K sentences. Testing: Wall Street Journal, 5.5K sentences, different sections from training.

Experiments

Use the learned generative model to sample 1000 training samples and 1000 test samples, as synthetic data.

Conclusions
 When model is well-specified:
   Three estimators all achieve O(n-1) convergence rate; There are no approximation error; The asymptotic estimation error generative < fully discriminative < pseudolikelihood discriminative

 When model is misspecified:



Fully discriminative estimator still achieves O(n-1) convergence rate,
but the other two estimators achieve O(n-1/2) convergence rate ; The approximation error and asymptotic estimation error for fully discriminative estimator is lower than the generative estimator and

the pseudolikelihood discriminative estimator.