An Asymptotic Analysis of Generative Discriminative and
Document Sample


An Asymptotic Analysis of Generative, Discriminative,
and Pseudolikelihood Estimators
by
Percy Liang and Michael Jordan
(ICML 2008 )
Presented by Lihan He ECE, Duke University June 27, 2008
Outline
Introduction
Exponential family estimators
Generative Fully discriminative Pseudolikelihood discriminative
Asymptotic analysis Experiments
Conclusions
Introduction
Data points are not considered to be drawn independently.
There are correlations between data points.
Given data z ( x , y ) {( x1 , y1 ), ( x 2 , y 2 ), ..., ( x n , y n )}, we have to consider the joint distribution over all the data points.
p ( z ) p ( z1 , ..., z n ) p ( z1 ) p ( z 2 | z1 )... p ( z n | z1 , ..., z n 1 )
Correspondingly, the overall likelihood is not the product of the likelihood for each data point.
Introduction
Generative vs. Discriminative Generative model: • A model for randomly generating observed data; • Learning a joint probability distribution over both observations and labels
p ( x , y ) p ( x1 , ..., x n , y1 , ..., y n )
Discriminative model: • A model only of the label variables conditional on the observed data; • Learning a conditional distribution over labels given observations
p ( y | x ) p ( y1 , ..., y n | x1 , ..., x n )
Introduction
Full Likelihood vs. Pseudolikelihood Full likelihood:
p ( z ) p ( z1 , ..., z n ) p ( z1 ) p ( z 2 | z1 )... p ( z n | z1 , ..., z n 1 )
• Could be intractable; • Computationally inefficient. Pseudolikelihood: • An approximation of the full likelihood; • Computationally more efficient.
A set of dependencies between data points
p(z)
i
p ( z i | z j for all { z i , z j } E )
Estimators
Exponential Family Estimators
p ( z ) exp{ ( z ) A ( )} for z Z
T
z ( x , y ) and Z X Yx
( z ) : features : model parameters
A ( ) :
normalization
Example: conditional random field
Estimators
Composite Likelihood Estimators [Lindsay 1988] One class of pseudolikelihood estimator; Consists of a weighted sum of component likelihoods, each of which is the probability of one subset of data points conditioned on another. Partitions the output space (denoted by r) according to a fixed distribution Pr, and obtains the component likelihood. Defines criterion function
m ( z ) E e~ Pr log p ( z | z r ( z )) r
which reflects the quality of the estimator. The maximum composite likelihood estimator
ˆ ˆ arg m ax E [ m ( z )] z
Estimators
Three estimators to be compared in the paper:
Generative:
one component rg ( x , y ) X Y Fully discriminative:
one component r ( x , y ) x Y x d
Pseudolikelihood discriminative: for each data point, we have one component
ri ( x , y ) {( x ', y ') : x ' x , y ' Y , y j ' y j for j i}
Estimators
Risk Decomposition Bayes risk R * H (Y | X ) E ( X ,Y ) ~ p [ log p (Y | X )]
*
Define o arg m ax E * m ( Z ) Z~p
unrelated to data samples z
have only finite data
intrinsic suboptimality of the estimator
Asymptotic Analysis
before
Well-specified model: , achieves O(n-1) convergence rate. Misspecified model: only fully discriminative estimator achieves O(n-1) rate.
Asymptotic Analysis
Experiments
Toy example: four-node binary-valued graphical model z ( x1 , x 2 , y1 , y 2 ) True model:
( z ) 1 ( y1 y 2 ) [1 ( x1 y1 ) 1 ( x 2 y 2 )] [1 ( x1 y 2 ) 1 ( x 2 y1 )]
T * * *
Learned model:
( z ) 1 ( y1 y 2 ) [1 ( x1 y1 ) 1 ( x 2 y 2 )]
T
When * 0 , the learned model is well-specified; When * 0 , the learned model is misspecified.
Experiments
0
*
well-specified
0.5
*
misspecified
n 20000
(g) 1
* *
( h ) 1, 0
* *
Experiments
Part-of-speech (POS) Tagging:
Input: a sequence of words x ( x1 , ..., x l )
Output: a sequence of POS tags y ( y1 , ..., y l ) , i.e. noun, verb,etc. (45 tags total) Specified model:
Node features node ( y i , x i ) : indicator functions of the form 1 ( y i a , x i b ) Edge features edge ( y i , y i 1 ) : indicator functions of the form 1 ( y i a , y i 1 b )
Training: Wall Street Journal, 38K sentences. Testing: Wall Street Journal, 5.5K sentences, different sections from training.
Experiments
Use the learned generative model to sample 1000 training samples and 1000 test samples, as synthetic data.
Conclusions
When model is well-specified:
Three estimators all achieve O(n-1) convergence rate; There are no approximation error; The asymptotic estimation error generative < fully discriminative < pseudolikelihood discriminative
When model is misspecified:
Fully discriminative estimator still achieves O(n-1) convergence rate,
but the other two estimators achieve O(n-1/2) convergence rate ; The approximation error and asymptotic estimation error for fully discriminative estimator is lower than the generative estimator and
the pseudolikelihood discriminative estimator.
Related docs
Get documents about "