Minimum Classification Error (MCE)
Approach in Pattern Recognition
Wu Chou, Avaya Labs Research, Avaya Inc., USA
Present by: Fang-Hui Chu
Outline (1/2)
• Introduction
• Optimal Classifier from Bayes Desicion Theory
• Discriminant Function Approach to Classifier Design
• Speech Recogniation and Hidden Markov Modeling
– Hidden Markov Modeling of Speech
• MCE Classifier Design Using Discriminant Functions
– MCE Classifier Design Strategy
– Optimization Methods
– Other Optimization Methods
– HMM as a Discriminant Function
– Relation Between MCE and MMI
– Discussions and Comments
2
Outline (2/2)
• MCE TRAINING BASED ON EMBEDDED STRING
MODEL
– String-Model-Based MCE Approach
– Combined String-Model-Based MCE Approach
– Discriminative Language Model Estimation
• SUMMARY
3
Introduction
• The advent of powerful computing devices and success
of statistical approaches
– A renewed pursuit for more powerful method to reduce
recognition error rate
• Although MCE-based discriminative methods is rooted in
the classical Bayes’ decision theory, instead of a
classification task to distribution estimation problem,
it takes a discriminant-function based statistical
pattern classification approach
• For a given family of discriminant function, optimal
classifier/recognizer design involves finding a set of
parameters which minimize the empirical pattern
recognition error rate
4
Introduction
• Why we take this approach to design classifier?
– We lack complete knowledge of the form of the distribution
– Training data are inadequate
• How to do?
– Formulating the problem of self-learning into a classification
problem which consists of optimal partitioning of the observation
space into regions, Xk, for which the expected risk , R, is
minimized
– Then we apply generalized probabilistic decent algorithm to
achieve the goal
5
Optimal Classifier from Bayes Desicion Theory
C1 C2 CM
P( x, C1 ) P ( x, CM )
random x 要分類
P( x, Ci ) : x 不確定是 Ci,但被分到 Ci 的機率
但,我們並不知道標準答案
6
Optimal Classifier from Bayes Desicion Theory
定義 loss function : eij : X Y R
X : sample space Y:categorical set
可以想成 Class i 與 Class j 的 distance, eii 0
將 Class i 的observation分到 Class j,分錯的 cost
假設 Class i 是正確答案,
則將 x 分錯而得到的cost之expectation
M
R (Ci | x ) e ji P (C j | x ) (1)
j 1
7
Optimal Classifier from Bayes Desicion Theory
當我們作決定時 x C (x )
雖然我們並不知道正確的答案,但可算出作此決定需付出的代價
代價(cost) R(C ( x ) | x ) 對x取期望值 L R(C ( x ) | x )dP( x )
(2)
如何作出較正確的決定?
雖然不知道正確答案,但付出的代價愈小,則愈正確
C ( x ) arg min R(Ci | x ) 【Decision Rule】
i
R(C ( x ) | x ) min R(Ci | x ) (3)
i
8
Optimal Classifier from Bayes Desicion Theory
在SR及許多application中,我們常用的 loss function
0 , i j
eij Posterior Probability
1 , i j
M
R (Ci | x ) P (Ci | x ) 1 P (Ci | x ) (5)
j i
所以【Decision Rule】可以改寫
C ( x ) arg min R (Ci | x )
i
Bayes’ risk
C ( x ) arg min 1 P (Ci | x )
i
C ( x ) arg max P (Ci | x ) (6) MAP decision
i
9
Optimal Classifier from Bayes Desicion Theory
OK!! 若 Posterior Probability知道,一切好辦 over
但一般來說,Posterior Probability 需有已知 class 的 labeled
training data來估測 (這是不容易取得的)
本來是classifier design的問題
distribution estimation problem
estimate the a posterior probabilities P(Ci | x), i 1,2,..., M
for any x to implement the maximum a posterior decision for
minimum Bayes risk
由Bayes’ Theorem
P( x | Ci ) P(Ci )
P(Ci | x ) (7)
P( x ) 可省略!
10
Optimal Classifier from Bayes Desicion Theory
• 三個 issue:
• Classifier Designed 必需正確估算distribution的
parameters,但是,real-world中,distribution常為了容易
處理而妥協,使用較簡單或較容易作運算的distribution
如:Gaussian
• Real-world中,distribution的parameter一定由『有限』的
training data set來估算,但這需要一個大前題:當
training data set 的size改變時,訓練出來的parameter要
能保持一致
– unachievable
• 否則,則需要一定數量的 training data set 來使parameter
較為可信賴,但由於data sparse
– unachievable
11
Optimal Classifier from Bayes Desicion Theory
• Despite the conceptual optimality of the Bayes decision
theory and its applications to pattern recognition, it can’t
always be accomplished in practice
• Most practical “MAP” decisions in speech and language
processing are not true MAP decisions
12
Discriminant Function Approach to Classifier Design
先只考慮 2-class {M 1 , M 2 }
定義 discriminant function g (x) 分類用
g ( x ) 0 , x is classified to M 1
g ( x ) 0 , x is classified to M 2
One well-studied family of discriminant function is the
Linear discriminant function which has
computational advantages
g ( x ) w x w0
T
(9)
where wT w1 , w2 ,, wk
and w0 is a real number
13
Discriminant Function Approach to Classifier Design
More generally
1 ( x )
( x )
g ( x ) w1 , w2 ,, wk 2 w0 a T y ( x ) (10)
k ( x )
where a T w0 , w1 ,, wk w0 , wT
y T 1, 1 ,, k 1, T (11)
i are known linearly independent functionsof x
14
Discriminant Function Approach to Classifier Design
再來考慮 M-class { C1 , C2 , , CM }
g1 ( x ) g2 ( x) gM ( x)
※ gi ( x )不一定等於1 gi ( x )不一定是機率
i
C( x) arg max gi ( x) (12)
i
也就是說,我們要一組『最佳discriminant functions』
{gi ( x) | i 1,, M } arg min
gi ( x )F ( X Y ) i
R(C( x) | x)dP( x)
(13)
When the loss function R(C ( x) | x) is specified
15
Discriminant Function Approach to Classifier Design
若使用true Posterior Probability P(Ci|x) 來implement gi ( x)
MAP Decision
若{gi ( x ) | i 1,, M }是optimal solution
則{agi ( x ) b | i 1,, M }也會是optimal solution
where {(a, b) | a 0, b R}
This is quite different from the distribution estimation based
approach in pattern classification
16
Speech Recogniation and Hidden Markov Modeling
• A decoder performs a maximum a posterior decision
ˆ
W arg max P(W | X ) arg max P( X | W ) P(W )
W W
Score from
Word Sequence Acoustic Model
Acoustic Feature Score from
Best Word Sequence Language Model
17
Speech Recogniation and Hidden Markov Modeling
• Basic components:
• Acoustic Feature Extraction:
– Used to extract the features from waveform.
– We use X ( x1 ,, xT ) to represent the acoustic observation feature
vector sequence.
• Acoustic Modeling:
– Provides statistical modeling for the acoustic observation X.
– Hidden Markov Model is the prevalent choice.
• Language Modeling:
– Provides linguistic constraints to the text sequence W.
– Based on statistical N-gram language models
18
Speech Recogniation and Hidden Markov Modeling
• Decoding Engine:
– Search for the best word sequence given the feature and model
– This is achieved through Viterbi decoding
ˆ
W arg max P( X , WQ | W ) Discrete observation
W Probability
Word String State Sequence
ˆ
W arg max log f ( X , WQ | W ) Continuous density
W HMMs
19
Speech Recogniation and Hidden Markov Modeling
• Hidden Markov modeling is a powerful statistical framework
for time-varying quasi-stationary process and a popular choice
for statistical modeling of speech signal
P( X | , A,{b j }N1 ) P( X | ) q P( X , q | )
j
q q 0 t 1 aqt 1qt bqt ( X t )
T
20
SPEECH RECOGNITION AND HIDDEN MARKOV
MODELING
• Three basic problems have to be resolved:
• The evaluation problem
– estimate the probability
• The decoding problem
– find a best state sequence q
• The estimation problem
– estimate HMM parameters from a given set of training samples
(ML based algorithms such as Baum-Welch al.)
21
MCE Classifier Design Using Discriminant Functions
Consider a set of discrimina nt functions
gi ( x | ), i 1,2,, M
is the parameter set
C ( x ) arg max gi ( x ) (19)
i
MCE classifier design based on 3 steps
22
MCE Classifier Design Using Discriminant Functions
• Misclassification measure
1
1
d i ( X ) gi ( x | ) log ie gi ( x| )
(20)
M 1 j, j
if L norm
then
norm
and right hand become max gi ( X | )
j , j i
Generally we use di ( X ) gi ( x | ) max gi ( x | )
j , j i
23
MCE Classifier Design Using Discriminant Functions
•
let e gi ( X | ) ai
and we could find ak max a j
j , j i
right hand become
a1 a2 a a
k
a a a M
a
lim log k k k k
M 1
ak
lim log
ak
M 1
log ak gi ( X | )
24
MCE Classifier Design Using Discriminant Functions
• Loss function
li ( X | ) l (di ( X | )) (21)
li () is a sigmoid function
1
l ( d i ( X | )) ( d i ( X | ) ) (22)
1 e
with normal set to 0 correct wrong
and set to 1 di ( X | ) 0
1
l ( d i ( X | )) 0 1
2
25
MCE Classifier Design Using Discriminant Functions
• Classifier Performance Measure
M
l ( X | ) li ( X | )1( X Ci ) (23)
i 1
1() is a indicator function
Expected loss
M
L( ) E X [l ( X | )] lk ( x | )1( x Ck )dP( x )
(24) k 1
M
lk ( x | )dP( x )
xCk
k 1
26
MCE Classifier Design Using Discriminant Functions
If posterior probability P (Ci | x ) is used
Then the Bayes’ minimum risk is
M
L( ) E X [l ( X | )] P (Ck | x )1( x Ck )dP( x ) (25)
k 1 k
where k x | P (Ck | x ) max P (Ck | x )
j
X 在 Class k 的機率不可最大,也就是說分錯的 loss
27
MCE Classifier Design Using Discriminant Functions
If posterior probability P (Ci | x ) is used
Then the Bayes’ minimum risk is
M
L( ) E X [l ( X | )] P (Ck | x )1( x Ck )dP( x )
k 1 k
M
P (Ck | x )1( x Ck )1( P (Ck | x ) max P (Ck | x )) dP( x )
j
k 1 k
M
P (Ck | x )1( x Ck )l (d k ( x | )) dP( x ) (26)
k 1 k
Empirical loss
28
Optimization Methods
• Expected Loss
M
L( ) E X [l ( X | )] lk ( x | )dP( x ) (27)
xCk
k 1
We use GPD-based minimization algorithm to minimize it
t 1 t tU t l ( X t | ) | t (28)
U t : positive definite matrix
t : a sequence of positive numbers
l ( X t | ) : is the gradient function of loss function
X t : t th training sample
29
Optimization Methods
若滿足下面三個properties,則 t 收斂
C1 : 是數列,若無窮級數 t ,且
t 1
t 2 t 0
t 1
C 2 : 0 V t the inner product
Rt ( t , t ) l ( X , t ), H ( X , t t t l ( X , t ))l ( X , t ) V
H : Hessian matrix of 2nd order partial derivatives
C 3 : if * arg min E X l ( X , ) is the unique
L( ) | * E X l ( X , ) | * 0
then t 1 t t l ( X t , ) | t will converge to *
30
Optimization Methods
• Empirical Loss
1 I M
L0 ( ) l (x i j | )1( x j Ci )
I j 1 i 1
l ( x | )dPI (31)
I is the size of training set
PI is the empirical measure defined on the training set .
lim fdPI fdP (32)
I
31
HMM as a Discriminant Function
使用HMM當作discriminant function
Class i
T
P( i ) ( X , q | ) q0) aqit )1qt bqti ) ( xt ) gi ( X , q | )
(i ( (
(34)
t 1
discriminant function利用 gi ( X | ) 有三種方式來產生
1) gi ( X | ) gi ( X , q | ) (35)
q
2) gi ( X | ) max gi ( X , q | ) (36)
q
Q
1
3) gi ( X | ) gi ( X , q | ) (37)
q
32
HMM as a Discriminant Function
define X ( x1 , x2 ,, xT )
x11 x21 xT 1
x12 x22 xT 2
, , D is dimension
x x2 D xTD
1D
gi ( X | ) log{max gi ( X , q | )}
q
log gi ( X , q | ) q is optimal state sequence
T
log aqit )1qt log bqti ) ( xt ) q0i )
( ( (
t 1
33
HMM as a Discriminant Function
假設
K
b ( xt ) c (jk) xt | (jk) , R (jk)
(i )
j
i i i
k 1
Maintain HMM 原有的constraints
1) function being nonnegative
2) a
j
ij 1
3) c
k
jk 1
4) jkl 0
34
HMM as a Discriminant Function
所以我們使用parameter transformation來保留這些
constraints
~
aij
~ e
1) aij aij aij
~
aik
e
k
~
c jk
~ e
2) c jk c jk c jk
~
c jk
e
k
jkl
3) jkl
~
jkl
jkl
4) jkl jkl log jkl
~
35
HMM as a Discriminant Function
Then X n Ci , discriminant adjustment of the mean vector
~ (n 1) (n) li ( Xn | )
jkl ~
jkl ~
jkl
n
where li (1 exp(rdi )) 1
di di
li ( Xn | ) li d i
~ ~ 1(1 exp(rdi ) 2 (exp(rd )) r
jkl d i jkl
rli (di ) 2 (1 exp(rd ) 1)
li rli (di ) 2 (li (di ) 1 1)
li (d i )(1 li (d i ))
d i rli (di )(1 li (di ))
li ( Xn | ) T log b j ( xt )
~ (qt j ) ~ () denotes the Kronec ker delta function
jkl t 1 jkl
0 n 0
(n)
1 n 0
36
HMM as a Discriminant Function
and
~ ) exp 1 ( xtl ) 2
D
1 / 2 xtl
~
log b j ( xt ) c jk (2 ) d / 2 R jk 1
(b j ( xt )) (
jkl
jkl ~ jkl
2 l 1 jkl
jkl
Finally
~
jkl (n 1) jkl jkl (n 1)
37
HMM as a Discriminant Function
K
b ( X t ) c (jk) N [ xt ; (jk) , R (jk) ]
(i )
j
i i i
k 1
1 1
1 ( xl jkl ) 2
(i )
N [ xt ; (jk) , R (jk) ] (2 )
i i 2
jk 2 exp( )
2 l jkl
(i )
log b(j i ) ( X t ) 1 ( i )
N [ xt ; (jk) , R (jk) ]
i i
(b ( X t )) c
(i )
(jkl) (jkl)
i j jk i
xtl (jkl)
i
( )
d 1 1 xtl (jkl) 2
i
xtl (jkl) 1
i
(jkl)
i
(b(j i ) ( X t )) 1 c (jk) (2 )
i 2
R (i ) 2
exp ( ) ( 1( ))
(jkl) (jkl) (jkl)
jk i i i
2
(jkl)
i
xtl (jkl) xtl (jkl) (jkl)
i i i
x
(i )
(i ) (jkl)
i
jkl jkl
(i ) (i )
(tli ) (jkl)
i
jkl jkl jkl jkl
jkl (i ) (i )
d 1 1 xtl (jkl) 2
i
x
(b ( X t )) c (2 )
(i ) 1 ( i ) 2
R (i) 2
exp ( ) ( ( (tli ) (jkl) )) 1
i
jkl jkl
j jk jk (i )
2
d 1 1 xtl (jkl) 2 xtl
i
(b ( X t )) c (2 )
(i ) 1 ( i ) 2
R (jk)
i 2 exp
( ) ( (i ) (jkl) )
i
(jkl) jkl
j jk i
2
38
HMM as a Discriminant Function
Then X n Ci , discriminant adjustment of the variance
39
HMM as a Discriminant Function
1 1
1 ( xl jkl ) 2
K (i )
b (j i ) ( X t ) c (jk) N [ xt ; (jk) , R (jk) ] N [ xt ; (jk) , R (jk) ] (2 ) 2 jk
i i i i i 2 exp( (i ) )
k 1 2 l jkl
log b (j i ) ( X t ) N [ xt ; (jk) , R (jk) ]
i i
(b (j i ) ( X t )) 1 c (jk)
i
(jkl)
i
(jkl)
i
1
d
D 2 1 x ( i )
(2 ) 2 i
R (jk) exp ( tl ( i ) jkl ) 2
l 1
2
jkl
(b(j i ) ( X t )) 1 c (jk)
i
(jkl) exp( (jkl) )
i i
jkl
(i )
1 xtl (jkl) 2
1
i
D
exp( (jkl) ) 2 )
i 2
exp (
d l 1
2 exp( jkl )
(i )
(b ( X t )) c (2 )
(i ) 1 ( i ) 2
(jkl)
j jk i
D 1
2 (i ) 2
1
1 xtl (jkl) 2
i
d
z 1, z l R jkz exp( jkl ) exp 2 ( exp( ( i ) ) )
(i )
(b ( X t )) c (2 )
1 ( i )
(i ) 2 jkl
j jk
1
1 xtl jkl 2 xtl jkl xtl jkl
(i ) (i ) (i )
exp( (jkl) ) 2
D (i )
) ( )( ) exp( jkl )
i 2
exp (
l 1
2 exp( jkl ) exp( jkl )
(i ) (i )
exp( (jkl) ) 2
i
i D
1
1 xtl (jkl) 2
i
R (jk) 2 exp ( )
2 exp( jkl )
l 1 (i )
d
(b ( X t )) c (2 )
(i )
j
1 ( i )
jk
2
1
(i ) D 2 1 xtl (jkl) 2 ( xtl (jkl) ) 2
i
i
R jk l 1 exp 2 ( exp( ( i ) ) ) ( exp( ( i ) ) 2 )
jkl
jkl
1
d
i D
1 xtl jkl 2
(i )
( xtl jkl )
(i ) 2
(b(j i ) ( X t )) 1 c (jk) (2 )
i 2
R (jk) 2 exp ( ) 1 ( )
l 1
2 exp( jkl )
(i )
exp( (jkl) ) 2
i
40
HMM as a Discriminant Function
li ( X n ; ) li di d
rli (di ) 2 (li (d i ) 1 1) ( ii )
c (jk)
i
di c (jk)
i
c jk
di log b j ( X t ) (i ) K i
exp(c (jk) )
b ( X t ) c N [ xt ; , R ] , c
(i ) (i ) (i ) (i ) (i )
c (jk) c (jk)
i i j jk jk jk jk
k 1 k
i
exp(c (jk) )
1 1
1 ( xl jkl ) 2
(i )
N [ xt ; , R ] (2 )
(i ) (i ) 2
jk 2 exp( )
(jkl)
jk jk i
2 l
i
exp(c (jk) )
log b ( X t )
(i )
j
(b ( X t )) N [ xt ; , R ]
(i ) 1 (i ) (i ) k
exp(c (jk) )
i
c c (jk)
(i ) j jk jk i
jk
i
exp(c (jk) )
(b ( X t )) N [ xt ; , R ]
(i ) 1 (i ) (i ) zk
exp(c (jz) ) exp(c (jk) )
i i
c (jk)
j jk jk i
(b(j i ) ( X t )) 1 N [ xt ; (jk) , R (jk) ] exp(c (jk) ) k exp(c (jk) ) exp(c (jk) )
i i
i i
1
i
k
exp(c (jk) )
i
2
exp(c (i )
jk )
(b(j i ) ( X t )) 1 N [ xt ; (jk) , R (jk) ] c (jk) exp(c (jk) ) 2 k exp(c (jk) )
i i
i i i
2
(b(j i ) ( X t )) 1 N [ xt ; (jk) , R (jk) ]c (jk) 1 c (jk)
i i i i
41
HMM as a Discriminant Function
li ( X n ; ) li di d
rli (di ) 2 (li (di ) 1 1) (ii )
c (jk)
i
di c (jk)
i
c jk
di log a jk (i ) i
exp(a (jk) )
a (jk)
i
c (jk)
i
a (jk)
i
k
i
exp(a (jk) )
i
exp(a (jk) ) i
exp(a (jk) )
k
exp(a ) (i )
jk
zk
exp(a (jz) ) exp(a (jk) )
i i
a (i )
jk a (jk)
i
a (jk) exp(a (jk) )
i i
k exp(a (jk) )
i
2
exp(a (jk) )
i
a (jk) a 2(i )
i
jk
42
HMM as a Discriminant Function
• How to design the step size?
– If the step size is too large, the classifier will be degraded at the
start and sequential learning cannot be made successful
– If the step size is too small, the convergence speed of the
algorithm is too slow and it is practically not useful
• It’s difficult to design it, the general solution is still lacking
43
HMM as a Discriminant Function
• Why we normalize mean vector?
– The magnitude of variances can vary in the range between 100
and 10-5 .
– If using a constant step size for all mean vectors, the algorithm
will either not converge or will be too slow to become practically
useless
• This takes away the dependencies on the variance
variations
44
Relation between MCE and MMI
Acoustic Obsevation
MMI Approach I (WC , X )
Correct Lexical Symbol
p(Wc , X ) p( X | Wc )
I (Wc , X ) log log
p(Wc ) p( X ) p( X )
p( X | Wc ) p( X | Wc )
log N
log N
k 1
p(Wk , X )
k 1
p(Wk ) p( X | Wk )
N
令e rc ( X )
p( X | Wc ) 上式 rc ( X ) log( p(Wk )e rk ( X ) )
k 1
45
Relation between MCE and MMI
log likelihood 與 I (Wc , X ) 有關
p(Wc , X ) p(Wc , X )
log p(Wc | X ) log
p( X ) log
p( X ) p(W ) p(Wc ) I (Wc , X ) log p(Wc )
c
MMI maximize the average mutual information I (Wc , X ) 使用ML
Now, MMI與 MCE 的關係
1
先假設 language model uniform p (Wk ) , k 1,..., WN
N
46
Relation between MCE and MMI
N
則 I (Wc , X ) rc ( X ) log( p(Wk )e
rk ( X )
)
k 1
1 N rk ( X )
rc ( X ) log( e )
N k 1
N
MMI rc ( X ) log( e rk ( X ) ) log N
k 1
arg max E I (W , X ) arg max E r ( X ) log(
N
ˆ
X c
X c
e rk ( X ) ) log N
k 1
47
Relation between MCE and MMI
1) The Misclassification Measure
1
d c ( X ) rc ( X ) log N 1 e
rk ( X )
k :Wk Wc
MCE
log p( X | Wc )
2) The Loss Function
1
l (d c ( X )) d c ( X )
, 0
1 e
48
Relation between MCE and MMI
3) The Expected Loss
E X [l (d c ( X ))] minimize
ˆ
MCE arg min E X [l (d c ( X ))]
e rk ( X )
Consider 1 d ( X ) r ( X ) log k :Wk Wc
c c
N 1
e rk ( X ) log( N 1)
rc ( X ) log
k :Wk Wc
49
Relation between MCE and MMI
N
I (Wc , X ) rc ( X ) log( e rk ( X ) ) log N
k 1
N
rc ( X ) log( e rk ( X ) ) log N
k 1
N
log e rc ( X )
log( e rk ( X ) ) log N
k 1
N
log(e rc ( X )
k 1
e rk ( X ) ) log N
rc ( X ) log e rk ( X )
log(e k :Wc Wk
1) log N
log(e rc ( X )
k :Wc Wk
e rk ( X )
1) log N
log(e d c ( X )log( N 1) 1) log N
1
log( ) log N
1 e d c ( X )log( N 1)
log[l ( d c ( X ) log( N 1))] log N
MMI 的 loss function
50
Relation between MCE and MMI
I (Wc , X ) log 0 log N ~ log 1 log N
~ log N
maximize I (Wc , X ) minimize log e d c ( X ) log( N 1)
minimize d c ( X )
maximize p(X | Wc ) minimize e rk ( X )
k :Wk Wc
MMI本來要 maximize posterior probabilit y p(Wc | X )
但現在要 maximize p(X | Wc ) 由distribun 來 model
MMI 的 objective function asymmetrical
MMI MCE
dc ( X ) 0 0
1 1
objective function 0 1 0 1
1 ( N 1) 2
51
Relation between MCE and MMI
• MCE approach has several advantages in classifier
design:
– It is meaningful in the sense of minimizing the empirical
recognition error rate of the classifier
– If the true class posterior distributions are used as discriminant
functions, the asymptotic behavior of the classifier will
approximate the minimum Baye’s risk
52
SUMMARY
• We examined the classical Bayes’ decision theory
approach to the problem of pattern classification.
• We don’t know the actual probability distribution
• So we minimize the expected loss. Get a set of
parameters.
• Understand what MCE is, and how to use it to solve
problems
53