Embed
Email

Minimum Classification Error _MCE_ Approach in Pattern Recognition

Document Sample

Shared by: ewghwehws
Categories
Tags
Stats
views:
0
posted:
1/19/2012
language:
pages:
53
Minimum Classification Error (MCE)

Approach in Pattern Recognition





Wu Chou, Avaya Labs Research, Avaya Inc., USA









Present by: Fang-Hui Chu

Outline (1/2)



• Introduction

• Optimal Classifier from Bayes Desicion Theory

• Discriminant Function Approach to Classifier Design

• Speech Recogniation and Hidden Markov Modeling

– Hidden Markov Modeling of Speech

• MCE Classifier Design Using Discriminant Functions

– MCE Classifier Design Strategy

– Optimization Methods

– Other Optimization Methods

– HMM as a Discriminant Function

– Relation Between MCE and MMI

– Discussions and Comments





2

Outline (2/2)



• MCE TRAINING BASED ON EMBEDDED STRING

MODEL

– String-Model-Based MCE Approach

– Combined String-Model-Based MCE Approach

– Discriminative Language Model Estimation





• SUMMARY









3

Introduction



• The advent of powerful computing devices and success

of statistical approaches

– A renewed pursuit for more powerful method to reduce

recognition error rate



• Although MCE-based discriminative methods is rooted in

the classical Bayes’ decision theory, instead of a

classification task to distribution estimation problem,

it takes a discriminant-function based statistical

pattern classification approach



• For a given family of discriminant function, optimal

classifier/recognizer design involves finding a set of

parameters which minimize the empirical pattern

recognition error rate

4

Introduction



• Why we take this approach to design classifier?

– We lack complete knowledge of the form of the distribution

– Training data are inadequate





• How to do?

– Formulating the problem of self-learning into a classification

problem which consists of optimal partitioning of the observation

space into regions, Xk, for which the expected risk , R, is

minimized

– Then we apply generalized probabilistic decent algorithm to

achieve the goal









5

Optimal Classifier from Bayes Desicion Theory







C1 C2 CM





P( x, C1 ) P ( x, CM )



random x 要分類





P( x, Ci ) : x 不確定是 Ci,但被分到 Ci 的機率





但,我們並不知道標準答案





6

Optimal Classifier from Bayes Desicion Theory





定義 loss function : eij : X  Y  R

X : sample space Y:categorical set



可以想成 Class i 與 Class j 的 distance, eii  0

將 Class i 的observation分到 Class j,分錯的 cost



假設 Class i 是正確答案,

則將 x 分錯而得到的cost之expectation

M

R (Ci | x )   e ji P (C j | x ) (1)

j 1









7

Optimal Classifier from Bayes Desicion Theory



當我們作決定時 x  C (x )

雖然我們並不知道正確的答案,但可算出作此決定需付出的代價



代價(cost)  R(C ( x ) | x ) 對x取期望值  L   R(C ( x ) | x )dP( x )

 

(2)





如何作出較正確的決定?



雖然不知道正確答案,但付出的代價愈小,則愈正確



C ( x )  arg min R(Ci | x ) 【Decision Rule】

i



R(C ( x ) | x )  min R(Ci | x ) (3)

i

8

Optimal Classifier from Bayes Desicion Theory



在SR及許多application中,我們常用的 loss function



0 , i  j

eij   Posterior Probability

1 , i  j

M

 R (Ci | x )   P (Ci | x )  1  P (Ci | x ) (5)

j i

所以【Decision Rule】可以改寫

C ( x )  arg min R (Ci | x )

i

Bayes’ risk

 C ( x )  arg min 1  P (Ci | x )

i



 C ( x )  arg max P (Ci | x ) (6) MAP decision

i

9

Optimal Classifier from Bayes Desicion Theory

OK!! 若 Posterior Probability知道,一切好辦  over



但一般來說,Posterior Probability 需有已知 class 的 labeled

training data來估測 (這是不容易取得的)



本來是classifier design的問題 

distribution estimation problem

estimate the a posterior probabilities P(Ci | x), i  1,2,..., M

for any x to implement the maximum a posterior decision for

minimum Bayes risk



由Bayes’ Theorem

P( x | Ci ) P(Ci )

P(Ci | x )  (7)

P( x ) 可省略!

10

Optimal Classifier from Bayes Desicion Theory



• 三個 issue:

• Classifier Designed 必需正確估算distribution的

parameters,但是,real-world中,distribution常為了容易

處理而妥協,使用較簡單或較容易作運算的distribution

如:Gaussian

• Real-world中,distribution的parameter一定由『有限』的

training data set來估算,但這需要一個大前題:當

training data set 的size改變時,訓練出來的parameter要

能保持一致

– unachievable

• 否則,則需要一定數量的 training data set 來使parameter

較為可信賴,但由於data sparse

– unachievable



11

Optimal Classifier from Bayes Desicion Theory



• Despite the conceptual optimality of the Bayes decision

theory and its applications to pattern recognition, it can’t

always be accomplished in practice



• Most practical “MAP” decisions in speech and language

processing are not true MAP decisions









12

Discriminant Function Approach to Classifier Design



先只考慮 2-class {M 1 , M 2 }



定義 discriminant function g (x) 分類用

 g ( x )  0 , x is classified to M 1



 g ( x )  0 , x is classified to M 2

One well-studied family of discriminant function is the

Linear discriminant function which has

computational advantages

g ( x )  w x  w0

T

(9)



where wT  w1 , w2 ,, wk 

and w0 is a real number

13

Discriminant Function Approach to Classifier Design



More generally

1 ( x ) 

 ( x )

g ( x )  w1 , w2 ,, wk  2   w0  a T y ( x ) (10)

  



k ( x ) 



where a T  w0 , w1 ,, wk   w0 , wT 



y T  1, 1 ,, k   1,  T  (11)



i are known linearly independent functionsof x





14

Discriminant Function Approach to Classifier Design



再來考慮 M-class { C1 , C2 , , CM }

  

g1 ( x ) g2 ( x) gM ( x)

※ gi ( x )不一定等於1  gi ( x )不一定是機率

i



C( x)  arg max gi ( x) (12)

i

也就是說,我們要一組『最佳discriminant functions』





{gi ( x) | i  1,, M }  arg min

gi ( x )F ( X Y ) i

 R(C( x) | x)dP( x)

(13)



When the loss function R(C ( x) | x) is specified

15

Discriminant Function Approach to Classifier Design





若使用true Posterior Probability P(Ci|x) 來implement gi ( x)

 MAP Decision



若{gi ( x ) | i  1,, M }是optimal solution

則{agi ( x )  b | i  1,, M }也會是optimal solution

where {(a, b) | a  0, b  R}



This is quite different from the distribution estimation based

approach in pattern classification









16

Speech Recogniation and Hidden Markov Modeling



• A decoder performs a maximum a posterior decision





ˆ

W  arg max P(W | X )  arg max P( X | W ) P(W )

W W









Score from

Word Sequence Acoustic Model

Acoustic Feature Score from

Best Word Sequence Language Model







17

Speech Recogniation and Hidden Markov Modeling



• Basic components:

• Acoustic Feature Extraction:

– Used to extract the features from waveform.

– We use X  ( x1 ,, xT ) to represent the acoustic observation feature

vector sequence.





• Acoustic Modeling:

– Provides statistical modeling for the acoustic observation X.

– Hidden Markov Model is the prevalent choice.





• Language Modeling:

– Provides linguistic constraints to the text sequence W.

– Based on statistical N-gram language models







18

Speech Recogniation and Hidden Markov Modeling



• Decoding Engine:

– Search for the best word sequence given the feature and model

– This is achieved through Viterbi decoding





ˆ

W  arg max P( X , WQ | W ) Discrete observation

W Probability







Word String State Sequence



ˆ

W  arg max log f ( X , WQ | W ) Continuous density

W HMMs







19

Speech Recogniation and Hidden Markov Modeling



• Hidden Markov modeling is a powerful statistical framework

for time-varying quasi-stationary process and a popular choice

for statistical modeling of speech signal



P( X |  , A,{b j }N1 )  P( X | )   q P( X , q | )

j





  q  q 0  t 1 aqt 1qt bqt ( X t )

T









20

SPEECH RECOGNITION AND HIDDEN MARKOV

MODELING

• Three basic problems have to be resolved:



• The evaluation problem

– estimate the probability





• The decoding problem

– find a best state sequence q





• The estimation problem

– estimate HMM parameters from a given set of training samples

(ML based algorithms such as Baum-Welch al.)







21

MCE Classifier Design Using Discriminant Functions







Consider a set of discrimina nt functions

gi ( x |  ), i  1,2,, M

 is the parameter set





C ( x )  arg max gi ( x ) (19)

i



MCE classifier design based on 3 steps









22

MCE Classifier Design Using Discriminant Functions



• Misclassification measure

1

 1 

d i ( X )   gi ( x |  )  log  ie  gi ( x| ) 

 (20)

 M  1 j, j 



if    L norm

then 

norm

and right  hand become max gi ( X |  )

j , j i



Generally we use di ( X )   gi ( x | )  max gi ( x | )

j , j i





23

MCE Classifier Design Using Discriminant Functions





let e gi ( X | )  ai

and we could find ak  max a j

j , j i



right  hand become

   

 a1   a2  a  a 

       k

a  a  a    M

 a 



lim log   k   k   k   k 

  M 1

ak 



 lim log 

ak 

  M 1

 log ak  gi ( X |  )



24

MCE Classifier Design Using Discriminant Functions



• Loss function



li ( X | )  l (di ( X | )) (21)





li () is a sigmoid function



1

l ( d i ( X |  ))  (  d i ( X | )  ) (22)

1 e

with  normal set to 0 correct wrong

and  set to 1 di ( X | )  0 

1

l ( d i ( X |  )) 0 1

2





25

MCE Classifier Design Using Discriminant Functions



• Classifier Performance Measure

M

l ( X |  )   li ( X |  )1( X  Ci ) (23)

i 1



1() is a indicator function



Expected loss

M

L(  )  E X [l ( X |  )]    lk ( x |  )1( x  Ck )dP( x )

(24) k 1



M

  lk ( x |  )dP( x )

xCk

k 1





26

MCE Classifier Design Using Discriminant Functions





If posterior probability P (Ci | x ) is used



Then the Bayes’ minimum risk is



M

L(  )  E X [l ( X |  )]    P (Ck | x )1( x  Ck )dP( x ) (25)





 

k 1 k



where k  x   | P (Ck | x )  max P (Ck | x )

j









X 在 Class k 的機率不可最大,也就是說分錯的 loss







27

MCE Classifier Design Using Discriminant Functions





If posterior probability P (Ci | x ) is used



Then the Bayes’ minimum risk is



M

L(  )  E X [l ( X |  )]    P (Ck | x )1( x  Ck )dP( x )

k 1 k



M

   P (Ck | x )1( x  Ck )1( P (Ck | x )  max P (Ck | x )) dP( x )

j

k 1 k



M

   P (Ck | x )1( x  Ck )l (d k ( x |  )) dP( x ) (26)

k 1 k





Empirical loss





28

Optimization Methods



• Expected Loss



M

L(  )  E X [l ( X |  )]    lk ( x |  )dP( x ) (27)

xCk

k 1



We use GPD-based minimization algorithm to minimize it



 t 1   t   tU t l ( X t | ) |   t (28)



U t : positive definite matrix

 t : a sequence of positive numbers

l ( X t | ) : is the gradient function of loss function

X t : t  th training sample



29

Optimization Methods



若滿足下面三個properties,則  t 收斂

 

C1 :  是數列,若無窮級數   t  ,且

t 1

  t 2    t  0

t 1





C 2 : 0  V    t the inner product

Rt ( t , t )  l ( X ,  t ), H ( X ,  t   t t l ( X ,  t ))l ( X ,  t )  V

H : Hessian matrix of 2nd  order partial derivatives





C 3 : if *  arg min E X l ( X ,  ) is the unique  





L(  ) |  *  E X l ( X ,  ) |  *  0

then  t 1   t   t l ( X t ,  ) |  t will converge to *





30

Optimization Methods



• Empirical Loss





1 I M

L0 (  )   l (x i j |  )1( x j  Ci )

I j 1 i 1



  l ( x |  )dPI (31)



I is the size of training set

PI is the empirical measure defined on the training set .





lim  fdPI   fdP (32)

I 





31

HMM as a Discriminant Function



使用HMM當作discriminant function



Class i

T

P( i ) ( X , q |  )   q0)  aqit )1qt bqti ) ( xt )  gi ( X , q |  )

(i ( (

(34)

t 1



discriminant function利用 gi ( X |  ) 有三種方式來產生



1) gi ( X |  )   gi ( X , q |  ) (35)

q



2) gi ( X |  )  max gi ( X , q |  ) (36)

q





Q

1

 





3) gi ( X |  )   gi ( X , q |  )  (37)

q 



32

HMM as a Discriminant Function





define X  ( x1 , x2 ,, xT )

  x11   x21   xT 1  

      

  x12   x22   xT 2  

 , ,  D is dimension

       

      

 x x2 D   xTD  

  1D   

gi ( X |  )  log{max gi ( X , q |  )}

q



 log gi ( X , q |  ) q is optimal state sequence



 

T

  log aqit )1qt  log bqti ) ( xt )   q0i )

( ( (



t 1









33

HMM as a Discriminant Function



假設

 

K

b ( xt )   c (jk)  xt |  (jk) , R (jk)

(i )

j

i i i



k 1





Maintain HMM 原有的constraints



1) function being nonnegative

2) a

j

ij 1



3) c

k

jk 1



4)  jkl  0





34

HMM as a Discriminant Function



所以我們使用parameter transformation來保留這些

constraints

~

aij

~ e

1) aij  aij aij 



~

aik

e

k

~

c jk

~ e

2) c jk  c jk c jk 



~

c jk

e

k



 jkl

3)  jkl 

~ 

jkl

 jkl

4)  jkl   jkl  log  jkl

~





35

HMM as a Discriminant Function





Then X n  Ci , discriminant adjustment of the mean vector



~ (n  1)   (n)   li ( Xn | )

 jkl ~

jkl ~

 jkl

  n



where li  (1  exp(rdi   )) 1



di di

li ( Xn | ) li d i

~  ~  1(1  exp(rdi   ) 2 (exp(rd   ))  r

 jkl d i  jkl

 rli (di ) 2 (1  exp(rd   )  1)

li  rli (di ) 2 (li (di ) 1  1)

 li (d i )(1  li (d i ))

d i  rli (di )(1  li (di ))



li ( Xn | ) T  log b j ( xt )

~    (qt  j ) ~  () denotes the Kronec ker delta function

 jkl t 1  jkl

0 n  0

 (n)  

1 n  0



36

HMM as a Discriminant Function



and

 ~ )  exp 1 ( xtl   ) 2 

 

D

1 / 2 xtl

~



log b j ( xt )  c jk (2 )  d / 2 R jk 1

(b j ( xt )) (

 jkl

  jkl    ~ jkl 

 2 l 1 jkl 

jkl  







Finally

~

 jkl (n  1)   jkl  jkl (n  1)









37

HMM as a Discriminant Function



K

b ( X t )   c (jk) N [ xt ;  (jk) , R (jk) ]

(i )

j

i i i



k 1

1 1

1 ( xl   jkl ) 2

(i )



N [ xt ;  (jk) , R (jk) ]  (2 )

i i 2

 jk 2 exp(  )

2 l  jkl

(i )





 log b(j i ) ( X t ) 1 ( i )

N [ xt ;  (jk) , R (jk) ]

i i



 (b ( X t )) c

(i )



 (jkl)  (jkl)

i j jk i





xtl   (jkl)

i



( )

d 1  1 xtl   (jkl) 2 



i

 xtl   (jkl) 1

i

 (jkl)

i



 (b(j i ) ( X t )) 1 c (jk) (2 )

i 2

R (i ) 2

exp  ( )  ( 1( ))

 (jkl)  (jkl)  (jkl)

jk i i i



 2 



 (jkl)

i

xtl   (jkl) xtl   (jkl) (jkl)

i i i

x

 (i )

 (i )   (jkl)

i

  jkl jkl 

(i ) (i )

  (tli )   (jkl)

i



 jkl  jkl  jkl  jkl

jkl (i ) (i )





d 1  1 xtl   (jkl) 2 



i

 x

 (b ( X t )) c (2 )

(i ) 1 ( i ) 2

R (i) 2

exp  ( )  ( ( (tli )   (jkl) ))  1

i



 jkl   jkl

j jk jk (i )

2

 

d 1  1 xtl   (jkl) 2  xtl



i



 (b ( X t )) c (2 )

(i ) 1 ( i ) 2

R (jk)

i 2 exp

 ( )  ( (i )   (jkl) )

i



 (jkl)   jkl

j jk i

2

 



38

HMM as a Discriminant Function





Then X n  Ci , discriminant adjustment of the variance









39

HMM as a Discriminant Function

1 1

1 ( xl   jkl ) 2

K (i )



b (j i ) ( X t )   c (jk) N [ xt ;  (jk) , R (jk) ] N [ xt ;  (jk) , R (jk) ]  (2 ) 2  jk

i i i i i 2 exp(   (i ) )

k 1 2 l jkl



 log b (j i ) ( X t ) N [ xt ;  (jk) , R (jk) ]

i i



 (b (j i ) ( X t )) 1 c (jk)

i



 (jkl)

i

 (jkl)

i





1

d

D 2  1 x   ( i ) 

 

 (2 ) 2 i

R (jk) exp  ( tl ( i ) jkl ) 2 

l 1

2

  jkl 



 (b(j i ) ( X t )) 1 c (jk)

i

  (jkl)  exp( (jkl) )

i i



 jkl

(i )





  1 xtl   (jkl) 2  



1



 

i

D 

  exp( (jkl) ) 2 ) 

i 2

exp  (

d  l 1

 2 exp( jkl )  



(i )



 (b ( X t )) c (2 )

(i ) 1 ( i ) 2 

 (jkl)

j jk i





 D 1



  2  (i ) 2 

1

 1 xtl   (jkl) 2 



i

 

d 

  z 1, z l R jkz    exp( jkl )  exp  2 ( exp( ( i ) ) )  

(i )



    

 

 

 (b ( X t )) c (2 )

1 ( i )

(i ) 2 jkl

j jk

 



1



  1 xtl   jkl 2  xtl   jkl xtl   jkl

(i ) (i ) (i )

 exp( (jkl) ) 2

D   (i ) 

)  ( )( ) exp( jkl )

i 2

exp  (

 l 1

 2 exp( jkl )  exp( jkl )

(i ) (i )

exp( (jkl) ) 2

i 

   

 i D

1

 1 xtl   (jkl) 2 



i

 

  R (jk) 2 exp  ( )  

 2 exp( jkl ) 

l 1 (i )

d

   

 (b ( X t )) c (2 )

(i )

j

1 ( i )

jk

2

 1 

 (i ) D 2  1 xtl   (jkl) 2  ( xtl   (jkl) ) 2 



i



i



 R jk l 1 exp  2 ( exp( ( i ) ) )  ( exp( ( i ) ) 2 ) 

 

 

 jkl 

 jkl 

1

d

i D

 1 xtl   jkl 2 



(i )

 ( xtl   jkl ) 

(i ) 2



 (b(j i ) ( X t )) 1 c (jk) (2 )

i 2

R (jk) 2 exp  ( )   1  ( )

l 1

 2 exp( jkl ) 

(i )

  exp( (jkl) ) 2 

i 





40

HMM as a Discriminant Function

li ( X n ; ) li di d

  rli (di ) 2 (li (d i ) 1  1) ( ii )

c (jk)

i

di c (jk)

i

c jk

di  log b j ( X t ) (i ) K i

exp(c (jk) )

  b ( X t )   c N [ xt ;  , R ] , c

(i ) (i ) (i ) (i ) (i )



c (jk) c (jk) 

i i j jk jk jk jk

k 1 k

i

exp(c (jk) )

1 1

1 ( xl   jkl ) 2

(i )



N [ xt ;  , R ]  (2 )

(i ) (i ) 2

 jk 2 exp(  )

 (jkl)

jk jk i

2 l

i

exp(c (jk) )



 log b ( X t )

(i )

j

 (b ( X t )) N [ xt ;  , R ]

(i ) 1 (i ) (i )  k

exp(c (jk) )

i





c c (jk)

(i ) j jk jk i

jk

i

exp(c (jk) )



 (b ( X t )) N [ xt ;  , R ]

(i ) 1 (i ) (i )  zk

exp(c (jz) )  exp(c (jk) )

i i





c (jk)

j jk jk i







 (b(j i ) ( X t )) 1 N [ xt ;  (jk) , R (jk) ]  exp(c (jk) )  k exp(c (jk) )  exp(c (jk) ) 

i i





i i



1

i

   k

exp(c (jk) )

i



2

 exp(c (i )

jk )





 (b(j i ) ( X t )) 1 N [ xt ;  (jk) , R (jk) ]  c (jk)  exp(c (jk) ) 2   k exp(c (jk) )

i i





i i i

2





    

 (b(j i ) ( X t )) 1 N [ xt ;  (jk) , R (jk) ]c (jk) 1  c (jk)

i i i i

 



41

HMM as a Discriminant Function





li ( X n ; ) li di d

  rli (di ) 2 (li (di ) 1  1) (ii )

c (jk)

i

di c (jk)

i

c jk

di  log a jk (i ) i

exp(a (jk) )

  a (jk) 

i



c (jk)

i

a (jk)

i

 k

i

exp(a (jk) )

i

exp(a (jk) ) i

exp(a (jk) )

 



 k

exp(a ) (i )

jk



 zk

exp(a (jz) )  exp(a (jk) )

i i





a (i )

jk a (jk)

i









 a (jk)   exp(a (jk) )

i i

 k exp(a (jk) )

i



2

exp(a (jk) )

i



 a (jk)  a 2(i )

i

jk









42

HMM as a Discriminant Function



• How to design the step size? 

– If the step size is too large, the classifier will be degraded at the

start and sequential learning cannot be made successful



– If the step size is too small, the convergence speed of the

algorithm is too slow and it is practically not useful





• It’s difficult to design it, the general solution is still lacking









43

HMM as a Discriminant Function



• Why we normalize mean vector?



– The magnitude of variances can vary in the range between 100

and 10-5 .



– If using a constant step size for all mean vectors, the algorithm

will either not converge or will be too slow to become practically

useless





• This takes away the dependencies on the variance

variations









44

Relation between MCE and MMI

Acoustic Obsevation



MMI Approach I (WC , X )



Correct Lexical Symbol





p(Wc , X ) p( X | Wc )

I (Wc , X )  log  log

p(Wc ) p( X ) p( X )

p( X | Wc ) p( X | Wc )

 log N

 log N





k 1

p(Wk , X ) 

k 1

p(Wk ) p( X | Wk )





N

令e rc ( X )

 p( X | Wc )  上式  rc ( X )  log(  p(Wk )e rk ( X ) )

k 1









45

Relation between MCE and MMI



log  likelihood  與 I (Wc , X ) 有關

 p(Wc , X )   p(Wc , X ) 

log p(Wc | X )  log 

 p( X )    log 

 p( X ) p(W ) p(Wc )   I (Wc , X )  log p(Wc )



   c 

MMI  maximize the average mutual information I (Wc , X ) 使用ML





Now, MMI與 MCE 的關係

1

先假設 language model uniform p (Wk )  , k  1,..., WN

N









46

Relation between MCE and MMI



 N



 則 I (Wc , X )  rc ( X )  log(  p(Wk )e

rk ( X )

)

 k 1



 1 N rk ( X )

  rc ( X )  log(  e )

N k 1



 N

MMI   rc ( X )  log(  e rk ( X ) )  log N

 k 1





  arg max E I (W , X )  arg max E r ( X )  log( 

N





ˆ



X c



X  c



 e rk ( X ) )  log N 

k 1 













47

Relation between MCE and MMI



1) The Misclassification Measure



  1 

d c ( X )  rc ( X )  log   N  1  e 

rk ( X )

 

  k :Wk Wc 

 

MCE

 log p( X | Wc )



2) The Loss Function

 1

l (d c ( X ))  d c ( X )

,  0

 1 e









48

Relation between MCE and MMI













3) The Expected Loss



 E X [l (d c ( X ))]  minimize

ˆ

MCE  arg min E X [l (d c ( X ))]

 





  e rk ( X )

Consider   1  d ( X )  r ( X )  log k :Wk Wc



 c c

N 1



  

  e rk ( X )   log( N  1)

 rc ( X )  log 

 

  k :Wk Wc 









49

Relation between MCE and MMI

N

I (Wc , X )  rc ( X )  log(  e rk ( X ) )  log N

k 1



 N



   rc ( X )  log(  e rk ( X ) )   log N

 k 1 

 N



  log e  rc ( X )

 log(  e rk ( X ) )   log N

 k 1 

N

  log(e  rc ( X )



k 1

e rk ( X ) )  log N

 rc ( X )  log  e rk ( X )

  log(e k :Wc Wk

 1)  log N

  log(e  rc ( X )



k :Wc Wk

e rk ( X )

 1)  log N

  log(e d c ( X )log( N 1)  1)  log N

1

 log( )  log N

1  e d c ( X )log( N 1)

 log[l ( d c ( X )  log( N  1))]  log N



MMI 的 loss function

50

Relation between MCE and MMI



 I (Wc , X )  log 0  log N ~ log 1  log N

  ~ log N



maximize I (Wc , X )  minimize log e d c ( X )  log( N 1) 

 minimize d c ( X )



 maximize p(X | Wc ) minimize  e rk ( X )

k :Wk Wc



MMI本來要 maximize posterior probabilit y p(Wc | X )

但現在要 maximize p(X | Wc )  由distribun 來 model







MMI 的 objective function  asymmetrical

MMI MCE

dc ( X )  0  0

1 1

objective function 0 1 0 1

1  ( N  1)  2



51

Relation between MCE and MMI



• MCE approach has several advantages in classifier

design:



– It is meaningful in the sense of minimizing the empirical

recognition error rate of the classifier



– If the true class posterior distributions are used as discriminant

functions, the asymptotic behavior of the classifier will

approximate the minimum Baye’s risk









52

SUMMARY



• We examined the classical Bayes’ decision theory

approach to the problem of pattern classification.



• We don’t know the actual probability distribution



• So we minimize the expected loss. Get a set of

parameters.



• Understand what MCE is, and how to use it to solve

problems









53



Related docs
Other docs by ewghwehws
By registering with docstoc.com you agree to our
privacy policy

You are almost ready to download!

You are almost ready to download!