# Discriminative Learning for Hidden Markov Models

Document Sample

```					Discriminative Learning for
Hidden Markov Models

Li Deng
Microsoft Research

EE 516; UW Spring 2009
1
Minimum Classification Error (MCE)
   The objective function of MCE training is a
smoothed recognition error rate.
   Traditionally, MCE criterion is optimized
GPD)
   In this work we proposed the Growth
Transformation based method for MCE
based model estimation

2
Automatic Speech Recognition (ASR)
Speech recognition:
sr  arg max log p ( sr | X r )  arg max log p ( X r , sr )
*

sr                                     sr

Speech signal
of the r-th utt.:
Segment to          |       |    |   |   | …                        | |
frames:                 1       2 3 4                                T
Spectrum
analysis: Xr =          x1, x2, x3, x4     ,…,        xt   ,…,       xT

Decoding          (sil) OH (sil) SIX                  EIGHT      (sil)
sr* =
argmax pΛ(Xr, sr)
3
Models (feature functions) in ASR
ASR in the log-linear framework
 3                   
p ( X r , sr )  exp   mhm ( sr , X r ) 
 m1                 
h1(sr, Xr) = log p(Xr|sr; Λ) (AM)     λ1 = 1

h2(sr, Xr) = log p(sr)     (LM)       λ2 = s (LM scale)

h3(sr, Xr) = |sr|        (#word)      λ3 = p (word ins. penalty)

Λ is the parameter set of the acoustic model (HMM),
which is of interest at MCE training in this work.

4
MCE: Mis-classification measure
Define misclassification measure:
dr ( X r , )  log p ( X r , sr ,1 )  log p ( X r , Sr )
(in the case of using correct and top one incorrect competing tokens)

Observation.
seq.: Xr               x1, x2, x3, x4     ,…,      xt      ,…,   xT
correct label:
OH          THREE           EIGHT
Sr
competitor:
sr,1                        OH          SIX             EIGHT
sr,1: the top one incorrect (not equal to Sr) competing string

5
MCE: Loss function
Classification: sr  arg maxlog p ( X r , sr )
*

sr

Classifi. error: dr(Xr,Λ) > 0  1 classification error
dr(Xr,Λ) < 0  0 classification error

loss
Loss function:
1
smoothed error count func.

lr d r ( X r , ) 
1
1  e d r ( X r , )
0   d

6
MCE: Objective function

MCE objective function:

1 R
LMCE ()   lr d r ( X r , ) 
R r 1

LMCE(Λ) is the smoothed recognition error rate on
the string (token) level.
Model (acoustic model) is trained to minimize
LMCE(Λ), i.e., Λ* = argmin Λ{LMCE(Λ)}

7
MCE: Optimization

Traditional Stochastic GD      New Growth Transform.
online optimization            based batch-mode
method

Convergence is unstable        Stable convergence

Training process is            Ready for parallelized
difficult to be parallelized   processing

8
MCE: Optimization
o Growth Transformation based MCE:
If Λ=T(Λ') ensures P(Λ)>P(Λ'), i.e., P(Λ) grows, then
T(∙) is called a growth transformation of Λ for P(Λ).

Minimizing        Maximizing          Maximizing
LMCE(Λ) =         P(Λ) =              F(Λ;Λ′) =
∑ l﴾d(∙)﴿        G(Λ)/H(Λ)           G-P′×H+D

GT formula       Maximizing          Maximizing
∂U(∙)/∂Λ = 0     U(Λ;Λ′) =           F(Λ;Λ′) =
 Λ =T(Λ′)       ∑ f ′(∙)log f (∙)     ∑ f (∙)
9
MCE: Optimization
Re-write MCE loss function to
p( X r , sr ,1 | )
lr  d r ( X r ,  )  
p( X r , sr ,1 | )  p( X r , Sr | )

Then, min. LMCE(Λ)  max. Q(Λ), where

Q (  )  R 1  LMCE (  ) 

R
p( X r , Sr |  )             R
           p( X r , sr |  ) ( sr , Sr )
                                                 r r ,1 r
s { s , S }

r 1   p( X r , sr ,1 |  )  p( X r , Sr |  ) r 1            
sr { sr ,1 , Sr }
p( X r , sr |  )

10
MCE: Optimization
Q(Λ) is further re-formulated to a single fractional
function P(Λ)
G ()
P( ) 
H ()
where
                                   R

G ()      p ( X 1 ,..., X R , s1 ,..., sR )  ( sr , S r )
s1  sR                                   r 1            
H ( )     p ( X 1 ,..., X R , s1 ,..., s R )
s1     sR

11
MCE: Optimization
Increasing P(Λ) can be achieved by maximizing

F (; )  G()  P() H ()  D

as long as D is a Λ-independent constant.
i.e., P()  P()      1
H ()    F (; )  F (; )
(Λ′ is the parameter set obtained from last iteration)

Substitute G() and H() into F(),
F (; )   p( X , q, s | )[C ( s)  P()]  D
q   s

12
MCE: Optimization
Reformulate F(Λ;Λ') to

F (; )    f (  , q, s, ;)d
s   q 
where
f (  , q, s, ; )  ()  d (s)p(  , q | s, )
()   (  , X ) s p(q, s)[C ( s)  P()]
R
C (s)    (sr , Sr )
r 1

F(Λ;Λ') is ready for EM style optimization
Note: Γ(Λ′) is a constant, and log p(χ, q | s, Λ) is easy to
decompose.
13
MCE: Optimization
Increasing F(Λ;Λ') can be achieved by maximizing

U ()    f (  , q, s, ; ) log f (  , q, s, ; )d
s   q 

Use extend Baum-Welch for E step.
log f(χ,q,s,Λ;Λ') is decomposable w.r.t Λ, so M
step is easy to compute.
So the growth transformation of Λ for CDHMM is:
U ()
 0    T ()


14
MCE: Model estimation formulas
For Gaussian mixture CDHMM,
1            1                          
p( x |  , ) |  |                 2
exp  ( x   )T  1 ( x   ) 
 2                          
GT of mean and covariance of Gaussian m is

  (t ) x  D   m ,r            r ,t   m   m

m       r       t

  (t )  D
j   t
m ,r            m

  
              m ,r   (t )( xr ,t - m )( xr ,t - m )T   Dm   Dm ( m - m )( m - m )T
       m
          
m      r    t

 
r   t
m ,r   (t )  Dm

where  m,r (t )  p( Sr | X r , ) p( sr,1 | X r , )  m,r,S (t)   m,r ,s (t) 
r     r ,1

15
MCE: Model estimation formulas
Setting of Dm

Theoretically,
set Dm so that f(χ,q,s,Λ;Λ') > 0

Empirically,
R
Dm  E   p( Sr | X r ,  )  p( Sr | X r ,  )  m ,r , Sr (t )
r 1                                           t

 p( sr ,1 | X r ,  )  m ,r , sr ,1 (t )   
t

16
MCE: Workflow
Training                     Last iteration
Recognition
utterances                    Model Λ′

Competing
strings

Training
GT-MCE
transcripts

next
New model Λ   iteration

17
Experiment: TI-DIGITS
   Vocabulary: “1” to “9”, plus “oh” and “zero”
   Training set: 8623 utterances / 28329 words
   Test set: 8700 utterances / 28583 words
   33-dimentional spectrum feature: energy +10
MFCCs, plus ∆ and ∆∆ features.
   Model: Continuous Density HMMs
   Total number of Gaussian components: 3284

18
Experiment: TI-DIGITS
GT-MCE vs. ML (maximum likelihood) baseline
MCE training - TIdigits                                                              MCE training - TIdigits
0.4                                                                                    1100
E=1.0

Loss func. (sigmoid error count)
E=1.0
E=2.0                                                                                E=2.0
E=2.5                                       1000                                     E=2.5
0.35
WER (%)

900
0.3
800

0.25
700

0.2                                                                                    600
0     2        4         6          8       10                                         0   2        4         6          8       10
MCE iteration                                                                        MCE iteration

Obtain the lowest error rate on this task
Reduce recognition Word Error Rate (WER) by 23%
Fast and stable convergence
19
Experiment: Microsoft Tele. ASR
   Microsoft Speech Server – ENUTEL
   A telephony speech recognition system
   Training set: 2000 hour speech / 2.7 million
utterances
   33-dim spectrum features: (E+MFCCs) +∆ +∆∆
   Acoustic Model: Gaussian mixture HMM
   Total number of Gaussian components: 100K
   Vocabulary: 120K (delivered vendor lexicon)
   CPU Cluster: 100 CPUs @ 1.8GHz – 3.4GHz
   Training Cost: 4~5 hours per iteration
20
Experiment: Microsoft Tele. ASR
    Evaluate on four corpus-independent tests
   Collected from sites other than training data providers
    Cover major commercial Tele. ASR scenarios

Name       voc.size # word description
MSCT         70K      4356    enterprise call center system
(the MS call center we use daily)
SA           20K     43966    major commercial applications
(and include many cell phone data)
QSR          55K      5718    name dialing system
(many names are OOV, rely on LTS)
ACNT         20K      3219    foreign accented speech recognition
(designed to test system robustness)
21
Experiment: Microsoft Tele. ASR
WER               ML      GT-MCE       WER reduction
MSCT          11.59%        9.73%           16.04%
SA            11.24%       10.07%           10.40%
QSR            9.55%        8.58%           10.07%
ACNT          32.68%       29.00%           11.25%

Significant performance improvements across-the-board
The first time MCE is successfully applied to a 2000 hr.
speech database
The Growth Transformation based MCE training is well
suited for large scale modeling tasks
22

```
DOCUMENT INFO
Shared By:
Categories:
Stats:
 views: 10 posted: 5/21/2010 language: English pages: 22