Docstoc

HMM

Document Sample
HMM Powered By Docstoc
					Hidden Markov Models
Hidden Markov Model
 In some Markov processes, we may not
 be able to observe the states directly.
Hidden Markov Model
 X1            Xt-1     Xt      Xt+1          XT



 e1             et-1    et      et+1          eT

 A HMM is a quintuple (S, E, P, A, B ).
 S : {s1…sN } are the values for the hidden states
 E : {e1…eT } are the values for the observations
 P: probability distribution of the initial state
 A: transition probability matrix
 B: emission probability matrix
Inferences with HMM
 Filtering: P(xt|e1:t)
     Given an observation sequence, compute the
      probability of the last state.
 Decoding: argmaxx1:t P(x1:t|e1:t)
     Given an observation sequence, compute the most
      likely hidden state sequence.
 Learning: argmax P(e1:t) where =(P, A, B )
 are parameters of the HMM
     Given an observation sequence, find out which
      transition probability and emission probability
      table assigns the observations the highest
      probability.
     Unsupervised learning
Filtering
P(Xt+1|e1:t+1) = P(Xt+1|e1:t, et+1)
=P(et+1|Xt+1, e1:t) P(Xt+1|e1:t)/P(et+1|e1:t)
=P(et+1|Xt+1) P(Xt+1|e1:t)/P(et+1|e1:t)


P(Xt+1|e1:t) =   xt P(Xt+1|xt, e1:t) P(xt|e1:t)

Same form. Use recursion
Filtering Example
 Viterbi Algorithm
Compute argmaxx1:t P(x1:t|e1:t)
   Since P(x1:t|e1:t) = P(x1:t, e1:t)/P(e1:t),
   and P(e1:t) remains constant when we consider different x1:t
   argmaxx1:t P(x1:t|e1:t)= argmaxx1:t P(x1:t, e1:t)
Since the Markov chain is a Bayes Net,
   P(x1:t, e1:t)=P(x0)   Pi=1,t P(xi|xi-1) P(ei|xi)
   Minimize – log P(x1:t, e1:t)

    =–logP(x0) +i=1,t(–log P(xi|xi-1) –log P(ei|xi))
Viterbi Algorithm
 Given a HMM (S, E, P, A, B ) and
 observations o1:t, construct a graph that
 consists 1+tN nodes:
    One initial node
    N node at time i. The jth node at time i
     represent Xi=sj.
    The link between the nodes Xi-i=sj and Xi=sk
     is associated with the length
            –log P(Xi=sk| Xi-1=sj-1)P(ei|Xi=sk)
The problem of finding argmaxx1:t
P(x1:t|e1:t) becomes that of finding the
shortest path from x0=s0 to one of the
nodes xt=st.
Example
Baum-Welch Algorithm
 The previous two kinds of computation
 needs parameters =(P, A, B ). Where
 do the probabilities come from?
 Relative frequency?
     But the states are not observable!
 Solution: Baum-Welch Algorithm
     Unsupervised learning from observations
     Find argmax P(e1:t)
Baum-Welch Algorithm
 Start with an initial set of parameters 0
    Possibly arbitrary
 Compute pseudo counts
    How many times the transition from Xi-i=sj to Xi=sk
     occurred?
 Use the pseudo counts to obtain another (better)
 set of parameters 1
 Iterate until P1(e1:t) is not bigger than P(e1:t)
 A special case of EM (Expectation-Maximization)
Pseudo Counts
 Given the observation sequence e1:T, the
 pseudo counts of the link from Xt=si to
 Xt+1=sj is the probability
 P(Xt=si,Xt+1=sj|e1:T)

                    Xt+1=sj

       Xt=si
Update HMM Parameters
 Add P(Xt=si,Xt+1=sj|e1:T) to count(i,j)
 Add P(Xt=si|e1:T) to count(i)
 Add P(Xt=si|e1:T) to count(i,et)
 Updated aij= count(i,j)/count(i);
 Updated bjet=count(j,et)/count(j)
P(Xt=si,Xt+1=sj|e1:T)
= P(Xt=si,Xt+1=sj, e1:t, et+1, et+2:T)/ P(e1:T)
= P(Xt=si, e1:t)P(Xt+1=sj|Xt=si)P(et|Xt+1=sj)
  P(et+2:T|Xt+1=sj)/P(e1:T)
= P(Xt=si, e1:t) aijbjetP(et+2:T|Xt+1=sj)/ P(e1:T)
= i(t) aij bjet βj(t+1)/P(e1:T)
Forward Probability

  i (t )  P(e1 ...et , xt  si )

  j (t  1)
      P(e ...e , x
     i 1... N
                     1        t      t      i)P( xt 1  j | xt  i ) P(et 1 | xt 1  j )

      (t )a b
     i 1... N
                 i       ij       jet 1
Backward Probability
  i (t )  P(et 1 ...eT | xt  i )
  i (T )  1
  i (t )     a b
              j 1... N
                          ij iet    j (t  1)
                         j(t+1)

       i(t)             Xt+1=sj

      Xt=si    aijbjet



t-1      t                 t+1     t+2
P(Xt=si|e1:T)
=P(Xt=si, e1:t, et+1:T)/P(e1:T)
=P(et+1:T| Xt=si, e1:t)P(Xt=si, e1:t)/P(e1:T)
= P(et+1:T| Xt=si)P(Xt=si|e1:t)P(e1:t)/P(e1:T)
= i(t) βi(t)/P(et+1:T|e1:t)
Speech Recognition
Phones
Speech Signal
 Waveform




 Spectrogram
Feature Extraction



          Frame 1

                      Frame 2



  Feature Vector
                    Feature Vector
        X1
                          X2
Speech System Architecture
         Speech input


           Acoustic
           analysis

                   x1 ... xT
                                                          Phoneme inventory
         Global search:        P(x1... xT | w1 ...wk )
            | w1... wk )・P(w
P (x1... xT Maximize 1... wk )
                                                         Pronunciation lexicon
     P (w1... wk | x1... xT )
                                       P(w1 ...wk )        Language model
      over w1 ... wk


        Recognized
       word sequence
HMM for Speech Recognition
                                                       a24
  Word Model
                        a11               a22          a33


                 a01           a12              a23            a34
        start0          n1             iy2             d3            end4
                              b1(o3) b1(o5)
                  b1(o1) b1(o2) b (o )                b1(o6)
                                      1     4



  Observation
  Sequence
                 …                                             …


                       o1 o2         o3 o4 o5           o6
Language Modeling
   Goal: determine which sequence of
   words is more likely:
      I went to a party
      Eye went two a bar tea

• Rudolph the red nose reindeer.
• Rudolph the Red knows rain, dear.
• Rudolph the Red Nose reigned here.
Summary
 HMM
    Filtering
    Decoding
    Learning
 Speech Recognition
    Feature extraction from signal
    HMM for speech recognition

				
DOCUMENT INFO
Shared By:
Categories:
Stats:
views:9
posted:4/27/2011
language:English
pages:28