VIEWS: 10 PAGES: 15 POSTED ON: 2/9/2012 Public Domain
Hidden Markov Models Theory By Johan Walters (SR 2003) Topics overview HMM’s as part of Speech Recognition Input / Output Basics Definition Simple HMM Markov assumption HMM program Evaluation problem Decoding problem Learning problem HMM in SR Input / Output An HMM is a statistical model that describes a probability distribution over a number of possible sequences. Input: A sequence of feature vectors Output: Words with highest probability being spoken Given a sequence of feature vectors, what words are most probably meant? Basics –States –State transition probabilities –Symbol emission probabilities State transition probability matrix 0.6 0.2 0.2 A aij 0.5 0.3 0.2 aij P( st j | st 1 i) 0.4 0.1 0.5 0.5 i P(s1 i) ( i ) t 0.2 0.3 O {o1 , o2 ,...,oM } up, down, unchanged A simple HMM {1,2,3} Formal definition HMM An output observation alphabet O {o1 , o2 ,...,oM } The set of states {1,2,...,N} a12 A transition probability matrix A {aij } P( st j | st 1 i) An output probability matrix B B {bi (k )} bi (k ) P( X t ok | st i) B An initial state distribution P(s0 i) Formal notation whole parameter set ( A, B, ) Assumptions • Markov assumption • Output independence assumption Ease of use / no significant affect Markov assumption “probability of the random variable at a given time depends only on the value at the preceding time.” P ( X i | X i 1 ) P( X 1 ,, X N ) P( X 1 ) P( X 2 | X 1 ) P( X 3 | X 2 , X 1 ) P( X N | X 1 ,, X N 1 ) P( X 1 ) P( X 2 | X 1 ) P( X 3 | X 2 ) P( X N | X N 1 ) N P ( X 1 ) P ( X i | X i 1 ) i2 HMM Program t:=1; Start in state sj with probability πi (i.e., X1 = i) Forever do Move from state si to state sj with probability aij (i.e. Xt+1 = j) Emit observation symbol ot = k with probability bijk t := t+1 end A symbol sequence (or observations) is generated by starting at an initial state and moving from state to state until a terminal state is reached. The state sequence is “hidden”. Only the symbol sequence that hidden states emit is observable. HMM s1 s2 s3 ㅕ ㄹ ㅓ b j ( xt ) Features x1 x 2 x3 xt frame Frame shift Speech signals time Problems The Evaluation Problem Given the observation sequence O and the model Ф, how do we efficiently compute P(O|Ф), the probability of the observation sequence, given the model? The Decoding Problem Finding the sequence of hidden states that most probably generated an observed sequence. The Learning Problem How can we adjust the model parameter to maximize the joint probability (likelihood)? How to evaluate an HMM Given multiple HMM’s (1 for each word) and a observation sequence. Which HMM most probably generated the sequence? Simple (expensive) solution: Enumerate all possible state sequences S of length T Sum up all probabilities of these sequences Probability of path S (calculate for all paths): State sequence probability * joint output probability Forward Algorithm is used to calculate above idea much more efficient, Complexity O(N2T) Recursive use of partially computed probabilities for efficiency How to evaluate an HMM (2) HMM for Seoul l1 word 1 Likelihood P(X|l1) computation . . Recognized Speech word Feature . Select extraction maximum HMM for lV word V Likelihood computation P(X|lV) How to decode an HMM Forward algorithm does not find best state sequence (‘best path’) Exhaustive search for best path is expensive Viterbi algorithm is used: Also uses partially computed results recursively Partially computed results are best path so far Each calculated state remembers most optimal previous state invoking it Complexity O(N2T) Finding best path is very important for continuous speech recognition How to estimate HMM Parameters (learning) Baum-Welch ( or Forward-Backward) algorithm Estimation of model parameters ф=(A,B,): First make an initial guess of the parameters (which may well be entirely wrong) Refine it by assessing its worth, attempt to reduce provoked errors when fitted to the given data Performs a form of gradient descent, looking for a minimum of an error measure. Forward probability term and backward probability term Similar to Forward & Viterbi (recursive use of incomplete data) but more complex Unsupervised learning: feed sample speech data along with phonemes of spoken words How to estimate HMM Parameters (learning) (2) waveform feature il i chil Yes Feature Baum-Welch Converged? end Speech Extraction Re-estimation database No l1 Word HMM l2 l7