# Hidden Markov Models

Document Sample

```					Hidden Markov Models

Doug Van Nort
MUMT 611
Overview
•   Definition
•   Simple Example
•   3 Basic Problems
•   Forward Algorithm
•   Viterbi Algorithm
•   Baum - Welch Algorithm
•   Example Application
•   Conclusion
Definition

• Markov Model H = {A, B, ,P} , where…
Elements
• Set of N states {S1,…,SN}
• M distinct observation symbols V = {v1,…,vM}
per state
– Our “finite grammar”
– We assume discrete here, but could be
continuous
• State transition probability matrix A
• Observation probability distribution B
– B = {bj(k) | bj(k) = P(Ot = vk | qt = Sj), 1< k <M, 1< j
<N} , where Ot,qt represent observation and state
at time t respectively
– Again, could be continuous pdf modeled by
something like Gaussian mixtures
• Initial state distribution P = {pi | pi = P(q1 = i),
1< i <N}
Matrix of state transition probabilities

Where
Markov Chain with 5 states
Observable vs. Hidden
• Observable: output state is completely
determined at each instance of time
– For example, if output at time t is state
itself: 2 state heads/tails coin toss model

• Hidden: states must be inferred from
observations
– In other words, observation is probabilistic
function of state
Simple Example: Urn and Ball
• N urns sitting in a room
• Each one has M distinct colored balls
• Magic genie selects an urn at random,
based on some probability distribution
• Genie selects ball randomly from this
urn, tells us the color and puts it back
• She/he then moves on to next urn
based on second prob distribution, and
repeats process
Obvious Markov Model here:
• Each urn is a state
• Genie’s initial selection is based on initial
state probability, P
• Probability of selecting a certain color
determined by observation probability matrix,
B
• The likelihood of the “next” urn is determined
by the matrix of transition probabilities, A.
• At end we have observation sequence, for
example O = {red, blue, green, red, green,
magenta}
Where’s genie?
• If Genie location is known at each time
instant t, then model is observed
• Otherwise, this is a hidden model, and
we can only infer state at time t, given
our string of observations and known
probabilities
Three Basic Problems for HMM’s
• Problem 1:

Given observation sequence O = O1O2…OT ,
and Markov Model H = {A,B,P} , how do we
(efficiently) compute P(O | H)

- Given several model choices, can be used
to determine most appropriate one
Three Basic Problems for HMM’s
• Problem 2:

Given observation sequence O = O1O2…OT , and
Markov Model H = {A,B,P} , find optimal state
sequence q = q1…qT

- Optimality criterion needs to be determined

- interest is finding the “correct” state sequence
Three Basic Problems for HMM’s
• Problem 3:

Given observation sequence O = O1O2…OT ,
estimate parameters for Model H = {A,B,P} that
maximize P(O | H)

-observation sequence used here to train model,
adapting it to best fit observed phenomenon
Problem 1 : compute P(O | H)

For given state sequence q = {q1,…,qT}
we have

The probability of sequence q occurring is
P(q | H) = piaq1q2…a(qT-1)qT
• Joint probability of O and q is product of
two: P(O,q | H) = P(O | q,H)P(q | H)

• Probability of O is P(O,q | H) over set of
all possible sequences Q:
P(O | H) =
No Good
• Computation for this direct method is
O(2TNT)

• Not reasonable even for small values of
N and T

• Need to find efficient way
Problem 1 : Compute P(O | H)
Efficient Solution

The forward algorithm
The Forward Algorithm
• Let ft(i) = P(O1…Ot, qt = Si | H)

• Initialization:
f1(i) = pibi(O1) , 1 < i < N

• Induction:
Forward Algorithm
• Finally:
P(O | H) =

• Requires O(N2T) calculations
– Much less than direct method
Problem 2: Given O, H, find “optimal” q
• Of course, depends on optimality criterion
• Several likely candidates:
– Maximize number of correct individual states
• Does not consider transitions -> may lead to illegal
sequences
– Maximize number of correct duples, triples, etc.
– Find single best state sequence
• i.e. maximize P(q | O,H)
• This is most common criterion, and it is solved via the
Viterbi algorithm
Prob 2 solution: Viterbi Algorithm
• Define:

-Highest prob of single path at time t
ending in state Si
Inductively speaking:
Viterbi Algorithm
• Need to keep track of argument which
maximizes our delta function for each
timet,state i
– We use array rt(i)
Now:
Initialization:

r1(i) = 0 , 1 < i < N
• Recursion:

rt(i) =

• At end, we have final probability and the end
state:
• Backtrack to get entire path:

t = T-1, T-2,…, 1
Problem 3: Given O, estimate parameters for H
to maximize P(O|H)
• No known way to analytically maximize
P(O | H), or to solve for optimal
parameters

• Can locally maximize P(O | H) with
Baum - Welch Algorithm
Solution to 3: Baum - Welch Algorithm
• Quite lengthy and beyond our time
frame
• Suffice to say, it works
• Other solutions to 3 used, including EM
Ergodic vs. Left-to-Right
• Ergodic model:

• Left-to-Right Model:
Variations on HMM
• Null transition
– Transition between states that produces no output
– For ex: to model alternate word pronunciations
• Tied Parameters
– Set up equivalence relation between parameters
– For ex: between observation prob of 2 states
which have same B
– Reduces size of model, and makes prob 3 easier
• State duration density
– Inherent prob of staying in state Si for d iterations
is (aii)d-1(1-aii)
– May not be appropriate for physical signals, and
so an explicit state duration probability density is
introduced
Issues with HMM implementation
• Scaling
– Product of very small terms -> machine may not
be precise enough, so we scale
• Multiple observation sequences
– In left-to-right model, small number of
observations available for each state, requiring
several sequences for parameter estimation (prob
3)
• Initial estimate
– Normal distributions fine for P ,A , but B is
sensitive to initial estimate
– Again, this is an issue for problem 3
Issues with HMM implementation

• Insufficient training data
– For ex: not enough occurrences of different events in a given
state
– Possible solution: reduce model to subset for which more
data exists, and linearly interpolate between model
parameters
• Interp weightings a function of amount of training data
– Alternately, could impose some lower bound on individual
observation probabilities
• Model choice
– Ergodic vs. LTR (or other), Continuous vs. discrete
observation densities, number of states, etc.
Markov Processes Used in
Composition
•   Xenakis
•   Tenney
•   Hiller
•   Charles Ames
– Student of Hiller
• Many others since
Example Application
• Isolated word recognition (Rabiner)
– Each word v modeled as distinct HMM Hv
– Training set of k occurrences per word O1,…,Ok
• Each of which is an observation sequence
– Need to:
• estimate parameters for each Hv that maximize
P(O1,…,Ok | Hv) (i.e. prob 3)
• Extract features O = (O1,…,OT) from unknown word
• Calculate P(O | Hv) for all v (prob 1), find v which
maximizes
Make Observation
• Feature extraction: at each frame, cepstral
coefficients and their derivatives are taken
• Vector Quantization: observed frame is
mapped to possible observation (codebook
entry) via nearest neighbor
– Assuming discrete observation probability
– Codebook entries estimated by segmenting
training data, and taking centroid of all frame
vectors for each segment.
• A la k-means clustering
Choice of Model and Parameters
• Left-to-Right model more appropriate
– Thus we have P(q1 = S1) = 1
• Choice of states - two ideas:
– Let state correspond to phoneme
– Let state correspond to analysis frame
• Update model parameters:
– Segment training data into states based on current
model using Viterbi algorithm (prob 2)
– Update A,B probabilities based on observed data
• Ex: bj(Ok) number of observed vectors nearest to Ok in
state j divided by total number of observed vectors in
state j
State Duration Density
• If phoneme segmentation used, it may
be advantageous to determine a state
duration density
– Variable state length for each phoneme
Pyramid of death
Conclusion
– Has contributed quite a bit to speech recognition
– With algorithms we have described, computation
is reasonable
– Complex processes can be modeled with low-
dimensional data
– Works well for time varying classification
• other examples: gesture recognition, formant tracking
• Limitations
– Assumption that successive observations are independent
– First order assumption: probability state at time t only
depends on state at time t-1
– Need to be “tailor made” for specific application
– Needs lots of training data, in order to see all observations

```
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
 views: 25 posted: 9/1/2012 language: Unknown pages: 36