Segmental HMMs: Modelling
Dynamics and Underlying Structure
for Automatic Speech Recognition
20/20 Speech Limited, UK
A DERA/NXT Joint Venture
• Hidden Markov models (HMMs): advantages and limitations
• Overcoming limitations with segment-based HMMs
• Modelling trajectories of acoustic features
• Theory of trajectory-based segmental HMMs
• Experimental investigations: comparing performance of
different segmental HMMs
• Choice of parameters for trajectory modelling: recognition
using formant trajectories
• A “unified” model for both recognition and synthesis
• Challenges and further issues
Typical speech spectral characteristics
s i k s th r ee o ne
• Each sound has particular spectral characteristics.
• Characteristics change continuously with time.
• Patterns of change give cues to phone identity.
• Spectrum includes speaker identity information.
Useful properties of HMMs
1. Appropriate general structure
• Underlying Markov process allows for time-varying
nature of utterances.
• Probability distributions associated with states
represent short-term spectral variability.
• Can incorporate speech knowledge - e.g. context-
dependent models, choice of features.
2. Tractable mathematical framework
• Algorithms for automatically training model
parameters from natural speech data.
• Straightforward recognition algorithms.
Modelling observations with an HMM
model time t
Conventional HMM assumptions
• Piece-wise stationarity
Assume speech produced by piece-wise stationary
process with instantaneous transitions between
• Independence Assumption
Probability of an acoustic vector given a model state
depends ONLY on the vector and the state.
Assume no dependency of observations, other
than through the state sequence.
• Duration model
State duration conforms to geometric pdf (given by
self-loop transition probability).
Limitations of HMM assumptions
• Speech production is not a piece-wise stationary
process, but a continuous one.
• Changes are mostly smoothly time varying.
• Constraints of articulation are such that any one
frame of speech is highly correlated with
previous and following frames.
• Time derivatives capture correlation to some
extent - but not within the model.
• Long-term correlations, e.g. speaker identity.
• Speech sounds have a typical duration, with
shorter and longer durations being less likely, and
limitations on maximum duration.
Addressing HMM limitations
AIMS WERE TO:
• retain advantages of HMMs:
– automatic and tractable algorithms for training to model
quantity of speech data;
– manageable recognition algorithms (principle of dynamic
• improve the underlying model structure to address
HMM shortcomings as models of speech.
ACHIEVING THE AIMS:
• Associate states with sequences of feature vectors
=> SEGMENTAL HMMS
Modelling observations with Segmental HMMs
time t (d=3)
time t+3 (d=2)
time t+5 (d=5)
• Associate states with sequences of feature
vectors, where these sequences can vary in
• Each state is associated with meaningful acoustic-
phonetic event (phones or parts of phones).
• Can easily incorporate realistic duration model.
• Enable relationship between frames comprising a
segment to be modelled explicitly.
• Characterize dynamic behaviour during a segment.
Recognition calculations with HMMs
• Compute most likely path through model (or sequence of
• Evaluate efficiently using dynamic programming (Viterbi
• To compute probability of emitting observations up to a
given frame time, for any one state need only consider
states which could be occupied at previous frame.
1 2 3 4 5 6 7
Segmental HMM recognition calculation
• Principle of dynamic programming still applies.
• BUT, is more complex and computationally intensive.
• For probability in any one state at any given frame time:
– assume that represents last frame of a segment
– consider all possible segment durations from 1 to some
– therefore, must consider all possible previous states at
all possible previous frame times from t-1 up to t-D.
12 1 2 3 4 5 6 7
Trajectory-based segmental HMMs
• Approximate relation between successive feature
vectors by some trajectory through feature space.
• Simple trajectory-based segmental HMM: associate a
state with a single mean trajectory, in place of (static)
single mean value used for a standard HMM.
Segmental HMM probability calculations
• Generate observations independently, but
conditioned on the trajectory.
• Aim to provide constraining model of dynamics
without requiring a complex model of correlations.
• BUT, trajectory may be different for different
utterances of the same sound.
• So, if a single trajectory is used to represent all
examples of a given model unit, will not be a very
accurate representation for any one example.
• One possible solution is a mixture of trajectories,
but needs many components to capture all
Intra- and Extra-segmental variability
• Model feature dynamics across all segment examples
by, in effect, a continuous mixture of trajectories.
• This is achieved by modelling separately:
– extra-segmental variation (underlying trajectory)
– intra-segmental variation (about trajectory)
=> Probabilistic-trajectory segmental HMMs
Comparing different models
Generating a sequence of 5 observations
standard HMM segmental HMM trajectory
Probabilistic-trajectory segmental HMMs
• Parametric trajectory model and Gaussian distributions.
• Simple linear trajectory - characterized by mid-point
and slope .
• For illustration show with slope=0.
variability 1 t D
PTSHMM probability (general)
• A segment of observations is y = y0,...,yT.
• Probability of y and trajectory f given state S is
P( y , f | S ) P( f | S ) . P( y t | S , f t )
Alternative segmental models:
1. Define trajectory; model variation in trajectory
2. Fix trajectory and model observations - HMM is
P( y| S) P( yt | S)
Linear Gaussian PTSHMM
• Linear trajectory: slope m and mid-point c.
• Joint probability of y and linear trajectory is:
P( y | S , f (m, c))
P( y, m, c | S ) P(m | S ).P(c | S ).
slope mid-point intra-segment
• Gaussian distributions for slope, mid-point and intra-
• To use model in recognition, need to compute P(y|S).
• but values of trajectory parameters m and c are not
known - they are “hidden” from the observer.
Hidden-trajectory probability calculation
• One possibility: estimate the location of the trajectory,
and compute the probability for that trajectory.
• Used this approach in early work, but suffers problems
due to difficulty in making unbiased trajectory estimate.
• A better alternative is to allow for all possible locations
of the trajectory by integrating out the unknown
• In the case of the linear model, the calculation is:
P( y| S ) P( y , m, c| S )
= P(m| S ). P(c| S ). P( y t | S , f t (m, c)) dm dc
Parameters of the linear PTSHMM
• Linear PTSHMM has five model parameters:
mid-point mean and variance,
slope mean and variance,
and intra-segment variance.
• Simpler models arise as special cases, by fixing various
• If trajectory slope is set to zero
=> “static” PTSHMM.
• If prevent variability in trajectory
=> “fixed-trajectory” SHMM.
• Fixed-trajectory static SHMM = standard HMM
with explicit duration model.
Digit recognition experiments
• Speaker-independent connected-digit recognition
• 8 mel cepstrum features + overall energy
• three-state monophone models
• Segmental HMM max. segment dur. 10 frames
(=> maximum phone duration = 300 ms).
• Compared probabilistic-trajectory SHMMs with fixed-
trajectory SHMMs and with standard HMMs.
• Initialised all SHMMs from segmented training data
(using HMM Viterbi alignment).
• Interested in acoustic-modelling aspects, so fixed all
transition and duration probabilities to be equal.
• 5 training iterations.
Digit recognition results: simple SHMMs
% Sub. % Del. %Ins %Err.
Standard HMM 6.2 1.5 0.9 8.6
Add duration constraint 5.2 0.7 0.7 6.6
Linear fixed trajectory 3.8 0.5 0.6 4.9
• Some benefit from simply imposing duration
constraints by introducing the segmental
structure (prevents “silly” segmentations).
• Further benefit from representing dynamics by
incorporating linear trajectory (one trajectory
per model state).
Digit recognition results:
%Sub. %Del. %Ins %Err.
Static fixed SHMM 5.2 0.2 0.7 6.6
Static probabilistic SHMM 5.2 2.2 0.1 7.5
• For static models, no advantage from distinguishing
between extra- and intra-segmental variability.
Digit recognition results: linear SHMMs
%Sub. %Del. %Ins %Err.
Static fixed SHMM 5.2 0.2 0.7 6.6
Linear fixed trajectory 3.8 0.5 0.6 4.9
Linear PTSHMM (slope var=0) 2.0 0.8 0.1 2.9
Linear PTSHMM (flexible slope) 4.9 4.0 0.1 9.0
• Some advantage for linear trajectory.
• Considerable further benefit from modelling
variability in mid-point.
• But modelling variability in both mid-point and slope is
detrimental to recognition performance.
Conclusions from digit experiments
Best trajectory model gives nearly 70% reduction
inn error-rate (2.9%) compared with standard HMMs
=> advantages from trajectory-based segmental HMM
which also incorporates distinction between intra- and
extra-segmental variability, but:
• Trajectory assumption must be reasonably accurate
(advantage for linear but not for static models).
• Not beneficial to model variability in slope parameter -
possibly too variable between speakers, or too
difficult to estimate reliably for short segments.
Phonetic classification: TIMIT
• Training and recognition with given segment boundaries.
• Train on complete training set (male speakers), with
classification on core test set.
• 12 mel cepstrum features + overall energy.
• Evaluated (constrained) linear PTSHMMs.
• Compared performance with standard-HMM
– context-dependent (biphone) versus context-independent
– feature set using only the mel cepstrum features versus
one which also included time derivative features.
TIMIT classification results
Feature M odel HM M PTSHM M % improvement
set type % err. % err. of PTSHM M
mel-cepstrum monophone 48.1 44.0 8.5
features only biphone 43.0 38.2 11.1
include time monophone 38.7 36.5 5.7
derivatives biphone 29.4 26.8 8.8
• Improvement with linear PTSHMM is greatest for
more accurate (context-dependent) models.
=> more benefit from modelling trajectories when not
including different phonetic events in one model.
• Most advantage when not using delta features.
=> most benefit from modelling dynamics when not
attempting to represent dynamics in front-end.
Benefit of PTSHMMs for some
different phone classes
no. HMM PTSHMM %impro-
examples %error %error ment
Fricatives (f v th dh s z sh hh) 710 41.7 38.9 6.8
Vowels(iy ih eh ae ah uw uh er) 1178 53.8 48.9 9.1
Semivowels and glides(l r y w) 97 39.2 33.2 15.4
Diphthongs(ey ay oy aw ow) 376 48.9 41.2 15.8
Stops (p t dx k b d g) 566 56.7 54.8 3.4
Most benefit from linear PTSHMM for sounds
characterised by continuous smooth-changing dynamics.
Summary of findings
• Probabilistic-trajectory segmental HMMs can
outperform standard HMMs and fixed-trajectory
• Separately modelling variability within/between
segments is a powerful approach, provided that:
– trajectory assumptions are appropriate (linear
– variability in the parameter can be usefully modelled
(not useful to model variability in slope parameter with
• The models have been shown to give useful
Issues of modelling speech dynamics
Compare error rates on TIMIT task:
• HMMs with time derivatives: 29.8%
• best segmental HMM result WITHOUT time
=> time derivatives capture some aspects of
dynamics not modelled in segmental HMMs.
• Time derivative features provide some
measure of dynamics for every frame.
• current segmental HMMs only model dynamics
within a segment.
modelling issues and questions (1)
• Choice of model unit (e.g. phone, diphone)
• How to model dynamics and continuity effects
across segment boundaries, to represent
dynamics throughout an utterance.
• How to model context effects. (e.g. could define
trajectories according to previous and following
sounds - but complicates search)
• How to define trajectories. (e.g. linear or
higher-order polynomial; versus dynamical-
system type model with filtered output of
modelling issues and questions (2)
• Incorporating a realistic duration model.
• How to model any systematic effects of
duration on trajectory realisation - should
reduce remaining variability in trajectories.
• How to model speaker-dependent effects and
• How to deal with other systematic influences
- e.g. speaker stress, speaking rate.
• Dealing with external influences - e.g. noise.
• Choice of features for trajectory modelling.
Spectral representations (1)
• Typical wideband spectrogram - for display compute
spectrum at frequent time intervals (e.g. 2 ms)
th r ee s I x s I x
• Typical features for ASR: mfccs computed from
FFT of 25 ms windows at 10 ms intervals:
Spectral representations (2)
• Using long windows at fixed positions blurs rapid
events - stop bursts and rapid formant transitions.
• An alternative: use a shorter window “excitation
th r ee s I x s I x
• Compare with long fixed-window analysis:
Standard HMM digit recognition experiments
• Compared excitation-synchronous analysis with
fixed analysis for different window lengths.
• In all cases computed FFT then mel cepstrum.
• Shorter window gives lower frequency resolution,
but effect is not so great on mel scale.
• Best fixed-window condition 20 or 25 ms: 2.1% err.
(increased to 4.6% for a 5 ms window).
• Best synchronous-window condition 10ms: 1.9% err.
But only increased to 2.1 % for a 5 ms window.
=> some advantage to capturing rapid events. But note
short window may be disadvantage for fricatives.
Maybe combine different analyses?
Moving beyond cepstrum trajectories
• Start with spectral analysis: this must preserve all
• But is it appropriate to then model trajectories
directly in the spectral/cepstral domain?
• Motivation for modelling dynamics is from nature
of articulation, and its acoustic consequences.
=> should be modelling in domain closer to
• One possibility is an articulatory description.
• Another option is formants - closely related to
articulation but also to acoustics.
Problems with formant analysis
• Unambiguous formant labelling may not be possible
from a single spectral cross-section.
e.g. close formants may merge to give single spectral peak
• A formant may not be apparent in the spectrum.
e.g. formant is weakly excited (F1 in unvoiced sounds).
• NOT useful for certain distinctions, where low
amplitude is the main feature.
e.g. identifying silence or weak fricatives.
=> difficult to identify formants independently from
recognition process, so not generally used as features
for automatic speech recognition.
Estimating formant trajectories
s i k s th r ee o ne
•Where see clear formant structure, F1, F2 and F3 can be identified.
•In voiceless fricatives, higher formant movements are usually continuous
with those in adjacent vowels.
•For F1, arbitrarily connect between adjacent vowels.
Formant analysis method
John Holmes (Proc. EUROSPEECH’97)
• Aims to emulate human abilities:
– ability to label single spectrum cross-sections
– rely heavily on continuity over time
– sometimes need knowledge of what is being said to
• Two fundamental features of the method:
– outputs alternatives when uncertain (“delayed decisions”).
– Notion of “confidence” in formant measurement
when formants cannot be estimated (e.g. during silence),
confidence is low and estimate not useful for recognition
=> rely on other features (general spectrum shape).
Example of formant analyser output
• Up to two sets of formants for each frame.
• Alternatives are in terms of sets - F1, F2, F3.
• Specified frame by frame, but are usually
Segmental HMM experiments
• Each segment model is associated with a linear trajectory.
• Model each phone by a sequence of one or more segments.
e.g. monophthongal vowels, fricatives - 1 segment
diphthongs - sequence of 2 segments
aspirated voiceless stops - sequence of 3 segments.
• Set allowed minimum and maximum segment duration
dependent on identity of phone segment (loose constraint).
• Incorporate confidence estimate (as a variance) in
• Resolve formant alternatives based on probability.
• Use formants + low-order cepstrum features.
Some connected-digit recognition results
Word error rates
8 cep. 5 cep.+3 for.
Standard-HMM baseline 3.5 % 2.5 %
with 3 states per phone
Standard HMMs with 6.4 % 5.9 %
variable state allocation
Introduce segment structure 3.2 % 2.9 %
Introduce linear trajectory 2.6 % 2.3 %
• Performance drops when introduce new state allocation
(total number of states about half that of baseline)
• Need segment structure for good performance
• Some advantage from linear trajectory
• Formants show small, but consistent, advantage.
• Expressing a model in terms of formant dynamics offers:
– Potential for modelling systematic effects in a meaningful way: e.g
speaker identity, speaker stress, speaking rate.
– Potential for a constrained model for speech, which should be more
robust to noise (assuming also model the noise).
• BUT: analysis of formants separately from hypotheses
about what is being said will always be prone to errors.
• FUTURE AIM: integrate formant analysis within
recognition scheme: provided speech model is accurate, this
should overcome any formant tracking errors.
• A good model for speech should be appropriate for
synthesis as well as for recognition: a trajectory-based
formant model offers this possibility.
A “unified” speech model: applied to coding
A simple coding scheme
• Demonstrate principles of coding using same model
for both recognition and synthesis.
• Model represents linear formant trajectories.
• Recognition: linear trajectory segmental HMMs of
• Synthesis: JSRU parallel-formant synthesizer.
• Coding is applied to analysed formant trajectories
=> relatively high bit-rate (up to about 1000 bits/s).
• Recognition is used mainly to identify segment
boundaries, but also to guide the coding of the
Segment coding scheme overview
Speech Coding results
Coded at about 600bps Natural
– Speaker 1: digits – Speaker 1: digits
– Speaker 2: digits – Speaker 2: digits
– Speaker 3: digits – Speaker 3: digits
– Speaker 1: ARM report – Speaker 1: ARM report
Achievements of study: Established principle of
using formant trajectory model for both
recognition and synthesis, including using
information from recognition to assist in coding.
Future work: better quality coding should be
possible by further integrating formant analysis,
recognition and synthesis within a common