Pushing the Envelope - Aside
Shared by: dffhrtcv3
-
Stats
- views:
- 2
- posted:
- 2/2/2013
- language:
- Unknown
- pages:
- 31
Document Sample


Pushing the Envelope - Aside
Nelson Morgan, Qifeng Zhu, Andreas Stolcke,
Kemal Sönmez, Sunil Sivadas, Takahiro Shinozaki,
Mari Ostendorf, Pratibha Jain, Hynek Hermansky,
Dan Ellis, George Doddington, Barry Chen,
Özgür Çetin, Hervé Bourlard, and Marios Athineos
Presenter: Shih-Hsiang
IEEE SIGNAL PROCESSING MAGAZINE SEPTEMBER,2005
Reference
Ö. Çetin and M. Ostendorf, “Multi-rate and variable-
rate modeling of speech at phone and syllable time
scales,” in Proc. ICASSP 2005
B. Chen, Q. Zhu, and N. Morgan, “Learning long
term temporal features in LVCSR using neural
networks,” in Proc. ICSLP, 2004
H. Hermansky and S. Sharma, “TRAPS—Classifiers
of temporal patterns,” in Proc. ICSLP, 1998
H. Hermansky, S. Sharma, and P. Jain, “Data-
derived nonlinear mapping for feature extraction in
HMM,” in Proc. ASRU, 1999
Reference (cont.)
C. Moreno, Q. Zhu, B. Chen, Nelson Morgan,
“Automatic Data Selection for MLP-based Feature
Extraction for ASR” in Proc. ASRU, 2005
N. Morgan, B. Chen, Q. Zhu, A. Stolcke, “Trapping
Conversational Speech: Extending TRAP/TANDEM
Approaches to Conversational Telephone Speech
Recognition” in Proc. ICASSP, 2004
Today’s topic
Focus on three issues
Using MLP to extract the long-term features
TRAPs
HATs
The considerations when training the large amount
data
New HMM model introduced (multi-scale)
Multi-Scale, Variable-Scale
Introduction
The core acoustic operation has essentially remained
the same for decades
Using single feature vector compares to a set of distributions
derived from training
The feature vector often derived from the power spectral
envelope over a 20-30ms window, steeped forward by
~10ms step per frame
Systems using short-term cepstra for modeling have
been successful both in the laboratory and in numerous
application
But there are still significant limitations to speech
recognition performance, particularly for conversational
speech and/or speech with significant acoustic
degradations from noise or reverberation
Introduction (cont.)
Human phonetic categorization is poor for extremely short
segments (<100ms)
suggesting that analysis of longer time regions is somehow
essential to the task
In mid-2002, they began working on a DARPA
sponsored project - EARS
The fundamental goal of this multisite effect was is
Push the spectral envelope away from its role as the sole
source of acoustic incorporated by the statistical models of
modern speech recognition systems (SRSs)
This ultimately would required both a revamping of
acoustical feature extraction and a fresh look at the
incorporation of these feature into statistical models
representing speech
Temporal Representation
Replace (or augment) the current notion of a spectral-
energy based vector at time t with variables
Based on posterior probabilities of speech categories for
long and short time functions of the time-frequency plane
These feature may be represented as multiple streams of
probabilistic information
Working with narrow spectral subbands and long
temporal windows (up to 500 ms or more, sufficiently
long for two or more syllables)
TempoRAl Patterns (TRAPs)
Hidden Activation TRAPS (HATS)
ICSLP 1998
TempoRAl Patterns (TRAPs)
Substitute a conventional spectral feature vector in
phonetic classification by a 1 sec long temporal
vector of critical band logarithmic spectral energies
(Bark critical band)
Bark Critical Band
The scale ranges from 1 to 24 and corresponds to the first 24
critical bands of hearing
Bark 13 arctan( 0.76 f / 1000 ) 3.5 arctan(( f / 7500 )2 )
The subsequent band
edges are (in Hz) 0,
100, 200, 300, 400,
510, 630, 770, 920,
1080, 1270, 1480,
1720, 2000, 2320,
2700, 3150, 3700,
4400, 5300, 6400,
7700, 9500, 12000,
15500
TempoRAl Patterns (cont.)
Fig. Mean TRAPs for 16 phonemes at the fifth critical band
ASRU 1999
TempoRAl Patterns (cont.)
The TRAPS system consists of two stages of MLPs
In the first stage
critical band MLPs learn phone probabilities posterior
on the input
In the second stage
A “merger” MLP merges the output of each of these
individual critical band MLPs resulting in overall
phone posteriors probabilities
TempoRAl Patterns (cont.)
Input to each TRAP is a 1 sec long temporal vector
Output of each TRAP is a vector of estimates of
phoneme-specific likelihoods
Output from the merging MLP is a vector of estimates
of phoneme-specific posterior probabilities
15 Critical-band
TRAP 101 input units
300 hidden units
29 output phonetic classes
ICSLP 2004
Hidden Activation TRAPS (HATS)
Use the hidden activations of the critical band MLPs
instead of their outputs as inputs to the “merger”
MLPs ??
Widening acoustic context by using more frames of
full band speech energies as input to the MLP
Reducing the word error rate from 25.6% to 23.5%
on the 2001 NIST evaluation set
Reducing the word error rate from 20.3% to 18.3%
on the 2004 NIST evaluation set
Hidden Activation TRAPS (cont.)
Hidden Activation TRAPS (cont.)
PLP feature were derived from short term spectral
analysis(25ms time slices every 10 ms)
PLP/MLP used 9 frames of PLP features and HATs used
51 frames of log critical band energies
Stability of Results
Switch board (earlier) and Fisher (later) conversational
data is extremely difficult to recognize
Due to their unconstrained vocabulary, speaking style,
and range of telephones used
Increasing amounts of training data can achieved better
performance
Some Practical Consideration
Larger and larger training sets can provide the best
improvement
implies a quadratic growth in training time
Solution
Hyper-threading on the dual CPUs
Gender-specific training
Preliminary network training passes with fewer
training patterns
Customization of the learning regimen to reduce the
number of epochs (training iteration)
Using selected subsets of the data for later training
passes
Some Practical Consideration (cont.)
Faster probabilistic inference algorithms and judicious
model selection methods for controlling model
complexity are needed
ASRU 2005
Some Practical Consideration (cont.)
Data Selection is also an important issue
Reducing the redundancy existing in the database can
help to reduce the costs of learning achieving the same
performance with less effort
Over-represented examples in the database can harm the
generalization capabilities of a given learning machines
biasing its modeling toward those classes
For the selection of data based on the filter approach
we need an evaluation method that allows us to sort
the data according to some sampling criteria of
definition of usefulness of the data
Some Practical Consideration (cont.)
Evaluation method
The first step, we have to train an MLP selector (classifier)
,s, using a small subset of the data that will result in a set
of parameters,
Afterward, given those parameters we can then obtain
the probabilities a posteriori for the rest of the data
Ps (qk | x[n], ) s( x[n]) k 0,...,K 1
for every feature frame x[n] and phoneme, qk
We can now compute the entropy value for each feature
frame as
K 1
h[n] Ps (qk | x[n], ) log 2 Ps (qk | x[n], )
k 0
Some Practical Consideration (cont.)
Sampling criteria
High entropy values indicate that taking a decision is
going to be difficult
Low entropy value indicate that the decision is easy to
make (not necessarily implying it will be the right one)
Very high entropy values may account for outlier or
mislabeled examples: non–separable data.
Very low entropy value can account for overrepresented
or easily learnt examples
This overrepresentation can harm the classifier ability by
forcing too much detail in the corresponding class
Some Practical Consideration (cont.)
NIST 2001
Statistical Modeling for the New
Features
HMMs are not well suited to long-term features
The use of HMMs as the core acoustic modeling
technology might obscure the gains from new features,
especially those from long time scales
This may be one reason why progress with novel
techniques has been so difficult
The standard way to use longer temporal scale with an
HMM is simply to use a large analysis window and a
small frame step
The successive features at the slow time scale are even
more correlated than those at the fast time scale, leading
to a bias in posteriors
Models that do not represent the high correlation between
successive frames effectively
Statistical Modeling for the New
Features (cont.)
They propose instead to focus on the problem of
multistream and multirate process modeling
It is desirable to improve robustness to corruption of
individual streams
The use of multiple streams introduces more flexibility in
characterizing speech at different time and frequency scale
The statistical models and features interact, and simple
HMM-based combination approaches might not fully
utilize complementary information in different feature
sequences
A multi-rate and variable-rate modeling is introduced
ICASSP 2005
Multi-Rate and Variable-Rate Modeling
The traditional approach for utilizing new features is to
concatenate them with existing cepstral features after
over-sampling and use them with in a standard HMM-
based models
HMM have become so tuned to short-term features that
their use might obscure the gains from new features
Traditional HMM
T 1
P ({ot }, {st }) P( st | st 1 ) p(ot | st )
t 0
Multi-Rate and Variable-Rate Modeling
(cont.)
Basic Multi-rate HMM
K Tk 1
P({ot11 },{st11 },...,{otK },{stK }) P( stkk | stkk 1 , skt1 M k ) P(otkk | otkk )
k k k/
k 1 tk
T1=3 M2=3
coarser scale T2=M2xT1=9
finer scale :states
:observation
Multi-Rate and Variable-Rate Modeling
(cont.)
Variable-rate Extension (2-rate)
P({o },{s },{o },{s } | {M t1 }) t 0 P(st11 | st11 1 ) P(ot11 | st11 )
1 1 2 2 T1 1
t1 t1 t2 t2
1
l ( t1 ) M t1 1
t 2 l ( t1 )
P(st22 | st11 st22 1 ) P(ot22 | st22 )
coarser scale
finer scale
:states
:observation
Multi-Rate and Variable-Rate
Modeling (cont.)
In their experiment, they modeled speech using both
recognition units and feature sequences corresponding
to phone and syllable time scales
Short-time: traditional phone HMMs using cepstral
features (PLP cepstral)
Long-time: characterizes syllable structure and lexical
stress using HATs
Unlike the previously mentioned HAT features that were
trained on phone targets, these HAT features are trained on
broad consonant/vowel classes with distinction for syllable
position (onset, coda, and ambi-syllabification) for
consonants and low/high stress level for vowels
2% word error rate reduction on NIST 2001 Hub-5 task
Multi-Rate and Variable-Rate
Modeling (cont.)
The experiment result shows the explicit modeling of speech
at two time scales via multirate, coupled HMMs architecture
outperforms simple HMM-based feature concatenation
approach
The feature extraction and statistical modeling are tailored to
focus more on information-bearing regions (e.g. phone
transition) as opposed to a uniform emphasis over the whole
signal space
Research direction
Choice of the sampling rates according to the scale/rate of the
larger time-window features
Multirate acoustic models with more than two time scales
The third or higher time scale can represent utterance-level effects
such as speaking rate and style, gender and noise
What could be next
Determine optimal window sizes and frame rates for
different regions of speech, thus creating a signal-
adaptive front end
The energy-based representations of temporal
trajectories could be replaced by autoregressive models
for these components of the time-frequency plane
FDLP, LP-TRAP
Perceptual linear prediction squared (PLP2 )
A spectrogram-like signal representation that is iteratively
approximated by all-pole models applied sequentially in the
time and frequency direction of the spectrotemporal pattern
Unlike conventional feature processing, no frame-based spectral
analysis occures
Final Words
They wrote some words …
“We implored the reader not to be deterred by initial
result that were poorer than those achieved by more
conventional method, since this was almost inevitable
when wandering from a well-worn path. However the
goal was always to ultimately improve performance,
and the explorations into relatively uncharted
territory were only a path to that goal. This process
can be slow and sometimes frustrating”
Get documents about "