Pushing the Envelope - Aside

Shared by: dffhrtcv3
Categories
Tags
-
Stats
views:
2
posted:
2/2/2013
language:
Unknown
pages:
31
Document Sample
scope of work template
							Pushing the Envelope - Aside
Nelson Morgan, Qifeng Zhu, Andreas Stolcke,
Kemal Sönmez, Sunil Sivadas, Takahiro Shinozaki,
Mari Ostendorf, Pratibha Jain, Hynek Hermansky,
Dan Ellis, George Doddington, Barry Chen,
Özgür Çetin, Hervé Bourlard, and Marios Athineos

            Presenter: Shih-Hsiang



                IEEE SIGNAL PROCESSING MAGAZINE SEPTEMBER,2005
Reference
 Ö. Çetin and M. Ostendorf, “Multi-rate and variable-
  rate modeling of speech at phone and syllable time
  scales,” in Proc. ICASSP 2005
 B. Chen, Q. Zhu, and N. Morgan, “Learning long
  term temporal features in LVCSR using neural
  networks,” in Proc. ICSLP, 2004
 H. Hermansky and S. Sharma, “TRAPS—Classifiers
  of temporal patterns,” in Proc. ICSLP, 1998
 H. Hermansky, S. Sharma, and P. Jain, “Data-
  derived nonlinear mapping for feature extraction in
  HMM,” in Proc. ASRU, 1999
Reference (cont.)
 C. Moreno, Q. Zhu, B. Chen, Nelson Morgan,
  “Automatic Data Selection for MLP-based Feature
  Extraction for ASR” in Proc. ASRU, 2005
 N. Morgan, B. Chen, Q. Zhu, A. Stolcke, “Trapping
  Conversational Speech: Extending TRAP/TANDEM
  Approaches to Conversational Telephone Speech
  Recognition” in Proc. ICASSP, 2004
Today’s topic
 Focus on three issues
    Using MLP to extract the long-term features
       TRAPs
       HATs
    The considerations when training the large amount
     data
    New HMM model introduced (multi-scale)
       Multi-Scale, Variable-Scale
Introduction
 The core acoustic operation has essentially remained
  the same for decades
       Using single feature vector compares to a set of distributions
        derived from training
       The feature vector often derived from the power spectral
        envelope over a 20-30ms window, steeped forward by
        ~10ms step per frame
 Systems using short-term cepstra for modeling have
  been successful both in the laboratory and in numerous
  application
    But there are still significant limitations to speech
     recognition performance, particularly for conversational
     speech and/or speech with significant acoustic
     degradations from noise or reverberation
Introduction (cont.)
 Human phonetic categorization is poor for extremely short
  segments (<100ms)
      suggesting that analysis of longer time regions is somehow
       essential to the task
 In mid-2002, they began working on a DARPA
  sponsored project - EARS
 The fundamental goal of this multisite effect was is
      Push the spectral envelope away from its role as the sole
       source of acoustic incorporated by the statistical models of
       modern speech recognition systems (SRSs)
 This ultimately would required both a revamping of
  acoustical feature extraction and a fresh look at the
  incorporation of these feature into statistical models
  representing speech
Temporal Representation
 Replace (or augment) the current notion of a spectral-
  energy based vector at time t with variables
    Based on posterior probabilities of speech categories for
     long and short time functions of the time-frequency plane
    These feature may be represented as multiple streams of
     probabilistic information
 Working with narrow spectral subbands and long
  temporal windows (up to 500 ms or more, sufficiently
  long for two or more syllables)
    TempoRAl Patterns (TRAPs)
    Hidden Activation TRAPS (HATS)
                                            ICSLP 1998


TempoRAl Patterns (TRAPs)
 Substitute a conventional spectral feature vector in
  phonetic classification by a 1 sec long temporal
  vector of critical band logarithmic spectral energies
  (Bark critical band)
Bark Critical Band
 The scale ranges from 1 to 24 and corresponds to the first 24
  critical bands of hearing
   Bark  13 arctan( 0.76 f / 1000 )  3.5 arctan(( f / 7500 )2 )
                                          The subsequent band
                                          edges are (in Hz) 0,
                                          100, 200, 300, 400,
                                          510, 630, 770, 920,
                                          1080, 1270, 1480,
                                          1720, 2000, 2320,
                                          2700, 3150, 3700,
                                          4400, 5300, 6400,
                                          7700, 9500, 12000,
                                          15500
TempoRAl Patterns (cont.)




      Fig. Mean TRAPs for 16 phonemes at the fifth critical band
                                               ASRU 1999


TempoRAl Patterns (cont.)
 The TRAPS system consists of two stages of MLPs
    In the first stage
     critical band MLPs learn phone probabilities posterior
     on the input
    In the second stage
     A “merger” MLP merges the output of each of these
     individual critical band MLPs resulting in overall
     phone posteriors probabilities
TempoRAl Patterns (cont.)
 Input to each TRAP is a 1 sec long temporal vector
 Output of each TRAP is a vector of estimates of
  phoneme-specific likelihoods
 Output from the merging MLP is a vector of estimates
  of phoneme-specific posterior probabilities

                                       15 Critical-band
                           TRAP        101 input units
                                       300 hidden units
                                       29 output phonetic classes
                                           ICSLP 2004


Hidden Activation TRAPS (HATS)
 Use the hidden activations of the critical band MLPs
  instead of their outputs as inputs to the “merger”
  MLPs ??
 Widening acoustic context by using more frames of
  full band speech energies as input to the MLP
 Reducing the word error rate from 25.6% to 23.5%
  on the 2001 NIST evaluation set
 Reducing the word error rate from 20.3% to 18.3%
  on the 2004 NIST evaluation set
Hidden Activation TRAPS (cont.)
Hidden Activation TRAPS (cont.)
 PLP feature were derived from short term spectral
  analysis(25ms time slices every 10 ms)
 PLP/MLP used 9 frames of PLP features and HATs used
  51 frames of log critical band energies
Stability of Results
 Switch board (earlier) and Fisher (later) conversational
  data is extremely difficult to recognize
    Due to their unconstrained vocabulary, speaking style,
     and range of telephones used
 Increasing amounts of training data can achieved better
  performance
Some Practical Consideration
 Larger and larger training sets can provide the best
  improvement
    implies a quadratic growth in training time
 Solution
    Hyper-threading on the dual CPUs
    Gender-specific training
    Preliminary network training passes with fewer
     training patterns
    Customization of the learning regimen to reduce the
     number of epochs (training iteration)
    Using selected subsets of the data for later training
     passes
Some Practical Consideration (cont.)
 Faster probabilistic inference algorithms and judicious
  model selection methods for controlling model
  complexity are needed
                                              ASRU 2005


Some Practical Consideration (cont.)
 Data Selection is also an important issue
    Reducing the redundancy existing in the database can
     help to reduce the costs of learning achieving the same
     performance with less effort
    Over-represented examples in the database can harm the
     generalization capabilities of a given learning machines
     biasing its modeling toward those classes
 For the selection of data based on the filter approach
  we need an evaluation method that allows us to sort
  the data according to some sampling criteria of
  definition of usefulness of the data
Some Practical Consideration (cont.)
 Evaluation method
   The first step, we have to train an MLP selector (classifier)
    ,s, using a small subset of the data that will result in a set
    of parameters, 
                                           
   Afterward, given those parameters  we can then obtain
    the probabilities a posteriori for the rest of the data
                          
     Ps (qk | x[n], )  s( x[n]) k  0,...,K 1
                             
     for every feature frame x[n] and phoneme, qk
   We can now compute the entropy value for each feature
    frame as
             K 1
                                                     
      h[n]   Ps (qk | x[n],  ) log 2 Ps (qk | x[n],  ) 
             k 0
Some Practical Consideration (cont.)
 Sampling criteria
    High entropy values indicate that taking a decision is
     going to be difficult
    Low entropy value indicate that the decision is easy to
     make (not necessarily implying it will be the right one)
    Very high entropy values may account for outlier or
     mislabeled examples: non–separable data.
    Very low entropy value can account for overrepresented
     or easily learnt examples
       This overrepresentation can harm the classifier ability by
        forcing too much detail in the corresponding class
Some Practical Consideration (cont.)




                                 NIST 2001
Statistical Modeling for the New
Features
 HMMs are not well suited to long-term features
    The use of HMMs as the core acoustic modeling
     technology might obscure the gains from new features,
     especially those from long time scales
    This may be one reason why progress with novel
     techniques has been so difficult
 The standard way to use longer temporal scale with an
  HMM is simply to use a large analysis window and a
  small frame step
    The successive features at the slow time scale are even
     more correlated than those at the fast time scale, leading
     to a bias in posteriors
    Models that do not represent the high correlation between
     successive frames effectively
Statistical Modeling for the New
Features (cont.)
 They propose instead to focus on the problem of
  multistream and multirate process modeling
    It is desirable to improve robustness to corruption of
     individual streams
       The use of multiple streams introduces more flexibility in
        characterizing speech at different time and frequency scale
    The statistical models and features interact, and simple
     HMM-based combination approaches might not fully
     utilize complementary information in different feature
     sequences
 A multi-rate and variable-rate modeling is introduced
                                                      ICASSP 2005


Multi-Rate and Variable-Rate Modeling
 The traditional approach for utilizing new features is to
  concatenate them with existing cepstral features after
  over-sampling and use them with in a standard HMM-
  based models
    HMM have become so tuned to short-term features that
     their use might obscure the gains from new features
 Traditional HMM
                      T 1
   P ({ot }, {st })   P( st | st 1 ) p(ot | st )
                      t 0
  Multi-Rate and Variable-Rate Modeling
  (cont.)
   Basic Multi-rate HMM
                                                 K Tk 1
       P({ot11 },{st11 },...,{otK },{stK })   P( stkk | stkk 1 , skt1 M k  ) P(otkk | otkk )
                                k      k                                 k/
                                                 k 1   tk


                                                                     T1=3      M2=3

coarser scale                                                        T2=M2xT1=9
  finer scale                                                             :states
                                                                          :observation
    Multi-Rate and Variable-Rate Modeling
    (cont.)
     Variable-rate Extension (2-rate)
P({o },{s },{o },{s } | {M t1 })  t 0 P(st11 | st11 1 )  P(ot11 | st11 )
     1      1      2     2                   T1 1
     t1     t1     t2    t2
                                              1




                                         
                                             l ( t1 )  M t1 1
                                             t 2 l ( t1 )
                                                                  P(st22 | st11 st22 1 )  P(ot22 | st22 )



 coarser scale
     finer scale
                                                     :states
                                                     :observation
Multi-Rate and Variable-Rate
Modeling (cont.)
 In their experiment, they modeled speech using both
  recognition units and feature sequences corresponding
  to phone and syllable time scales
    Short-time: traditional phone HMMs using cepstral
     features (PLP cepstral)
    Long-time: characterizes syllable structure and lexical
     stress using HATs
       Unlike the previously mentioned HAT features that were
        trained on phone targets, these HAT features are trained on
        broad consonant/vowel classes with distinction for syllable
        position (onset, coda, and ambi-syllabification) for
        consonants and low/high stress level for vowels

       2% word error rate reduction on NIST 2001 Hub-5 task
Multi-Rate and Variable-Rate
Modeling (cont.)
 The experiment result shows the explicit modeling of speech
  at two time scales via multirate, coupled HMMs architecture
  outperforms simple HMM-based feature concatenation
  approach
 The feature extraction and statistical modeling are tailored to
  focus more on information-bearing regions (e.g. phone
  transition) as opposed to a uniform emphasis over the whole
  signal space
 Research direction
      Choice of the sampling rates according to the scale/rate of the
       larger time-window features
      Multirate acoustic models with more than two time scales
          The third or higher time scale can represent utterance-level effects
           such as speaking rate and style, gender and noise
What could be next
 Determine optimal window sizes and frame rates for
  different regions of speech, thus creating a signal-
  adaptive front end
 The energy-based representations of temporal
  trajectories could be replaced by autoregressive models
  for these components of the time-frequency plane
    FDLP, LP-TRAP
    Perceptual linear prediction squared (PLP2 )
       A spectrogram-like signal representation that is iteratively
        approximated by all-pole models applied sequentially in the
        time and frequency direction of the spectrotemporal pattern
            Unlike conventional feature processing, no frame-based spectral
             analysis occures
Final Words
 They wrote some words …
   “We implored the reader not to be deterred by initial
    result that were poorer than those achieved by more
    conventional method, since this was almost inevitable
    when wandering from a well-worn path. However the
    goal was always to ultimately improve performance,
    and the explorations into relatively uncharted
    territory were only a path to that goal. This process
    can be slow and sometimes frustrating”

						
Related docs
Other docs by dffhrtcv3
Environmental Wireless Sensing Network
Views: 1  |  Downloads: 0
Energy and Power
Views: 1  |  Downloads: 0
English and its history
Views: 0  |  Downloads: 0
Energy Policy Challenges Caribbean Community
Views: 0  |  Downloads: 0
Energy and the Environment
Views: 0  |  Downloads: 0