Word Recognition with Conditional Random Fields

Document Sample
Word Recognition with Conditional Random Fields Powered By Docstoc
					Conditional Random Fields for
Automatic Speech Recognition


 Jeremy Morris
 05/12/2010



                                1
Motivation

   What is the purpose of Automatic Speech
    Recognition?
       Take an acoustic speech signal …
       … and extract higher level information (e.g.
        words) from it



                                          “speech”



                                                       2
Motivation

   How do we extract this higher level
    information from the speech signal?
   First extract lower level information
   Use it to build models of phones, words



                                   “speech”
                                  / s p iy ch/

                                                 3
Motivation

   State-of-the-art ASR takes a top-down
    approach to this problem
       Extract acoustic features from the signal
       Model a process that generates these features
       Use these models to find the word sequence that
        best fits the features


                                                “speech”
                                               / s p iy ch/

                                                          4
Motivation

   A bottom-up approach
       Look for evidence of speech in the signal
           Phones, phonological features
       Combine this evidence together to find the most
        probable sequence of words in the signal



                               voicing?
                               burst?        “speech”
                               frication?
                                            / s p iy ch/


                                                           5
Motivation

   How can we combine this evidence?
       Conditional Random Fields (CRFs)
           Discriminative, probabilistic sequence model
           Models the conditional probability of a sequence given
            evidence




                            voicing?
                            burst?                      “speech”
                            frication?
                                                       / s p iy ch/


                                                                      6
Outline

   Motivation
   CRF Models
   Phone Recognition
   HMM-CRF Word Recognition
   CRF Word Recognition
   Conclusions



                               7
CRF Models
   Conditional Random Fields (CRFs)
       Discriminative probabilistic sequence model
       Directly defines a posterior probability P(Y|X) of a
        label sequence Y given evidence X




                                                               8
CRF Models
   The structure of the evidence can be arbitrary
       No assumptions of independence




                                                9
CRF Models
   The structure of the evidence can be arbitrary
       No assumptions of independence
       States can be influenced by any evidence




                                                   10
CRF Models
   The structure of the evidence can be arbitrary
       No assumptions of independence
       States can be influenced by any evidence




                                                   11
CRF Models
   The structure of the evidence can be arbitrary
       No assumptions of independence
       States can be influenced by any evidence




                                                   12
CRF Models
   The structure of the evidence can be arbitrary
       No assumptions of independence
       States can be influenced by any evidence
       Evidence can influence transitions between states




                                                       13
CRF Models
   The structure of the evidence can be arbitrary
       No assumptions of independence
       States can be influenced by any evidence
       Evidence can influence transitions between states




                                                       14
CRF Models
   Evidence is incorporated via feature functions


    state feature
      functions




                                                15
CRF Models
   Evidence is incorporated via feature functions

                      transition feature
    state feature         functions
      functions




                                                16
CRF Models
                exp  ( i si ( x, yk )    j t j ( x, yk , yk 1 ))
P(Y | X ) 
                       k   i                     j

                                        Z ( x)
       state feature
         functions
                                                        transition feature
                                                            functions


   The form of the CRF is an exponential model of
    weighted feature functions
      Weights trained via gradient descent to maximize
       the conditional likelihood

                                                                             17
Outline

   Motivation
   CRF Models
   Phone Recognition
   HMM-CRF Word Recognition
   CRF Word Recognition
   Conclusions



                               18
Phone Recognition
   What evidence do we have to combine?
       MLP ANN trained to estimate frame-level
        posteriors for phonological features
       MLP ANN trained to estimate frame-level
        posteriors for phone classes
                                                  P(voicing|X)
                                                  P(burst|X)
                                                  P(frication|X)
                                                  …



                                                  P( /ah/ | X)
                                                  P( /t/ | X)
                                                  P( /n/ | X)
                                                  …



                                                                   19
Phone Recognition
   Use these MLP outputs to build state feature
    functions
                                                                    ( x), if y  / t /
                        s/ t /,P (/ t /| x ) ( y, x)    {MLP
                                                            P (/ t / | x )

                                                                 0, otherwise




                                                                                    20
Phone Recognition
   Use these MLP outputs to build state feature
    functions
                                                                                   ( x), if y  / t /
                              s/ t /,P (/ t /| x ) ( y, x)    {MLP        P (/ t / | x )

                                                                                0, otherwise



                                                                        ( x), if y  / t /
                     s/ t /,P (/ d /| x ) ( y, x)    {MLP      P (/ d /| x )

                                                                     0, otherwise




                                                                                                   21
Phone Recognition
   Use these MLP outputs to build state feature
    functions
                                                                                      ( x), if y  / t /
                                 s/ t /,P (/ t /| x ) ( y, x)    {MLP        P (/ t / | x )

                                                                                   0, otherwise




                                                                      ( x), if y  / t /
                   s/ t /,P ( stop| x ) ( y, x)    {MLP      P ( stop| x )

                                                                   0, otherwise




                                                                                                      22
Phone Recognition
   Pilot task – phone recognition on TIMIT
       ICSI Quicknet MLPs trained on TIMIT, used as
        inputs to the CRF models
       Compared to Tandem and a standard PLP HMM
        baseline model
   Output of ICSI Quicknet MLPs as inputs
       Phone class attributes (61 outputs)
       Phonological features attributes (44 outputs)




                                                        23
 Phone Recognition
   Model                                                                                   Accuracy
   HMM (PLP inputs)                                                                        68.1%
   CRF (phone classes)                                                                     70.2%
   HMM Tandem16mix (phone classes)                                                         70.4%
   CRF (phone classes +phonological features)                                              71.5%*

   HMM Tandem16mix (phone classes+ phonological                                            70.2%
   features)




*Signficantly (p<0.05) better than comparable Tandem system (Morris & Fosler-Lussier 08)


                                                                                                      24
Phone Recognition
   Moving forward: How do we make use of
    CRF classification for word recognition?
       Attempt to fit CRFs into current state-of-the-art
        models for speech recognition?
       Attempt to use CRFs directly?
   Each approach has its benefits
       Fitting CRFs into a standard framework lets us
        reuse existing code and ideas
       A model that uses CRFs directly opens up new
        directions for investigation
           Requires some rethinking of the standard model for ASR

                                                                 25
Outline

   Motivation
   CRF Models
   Phone Recognition
   HMM-CRF Word Recognition
   CRF Word Recognition
   Conclusions



                               26
HMM-CRF Word Recognition
   Inspired by Tandem HMM systems
       Uses ANN outputs as input features to an HMM




                                                “speech”
                            PCA
                                               / s p iy ch/


                                                        27
HMM-CRF Word Recognition
   Inspired by Tandem HMM systems
       Uses ANN outputs as input features to an HMM
       HMM-CRF system (Crandem)
           Use a CRF to generate input features for HMM
           See if improved phone accuracy helps the system
   Problem: CRFs estimate probability of the
    entire sequence, not individual frames

                                                         “speech”
                                  PCA
                                                        / s p iy ch/


                                                                 28
HMM-CRF Word Recognition
   One solution: Forward-Backward Algorithm
       Used during CRF training to maximized
        conditional likelihood
       Provides an estimate of the posterior probability of
        a phone label given the input


                                       i ,t  i ,t
                 P ( yi , t | X ) 
                                        Z ( x)

                                                               29
HMM-CRF Word Recognition
   Original Tandem system




                               “speech”
                        PCA
                              / s p iy ch/




                                       30
HMM-CRF Word Recognition
   Modified Tandem system (Crandem)




                       Local
                      Feature
                       Calc.



                                        “speech”
                       PCA
                                       / s p iy ch/

                                                31
HMM-CRF Word Recognition
   Pilot task – phone recognition on TIMIT
       Same ICSI Quicknet MLP outputs used as inputs
       Crandem compared to Tandem, a standard PLP
        HMM baseline model, and to the original CRF
   Evidence on transitions
       This work also examines the effect of using the
        same MLP outputs as transition features for the
        CRF




                                                          32
  HMM-CRF Word Recognition
       Pilot Results 1 (Fosler-Lussier & Morris 08)
             Model                                        Phone
                                                          Accuracy
             PLP HMM reference                            68.1%
             Tandem (phone class)                         70.8%
             CRF (phone class)                            70.2%
             Crandem (phone class, state only)            71.1%
             CRF (phone class, state+trans)               70.7%
             Crandem (phone class, state+trans)           71.7%



* Significant (p<0.05) improvement at 0.6% difference between models
                                                                       33
  HMM-CRF Word Recognition
       Pilot Results 2 (Fosler-Lussier & Morris 08)
             Model                                           Phone
                                                             Accuracy
             PLP HMM reference                               68.1%
             Tandem (phone+phono)                            71.2%
             CRF (phone+phono, state only)                   71.4%
             Crandem (phone+phono, state only)               71.7%
             CRF (phone+phono, state+trans)                  71.6%
             Crandem (phone+phono, state+trans) 72.4%



* Significant (p<0.05) improvement at 0.6% difference between models
                                                                        34
HMM-CRF Word Recognition
   Extension – Word recognition on WSJ0
       New MLPs and CRFs trained on WSJ0 corpus of
        read speech
           No phone level assignments, only word transcripts
           Initial alignments from HMM forced alignment of MFCC
            features
           Compare Crandem baseline to Tandem and original
            MFCC baselines
       WJ0 5K Word Recognition task
           Same bigram language model used for all systems



                                                                   35
  HMM-CRF Word Recognition
       Results (Morris & Fosler-Lussier 09)
        Model                                      Dev         Eval
                                                   WER         WER

        MFCC HMM reference                         9.3%        8.7%
        Tandem MLP                                 9.1%        8.4%
        Crandem (1 epoch)                          8.9%        9.4%
        Crandem (10 epochs)                        10.2%      10.4%
        Crandem (20 epochs)                        10.3%      10.5%




* Significant (p≤0.05) improvement at roughly 0.9% difference between models
                                                                               36
  HMM-CRF Word Recognition
          Model                                      Phone
                                                     Accuracy

          MFCC HMM reference                         70.1%
          Tandem MLP                                 75.6%
          Crandem (1 epoch)                          72.8%
          Crandem (10 epochs)                        72.9%
          Crandem (20 epochs)                        72.3%
          CRF (1 epoch)                              69.5%
          CRF (10 epochs)                            70.6%
          CRF (20 epochs)                            71.0%


* Significant (p≤0.05) improvement at roughly 0.06% difference between models
                                                                                37
HMM-CRF Word Recognition




       Comparison of MLP activation vs. CRF activation

                                                         38
HMM-CRF Word Recognition




       Ranked average per-frame activation MLP vs. CRF

                                                         39
HMM-CRF Word Recognition
   Insights from these experiments
       CRF posteriors very different in flavor from MLP
        posteriors
           Overconfident in local decision being made
           Higher phone accuracy did not translate to lower WER
   Further experiment to test this idea
       Transform posteriors via taking a root and
        renormalizing
           Bring classes closer together
           Achieved results insignificantly different from baseline,
            no longer degraded with further epochs of training
            (though no improvement either)
                                                                        40
Outline

   Motivation
   CRF Models
   Phone Recognition
   HMM-CRF Word Recognition
   CRF Word Recognition
   Conclusions



                               41
CRF Word Recognition
   Instead of feeding CRF outputs into an HMM



                                        “speech”
                                       / s p iy ch/




                                                 42
CRF Word Recognition
   Instead of feeding CRF outputs into an HMM
   Why not decode words directly off the CRF?

                                        “speech”
                                       / s p iy ch/




                                        “speech”
                                       / s p iy ch/


                                                 43
CRF Word Recognition
    arg max P(W | X )  arg max P( X | ) P( | W ) P(W )
       W                  W ,


                          Acoustic     Lexicon   Language
                           Model        Model      Model


   The standard model of ASR uses likelihood
    based acoustic models
   CRFs provide a conditional acoustic model
    P(Φ|X)


                                                       44
CRF Word Recognition
                                          CRF
                                         Acoustic
                                          Model


                            P ( | X )
arg max P(W | X )  arg max            P( | W ) P(W )
   W                  W ,    P ( )
                    Phone
                    Penalty         Lexicon     Language
                    Model            Model        Model




                                                           45
CRF Word Recognition
   Models implemented using OpenFST
       Viterbi beam search to find best word sequence
   Word recognition on WSJ0
       WJ0 5K Word Recognition task
           Same bigram language model used for all systems
       Same MLPs used for CRF-HMM (Crandem)
        experiments
       CRFs trained using 3-state phone model instead
        of 1-state model
       Compare to Tandem and original MFCC baselines

                                                              46
  CRF Word Recognition
     Results – Phone Classes only
       Model                                        Dev         Eval
                                                    WER         WER

       MFCC HMM reference                           9.3%        8.7%
       Tandem MLP                                   9.1%        8.4%
       CRF (state only)                            11.3%        11.5%
       CRF (state+trans)                            9.2%        8.6%



* Significant (p≤0.05) improvement at roughly 0.9% difference between models


                                                                               47
  CRF Word Recognition
     Results – Phone & Phonological features
       Model                                        Dev         Eval
                                                    WER         WER

       MFCC HMM reference                           9.3%        8.7%
       Tandem MLP (all)                             9.1%        8.4%
       Tandem MLP (best)                            8.6%        8.1%
       CRF (state+trans)                            8.3%        8.0%



* Significant (p≤0.05) improvement at roughly 0.9% difference between models


                                                                               48
Outline

   Motivation
   CRF Models
   Phone Recognition
   HMM-CRF Word Recognition
   CRF Word Recognition
   Conclusions



                               49
Conclusions & Future Work
   Designed and developed software for CRF
    training for ASR
   Developed a system for word-level ASR
    using CRFs
       Meets baseline performance of an MLE trained
        HMM system
       Platform for further exploration




                                                       50
Conclusions & Future Work
   Topics for further exploration:
       Feature modeling
           What kinds of features can benefit these kinds of
            models?
       Incorporating context
           Segmental approaches
       Integration between higher level and lower level
        models
           Language model and acoustic model separate models
           Combining low level evidence with higher-level evidence



                                                                  51
52
53
  HMM-CRF Word Recognition
       Transformed Results
        Model                                      Dev         Eval
                                                   WER         WER

        MFCC HMM reference                         9.3%        8.7%
        Tandem MLP (39)                            9.1%        8.4%
        Crandem (1 epoch)                          8.9%        9.4%
        Crandem* ( 1 epoch)                        8.4%        8.5%
        Crandem (10 epochs)                        10.2%      10.4%
        Crandem* (10 epochs)                       8.5%        8.8%
        Crandem (20 epochs)                        10.3%      10.5%
        Crandem* (20 epochs)                       8.5%        8.5%
* Significant (p≤0.05) improvement at roughly 0.9% difference between models
                                                                               54
CRF Word Recognition
               P ( | X ) P ( X )
   P( X | ) 
                   P ( )                 CRF
                                         Acoustic
                                          Model


                            P ( | X )
arg max P(W | X )  arg max            P( | W ) P(W )
   W                  W ,    P ( )
                     Phone
                     Penalty        Lexicon     Language
                     Model           Model        Model




                                                           55
Review - Word Recognition
    arg max P(W | X )
       W



   Problem: For a given input signal X, find the
    word string W that maximizes P(W|X)




                                                    56
Review - Word Recognition
                                P( X | W ) P(W )
    arg max P(W | X )  arg max
       W                   W         P( X )

   Problem: For a given input signal X, find the
    word string W that maximizes P(W|X)
   In an HMM, we would make this a generative
    problem




                                                   57
Review - Word Recognition
    arg max P(W | X )  arg max P( X | W ) P(W )
       W                   W



   Problem: For a given input signal X, find the
    word string W that maximizes P(W|X)
   In an HMM, we would make this a generative
    problem
   We can drop the P(X) because it does not
    affect the choice of W

                                                   58
Review - Word Recognition
    arg max P(W | X )  arg max P( X | W ) P(W )
       W                   W




   We want to build phone models, not whole
    word models…




                                                   59
Review - Word Recognition
    arg max P(W | X )  arg max P( X | W ) P(W )
       W                     W

                    arg max  P( X |  ) P( | W )P(W )
                         W       




   We want to build phone models, not whole
    word models…
   … so we marginalize over the phones


                                                    60
Review - Word Recognition
    arg max P(W | X )  arg max P( X | W ) P(W )
       W                     W

                    arg max  P( X |  ) P( | W )P(W )
                         W       
                     arg max P( X |  ) P( | W ) P(W )
                        W ,
   We want to build phone models, not whole
    word models…
   … so we marginalize over the phones
   and look for the best sequence that fits these
    constraints
                                                       61
Review - Word Recognition
              P( X | ) P( | W ) P(W )


    Acoustic Model


                                     Language Model




                      Lexicon




                                                      62
Word Recognition
                    P( X | ) P( | W ) P(W )


          Acoustic Model



   However - our CRFs model P(Φ|X) rather
    than P(X|Φ)
       This makes the formulation of the problem
        somewhat different


                                                    63
Word Recognition
    arg max P(W | X )
       W




   We want a formulation that makes use of P(Φ|X)




                                                 64
Word Recognition
arg max P(W | X )  arg max  P(W ,  | X )
    W                  W    
                    arg max  P(W | , X ) P( | X )
                        W    




   We want a formulation that makes use of P(Φ|X)
   We can get that by marginalizing over the phone
    strings
   But the CRF as we formulate it doesn’t give
    P(Φ|X) directly
                                                    65
Word Recognition

             P(W | , X ) P( | X )




   Φ here is a phone level assignment of phone
    labels
   CRF gives related quantity – P(Q|X) where Q is
    the frame level assignment of phone labels


                                                     66
Word Recognition

   Frame level vs. Phone level
       Mapping from frame level to phone level may not
        be deterministic
       Example: The word “OH” with pronunciation /ow/
       Consider this sequence of frame labels:
          ow    ow    ow    ow    ow     ow    ow
       This sequence can possibly be expanded many
        different ways for the word “OH” (“OH”, “OH OH”,
        etc.)

                                                           67
Word Recognition

   Frame level vs. Phone segment level
       This problem occurs because we’re using a single
        state to represent the phone /ow/
           Phone either transitions to itself or transitions out to
            another phone
       We can change our model to a multi-state model
        and make this decision deterministic
           This brings us closer to a standard ASR HMM topology
            ow1 ow2 ow2 ow2 ow2 ow3 ow3
       Now we can see a single “OH” in this utterance

                                                                       68
Word Recognition
 P ( | X )   P ( , Q | X )
                Q
              P( | Q, X ) P(Q | X )
                Q
              P( | Q) P(Q | X )
                 Q

   Multi-state model gives us a deterministic
    mapping of Q -> Φ
       Each frame-level assignment Q has exactly one
        segment level assignment associated with it
       Potential pitfalls if the multi-state model is
        inappropriate for the features we are using

                                                         69
 Word Recognition
arg max P(W | X )  arg max  P(W | , X ) P( | X )
     W                         W   
               arg max  P(W | , X ) P( | Q) P(Q | X )
                   W     ,Q

               arg max  P(W |  ) P( | Q) P(Q | X )
                   W     ,Q

    Replacing P(Φ|X) we now have a model with our
     CRF in it
    What about P(W| Φ,X)?
        Conditional independence assumption gives P(W| Φ)


                                                             70
Word Recognition
              P(W | X )   P(W |  ) P( | Q) P(Q | X )
                              ,Q


   What about P(W|Φ)?
       Non-deterministic across sequences of words
           Φ = / ah f eh r /
           W = ? “a fair”? “affair”?
           The more words in the string, the more possible
            combinations can arise




                                                              71
Word Recognition
                        P( | W ) P(W )
            P(W |  ) 
                            P ( )



   Bayes Rule
       P(W) –language model
       P(Φ|W) – dictionary model
       P(Φ) – prior probability of phone sequences


                                                      72
Word Recognition
   What is P(Φ) ?
       Prior probability over possible phone sequences
           Essentially acts as a “phone fertility/penalty” term –
            lower probability sequences get a larger boost in weight
            than higher probability sequences
       Approximate this with a standard n-gram model
           Seed it with phone-level statistics drawn from the same
            corpus used for our language model




                                                                      73
Word Recognition

                               P( | W ) P(W )
arg max P(W | X )  arg max                    P( | Q) P(Q | X )
   W                  W ,  ,Q     P ( )
   Our final model incorporates all of these pieces together
   Benefit of this approach – reuse of standard models
       Each element can be built as a finite state machine (FSM)
       Evaluation can be performed via FSM composition and best path
        evaluation as for HMM-based systems (Mohri & Riley, 2002)




                                                                        74
Pilot Experiment: TIDIGITS

   First word recognition experiment – TIDIGITS
    recognition
       Both isolated and strings of spoken digits, ZERO
        (or OH) to NINE
       Male and female speakers
   Training set – 112 speakers total
       Random selection of 11 speakers held out as
        development set
       Remaining 101 speakers used for training as
        needed
                                                           75
  Pilot Experiment: TIDIGITS
                               P( | W ) P(W )
arg max P(W | X )  arg max                    P( | Q) P(Q | X )
   W                  W ,  ,Q     P ( )


     Important characteristics of the DIGITS problem:
         A given phone sequence maps to a single word sequence
         A uniform distribution over the words is assumed
     P(W|Φ) easy to implement directly as FSM



                                                                  76
Pilot Experiment: TIDIGITS

   Implementation
       Created a composed dictionary and language
        model FST
           No probabilistic weights applied to these FSTs –
            assumption of uniform probability of any digit sequence
       Modified CRF code to allow composition of above
        FST with phone lattice
           Results scored using standard HTK tools
           Compared to a baseline HMM system trained on the
            same features


                                                                      77
Pilot Experiment: TIDIGITS

   Labels
       Unlike TIMIT, TIDIGITS is only labeled at the word
        level
       Phone labels were generated by force aligning the
        word labels using an HMM-trained, MFCC based
        system
   Features
       TIMIT-trained MLPs applied to TIDIGITS to create
        features for CRF and HMM training


                                                         78
Pilot Experiment: Results
Model                                                        WER
HMM (triphone, 1 Gaussinan, ~4500 parameters)                1.26%
HMM (triphone, 16 Gaussians ~120,000 paramters)              0.57%
CRF (monophone, ~4200 parameters)                            1.11%
CRF (monophone, windowed, ~37000 parameters)                 0.57%
HMM (triphone, 16 Gaussians, MFCCs)                          0.25%


   Basic CRF performance falls in line with HMM performance for a single
    Gaussian model
   Adding more parameters to the CRF enables the CRF to perform as well as
    the HMM on the same features




                                                                              79
Larger Vocabulary
   Wall Street Journal 5K word vocabulary task
       Bigram language model
       MLPs trained on 75 speakers, 6488 utterances
           Cross-validated on 8 speakers, 650 utterances
       Development set of 10 speakers, 368 utterances
        for tuning purposes
   Results compared to HMM-Tandem baseline
    and HMM-MFCC baseline


                                                            80
Larger Vocabulary

   Phone penalty model P(Φ)
       Constructed using the transcripts and the lexicon
       Currently implemented as a phone pair (bigram)
        model
       More complex model might lead to better
        estimates




                                                            81
Larger Vocabulary

   Direct finite-state composition not feasible for
    this task
       State space grows too large too quickly
   Instead Viterbi decoding performed using the
    weighted finite-state models as constraints
       Time-synchronous beam pruning used to keep
        time and space usage reasonable




                                                     82
 Larger Vocabulary – Initial Results
Model                                           WER
HMM MFCC Baseline                               9.3%
HMM PLP Baseline                                9.7%
HMM Tandem MLP                                  9.1%
CRF (phone)                                     11.3%
CRF (phone windowed)                            11.7%
CRF (phone + phonological)                      10.9%
CRF (3state phone inputs)                       12.4%
CRF (3state phone + phono)                      11.7%
HMM PLP (monophone labels)                      17.5%

    Preliminary numbers reported on development set only

                                                            83
Next Steps

   Context
       Exploring ways to put more context into the CRF, either at the
        label level or at the feature level
   Feature selection
       Examine what features will help this model, especially features
        that may be useful for the CRF that are not useful for HMMs
   Phone penalty model
       Results reported with just a bigram phone model
       A more interesting model leads to more complexity but may lead
        to better results
       Currently examining trigram phone model to test the impact



                                                                          84
Discussion




             85
References
   J. Lafferty et al, “Conditional Random Fields:
    Probabilistic models for segmenting and labeling
    sequence data”, Proc. ICML, 2001
   A. Gunawardana et al, “Hidden Conditional Random
    Fields for phone classification”, Proc. Interspeech, 2005
   J. Morris and E. Fosler-Lussier. “Conditional Random
    Fields for Integrating Local Discriminative Classifiers”,
    IEEE Transactions on Audio, Speech and Language
    Processing, 2008
   M. Mohri et al, “Weighted finite-state transducers in
    speech recognition”, Computer Speech and Language,
    2002

                                                                86
Background
   Tandem HMM
       Generative probabilistic sequence model
       Uses outputs of a discriminative model (e.g. ANN
        MLPs) as input feature vectors for a standard
        HMM




                                                           87
Background
   Tandem HMM
       ANN MLP classifiers are trained on labeled
        speech data
           Classifiers can be phone classifiers, phonological
            feature classifiers
       Classifiers output posterior probabilities for each
        frame of data
           E.g. P(Q |X), where Q is the phone class label and X is
            the input speech feature vector




                                                                      88
Background
   Tandem HMM
       Posterior feature vectors are used by an HMM as
        inputs
       In practice, posteriors are not used directly
           Log posterior outputs or “linear” outputs are more
            frequently used
               “linear” here means outputs of the MLP with no application
                of a softmax function
           Since HMMs model phones as Gaussian mixtures, the
            goal is to make these outputs look more “Gaussian”
           Additionally, Principle Components Analysis (PCA) is
            applied to features to decorrelate features for diagonal
            covariance matrices
                                                                             89
Idea: Crandem
   Use a CRF model to create inputs to a
    Tandem-style HMM
       CRF labels provide a better per-frame accuracy
        than input MLPs
       We’ve shown CRFs to provide better phone
        recognition than a Tandem system with the same
        inputs
   This suggests that we may get some gain
    from using CRF features in an HMM



                                                         90
Idea: Crandem
   Problem: CRF output doesn’t match MLP
    output
       MLP output is a per-frame vector of posteriors
       CRF outputs a probability across the entire
        sequence
   Solution: Use Forward-Backward algorithm to
    generate a vector of posterior probabilities




                                                         91
Forward-Backward Algorithm
   Similar to HMM forward-backward algorithm
   Used during CRF training
   Forward pass collects feature functions for
    the timesteps prior to the current timestep
   Backward pass collects feature functions for
    the timesteps following the current timestep
   Information from both passes are combined
    together to determine the probability of being
    in a given state at a particular timestep

                                                     92
Forward-Backward Algorithm
                                    i ,t  i ,t
              P ( yi , t | X ) 
                                     Z ( x)
   This form allows us to use the CRF to
    compute a vector of local posteriors y at any
    timestep t.
   We use this to generate features for a
    Tandem-style system
       Take log features, decorrelate with PCA

                                                    93
Phone Recognition
   Pilot task – phone recognition on TIMIT
       61 feature MLPs trained on TIMIT, mapped down
        to 39 features for evaluation
       Crandem compared to Tandem and a standard
        PLP HMM baseline model
       As with previous CRF work, we use the outputs of
        an ANN MLP as inputs to our CRF
   Phone class attributes
       Detector outputs describe the phone label
        associated with a portion of the speech signal
           /t/, /d/, /aa/, etc.
                                                         94
  Results (Fosler-Lussier & Morris 08)
         Model                                      Phone
                                                    Accuracy

         PLP HMM reference                          68.1%
         Tandem                                     70.8%
         CRF                                        69.9%
         Crandem – log                              71.1%




* Significantly (p<0.05) improvement at 0.6% difference between models
                                                                         95
Word Recognition
   Second task – Word recognition on WSJ0
       Dictionary for word recognition has 54 distinct
        phones instead of 48
           New CRFs and MLPs trained to provide input features
       MLPs and CRFs trained on WSJ0 corpus of read
        speech
           No phone level assignments, only word transcripts
           Initial alignments from HMM forced alignment of MFCC
            features
           Compare Crandem baseline to Tandem and original
            MFCC baselines


                                                                   96
  Initial Results
         Model                                      WER


         MFCC HMM reference                         9.12%
         Tandem MLP (39)                            8.95%
         Crandem (19) (1 epoch)                     8.85%
         Crandem (19) (10 epochs)                   9.57%
         Crandem (19) (20 epochs)                   9.98%




* Significant (p≤0.05) improvement at roughly 1% difference between models
                                                                             97
Word Recognition
   CRF performs about the same as the
    baseline systems
   But further training of the CRF tends to
    degrade the result of the Crandem system
       Why?
       First thought – maybe the phone recognition
        results are deteriorating (overtraining)




                                                      98
  Initial Results
         Model                                      Phone
                                                    Accuracy

         MFCC HMM reference                         70.09%
         Tandem MLP (39)                            75.58%
         Crandem (19) (1 epoch)                     72.77%
         Crandem (19) (10 epochs)                   72.81%
         Crandem (19) (20 epochs)                   72.93%




* Significant (p≤0.05) improvement at roughly 0.07% difference between models
                                                                                99
Word Recognition
   Further training of the CRF tends to degrade
    the result of the Crandem system
       Why?
       First thought – maybe the phone recognition
        results are deteriorating (overtraining)
           Not the case
       Next thought – examine the pattern of errors
        between iterations




                                                       100
Initial Results
    Model              Total       Insertions     Deletions         Subs.
                      Errors

    Crandem             542            57             144            341
    (1 epoch)
    Crandem             622            77             145            400
    (10 epochs)
    Shared              429            37            131*           261**
    Errors                                           (102)          (211)
    New                 193            40              35            118
    Errors
    (1->10)

* 29 deletions are substitutions in one model and deletions in the other
**50 of these subs are different words between the epoch 1 and epoch 10 models
                                                                            101
Word Recognition
   Training the CRF tends to degrade the result
    of the Crandem system
       Why?
       First thought – maybe the phone recognition
        results are deteriorating (overtraining)
           Not the case
       Next thought – examine the pattern of errors
        between iterations
           There doesn’t seem to be much of a pattern here, other
            than a jump in substitutions
           Word identity doesn’t give a clue – similar words wrong
            in both lists
                                                                  102
Word Recognition
   Further training of the CRF tends to degrade
    the result of the Crandem system
       Why?
       Current thought – perhaps the reduction in scores
        of the correct result is impacting the overall score
           This appears to be happening in at least some cases,
            though it is not sufficient to explain everything




                                                                   103
Word Recognition
MARCH vs. LARGE
Iteration 1
0   0    m     0.952271   l  0.00878177   en 0.00822043   em     0.00821897
0   1    m     0.978378   em   0.00631441  l 0.00500046   en    0.00180805
0   2    m     0.983655   em   0.00579973  l 0.00334182   hh    0.00128429
0   3    m     0.980379   em   0.00679143  l 0.00396782   w     0.00183199
0   4    m     0.935156   aa  0.0268882   em 0.00860147    l    0.00713632
0   5    m     0.710183   aa  0.224002    em 0.0111564    w      0.0104974 l 0.009005



Iteration 10
0   0    m     0.982478   em   0.00661739 en 0.00355534     n   0.00242626 l 0.001504
0   1    m     0.989681   em   0.00626308 l 0.00116445    en    0.0010961
0   2    m     0.991131   em   0.00610071 l 0.00111827    en    0.000643053
0   3    m     0.989432   em   0.00598472 l 0.00145113    aa    0.00127722
0   4    m     0.958312   aa   0.0292846  em 0.00523174   l     0.00233473
0   5    m     0.757673   aa   0.225989   em 0.0034254    l     0.00291158


                                                                                   104
Word Recognition
MARCH vs. LARGE - logspace
Iteration 1
0   0    m     -0.0489053   l     -4.73508    en -4.80113       em -4.80131
0   1    m     -0.0218596   em    -5.06492    l -5.29822        en -6.31551
0   2    m     -0.01648     em    -5.14994    l -5.70124        hh -6.65755
0   3    m     -0.0198163   em    -4.99209    l -5.52954        w -6.30235
0   4    m     -0.0670421    aa   -3.61607    em -4.75582       l -4.94256
0   5    m     -0.342232     aa   -1.4961     em -4.49574       w -4.55662  l -4.71001


Iteration 10
0   0    m     -0.017677    em     -5.01805    en   -5.6393     n     -6.02141    l   -6.49953
0   1    m     -0.0103729   em     -5.07308    l    -6.75551    en     -6.816
0   2    m     -0.0089087   em     -5.09935    l    -6.79597    en     -7.34928
0   3    m     -0.0106245   em     -5.11855    l    -6.53542    aa     -6.66307
0   4    m     -0.0425817   aa    -3.53069    em    -5.25301    l    -6.05986
0   5    m      -0.277504   aa    -1.48727    em     -5.67654   l    -5.83906



                                                                                                 105
Word Recognition
   Additional issues
       Crandem results sensitive to format of input data
           Posterior probability inputs to the CRF give very poor
            results on word recognition.
           I suspect is related to the same issues described
            previously
       Crandem results also require a much smaller
        vector after PCA
           MLP uses 39 features – Crandem only does well once
            we reduce to 19 features
           However, phone recognition results improve if we use
            39 features in the Crandem system (72.77% -> 74.22%)

                                                                     106

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:0
posted:5/9/2013
language:English
pages:106