Speech Recognition (PowerPoint)

Document Sample
Speech Recognition (PowerPoint) Powered By Docstoc
					Introduction to Automatic
   Speech Recognition
Define the problem
What is speech?

Feature Selection


       Early methods
       Modern statistical models
Current State of ASR
Future Work
              The ASR Problem
There is no single ASR problem
The problem depends on many factors

       Microphone: Close-mic, throat-mic, microphone
        array, audio-visual
       Sources: band-limited, background noise,
       Speaker: speaker dependent, speaker
       Language: open/closed vocabulary, vocabulary
        size, read/spontaneous speech
       Output: Transcription, speaker id, keywords
           Performance Evaluation
   Accuracy
         Percentage of tokens correctly recognized
   Error Rate
         Inverse of accuracy
   Token Type
         Phones
         Words*
         Sentences
         Semantics?
           What is Speech?
Analog signal produced by humans
You can think about the speech signal being

decomposed into the source and filter
The source is the vocal folds in voiced speech

The filter is the vocal tract and articulators
Speech Production
Speech Production
Speech Production
Speech Visualization
Speech Visualization
Speech Visualization
          Feature Selection
As in any data-driven task, the data must be
represented in some format
Cepstral features have been found to perform

They represent the frequency of the

Mel-frequency cepstral coefficients (MFCC)

are the most common variety
        Where do we stand?
Defined the multiple problems associated with
Described how speech is produced

Illustrated how speech can be represented in

an ASR system
Now that we have the data, how do we

recognize the speech?
                Radio Rex
First known attempt at speech recognition
A toy from 1922

Worked by analyzing the signal strength at

      Actual speech recognition
   Originally thought to be a relatively simple
    task requiring a few years of concerted effort
   1969, “Wither speech recognition” is
   A DARPA project ran from 1971-1976 in
    response to the statements in the Pierce
   We can examine a few general systems
         Template-Based ASR
   Originally only worked for isolated words
   Performs best when training and testing
    conditions are best
   For each word we want to recognize, we
    store a template or example based on actual
   Each test utterance is checked against the
    templates to find the best match
   Uses the Dynamic Time Warping (DTW)
        Dynamic Time Warping
   Create a similarity matrix for the two
   Use dynamic programming to find the lowest
    cost path
   One of the systems developed during the
    DARPA program
   A blackboard-based system utilizing symbolic
    problem solvers
   Each problem solver was called a knowledge
   A complex scheduler was used to decide
    when each KG should be called
               DARPA Results
   The Hearsay-II system performed much
    better than the two other similar competing
   However, only one system met the
    performance goals of the project
       The Harpy system was also a CMU built system
       In many ways it was a predecessor to the
        modern statistical systems
Modern Statistical ASR
Modern Statistical ASR
                Acoustic Model
   For each frame of data, we need some way
    of describing the likelihood of it belonging to
    any of our classes
   Two methods are commonly used
       Multilayer perceptron (MLP) gives the likelihood
        of a class given the data
       Gaussian Mixture Model (GMM) gives the
        likelihood of the data given a class
Gaussian Distribution
            Pronunciation Model
   While the pronunciation model can be very
    complex, it is typically just a dictionary
   The dictionary contains the valid
    pronunciations for each word
   Examples:
       Cat: k ae t
       Dog: d ao g
       Fox: f aa x s
            Language Model
   Now we need some way of representing the
    likelihood of any given word sequence
   Many methods exist, but ngrams are the
    most common
   Ngrams models are trained by simply
    counting the occurrences of words in a
    training set
   A unigram is the probability of any word in
   A bigram is the probability of a given word
    given the previous word
   Higher order ngrams continue in a similar
   A backoff probability is used for any unseen
     How do we put it together?
   We now have models to represent the three
    parts of our equation
   We need a framework to join these models
   The standard framework used is the Hidden
    Markov Model (HMM)
                  Markov Model
   A state model using the markov property
       The markov property states that the future
        depends only on the present state
   Models the likelihood of transitions between
    states in a model
   Given the model, we can determine the
    likelihood of any sequence of states
         Hidden Markov Model
   Similar to a markov model except the states
    are hidden
   We now have observations tied to the
    individual states
   We no longer know the exact state sequence
    given the data
   Allows for the modeling of an underlying
    unobservable process
              HMMs for ASR
   First we build an HMM for each phone
   Next we combine the phone models based
    on the pronunciation model to create word
    level models
   Finally, the word level models are combined
    based on the language model
   We now have a giant network with potentially
    thousands or even millions of states
   Decoding happens in the same way as the
    previous example
   For each time frame we need to maintain two
    pieces of information
       The likelihood of being at any state
       The previous state for every state
                State of the Art
   What works well
       Constrained vocabulary systems
       Systems adapted to a given speaker
       Systems in anechoic environments without
        background noise
       Systems expecting read speech
   What doesn't work
       Large unconstrained vocabulary
       Noisy environments
       Conversational speech
                Future Work
   Better representations of audio based on
   Better representation of acoustic elements
    based on articulatory phonology
   Segmental models that do not rely on the
    simple frame-based approach
   Hidden Markov Model Toolkit (HTK)
   CHIME ( a freely available dataset)
   Machine Learning Lectures

Shared By: