5. Automatic speech recognition _ASR_

Document Sample
5. Automatic speech recognition _ASR_ Powered By Docstoc
					5.      Automatic speech recognition (ASR)

By automatic speech regonizer we mean a system that interprets human
speech in some way. Possible applications include: speech-controlled devices,
dictation, automatic transcription of meetings, automated telephone services,
searching large audio/video archives, speech translation.
The performance of current speech recognition systems can be anything
between perfect and poor depending on the following factors:
     • Vocabulary size (only a few words ... unlimited vocabulary)

     • Speech style (isolated words ... natural speech)

     • Language style (formal speech ... spontaneous conversation)

     • Speaker dependency (trained for one speaker ... speaker independent)

     • Recording conditions (quiet room ... noisy restaurant)
Speech recognizer

A typical speech recognizer consists of the following modules. Additionally,
a complete application may require methods for separating speech and non-
speech, dialogue control, speaker recognition, and search methods.
                     Speech


                  Preprocessing

          Feature vectors

                 Phoneme models

    Phoneme probabilities                     Language model

                     Decoding
                                     Probabilities for phoneme sequences

                 Reconition result


1
How does it work?

When building a recognizer, we train two models:
   • Acoustic phoneme model: Computes phoneme probabilities in given
     speech signal.

   • Language model: Defines legal phoneme sequences and their proba-
     bilities.

In principle, after the models have been trained, recognizing a new speech
signal is simple: Find the phoneme sequence that is best according to the
phoneme model and language model.
Demonstrations: small vocabulary, large vocabulary.




                                                                        2
Lessons learnt from the demo:

    • Recognition with a small vocabulary is easy. Especially, if the words are
      different from each other acoustically.

    • Simple vocabulary-based recognition does not work well if the vocabu-
      lary is not restricted. Especially in Finnish, inflections and compound
      words worsen the problem.

Solutions to this problem presented later during the course (lectures: statis-
tical language models, methods for speech recognition)


Preprocessing the speech signal

The preprocessing module extracts relevant features from the speech signal.
Typically features are computed from the short-time fourier spectrum.



3
                                        Waveform
 0.5

  0

−0.5
            0.5            1             1.5             2                2.5             3
                                                                                              4
                                                                                        x 10
                                 Spectrum (0−8000 Hz)

 50
100
150
200
       20   40    60       80     100     120      140    160       180     200   220

                                     Mel−spectrum

  5
 10
 15
 20
       20   40    60       80     100     120      140    160       180     200   220

                       Mel−Frequency Cepstral Coefficients (MFCC)
  2
  4
  6
  8
 10
 12
       20   40    60       80     100     120      140    160       180     200   220




                                                                                                  4
Acoustic phoneme models

For training the phoneme models using Hidden Markov Models (HMM), tens
or hundreds of hours of transcribed speech is needed.
    • For each phoneme, fetch feature vectors of all speech segments that
      contain the phoneme. The feature vectors form a cloud of points in
      the feature vector space which typically has 26-39 dimensions.
    • Model the cloud with multi-dimensional gaussian mixtures.

The simple model above can be improved:

    • Divide each phoneme in three states. The idea is that the first state
      models the transition from the previous phoneme, and the third state
      the transition to the next phoneme.
    • Create context-dependent models for each phoneme: For example, /a/
      between /p/ and /u/, /a/ between /k/ and /t/, and so on.


5
The feature vector space illustrated

                                                     h       h h
                                                 h       h     h
     16000 samples / s
                                                     h       h     s
                                             h
                                aa   a
                                      a
                                                               s      s
                                a aaa a                              s s
                                   a a a aa a
                                 aae                                   s
                                                ä                  s
                                       e e
                                  e a e e ä eä    ä                  s s
                                               ää
                                    e
                                      eä ä ä ä
                                           ä ä


   125 feature vectors / s
                              26−dimensional feature space




Perfect features would separate different phonemes in the feature space. In
practice, the phoneme clusters overlap, unfortunately.


                                                                           6
Phoneme models in the feature space

                        h       h h
                    h       h     h                                       h
                       h        h     s                           h   h
                  h                               a
     aaaa                         s      s                                    s
    a aaa a       a                     s s           a
                                              a
     a a e a a aa
       a
                     ä                s   s                                       s
            e e ä                       s s                               s
      eae eä e         ä                              e
                    ää                                        e
        e                                         e
          eä ä ä ä                                                ä
                ä ä                                       ä
                                                              ä




With Hidden Markov Models one can efficiently compute the state sequence
that best matches the given speech signal. More about Hidden Markov Mo-
dels later during the course.


7
Language model

Basically, the language model defines what kind of sentences or commands
the recognition expects from the user.
   • Simple command interface: list of commands, all equally probable.

   • Telephone service: A state machine describing the phases in a dialogue,
     and expected responsees in each state.

   • Dictation: Statistical n-gram model trained from a large collection text
     documents.

Language model limits the search space (possible hypotheses) and helps
to rank acoustically similar hypotheses: “Give those papers to me, please.”
versus “Give toes pay purse two me, police.”
Demonstration: Speech recognition using a good n-gram language model.



                                                                           8
Some recent recognition results

Finnish (TKK, 2005):
     Task         Speaker dependent   WER
     Audio book          yes           7
     Radio news           no           22
     TV news              no           35
     TV debate            no           70

English (HTK-system, University of Cambridge)
     Task              Speed   Year   WER
     News (NIST)        10     2004    11
     News (NIST)         1     2004    15
     Telephone (SWB)   > 100   2005    24
     Telephone (SWB)   < 10    2005    27

WER = word error rate (%)
Speed = recognition speed (x Real-time)


9
Application: Spoken document retrieval

The user makes queries to a large speech database, and the system tries to
find the most relevant clips. For example, broadcast companies have massive
audio and video archives that are often untranscribed.
Speech recognition system can be used for transcribing the archive automa-
tically. Relevant clips can be found even if the word error rate in recognition
more than 20 %.
Demonstration: http://speechfind.utdallas.edu/




                                                                            10
Word graphs or lattices

In some applications, it is useful to get multiple recognition hypotheses or a
so called word graph or lattice.
                                                                                                                                                                                                                                                                                                                   n
                                                                                                                                                                                                                                                    n                                                          _
                                                                                                                                                                                                                                                            _                                                               _
                                                                                                                                                                                                                              näkyvä                n                     en                          yyden        ja
                                                                                                                                                                                                                                                                 ihmis                                                  _
                                                                                                                                                                                            a                                                                             een    _                             _
                                                                                                                                                                           s                        _                                                   _                                                                   ja
                                                                                                                                                                                                                                       än                                        _   henkilö   llis   yyden        _
                                                                                                                                                                                                                                                                                                                                 _
                                                                                              saada                                                                                                                                                         ni                                                                       !NULL
                                                                                                                                                                                                                                ä                                                _                    yyteen
                                                                                                                                                                                                                                            _                                                                           _
                                                                                                                   _                                                                   a                    näkyvä
                                                                                                                                                                     us                                                                                     ni           misen
                                                                                                      va                                                                        sta                                                    _                                                                                _
                                                                                                                                                                                                                                                                  ni
                                                                                       saa                                                                                                                  näkyvä
                                                                                                      _                                                                                     a           _
                                                                                                               a   _                         kuva                                                                    näkyvä                                 _
                                                                                                                                                                          ssa               a                                                   n
                                                                                       saa
                                                                                                                                                                          sta                               näkyvä                                  n
                                                                                                               a                             kuva
                                                                                              sala                                                                              ssa             a                               ä           _
                                                                                                                                                                                                    _       näkyvä
                                                                                                                                                    kuva                  ssa
                                                                                                                                                                                                                                       _
                                                                                                           a                                                                          _
                                                                                      sama
                                                                                                                       _                                                  ssa
                                                                                                           a
                                                                                      sano                 a                                                 va           ssa
                                                                                                                                             kuu
                                                                                       saa                         _       selvi
                                                                                                                                                                                ssa
                                                                                                               a
                                                   tä                                                                                               kova                        ssa
                                                                                              saar
                                          peri
                                                   tä                                                                                                        wa                 ssa
                                                                                                                                              ku
                                          pyöri    tä                                         saa                                  lle   _
                                                                _                                                                            kuha
                                          pyri                                                                                                                            ssa
                                                                                  _           saa
                     _                             tä                                                                                               kuka
                                                                            me                                 n
                                                                                      sauna                                                                                     ssa
                                                   tä                             _
                     _                                                                                                                                     kuuluva
                                           yli                                    _                        n               selvi                                                      ssa
                                                   tä     _
                                          yritä
                                                                                       sol
                                                                                  _                        a                                 kuva                                     ssa
                     en
                                                                      mme             saada                                                                          a
             f                                                                                        _
                     t                                                mme   mme                                                              kuha
                                                  yritä                                saa                                                                           a
                                                                                                      _        n
                                                                                                                                             kuva
         _           t                                                                saada                                                                                a
             b                                                                                                         _
                                      _                       yritä         mme                            n                                                  _
                     t                                                                sala                 n
                                                    _                                                              _
             b
                                                                                                               n
         _   t                                                                    _           salo
                 t   n                                                                saada

         _   t                                                                                        an
                                                                                       sol            an

         _           n        _                                                                            n
                                                                                      osaa
                                            n
             n       n
 !NULL   _
                 n
         _
                     n
         _   h                _



                 e        _


                     _
         _
             o
         _                _
                                  _
             a       _    a


             _




Word graphs can be used for estimating how confident the regocnizer is
about the recognition output.
Also, in spoken document retrieval, word graphs are useful: A relevant query
word may be found in the word graph even if it would is incorrectly recognized
in the best hypothesis.


11

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:5
posted:9/16/2011
language:English
pages:12