CMU-ECE-1993-016

Document Sample
CMU-ECE-1993-016 Powered By Docstoc
					CARNEGIE           MELLON




    Speech Recognition
     in the Automobile




     Nobutoshi     Hanai

            1993
Speech Recognition in the Automobile



                       Nobutoshi Hanai




Submitted to the Departmentof Electrical and ComputerEngineering
                                             for
    in Partial Fulfillment of the Requirements the Degreeof
                       Master of Science at

                  Carnegie Mellon University
                Pittsburgh, Pennsylvania 15213



                           May 1993
Abstract                                                                                                           1
Acknowledgments                                                                                                    2
Chapter1: Introduction                                                                                             3
Chapter 2: The SPHINX     Speech Recognition System                                                                5
   2.1     Signal      Processing                    ............................                                  5
   2.2 Clustering         and Vector                Quantization                     .....................         6
   2.3    Hidden       Markov           Models               ..........................                            7
   2.4     Speech        Unit         ...............................                                              7
Chapter 3: The Motorola Car Database and AN4Database                                                                8
   3.1    The Motorola             Car Database                        ........................                     8
   3.2     The AN4 Database                        ............................                                     9
   3.3      Summary         ................................                                                        9
Chapter 4: Noise Characteristics in the Automobile                                                                11
   4.1     Noise      Sources              .............................                                           11
       4.1.1      Running           Noise             ..........................                                   11
       4.1.2      Functional               Noise             .........................                            15
       4.1.3       Outer       Noise             ...........................                                      17
   4.2      Summary         ...............................                                                       17
Chapter 5: Speech Recognition in Adverse Environments: Previous Work                                              18
   5.1 Auditory-Based                   Front           Ends .......................                              18
   5.2 Noise        and Noise-Word                     Models             ......................                  18
   5.3 Cepstral      Mean Normalization                    and the RASTAMethod .............                      19
   5.4     The CDCN Algorithm                             .........................                                19
   5.5 Speech Recognition                in the Car Environment                               .................   22
   5.6      Summary         ...............................                                                       23
Chapter 6: Recognition in the Motorola Car Database Task                                                           24
   6.1     Baseline        System               ............................                                      24
   6.2 Mel-Frequency           Cepstral              Coefficients                    ....................         25
   6.3 Environmental           Compensation                    Algorithms                 ..................      26
       6.3.1     Cepstral        Mean Normalization                               ....................            26
       6.3.2        CDCN        .............................                                                     27
       6.3.3 Combination of Cepstral Mean Normalization and CDCN.........                                         28
   6.4     Histogram-based                    CDCN .........................                                      29
   6.5      Summary         ...............................                                                       31
Chapter 7: Noise Cancellation for Car Radio                                                                       32
   7.1 Collection           of Stereo                  Data ........................                              32
   7.2 Adaptive          Noise        Cancellation                       .......................                  32
   7.3     Recognition            Results                 ..........................                              34
   7.4      Summary         .............................                                                         35
Chapter 8: Conclusions and Suggestions for Future Work                                                            36
   8.1     Conclusions                ..............................                                              36
   8.2 Suggestions            for        Future             Work .......................                          37
References                                                                                                        39
                                     List of Figures
Figure 2-1: Block Diagram of the SPHINX      System                                            5
Figure 4-1: Spectrum of Running Noise                                                         12
Figure 4-2: Histogram of the Power(C[0] Cepstral Component)                                   14
Figure 4-3: Spectrumof the Car Radio                                                          15
Figure 4-4: Spectrumof the Fan                                                                16
Figure 4-5: Spectrumof the Wipers                                                             16
Figure5-1: CDCN  estimates a noise vector n, and a cepstral equalization vector q that best
           transform the universal codebook  into the set of input frames of the observed
           speech.                                                                            20
Figure 7-1: Block Diagramof the Noise Cancellation Module                                     33
                                                                                    iii


                                    List of Tables

Table   1:   Baseline Recognition Accuracy                                         25
Table   2:   Recognition Accuracy of using MFCC                                    26
Table   3:                                      +
             Recognition Accuracy of using MFCC Cepstral MeanNormalization         27
Table   4:                                        +
             Recognition Accuracy of using MFCC CDCN                               28
Table   5:                                       +
             Recognition Accuracy of using MFCC Cepstral Mean Normalization + CDCN 29
Table   6:   Recognition Accuracy of using Histogram-based CDCN                    31
Table   7:   Recognition Accuracy w & w/o Noise Cancellation                       34
Abstract                                                                                       I
                                                                                            Page


Abstract
   This workinvestigates automatic speech recognition in the automobile.

   The report contains a description of the SPHINX  speech recognition system, a description of
the speech databases used, an overviewof the field of robust speech recognition, problemsfor
speechrecognition in the automobile, and the speech recognition experimentsthat have been done.

    Speechrecognition systems workrelatively well with high-quality speech. Nevertheless, when
the speech signal has been corrupted by noise sources in the automobile, recognition accuracies
decrease dramatically. In the car databases weuse in this research, the recognition accuracyvaries
         in
from 78% the best condition to 17%in the most adverse condition.

   Weconsider various noise sources in the automobile. Tire noise is generated betweenthe tires
and the road and it gets louder as the car runs faster. Enginenoise has relatively large powerin the
low frequency region. Wind                                   are
                            noise gets louder whenwindows open, as the car movesfaster. The
car radio is also one of the most severe noise sources, especially whenradio talk showsare on the
air. Noise from windshieldwipers is also recognizedas particular words.

    Westudy a new parameter representation and various compensation techniques that could
overcomethese adversities. Mel-frequencycepstral coefficients whosemodelis based on the hu-
                                  to
manauditory perception are shown provide better parametersfor very noisy speech than the cep-
stral coefficients derived from linear prediction that are the original parameters for the SPHINX
system. Environmental normalization algorithms such as CDCN       (CodewordDependent Cepstral
                                                                            in
Normalization) and cepstral meannormalization provide someimprovements almost every con-
dition. Anextended version of CDCN                                is
                                     called histogram-based CDCN developed in this research.
This technique attempts to provide moreaccurate statistical parametersof noise for every incoming
                    to                    but
utterance compared the original CDCN, it doesn’t workvery well in speech with a low sig-
nal-to-noise ratio.

                                                                    car
    Theadaptive noise cancellation technique is applied to overcome radio noise. Weuse the
simple LMS(Least-Mean-Square)algorithm adding a gain-reduction feature to avoid divergence
          speech. This techniquereducesthe car radio noise with very little distortion in speech,
in enhanced
and improvesrecognition accuracy considerably.

    The techniques we examinedprovide substantial improvementsto speech recognition in the
                                                              as
automobile, but they do not reach the samelevel of performance that of clean speech.
Acknowledgments                                                                          Page2


Acknowledgments
    First and foremost, I wouldlike to thank Prof. Richard Stern, myadvisor, for his invaluable
                        and
support, encouragement, guidance. I also wouldlike to thank Prof. Virginia Stonick for her ad-
                  on
vice and comments this report.

                                    in
    I amalso grateful to the members the speech group. Fu-HuaLiu, Pedro Moreno,Yoshiaki
Ohshimaand Tom  Sullivan informed meabout various tricks both in the SPHINX system and in
the Unix environment. Dr. TomioTakara, whowas a visiting scientist from the University of
Ryukyus,helped meunderstand speech recognition in general.

    I wouldlike to thank Motorola,Inc. for providingthe car database. I could not havecarded out
this research without the database.

   I greatly appreciate mycolleagues in Mitsubishi HeavyIndustries, Ltd. for giving methis in-
debted opportunity at CMU.

    Finally, I wish to thankmyfamily and friends for their support. My   wife, Yukiko,was always
there to share all difficulties and helped meto concentrate on studying. Mynewborn son, Tomoaki,
                                             of
always cheered meup during the last moments completingthis report. I will be always grateful
to them.
     1:
Chapter Introduction                                                                         Page3


                                          Chapter 1
                                         Introduction
   Speechrecognition has a long history of being one of the mostdifficult interdisciplinary prob-
lems. It will still take manymoreyears to developan unconstrainedsystemthat is able to recognize
speech under any circumstances. Early speech recognition systems achieved a reasonable perfor-
manceonly under one or moreof the following constraints: (1) speaker dependence(2) isolated
                                                   (5)
words(3) limited vocabulary(4) constrained grammar use of close-talking microphone    (6)
et recording environment.

    In the last 5 years, remarkableprogress has beenmadein addressing all of these constraints as
computational power has been increasing. Nowadays    speaker-independent continuous speech sys-
tems with vocabulary sizes of 5,000 to 20,000 words, using sophisticated language models, have
                                                      all
been demonstratedby research laboratories. However, of these systemsexhibit dramatic degra-
dations in performance  whenthey are operated in different environmentalconditions from the ones
with which they were trained.

    Thoughthe field of speech enhancementand robust speech recognition in adverse environ-
                                   of
mentsis relatively young,a number workhas beencarded out. For example,early efforts in this
                      by
field were summarized Lim [1983] and later results were reviewed by Juang [1991]. Previous
research in robust speech recognition in adverse environmentshas focussed on several environ-
mentsincluding telephone lines, workoffices, aircraft cockpits, and automobiles. This project
deals with the problemof speech recognition in the automobile.

                                                          is
    Oneof the mostpromisingapplications of this problem that of the hands-free cellular phone.
As cellular phones have becomepopular, speech recognition in the automobile has becomevery
attractive for the past 4 years. To reducethe dangerof dialing while driving, vocal dialing is desir-
able. Also whenthe driver is speakingwhile driving, speechmaynot be intelligible and should be
enhanced for hands-free radiocommunications.

   Themain problemsof speech recognition in the automobile are noise sources such as:

   ¯   tire and windnoise while running
   ¯   engine noise
   ¯   car radio, fan, wipers, and turn signals
   ¯   noise comingfrom outside the car
     1:
Chapter Introduction                                                                     Page4


    The purpose of this workis to understand howbadly the noises affect the performance of
speech recognition in the automobile, and to understand howthese noises can be defeated by var-
ious techniques, including a newly developed technique. Thoughthese studies have attempted to
improvespeech recognition accuracy using specific speechdatabases, the major goal is to obtain a
better fundamentalunderstandingof the car environment,and to establish an appropriate method-
ology in this field. This workhas beenfocused on the "front end" of the speechrecognition system,
and we have madeno modification in the recognition engine to achieve environmentalrobustness.

   Theoutline of this report is as follows:

   ¯ Chapter 2 provides a brief description of the SPHINX   recognition system
   ¯ Chapter 3 describes the speech databases used in this research
   ¯ Chapter4 provides noise characteristics in the automobile
   ¯ Chapter 5 provides a review of previous workin the area of speech recognition in adverse
     environments
   ¯ Chapter 6 describes a set of experiments and their results, including a newenvironmental
     compensation algorithm, called histogram-based CDCN
   ¯ Chapter7 describes the experimenton adaptive noise cancellation for car radio
   ¯ Chapter8 summarizes results of this research
                          the
      2:
Chapter TheSPHINX
                Speech         System
                     Recognition                                                        Page5


                            Chapter 2
               The SPHINXSpeech Recognition                            System
    The SPHINX                              by
               system was developed at CMU Lee et al. [1989]. It is a pioneer in speaker-
independent large-vocabulary continuous speech recognition. Webriefly describe the blocks that
comprise the SPHINX  system shownin Figure 2-1.


             Training Speech                                     Testing Speech

                    ~  Tune waveform
                                                                        ~ Tune waveform

             Signal Processing      I                           Signal Processing

                                          LPC cepstmm




                                          Cepstrum Code
                                        Diff. Cepstrum Code
                                        2nd Diff. Cep. Code
                           Power, Diff. Powerand 2nd Diff. PowerCode

                 Training                                          Recognition



                                                              Sentence Interpretation

                                                            System
                      Figure 2-1: Block Diagram of the SPHINX



2.1 Signal    Processing

    All speech recognition systems use a parametdcrepresentation rather than the waveformitself
as the basis for pattern classification. These parameters carry information about the spectrum.
Chapter TheSPHINX
      2:             Recognition
                Speech         System                                                          6
                                                                                            Page


SPHINX  uses frequency-warpedLPC(Ia’near Predictive Coding) cepstral coefficients (LPCC)
features for speech recognition. Theyare computedas follows (Shikano [1986]):

   ¯   Speechis digitized at a samplingrate of 16 kHz
   ¯   A Hamming  windowof 20 ms (320 samples) is used every 10 ms (160 samples)
   ¯   A preemphasis filter H(z) = 1 - 0.97z-~ is applied
   ¯   14 autocorrelation coefficients are computed
   ¯                    is
       a Pascal window applied to the autocorrelailon sequence
   ¯   14 LPCcoefficients are derived from the Levinson-Durbin   recursion
   ¯   32 LPC cepstral coefficients are computed using a standard recursion
   ¯   The cepstral coefficients are frequency-warpedby using a bilinear transform producing 12
       warpedLPCcepstral coefficients


                                                                                     system
   Althoughadjacent frames of speech are indeed correlated with each other, the SPHINX
assumesthat every frame is statistically independent.In addition to the smile informationprovided
                           also
by the cepstrum, SPHINX uses dynamicinformation represented by first-order and second-
order differences of the cepstral vector.

2.2 Clustering      and Vector Quantization

    Once the incoming speech has been converted to an n-dimensional vector, a damreduction
                 as
technique known vector quantization (VQ)is used to mapa vector into a discrete symbol. Avec-
                                                                                 is
tor quanilzer is defined by a codebookand a distortion measure. The codebook generated by a
hierarchical clustering algorithmand contains L vectors. It is a quantizedrepresentation of the vec-
tor space. Thedistortion measureestimates the degree of proximity betweentwo vectors. Aninput
vector is mapped into a symbolin the alphabet by choosing the closest codebookvector according
to the distance metric. In the SPHINX                                       can
                                     system L is fixed to 256, so a codeword be represented
with just one byte. Thedistortion measureis a Euclideandistance.

   VQgreatly reduces the amountof information while maintaining most of the information of
speech events; i.e. a 12-dimensional cepstral vector and power componentare compressedto 4
code words. It also reduces the computationalload of the recognition engine. For these reasons
SPHINX  uses the VQtechnique.
      2:
Chapter TheSPHINX
                Speech         System
                     Recognition                                                            Page7


2.3    Hidden Markov Models

                                    is
    The Hidden MarkovModel(HMM) currently the dominant technology in continuous speech
                                                                             system. Agood in-
recognition, and it is the technique used for speech recognition in the SPHINX
troduction to this technique is found in Rabiner and Juang [1986].

                  is
    Briefly the HMM a collection of states connectedby transitions. Eachtransition is described
by twosets of probabilities:

      ¯ Atransition probability whichdescribes the probability of a transition from one state to the
        next.
      ¯ Anoutput probability density function whichdefines the conditional probability of emitting
        each output symbolfrom a finite alphabet whena particular transition takes place.



               and
    GivenHMMs a set of training features, wefirst estimate the probabilities in each transition
                  has
so that each HMM maximum       probability of generating the training features. Usingthese prob-
                        and
abilities for each HMM a set of testing features, we then find the most likely HMM       state se-
quenceby looking at the overall probability that the modelproduces the testing features. HMMs
             a
have become widely-usedapproach for speech recognition due to the existence of maximum        like-
lihood techniques to estimate the parametersof the modelsin the training stage and efficient algo-
rithms to find the most likely state sequencein the recognition stage.

2.4 Speech Unit

                                  system is the phoneme.Since the same phoneme different
      The speech unit of the SPHINX                                          in
                                                    uses
contexts can be instantiated very differently, SPHINX generalized triphone modelsas a way
to modelleft and right context.

    In Chapter 3 wedescribe two databases, the Motorolacar database of 7-digit strings and the
AN4  database of census information. Since both the Motorolacar database and the AN4database
do not follow a particular grammar,triphones are concatenated into words without grammar.
      3:
Chapter TheMotorola Database AN4
                  Car      and database                                                       Page8


                           Chapter 3
           The Motorola Car Database and AN4 Database
    In this chapter wedescribe the Motorolacar database whichwasused to evaluate effects of the
noise in the automobileon the SPHINX system. Wealso give a brief description of the AN4   data-
base which was used as training data for SPHINX.

3.1 The Motorola Car Database
    The Motorolacar database was created to develop and evaluate a hands-free cellular phone at
Motorola. It consists of 12 speakers; 9 males and 3 females in their 20’s and 30’s. Eachspeaker
uttered five 7-digit strings in each of the 16 different conditions derived from3 driving speeds, 0
(engine idle), 30, and 55 m.p.h., and the followingother factors:

   ¯    windows up/down
   ¯   fan on/off
   ¯   radio off/music/AM-talk
   ¯   wipers on/off


   Thedigit strings were randomly   generated with equally likely probabilities for all the digits,
and they were read from a script to the driver whothen repeated them. Thedigit, ’0’, had two pro-
nunciations: "zero" and "oh".

    In addition to the five 7-digit strings, each speakerrecited all the digits discretely in everycon-
                                noise wasalso recorded in each condition for each of the speak-
dition. A sampleof the background



    The speech digit files were recorded on a DATrecorder in various automobilesusing 2 micro-
phoneslocated on the driver’s visor. Onewas a low-fidelity microphone   which is a standard Mo-
                                              It
torola "hands-free" cellular phonemicrophone. uses an electret elementand wasfiltered to have
a bandwidthof approximately 300-3,400 Hz. The other was a high-fidelity microphone,the Sony
ECM-959DT,  which uses an electret element and has a flat bandpass of bandwidth50-18,000 Hz.
Thehigh-fidelity data were lowpassfiltered to about 6,720 Hzbefore sampling. Thesefiles were
sampledat 8 and 16 kHz, respectively, using the line inputs of the Ariel Digital Microphone.

   Since the goal for collecting the database wasto makeit as realistic as possible, the recording
conditions were somewhat variable and reflected what an untrained population of users might pro-
      3:
Chapter TheMotorola Database AN4
                  Car      and database                                                     Page9


duce. Someof the files for various speakers were missing due to recording problemswhich were
not noticed until the data were reviewed.

    We used a portion of the files which contained 7-digit strings sampledat 16 kHz, since only
these files were valid for recognition experimentin the SPHINX   system. Becausethe database was
too small for training, weused this database for testing purposeonly.

           the
    Though Motorolacar database wascollected under a wide variety of conditions, it wasnot
sufficient to represent the entire conditions of the car environment.For example,weshould also
consider variety of weatherconditions and road conditions. In the real application, therefore, it is
morefeasible to train the speech recognition system with large database of clean speech and then
                                                                               database for
to operate the system in the real car environment. For this masonwe used the AN4
training in our experiments.

3.2 The AN4Database
    The AN4                        of
            database is composed 74 speakers (53 males and 21 females) for training, and
speakers (7 malesand 3 females) for testing. Thetraining speakersand the testing speakers are dif-
ferent. Thedatabase consists of strings of letters, numbers,and a fewcontrol words.After discard-
ing some utterances because of bad recording conditions, the training set consists of 1018
utterances, and the testing set contains 140utterances.

   The AN4  database was recorded in stereo in an office environment; the primary channel was
recorded using a Sennheiser HMD224  close-talking microphone, and the second channel of speech
was recorded using an omnidirectional desk-top CrownPZM6fs     microphone.

                                                      was
    Only the training set of data from Sennheiser HMD224 used in our experiments.

    Since the AN4Databasehas larger vocabulary size, wehave to provide larger phonetical mod-
els and HMM modelsthan we really need for testing data. This provides less accurate models, and
less wordaccuracyat last.

3.3 Summary
    The Motorola car database was collected under a wide variety of the conditions. Thoughthe
       of
number utterances was small, we could evaluate the effect of noise under sometypical condi-
                                  the
tions in the automobile.However, database did not contain all weatherconditions and road con-
ditions, so wecould not evaluate the effect of noise under these conditions.
      3:
Chapter TheMotorola Database AN4
                  Car     and database                                                  Page10


    It is morefeasible to train the speechrecognition systemwith a large database of clean speech
and then operate the systemin the real car environmentbecauseof the broad variability of the con-
ditions in the real application. For this reason we used the AN4database for training. Since the
AN4  database has a larger vocabulary than the Motorola car database, we have larger phonetic
models and HMM   models for a greater number of words, which degrades recognition accuracy
somewhat.
Chapter Noise
     4:     CharacteristicstheAutomobile
                         in                                                                   11
                                                                                           Page


                              Chapter 4
                Noise Characteristics in the Automobile
    In this chapter wedescribe the noise characteristics in the automobile.Mostof this information
is based on the workofDal Deganand Prati [1988] and our analysis of the Motorolacar database.

4.1 Noise Sources

    There are various noise sources in the automobile. Mostof themare additive in its nature. We
can categorize these additive noise sources in three groups. The first group whichwecall running
noise consists of the noise sources that are generatedand that vary as the car is running. Thesecond
group whichwecall functional noise consists of the noise sources that are irrelevant to the motion
of the car but relevant to the function of the car such as a car radio. Thethird groupwhichwecall
outer noise is the noise sources that comefrom outside of the car such as the noise comingfrom
other cars.

4.1.1   Running Noise

    Themainnoise sources in this category are tire noise, windnoise, and engine noise. Suspension
             on
noise, bumping the road, and other mechanicalnoise is also generatedwhile the car is running.

                                  the
    Tire noise is generatedbetween tires and the road. Oncethe car is moving,      this noise is in-
evitable, thoughmorerecent car bodydesigns have reducedthis noise significantly. Tire noise de-
pendsnot only on the tires and road conditions but also on the car speed. As the car is runningat a
higher speed, the noise powerincreases.

    Windnoise comesfrom air turbulence around the car body. Becauseof the recent improvement
in aerodynamicdesign and sound-shieldingtechnologyof car bodies, this noise is not very signif-
                      are
icant whenthe windows closed. It becomes     considerable, however,whenthe car is moving   with
            Noise powerat higher speeds is greater than at lower speeds, like tire noise.
open windows.

    Both tire noise and windnoise appear at the sametime whenwe observe noise in the car, and
cannot be separated. Dal Deganand Prati [1988] found that noise poweris very high in the low-
frequencyregion and it decreases with increasing frequency, but there is still considerable noise
power between 1 and 6 kHz at 100 km/hwith the engine off and windowsup.

   Enginenoise comesfromthe engine. It dependsnot only on the car speed but on the accelerator
                      the
and gear control. When engine runs at a high speed, the noise becomesquite high. Dal Degan
and Prati [1988] showedthat there is a very high peak of poweraround the fundamentalfrequency
Chapter Noise
     4:                 in
            CharacteristicstheAutomobile                                                Page12


of the engine noise; Le. 120 Hzfor the four-stroke four-cylinder engine running at 4000r.p.m. The
powerof engine noise decreases exponentially as the frequencyincreases like tire and windnoise.
But the powerof engine noise in the region between1 and 6 kHzis muchless than fire and wind
noise.

             and
    Dal Degan Prati [1988] also showed  that spectral coherenceat different positions in the car
is observed only at the fundamentalfrequency of the engine noise, since other noise is generated
by manyuncorrelated sources of the sameorder of magnitude. Therefore, multiple-channel adap-
tive noise cancelling technique worksonly for engine noise.

   Ourspectral analysis of running noise is done with the conditions of three car speeds and win-
dows up and down.

    Figure 4-1 (normalized by setting the maximum    value to 60 dB) showspowerspectra of speech
and noise in the automobilemeasuredin various conditions. Wecan see the peak at the very low
frequency in the idling condition (Figure 4-1 (a)) like Dal Deganand Prati [1988], but this
becomes  smaller and the noise has a broad spectrumwhenthe car is running (Figure 4-1 (b) &(c)).
     the                                                                           are
When car is not running, there is no significant difference whetherthe windows up or down,
except sometransient noise appears whenthe windowsare down(Figure 4-1 (d)). However,
the car is running with windows   down,the windnoise becomesconsiderably high (Figure 4-1 (e))
and the distribution of the noise powerbecomes    broader as shownin Figure 4-2 (a) and (b).



                                              Noise




                                                                       Speech

                                                      Thn~ (ms)



                            (a) Car Speed 0 m.p.h. (idling), Windows
                             Figure 4-1: Spectrum of Running Noise
Chapter Noise
      4:     Characteristics the Automobile
                           in                                                                Page13



                       50                      Noise

          SpatUla (dB) ~0.

                       20-
                       to.




                 l~q~acy     ~z)   ~X)O~
                                                                                    Speech



                                                                   T~a~(~)

                                                     1.1B 351-~gz1-16.ad¢:

                                    (b) Car Speed 30 m.p.h., Windows
                        ,o.                     Noise




                                                                                 Speech


                                                                    T;,,~ (n~)

                                                     1.1C 899-~gzg.16.~dc

                                    (c) Car Speed 55 m.p.h., Windows


                                           Noise~




                                                    1.$A o~3-1o$9-16.adc
                                    (d) Car Speed 0 m.p.h. (idling),             Windows
                                                   of
                               Figure 4-1: Spectrum RunningNoise
Chapter 4: Noise Characteristics        in the Automobile                                                   Page 14



                                                     Noise




                                                                                               Speech


                                                                             Ti~     1~)

                                                        1.5C       14o-2o98-16.adc

                                (e) Car Speed 55 m.p.h., Windows Down
                              Figure 4-1: Spectrum of Running Noise


                                                                         total frame counts = 392
                                                                         average = - 1.612397
                                                                         standard dev. = 3.435854
                                                                         maximum  --- 4.646345
                                                                         minimum = -5.583714




                                   -4           -2             0                 2         4          6
                                                                                               param[O]
                                             1.5,4_o93-1o59-16-              ch#=-O
                              (a) Car Speed 0 m.p.h. (idling),                         Windows

                      ~16
                                                                          total frame counts = 411
                                                                          average = 2.254208
                                                                          standard dev. = 0.939969
                                                                          m.a~i.’mum= 4.924498




                         4
                          2

                          0                 1           2                    3             4            5
                                                                                                 param[O]
                                             1.5C_ 14o-2o98-16 --             ch #=O
                                 (b) Car Speed 55 m.p.h., Windows Down
                            Figure 4-2: Histogram of the Power(C[O] Cepstral Component)
Chapter Noise
     4:                  in
            CharacteristicstheAutomobile                                                  Page15


4.1.2   Functional         Noise
   Noisesources in this category include the car radio, the fan, the wipers, the turn signals, and
the horn.

   Thecar radio is a significant noise source. Sincepeoplein the car wantthe car radio to be louder
than other noise sources, the powerof the car radio is usually high comparedto the other noise
sources. In addition, radio talk shows usually generate humanspeech sounds which confuses
speechrecognition systemsvery easily. Figure 4-3 showsthe spectral characteristics of the speech
with both music programsand AM   talk shows. Sometransient patterns appear in the noise region.
                                    Noise




                     10-




                                                                            Spe~ech

               1="=~))~ooo

                                               1o3A 372-~.4-16.adc

                                    (a) Car Speed0 m.p.h. (idling), Music




                                                 1.4A z33-z251-16.adc

                                   (b) Car Speed 0 m.p.h. (idling), AMtalk Show
                                       Figure 4-3: Spectrumof the Car Radio
     4:     CharacteristicstheAutomobile
Chapter Noise            in                                                                Page16

   The fan makesbroadbandnoise in the region of 2 kHz- 7 kHz as shownin Figure 4-4. This
characteristic is similar to the windnoise described in the previous section.


                        ,o         Noise




                                  Figure 4-4: Spectrumof the Fan


    Thewipers, the turn signals, and the horn are nonstationary noise sources whichgenerate tran-
sient patterns. As shownin Figure 4-5, the wipers makesignificant transient noise. This noise is
almost as severe as the car radio noise.

                                Noise




                                                                       Speech



                                                           T~ (n~)

                                            1.6.4 13o-85zz-16.adc

                                Figure 4-5: Spectrumof the Wipers


    Functional noise is direcdy connectedto the function of the car, so we can utilize information
fromthe car such as car radio signal, switchesof the fan, wipers, and the horn to predict and cancel
the noise.
     4:    Characteristics Automobile
ChapterNoise          in the                                                               17
                                                                                        Page

4.1.3 Outer Noise
   Noise sourcesin this categoryincludethe running noiseof othercars, trains, airplanes,the rain,
                                   from
andany other kind of noise coming outside the car. Rain produces       especially serious degra-
dationof the speechsignal, sinceit hits the car bodydirectly.

      Sincethis category extremely
                       is                             to            all
                                 broad,it is impossible characterize noisesources.

4.2     Summary
    Wehave presented various noise sources appearing in the car environment.Running        noise
        as
changes the runningconditionof the car changes.Functionalnoise, with the exceptionof the
fan, produces transient noise characteristics. All of the noise sourcesshouldnot be characterized
as a single formof noise.
      5:     Recognition Adverse
Chapter Speech        in       Environments                                                 Page18


                              Chapter 5
             Speech Recognition in Adverse Environments:
                            Previous Work
                                   a
    In this chapter we summarize numberof techniques that have been used to makespeech rec-
ognition morerobust in the presence of additive noise and/or an unknown  linear filter in the chan-
nel. We consider in this chapter the use of peripheral signal processingbased on the human auditory
system, the use of non-lexical modelsto characterize transient noises and distortions, the use of
channel equalization techniques, and various techniques applied to the car environment.

5.1 Auditory-Based        Front Ends

    Since the human auditory system is very robust to changesin the acoustical environment,some
researchers have tried to develop signal processing schemesthat are motivated by the functional
organization of the peripheral auditory system. A simple front end motivated by the human audi-
                                       1
tory system is Mel-frequencycepstrum (Davis and Mermelstein [1980]). Hermansky      [1990] used
the Perceptual modeled Linear Predictive coding (PLP) scheme. Seneff [1988] and Zue et al.
[1990] used modelsbased on the physiological humanauditory system.

   Mel-frequencycepstrum is simple and computationally cheap, whereas the physiological hu-
manauditory models are computationally very expensive. The Mel-frequency cepstrum can also
achieve some robustness compared to LPCC.

    We believe that the use of the filter bankcouldbe beneficial for the front ends usedin the speech
recognition systems, since effect of narrowband   noise is suppressed by passing through the band-
pass filter with maintainingthe spectral shapeof other part. LPCC                              to
                                                                   tries to fit the polynomial the
spectral shape in the minimum-mean-square-error      sense. Thus LPCC    changes the spectral shape
of other part, thoughit suppresses the effect of the narrowband   noise.

5.2 Noise     and Noise-Word Models

    Ward[1989] developed a "noise-word modeling" technique to characterize non-stationary
noise. The main idea of this technique is to train hidden Markovmodels(HMMs)of"noise words"
                                            to
to matchclasses of noise and use these HMMs recognize non-stationary noise. Typical non-sta-

                  cepstrum a representation the spectrum onthe human
   1. Mel-frequency       is              of           based                 frequency
                                                                    perceptual        scale
               To        the
   (Mel-scale). estimate Mel-frequency           we       the                   of
                                        eepstrum, compute inversecosinetransform the aver-
   aged         energy
       short-term            from
                      output thechannels a filter bank which filter is equally
                                          of           in     each                    along
                                                                              separated
   the Mel-scale.
      5:     Recognition Adverse
Chapter Speech        in       Environments                                                Page19

tionary noises are breath noises, lip smacks,paperrustles, filled pauses("ah" etc.), coughs,clearing
of the throat, phonerings, door slams etc. To implement    this technique weneed to transcribe the
noise wordsin a training database. Although  this techniqueis simple and effective, it does not take
care of non-stationary noise during speech segments.

    Varga and Moore[1990], and Gales and Young[1992] introduced a new way to take care of
                                                           the
speech and noise simultaneously. This methoddecomposes input speech into speech and noise
                The
by using HMMs. methodinvolves the use of a three-dimensional Viterbi search, in which the
noise and the speech are decodedat the sametime. Whilethis technique is powerful, the computa-
tional cost is extremelyhigh, since the Viterbi search has to be solved in three dimensions.

5.3   Cepstral    Mean Normalization         and the RASTAMethod

   Cepstral mean normalization and the RASTA    (RelAtive SpecTrAlProcessing) method (Her-
       et
mansky al. [1991]) are techniques that suppress constant channel effects in each log spectral
componentor cepstral component.Cepstral meannormalization simply subtracts a meanvalue in
                            or
each log spectral component cepstral componentover an entire utterance, whereas the RASTA
methodapplies a bandpassor highpass filter with a very low cutoff frequencyto the running esti-
                                  or
mate of each log spectral component of each cepstral component.

    Both techniques achieve good results whentraining and testing channel conditions are differ-
ent. The RASTA   method, however, degrades performance whentraining and testing channel con-
ditions are the same.This degradationis causedby the filtering operation. Since the filter removes
                                 but
not only the constant component also slow changes in the channel, static information in the
channel is somewhat  eliminated.

    Since the human vocal tract can be characterized as a transfer function and each person has a
different channel characteristics over the speech, these techniques mayalso suppress personal
acoustic differences. For this reason, we report results describing the application of cepstral mean
normalization to the cepstral domain,even though the mainnoise source in the car is an additive
noise.

5.4   The CDCN Algorithm

   Acero [1990] introduced CDCN      (Codeword Dependent Cepstral Normalization) as a tech-
nique for dealing jointly with additive noise and channel equalization.

   Given the observed noisy speech, CDCN   attempts to estimate for each frame a noise vector n
and a cepstral equalization vector q. Thesevectors are chosento best matchthe ensembleof ceps-
      5:     Recognition Adverse
Chapter Speech        in       Environments                                               Page20

tral vectors of the incomingspeech to the ensembleof cepstral vectors in a universal codebook
whichis generated from the training corpus (Figure 5-1).




                                                     Universal
                         Utterance
                                                     Codebook
     Figure 5-1: CDCNestimates a noise vector n, and a cepstral equalization vector q that
     best transform the universal codebookinto the set of input frames of the observed
     speech.


    Wedescribe the CDCN   algorithm in moredetail in this section because we develope someex-
tensions for it in Chapter6.

    In CDCN cepstral distance d [k] = IIz-y [/~] II betweenthe powercepstral coefficients y[k]
           the
                             in                       and
in terms of the kth codeword the universal codebook the observed cepstrumz is assumedto
exhibit a Gaussiandistribution. Thepowercepstral coefficients y[k] associated with the kth code-
                                can
wordin the universal codebook be expressed as follows.
                     y [k] = ¢ [k] + r [k] + q
                      vector, and r[k] is a noise correction vector whichhas the followingform.
wheree[k] is a codeword

                      r[k] = IDFT{ln(1 +eDFTtn-q-ctk]])

   Usingthis distance, with the assumption                                           are
                                          that the covariancematrices of the codewords o"21,
and using the assumptionthat the covariancematrix of the noise is ~I, the aposteriori probability
      5:     Recognition Adverse
Chapter Speech        in       Environments                                                    21
                                                                                            Page

                       given z is described as follows.
f[k] of the kth codeword
                                                 P [k] exp ~ a~ [k]

                       f[k] =                                                      for K>k>O
                                   P [0] exp
                                                      + Y~---z--_ expl-
                                    Y           da ~
                                             (. 2"I [0] k=l o        ~ 2o
                                                          )x-~P[k] (a~[k]
                                                                     k
                                                                     !
                                                 P [0] exp ~ a~ [0]
                                                   Y       k     2~)
                       f[0] =                                                      for k=-0
                               "[O]exp(.a~[O]]+~’P[k]exp(.a~[k]
                               )Y             2)
                                           2~£ ~ k=~--6"-      k 2~
whereP[0] is the fraction of noiseframes                     data, andP[k] is the fraction of
                                          presentin the speech
frames                                    and          to
      that are closest to the kth codeword is assumed equal(1-P[0])/(K- 1) for atl k

   Given an ensembleof N cepstrumvectors zo, .... ZN. and current estimates of n’ and q’, the
                                                     1,
newML estimates of n and q have following iterative forms.
                            N-1
                            Xf~ [0] z~
                       n- i=o
                            N-I
                                   f, [o]
                             i=0
                            N-IK-I
                             Z Eft[k] [z~-c[kl-r’[kl]
                       q = i=Ok=l N-IK-I
                                            Z Z f~ [k]
                                        i=Ok=l

wherer’ [k] is a current estimate of the noise correction vector.

                                 to
    After reaching the convergence obtain the final estimates of n and q, the clean speechvector
for every frame is estimated by MMSEestimation with the following form.
                                      K-1
                       :t~ = z~-q- Y f~[klr[k]                         ....
                                                                  i---O,1 oN-1
                                      k=l


    As wecan see in the aboveformulae, the a posteriori probability f[k] is used throughout the
CDCN process. To obtain f[k], weneed the following information; (~, T, P[0], and e[k]. The code-
                                                        (~
wordsc[k] and the standard deviation of the codewords are estimated with a standard Lloydclus-
tering algorithm in which the CDCN     algorithm is embedded.Therefore, we need to estimate the
standard deviation of the noise distribution y and the fraction of noise framesP[0]. In the original
implementationof CDCN    (Acero [1990]), the values oft and P[0] are set to 0.3 and 0.25, respec-
tively.
      5:     Recognition Adverse
Chapter Speech        in       Environments                                                Page22

    In addition, two other constants are used to initialize the algorithm in real CDCN implementa-
tions: the noise threshold and the dynamicrange. Thenoise threshold is neededto obtain an initial
estimate of the noise vector n. Theinitial estimate of n is obtained by averaging all frameswhose
                 of
power component the cepstrum, c[01, is less than the noise threshold. The dynamicrange is
                                                                      ofq
neededto obtain an initial estimate of q[0] with the other components set to 0, and it worksto
set the offset of the speechpowerfromthe noise. Thevalues of the noise threshold and the dynamic
range are set to 1.0 and 13.0, respectively in the original implementationof CDCN.

    Since this technique has been very successful in combattingthe effects of additive noise and
                                                                      is
channel effects, wewill attempt to determinethe extent to whichCDCN able to recover the sig-
nal from degradations introduced by the car environment.

5.5 Speech Recognition         in the Car Environment

    There have been a numberof algorithms developed with the goal of improving speech recog-
nition in the car environment.

    Dal Deganand Prati [1988] tried a series of speech enhancement     techniques for mobile radio
application as well as an acoustic noise analysis inside the car. Theyfoundthat adaptive noise can-
ceiling techniques can eliminate only engine noise, since only the engine noise is highly correlated
both spatially and temporally. A combinationof the noise cancelling technique and the spectral
                               the
subtraction techniqueperformed best in their experiments,but it also introducednon-linear dis-
tortion. Theysuggested that the use of available information from the tachometer could improve
the speech enhancement.

            et
    Lecomte al. [1989] also described both the effect of noise in the car and a series of experi-
ments related to the application of hands-free telecommunications. They tried the Short-Tune
Modified Coherence (SMC)algorithm proposed by Mansourand Juang [1988], LPCwith adapta-
tion to known  noise as proposedby Ephraimet al. [1987], and adaptive filtering. The SMC   algo-
rithm following highpass filtering achieved a good result, but none of the techniques could
                        the
sufficiently compensate car noise.

     Mokbeland Chollet [1991] compared a speech enhancement algorithm based on Kalman
tering to environmentadaptation through spectral transformations. Theyfound that the environ-
ment adaptation technique worksbetter, but the spectral transformation from clean references to
noisy speechfor any noise conditions is difficult.
      5:     Recognition Adverse
Chapter Speech        in       Environments                                             Page23

    Ohet al. [1992] presented a microphonearray approach for hands-free voice communication.
The microphonearray approach using a generalized sidelobe canceler methodachieved superior
         to
compared a single microphone  using spectral subtraction, and stable performanceunder all con-
ditions. The microphonearmy approach, however, is computationally expensive.

5.6   Summary

    In this chapter we have given an overviewof most of the techniques already available for the
automatic speech recognition in adverse environments.Concerningthe applications to the car en-
vironment, manyadaptive approaches were taken because of the dynamicchanges of the noise in
the automobile. In general, one- or two-microphone approaches did not achieve satisfactory re-
suits.
     6:          in          Car      Task
Chapter Recognitionthe Motorola Database                                                   Page24


                             Chapter 6
           Recognition in the Motorola Car Database Task
                                                          using the Motorola database and the
    In this chapter wereport the results of our experiments                  car
AN4 database. In Section 6.1 wedescribe our baseline system and the first results of the Motorola
car database task. In Section 6.2 wedescribe the results using another type of cepstral analysis,
      In
MFCC. Section 6.3 we describe the results obtained with the previously developedenvironmen-
                                                                       In
tal compensationalgorithms; cepstral meannormalization and CDCN. Section 6.4 we discuss
possible improvementsin extending the CDCN     schemeand showthe experimental results.

6.1 Baseline     System

    Wefirst consider the baseline recognition accuracy of the Motorola car database in the
SPHINX                                                                    by
        system. Weused the AN4database for training, and trained HMMs using 400 gener-
alized triphone modelsand a vocabularysize of 104. Since the Motorolacar database contains only
                                                                         nets
11 words, "one" through "nine", "oh" and "zero", we provided both HMM and a dictionary
for testing 11 words. It reduced the computationalload and improvedrecognition accuracy.

   Theperformance the recognizer is defined in terms of a recognition accuracy, as shownbe-
                 of
low.


   RecognitionAccuracy = Total - Substitutions - Deletions - Insertions
                                              Total
                                     of
whereTotal refers to the total number wordsin the utterance and Substitutions, Deletions, and
                              of
Insertions refers to the number substitution errors, deletion errors, and insertion errors, respec-
tively.

    Fromthe baseline experimentsshownin the first row of Table 1 we can see that recognition
accuracy degrades dramatically whenthe car is running, and recognition accuracy degrades further
as the car runs faster. In terms of conditions, whenthe radio is on, especially with AM   talk shows,
the recognition accuracy degrades considerably. This degradation comesfrom insertion errors,
since the radio sounds maketransient noises which are recognized as speech events by SPHINX.
      the          are
When windows downwhile running, the recognition accuracy decreases rapidly because of
the great increase of windnoise. Thewipers producea lot of degradation, since the wipers make
transient noises like the radio and this noise causes insertion errors. Theeffect of fan noise is not
foundto be as significant as the other conditions.
     6:          in          Car      Task
Chapter Recognitionthe Motorola Database                                                Page25




                    Condition                                    Car Speed

   Window        Fan        Radio       Wiper       0 m.p.h.      30 m.p.h.      55 m.p~t.
      up          off         off         off        78.3%          41.5%          26.6%
      up          on          off         off        75.8%          35.7%          28.6%
      up          off       music         off        54.3%          32.7%          22.1%
      up          off      AM-talk        off        40.8%          23.3%          16.9%
     down         off         off         off        76.4%          30.4%          19.1%
      up          off         off         on         67.8%           N/A            N/A
                           Table 1: Baseline Recognition Accuracy

                Cepstral Coefficients
6.2 Mel-Frequency
     Mel-frequency cepstral coefficients (MFCC) provide an alternate to the LPC-basedsignal pro-
                             in                        is
cessing approach summarized Section 2.1. The MFCC defined as the Discrete Fourier Trans-
form (DFT)of the logarithm of the powerspectral density function passing through a bandpass
filter bank in whichthe center frequencyof each bandpassfilter is equally spaced along the Mel
                                           and
scale. The main difference betweenMFCC the current SPHINX          front end, LPCcepstral coef-
                 is                                                             the
ficients (LPCC), the different representation of the powerspectrum. In MFCC powerspectral
                                                          of
density function is estimated by directly using the DFT the windowed       speech. Thenthe power
spectral density function is passing through the bandpassfilter bank whichcontains 40 triangular-
shaped ban@ass filters equally spaced along the frequency axis up to 1 kHzand along the log fre-
                                    on
quencyaxis above 1 kHz. In LPCC, the other hand, the powerspectrumis estimated by fitting
it to an autoregressive process. Frequencywarping is performed subsequently using a bilinear
transform.

    Results for the samerecognition experimentsare described in Table 2, and MFCC achieves bet-
                in
ter performance every condition shown.Theerror reduction rate from the baseline results ranges
                                         is
from 4%to 60%.The greatest improvement achieved in the cleanest condition (recognition ac-
curacy increases from 78.3%to 91.4%; a 60%reduction in error rate). The smallest improvement
is achieved in the AM talk showcondition (recognition accuracy increases from 40.8%to 42.9%;
only a 4%reduction in error rate). Since AM talk showscontain a human voice, there are a lot of
                                and
insertion errors in both LPCC MFCC.        MFCC, however, obtained muchbetter results in the
     6:          in          Car      Task
Chapter Recognition theMotorola Database                                                  Page26

                                                                                       is
presence of stationary noise, and even in the presence of music. This suggests that MFCC a more
robust spectral representation than LPCC.

                      Condition                                   Car Speed

      Window        Fan       Radio       Wiper     0 m.p.h.      30 m.pJ~.      55 m.p.h.
        up          off        off         off       91.4%          60.6%          47.4%
        up          on         off         off       86.0%          67.8%          50.8%
        up          off       music        off       73.2%          54.8%          42.3%
        up          off      AM-talk       off       42.9%          31.2%          33.9%
        down        off        off         off       88.0%          60.0%          37.1%
        up          off        off         on        83.4%           N/A            N/A
                          Table 2: Recognition Accuracy of using MFCC

             achieves muchbetter performance compared to LPCC,there is no reason to go
    Since MFCC
            as
on using LPCC a front end processor. Therefore, we performedfurther experimentsin environ-
mental compensation using MFCC.

6.3     Environmental      Compensation    Algorithms

                                                      is
    Themainsource of variation in the car environment additive noise. But there are also differ-
ences of channel frequencyresponse betweentraining and testing data, since we used different da-
tabases in training and testing including different microphones.We  hoped that the cepstral mean
normalization wouldeliminate the personal difference of channel frequency response. For these
                                      but
reasons we investigated not only CDCN also cepstral meannormalization. The simple cepstral
meannormalization methodcompensatesprimarily for differences in the frequency response of the
channel, while CDCN compensates simultaneously with the effects of linear filtering and additive
noise.

6.3.1    Cepstral    Mean Normalization

   Weperformeda series of experiments using cepstral meannormalization. Cepstral meannor-
malizationis applied mainlyto the reduction of linear filtering effects of the channel.

    The cepstral meannormalization was applied to all cepstral components   except the c[O] coef-
ficient, i.e. the power. Since we normalize the c[O] coefficient to the maximum  value in the VQ
stage, there is no effect of cepstral meannormalizationin the c[O]coefficient.
     6:          in          Car      Task
Chapter Recognition theMotorola Database                                                  Page27




                     Condition                                    Car Speed

    Window        Fan        Radio      Wiper       0 m.p.h.       30 m.p.h.      55 m.pJ~.
        up        off         off         off        93.6%          86.0%          70.0%
        up         on         off         off        93.5%          78.7%          66.3%
        up        off        music        off        83.9%          68.1%          55.3%
        up        off      AM-talk        off        61.2%          54.2%          55.8%
        down      off         off         off        91.5%          73.0%          50.2%
        up        off         off         on         86.8%           N/A            N/A
                                                  +
        Table 3: Recognition Accuracy of using MFCC Cepstral MeanNormalization

    The cepstral meannormalization methodis simple and very powerful. As shownin the Table
3, using cepstral meannormalization produces no degradation in performanceunder any condition
          to
compared the results of MFCC    (Table 2). Theerror reduction rate from the MFCC  results is from
20%to 64%. Eventhough cepstral meannormalization technique is supposed to reduce channel
                                                       of
effects, it also workswith the condition of large amount additive noise, e.g. runningat 55 m.p.h.
and even AM   talk shows.

         the
    When AM                                       of
                talk showsare on, the performance running at 55 m.p.h, is better than that of
running at 30 m.p.h. This occurs because running noise like wind noise and tire noise masksthe
radio sound, and insertion errors are reduced.

6.3.2     CDCN

    The CDCN procedure is moreattractive than cepstral meannormalization because it can simul-
                    for
taneously compensate the effects of additive noise and linear filtering. Weperformeda series
of experiments to determine the extent to which CDCN  could provide benefit for recognition of
speech in the car environments.

                                                 with
    Table 4 represents the results of using MFCC CDCN.                to
                                                            Compared Table 2, the results are
better than MFCC   alone, except for the condition of wipers on. Comparedto Table 3, CDCN
achieves better performancewhenthe additive noise is large such as a car speed of 30 or 55 m.p.h.,
or the music is on, but not whenthe wipers or AM    talk showsare on. CDCN  restores frames with
wiper noise as speech frames, because the powerof the wiper noise is large. Since the restored
frames of wiper noise are intermittent, SPHINX  recognizes wiper noise as short words, "oh" and
     6:          in          Car      Task
Chapter Recognition theMotorola Database                                                    Page28


"one". AM  talk shows, on the other hand, contain real human voice. CDCN  tries to restore frames
with AM talk showsas speechframes, even if the noise powerof AM    talk showsis relatively small.
Anothersignificant result is that recognition accuracyat 55 m.p.h, is much better than at 30 m.p.h.
whenthe AM   talk showsare on. This is probably because the sounds from the radio are maskedby
running noise at high speeds.

             to
   According these results CDCN    actually takes care of both additive noise and channel filter-
ing effects. However,CDCN    cannot compensatefor intermittent dynamic noise or noise which
contains the human voice. Therefore, we should comeup with a noise modeling technique or an
adaptive noise cancellation technique to suppress these dynamicnoises.

                     Condition                                     Car Speed

     Window       Fan        Radio       Wiper       0 m.p.h.       30 m.p.h.      55 m.p.h.
          up       off        off          off        92.6%          90.9%           81.4%
          up       on         off          off        92.5%           85.7%          78.4%
          up       off       music         off        90.1%           85.7%          72.2%
          up       off      AM-talk        off        60.1%          64.8%           72.0%
          down     off        off          off        92.4%          80.8%           67.4%
          up       off        off          on         78.4%           N/A             N/A
                                                             +
                   Table 4: Recognition Accuracy of using MFCC CDCN

6.3.3      Combination   of Cepstral     Mean Normalization        and CDCN

    Since both cepstral meannormalization and CDCN  producedvery good results, we investigat-
ed the combined effect of both schemes.As shownin Table 5, the combinationof the cepstral mean
normalization and CDCN   sometimes degrades performance comparedto CDCN    alone. This is es-
pecially true whenAM  talk showsare on, in whichthe combinationof both techniques slightly de-
grades performance comparedto CDCN   alone. This degradation comesfrom increase of insertion
eITors.


                                   can
    Theseresults suggest that CDCN take care of both additive noise and channel filtering ef-
                                                                                       AM
fect properly, so the role of the cepstral meannormalization is almost negligible. When talk
     6:          in          Car      Task
Chapter Recognition theMotorola Database                                                    Page29

showsare on, CDCN after the cepstral meannormalization might restore morenoise frames as a
speech compared to CDCN  alone.

                     Condition                                     Car Speed

      Window      Fan        Radio       Wiper        0 m.p.h.      30 m.p.h.      $$ m.pJ~.
       up          off         off         off         92.4%          91.7%          81.1%
       up          on          off         off         94.0%          84.1%          80.3%
       up          off       music         off        90.9%           84.4%          74.8%
       up          off      AM-talk        off        54.2%           60.8%          68.3%
       down        off         off         off        92.4%           84.7%          69.6%
       up          off         off         on         77.1%            N/A            N/A
                                           +
 Table 5: Recognition Accuracy of using MFCC Cepstral MeanNormalization + CDCN

                                                                  does
    Since the combinationof the cepstral meannormalization and CDCN not give us a better
performance,there is no need to apply both techniques.

6.4    Histogram-based       CDCN

    TheCDCN  algorithmuses fixed statistical parametersfor the noise, T, and P[0], and the actual
parameters for incomingspeech maybe quite different from the fixed ones. As shownin the pre-
vious sections, the car environmentcannot be represented by fixed parameter values. When      the
        are                                                                    to
windows open while running, the noise poweris broadly distributed compared the idling con-
dition. Therefore, if we estimate the statistical parametersof noise for each incomingspeechand
                               we
use these parameters in CDCN, could probably normalize the incoming speech moreaccurate-
ly. Also whenwe use a different representation of speech like MFCC  instead of the original LPCC
in the SPHINX  system, we mayneed to estimate the statistical parameters for the newrepresenta-
tion of speech, so that the CDCN   algorithm becomesmoregeneral.

    Thehistogram-basedCDCN   methodtries to estimate the values ofT, P[0], and the noise thresh-
                               for
old, which are fixed in CDCN, each incomingspeech utterance based on the powerhistogram.
Weuse a fixed value for the dynamicrange which is set to the maximum       dynamicrange in the
entire speechdata, since the offset of the speechpowerfrom the noise should not vary in the entire
CDCN  process.

                                                        of
   The powerof the speech data is assumedto be composed a Gaussian distribution of noise
                                                                  Accordingto our obser-
and a numberof Gaussian distribution of speech codewordsas in CDCN.
     6:          in          Car      Task
Chapter Recognitionthe Motorola Database                                                  Page30

vation of the speech powerhistogram, there is a sharp peak at small powerwhichrepresents the
center of the noise distribution and a broader distribution of higher powerwhichrepresents the
                   If
speech components. wecan estimate the center of the noise in a robust manner,wecan estimate
the statistical parametersof noise whichare used in CDCN.

    To find the center of the noise in the powerhistogram, wefirst smooththe histogramby apply-
                          We
ing a rectangular window. then search for the bin that has the first negative slope, starting from
the minimum  power bin in the histogram. Then we search for the bin with maximum       counts near
                                                                         from
this region. Our assumptionis that the bin of the first local maximum the minimum          powerin
the histogramis the center of the noise. This method  estimates the center of the noise in a robust
manner.We  then estimate the values ofT, P[0], and the noise threshold. Thenoise threshold can be
found so that the meanvalue of the power between the minimum       power in the utterance and the
minimum                                        the
        powerplus the noise threshold becomes center of the noise. The value of Tis obtained
by calculating the standard deviation of the Euclidean distance of the noise frames in whichthe
power is between the minimum   power in the utterance and the minimum     power plus the noise
                                                                               of
threshold. Thevalue of P[0] is obtained by calculating the ratio of the number noise frames to
           of
the number total frames.

         we                                     we
    When applied the histogram-based CDCN, took the following steps. First we estimated
the values of parameters,%P[0], and the noise threshold, fromthe 146 training data sets, whichare
used for training the CDCN   universal codebook. Then we trained the CDCN     universal codebook
by using newlyestimated parameters.Since all the training data wasused as if all the data wasgen-
erated simultaneously,wedid not estimate the parametersfor each utterance in the training. By us-
ing this universal codebookand the parameters, we then normalized all the training sentences.
Finally incoming testing sentences were normalized by using the universal codebookand newly
estimated parameters for each sentence.

                                                                                            to
   As shownin Table 6, recognition accuracy degrades in case that the car is running compared
the original CDCN.  Since the estimation of noise is based only on the powerhistogram, we may
also count the speech framesas a noise whenthe separation of noise and speech poweris not good.
Thenthe estimation of the parameters, 7 and P[0], maynot be very accurate. Therefore, the histo-
gram-basedCDCN  increases deletion and substitution errors while reducing insertion errors, be-
                                  may
cause the histogram-based CDCN mistake some speech frames for noise. To overcome this
deficiency we need to explore a segmentation schemeand identify the noise segments and the
speech segments. Thenwe should estimate parameters based on this segmentinformation. It is not
easy to identify such segments, however,because somespeech frames are buried in the noise and
somenoise frames are likely to be speech frames whenthe noise poweris high.
     6:          in          Car      Task
Chapter Recognition theMotorola Database                                                       31
                                                                                            Page

                                                                             is
     Anotherpossible reason for the degradation in the histogram-basedCDCN that the length of
each testing sentence maynot be long enoughto find the center of the noise in the powerhistogram.
         the                                                                      not
Although averagelength of the testing sentencesis 5.3 see (530 frames), it may be sufficient
to obtain reasonablestatistical estimates. We need to find out if the recognition accuracy will in-
             we
crease when estimate the center of the noise and the parametersby taking several sentences into
account. Eventhough we can reduce the insertion errors by using the histogram-based CDCN,        we
still have a lot of insertion errors whenAM                          can            the
                                             talk showsare on. CDCN compensate stationary
additive noise, but not the dynamic additive noise.

                     Condition                                     Car Speed

      Window      Fan        Radio       Wiper       0 m.p.h.       30 m.p.h.      55 m.p.h.
       up          off         off         off        92.9%          87.6%           72.9%
       up          on          off         off        87.3%          78.3%           74.6%
       up          off       music         off        87.5%          64.9%           59.0%
       up          off      AM-talk        off        64.1%          65.9%           65.6%
       down        off        off          off        93.3%          72.5%           58.1%
       up          off        off          on         78.4%            N/A            N/A
               Table 6: Recognition Accuracyof using Histogram-basedCD(

6.5    Summary

                                                                      on
    In this chapter we have explored the effects of the car environment the speech recognition
system, and we have observed the dramatic degradation in the running condition and in the pres-
ence of other additive noise sources. The use of Mel-frequencycepstral coefficients (MFCC)  has
been considered, and it yields much                         to
                                    better results compared what is obtained by using the more
traditional LPCC front end.

    Twoenvironmental compensationtechniques have been considered, cepstral meannormaliza-
tion and CDCN.                                            have
                Both cepstral meannormalization and CDCN proved to be very effective in
this task, although their performanceunder the condition of AMtalk showand wipers is not very
good. Weneed to explore other approaches which could overcomethese dynamicnoise sources.

    Wehave tried histogram-based CDCN,  which can estimate statistical parameters for each ut-
                               of
terance. It increases the number deletion and substitution errors while reducingthe insertion er-
rors in the presence of large stationary noise, though we need further investigation for more
improvement.
     7:     Cancellation CarRadio
Chapter Noise         for                                                                 Page32


                                 Chapter 7
                      Noise Cancellation for Car Radio
    In this chapter wedescribe a noise cancellation techniquewhichattempts to eliminate the effect
of car radios on speech recognition. In Section 7.1 we describe the newly-collected database for
this experiment. In Section 7.2 we describe the adaptive noise cancellation technique we applied.
In Section 7.3 we describe the results obtained with the noise cancellation algorithm.

7.1 Collection of Stereo Data
    Wecollected a new database which contained two-channel stereo data; one was speech data
corrupted by the car radio at various car speeds (0 m.p.h. [idling], 30 m.p.h., and 55 m.p.h.), and
the other wascar radio data directly collected from a loudspeakerin the car. Both channels were
recorded simultaneously. The radio station was an AM   station which had monauraltalk shows. The
speechcontained 7-digit strings whichwere randomly   generatedwith equal probabilities for all the
digits. Thedigit, ’0’, had two pronunciations; "zero" and "oh" as in the Motorolacar database. We
collected 10 sentences at every speed for each of 10 speakers (8 malesand 2 females). All speakers
werein their 20’s.

    Speech was recorded on a DAT  recorder, TEACDA-P20,using a uni-directional condenser mi-
crophone, the Panasonic WM-55D103,    located on a passenger’s visor. The distance from the
speaker to the microphonewas set to about 50 cm. The speech was sampledat 48 kHzby the DAT,
                              to
and transmitted from the DAT a computeras a digital data. Thenthe speech was downsampled
to 16 kHz, which is the standard sampling rate for the SPHINX system. The condenser microphone
has a close-talking characteristic which suppresses low frequency components whenthe distance
from the sound source to the microphoneis about 50 cm. Wecompensated   for this by applying a
linear filter.

7.2 Adaptive Noise Cancellation
                                                                                     can
    As we have seen in chapter 6, neither cepstral meannormalization nor CDCN overcome
dynamicnoise sources such as radio talk showsand wiper noise. Fortunately, the car radio signal
                                                         directly fromthe loudspeakeroutput as a
is originally an electrical signal, and it can be measured
                                                              noise, and it is difficult to record
noise source. Onthe other hand, the wiper noise is a mechanical
pure wiper noise without any interference.
      7:                for
Chapter NoiseCancellation CarRadio                                                                 Page33


       We applied the Least-Mean-Square (LMS) algorithm (Widrow and Stearns[1985])               to cancel
the car radio signal. Figure 7-1 shows a block diagram of the noise cancellation module.

                                        dk
                       Speech +
                   Noise (Car Radio)

                                        Xk
                         Car Radio




                      Figure 7-1: Block Diagram of the Noise Cancellation       Module


             algorithm is a simple and computationally easy method for steepest-descent
       The LMS                                                                                    adaptive
filtering,    and it uses a simplified estimate of the gradient as follows:

                                        = ak-x wk
with xk representing the car-radio input and dk representing the response of the microphone (which
includes both speech and the radio signal), and k is t he weight v ector o f t he FIR f ilter.

       Weset the numberof taps in the FIR filter    to 150 based on informal listening.

       The update equation for the LMSalgorithm is
                                          = tb-- S = +
                                                  *

whereI.t is a gain constant that regulates the speedand stability of adaptation. Since the weight
      at
changes eachiteration are basedon imperfect gradient estimates, the adaptive processdoesnot
follow the true line of steepest descent on the performance surface. However, the LMSalgorithm
can be implemented in a practical      mannerwithout squaring, averaging, or differentiation.

       To assure the convergence of the weight vector, we need to restrict    the gain constant, I.t such
that

                                                        1
                                       0<~t< (L+ 1) (signalpower)

where (L+I) equals the numberof taps of the filter.

       Weused values of I.t which are 0.05 times the upper bound in our experiment based on informal
listening.
     7:     Cancellation CarRadio
Chapter Noise         for                                                                 Page34

    We also foundthat the estimated value of speech, e, diverges whenthe speechis present, so we
                                 of                                               by
added a gain-reduction mechanism the weight vector that reduces all components a factor of
0.1, te. w’k÷ = 0.1 x wk÷~
              1             whereW’k+ is a weight vector after gain reduction. This gain-reduc-
                                      1
             causes a perturbation in weightvector adaptation, but improvesthe stability of the
tion mechanism
speechaccording to our informal listening.

7.3 Recognition      Results

    Wehave performedexperimentsusing the adaptive noise cancellation technique on the newly-
                                     the
collected database. Table 7 summarizes recognition results.

                                                            W/ONoise            Noise
                   Processing                 Car Speed
                                                            Cancellation     Cancellation
                                               0 m.p.h.         45.1%           67.1%
                     MFCC
                                              30 m.p.h.         37.9%           59.1%
                                              55 m.p.h.         36.0%           46.7%
                                               0 m.p.h.         42.6%           70.7%
                  MFCC
                                              30 m.p.h.         47.3%           62.7%
       + Cepstral Mean Normalization
                                              55 m.p.h.         40.5%           47.6%
                                               0 m.p.h.         47.5%           67.0%
                     MFCC
                                              30 m.p.h.         58.8%           62.2%
                    + CDCN
                                              55 m.p.h.         58.2%           62.2%
                 Table 7: Recognition Accuracyw & w/o Noise Cancellation

    As shownin Table 7, the adaptive noise cancellation technique improvesrecognition accuracy.
A detailed analysis showsthat insertion errors are reduced dramatically whenwe apply noise can-
                 we
cellation. When apply environmental compensationalgorithms, the recognition accuracy at 0
m.p.h, is lower than at 55 m.p.h, without noise cancellation, whereasthe recognition accuracy at 0
m.p.h, is higher than at 55 m.p.h, with noise cancellation. This is becausewe can eliminate the car
radio signal effectively and there mostlyremainsthe running noise. Thus, the recognition accuracy
with noise cancellation degradeswhenthe car runs faster.

    Webelieve, however, that the LMS  algorithm developed here is not well optimized. Everypa-
rameter used in this algorithmis chosenon the basis of our informal listening only. Weshould find
optimal parameters based on recognition performance. While we added the gain-reduction mech-
     7:     Cancellation CarRadio
Chapter Noise         for                                                                  Page35


anismto assure the stability of estimated speech, we think there maybe morereasonable waysto
assure stability. For example,wecould use of two filters, an LMS filter and a stable filter whichis
adaptedduring the absenceof the speech, switchingthe LMS   filter to the stable filter whenthe es-
timated speech seemsto diverge.

7.4 Summary
    In this chapter we have described the adaptive noise cancellation technique we developed to
eliminate the car radio effect. Weperformedexperiments using a newly collected database which
contained both corrupted speech with a car radio signal and the pure car radio signal. Theadaptive
noise cancellation technique workedwell on the speech recognition experiment.
      8:         and         for     Work
Chapter Conclusions Suggestions Future                                                   Page36


                             Chapter 8
            Conclusions and Suggestions for Future Work
                              the
    In Section 8.1 wesummarize major findings of this report, and in Section 8.2 we offer
suggestions for future work.

8.1 Conclusions

    In this report wehave performedseveral experimentsto explore the effect of noise sources in
the automobile on the recognition accuracy of a speech recognition system. Wehave observed a
great degradation in recognition accuracy while running at high speeds, especially whenthe win-
dowsare open. This running noise is rather stationary.

   The car radio, and especially talk shows, adversely affects recognition accuracy. Degradation
occurs even whenthe car is not moving.The wipers makeintermittent noise that is recognized as
short words. Thesefunctional noises have transient characteristics.

                                                   and
    Wetried a newspectral representation, MFCC, got muchbetter results than LPCC.          General
trends associated with the noise sources are the sameas the baseline results; great degradationoc-
curs at high speeds, and in the presenceof radio talk shows.

    Wealso applied two environmental compensationalgorithms, cepstral meannormalization and
CDCN. Cepstral meannormalization increases the recognition accuracy in every condition, even
                                                                  also
though it is expected primarily to reduce channel effects. CDCN reduces recognition errors
and obtains better results than cepstral meannormalization. CDCN,    however,degrades recognition
                                                                           of
accuracy in the presence of radio talk showsor wiper noise. A combination both techniques does
                                  of                                                 can
not provide further improvements recognition accuracy. This indicates that CDCN take care
of both stationary additive noise and channel effects properly, and the role of cepstral meannor-
malization is almost negligible.

    Wethen developed a histogram-based version of CDCN,     which attempts to estimate accurate
statistical parametersof noise for each utterance. Since the separation of noise and speechis not
good in the powerhistogram under low signal-to-noise ratio conditions, estimation of parameters
maynot be very accurate. Thusthis technique increases deletion and substitution errors while re-
ducinginsertion errors.

   Since environmental compensationalgorithms cannot overcomedynamicnoise like radio talk
showsand wipers, we investigated another approach to enhance speech, adaptive noise cancella-
      8:         and         for     Work
Chapter Conclusions Suggestions Future                                                     Page37


tion. Adaptivenoise cancellation is applied to eliminate radio sounds. Experimental  results indi-
cate that the techniqueis very effective to reduce interference of radio sounds. Thedominant noise
in the enhanced speech is running noise which can be eliminated by CDCN.

8.2 Suggestions      for   Future   Work

          we
    Though have obtained encouragingresults in this study, there is still roomfor better under-
                               in
standing and further improvement the field of speech recognition in the automobile.

    Thefirst approachto pursue is the further investigation of adaptive noise cancellation tech-
niques. Since the stability of the enhancedspeech and the speed of adaptation are the most impor-
tant factors in adaptive noise cancellation, weshould find a reasonable wayto assure stability
without perturbing the adaptation. Stability is easily destroyedwhenthe speechsignal begins. Thus
weshould hold the weightvector as fully adapted during the absenceof speech, and use this weight
vector whenstability seemsto break.

    The second approach is the further study of CDCN.    Eventhough we did not obtain better re-
suits from histogram-based CDCN,   there should be a wayto estimate statistical parameters more
appropriately. Weshould develop a robust segmentation schemeto identify the speech frames and
noise frames. Weshould also find the proper length of utterance to estimate reliably the statistical
parameters.

                                                                                    in
    The third technique is the use of noise-word models, which were not examined this study.
Thesemodelshavethe ability to providesomeresilience to the effects of nonstationarynoises. This
techniqueis a possible solution for wiper noise. There will be other noise sources that can be com-
pensatedby this technique, such as the turn signals and the horn.

    Thefourth approachis to find a wayto utilize informationfromthe function of the car such as
the tachometer, speedometer,fan switch, turn signals, and horn for speech recognition. Thesein-
formation maybe useful to identify the noise sources, and the speechrecognition systemcan adapt
the incomingspeech to the noise models.

    We also believe that further study of noise sources in the automobileis necessary. It is almost
impossible to study wholenoise sources in the automobile,but so far the database weused wasnot
sufficient to represent the variability of the car environment.We  also need a moreanalytical ap-
proach to modelingof the noise sources. A few efforts toward this end have been undertaken, but
morethorough analysis will help the study on speech recognition in the automobile.
      8:         and         for     Work
Chapter Conclusions Suggestions Future                                                  Page38


                                                   in
    Finally we believe that further improvements compensationtechniques can be obtained if
the algorithmsare applied at the level of phonetic modelsin the speechrecognition systemlike the
                                                                       All
decompositionof input speech into speech and noise by using HMMs. of our work has taken
place at the waveform level before speech is input to the system. Application of this compensation
at the HMM-state level is a promisingapproachin the future.
References                                                                            Page39


References
Acero, A. (1990). Acoustical and EnvironmentalRobustness in Automatic Speech Recognition. Ph.
   D. Dissertation, ECEDepartment, Carnegie Mellon University.
Dal Degan, N., and Prati, C. (1988). Acoustic Noise Analysis and Speech EnhancementTech-
   niques for MobileRadioApplications. Signal Processing 15: 43-56.
                                                   of
Davis, S. B., and Mermelstein, P. (1980). Comparison Parametric Representations for Monosyl-
   labic WordRecognition in Continuously Spoken Sentences. IEEEASSP   28(4): 357-366.
Ephraim,Y., Wilpon,J. G., and Rabiner, L.R. (1987). A linear predictive front-end processor for
   speech recognition in noisy environments. International Conferenceon Acoustics, Speech and
   Signal Processing (ICASSP): 1324-1327.
Gales, M. J. F., and Young,S. (1992). AnImprovedApproachto the Hidden MarkovModelDe-
   composition of Speech and Noise. International Conference on Acoustics, Speech and Signal
   Processing (ICASSP):1-233-1-236.
          H.                                                                ust.
Hermansky, (1990). Perceptual linear predictive (PLP) analysis of speech. Aco Soc . Am.
   87(4): 1738-1752.
                                                                   for
Hermansky,H., Morgan,N., Bayya, A., and Kohn,P. (1991). Compensation the Effect of the
   CommunicationChannel in Auditory-like Analysis of Speech (RASTA-PLP).  Proc. EURO-
   SPEECH ’91:1367-1370.
                                                                        Speech and Language
Juang, B. H. (1991). Speech recognition in adverse environments. Computer
   5: 275-294.
Lecomte,I., Lever, M., Boudy,J., and Tassy, A. (1989). Car Noise Processing for SpeechInput.
   International Conferenceon Acoustics, Speech and Signal Processing (ICASSP):512-515.
                                                                           System. Bos-
Lee, K. E (1989). Automatic Speech Recognition: The Developmentof the SPHINX
   ton: KluwerAcademicPublishers.
                                     Englewood
Lim, J. S. (1983). Speech Enhancement.       Cliffs: Prentice-Hall.
Mansour,D., and Juang, B. H. (1988). The short-time modified coherence representation and its
  application for noisy speech recognition. International Conferenceon Acoustics, Speech and
  Signal Processing (ICASSP): 525-528.
Mokbel, C., and Chollet, G. (1991). WordRecognition in the Car: Speech enhancement/spectral
  transformations. International Conferenceon Acoustics, Speech and Signal Processing (IC-
  ASSP): 925-928.
                                                                                  in
Oh, S., Viswanathan,V., and Papamichalis, P. (1992). Hands-FreeVoice Communication an Au-
   tomobile With a MicrophoneArray. International Conferenceon Acoustics, Speech and Signal
   Processing (ICASSP):1-281-I-284.
Rabiner, R. L., and Juang, B. H. (1986). AnIntroduction to Hidden MarkovModels. IEEEASSP
   Magazine3(1): 4-16.
References                                                                          Page40


Seneff, S. (1988). A joint synchrony/mean-ratemodelof auditory speech processing. Journal of
   Phonetics 16: 55-76.
Shikano, K. (1986). Evaluation of LPCSpectral Matching Measuresfor Phonetic Unit Recogni-
                                         CS
   tion. Technical Report CMU-CS-86-108, Department, Carnegie Mellon University.
Varga, A. P., and Moore, R. K. (1990). Hidden Mark0vModel Decomposition of Speech and
   Noise. International Conference on Acoustics, Speech and Signal Processing (ICASSP):845-
   848.
Ward, W. (1989). Modelling Non-verbal Sounds for Speech Recognition. Proc.Speech andNatural
  Language Workshop: 47-50.
                                                                         Cliffs: Prentice-
Widrow,B., and Steams, S. D. (1985). Adaptive Signal Processing. Englewood
   Hall.
                                                                              SpeechRecog-
Zue, V., Glass, J., Goodine, D., Phillips, M., and Seneff, S. (1990). The SUMMIT
   nition System: Phonological Modelling and Lexical Access. International Conference on
   Acoustics, Speech and Signal Processing (ICASSP):49-52.

				
DOCUMENT INFO