Automatic Speech Recognition Studies by sdfwerte

VIEWS: 12 PAGES: 28

									Automatic Speech Recognition Studies

Guy Brown, Amy Beeston
and Kalle Palomäki




PERCEPTUAL COMPENSATION | EPSRC 12-MONTH PROGRESS MEETING   1
Overview
• Aims
• The articulation index (AI) corpus
• Phone recogniser
• Results on sir/stir subset of AI corpus
• Future plans




PERCEPTUAL COMPENSATION | EPSRC 12-MONTH PROGRESS MEETING   2
Aims
• Aim to develop a „perceptual constancy‟ front-end for
  automatic speech recognition (ASR).
• Should be compatible with Watkins et al. findings but
  also validated on a „real world‟ ASR task.
     – wider vocabulary
     – range of reverberation conditions
     – variety of speech contexts
     – naturalistic speech, rather than interpolated stimuli
     – consider phonetic confusions in reverberation in general




PERCEPTUAL COMPENSATION | EPSRC 12-MONTH PROGRESS MEETING         3
Progress to date
• Current work has focused on implementing a
  baseline ASR system for the articulation index (AI)
  corpus, which meets the requirements for speech
  material stated on previous slide.
• So far have results for phone recognition on small
  test set without any „constancy‟ processing.
• Planning evaluation that compares phonetic
  confusions made by listeners and ASR on the same
  test.


PERCEPTUAL COMPENSATION | EPSRC 12-MONTH PROGRESS MEETING   4
The articulation index (AI) corpus
• Recorded by Jonathan Wright (University of
  Pennsylvania), available via LDC.
• Intended for speech recognition in noise experiments
  similar to those of Fletcher.
• Suggested to us by Hynek Hermansky; utterances
  are similar to those used by Watkins et al.:
     – English (American)
     – Target syllables are mostly nonsense, but some correspond to real
       words (including “sir” and “stir”)
     – Target syllables are embedded in a context sentence drawn from a
       limited vocabulary

PERCEPTUAL COMPENSATION | EPSRC 12-MONTH PROGRESS MEETING                  5
Details of the AI corpus
• Includes all “valid” English diphone (CV, VC)
  syllables.
• Triphone syllables (CVC, CCV, VCC) chosen
  according to frequency in Switchboard corpus
     – correlated with syllable frequency in casual conversation.
• 12 male speakers, 8 female speakers.
• Approximately 2000 syllables common to all
  speakers.
• Small amount (10 min) of conversational data.
• All speech data sampled at 16 kHz.
PERCEPTUAL COMPENSATION | EPSRC 12-MONTH PROGRESS MEETING           6
AI corpus examples
• Target syllable preceded by two context words and
  followed by one context word:
     – CW1 CW2 SYL CW3
     – CW1, CW2 and CW3 drawn from sets of 8, 51 and 44 words
       respectively
• Examples:
          they recognise sir entirely
          people ponder stir second




PERCEPTUAL COMPENSATION | EPSRC 12-MONTH PROGRESS MEETING       7
 Phone recogniser
 • Monophone recogniser implemented and trained on
   the TIMIT corpus.
 • Based on HTK scripts by Tony Robinson1.
 • Front-end: speech encoded as 12 cepstral
   coefficients +energy+deltas+accelerations (39
   features).
 • Cepstral mean normalisation applied.
 • 3 emitting states per phone model, observations
   modelled by 20 Gaussian mixtures per state.
 • Approx 58% phone accuracy on TIMIT test set.
1http://www.cantabResearch.com/HTKtimit.html



PERCEPTUAL COMPENSATION | EPSRC 12-MONTH PROGRESS MEETING   8
Training and testing
• Trained on TIMIT training set.
• Really needs adapting to the AI corpus material; work
  in progress.
• Removed allophones from TIMIT labels (as is usual)
  to give 41 phone set.
• Short pause and silence models.
• For testing on AI corpus, word-level transcriptions
  were expanded into phone sequences using
  Switchboard-ICSI pronunciation dictionary.


PERCEPTUAL COMPENSATION | EPSRC 12-MONTH PROGRESS MEETING   9
Experiments
• Initial experiments done with a subset of AI corpus
  utterances in which the target syllable is “sir” or “stir”.
• Small test set of 40 utterances:

                           Male speaker        Female speaker
             “Sir”                12                    8
             “Stir”               12                    8




PERCEPTUAL COMPENSATION | EPSRC 12-MONTH PROGRESS MEETING       10
Experiment 1: Fletcher-style paradigm
• A recogniser grammar was used in which
     – The sets of context words CW1, CW2 and CW3 are specified;
     – Target syllable is any sequence of two or three phones.
• Corresponds to task in which listener knows that
  context words are drawn from a limited set.
• Recogniser grammar is a (rather unconventional) mix
  of word-level and phone-level labels.




PERCEPTUAL COMPENSATION | EPSRC 12-MONTH PROGRESS MEETING          11
Experiment 1: recogniser grammar
$cw1 = I | YOU | WE | THEY | SOMEONE | NO-ONE | EVERYONE | PEOPLE;

$cw2 = SEE | SAW | HEAR | PERCEIVE | THINK | SAY | SAID | SPEAK | PRONOUNCE | WRITE | RECORD |
OBSERVE | TRY | UNDERSTAND | ATTEMPT | REPEAT | DESCRIBE | DETECT | DETERMINE |
DISTINGUISH | ECHO | EVOKE | PRODUCE | ELICIT | PROMPT | SUGGEST | UTTER | IMAGINE | PONDER
| CHECK | MONITOR | RECALL | REMEMBER | RECOGNIZE | REPEAT | REPORT | USE | UTILIZE | REVIEW
| SENSE | SHOW | NOTE | NOTICE | SPELL | READ | EXAMINE | STUDY | PROPOSE | WATCH | VIEW |
WITNESS;

$cw3 = NOW | AGAIN | OFTEN | TODAY | WELL | CLEARLY | ENTIRELY | NICELY | PRECISELY | ANYWAY |
DAILY | WEEKLY | YEARLY | HOURLY | MONTHLY | ALWAYS | EASILY | SOMETIME | TWICE | MORE |
EVENLY | FLUENTLY | GLADLY | HAPPILY | NEATLY | NIGHTLY | ONLY | PROPERLY | FIRST | SECOND |
THIRD | FOURTH | FIFTH | SIXTH | SEVENTH | EIGHTH | NINTH | TENTH | STEADILY | SURELY |
TYPICALLY | USUALLY | WISELY;

$phn = AA | AE | AH | AO | AW | AX | AY | B | CH | D | DH | DX | EH | ER | EY | F | G | HH | IH | IY | JH | K | L |
M | N | NG | OW | OY | P | R | S | SH | T | TH | UH | UW | V | W | Y | Z | ZH;

(!ENTER $cw1 $cw2 $phn $phn [$phn] $cw3 !EXIT)




PERCEPTUAL COMPENSATION | EPSRC 12-MONTH PROGRESS MEETING                                                             12
Experiment 1: results
• Overall 47.5% correct at word level (sir/stir)
• Context words not correctly recognised in some
  cases, leading to knock-on effect on recognition of
  the target syllable.
• Examples:
        they imagine stir surely      they imagine s t er surely   Correct

        they sense stir gladly        they sense s er gladly       Deletion

                                                                   Substitution (/eh/
        I evoke sir precisely         I evoke s eh precisely
                                                                   as in „head‟)
                                                                   Incorrect context
        they recognize sir entirely   they witness er n p daily
                                                                   words


PERCEPTUAL COMPENSATION | EPSRC 12-MONTH PROGRESS MEETING                               13
Experiment 2: constrained sir/stir
• A recogniser grammar was used in which
     – The sets of context words CW1, CW2 and CW3 are specified;
     – Target syllable is constrained to “sir” or “stir”;
     – Canonical pronunciation of “sir” and “stir” is assumed
       (i.e. “sir” = /s er/ and “stir” = /s t er/)
• Corresponds to Watkins-style task, except that
  context words vary and are drawn from a limited set.
• Utterances either presented clean or convolved with
  the left channel or right channel of the L-shaped room
  or corridor BRIRs.


PERCEPTUAL COMPENSATION | EPSRC 12-MONTH PROGRESS MEETING          14
Experiment 2: recogniser grammar
• Recogniser grammar was
    $test = SIR | STIR;
    ( !ENTER $cw1 $cw2 $test $cw3 !EXIT )

    with $cw1, $cw2 and $cw3 defined as before.




PERCEPTUAL COMPENSATION | EPSRC 12-MONTH PROGRESS MEETING   15
Results: L-shaped room, left channel
          Impulse response ID          % Correct            SIR   STIR
                                                   SIR      18     2
          clean.wav                       95.0
                                                   STIR     0      20
                                                   SIR      18     2
          outconv22feb31p5.wav            92.5
                                                   STIR     1      19
                                                   SIR      18     2
          outconv22feb63.wav              85.0
                                                   STIR     4      16
                                                   SIR      13     7
          outconv22feb125.wav             72.5
                                                   STIR     4      16
                                                   SIR      9      11
          outconv22feb250.wav             67.5
                                                   STIR     2      18
                                                   SIR      15     5
          outconv22feb500.wav             62.5
                                                   STIR     10     10
                                                   SIR      14     6
          outconv22feb1000.wav            65.0
                                                   STIR     8      12


PERCEPTUAL COMPENSATION | EPSRC 12-MONTH PROGRESS MEETING                16
Results: L-shaped room, right channel
          Impulse response ID          % Correct            SIR   STIR
                                                   SIR      18     2
          clean.wav                       95.0
                                                   STIR     0      20
                                                   SIR      17     3
          outconv22feb31p5.wav            87.5
                                                   STIR     2      18
                                                   SIR      19     1
          outconv22feb63.wav              85.0
                                                   STIR     5      15
                                                   SIR      15     5
          outconv22feb125.wav             87.5
                                                   STIR     1      19
                                                   SIR      16     4
          outconv22feb250.wav             82.5
                                                   STIR     3      17
                                                   SIR      16     4
          outconv22feb500.wav             67.5
                                                   STIR     9      11
                                                   SIR      14     6
          outconv22feb1000.wav            65.0
                                                   STIR     8      12


PERCEPTUAL COMPENSATION | EPSRC 12-MONTH PROGRESS MEETING                17
Results: corridor, left channel
          Impulse response ID          % Correct            SIR   STIR
                                                   SIR      18     2
          clean.wav                       95.0
                                                   STIR     0      20
                                                   SIR      18     2
          outconv22feb31p5.wav            90.0
                                                   STIR     2      18
                                                   SIR      19     1
          outconv22feb63.wav              87.5
                                                   STIR     4      16
                                                   SIR      15     5
          outconv22feb125.wav             77.5
                                                   STIR     4      16
                                                   SIR      17     3
          outconv22feb250.wav             72.5
                                                   STIR     8      12
                                                   SIR      17     3
          outconv22feb500.wav             67.5
                                                   STIR     10     10
                                                   SIR      14     6
          outconv22feb1000.wav            57.5
                                                   STIR     11     9


PERCEPTUAL COMPENSATION | EPSRC 12-MONTH PROGRESS MEETING                18
Results: corridor, right channel
          Impulse response ID          % Correct            SIR   STIR
                                                   SIR      18     2
          clean.wav                       95.0
                                                   STIR     0      20
                                                   SIR      18     2
          outconv22feb31p5.wav            90.0
                                                   STIR     2      18
                                                   SIR      18     2
          outconv22feb63.wav              87.5
                                                   STIR     3      17
                                                   SIR      18     2
          outconv22feb125.wav             87.5
                                                   STIR     3      17
                                                   SIR      16     4
          outconv22feb250.wav             85.0
                                                   STIR     2      18
                                                   SIR      19     1
          outconv22feb500.wav             82.5
                                                   STIR     6      14
                                                   SIR      15     5
          outconv22feb1000.wav            60.0
                                                   STIR     11     9


PERCEPTUAL COMPENSATION | EPSRC 12-MONTH PROGRESS MEETING                19
Conclusions
• The phone recogniser works well when constrained
  to recogniser “sir”/”stir” only (95% correct).
• Recognition rate falls as reverberation increases, as
  expected.
• The fall in performance is not only due to “stir” being
  reported as “sir”, as expected from human studies.
• Some effects of BRIR channel on performance. Right
  channel of the corridor BRIR is less problematic,
  most likely due to a strong early reflection in the right
  channel for the 5m condition.


PERCEPTUAL COMPENSATION | EPSRC 12-MONTH PROGRESS MEETING   20
Plans for next period: experiments
• The AI corpus lends itself to experiments in which
  target and context are varied as in Watkins et al.
  experiments.
• Suggestion:
     – Compare listener and ASR phone confusions under conditions in
       which the whole utterance is reverberated, and when reverberation
       is added to the target syllable only.
• Possible problems:
     – Relatively insensitive design? Will effect of reverberation be
       sufficient to show up as consistent phone confusions?
     – Are the contexts long enough? (some contexts as short as 0.5 s).
     – As shown in baseline studies, recogniser does not necessarily
       make the same mistakes as human listeners.

PERCEPTUAL COMPENSATION | EPSRC 12-MONTH PROGRESS MEETING                  21
AI corpus sir/stir stimuli
• Utterances similar to sir/stir format
     – Wider variety of speakers/contexts (but still limited vocabulary)
     – Targets mostly nonsense, but some real words (eg. sir/stir)
     – Reverberated (by Amy) according to sir-stir paradigm

         near-near                    near-far                   far-far

• Widening sir/stir paradigm towards ASR environment
     – Introduce different stop consonants first: s {t,p,k} ir
     – Look for confusion in place of articulation

         near-near                    near-far                   far-far



PERCEPTUAL COMPENSATION | EPSRC 12-MONTH PROGRESS MEETING                  22
Test words from AI corpus
    AI     rhymes        s           st          sp          sk        all
  label      with                                                    four?

  s_xq       hat        sa           sta         spa        ska     nonsense

  s_xe       bet        se           ste         spe        ske     nonsense

   s_xi      hit         si          sti         spi         ski    nonsense

   s_xr     bird        sir          stir        spur       skerr     3/4

   s_a       cot        so           sto         spo        sko     nonsense

   s_e      bait        say         stay        spay        skay      3/4

   s_i      bee         see         stee        spee        skee      2/4

   s_o      boat       sow          stow         spo        sko       2/4


 We could record our own: sigh, sty, spy, sky (sky is missing)

PERCEPTUAL COMPENSATION | EPSRC 12-MONTH PROGRESS MEETING                      23
Questions for Tony
• Generally - would this sort of thing work?
• Is the initial delay in BRIR kept?
• How should the AI corpus signals be level-normalised
  when mixed reveberation distance is used?
• How to control the ordering of stimuli?




PERCEPTUAL COMPENSATION | EPSRC 12-MONTH PROGRESS MEETING   24
Plans: system development
• Currently the ASR system is trained on TIMIT; expect
  improvement if adapted to the AI corpus material.
• Only have word-level transcription for the AI corpus
  so must obtain phone labels by forced alignment.
• We will try the efferent model as a front end for
  recognition of reverberated speech, however:
     – it may not be sufficiently general, having been developed/tuned
       only for the sir/stir task
     – that said, we have shown elsewhere that efferent suppression is
       effective in improving ASR performance in additive noise
     – there is some relationship between the efferent model and
       successful engineering approaches


PERCEPTUAL COMPENSATION | EPSRC 12-MONTH PROGRESS MEETING                25
Plans: system development
• Current efferent model is not unrelated to engineering
  approach of Thomas et al. (2008):
     – “the effect of reverberation is reduced when features are extracted
       from gain normalized temporal envelopes of long duration in narrow
       subbands”
• Our efferent model also does gain control over long-
  duration windows (and will work in narrow bands).
• The model currently produces a spectral
  representation but could be modified to give cesptral
  features for ASR.


PERCEPTUAL COMPENSATION | EPSRC 12-MONTH PROGRESS MEETING                26
Plans: other approaches
• Parallel search over room acoustics and word
  models?
     – How would context effects be included in such a scheme?
     – On-line selection of word models trained in dry or reverberant
       conditions, according to context characteristics?
• Recognition within individual bands (i.e. train
  recogniser for each band and combine posterior
  probabilities)
     – May allow modelling of Watkins et al. 8-band results
     – Performance of multiband systems generally lower than
       conventional ASR


PERCEPTUAL COMPENSATION | EPSRC 12-MONTH PROGRESS MEETING               27
                                  Lunch




PERCEPTUAL COMPENSATION | EPSRC 12-MONTH PROGRESS MEETING   28

								
To top