Docstoc

Stress

Document Sample
Stress Powered By Docstoc
					Stress-Accent and Vowel Quality
                              in
       The Switchboard Corpus
          Steven Greenberg and Leah Hitchcock
            International Computer Science Institute
             1947 Center Street, Berkeley, CA 94704

             http://www.icsi.berkeley.edu/~steveng




 NIST Workshop on Large Vocabulary Continuous Speech Recognition
           Maritime Institute of Technology, May 4, 2001
                    Take Home Messages
• There is an intimate relationship between vocalic identity, nucleic
    duration and stress accent in spontaneous dialogue (at least in
    the Switchboard corpus)
• Stressed syllables tend to have significantly longer nuclei than
     their unstressed counterparts, consistent with the findings
     reported by Silipo and Greenberg in previous years’ meetings
     regarding the OGI Stories corpus (telephone monologues)
• Certain vocalic classes exhibit a far greater dynamic range in
    duration than others
   – Diphthongs tend to be longer than monophthongs, BUT ….
   – The low monophthongs ([ae], [aa], [ay], [aw], [ao]) exhibit patterns of
     duration and dynamic range under stress (accent) similar to diphtongs
• The statistical patterns are consistent with the hypothesis that
    duration serves under many conditions as either a primary or
    secondary cue for vowel height (normally associated with the
    frequency of the first formant)
                   Take Home Messages
• Moreover, the stress-accent system in spontaneous (American)
    English appears to be closely associated with vocalic identity
• Low vowels are far more likely to be fully stressed than high vowels
    (with the mid vowels exhibiting an intermediate probability of
    being stressed)
• Thus, the identity of a vowel can not be considered independently of
    stress-accent
• The two parameters are likely to be flip sides of the same Koine
• Although English is not generally considered to be a vowel-quantity
     language (as is Finnish), given the close relationship between
     stress-accent and duration, and between duration and vowel
     quality, there is some sense in which English (and perhaps other
     stress-accent languages) manifest certain properties of a
     “quantity” system
• Thus, vowel duration may be an important factor in disambiguating
    spoken language and therefore should be of interest to the
    speech recognition community
What is (usually) Meant by Prosodic Stress?
• Prosody is supposed to pertain to extra-phonetic cues in the
    acoustic signal
• The pattern of variation over a sequence of SYLLABLES
    pertaining to: syllabic DURATION, AMPLITUDE and PITCH
    (fo) variation over time (but the plot thickens, as we shall see)
     Why is Prosodic Stress Important?
• It supposedly provides important information about:
  Focus of the speaker’s attention and emphasis for the listener
  What is “new” and “important” information
  Emotional context of the utterance - surprise, sarcasm, shock, delight
  anger impatience, etc.
  Syntactic disambiguation, particularly at the clausal/sentential level
    e.g., interrogative, declarative forms
  Perceptual processing - parsing the utterance into “chunks” for reliable
    understanding

• Prosody provides a window onto the higher levels of language
    Can be useful for developing semantic-oriented models for speech
    understanding (“Information spotting”)

• Prosody affects pronunciation (and vice versa)
  Can be useful for modeling pronunciation variation in ASR
  Phonetic properties may be correlated with prosodic stress -
  THIS IS THE TOPIC FOR TODAY’S PRESENTATION
    The Nitty Gritty (a.k.a. the Corpus Material)
•   SWITCHBOARD PHONETIC TRANSCRIPTION CORPUS (same as
          Phoneval-2000)
    –   Switchboard contains informal telephone dialogues
    –   54 minutes of material that had previously been phonetically
           transcribed (by highly trained phonetics students from UC-
           Berkeley)
    – 45.5 minutes of “pure” speech (filled pauses, junctures filtered out),
      consisting of:
         9,991 words, 13,446 syllables, 33,370 phonetic segments
    –   All of this material had been hand-segmented at either the phonetic-
              segment or syllabic level by the transcribers
    –   The syllabic-segmented material was subsequently segmented at the
          phonetic-segment level by a special-purpose neural network
          trained on 72-minutes of hand-segmented Switchboard material.
          This automatic segmentation was manually verified
                                                     Evaluation Material Details
•                          AN EQUAL BALANCE OF MALE AND FEMALE SPEAKERS
•                          BROAD DISTRIBUTION OF UTTERANCE DURATIONS
                             –    2-4 sec - 40%, 4-8 sec - 50%, 8-17 sec - 10% (mean = 4.75 s)
•                          COVERAGE OF ALL (7) U.S. DIALECT REGIONS IN SWITCHBOARD
•                          A WIDE RANGE OF DISCUSSION TOPICS
•                          VARIABILITY IN DIFFICULTY (VERY EASY TO VERY HARD)



                                                   By Dialect Region                                    By Subjective Difficulty

                                                                                         300
                           180
    Number of Utterances




                           160                                                           250
                           140
                                                                                         200
                           120
                           100                                                           150
                            80
                            60                                                           100

                            40
                                                                                          50
                            20
                             0                                                             0
                                 S_Mid   N_Mid   N_East   West   South   NYC   (Other)         V_Easy   Easy     Medium     Hard       V_Hard

                                                   Dialect Region                                              Subjective Difficulty
       Manual Transcription of Stress Accent
• 2 UC-Berkeley Linguistics students each transcribed the full 45
      minutes of material (i.e., there is 100% overlap between the 2)
• Three levels of stress-accent were marked for each syllabic nucleus
   – Fully stressed (78% concordance between transcribers)
   – Completely unstressed (85% interlabeler agreement)
   – An intermediate level of accent (neither fully stressed, nor completely
       unstressed (ca. 60% concordance)
   – Hence, 95% concordance in terms of some level of stress
• The labels of the two transcribers were averaged
   – In those instances where there was disagreement, the magnitude of
       disparity was almost always (ca. 90%) one step. Usually,
       disagreement signaled a genuine ambiguity in stress accent
• The illustrations in this presentation are based solely on those
       data in which both transcribers concurred (i.e., fully stressed
       or completely unstressed)
• A table containing the complete set of data is in a paper
       submitted to Eurospeech (in the workshop notebook)
 The “Conventional Wisdom” on Stress-Accent
  "Pitch is widely regarded, at least in English, as the most salient
  determinant of prominence. In other words, when a syllable or word is
  perceived as 'stressed' or 'emphasized,' it is pitch height or a change in
  pitch, more than length or loudness that is likely to be mainly responsible
  (see, for example, Fry 1958, Grimson 1980, pp. 222-226, Lehiste 1976,
  Fudge, 1984, ch. 1)"
  Clark, J. and Yallop, C. (1990) An Introduction to Phonetics and Phonology. Oxford, Blackwell, p. 280.

"In fact, although it is clear that stressed syllables often have greater overall
acoustic intensity than weakly stressed ones, loudness seems to be the
least salient and least consistent of the three parameters of pitch, duration
and loudness - at least for purposes such as signaling stress" (ibid, p. 282)
“Thus, acording to the „general consensus‟ the important parameters are
 (in order) - PITCH, DURATION, LOUDNESS”
(the latter most closely correlated with TOTAL ENERGY (i.e., duration x
amplitude, cf. further on)
 OGI Stories - Pitch Doesn‟t Cut the Mustard
• Although pitch range is the most important of the fo-related cues,
      it is not as good a predictor of stress as DURATION



                                 Amplitude

            Pitch Range
                                             Duration



                          Av. Pitch
  Total Energy is the Best Predictor of Stress
• Duration x Amplitude is superior to all other combination pairs
    of acoustic parameters. Pitch appears redundant with duration.



                                         Duration x Amplitude

                                        Dur x Pitch Range


           Pitch Range x Average
                                                    Dur x Pitch Av
                            Pitch Av x Amp

                              Pitch Range x Amp
                                                                Duration
         A Brief Primer on Vocalic Acoustics
• Vowel quality is generally thought to be a function primarily of two
  articulatory properties - both related to the motion of the tongue
   – The front-back plane is most closely associated with the second
     formant frequency (or more precisely F2 - F1) and the volume of the
     front-cavity resonance
   – The height parameter is closely linked to the frequency of F1
• In the classic vowel “triangle” segments are positioned in terms of
      the tongue positions associated with their production, as follows:
   Duration/Amplitude/Int. Energy - Which?
• There are supposed to be large differences in the “intrinsic”
    amplitude and duration of vowels
• Could such differences be compensated for in terms of stress?
• Let’s take a closer look!
Amplitude Differences - Stressed/Unstressed
• There are very small differences in amplitude between stressed
  and unstressed nuclei
• The lax monophthongs tend to be have a slightly larger dynamic
    range than diphthongs
Durational Differences - Stressed/Unstressed
• There is a large dynamic range in duration between stressed and
    unstressed nuclei
• Diphthongs and tense, low monophthongs tend to have a larger
    range than the lax monophthongs
Int. Energy Differences - Stressed/Unstressed
• There is a large dynamic range in integrated energy between
    stressed and unstressed nuclei
• Diphthongs and tense, low monophthongs tend to have a larger
    range than the lax monophthongs
  Spatial Patterning of Duration and Amplitude
• Let’s return to the vowel triangle and see if it can shed light on
     certain patterns in the vocalic data
• The duration, amplitude (and their product, integrated energy, will
     be plotted on a 2-D grid , where the x-axis will always be in terms
     of hypothetical front-back tongue position (and hence remain a
     constant throughout the plots to follow)
• The y-axis will serve as the dependent measure, sometimes
  expressed in terms of duration, or amplitude, or their product
     Dipthongal Amplitude and Vowel Height
All nuclei
  Monopthongal Amplitude and Vowel Height
All nuclei
      Amplitude - Monophthongs vs. Diphthongs

              Diphthongs         Monophthongs




All nuclei
     Diphthongal Duration and Vowel Height
All nuclei
   Monopthongal Duration and Vowel Height
All nuclei
     Duration - Monophthongs vs. Diphthongs
             Diphthongs        Monophthongs




All nuclei
    Dipthongal Int. Energy and Vowel Height
All nuclei
 Monopthongal Int. Energy and Vowel Height
All nuclei
     Int. Energy - Monophthongs vs. Diphthongs

              Diphthongs         Monophthongs




All nuclei
    Dipthongal Amplitude and Vowel Height
Stressed nuclei
    Dipthongal Amplitude and Vowel Height
Unstressed nuclei
 Monopthongal Amplitude and Vowel Height
Stressed nuclei
 Monopthongal Amplitude and Vowel Height
Unstressed nuclei
      Amplitude - Monophthongs vs. Diphthongs
                Diphthongs         Monophthongs
Stressed




Unstressed
    Diphthongal Duration and Vowel Height
Stressed nuclei
  Diphphthongal Duration and Vowel Height
Unstressed nuclei
   Monopthongal Duration and Vowel Height
Stressed nuclei
  Monopthongal Duration and Vowel Height
Unstressed nuclei
       Duration - Monophthongs vs. Diphthongs
                 Diphthongs        Monophthongs
Stressed




Unstressed
   Dipthongal Int. Energy and Vowel Height
Stressed nuclei
   Dipthongal Int. Energy and Vowel Height
Unstressed nuclei
 Monopthongal Int. Energy and Vowel Height
Stressed nuclei
 Monopthongal Int. Energy and Vowel Height
Unstressed nuclei
    Int. Energy - Monophthongs vs. Diphthongs
               Diphthongs         Monophthongs
Stressed




Unstressed
                  Mystery Parameter
• There is one other parameter which when plotted in a vowel triangle plot
  shows an interesting pattern
• This is - proportion of stressed an unstressed nuclei
Proportion of Stress Accent and Vowel Height
      Amplitude - Monophthongs vs. Diphthongs

              Diphthongs         Monophthongs




All nuclei
     Duration - Monophthongs vs. Diphthongs
             Diphthongs        Monophthongs




All nuclei
     Int. Energy - Monophthongs vs. Diphthongs

              Diphthongs         Monophthongs




All nuclei
                Summary and Conclusions
• There is an intimate relationship between vocalic identity, nucleic
    duration and stress accent in spontaneous dialogue (at least in
    the Switchboard corpus)
• Stressed syllables tend to have significantly longer nuclei than
     their unstressed counterparts, consistent with the findings
     reported by Silipo and Greenberg in previous years’ meetings
     regarding the OGI Stories corpus (telephone monologues)
• Certain vocalic classes exhibit a far greater dynamic range in
    duration than others
   – Diphthongs tend to be longer than monophthongs, BUT ….
   – The low monophthongs ([ae], [aa], [ay], [aw], [ao]) exhibit patterns of
     duration and dynamic range under stress (accent) similar to diphtongs
• The statistical patterns are consistent with the hypothesis that
    duration serves under many conditions as either a primary or
    secondary cue for vowel height (normally associated with the
    frequency of the first formant)
               Summary and Conclusions
• Moreover, the stress-accent system in spontaneous (American)
    English appears to be closely associated with vocalic identity
• Low vowels are far more likely to be fully stressed than high vowels
    (with the mid vowels exhibiting an intermediate probability of
    being stressed)
• Thus, the identity of a vowel can not be considered independently of
    stress-accent
• Thus, vowel duration may be an important factor in disambiguating
    spoken language and therefore should be of interest to the
    speech recognition community

				
DOCUMENT INFO