Landmark-Based Speech Recognition by J0vi4YEw

VIEWS: 0 PAGES: 37

									Landmark-Based Speech
     Recognition:
     Spectrogram Reading,
    Support Vector Machines,
   Dynamic Bayesian Networks,
        and Phonology



     Mark Hasegawa-Johnson
               jhasegaw@uiuc.edu
University of Illinois at Urbana-Champaign, USA
              Lecture 10. Phonology
• Foundations of distinctive feature theory
    – Writing, phonemes, and the place-manner distinction
    – A universal encoding of phonemes: Bell
    – A universal binary encoding of phonemes: Jakobson
• Distinctive features and speech perception
    – Information theory: Shannon
    – Independence of feature errors: Miller
    – Dependence of distinctive feature acoustics!!
• Distinctive features and speech articulation
    –   Rule-based modification of distinctive features: Halle
    –   Syllable boundaries and word boundaries
    –   Psychological processing, the lexicon, and surface phonology
    –   Spreading of autosegmental distinctive features: Goldsmith
• Unification of perceptual and articulatory accounts: Stevens
    –   The quantal theory of speech production
    –   Articulator-free features, articulator-bound features, and landmarks
    –   Language-specific features
    –   Redundant features
            A Brief History of Writing
                (Diamond, Guns, Germs, and Steel, 1999)

• Writing developed 5000 years ago independently in at least three places
   – Mesopotamia (modern Iraq): recording commercial transactions
   – Egypt (first dynasty, 3150 BCE): recording imperial conquest
   – China (Shang dynasty): recording divinations
• Symbols were meaning-based (word glyphs)
• A word glyph can also be used to record similar-sounding nonsense
  words or foreign words
   – “Narmur,” first pharoah of Egypt, 3150 BCE
   – “Meiguo” = “America” in modern Chinese
• Phoenicians (3000 years ago) were the first to abandon word-glyphs in
  favor of a purely sound-based writing system
   – Phoenicians built trading empires based in Lebanon (Tyre and Sidon) and
     Carthage; their empires were independent until Roman conquest
   – “Phoenician” is the Greek name for them
   – The 22 symbols of their alphabet were based on word glyphs (e.g.,
     “alpha”=“ox”), but they only used the 22 that were necessary to write all
     sounds of their language, and discarded the remainder
   – “Phoneme” means, roughly, “Phoenician unit”
 A Few Phonemic & Syllabic Writing
            Systems
• Semitic, related to Phoenician symbols:
   –   Hebrew (‫ – ה ,ד ,ג ,ב ,א‬alef, bet, gimel, dalet, he)
   –   Greek (a, b, g, d, e - alpha, beta, gamma, delta)
   –   Cyrillic (а, б, в, г, д – a, be, ve, ghe, de)
   –   Latin (a, b, c, d, e, …)
   –   Arabic (‫ – ح ,ج ,ت ,ب ,ا‬alef, beh, teh, jeem, hah)
• Nubian (modern Ethiopia; independently developed in
  response to Egyptian writing)
• Indian: Hindi, Punjabi, Bengali, Gujarati, Grantha, …
• East Asian: Korean (hangul), Japanese (hiragana,
  katakana), Tagalog (baybayin)
• American Indian: Cherokee (independently designed in
  response to English)
   Manner/Place Distinction: Hangul
            (King Sejong of Joseon, Hunmin jeongeum, 1446)

• Origin of Hangul
   – Attributed to: King Sejong of Joseon (Korea), 15th century
   – Goal: Literacy for commoners who could not read Chinese
   – Therefore: System should be as easy as possible to memorize
• Innovation: manner/place distinction
   – Place of articulation encoded using symbols representing the shape
     of the articulator
       • 5 places of articulation: labial (/m/:ㅁ) alveolar (/n/:ㄴ), dental? (/s/:ㅅ),
         velar (/g/:ㄱ), glottal (/ŋ/:ㅇ)
   – Manner of articulation is changed from nasal/fricative to stop/affricate
     by adding one stroke
       • /m/→/b/, /n/→/d/, /s/→/ǰ/
   – Obstruent is made unvoiced encoded by adding a second stroke
       • /b/→/p/, /d/→/t/, /ǰ/→/č/, /ŋ/→/h/
       • Exception: /g/=base symbol for velar place, /k/ has just one extra stroke
   Universal Encoding of Phonemes
   Using a Manner/Place Encoding
                  (Bell, Visible Speech, 1876)

• Goal: an international phonetic alphabet
  – If any language in the world distinguishes two sounds,
    then Bell’s alphabet should distinguish them
• Proposed encoding:
  – Base symbol = half-moon or “C” shape
  – Place encoding: angle of the signal
  – Manner and voicing: extra strokes across the symbol
• (End result: system was scrapped because it was
  too expensive to typeset using movable-type
  printing. Modern IPA was developed instead).
         Universal Binary Encoding of
                 Phonemes
                       (Jakobson, Fant, and Halle, 1952)
• Jakobson’s minimalist program:
    – An 8-way place distinction is composed of exactly 3 binary distinctive
      features
    – Distinctive features are not arbitrary; they are based on physical properties
      of the sound (its articulation or its acoustics)
• Contribution of the engineer (Fant):
    – Distinctive features are based on spectral shape
• Jakobson’s consonantal distinctive features (in most cases, the first
  feature is more frequently used):
    – Acute vs. Grave (spectrum tilts up, like /,t,s,č,š/, or down, like /p,f,ţ,ş,k,h/)
    – Distributed vs. Compact (broad distribution of energy, like /m,p,f,,n,t,s/, or
      one narrow peak, like /č,š,ţ,ş,k,h/, or )
    – Strident vs. Mellow (high-energy frication, like /s,š,ş/, or low-energy
      frication, like /f,,h/
    – Voiced vs. Unvoiced (/b,v,δ,d,z,ǰ,ž,g/ vs. /p,f,,t,s,č,š,k/)
    – Sonorant vs. Obstruent (unobstructed voicing, like /m,n,ng/, vs. obstructed
      voicing, like /p,b,f,v,t,d,s,z,k,g/)
       Binary Encoding of Phonemes
                       (Jakobson, Fant, and Halle, 1952)
• Jakobson’s minimalist program:
    – All phoneme distinctions are binary. For example, if there are 10 places of
      articulation in the world’s languages, there must be ceil(log2(10))=4 binary
      distinctive features to encode place of articulation.
    – Distinctive features are not arbitrary; they are based on physical properties
      of the sound (its articulation or its acoustics)
• Apparent contribution of the engineer (Fant):
    – Distinctive features are based on spectral shape
• Jakobson’s consonantal distinctive features:
    – Acute vs. Grave (spectrum tilts up, like /,t,s,č,š/, or down, like /p,f,ţ,ş,k,h/)
    – Diffuse vs. Compact (broad distribution of energy, like /m,p,f,,n,t,s/, or one
      narrow peak, like /č,š,ţ,ş,k,h/, or )
    – Strident vs. Mellow (high-energy frication, like /s,š,ş/, or low-energy
      frication, like /f,,h/
    – Voiced vs. Unvoiced (/b,v,δ,d,z,ǰ,ž,g/ vs. /p,f,,t,s,č,š,k/)
    – Nasal vs. Non-nasal (unobstructed voicing, like /m,n,ng/, vs. obstructed
      voicing, like /p,b,f,v,t,d,s,z,k,g/)
    – … etcetera. There were 12 features in the original set.
   Speech Perception:
  Distinctive Features as an
Encoding for a Communications
            Channel
              Information Theory
                       (Shannon, 1948)

• A Mathematical Definition of the word
  “Information”
  – “Entropy” = the degree to which the outcome of
    some random process is impossible to predict
     • H = − ∫p(x)log2(p(x))dx = −E[log2p(x)]
     • Why it makes sense: the entropy of a coin thrown
       twice is 2X the entropy of a coin thrown once
       (p(x1,x2)=p(x1)p(x2), so H(x1,x2)=H(x1)+H(x2))
     • One “bit” = the amount of entropy in one unbiased
       coin toss
  – “Information” = the amount by which your
    uncertainty about x is reduced if you know y:
     • I(x,y) = H(x)−H(x|y)
               Information Theory
                          (Shannon, 1948)

• “Channel capacity” = the rate at which information
  can be conveyed through a noisy channel, in
  bits/second
  – CAB=maxA(I(x,x*)) for x=input to the channel per second,
    A(x)=symbols transmitted over channel, B(x)=recovered
    symbols at receiver, x*=maximum posterior probability
    estimate of x given A*(x).
• Shannon’s theorem:
  – Given a decoding machine with enough memory, it is
    always possible to achieve channel capacity using an
    A(x) of the following form:
     • Encode x using the smallest possible number of bits
     • Add redundancy until bit rate is reduced to CAB, then transmit.
   Miller’s Linguistic Interpretation of
           Information Theory
                         (Miller and Nicely, 1955)

• The “channel” for speech = acoustic channel
   – Information can be conveyed over 32 critical bands of the human ear
   – Rate at which information can be transmitted in each band depends
     on SNR in band
• The minimal binary encoding of speech = distinctive
  features
• The “encoding” is the speech production mechanism: slow
  movements of the tongue and lips add redundancy matched
  to the noise conditions of the channel
   – When the channel is bad, speaker slows down and speaks clearly
   – When channel is good, the main source of error is pronunciation
     variability, i.e., talker leaves out phonemes that are only important for
     redundancy
 In the Perceptual Space, Distinctive
   Feature Errors are Independent
                     (Miller and Nicely, 1955)
• Experimental Method:
   – Subjects listen to nonsense syllables mixed with noise (white noise
     or BPF)
   – Subjects write the consonant they hear
• Results:
          p(q*|q,SNR,BPF) ≈ Pi p(fi* | fi,SNR,BPF)
                 q* = consonant label heard by the listener
                         q = true consonant label

            F*=[f1*,…,f6*] = perceived distinctive feature labels
                F=[f1,…,f6] = true distinctive feature labels
            [±nasal, ±voiced, ±fricated, ±strident, ±lips, ±blade]
     Consonant Confusions at -6dB SNR
      P    T    K     F     TH   S     SH    B     D    G    V     DH   Z    ZH    M     N
 P    80   43   64    17    14   6     2     1     1         1     1               2
 T    71   84   55    5     9    3     8     1                     1               1
 K    66   76   107   12    8    9     4                           1               1
 F    18   12   9     175   48   11    1     7     2    1    2     2
 TH   19   17   16    104   64   32    7     5     4    5    6     4    5
 S    8    5    4     23    39   107   45    4     2    3    1     1    3    2           1
 SH   1    6    3     4     6    29    195         3                                     1
 B    1               5     4    4           136   10   9    47    16   6    1     5     4
 D                                     8     5     80   45   11    20   20   26    1
 G                          2                3     63   66   3     19   37   56          3
 V                    2          2           48    5    5    145   45   12         4
 DH                         6                31    6    17   86    58   21   5     6     4
 Z                               1     1     17    20   27   16    28   94   44          1
 ZH                                          1     26   18   3     8    45   129         2
 M    1                                      4               4     1    3          177   46
 N                          4                1     5    2          7    1    6     47    163


Distinctive Features: ±nasal, ±voiced, ±fricative, ±strident
In the Acoustic Space, Distinctive
  Features are Not Independent
         (Volaitis and Miller, 1992)
                                       •   [±voiced] for English
                                           stops is mainly cued
                                           by voice onset time
                                           (VOT).

                                       •   VOT is also an cue
                                           for place of
                                           articulation: velar >
                                           alveolar > labial

                                       •   p(place,voicing|VOT)
                                           ≠ p(place|VOT)
                                           p(voicing|VOT)
   Speech Production:
Distinctive features explain the
   related pronunciations of
morphologically related words
                   The English Plural
• Three standard plural endings:
       – Cat → cats (/s/)
       – Dog → dogs (/z/)
       – Dish → dishes (/Əz/)
• Observation: you can predict the correct plural by
  observing the last phoneme of the word
• Algorithm #1: Create a lookup table

 Use This Plural… If Last Phoneme of the Word Is:
 /s/                 /p,t,k,f,θ/
 /z/                 /b,d,g,v,δ,m,n,ŋ,r,l,y,w/, any vowel
 /Əz/                /s,č,š,z,ǰ,ž/
                 The English Plural
         (Chomsky and Halle, Sound Pattern of English, 1968)

•   Algorithm #2: Create 3 rules
    1. Use /Əz/ if the last phoneme is [+strident]
       – PLURAL → /Əz/ | [+strident]a
    2. Otherwise, use /s/ if last phoneme is [-voiced]
       – PLURAL → /s/ | [-voiced]a
    3. Otherwise, use /z/.
       – PLURAL → /z/
•   General form of phonological rules: a becomes b
    in the context gad:
    – a→b | gad
               Stress Shift in English
•   Every English word has one stressed syllable. Usually
    but not always it is the antepenultimate syllable (3rd from
    the end) if the word has 3 or more syllables:
    –   Professor -- (an exception to the antepenult rule)
    –   Establishment
•   Some suffixes cause the stress to shift:
    –   Professorial
    –   Establishmentarian
•   Some don’t
    –   Professorship
    –   Establishmentarianism (stays where it was before ism)
•   Prefixes never cause a stress shift
    –   Unprofessorial
    –   Antidisestablishmentarianism
    Morphemes: Roots, Suffixes, and
              Prefixes
•   Suffixes and prefixes are examples of “bound
    morphemes”
•   “Morpheme” = a phoneme sequence with a meaning
    –   Root words (establish)
    –   Words that are part of compounds, e.g., rain|fall
    –   Suffixes and prefixes
•   “Bound morpheme” = word that never occurs on its own
    –   In English: most suffixes and prefixes
    –   In Chinese: perhaps some words that used to be independent,
        but now only appear as bound morphemes, e.g., “hu” in
        “hutong”?
    Morpheme and Word Boundaries
            (Chomsky and Halle, Sound Pattern of English, 1968)

•   Chomsky & Halle’s model of speech production:
    1. Morphemes of a sentence are sequenced
        –    the | anti | dis | establish | ment | arian | s | went | home
    2. Boundaries between roots and their type-1 suffixes are erased
        –    the | anti | dis | establishmentarian | s | went | home
    3. Type 1 rules, like the “antepenultimate stress rule,” are applied
        –    the | anti | dis | establishmentarian | s | went | home
    4. Boundaries between roots and type-2 suffixes are erased.
       Recent proposal: some function words like “the” also attached at
       this point, creating a unit called a “prosodic word:”
        –    the-antidisestablishmentarians | went | home
    5. Type 2 rules apply:
        –    theyantidisestablishmentarians | went | home
    6. All remaining boundaries are erased, and speech is produced
        –    the-antidisestablishmentarians | went | home
    Psychological Processing and the
                Lexicon
•   Experiment: play a word. Subject hits “W” if it’s a
    meaningful word (“empower”), “N” if it’s a nonsense word
    (“empriffle”).
•   Measure the subject’s reaction time. Adjust for word
    length.
•   Result: subjects recognize two-morpheme words
    (“dis+like”) just as fast as one-morpheme words
    (“power”).
•   Apparent conclusion: Chomsky & Halle’s rules are
    usually “precompiled,” not applied in real time.
    –   If a listener has heard the word frequently enough, it is stored in
        her mental lexicon as a whole word.
    –   If it’s not stored in her mental lexicon, she tries to figure it out by
        morphological parsing (“zookeepership”).
    –   If morphological parsing fails, then she concludes it’s a nonsense
        word.
                Inter-Word Phonology
•   Many words change form in particular contexts
    –   “this ship” → “thish ship”
•   Usually: Only a very small number of phonological rule types can
    apply across word boundaries.
    –   Place assimilation: /s/ in “this” takes the palatal place of /sh/ in “ship”
    –   Manner assimilation: “in the”→“in ne” with a dental /n/
    –   In Chinese: tone sandhi (tones change because of tone context)
•   These changes occur very frequently
•   Perhaps: MOST of the changed forms are very uncommon, so they
    are not stored in the mental lexicon, but SOME of the most common
    and most severely reduced forms may actually get their own lexical
    entry so the listener can respond more quickly:
    –   “did you” → “didja”
    –   “I don’t know” → “ǣəo”
    –   These may be comparable to “multiword” lexical entries in a speech
        recognition lexicon
              Autosegmental Phonology
                               (Goldsmith, 1975)

  •    Inter-word phonological rules all have a simple form: manner or
       place assimilation
  •    Hypothesis: instructions to the speech articulators are arranged in
       “autosegmental tiers,” i.e., on a kind of musical score with
       asynchronous rows
  •    Assimilation = feature spreading

              /s/             /sh/                       /sh/                 /sh/

[-nasal]        [-nasal]                   [-nasal]             [-nasal]
[+strident]     [+strident]                [+strident]          [+strident]


[+blade]        [+blade]                                        [+blade]

[+anterior]     [-anterior]                                     [-anterior]
      Quantal Theory:
   Distinctive Features are not
     “just perceptual” or “just
articulatory:” they arise from the
relationship between articulation
          and perception
 The Speech Chain
(Stevens, Speech Communication, 1999)
     The Speech Chain

             Speech
                                   Speech
            Production
                                  Production
Discrete     Planning Continuous
Inputs                 Muscle
(Words,                Activation       Acoustic
Phonemes,              Levels           Signal
Distinctive            a(t)             x(t)
Features)

             Speech                  Auditory
            Perception              Perception
Discrete                 Auditory
Outputs                  Nerve
(Words,                  Signals
Nonsense)                y(t)
    Nonlinearities in the Speech Chain
•    The mappings P:a(t)→x(t) (speech production) and
     R:x(t)→y(t) (perception) are highly nonlinear
    –   We have very good models, going back to 1940s
    –   Pick a particular vector a(t); we can usually estimate ∇P(a(t)), the
        local gradient of P. (Likewise R).
•    There are many sigmoidal nonlinearities in both P and R.
    –   Articulator position, a(t), can vary quite a bit without affecting the
        acoustics, as long as a(t) stays within the stable region (a<a1* or
        a>a2*)
    –   If a(t) crosses the transition region, acoustics change a lot!
                     Acoustics, x(t)




                                                     Articulation, a(t)
                    Stable      a1* a2*   Stable
                    region                region
The Quantal Theory of Speech Production
                               (Stevens, 1989)


•    The distinction between a>a2* and a<a1* is a robust
     distinction
    – In the case of P: robust to minor pronunciation variability
    – In the case of R: robust also to lots of added noise
    –   Therefore:
        •   Put this distinction into Shannon’s communication alphabet in order to
            maximize the mutual information I(y,a) between produced speech and
            perceived speech.
                      Acoustics, x(t)




                                                      Articulation, a(t)
                     Stable     a1* a2*    Stable
                     region                region
The Quantal Theory of Speech Production
                           (Stevens, 1989)

•   Hypothesis: Every binary distinctive feature, in every
    language of the world, is a distinction (a>a2* vs. a<a1*)
    near a sigmoidal nonlinearity of P, or a distinction (x>x2*
    vs. x<x1*) near a sigmoidal nonlinearity of R.
•   Different languages choose different nonlinearities to
    focus on, but the number of useful sigmoids in P and R is
    finite, thus the number of distinctive features in the world’s
    languages is finite.
                  Acoustics, x(t)




                                               Articulation, a(t)
                Stable      a1* a2*   Stable
                region                region
                             Examples
•   Feature [anterior]: a nonlinearity of P
    –   “Alveolar ridge” is a sigmoid-shaped bump in hard palate
    –   Moving tongue tip back 1cm, over the alveolar ridge, increases front
        cavity length by 2cm, causing big change in the front cavity resonance
        frequency
    –   Result: /s/ → //, [+anterior] → [−anterior]
•   Feature [sonorant]: a nonlinearity of P
    –   Opening the soft palate by just 2mm during /d/ allows sonorant voicing to
        continue throughout closure
    –   Result: energy increases 20-30dB during closure
    –   /d/→/n/, [−sonorant] → [+sonorant]
•   Feature [back]: a nonlinearity of R
    –   When |F2-F1|<3 Bark, both formants excite the same neurons, causing the
        perception of a single broad formant peak
    –   When |F2-F1|>3 Bark, 2 distinct formant peaks are perceived
    –   |F2-F1|<3 Bark: [+back] vowels /a,o,u/
    –   |F3-F2|<3 Bark: [−low,−back] vowels /i,e/
    –   All formant peaks distinct: [+low,−back] vowels /ae,E/
       What About Unused Sigmoids?
•   If a language doesn’t use a particular sigmoid, its
    listeners can do one of two things:
    1. Learn to ignore that distinction.
    2. Use that sigmoid to “enhance” some other distinction.
          1. Perceptual Magnet Effect
                              (Kuhl, 1992)
•   Perceptual magnet effect shows that, from infancy, babies learn a
    neural map that enhances linguistically useful sigmoids, and smooths
    out the less useful sigmoids.
•   Experiment: ask listeners to determine whether vowels are “same” or
    “different.
•   Result: accuracy is best near the boundary between phonemes in
    your own language. Location of boundary is native-language-
    dependent.
            2. Enhancement Features
                (Stevens and Keyser, Language, 1989)

•   Example: [voiced] in English.
    – Based on [voiced] distinction in Latin, which is a
      distinction between stops with/without voicing during
      closure.
    – Long VOT enhances the perceptual sense that a stop
      is devoiced, therefore, over some period of history, [-
      voiced] stops became [+aspirated].
       •   Some languages distinguish [+voiced,+aspirated],
           [+voiced,−aspirated], [−voiced,−aspirated], and
           [−voiced,−aspirated] stops. Such languages could not use
           [+aspirated] to enhance the perception of [−voiced].
    – In modern English, the “enhancing” feature is so strong
      that the “primary” feature (closure voicing) is often
      dropped.
    Articulator-Bound and Articulator-Free
                   Features
•    Articulator-bound features can only be
     implemented by one articulator
     –   [anterior] is bound to the tongue blade
     –   [front] is bound to the tongue body
     –   [voiced] is bound to the vocal folds
     –   [nasal] is bound to the soft palate
•    Articulator-free features can be implemented by
     the lips, tongue blade, or tongue body
     – Key articulator-free features: [sonorant,continuant]
         •   [+sonorant,+continuant] = vowel or glide
         •   [+sonorant,-continuant] = nasal
         •   [-sonorant,-continuant] = stop
         •   [-sonorant,+continuant] = fricative
                     Landmarks
•   “Primary articulator” of a consonant is the
    articulator that implements [−sonorant] or
    [−continuant] for that phoneme
    – Lips, tongue blade, or tongue body
    – “Implements [−continuant]” = articulator closes
      completely
•   At the moment when the primary articulator
    closes, there is a BIG change in the acoustics
    – [+sonorant]→[−sonorant]: 10dB at low freqs
    – [+continuant]→[−continuant]: 10dB at high freqs
•   This particular nonlinear change is called an
    “acoustic landmark”
                            Summary
• Brief history:
    – Writing: ~5000 years, Phonemes: ~3000 years
    – Manner-place notation: ~500 years, Binary encoding: ~50 years
• Speech perception:
    – Errors in the perception of different distinctive features are independent
    – … even though the acoustic correlates of different distinctive features are
      NOT independent.
• Speech production:
    – Phonemic relationships among morphologically related words can be
      explained using distinctive features
    – Inter-word phonology is simple but universal: spreading of manner or
      place features on autosegmental tiers
• Quantal theory:
    –   Distinctive features arise from sigmoids in the mapping
    –   Any language chooses a subset of sigmoids
    –   Other sigmoids are ignored, or else used for feature enhancement
    –   A very important sigmoid: primary articulator closure produces landmarks

								
To top