Speech RHythm by JS15NU


									Toward Rich Phonology

              Robert Port
    Linguistics, Cognitive Science
          Indiana University
             August, 2006

ESCA Experimental Linguistics, Athens

      The standard view of language

1.`Language is a cognitive symbol system …’
    • discrete tokens
    • static
    • serially ordered
    • perfectly recognized and produced
    • with associated meanings

The basic units are the phonological segment and the word.

2. `… used for real-time processing of language.’
     Speech production = `encoding’
     Speech perception = `decoding’
     Memory = keeping symbols `active’

     Linguistic processing = `reading’ and `writing’ symbols

• This assumption underlies everything most linguists do:
       Chomsky-Halle, optimality theory, etc.
• Underlies the International Phonetic Alphabet
• Underlies most talk about language by psychologists.

But human memory for words is far richer than this

                    Evidence for Rich Sensory Memory
Look first at Vision: Visual memory is detailed and depends on massive memory for
       Posner-Keele (1968) random dot patterns for categorization

         Prototype A                                         Prototype B

    Experiment: A random dot pattern serves as a prototype. It is not shown to subjects. Only noisy
       variants are shown.
    – Subjects are trained to classify dot patterns distorted from 2 prototypes. Measure accuracy
       and RT.
    – Results: After training, the prototype (though unseen) is recognized well.
      BUT performance is best on the actual training stimuli.

   `Exemplar Memory’ for Visual Images or Other Items

Modelers of categorization, recall and recognition (eg, Hintzman,
    Nosofsky, Shiffrin) get best results by:
    1.   storing all exemplars (ie, all presented tokens)
    2.   computing similarity of a new token to all items in memory
    3.   responding with, eg, the category of the closest matches.

In vision we remember detail well on a single exposure!
    1.   If I showed 50 photographs, 1 second each, you would probably
         recognize a repeated picture - even later in the day.
    2.   You can remember details from this morning – what you ate, who you
         saw, what the cafeteria looked like, what your plate of food looked
         like, etc.

Humans can do ``one-trial learning’’ of many
  coincidental events.
   – No generalizations involved.
   – No `training’ required

This skill is found in all mammals
but is best in primates.
Called `episodic     memory.’
                            (See review by Mark Gluck, Trends Cog Sci)

 What mechanism could store ``random’’ co-occurrences?

 The hippocampus (and neighboring regions) are essential for:
        •   autobiographical memory,
        •   picture recognition,
        •   maze learning,
        •   associating a smell with an event,

Linguistic events could be first stored as episodes and gradually be
   incorporated into long-term memory.                                 7
But linguists (including me) have assumed words
       do not have a rich episodic memory
Words were assumed to be perceptually identified and stored using a code that
  is abstract. A representation like

      1. speaker independent [You and I produce identical transcriptions]
      2. segmented into ordered, context-free parts [Segments represent all that is
         necessary to specify words]
      3. rate invariant [The same symbols apply at different speaking rates]
      4. low dimensional Jakobson: 12 bits/segment.
                            Chomsky-Halle: ~40 bits/segment.
                            At 15-20 segments/sec, this implies < 1000 bits/sec.

Presumably this code of symbols is used in long-term and short-term memory.

            Linguistics has always considered
           phonemes to be `psychologically real’

• Saussure `The phoneme is an acoustico-motor image’

• Troubetzkoy and Jakobson took phoneme as the
  psychological counterpart of a letter.

• IPA assumes letters capture what is important.

• Chomsky- Halle: ``Utterances are sequences of discrete
  segments that are complexes of phonetic features.’’
                                  Sound Pattern of English, p 5.

Nearly all linguists would agree with Morris Halle:

    ``It is unlikely that the information about the phonic
        shape of words is stored in the memory of speakers in
        acoustic form resembling a spectrogram’’
                                 1985, in Fromkin, ed.

    If true, then speech could not have `auditory episodic

              Evidence for rich memory of words
Palmeri, Goldinger, Pisoni (1993)
    – Ss heard a continuous list of words
      spoken by 2, 6, 12 or 20 voices.
    – Asked to recognize repetitions after lag
      of 2, 4, 8, 16, 32 or 64 words.

Of course, performance declined with
   greater lag
But no effect of number of talkers on

It doesn’t matter how many voices
    because the information is always
    stored – automatically!
Words are stored episodically, just like
  visual images and everyday events.

         What does rich memory store about speech?

We apparently store:

1.   full auditory detail
2.   speaker’s voice info
3.   emotional features
4.   word identity
5.   speaker ID
6.   semantic features
7.   other context features
8.   frequency-of-occurrence
9.   (orthographic spelling)

We store as much detail as possible – at   Formant tracks of short Australian
                                           vowels in CVCs, multiple tokens, male
  least for awhile.                        speaker: dead, Dad, Dodd, dud

         Any further evidence? Yes, lots!

Of course, an abstract linguistic description (using features,
   segments, orthographic letters, etc) is stored as well for educated
   speakers. For these, symbolic description tends to dominate our
   conscious experience of language.

         Other Evidence for Rich Language Memory

2. Dialect details, gradual dialect change and speaker
  idiosyncracies. (W. Labov, Betty Phillips)
   – How could we learn small differences in pronunciation if not
     recorded in memory?

                                                   Vowels vary in systematic
                                                   ways. NJ speaker.
                                                   Labov 2005

3. Frequent speech patterns are different from
    infrequent patterns. (eg, Joan Bybee, B. Phillips)

    R. Port will not begin a word with a flapped T – normally.
                  I want a tomáto          - no flap on initial t
                  Where is my tobácco?

    But its OK with a high-frequency word (and phrase).
               I want to go todáy      - t usually flapped!
                I’ll see you tomórrow.

Suggests each hearing of a word leaves a long-lasting record. Exemplar
     memory automatically records frequency.

4.     Speech perception uses rich context-sensitive cues, not abstract
     invariant cues.

                        di                          du
                     Sound spectrogram of male speaker,

Liberman (1968) was troubled that /di/ and /du/ have no acoustic invariant
   corresponding to /d/ - the ``cooarticulation problem’’
Rich memory sweeps it away. We remember big chunks.
     So every di is stored independently of all the du.

(Note that speech recognition systems pay no attention to segments! They always use
   spectral trajectories.)                                                          16
5-A.   Letters don’t fit speech well, if you look close.

            mom          seem
3 letters → 3 steady state gestures → 3 acoustic shapes

       s I l             m a m             s i m
   But this model does not work for:
        – glides (w, r, l, y)
        – stops (d, t, b, p, g, k) which have closing, closure, release phases
          and aspiration intervals
        – diphthongs (may, my, mow, Boyd)
        – affricates (tsh, dzh)

   5-B.      Letters don’t fit speech well, if you look close.
The V-to-C continuum has arbitrary cuts

       a u       a           aU     a                awa

The VOT continuum – arbitrary cuts

        da            da    da      ta         tha        tha

             voiced          unaspirated      aspirated          18
5-C.   Letters don’t fit speech well, if you look close.

Overlap of gestures is common but ignored.

 cap                    camp        but also      camp

5-D.   Letters don’t fit speech well, if you look close.

   Timing patterns are critical to word specification.

       fuzzy   ≠     fussy           budding     =? butting

   5-E.      Letters don’t fit speech well, if you look close.

‘‘Incomplete neutralization‘‘
   Sounds not different enough to permit reliable ID may still be different. But
   they are not discrete. (Port & Crawford; Warner et al)

        • German                 Bunde-bunte but Bund-bunt (same?)

        • American English       mad-mat
                                         madder-matter (same?)
                                         budding-butting (same?)

        This situation should be impossible if words had letter-like spellings!

6. Non-alphabet-literates should find segments very
non-intuitive. Look at ``phonological awareness tasks.’’

   6A. Studies of illiterates in Portugal (by Morais et al, Cognition

       Tasks like:
           segment addition: add /p/ to syllable (urso → purso )
           segment deletion: delete /p/ (purso → urso)

       Word-word condition and Nonword-nonword condition
       30 Illiterate subjects and 30 Reading subjects
       15 training trials with feedback
       10 W-W test trials and 10 Nw-Nw test trials

Number of
subjects    Illiterates         Readers
             n=30                 n=30

            Number of correct responses

6B. Matching experiments on Chinese who are literate in
  Chinese orthography with and without alphabet
  experience showed the same results (C. Reed et al,

6C. Reading skill correlates highly with `phonological
      awareness’ (Y. Liberman et al 1976)
               But which comes first?

   Ziegler and Goswami (2005, Psych Bull’tn)
          ``Phoneme awareness only develops once children
          are taught to read and write, irrespective of age.’’ (p. 14)

   Phonological awareness’ is mostly a result of literacy training.

   (I say `mostly’ because, eg, obviously the inventors of alphabetic
       writing must have had awareness of phonetic segments.)

     Review: Evidence of Rich Memory for Words

1. Recognition memory experiments show phonetic detail is

2. Gradual dialect variation (in space, time or social context)
   implies memory for rich phonetic detail – no discrete
   phonetic jumps.

3. Frequency influences both perception and production.

4. Speech perception uses auditory trajectories – not abstract
   invariants or static patterns.

5. Letters do not code most of what is important for perception
   – including timing, gestures and overlapping contrasts.

6. Only people with alphabet training think speech is made from
Conclusions so far
   1.   An alphabetic description of speech is possible only for those
        with alphabet-based education.
   2.   Linguistic fragments (words, phrases, etc) form clouds of
        trajectories in a high-D space with some category labels (and
        sometimes orthographic spellings). Each unit is a distribution.

   3. Language may resemble a symbol system, but
      cannot be one.
   With such rich memory, why is phonology necessary?

             Toward Rich Phonology
• If memory is rich, why are words partially similar
  to each other?

       Pete, pate, pet, pat      beat, bait, bet, bat
       seal, sale, sell, Sal     zeal, -- , Szell, --

That is,
if language skills do not rely on low-D descriptions,
     – Why is there phonological structure?
     – Why does a low-D description almost work?

Phonology is part of our ambient culture – our
  phonological culture.
  – The child is exposed to a corpus of utterances
  – Linguistics studies this corpus – ie, the data that are
    stimulation for language learning
  – Linguist observes patterns in this linguistic culture
     • Lexico-phrasal patterns
     • Phonological patterns

1.   Language learning and language memory do not
     require low-D descriptions. High-D works fine.
2.   But low-D phonological structure may lead to
     improvements in perception, learning, memory, etc.
3.   Languages (as cultural products) evolve to maintain
     approximate low-D descriptions (ie, phonology).
4.   Rich linguistics and phonology should study these
     patterns using all descriptions available:
       •   alphabetical (eg, narrow or broad transcription)
       •   acoustic (eg, spectra, formants, durations, etc)
       •   and maybe neural (eg, fMRI)
5.   The units of phonology are not units in speaker’s
     heads. Only masses of exemplars. Phonology can be
     found only in the corpus of utterances by speakers.

To top