Subphonemic detail is used in
spoken word recognition:
Temporal Integration at
Two Time Scales
Bob McMurray
Grateful Thanks to:
Advisors Collaborators
Dick Aslin Meghan Clayards
Mike Tanenhaus David Gow
Committee Saviors in the Lab
Joyce McDonough Julie Markant
David Knill Dana Subik
Christopher Brown
People who put up with me
Kate Pirog Kathy Corser Bette
Andrea Lathrop Jennifer Gillis McCormick
Meaningful stimuli are almost always temporal.
Scene Perception: build stable representation
across multiple eye-movements, attention shifts.
Music: series of notes. Temporal properties (order
and rhythm) are fundamental.
Language as Temporal Integration
Temporal Integration fundamental to language, as it
appears in the world.
•Word: Ordered series of articulations.
•Sentence: Sequence of words.
•A Language: Series of utterances.
Phonology, syntax extracted from this series of
utterances.
How are abstract representations formed?
Stimuli do not change arbitrarily.
At any point in time, subtle, perceptual cues tell the
system something about the change itself.
Enable an active integration process.
Anticipating future events
Retain partial present representations.
Resolve prior ambiguity.
Word recognition is an ideal arena:
• Substantial perceptual information available.
• Multiple timescales for integration.
But:
Early evidence suggested that this
perceptual information is not maintained.
Overview
1) Continuous perceptual variation affects word
recognition.
2) A new framework for word recognition.
3) Integrating speech cues in online recognition.
4) Long-term temporal integration: development.
5) The use of continuous detail during development.
6) Conclusions
Speech and Word Recognition
Acoustic Speech Perception
• Categorization of acoustic
input into sublexical units.
Sublexical Units
/a/ /la/ /ip/
/b/ /l/ /p/
Word Recognition Lexicon
• Identification of target word
from active sublexical units.
Word Recognition as temporal ambiguity resolution
• Information arrives sequentially
• At early points in time, signal is temporarily
ambiguous.
X
basic bakery
ba… kery
X
barrier
X
barricade X
bait
X
baby
• Later arriving information disambiguates the word.
Current models of spoken word recognition
• Immediacy: Hypotheses formed from the earliest
moments of input.
• Activation Based: Lexical candidates (words)
receive activation to the degree they match the
input.
• Parallel Processing: Multiple items are active in
parallel.
• Competition: Items compete with each other for
recognition.
Input: b... u… tt… e… r
time
beach
butter
bump
putter
dog
These processes have been well defined for a
phonemic representation of the input.
n S n
k Ag I
But there may be considerably less ambiguity in the
signal if we consider subphonemic information.
Example: subphonemic effects of motor processes.
Coarticulation
Any action reflects future actions as it unfolds.
Example: Coarticulation
Movements of articulators (lips, tongue…) during
speech reflect current, future and past events.
Yields subtle subphonemic variation in speech that
reflects temporal organization.
n n
Sensitivity to these
e e perceptual details might
t c yield earlier disambiguation.
k
These processes have largely been ignored
because of a history of evidence that perceptual
variability gets discarded.
Example: Categorical Perception
Categorical Perception
100 100
Discrimination
B
% /p/
Discrimination
ID (%/pa/)
P 0 0
B VOT P
• Sharp identification of tokens on a continuum.
• Discrimination poor within a phonetic category.
Subphonemic variation in VOT is discarded in favor
of a discrete symbol (phoneme).
Evidence against the strong form of Categorical
Perception comes from a variety of
psychophysical-type tasks:
Discrimination Tasks
Pisoni and Tash (1974)
Pisoni & Lazarus (1974)
Carney, Widin & Viemeister (1977)
Training
Samuel (1977)
Pisoni, Aslin, Perey & Hennessy (1982)
Goodness Ratings
Miller (1997)
Massaro & Cohen (1983)
Does within-category acoustic detail
systematically affect higher level
language?
Is there a gradient effect of
subphonemic detail on lexical
activation?
McMurray, Aslin & Tanenhaus (2002)
A gradient relationship would yield systematic effects
of subphonemic information on lexical activation.
If this gradiency is useful for temporal integration, it
must be preserved over time.
Need a design sensitive to both acoustic detail and
detailed temporal dynamics of lexical activation.
Acoustic Detail
Use a speech continuum—more steps yields a
better picture acoustic mapping.
KlattWorks: generate synthetic continua from
natural speech.
9-step VOT continua (0-40 ms)
6 pairs of words.
beach/peach bale/pale bear/pear
bump/pump bomb/palm butter/putter
6 fillers.
lamp leg lock ladder lip leaf
shark shell shoe ship sheep shirt
Temporal Dynamics
How do we tap on-line recognition?
With an on-line task: Eye-movements
Subjects hear spoken language and manipulate
objects in a visual world.
Visual world includes set of objects with interesting
linguistic properties.
a beach, a peach and some unrelated items.
Eye-movements to each object are monitored
throughout the task.
Tanenhaus, Spivey-Knowlton, Eberhart & Sedivy, 1995
Why use eye-movements and visual world paradigm?
•Relatively natural task.
•Eye-movements generated very fast (within 200ms
of first bit of information).
•Eye movements time-locked to speech.
•Subjects aren’t aware of eye-movements.
•Fixation probability maps onto lexical activation..
Task
A moment
to view the
items
Task
Bear
Repeat
1080
times
Identification Results
1
0.9
0.8
0.7
High agreement
proportion /p/
0.6 across subjects
0.5 and items for
0.4
0.3
category
0.2 boundary.
0.1
0
0 5 10 15 20 25 30 35 40
B VOT (ms) P
By subject: 17.25 +/- 1.33ms
By item: 17.24 +/- 1.24ms
Task
200 ms
Trials
1
2
3
4
5
Target = Bear
Competitor = Pear
Unrelated = Lamp, Ship
Time
Task
VOT=0 Response= VOT=40 Response=
0.9
Fixation proportion
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
00 400 800 1200 1600 0 400 800 1200 1600 2000
Time (ms)
More looks to competitor than unrelated items.
Task
Given that
• the subject heard bear How often was the subject
• clicked on ―bear‖… looking at the ―pear‖?
Categorical Results Gradient Effect
Fixation proportion
Fixation proportion
target target
competitor competitor
time time
Results
Response= Response=
0.16
VOT VOT
Competitor Fixations
0.14 0 ms 20 ms
0.12 5 ms 25 ms
10 ms 30 ms
0.1 15 ms 35 ms
40 ms
0.08
0.06
0.04
0.02
0
0 400 800 1200 1600 0 400 800 1200 1600 2000
Time since word onset (ms)
Long-lasting gradient effect: seen throughout
the timecourse of processing.
Response= Response=
0.08
Competitor Fixations 0.07
Looks to
0.06
0.05
0.04 Looks to
0.03 Category
Boundary
0.02
0 5 10 15 20 25 30 35 40
VOT (ms)
Area under the curve:
Clear effects of VOT B: p=.017* P: p.1
Distance from Category Boundary P: p=.027
Summary: Gradiency
Across continua, looks to competitors validated gradient
hypothesis.
Continuum Vowel Finding
Replicate prior work
B/P P=.0015 .006 2D gradiency
Extend gradiency to FT Slope
B/W .001 .05 2D gradiency
Extend gradiency to F3
R/L .001 >.1 Validate methods
Extend gradiency to place
D/G .017 >.1 Validate methods
Results: Temporal Dynamics
When do effects occur?
VOT / FTStep effects cooccurs with vowel length.
(Sublexical Integration)
VOT / FTStep precedes vowel length.
(Lexical locus)
Compute 3 effect sizes at each 20 ms time slice.
•VOT / FTStep: Regression slope of competitor
fixations as a function of VOT.
Time = 720 ms…
0.2 0.14
Competitor Fixations
VOT from
0.18
Boundary 0.12
Competitor Fixations
0.16 -25
0.1
0.14 -20
0.12 -15 0.08
-10
0.1
Y = M720x + B
-5 0.06
0.08
0.04
0.06
0.04 0.02
0.02
0
0
0 500 1000 1500 2000 -30 -25 -20 -15 -10 -5 0
Time (s) Distance from Boundary (VOT)
Compute 3 effect sizes at each 20 ms time slice.
•VOT / FTStep: Regression slope of competitor
fixations as a function of VOT.
Time = 740 ms…
0.2 0.14
Competitor Fixations
VOT from
0.18
Boundary 0.12
Competitor Fixations
0.16 -25
0.1
0.14 -20
0.12 -15 0.08
-10
0.1
Y = M740x + B
-5 0.06
0.08
0.04
0.06
0.04 0.02
0.02
0
0
0 500 1000 1500 2000 -30 -25 -20 -15 -10 -5 0
Time (s) Distance from Boundary (VOT)
Compute 3 effect sizes at each 20 ms time slice.
•Vowel Length: Difference (D) between fixations
after hearing long vs. short vowel.
Time = 340 ms…
•Repeat for 0.084
each time
Competitor Fixations
0.080
slice, subject.
0.076 L-S = D
0.072
0.068
0.064
Long Short
Compute 3 effect sizes at each 20 ms time slice.
•Unrelated: Difference between looks to target
after a experimental vs. filler stimulus.
Information available from the earliest moments
of processing: subjects should show early effect.
Does analysis have sufficient power?
Resulting dataset…
Subject Time Unrelated VOT (M) Vowel (D)
1 20 0.02076 -0.0023 0.0094
40 0.02446 -0.0016 0.0095
60 0.02916 -0.0008 0.0108
…
2000 0.99871 0.06021 0.123
2 20 0.05642 0.0014 0.0091
40 0.07126 0.0018 0.0088
60 0.08926 0.0029 0.0104
…
2000 0.99261 0.0604 0.1223
…
Results: Temporal Dynamics
Model 1: Sublexical integration
Effect of VOT / FTStep appears at same time as
Vowel Length
Model 2: Lexical Locus
Effect of VOT / FTStep precedes Vowel Length
time time
VOT Vowel Length VOT Vowel Length
Sublexical Rep. (phonemes)
Partial representation More complete
retained... representation…
The Lexicon The Lexicon
B/P: Effects on looks to Competitor
Looks to competitor Combined (b/p).
fƒ
Effect Size (normalized) 1.2
1
0.8
0.6
0.4
Vowel
0.2 VOT
0 UR
-0.2
0 300 600 900 1200
Time (ms)
Little sequentiality—vowel length and VOT
effects appear at same time.
Looks to competitor (b/p)
fƒ
B
1.2
Effect Size (normalized)
1
0.8 Some
0.6
0.4
sequentiality on
0.2
Vowel
VOT voiced side
0 UR
-0.2
0 300 600 900 1200
Time (ms)
P
1.2
Effect Size (normalized)
1
0.8
0.6 None on
0.4
0.2
Vowel voiceless.
VOT
0 UR
-0.2
0 300 600 900 1200
Time (ms)
B/P Summary
Limited sequentiality of effects supports some kind
of sublexical integration.
•Voiced: ~sequential effects.
•Voiceless: effect of VOT simultaneous with
vowel length.
VOT requires at least some portion of the vowel for
lexical interpretation.
•Voiceless sounds need ―more‖.
•Consistent with prior measurement and
perceptual work.
B/W: Effects on looks to Competitor
Looks to competitor Combined (b/w).
fƒ
1.2
Effect Size (normalized)
1
0.8
0.6
0.4
Vowel
0.2
Step
0
UR
-0.2
-0.4
0 300 600 900 1200
Time (ms)
Clearly sequential—FTStep effects appear
before vowel length.
Looks to competitor (b/w)
fƒ
1.2
B
Effect Size (normalized)
1
0.8 Clear
0.6
0.4
sequentiality on
0.2
0
UR
Step both sides.
Vowel
-0.2
-0.4
0 300 600 900 1200
Time (ms)
1.2
1 W
0.8
Effect Size (normalized)
0.6
0.4
0.2
0
-0.2
-0.4
0 300 600 900 1200
Time (ms)
B/W Summary
Manner of Articulation
•Clear sequential effects on competitor.
•Support lexical locus of temporal integration.
Formant transition slope may not work similarly to VOT.
•Is VOT the right cue for voicing?
•What was actually manipulated?
FTSlope vs. Transition Duration
Experiment 1 Conclusions
Gradient effect on lexical activation extended to
•Multi-dimensional categories
VOT & Vowel Length
FTStep & Vowel Length
•Additional phonetic dimensions
B/W: Manner of articulation
R/L: Laterality
D/G: Place of Articulation
Temporal Integration:
•VOT effect precedes vowel length only for voiced
sounds:
Some vowel required to interpret VOT.
•FTStep effect precedes vowel length.
Supports lexical integration.
Experiment 2
Lexical activation can play a role in integrating
multiple phonemic cues.
How long is the information available?
How is information at multiple levels integrated?
Misperception
What if a stimulus was misperceived?
Competitor still active
-- easy to activate it rest of the way.
Competitor completely inactive
-- system will “garden-path”.
P ( misperception ) distance from boundary.
Gradient activation allows the system to hedge its bets.
barricade vs. parakeet /beIkeId/ vs.
/peIkit/
Input: p/b eI k
time
i t…
Categorical Lexicon
parakeet
barricade
Gradient Sensitivity
parakeet
barricade
Methods
10 Pairs of b/p items.
Voiced Voiceless Overlap
Bumpercar Pumpernickel 6
Barricade Parakeet 5
Bassinet Passenger 5
Blanket Plankton 5
Beachball Peachpit 4
Billboard Pillbox 4
Drain Pipes Train Tracks 4
Dreadlocks Treadmill 4
Delaware Telephone 4
Delicatessen Television 4
10 Pairs of b/p items.
• 0 – 35 ms VOT continua.
20 Filler items (lemonade, restaurant, saxophone…)
Option to click ―X‖ (Mispronounced).
26 Subjects
1240 Trials over two days.
X
Identification Results
1.00
0.90
Response Rate
0.80
0.70
0.60
0.50
Voiced
Voiceless
Significant target
0.40
0.30
NW responses even at
0.20
0.10
extreme.
0.00
0 5 10 15 20 25 30 35
Barricade Parricade Graded effects of
1.00
0.90
VOT on correct
response rate.
Response Rate
0.80
0.70
0.60 Voiced
0.50
Voiceless
0.40
0.30 NW
0.20
0.10
0.00
0 5 10 15 20 25 30 35
Barakeet Parakeet
Eye Movement Results
Barricade -> Parricade Parakeet -> Barakeet
1
VOT
0
0.8
fƒ
Fixations to Target
5
0.6 10
15
0.4 20
25
0.2 30
35
0
300 600 900 300 600 900 1200
Time (ms) Time (ms)
Faster activation of target as VOTs approach
lexical endpoint.
• Even within the non-word range.
Phonetic “Garden-Path”
―Garden-path‖ effect:
Difference between looks to each target
(b vs. p) at same VOT.
VOT = 0 (/b/) VOT = 35 (/p/)
1
Fixations to Target
0.8
Barricade
0.6
Parakeet
0.4
0.2
0
0 500 1000 0 500 1000 1500
Time (ms) Time (ms)
fƒ 0.15
0.1 Target
( Barricade - Parakeet )
Garden-Path Effect
0.05
0
GP Effect:
-0.05
Gradient effect of VOT.
-0.1
0 5 10 15 20 25 30 35 Target: p.2
>.1
0 for unneeded categories.
VOT VOT
Overgeneralization
• large
• costly: lose phonetic distinctions…
Undergeneralization
• small
• not as costly: maintain distinctiveness.
To increase likelihood of successful learning:
• err on the side of caution.
• start with small
1
0.9
0.8
39,900 0.7
P(Success)
0.6
Models 0.5
0.4
2 Category Model
3 Category Model
Run 0.3
0.2
0.1
0
0 10 20 30 40 50 60
Starting
Small
Sparseness coefficient: % of
space not strongly mapped Unmapped
space
to any category.
VOT
Avg Sparseness Coefficient
0.4 Starting
0.35 .5-1
0.3
0.25
0.2
0.15
0.1
0.05
0
0 2000 4000 6000 8000 10000 12000
Training Epochs
Start with large σ
VOT
0.4 Starting
Avg Sparsity Coefficient
0.35 .5-1
0.3
0.25
20-40
0.2
0.15
0.1
0.05
0
0 2000 4000 6000 8000 10000 12000
Training Epochs
Intermediate starting σ
VOT
0.4 Starting
Avg Sparsity Coefficient
0.35 .5-1
0.3 3-11
0.25 12-17
20-40
0.2
0.15
0.1
0.05
0
0 2000 4000 6000 8000 10000 12000
Training Epochs
Limitations
1) Occasionally model leaves sparse regions at the end
of learning.
• Competition/Choice framework:
Additional competition or selection mechanisms
during processing: categorization despite
incomplete information.
2) Multi-dimensional categories
1-D: 3 parameters / category
2-D: 5 “ “
3-D: 21 “ “
• Incorporating cue/model-reliability may
reduce dimensionality.
Non-parametric approach?
Categories
•Competitive Hebbian Learning
(Rumelhart & Zipser, 1986).
•Not constrained by a particular
equation—can fill space better.
•Similar properties in terms of VOT
starting and sparseness.
Model Conclusions
To avoid overgeneralization…
…better to start with small estimates for
Small or even medium starting ’s lead to sparse
category structure during infancy—much of
phonetic space is unmapped.
Sparse categories:
Similar temporal integration to exp 2
Retain ambiguity (and partial
representations) until more input is available.
Infant Summary
Infants show graded sensitivity to subphonemic detail.
/b/-results: regions of unmapped phonetic space.
Statistical approach provides support for sparseness.
• Given current learning theories, sparseness
results from optimal starting parameters.
Empirical test will require a two-alternative task.
• AEM: train infants to make eye-movements in
response to stimulus identity.
Conclusions
Infant and adult word learning are sensitive to
subphonemic detail.
Sensitivity is important to adult and developing
word recognition systems.
1) Short term cue integration.
2) Long term phonology learning.
In both cases, partially ambiguous material is
retained until more data arrives.
The Future?
Change is the law of life. And those who look only to
the past or present are certain to miss the future.
-- John F. Kennedy
The Future?
Change is the law of life. And those [Word
Recognition Systems] who look only to the
past or present are certain to miss the future
[Acoustic Material].
-- John F. Kennedy-[McMurray]
Subphonemic cues signal upcoming events.
Can the system use the information to prepare
itself for future material?
The Last Word
Spoken language is defined by change.
But the information to cope with it is
in the signal.
Within-category acoustic variation is
signal, not noise.
Subphonemic detail is used in
spoken word recognition:
Temporal Integration at
Two Time Scales
Bob McMurray
• Infants make anticipatory eye-movements along
predicted trajectory, in response to stimulus identity.
• Two alternatives allows us to distinguish between
category boundary and unmapped space.