Outline for Today
LSA 352 Speech Recognition and Synthesis
Dan Jurafsky Lecture 7: Baum Welch/Learning and Disfluencies
IP Notice: Some of these slides were derived from Andrew Ng’s CS 229 notes, as well as lecture notes from Chen, Picheny et al, and Bryan Pellom. I’ll try to accurately give credit on each slide.
LSA 352 Summer 2007
1
Learning
EM for HMMs (the “Baum-Welch Algorithm”) Embedded Training Training mixture gaussians
Disfluencies
LSA 352 Summer 2007
2
The Learning Problem
Input to Baum-Welch
O Q unlabeled sequence of observations vocabulary of hidden states
Baum-Welch = Forward-Backward Algorithm (Baum 1972) Is a special case of the EM or ExpectationMaximization algorithm (Dempster, Laird, Rubin) The algorithm will let us train the transition probabilities A= {aij} and the emission probabilities B={bi(o t)} of the HMM
For ice-cream task
O = {1,3,2.,,,.} Q = {H,C}
LSA 352 Summer 2007
3
LSA 352 Summer 2007
4
Starting out with Observable Markov Models
How to train? Run the model on the observation sequence O. Since it’s not hidden, we know which states we went through, hence which transitions and observations were used. Given that information, training:
B = {b k(ot)}: Since every state can only generate one observation symbol, observation likelihoods B are all 1.0 A = {aij}:
Extending Intuition to HMMs
For HMM, cannot compute these counts directly from observed sequences Baum-Welch intuitions:
Iteratively estimate the counts.
– Start with an estimate for a ij and bk , iteratively improve the estimates
Get estimated probabilities by:
– computing the forward probability for an observation – dividing that probability mass among all the different paths that contributed to this forward probability
aij =
$
q #Q
C(i " j) C(i " q)
LSA 352 Summer 2007
5
LSA 352 Summer 2007
6
!
1
The Backward algorithm
We define the backward probability as follows:
The Backward algorithm
We compute backward prob by induction:
" t (i) = P(ot +1,ot +2 ,...oT ,| qt = i,#)
This is the probability of generating partial observations Ot+1T from time t+1 to the end, given that the HMM is in state i at time t and of course given Φ.
!
LSA 352 Summer 2007
7
LSA 352 Summer 2007
8
Inductive step of the backward algorithm
(figure inspired by Rabiner and Juang)
Computation of βt(i) by weighted sum of all successive values βt+1
Intuition for re-estimation of aij
We will estimate ^aij via this intuition:
ˆ aij = expected number of transitions from state i to state j expected number of transitions from state i
Assume we had some estimate of probability that a given transition i->j was taken at time t in observation sequence. If we knew this probability for each time t, we could sum over all t to get expected value (count) for i->j.
Numerator intuition:
!
LSA 352 Summer 2007
9
LSA 352 Summer 2007
10
Re-estimation of aij
Let ξt be the probability of being in state i at time t and state j at time t+1, given O1..T and model Φ:
Computing not-quite-ξ
The four components of P(qt = i,qt +1 = j,) | " ) : #, $ ,aij and b j (ot )
" t (i, j) = P(qt = i,qt +1 = j | O, # )
We can compute ξ from not-quite-ξ, which is:
!
not _ quite _ " t (i, j) = P(qt = i,qt +1 = j,O | # )
!
!
LSA 352 Summer 2007
11
LSA 352 Summer 2007
12
2
From not-quite-ξ to ξ
We want:
From not-quite-ξ to ξ
We want:
" t (i, j) = P(qt = i,qt +1 = j | O, # )
not _ quite _ " t (i, j) = P(qt = i,qt +1 = j,O | # )
Which we compute as follows: We’ve got:
" t (i, j) = P(qt = i,qt +1 = j | O, # )
not _ quite _ " t (i, j) = P(qt = i,qt +1 = j,O | # )
Since: We’ve got:
!
!
LSA 352 Summer 2007
!
We need:
!
13
" t (i, j) =
not _ quite _ " t (i, j) P(O | # )
LSA 352 Summer 2007
14
!
From not-quite-ξ to ξ From ξ to aij
ˆ aij = expected number of transitions from state i to state j expected number of transitions from state i
" t (i, j) =
not _ quite _ " t (i, j) P(O | # )
!
The expected number of transitions from state i to state j is the sum over all t of ξ The total expected number of transitions out of state i is the sum over all transitions out of state i Final formula for reestimated aij
!
LSA 352 Summer 2007
15
LSA 352 Summer 2007
16
Re-estimating the observation likelihood b
Computing gamma
expected number of times in state j and observing symbol v k ˆ b j (v k ) = expected number of times in state j
We’ll need to know the probability of being in state j at time t:
!
LSA 352 Summer 2007
17
LSA 352 Summer 2007
18
3
Summary
The ratio between the expected number of transitions from state i to j and the expected number of all transitions from state i
Summary: Forward-Backward Algorithm
1) 2) 3) 4) Intialize Φ=(A,B) Compute α, β, ξ Estimate new Φ’=(A,B) Replace Φ with Φ’
5) If not converged go to 2
The ratio between the expected number of times the observation data emitted from state j is v k, and the expected number of times any observation is emitted from state j
LSA 352 Summer 2007
19
LSA 352 Summer 2007
20
The Learning Problem: Caveats
Network structure of HMM is always created by hand
no algorithm for double-induction of optimal structure and probabilities has been able to beat simple handbuilt structures. Always Bakis network = links go forward in time Subcase of Bakis net: beads-on-string net:
Complete Embedded Training
Setting all the parameters in an ASR system Given:
training set: wavefiles & word transcripts for each sentence Hand-built HMM lexicon
Uses:
Baum-Welch algorithm
We’ll return to this after we’ve introduced GMMs
Baum-Welch only guaranteed to return local max, rather than global optimum
LSA 352 Summer 2007
21
LSA 352 Summer 2007
22
Baum-Welch for Mixture Models
By analogy with ξ earlier, let’s define the probability of being in state j at time t with the kth mixture component accounting for ot:
N
How to train mixtures?
Choose M (often 16; or can tune M dependent on amount of training observations) • Then can do various splitting or clustering algorithms • One simple method for “splitting”: 1) Compute global mean µ and global variance 2) Split into two Gaussians, with means µ±ε (sometimes ε is 0.2σ) • 3) Run Forward-Backward to retrain 4) Go to 2 until we have 16 mixtures
Now,
%#
" tm ( j) =
i=1
t$1
( j)aij c jm b jm (ot )& j (t)
# F (T)
T T tm
T
#"
µ jm = !
tm t=1 T M
( j)ot c jm =
tk
#"
t=1 T M
( j) " jm =
tk
%#
t=1
tm
( j)(ot $ µ j )(ot $ µ j )T
T M tm
# #"
t=1 k=1
( j)
# #"
t=1 k=1
( j)
% %#
t=1 k=1
( j)
LSA 352 Summer 2007
23
LSA 352 Summer 2007
24
!
!
!
4
Embedded Training
Components of a speech recognizer:
Feature extraction: not statistical Language model: word transition probabilities, trained on some other corpus Acoustic model:
– Pronunciation lexicon: the HMM structure for each word, built by hand – Observation likelihoods bj(ot) – Transition probabilities aij
Embedded training of acoustic model
If we had hand-segmented and hand-labeled training data With word and phone boundaries We could just compute the
B: means and variances of all our triphone gaussians A: transition probabilities
And we’d be done! But we don’t have word and phone boundaries, nor phone labeling
LSA 352 Summer 2007
25
LSA 352 Summer 2007
26
Embedded training
Instead:
We’ll train each phone HMM embedded in an entire sentence We’ll do word/phone segmentation and alignment automatically as part of training process
Embedded Training
LSA 352 Summer 2007
27
LSA 352 Summer 2007
28
Initialization: “Flat start”
Transition probabilities:
set to zero any that you want to be “structurally zero” Set the rest to identical values
– The γ probability computation includes previous value of aij, so if it’s zero it will never change
Embedded Training
Now we have estimates for A and B So we just run the EM algorithm During each iteration, we compute forward and backward probabilities Use them to re-estimate A and B Run EM til converge
Likelihoods:
initialize µ and σ of each state to global mean and variance of all training data
LSA 352 Summer 2007
29
LSA 352 Summer 2007
30
5
Viterbi training
Baum-Welch training says:
We need to know what state we were in, to accumulate counts of a given output symbol ot We’ll compute ξI(t), the probability of being in state i at time t, by using forward-backward to sum over all possible paths that might have been in state i and output o t .
Forced Alignment
Computing the “Viterbi path” over the training data is called “forced alignment” Because we know which word string to assign to each observation sequence. We just don’t know the state sequence. So we use aij to constrain the path to go through the correct words And otherwise do normal Viterbi Result: state sequence!
Viterbi training says:
Instead of summing over all possible paths, just take the single most likely path Use the Viterbi algorithm to compute this “Viterbi” path Via “forced alignment”
LSA 352 Summer 2007
31
LSA 352 Summer 2007
32
Viterbi training equations
Viterbi Baum-Welch
Viterbi Training
Much faster than Baum-Welch But doesn’t work quite as well But the tradeoff is often worth it.
ˆ aij =
n ij ni
For all pairs of emitting states, 1 <= i, j <= N
!
n (s.t.ot = v k ) ˆ b j (v k ) = j nj
Where nij is number of frames with transition from i to j in best path And nj is number of frames where state j is occupied
LSA 352 Summer 2007
33
LSA 352 Summer 2007
34
!
Viterbi training (II)
Equations for non-mixture Gaussians
Log domain
In practice, do all computation in log domain Avoids underflow
Instead of multiplying lots of very small probabilities, we add numbers that are not so small.
µi =
1 " ot s.t. qt = i N i t=1
T
Single multivariate Gaussian (diagonal Σ) compute:
"i 2 =
!
1 T #(ot $ µ i )2 s.t. qt = i N i t =1
In log space:
Viterbi training for mixture Gaussians is more complex, generally just assign each observation to 1 mixture
!
LSA 352 Summer 2007
35
LSA 352 Summer 2007
36
6
Log domain
Repeating:
State Tying:
Young, Odell, Woodland 1994
With some rearrangement of terms
Where:
The steps in creating CD phones. Start with monophone, do EM training Then clone Gaussians into triphones Then build decision tree and cluster Gaussians Then clone and train mixtures (GMMs
Note that this looks like a weighted Mahalanobis distance!!! Also may justify why we these aren’t really probabilities (point estimates); these are really just distances.
LSA 352 Summer 2007
37
LSA 352 Summer 2007
38
Summary: Acoustic Modeling for LVCSR.
Increasingly sophisticated models For each state:
Gaussians Multivariate Gaussians Mixtures of Multivariate Gaussians
Outline
Disfluencies Characteristics of disfluences Detecting disfluencies MDE bakeoff Fragments
Where a state is progressively:
CI Phone CI Subphone (3ish per phone) CD phone (=triphones) State-tying of CD phone
Forward-Backward Training Viterbi training
LSA 352 Summer 2007
39
LSA 352 Summer 2007
40
Disfluencies
Disfluencies: standard terminology (Levelt)
Reparandum: thing repaired Interruption point (IP): where speaker breaks off Editing phase (edit terms): uh, I mean, you know Repair: fluent continuation
LSA 352 Summer 2007
41
LSA 352 Summer 2007
42
7
Why disfluencies?
Need to clean them up to get understanding
Does American airlines offer any one-way flights [uh] oneway fares for 160 dollars? Delta leaving Boston seventeen twenty one arriving Fort Worth twenty two twenty one forty
Counts (from Shriberg, Heeman)
Sentence disfluency rate
ATIS: 6% of sentences disfluent (10% long sentences) Levelt human dialogs: 34% of sentences disfluent Swbd: ~50% of multiword sentences disfluent TRAINS: 10% of words are in reparandum or editing phrase
Might help in language modeling
Disfluencies might occur at particular positions (boundaries of clauses, phrases)
Word disfluency rate
SWBD: ATIS: AMEX 6% 0.4% 13%
Annotating them helps readability Disfluencies cause errors in nearby words
– (human-human air travel)
LSA 352 Summer 2007
43
LSA 352 Summer 2007
44
Prosodic characteristics of disfluencies
Nakatani and Hirschberg 1994 Fragments are good cues to disfluencies Prosody:
Pause duration is shorter in disfluent silence than fluent silence F0 increases from end of reparandum to beginning of repair, but only minor change Repair interval offsets have minor prosodic phrase boundary, even in middle of NP:
– Show me all n- | round-trip flights | from Pittsburgh | to Atlanta
Syntactic Characteristics of Disfluencies
Hindle (1983) The repair often has same structure as reparandum Both are Noun Phrases (NPs) in this example:
So if could automatically find IP, could find and correct reparandum!
LSA 352 Summer 2007
45
LSA 352 Summer 2007
46
Disfluencies and LM
Clark and Fox Tree Looked at “um” and “uh”
“uh” includes “er” (“er” is just British non-rhotic dialect spelling for “uh”)
Um versus uh: delays
(Clark and Fox Tree)
Different meanings
Uh: used to announce minor delays
– Preceded and followed by shorter pauses
Um: used to announce major delays
– Preceded and followed by longer pauses
LSA 352 Summer 2007
47
LSA 352 Summer 2007
48
8
Utterance Planning
The more difficulty speakers have in planning, the more delays Consider 3 locations:
I: before intonation phrase: hardest II: after first word of intonation phrase: easier III: later: easiest
Delays at different points in phrase
And then uh somebody said, . [I] but um -- [II] don’t you think there’s evidence of this, in the twelfth - [III] and thirteenth centuries?
LSA 352 Summer 2007
49
LSA 352 Summer 2007
50
Disfluencies in language modeling
Should we “clean up” disfluencies before training LM (i.e. skip over disfluencies?)
Filled pauses
– Does United offer any [uh] one-way fares?
Does deleting disfluencies improve LM perplexity?
Stolcke and Shriberg (1996) Build two LMs
Raw Removing disfluencies
Repetitions
– What what are the fares?
See which has a higher perplexity on the data.
Filled Pause Repetition Deletion
Deletions
– Fly to Boston from Boston
Fragments (we’ll come back to these)
– I want fl- flights to Boston.
LSA 352 Summer 2007
51
LSA 352 Summer 2007
52
Change in Perplexity when Filled Pauses (FP) are removed
LM Perplexity goes up at following word:
Filled pauses tend to occur at clause boundaries
Word before FP is end of previous clause; word after is start of new clause;
Best not to delete FP
Removing filled pauses makes LM worse!! I.e., filled pauses seem to help to predict next word. Why would that be?
Some of the different things we’re doing [uh] there’s not time to do it all “there’s” is very likely to start a sentence So P(there’s|uh) is better estimate than P(there’s|doing)
Stolcke and Shriberg 1996
LSA 352 Summer 2007
53
LSA 352 Summer 2007
54
9
Suppose we just delete medial FPs
Experiment 2:
Parts of SWBD hand-annotated for clauses Build FP-model by deleting only medial FPs Now prediction of post-FP word (perplexity) improves greatly! Siu and Ostendorf found same with “you know”
What about REP and DEL
S+S built a model with “cleaned-up” REP and DEL Slightly lower perplexity But exact same word error rate (49.5%) Why?
Rare: only 2 words per 100 Doesn’t help because adjacent words are misrecognized anyhow!
LSA 352 Summer 2007
55
LSA 352 Summer 2007
56
Stolcke and Shriberg conclusions wrt LM and disfluencies
Disfluency modeling purely in the LM probably won’t vastly improve WER But
Disfluencies should be considered jointly with sentence segmentation task Need to do better at recognizing disfluent words themselves Need acoustic and prosodic features
WER reductions from modeling disfluencies+background events
Rose and Riccardi (1999)
LSA 352 Summer 2007
57
LSA 352 Summer 2007
58
HMIHY Background events
Out of 16,159 utterances:
Filled Pauses Word Fragments Hesitations Laughter Lipsmack Breath Non-Speech Noise Background Speech Operater Utt. Echoed Prompt 7189 1265 792 163 2171 8048 8834 3585 5112 5353
LSA 352 Summer 2007
59
Phrase-based LM
“I would like to make a collect call” “a [wfrag]” [brth] “[brth] and I”
LSA 352 Summer 2007
60
10
Rose and Riccardi (1999) Modeling “LBEs” does help in WER
ASR Word Accuracy System Configuration HMM LM Greeting (HMIHY) 58.7 60.8 60.8
LSA 352 Summer 2007
More on location of FPs
Peters: Medical dictation task
Monologue rather than dialogue In this data, FPs occurred INSIDE clauses Trigram PP after FP: 367 Trigram PP after word: 51
Test Corpora Card Number
Stolcke and Shriberg (1996b)
wk FP wk+1: looked at P(wk+1|wk) Transition probabilities lower for these transitions than normal ones
Baseline LBE LBE
Baseline Baseline LBE
87.7 88.1 89.8
Conclusion:
People use FPs when they are planning difficult things, so following words likely to be unexpected/rare/difficult
61
LSA 352 Summer 2007
62
Detection of disfluencies
Nakatani and Hirschberg Decision tree at wi-wj boundary
Detection/Correction
Bear, Dowding, Shriberg (1992) System 1: Hand-written pattern matching rules to find repairs
– – – – Look for identical sequences of words Look for syntactic anomalies (“a the”, “to from”) 62% precision, 76% recall Rate of accurately correcting: 57%
pause duration Word fragments Filled pause Energy peak within wi Amplitude difference between wi and wj F0 of wi F0 differences Whether wi accented
Results:
78% recall/89.2% precision
LSA 352 Summer 2007
63
LSA 352 Summer 2007
64
Using Natural Language Constraints
Gemini natural language system Based on Core Language Engine Full syntax and semantics for ATIS Coverage of whole corpus:
70% syntax 50% semantics
Using Natural Language Constraints
Gemini natural language system Run pattern matcher For each sentence it returned
Remove fragment sentences Leaving 179 repairs, 176 false positives Parse each sentence
– If succeed: mark as false positive – If fail:
run pattern matcher, make corrections Parse again If succeeds, mark as repair If fails, mark no opinion
LSA 352 Summer 2007
65
LSA 352 Summer 2007
66
11
NL Constraints
Syntax Only
Precision: 96%
Recent work: EARS Metadata Evaluation (MDE)
A recent multiyear DARPA bakeoff Sentence-like Unit (SU) detection:
find end points of SU Detect subtype (question, statement, backchannel)
Syntax and Semantics
Correction: 70%
Edit word detection:
Find all words in reparandum (words that will be removed)
Filler word detection
Filled pauses (uh, um) Discourse markers (you know, like, so) Editing terms (I mean)
Interruption point detection
LSA 352 Summer 2007
67
LSA 352 Summer 2007 Liu et al 2003
68
Kinds of disfluencies
Repetitions
I * I like it
MDE transcription
Conventions:
./ for statement SU boundaries, <> for fillers, [] for edit words, * for IP (interruption point) inside edits
Revisions
We * I like it
Restarts (false starts)
It’s also * I like it
And wash your clothes wherever you are ./ and [ you ] * you really get used to the outdoors ./
LSA 352 Summer 2007
69
LSA 352 Summer 2007
70
MDE Labeled Corpora
MDE Algorithms
Use both text and prosodic features At each interword boundary
Extract Prosodic features (pause length, durations, pitch contours, energy contours) Use N-gram Language model Combine via HMM, Maxent, CRF, or other classifier
CTS Training set (words) Test set (words) STT WER (%) SU % Edit word % Filler word % 484K 35K 14.9 13.6 7.4 6.8
BN 182K 45K 11.7 8.1 1.8 1.8
LSA 352 Summer 2007
71
LSA 352 Summer 2007
72
12
State of the art: SU detection
2 stage
Decision tree plus N-gram LM to decide boundary Second maxent classifier to decide subtype
State of the art: Edit word detection
Multi-stage model
– HMM combining LM and decision tree finds IP – Heuristics rules find onset of reparandum – Separate repetition detector for repeated words
Current error rates:
Finding boundaries
– 40-60% using ASR – 26-47% using transcripts
One-stage model
CRF jointly finds edit region and IP BIO tagging (each word has tag whether is beginning of edit, inside edit, outside edit)
Error rates:
43-50% using transcripts 80-90% using ASR
LSA 352 Summer 2007
73
LSA 352 Summer 2007
74
Using only lexical cues
3-way classification for each word
Edit, filler, fluent
Rules learned
Label Label Label Label Label Label all fluent filled pauses as fillers the left side of a simple repeat as an edit “you know” as fillers fluent well’s as filler fluent fragments as edits “I mean” as a filler
Using TBL
Templates: Change
– – – – Word X from L1 to L2 Word sequence X Y to L1 Left side of simple repeat to L1 Word with POS X from L1 to L2 if followed by word with POS Y
LSA 352 Summer 2007
75
LSA 352 Summer 2007
76
Error rates using only lexical cues
CTS, using transcripts
Edits: 68% Fillers: 18.1%
Conclusions: Lexical Cues Only
Can do pretty well with only words
(As long as the words are correct)
Broadcast News, using transcripts
Edits 45% Fillers 6.5%
Much harder to do fillers and fragments from ASR output, since recognition of these is bad
Using speech:
Broadcast news filler detection from 6.5% error to 57.2%
Other systems (using prosody) better on CTS, not on Broadcast News
LSA 352 Summer 2007
77
LSA 352 Summer 2007
78
13
Fragments
Incomplete or cut-off words:
Leaving at seven fif- eight thirty uh, I, I d-, don't feel comfortable You know the fam-, well, the families I need to know, uh, how- how do you feel… Uh yeah, yeah, well, it- it- that’s right. And it-
Fragment glottalization
Uh yeah, yeah, well,
it- it-
that’s right. And it-
SWBD: around 0.7% of words are fragments (Liu 2003) ATIS: 60.2% of repairs contain fragments (6% of corpus sentences had a least 1 repair) Bear et al (1992) Another ATIS corpus: 74% of all reparanda end in word fragments (Nakatani and Hirschberg 1994)
LSA 352 Summer 2007
79
LSA 352 Summer 2007
80
Why fragments are important
Frequent enough to be a problem:
Only 1% of words/3% of sentences But if miss fragment, likely to get surrounding words wrong (word segmentation error). So could be 3 or more times worse (3% words, 9% of sentences): problem!
How fragments are dealt with in current ASR systems
In training, throw out any sentences with fragments In test, get them wrong Probably get neighboring words wrong too! !!!!!
Useful for finding other repairs
In 40% of SRI-ATIS sentences containing fragments, fragment occurred at right edge of long repair 74% of ATT-ATIS reparanda ended in fragments
Sometimes are the only cue to repair
“leaving at eight thirty”
LSA 352 Summer 2007
81
LSA 352 Summer 2007
82
Cues for fragment detection
49/50 cases examined ended in silence >60msec; average 282ms (Bear et al) 24 of 25 vowel-final fragments glottalized (Bear et al)
Glottalization: increased time between glottal pulses
Cues for fragment detection
Nakatani and Hirschberg (1994) Word fragments tend to be content words:
Lexical Class Content Function Untranscribed Token 121 12 155 Percent 42% 4% 54%
75% don’t even finish the vowel in first syllable (i.e., speaker stopped after first consonant) (O’Shaughnessy)
LSA 352 Summer 2007
83
LSA 352 Summer 2007
84
14
Cues for fragment detection
Nakatani and Hirschberg (1994) 91% are one syllable or less
Syllables 0 1 2 3 Tokens 113 149 25 1 Percent 39% 52% 9% 0.3%
Cues for fragment detection
Nakatani and Hirschberg (1994) Fricative-initial common; not vowel-initial
Class Stop Vowel Fric
% words 23% 25% 33%
% frags 23% 13% 45%
% 1-C frags 11% 0% 73%
LSA 352 Summer 2007
85
LSA 352 Summer 2007
86
Liu (2003): Acoustic-Prosodic detection of fragments
Prosodic features
Duration (from alignments)
– Of word, pause, last-rhyme-in word – Normalized in various ways
Liu (2003): Acoustic-Prosodic detection of fragments
Voice Quality Features
Jitter
– A measure of perturbation in pitch period Praat computes this
Spectral tilt
– Overall slope of spectrum – Speakers modify this when they stress a word
F0 (from pitch tracker)
– Modified to compute stylized speaker-specific contours
Energy
– Frame-level, modified in various ways
Open Quotient
– Ratio of times in which vocal folds are open to total length of glottal cycle – Can be estimated from first and second harmonics – Creaky voice (laryngealization) vocal folds held together, so short open quotient
LSA 352 Summer 2007
87
LSA 352 Summer 2007
88
Liu (2003)
Use Switchboard 80%/20% Downsampled to 50% frags, 50% words Generated forced alignments with gold transcripts Extract prosodic and voice quality features Train decision tree
Liu (2003) results
Precision 74.3%, Recall 70.1%
hypothesis complete reference complete fragment 109 43 fragment 35 101
LSA 352 Summer 2007
89
LSA 352 Summer 2007
90
15
Liu (2003) features
Features most queried by DT
Feature jitter Energy slope difference between current and following word Ratio between F0 before and after boundary Average OQ Position of current turn Pause duration % .272 .241 .238 .147 0.084 0.018
LSA 352 Summer 2007
91
Liu (2003) conclusion
Very preliminary work Fragment detection is good problem that is understudied!
LSA 352 Summer 2007
92
Fragments in other languages
Mandarin (Chu, Sung, Zhao, Jurafsky 2006) Fragments cause similar errors as in English:
Fragments in Mandarin
Mandarin fragments unlike English; no glottalization. Instead: Mostly (unglottalized) repetitions
I- I was asking… He very- lived very well
So: best features are lexical, rather than voice quality
LSA 352 Summer 2007
93
LSA 352 Summer 2007
94
Summary
Disfluencies Characteristics of disfluences Detecting disfluencies MDE bakeoff Fragments
LSA 352 Summer 2007
95
16