Novel Speech Recognition
Models for Arabic
The Arabic Speech Recognition Team
JHU Workshop Final Presentations
August 21, 2002
Arabic ASR Workshop Team
Senior Participants Undergraduate Students:
Katrin Kirchhoff, UW Melissa Egan, Pomona College
Jeff Bilmes, UW Feng He, Swarthmore College
John Henderson, MITRE
Mohamed Noamany, BBN Affiliates:
Pat Schone, DoD Dimitra Vergyri, SRI
Rich Schwartz, BBN Daben Liu, BBN
Nicolae Duta, BBN
Graduate Students Ivan Bulyko, UW
Sourin Das, JHU Mari Ostendorf, UW
Gang Ji, UW
“Arabic”
Gulf
Egyptian Modern
Arabic
Arabic Standard
Arabic
Levantine North- (MSA)
Arabic African
Arabic
Dialects used for Cross-regional standard,
informal conversation used for formal communication
Arabic ASR: Previous Work
• dictation: IBM ViaVoice for Arabic
• Broadcast News: BBN TIDESOnTap
• conversational speech: 1996/1997 NIST
CallHome Evaluations
• little work compared to other languages
• few standardized ASR resources
Arabic ASR: State of the Art
(before WS02)
• BBN TIDESOnTap: 15.3% WER
• BBN CallHome system: 55.8% WER
• WER on conversational speech noticeably
higher than for other languages
(eg. 30% WER for English CallHome)
focus on recognition of conversational
Arabic
Problems for Arabic ASR
• language-external problems:
– data sparsity, only 1 (!) standardized corpus of
conversational Arabic available
• language-internal problems:
– complex morphology, large number of possible
word forms
(similar to Russian, German, Turkish,…)
– differences between written and spoken
representation: lack of short vowels and other
pronunciation information
(similar to Hebrew, Farsi, Urdu, Pashto,…)
Corpus: LDC ECA CallHome
• phone conversations between family members/friends
• Egyptian Colloquial Arabic (Cairene dialect)
• high degree of disfluencies (9%), out-of-vocabulary words
(9.6%), foreign words (1.6%)
• noisy channels
• training: 80 calls (14 hrs), dev: 20 calls (3.5 hrs), eval: 20
calls (1.5 hrs)
• very small amount of data for language modeling (150K) !
MSA - ECA differences
• Phonology:
– /th/ /s/ or /t/ thalatha - talata („three‟)
– /dh/ /z/ or /d/ dhahab - dahab („gold‟)
– /zh/ /g/ zhadeed - gideed („new‟)
– /ay/ /e:/ Sayf - Seef („summer‟)
– /aw/ /o:/ lawn - loon („color‟)
• Morphology:
– inflections yatakallamu - yitkallim („he speaks‟)
• Vocabulary:
– different terms TAwila - tarabeeza (`table‟)
• Syntax:
– word order differences SVO - VSO
Workshop Goals
improvements to Arabic ASR through
developing novel developing techniques
models to better for using out-of-corpus
exploit available data data
Factored language modeling Automatic Integration of
romanization MSA text data
Factored Language Models
• complex morphological structure leads to
large number of possible word forms
• break up word into separate components
• build statistical n-gram models over individual
morphological components rather than
complete word forms
Automatic Romanization
• Arabic script lacks short vowels and other
pronunciation markers
• comparable English example
th fsh stcks f th nrth tlntc hv bn dpletd
the fish stocks of the north atlantic have been depleted
• lack of vowels results in lexical ambiguity;
affects acoustic and language model training
• try to predict vowelization automatically from
data and use result for recognizer training
Out-of-corpus text data
• no corpora of transcribed conversational
speech available
• large amounts of written (Modern Standard
Arabic) data available (e.g. Newspaper text)
• Can MSA text data be used to improve
language modeling for conversational
speech?
• Try to integrate data from newspapers,
transcribed TV broadcasts, etc.
Recognition Infrastructure
• baseline system: BBN recognition system
• N-best list rescoring
• Language model training: SRI LM toolkit with
significant additions implemented during this
workshop
• Note: no work on acoustic modeling, speaker
adaptation, noise robustness, etc.
• two different recognition approaches:
grapheme-based vs. phoneme-based
Summary of Results (WER)
Grapheme-based reconizer Phone-based recognizer
Baselin
e Random
59 65 62.7%
59.0
58 Automatic
Base-
romanization 60 line Additional
57.9%
57 55.8% Callhome
data 55.1%
56 55 Language
True modeling 53.8%
55 romanization
50
54.9%
Oracle
54 46%
45
53
52 40
Novel research
• new strategies for language modeling based on
morphological features
• new graph-based backoff schemes allowing wider
range of smoothing techniques in language modeling
• new techniques for automatic vowel insertion
• first investigation of use of automatically vowelized
data for ASR
• first attempt at using MSA data for language
modeling for conversational Arabic
• morphology induction for Arabic
Key Insights
• Automatic romanization improves grapheme-
based Arabic recognition systems
• trend: morphological information helps in
language modeling
• needs to be confirmed on larger data set
• Using MSA text data does not help
• We need more data!
Resources
• significant add-on to SRILM toolkit for general
factored language modeling
• techniques/software for automatic romanization
of Arabic script
• part-of-speech tagger for MSA & tagged text
Outline of Presentations
• 1:30 - 1:45: Introduction (Katrin Kirchhoff)
• 1:45 - 1:55: Baseline system (Rich Schwartz)
• 1:55 - 2:20: Automatic romanization (John Henderson,
Melissa Egan)
• 2:20 - 2:35: Language modeling - overview (Katrin Kirchhoff)
• 2:35 - 2:50: Factored language modeling (Jeff Bilmes)
• 2:50 - 3:05: Coffee Break
• 3:05 - 3:10: Automatic morphology learning (Pat Schone)
• 3:15 - 3:30: Text selection (Feng He)
• 3:30 - 4:00: Graduate student proposals (Gang Ji, Sourin Das)
• 4:00 - 4:30: Discussion and Questions
Thank you!
• Fred Jelinek, Sanjeev Khudanpur, Laura Graham
• Jacob Laderman + assistants
• Workshop sponsors
• Mark Liberman, Chris Cieri, Tim Buckwalter
• Kareem Darwish, Kathleen Egan
• Bill Belfield & colleagues from BBN
• Apptek
BBN Baseline System
for Arabic
Richard Schwartz, Mohamed Noamany,
Daben Liu, Bill Belfield, Nicolae Duta
JHU Workshop
August 21, 2002
BBN BYBLOS System
• Rough‟n‟Ready / OnTAP / OASIS system
• Version of BYBLOS optimized for
Broadcast News
• OASIS system fielded in Bangkok and
Aman
• Real-Time operation with 1-minute
delay
• 10%-20% WER, depending on data
BYBLOS Configuration
• 3-passes of recognition
– Forward Fast-match uses PTM models and
approximate bigram search
– Backward pass uses SCTM models and
approximate trigram search, creates N-best.
– Rescoring pass uses cross-word SCTM models and
trigram LM
• All runs in real time
– Minimal difference from running slowly
Use for Arabic Broadcast News
• Transcriptions are in normal Arabic script,
omitting short vowels and other diacritics.
• We used each Arabic letter as if it were a
phoneme.
• This allowed addition of large text corpora for
language modeling.
Initial BN Baseline
• 37.5 hours of acoustic training
• Acoustic training data (230K words) used for
LM training
• 64K-word vocabulary (4% OOV)
• Initial word error rate (WER) = 31.2%
Speech Recognition Performance
System (all real-time results) WER (%)
Baseline 31.2
+ 145M word LM (Al Hayat) 26.6
+ System Improvements (MLLR and tuning) 21.0
+ 128k Lexicon (OOV reduced to 2%) 20.4
+ Additional 20 hours acoustic data 19.1
+ 290M word LM + improved lexicon 17.3
+ New scoring (remove hamza from alif) 15.3
Call Home Experiments
• Modified OnTAP system to make it more
appropriate for Call Home data.
• Added features from LVCSR research to
OnTAP system for Call Home data.
• Experiments:
– Acoustic training: 80 conversations (15 hours)
• Transcribed with diacritics
– Acoustic training data (150K words) used for LM
– Real-time
Using OnTAP system for Call Home
System WER (%)
Baseline for OASIS 64.1
+ Bypass BN segmenter 63.4
+ Cepstral Mean Subtraction on conversations 62.4
+ Incremental MLLR on whole conversation 61.8
+ 1-level CMS (instead of 2) 60.8
Additions from LVCSR
System WER (%)
Baseline for OASIS 60.8
+ VTL on training and decoding (unoptimized) 59.0
+ LPC Smoothing with 40 poles 58.7
+ ‘split-init training’ 58.1
+ HLDA (not used for workshop) 56.6
+ Modified backoff (not used for workshop) 56.0
Output Provided for Workshop
• OASIS was run on various sets of training as needed
• Systems were run either for Arabic script phonemes
or „Romanized‟ phonemes – with diacritics.
• In addition to workshop participants, others at BBN
provided assistance and worked on workshop
problems.
• Output provided for workshop was N-best sentences
– with separate scores for HMM, LM, #words, #phones,
#silences
– Due to high error rate (56%), the oracle error rate for 100
N-best was about 46%.
• Unigram lattices were also provided, with oracle error
rate of 15%
Phoneme HMM Topology Experiment
• The phoneme HMM topology was increased
for the Arabic script system from 5 states to
10 states in order to accommodate a
consonant and possible vowel.
• The gain was small (0.3% WER)
OOV Problem
• OOV Rate is 10%
– 50% is morphological variants of words in the
training set
– 10% is Proper names
– 40% is other unobserved words
• Tried adding words from BN and from
morphological transducer
– Added too many words with too small gain
Use BN to Reduce OOV
• Can we add words from BN to reduce OOV?
• BN text contains 1.8M distinct words.
• Adding entire 1.8M words reduces OOV from
10% to 3.9%.
• Adding top 15K words reduces OOV to 8.9%
• Adding top 25K words reduces OOV to 8.4%.
Use Morphological Transducer
• Use LDC Arabic transducer to expand verbs to
all forms
– Produces > 1M words
• Reduces OOV to 7%
Language Modeling Experiments
Described in other talks
• Searched for available dialect transcriptions
• Combine BN (300M words) with CH (230K)
• Use BN to define word classes
• Constrained back-off for BN+CH
Autoromanization of
Arabic Script
Melissa Egan and John Henderson
Autoromanization (AR) goal
• Expand Arabic script representation to include short
vowels and other pronunciation information.
• Phenomena not typically marked in non-diacritized script
include:
– Short vowels {a, i, u}
– Repeated consonants (shadda)
– Extra phonemes for Egyptian Arabic {f/v,j/g}
– Grammatical marker that adds an „n‟ to the pronunciation
(tanween)
• Example
Non-diacritized form: ktb – write
Expansions: kitab – book
aktib – I write
kataba – he wrote
kattaba – he caused to write
AR motivation
• Romanized text can be used to produce better output
from an ASR system.
– Acoustic models will be able to better disambiguate based
on extra information in text.
– Conditioning events in LM will contain more information.
• Romanized ASR output can be converted to script for
alternative WER measurement.
• Eval96 results (BBN recognizer, 80 conv. train)
– script recognizer: 61.1 WERG (grapheme)
– romanized recognizer: 55.8 WERR (roman)
AR data
CallHome Arabic from LDC
Conversational speech transcripts (ECA) in both
script and a roman specification that includes
short vowels, repeats, etc.
set conversations words
Romanizer asrtrain 80 135K
Testing
dev 20 35K
eval96(asrtest) 20 15K
Romanizer
eval97 20 18K
Training h5_new 20 18K
Data format
• Script without and with diacritics
• CallHome in script and roman forms
Script: AlHmd_llh kwIsB w AntI AzIk
Roman: ilHamdulillA kuwayyisaB~ wi inti izzayyik
our task
Autoromanization (AR) WER baseline
• Train on 32K words in eval97+h5_new
• Test on 137K words in ASR_train+h5_new
Status portion error % total
in train in test in test error
unambig. 68.0% 1.8% 6.2%
ambig. 15.5 13.9 10.8
unknown 16.5 99.8 83.0
total 100 19.9 100.0
Biggest potential error reduction would come from predicting
romanized forms for unknown words.
AR “knitting” example
1. Find close
unknown: tbqwA
known word known: ybqwA
2. Record ops known: y bqwA
required to
make roman kn.roman: yibqu
from known ops: ciccrd
unknown: t bqwA
3. Construct
new roman
using same
kn.roman: yibqu
ops: ciccrd
ops
new roman: tibqu
Experiment 1 (best match)
Observed patterns in the known short/long pairs:
Some characters in the short forms are consistently
found with particular, non-identical characters in the
long forms.
Example rule:
Aa
Experiment 2 (rules)
Environments in which „w‟ occurs
in training dictionary long Environments in which „u‟
forms: occurs in training
Env Freq dictionary long forms:
C _ V 149 Env Freq
V _ # 8 C _ C 1179
# _ V 81 C _ # 301
C _ # 5 # _ C 29
V _ V 121
V _ C 118
• Some output forms depend on output context.
• Rule:
– „u‟ occurs only between two non-vowels.
– „w‟ occurs elsewhere.
• Accurate for 99.7% of the instances of „u‟ and „w‟ in the
training dictionary long forms. Similar rule may be
formulated for „i‟ and „y.‟
Experiment 3 (local model)
• Move to more data-driven model
– Found some rules manually.
– Look for all of them, systematically.
• Use best-scoring candidate for replacement
– Environment likelihood score
– Character alignment score
Known long: H a n s A h a
Known short: H A n s A h A
input: H A m D y h A
result: H a m D I h a
Experiment 4 (n-best)
• Instead of generating romanized form using the single
best short form in the dictionary, generate romanized
forms using top n best short forms.
Example (n = 5)
Character error rate (CER)
• Measurement of insertions, deletions, and substitutions in
character strings should more closely track phoneme error
rate.
• More sensitive than WER
– Stronger statistics from same data
• Test set results
– Baseline 49.89 character error rate (CER)
– Best model 24.58 CER
– Oracle 2-best list 17.60 CER suggests more room for gain.
Summary of performance (dev set)
Accuracy CER
Baseline 8.4% 41.4%
Knitting 16.9% 29.5%
Knitting + best match + rules 18.4% 28.6%
Knitting + local model 19.4% 27.0%
Knitting + local model + n-best 30.0% 23.1%
(n = 25)
Varying the number of dictionary matches
30
performance
26
22
18
0 50 100 150 200
dictionary matches
accuracy CER
ASR scenarios
1) Have a script recognizer, but want to
produce romanized form.
postprocessing ASR output
2) Have a small amount of romanized data
and a large amount of script data
available for recognizer training.
preprocessing ASR training set
ASR experiments
Roman
Roman Result WERR
AR
ASR
Script
R2S Result WERG
Script Preprocessing
Train
Script
Result WERG
Script
ASR Roman
AR WERR
Result
Postprocessing
Experiment: adding script data
• Script LM training data
could be acquired from
AR train
found text.
40
• Script transcription is
cheaper than roman
ASR train transcription
100 conv
• Simulate a
preponderance of script
by training AR on a
separate set.
Future training set • ASR is then trained on
output of AR.
Eval 96 experiments, 80 conv
Config WERR WERG
script baseline N/A 59.8
post processing 61.5 59.8
preprocessing 59.9 59.2 (-0.6)
Roman baseline 55.8 55.6 (-4.2)
Bounding experiment
• No overlap between ASR train and AR train.
• Poor pronunciations for “made-up” words.
Eval 96 experiments, 100 conv
Config WERR WERG
script baseline N/A 59.0
postprocessing 60.7 59.0
preprocessing 58.5 57.5 (-1.5)
Roman baseline 55.1 54.9 (-4.1)
More realistic experiment
• 20 conversation overlap between ASR train
and AR train.
• Better pronunciations for “made-up” words.
Remaining challenges
• Correct “dangling tails” in short matches
• Merge unaligned characters
Bigram translation model
r* arg max
r
p(s,d )p(r | s,d )
s l
d (d s , dl )
arg max
r
p(s,d )p(r, s,d )
s l
d (d s ,d l )
p(r, s, dl ) p(ri | ri 1 )p(sj | ri )p(dlk | ri )
i
p(sj | ri )
input s: t b q w A
p(ri | ri 1)
output r: □ t i b q u □
p(dlk | ri )
kn. roman dl: y i b q u
Future work
• Context provides information for
disambiguating both known and unknown
words
– Bigrams for unknown words will also be unknown,
use part of speech tags or morphology.
• Acoustics
– Use acoustics to help disambiguate vowels?
– Provide n-best output as alternative
pronunciations for ASR training.
Factored Language
Modeling
Katrin Kirchhoff, Jeff Bilmes, Dimitra Vergyri,
Pat Schone, Gang Ji, Sourin Das
Arabic morphology
• structure of Arabic derived words
pattern
particles fa- s a k a n -tu affixes
root
LIVE + past + 1st-sg-past + part: “so I lived”
Arabic morphology
• ~5000 roots
• several hundred patterns
• dozens of affixes
large number of possible word forms
problems training robust language
model
large number of OOV words
Vocabulary Growth - full word forms
vocab size CallHome English
Arabic
16000
14000
12000
10000
8000
6000
4000
2000
0
# word
k
k
k
k
k
k
k
k
k
0k
0k
0k
10
20
30
40
50
60
70
80
90
10
tokens
11
12
Vocabulary Growth - stemmed words
CallHome EN words
vocab size
AR words
16000 EN stems
14000 AR stems
12000
10000
8000
6000
4000
2000
0
# word
k
k
k
k
k
k
k
k
k
0k
0k
0k
10
20
30
40
50
60
70
80
90
tokens
10
11
12
Particle model
• Break words into sequences of stems +
affixes:
W 1 ,2 ,...,M
• Approximate probability of word sequence by
probability of particle sequence
T
P(W1 ,W2 ,...,WN ) P(t | t 1,t 2,..., t n1 )
t n
Factored Language Model
• Problem: how can we estimate P(Wt|Wt-1,Wt-2,...) ?
• Solution: decompose W into its morphological
components: affixes, stems, roots, patterns
• words can be viewed as bundles of features
Pt-2 Pt-1 Pt patterns
Rt-2 Rt-1 Rt roots
At-2 At-1 At affixes
St-2 St-1 St
stems
Wt-2 Wt-1 Wt words
Statistical models for factored
representations
• Class-based LM:
P(Wt | Wt 1 ,Wt 2 ) P(Wt | Ft ) P( Ft | Ft 1 , Ft 2 )
• Single-stream LM:
P( Ft | Ft 1 , Ft 2 ,..., F1 ) P( Ft | Ft 1 , Ft 2 )
Full Factored Language Model
assume wi ai , ri ,i where
w = word, r = root, = pattern, a = affixes
P( wi | wi 1 , wi 2 ) P(ai , ri ,i | ai 1 , ri 1 ,i 1 , ai 2 , ri 2 ,i 2 )
P(ai | ri ,i , ai 1 , ri 1 ,i 1 , ai 2 , ri 2 ,i 2 )
P(ri | i , ai 1 , ri 1 ,i 1 , ai 2 , ri 2 ,i 2 )
P(i | ai 1 , ri 1 ,i 1 , ai 2 , ri 2 ,i 2 )
• Goal: find appropriate conditional independence
statements to simplify this model.
Experimental Infrastructure
• All language models tested using nbest
rescoring
• two baseline word-based LMs:
– B1: BBN LM, WER 55.1%
– B2: WS02 baseline LM, WER 54.8%
• combination of baselines: 54.5%
• new language models were used in
combination with one or both baseline LMs
• log-linear score combination scheme
Log-linear combination
For m information sources, each producing a
maximum-likelihood estimate for W:
1 m ki
P(W | I )
Z (I )
P (W | I )
i
i i
I: total information available
Ii : the i‟th information source
ki: weight for the i‟th information source
Discriminative combination
• We optimize the combination weights jointly with the
language model and insertion penalty to directly
minimize WER of the maximum likelihood hypothesis.
• The normalization factor can be ignored since it is the
same for all alternative hypotheses.
• Used the simplex optimization method on the 100-
bests provided by BBN (optimization algorithm
available in the SRILM toolkit).
Word decomposition
• Linguistic decomposition (expert knowledge)
• automatic morphological decomposition: acquire
morphological units from data without using human
knowledge
• assign words to classes based not on characteristics
of word form but based on distributional properties
(Mostly) Linguistic Decomposition
• Stems/morph class: information from LDC CH lexicon:
$atamna $atam:verb+past-1st-plural
stem morph. tag
• roots: determined by K. Darwish‟s morphological
analyzer for MSA
$atam $tm
• pattern: determined by subtracting root from stem
$atam CaCaC
Automatic Morphology
• Classes defined by morphological components
derived from data
• no expert knowledge
• based on statistics of word forms
• more details in Pat‟s presentation
Data-driven Classes
• Word clustering based on distributional statistics
• Exchange algorithm (Martin et. al 98)
– initially assign words to individual clusters
– move each temporarily word to all other clusters, compute
change in perplexity (class-based trigram)
– keep assignment that minimizes perplexity
– stop when class assignment no longer changes
• bottom-up clustering (SRI toolkit)
– initially assign words to individual clusters
– successively merge pairs of clusters with highest average
mutual information
– stop at specified number of classes
Results
• Best word error rates obtained with:
– particle model: 54.0% (B1 + particle LM)
– class-based models: 53.9% (B1+Morph+Stem)
– automatic morphology: 54.3% (B1+B2+Rule)
– data-driven classes: 54.1% (B1+SRILM, 200
classes)
• combination of best models: 53.8%
Conclusions
• Overall improvement in WER gained from language
modeling (1.3%) is significant
• individual differences between LMs are not significant
• but: adding morphological class models always helps
language model combination
• morphological models get the highest weights in
combination (in addition to word-based LMs)
• trend needs to be verified on larger data set
application to script-based system?
Factored Language Models
and Generalized Graph
Backoff
Jeff Bilmes, Katrin Kirchhoff
University of Washington, Seattle &
JHU-WS02 ASR Team
Outline
• Language Models, Backoff, and Graphical
Models
• Factored Language Models (FLMs) as
Graphical Models
• Generalized Graph Backoff algorithm
• New features to SRI Language Model
Toolkit (SRILM)
Standard Language Modeling
• Example: standard tri-gram
P(wt | ht ) P(wt | wt 1 , wt 2 , wt 3 )
Wt 4 Wt 3 Wt 2 Wt 1 Wt
Typical Backoff in LM
Wt | Wt 1 ,Wt 2 ,Wt 3
• In typical LM, there
is one natural
(temporal) path to Wt | Wt 1 ,Wt 2
back off along.
• Well motivated
since information
often decreases
Wt | Wt 1
with word distance.
Wt
Factored LM: Proposed Approach
• Decompose words into smaller morphological
or class-based units (e.g., morphological
classes, stems, roots, patterns, or other
automatically derived units).
• Produce probabilistic models over these units
to attempt to improve WER.
Example with Words, Stems, and
Morphological classes
M t 3 M t 2 Mt 1 M t
St 3 St 2 St 1 St
Wt 3 Wt 2 Wt 1 Wt
P(wt | st , mt ) P(st | mt , wt 1 , wt 2 ) P(mt | wt 1 , wt 2 )
Example with Words, Stems, and
Morphological classes
M t 3 M t 2 Mt 1 M t
St 3 St 2 St 1 St
Wt 3 Wt 2 Wt 1 Wt
P(wt | wt 1 , wt 2 , st 1 , st 2 , mt 1 , mt 2 )
In general
3 3 3 3
F
t 3 F
t 2 F
t 1 Ft
2 2 2 2
F
t 3 F
t 2 F
t 1 Ft
1 1 1 1
F
t 3 F
t 2 F
t 1 Ft
General Factored LM
• A word is equivalent to collection of factors.
{wt } { ft } 1:K
f k the kth factor
• E.g., if K=3
P( wt | wt 1 , wt 2 ) P( f t1 , f t 2 , f t 3 | f t 1 , f t 21 , f t 1 , f t 2 , f t 2 , f t 3 )
1 3 1
2 2
P( ft1 | ft 2 , ft 3 , ft 1 , ft 21 , ft 1 , ft 2 , f t 2 , f t 3 )
1 3 1
2 2
1 3 1
P( ft 2 | ft 3 , ft 1 , ft 21 , ft 1 , f t 2 , f t 2 , f t 3 )
2 2
1 3
P( ft 3 | f t 1 , f t 21 , f t 1 , f t 12 , f t 2 , f t 3 )
2 2
• Goal: find appropriate conditional independence
statements to simplify this sort of model while
keeping perplexity and WER low. This is the
structure learning problem in graphical models.
The General Case
3 3 3 3
F
t 3 Ft 2 F
t 1 Ft
2 2 2 2
F
t 3 Ft 2 F
t 1 Ft
1 1 1 1
F
t 3 Ft 2 F
t 1 Ft
The General Case
FA 1 FA2 FA3
Fi
The General Case
FA 1 FA2 FA3
Fi
FA 1 FA2 FA 1 FA3 FA2 FA3
Fi Fi Fi
FA 1 FA2 FA3
Fi Fi Fi
Fi
A Backoff Graph (BG)
Fi | FA1 , FA2 , FA3
Fi | FA1 , FA2 Fi | FA1 , FA3 Fi | FA2 , FA3
Fi | FA1 Fi | FA2 Fi | FA3
Fi
Example: 4-gram Word Generalized
Backoff
Wt | Wt 1 ,Wt 2 ,Wt 3
Wt | Wt 1 ,Wt 2 Wt | Wt 1 ,Wt 3 Wt | Wt 2 ,Wt 3
Wt | Wt 1 Wt | Wt 2 Wt | Wt 3
Wt
How to choose backoff path?
Four basic strategies
1.Fixed path (based on what seems
reasonable (e.g., temporal
constraints))
2.Generalized all-child backoff
3.Constrained multi-child backoff
4.Child combination rules
Choosing a fixed back-off path
Fi | FA1 , FA2 , FA3
Fi | FA1 , FA2 Fi | FA2 , FA3 Fi | FA1 , FA3
Fi | FA1 Fi | FA2 Fi | FA3
Fi
How to choose backoff path?
Four basic strategies
1.Fixed path (based on what seems
reasonable (e.g., temporal
constraints))
2.Generalized all-child backoff
3.Constrained multi-child backoff
4.Child combination rules
Generalized Backoff
N ( f , f P1 , f P 2 )
d N ( f , f P1 , f P 2 ) if N ( f , f P1 , f P 2 ) 0
PBO ( f | f P1 , f P 2 ) N ( f P1 , f P 2 )
( f P1 , f P 2 ) g ( f , f P1 , f P 2 )
otherwise
• In typical backoff, we drop 2nd parent and use
conditional probability.
g ( f , f P1 , f P 2 ) PBO ( f | f P1 )
• More generally, g() can be any positive function, but
need new algorithm for computing backoff weight
(BOW).
Computing BOWs
N ( f , f P1 , f P 2 )
1 , f )0 d N ( f , fP1 , fP 2 ) N ( f , f )
( f P1 , f P 2 ) f :N ( f , f P1 P 2 P1 P2
g ( f , f P1 , f P 2 )
f :N ( f , f P1 , f P 2 ) 0
• Many possible choices for g() functions (next
few slides)
• Caveat: certain g() functions can make the
LM much more computationally costly than
standard LMs.
g() functions
• Standard backoff
g ( f , f P1 , f P 2 ) PBO ( f | f P1 )
• Max counts
g ( f , f P1 , f P 2 ) PBO ( f | f Pj* )
j argmax N ( f , f Pj )
*
j
• Max normalized counts
N ( f , f Pj )
j argmax
*
j N ( f Pj )
More g() functions
• Max backoff graph node.
g ( f , f P1 , f P 2 ) PBO ( f | f Pj* )
j argmax PBO ( f | f Pj )
*
j
Fi | FA1 , FA2 , FA3
F1 | FA1 , FA2 F1 | FA2 , FA3 Fi | FA1 , FA3
Fi | FA1 Fi | FA2 Fi | FA3
Fi
More g() functions
• Max back off graph node.
g ( f , f P1 , f P 2 ) PBO ( f | f Pj* )
j argmax PBO ( f | f Pj )
*
j
Fi | FA1 , FA2 , FA3
F1 | FA1 , FA2 F1 | FA2 , FA3 Fi | FA1 , FA3
Fi | FA1 Fi | FA2 Fi | FA3
Fi
How to choose backoff path?
Four basic strategies
1.Fixed path (based on what seems
reasonable (time))
2.Generalized all-child backoff
3.Constrained multi-child backoff
• Same as before, but choose a subset of
possible paths a-priori
4.Child combination rules
• Combine child node via combination
function (mean, weighted avg., etc.)
Significant Additions to
Stolcke‟s SRILM, the SRI Language
Modeling Toolkit
• New features added to SRILM including
– Can specify an arbitrary number of graphical-
model based factorized models to train, compute
perplexity, and rescore N-best lists.
– Can specify any (possibly constrained) set of
backoff paths from top to bottom level in BG.
– Different smoothing (e.g., Good-Turing, Kneser-
Ney, etc.) or interpolation methods may be used
at each backoff graph node
– Supports the generalized backoff algorithms with
18 different possible g() functions at each BG
node.
Example with Words, Stems, and
Morphological classes
M t 3 M t 2 Mt 1 M t
St 3 St 2 St 1 St
Wt 3 Wt 2 Wt 1 Wt
P(wt | st , mt ) P(st | mt , wt 1 , wt 2 ) P(mt | wt 1 , wt 2 )
Wt | St , M t
How to specify a model
## word given stem morph
W : 2 S(0) M(0)
Wt | St S0,M0 M0 wbdiscount gtmin 1 interpolate
S0 S0 wbdiscount gtmin 1
Wt 0 0 wbdiscount gtmin 1
M t | Wt 1 ,Wt 2 ## morph given word word
M : 2 W(-1) W(-2)
W1,W2 W2 kndiscount gtmin 1 interpolate
M t | Wt 1 W1 W1 kndiscount gtmin 1 interpolate
0 0 kndiscount gtmin 1
Mt
St | M t ,Wt 1 ,Wt 2
## stem given morph word word
St | M t ,Wt 1 S : 3 M(0) W(-1) W(-2)
M0,W1,W2 W2 kndiscount gtmin 1 interpolate
M0,W1 W1 kndiscount gtmin 1 interpolate
St | M t M0 M0 kndiscount gtmin 1
0 0 kndiscount gtmin 1
St
Summary
• Language Models, Backoff, and Graphical
Models
• Factored Language Models (FLMs) as
Graphical Models
• Generalized Graph Backoff algorithm
• New features to SRI Language Model
Toolkit (SRILM)
Coffee Break
Back in 10 minutes
Knowledge-Free Induction
of Arabic Morphology
Patrick Schone
21 August 2002
Why induce Arabic
morphology?
(1) Has not been done before
(2) If it can be done, and if it has value in LM,
it can generalize across languages without
needing an expert
Original Algorithm
(Schone & Jurafsky, „00/`01)
Looking for word inflections on words w/ Fr>9
Use a character tree to find word pairs with
similar beginnings/ endings
Ex: car/cars , car/cares, car/caring
Use Latent Semantic Analysis to induce
semantic vectors for each word,
then compare word-pair semantics
Use frequencies of word stems/rules to improve
the initial semantic estimates
Algorithmic Expansions
IR-Based Minimum Edit Distance
Trie-based approach could be a problem for Arabic:
Templates => $aGlaB: { $aGlaB il$AGil $aGlu $AGil }
Result: 3576 words in CallHome lexicon w/ 50+ relationships!
∙ $ A G i l Use Minimum Edit Distance
∙ 0 1 2 3 4 5 to find the relationships
$ 1 0 1 2 3 4 (can be weighted)
a 2 1 2 3 4 5
Use information-retrieval
G 3 2 3 2 3 4 based approach to faciliate
l 4 3 4 3 4 3 search for MED candidates
a 5 4 5 4 5 4
B 6 5 6 5 6 5
Algorithmic Expansions
Agglomerative Clustering Using Rules & Stems
#Word Pairs w/ Rule #Word Pairs w/ Stem
* => il+* 1178 Gayyar 507
* => *u 635 xallaS 503
* => *i 455 makallim$ 468
*i => *u 377 qaddim 434
* => fa+* 375 itgawwiz 332
* => bi+* 366 tkallim 285
… …
Do bottom-up clustering, where weight between
two words is Ct(Rule)*Ct(PairedStem)1/2
Algorithmic Expansions
Updated Transitivity
If X~Y and Y~Z and |X^Y|>2 and X^Y il+NULL)
• Generate only if initial and final n-characters of stem have
been seen before.
Number Coverage Observed
proposed as words
Rule only 993398 41.3% 0.1%
Rule+1-char stem agree 98864 25.0% 1.1%
Rule+2-char stem agree 35092 14.9% 1.8%
Text Selection for
Conversational Arabic
Feng He
ASR (Arabic Speech Recognition) Team
JHU Workshop
Motivation
• Group goal: Conversational Arabic Speech
Recognition.
• One of the Problems: not enough training
data to build a Language Model – most
available text is in MSA (Modern Standard
Arabic) or a mixture of MSA and
conversational Arabic.
• One Solution: Select from mixed text
segments that are conversational, and use
them in training.
Task: Text Selection
– Use POS-based language models because it has
been shown to better indicate differences in
styles, such as formal vs conversational.
– Method:
1.Training POS (part of speech) tagger on
available data
2.Train POS-based language models on formal
vs conversational data
3.Tag new data
4.Select segments from new data that are
closest to conversational model by using
scores from POS-based language models.
Data
• For building the Tagger and Language Models
– Arabic Treebank: 130K words of hand-
tagged Newspaper text in MSA.
– Arabic CallHome: 150K words of
transcribed phone conversations. Tags are
only in the Lexicon.
• For Text Selection
– Al Jazeera: 9M words of transcribed TV
broadcasts. We want to select segments
that are closer to conversational Arabic,
such as talk-shows and interviews.
Implementation
ti 1 ti
• Model (bigram):
wi
P(W | T ) P(T )
T arg max P(T | W ) arg max
*
T T P(W )
P (W | T ) P (T ) P ( wi | ti ) P (ti | t0:i 1 )
i
P ( wi | ti ) P (ti | ti 1 )
i
About unknown words:
• These are words that are not seen in training
data, but appear in test data.
• Assume unknown words behave like
singletons (words that appear only once in
training data).
• This is done by duplicating training data with
singletons replaced by special token. Then
train tagger on both the original and
duplicate.
Tools:
GMTK (Graphical Model Toolkit)
Algorithms:
Training: EM training – set parameters so
that joint probability of hidden states and
observations is maximized.
Decoding (tagging): Viterbi – find hidden
state sequence that maximizes joint
probability of hidden state and observations.
Experiments
Exp 1: Data: first 100K of English Penn Treebank. Trigram
model. Sanity check.
Exp 2: Data: Arabic Treebank. Trigram model.
Exp 3: Data: Arabic Treebank and CallHome. Trigram
model.
The above three experiments all used 10 fold cross
validation, and are unsupervised.
Exp 4: Data: Arabic Treebank. Supervised trigram model.
Exp 5: Data: Arabic Treebank and Callhome. Partially
supervised training using Treebank‟s tagged data. Test on
portion of treebank not used in training. Trigram model.
Results
Experiment Accuracy Accuracy Baseline
of OOV
1 – tri, en 92.7 37.9 79.3 – 95.5
2 – tri, ar, tb 79.5 19.3 75.9
3 – tri, ar, tb+ch 74.6 17.6 75.9
4 – tri, ar, tb, sup 90.9 56.5 90.0
5 – repeat 3 with 83.4 43.6 90.0
partial supervision
Building Language Models and Text
Selection
• Use existing scripts to build formal and
conversational language models from
tagged Arabic Treebank and CallHome data.
• Text selection: use log likelihood ratio
P( Si | C )1/ N i P(C )
Score( Si ) log
P( S | F ) P( F )
1/ N i
i
Si: the ith sentence in data set
C: coversational language model
F: formal language model
Ni : length of Si
Score Distribution
percentage
log count
log likelihood ratio log likelihood ratio
Assessment
• A subset of Al Jazeera equal in size to Arabic
CallHome (150K words) is selected, and
added to training data for speech recognition
language model.
• No reduction in perplexity.
• Possible reasons: Al Jazeera has no
conversational Arabic, or has only
conversational Arabic of a very different style.
Text Selection Work
Done at BBN
Rich Schwartz
Mohamed Noamany
Daben Liu
Nicolae Duta
Search for Dialect Text
• We have an insufficient amount of CH text for
estimating a LM.
• Can we find additional data?
• Many words are unique to dialect text.
• Searched Internet for 20 common dialect
words.
• Most of the data found were jokes or chat
rooms – very little data.
Search BN Text for Dialect Data
• Search BN text for the same 20 dialect words.
• Found less than CH data
• Each occurrence was typically an isolated
lapse by the speaker into dialect, followed
quickly by a recovery to MSA for the rest of
the sentence.
Combine MSA text with CallHome
• Estimate separate models for MSA text (300M
words) and CH text (150K words).
• Use SRI toolkit to determine single optimal
weight for the combination, using deleted
interpolation (EM)
– Optimal weight for MSA text was 0.03
• Insignificant reduction in perplexity and WER
Classes from BN
Hypothesis:
• Even if MSA ngrams are different, perhaps the
classes are the same.
Experiment:
• Determine classes (using SRI toolkit) from BN+CH
data.
• Use CH data to estimate ngrams of classes and / or
p(w | class)
• Combine resulting model with CH word trigram
Result:
• No gain
Hypothesis Test Constrained Back-Off
Hypothesis:
• In combining BN and CH, if a probability is different,
could be for 2 reasons:
– CH has insufficient training
– BN and CH truly have different probabilities (likely)
Algorithm:
• Interpolate BN and CH, but limit the probability
change to be as much as would be likely due to
insufficient training.
• Ngram count cannot change by more than its sqrt
Result:
• No gain
Learning & Using
Factored Language
Models
Gang Ji
Speech, Signal, and Language Interpretation
University of Washington
August 21, 2002
Outline
• Factored Language Models (FLMs)
overview
• Part I: automatically finding FLM structure
• Part II: first-pass decoding in ASR with
FLMs using graphical models
Factored Language Models
• Along with words, consider factors as
components of the language model
• Factors can be words, stems, morphs, patterns,
roots, which might contain complementary
information about language
• FLMs also provide a new possibilities for
designing LMs (e.g., multiple back-off paths)
• Problem: We don‟t know the best model,
and space is huge!!!
Factored Language Models
• How to learn FLMs
– Solution 1: do it by hand using expert
linguistic knowledge
– Solution 2: data driven; let the data
help to decide the model
– Solution 3: combine both linguistic
and data driven techniques
Factored Language Models
• A Proposed Solution:
– Learn FLMs using evolution-inspired search
algorithm
• Idea: Survival of the fittest
– A collection (generation) of models
– In each generation, only good ones survive
– The survivors produce the next generation
Evolution-Inspired Search
Combination: retain useful LMs next generation
Mutation: choose the change in
• Selection: some small goodcharacteristics
Evolution-Inspired Search
• Advantages
– Can quickly find a good model
– Retain goodness of the previous generation while
covering significant portion of the search space
– Can run in parallel
• How to judge the quality of each model?
– Perplexity on a development set
– Rescore WER on development set
– Complexity-penalized perplexity
Evolution-Inspired Search
• Three steps form new models.
– Selection (based on perplexity, etc)
• E.g. Stochastic universal sampling: models are
selected in proportion to their “fitness”
– Combination
– Mutation
Moving from One Generation to Next
• Combination Strategies
– Inherit structures horizontally
– Inherit structures vertically
– Random selection
• Mutation
– Add/remove edges randomly
– Change back-off/smoothing strategies
Combination according to Frames
F1
F2
F3
t 2 t 1 t t 2 t 1 t t 2 t 1 t
F1
F2
F3
t 2 t 1 t
Combination according to Factors
F1
F2
F3
t 2 t 1 t t 2 t 1 t t 2 t 1 t
F1
F2
F3
t 2 t 1 t
Outline
• Factored Language Models (FLMs)
overview
• Part I: automatically finding FLM structure
• Part II: first-pass decoding with FLMs
Problem
• May be difficult to improve WER just by
rescoring n-best lists
• More gains can be expected from using better
models in first-pass decoding
• Solution:
1. do first-pass decoding using FLMs
2. Since FLMs can be viewed as graphical models, use
GMTK (most existing tools don‟t support general
graph-based models)
3. To speed up inference, use generalized graphical-
model-based lattices.
FLMs as Graphical Models
F1
F2
F3
Word
Graph for Acoustic Model
FLMs as Graphical Models
• Problem: decoding can be expensive!
• Solution: multi-pass graphical lattice
refinement
– In first-pass, generate graphical lattices using
a simple model (i.e., more independencies)
– Rescore the lattices using a more complicated
model (fewer independencies) but on much
smaller search space
Example: Lattices in a Markov Chain
0 1 2
2 2 3 5
7 4
This is the same as a word-based lattice
Lattices in General Graphs
1 0
2 0 2 1
3 5 4
6
1 0 1
2 2 3 1
6 3 4 2
5
Research Plan
• Data
– Arabic CallHome data
• Tools
– Tools for evolution-inspired search
• most part already developed during workshop
– Training/Rescoring FLMs
• Modified SRI LM toolkit: developed during this
workshop
– Multi-pass decoding
• Graphical models toolkit (GMTK): developed in last
workshop
Summary
• Factored Language Models (FLMs)
overview
• Part I: automatically finding FLM structure
• Part II: first-pass decoding of FLMs using
GMTK and graphical lattices
Minimum Divergence Adaptation
of a MSA-Based Language Model
to Egyptian Arabic
A proposal by
Sourin Das
JHU Workshop Final Presentation
August 21, 2002
Motivation for LM Adaptation
• Transcripts of spoken Arabic are expensive to obtain;
MSA text is relatively inexpensive (AFP newswire,
ELRA arabic data, Al jazeera …)
– MSA text ought to help; after all it is Arabic
• However there are considerable dialectal differences
– Inferences drawn from Callhome knowledge or data ought
to overrule those from MSA whenever the inferences drawn
from them disagree: e.g. estimates of N-gram probabilities
– Cannot interpolate models or merge data naïvely
– Need to instead fall back to MSA knowledge only when the
Callhome model or data is “agnostic” about an inference
Motivation for LM Adaptation
• The minimum K-L divergence framework provides a
mechanism to achieve this effect
– First estimate a language model Q* from MSA text only
– Then find a model P* which matches all major Callhome
statistics and is close to Q*.
• Anecdotal evidence: MDI methods successfully used
to adapt models based on NABN text to SWBD: a 2%
WER reduction in LM95 from a 50% baseline WER.
An Information Geometric View
The Uniform Distribution
Models satisfying
Callhome marginals
MaxEnt Callhome LM
MaxEnt MSA-text LM Min Divergence Callhome LM
Models satisfying
MSA-text marginals
The Space of all Language Models
A Parametric View of MaxEnt Models
• The MSA-text based MaxEnt LM is the ML estimate
among exponential models of the form
Q(x) = Z-1(,) exp[ i fi(x) + j gj(x)]
• The Callhome based MaxEnt LM is the ML estimate
among exponential models of the form
P(x) = Z-1(,) exp[ j gj(x) + k hk(x)] *
• Think of the Callhome LM as being from the family
P(x) = Z-1(,) exp[ i fi(x) + j gj(x) + k hk(x)]
where we set =0 based on the MaxEnt principle.
• One could also be agnostic about the values of i‟s,
since no examples with fi(x)>0 are seen in Callhome
– Features (e.g. N-grams) from MSA-text which are not seen
in Callhome always have fi(x)=0 in Callhome training data
A Pictorial “Interpretation” of the
Minimum Divergence Model
The ML model for MSA text
Q*(x)=Z-1(,) exp[ i*fi(x) + j*gj(x)]
Subset of all exponential models with =*
P(x)=Z-1(,,) exp[ i*fi(x) + j gj(x) + k hk(x)]
The ML model for Callhome, with =* instead of =0.
P*(x)=Z-1(,,) exp[ i*fi(x) + j**gj(x) + k*hk(x)]
Subset of all exponential models with =0
Q(x)=Z-1(,) exp[ i fi(x) + j gj(x)]
All exponential models of the form
P(x)=Z-1(,,) exp[ i fi(x) + j gj(x) + k hk(x)]
Details of Proposed Research (1):
A Factored LM for MSA text
• Notation W=romanized word, =script, S=stem, R=root, M=tag
Q(i|i-1,i-2) = Q(i|i-1,i-2,Si-1,Si-2,Mi-1,Mi-2,Ri-1,Ri-2)
• Examine all 8C2 = 28 all trigram “templates” of two variables from the
history with i.
– Set observations w/counts above a threshold as features
• Examine all 8C1 = 8 all bigram “templates” of one variable from the
history with i.
– Set observations w/counts above a threshold as features
• Build a MaxEnt model (Use Jun Wu‟s toolkit)
Q(i|i-1,i-2)=Z-1(,) exp[ 1f1(i,i-1,Si-2)+2f2(i,Mi-1,Mi-2) …
+ifi(i,i-1)+…+jgj(i,Ri-1)+…+JgJ(i)]
• Build the Romanized language model
Q(Wi|Wi-1,Wi-2) = U(Wi|i) Q(i|i-1,i-2)
A Pictorial “Interpretation” of the
Minimum Divergence Model
The ML model for MSA text
Q*(x)=Z-1(,) exp[ i*fi(x) + j*gj(x)]
The ML model for Callhome, with =* instead of =0.
P*(x)=Z-1(,,) exp[ i*fi(x) + j**gj(x) + k*hk(x)]
All exponential models of the form
P(x)=Z-1(,,) exp[ i fi(x) + j gj(x) + k hk(x)]
Details of Proposed Research (2):
Additional Factors in Callhome LM
P(Wi|Wi-1,Wi-2) = P(Wi,i| Wi-1,Wi-2,i-1,i-2,Si-1,Si-2,Mi-1,Mi-2,Ri-1,Ri-2)
• Examine all 10C2 = 45 all trigram “templates” of two variables from the
history with W or .
– Set observations w/counts above a threshold as features
• Examine all 10C1 = 10 all bigram “templates” of one variable from the
history with W or .
– Set observations w/counts above a threshold as features
• Compute a Min Divergence model of the form
P(Wi|Wi-1,Wi-2)=Z-1(,, ) exp[ 1f1(i,i-1,Si-2)+2f2(i,Mi-1,Mi-2)+…
+ifi(i,i-1 )+…+jgj(i,Ri-1)+…+JgJ(i)]
exp[1h1(Wi,Wi-1,Si-2)+ 2h2(i,Wi-1,Si-2) +…
+ khi(i,i-1)+…+ KhK(Wi)]
Research Plan and Conclusion
• Use baseline Callhome results from WS02
– Investigate treating romanized forms of a script
form as alternate pronunciations
• Build the MSA-text MaxEnt model
– Feature selection is not critical; use high cutoffs
• Choose features for the Callhome model
• Build and test the minimum divergence model
– Plug in induced structure
– Experiment with subsets of MSA text
A Pictorial “Interpretation” of the
Minimum Divergence Model
The ML model for MSA text
Q*(x)=Z-1(,) exp[ i*fi(x) + j*gj(x)]
The ML model for Callhome, with =* instead of =0.
P*(x)=Z-1(,,) exp[ i*fi(x) + j**gj(x) + k*hk(x)]
All exponential models of the form
P(x)=Z-1(,,) exp[ i fi(x) + j gj(x) + k hk(x)]