Embed
Email

asr final

Document Sample
asr final
Shared by: HC11111012121
Categories
Tags
Stats
views:
3
posted:
11/10/2011
language:
English
pages:
168
Novel Speech Recognition

Models for Arabic

The Arabic Speech Recognition Team

JHU Workshop Final Presentations

August 21, 2002

Arabic ASR Workshop Team

Senior Participants Undergraduate Students:

Katrin Kirchhoff, UW Melissa Egan, Pomona College

Jeff Bilmes, UW Feng He, Swarthmore College

John Henderson, MITRE

Mohamed Noamany, BBN Affiliates:

Pat Schone, DoD Dimitra Vergyri, SRI

Rich Schwartz, BBN Daben Liu, BBN

Nicolae Duta, BBN

Graduate Students Ivan Bulyko, UW

Sourin Das, JHU Mari Ostendorf, UW

Gang Ji, UW

“Arabic”



Gulf

Egyptian Modern

Arabic

Arabic Standard

Arabic

Levantine North- (MSA)

Arabic African

Arabic



Dialects used for Cross-regional standard,

informal conversation used for formal communication

Arabic ASR: Previous Work

• dictation: IBM ViaVoice for Arabic

• Broadcast News: BBN TIDESOnTap

• conversational speech: 1996/1997 NIST

CallHome Evaluations



• little work compared to other languages

• few standardized ASR resources

Arabic ASR: State of the Art

(before WS02)



• BBN TIDESOnTap: 15.3% WER

• BBN CallHome system: 55.8% WER

• WER on conversational speech noticeably

higher than for other languages

(eg. 30% WER for English CallHome)

 focus on recognition of conversational

Arabic

Problems for Arabic ASR

• language-external problems:

– data sparsity, only 1 (!) standardized corpus of

conversational Arabic available

• language-internal problems:

– complex morphology, large number of possible

word forms

(similar to Russian, German, Turkish,…)

– differences between written and spoken

representation: lack of short vowels and other

pronunciation information

(similar to Hebrew, Farsi, Urdu, Pashto,…)

Corpus: LDC ECA CallHome

• phone conversations between family members/friends

• Egyptian Colloquial Arabic (Cairene dialect)

• high degree of disfluencies (9%), out-of-vocabulary words

(9.6%), foreign words (1.6%)

• noisy channels

• training: 80 calls (14 hrs), dev: 20 calls (3.5 hrs), eval: 20

calls (1.5 hrs)

• very small amount of data for language modeling (150K) !

MSA - ECA differences

• Phonology:

– /th/  /s/ or /t/ thalatha - talata („three‟)

– /dh/  /z/ or /d/ dhahab - dahab („gold‟)

– /zh/  /g/ zhadeed - gideed („new‟)

– /ay/  /e:/ Sayf - Seef („summer‟)

– /aw/  /o:/ lawn - loon („color‟)

• Morphology:

– inflections yatakallamu - yitkallim („he speaks‟)

• Vocabulary:

– different terms TAwila - tarabeeza (`table‟)

• Syntax:

– word order differences SVO - VSO

Workshop Goals

improvements to Arabic ASR through





developing novel developing techniques

models to better for using out-of-corpus

exploit available data data









Factored language modeling Automatic Integration of

romanization MSA text data

Factored Language Models



• complex morphological structure leads to

large number of possible word forms

• break up word into separate components

• build statistical n-gram models over individual

morphological components rather than

complete word forms

Automatic Romanization

• Arabic script lacks short vowels and other

pronunciation markers

• comparable English example

th fsh stcks f th nrth tlntc hv bn dpletd

the fish stocks of the north atlantic have been depleted



• lack of vowels results in lexical ambiguity;

affects acoustic and language model training

• try to predict vowelization automatically from

data and use result for recognizer training

Out-of-corpus text data

• no corpora of transcribed conversational

speech available

• large amounts of written (Modern Standard

Arabic) data available (e.g. Newspaper text)

• Can MSA text data be used to improve

language modeling for conversational

speech?

• Try to integrate data from newspapers,

transcribed TV broadcasts, etc.

Recognition Infrastructure

• baseline system: BBN recognition system

• N-best list rescoring

• Language model training: SRI LM toolkit with

significant additions implemented during this

workshop

• Note: no work on acoustic modeling, speaker

adaptation, noise robustness, etc.

• two different recognition approaches:

grapheme-based vs. phoneme-based

Summary of Results (WER)

Grapheme-based reconizer Phone-based recognizer

Baselin

e Random

59 65 62.7%

59.0

58 Automatic

Base-

romanization 60 line Additional

57.9%

57 55.8% Callhome

data 55.1%

56 55 Language

True modeling 53.8%

55 romanization

50

54.9%

Oracle

54 46%

45

53



52 40

Novel research

• new strategies for language modeling based on

morphological features

• new graph-based backoff schemes allowing wider

range of smoothing techniques in language modeling

• new techniques for automatic vowel insertion

• first investigation of use of automatically vowelized

data for ASR

• first attempt at using MSA data for language

modeling for conversational Arabic

• morphology induction for Arabic

Key Insights

• Automatic romanization improves grapheme-

based Arabic recognition systems

• trend: morphological information helps in

language modeling

• needs to be confirmed on larger data set

• Using MSA text data does not help

• We need more data!

Resources



• significant add-on to SRILM toolkit for general

factored language modeling

• techniques/software for automatic romanization

of Arabic script

• part-of-speech tagger for MSA & tagged text

Outline of Presentations

• 1:30 - 1:45: Introduction (Katrin Kirchhoff)

• 1:45 - 1:55: Baseline system (Rich Schwartz)

• 1:55 - 2:20: Automatic romanization (John Henderson,

Melissa Egan)

• 2:20 - 2:35: Language modeling - overview (Katrin Kirchhoff)

• 2:35 - 2:50: Factored language modeling (Jeff Bilmes)

• 2:50 - 3:05: Coffee Break

• 3:05 - 3:10: Automatic morphology learning (Pat Schone)

• 3:15 - 3:30: Text selection (Feng He)

• 3:30 - 4:00: Graduate student proposals (Gang Ji, Sourin Das)

• 4:00 - 4:30: Discussion and Questions

Thank you!

• Fred Jelinek, Sanjeev Khudanpur, Laura Graham

• Jacob Laderman + assistants

• Workshop sponsors

• Mark Liberman, Chris Cieri, Tim Buckwalter

• Kareem Darwish, Kathleen Egan

• Bill Belfield & colleagues from BBN

• Apptek

BBN Baseline System

for Arabic

Richard Schwartz, Mohamed Noamany,

Daben Liu, Bill Belfield, Nicolae Duta

JHU Workshop

August 21, 2002

BBN BYBLOS System

• Rough‟n‟Ready / OnTAP / OASIS system

• Version of BYBLOS optimized for

Broadcast News

• OASIS system fielded in Bangkok and

Aman

• Real-Time operation with 1-minute

delay

• 10%-20% WER, depending on data

BYBLOS Configuration



• 3-passes of recognition

– Forward Fast-match uses PTM models and

approximate bigram search

– Backward pass uses SCTM models and

approximate trigram search, creates N-best.

– Rescoring pass uses cross-word SCTM models and

trigram LM

• All runs in real time

– Minimal difference from running slowly

Use for Arabic Broadcast News



• Transcriptions are in normal Arabic script,

omitting short vowels and other diacritics.

• We used each Arabic letter as if it were a

phoneme.

• This allowed addition of large text corpora for

language modeling.

Initial BN Baseline



• 37.5 hours of acoustic training

• Acoustic training data (230K words) used for

LM training

• 64K-word vocabulary (4% OOV)



• Initial word error rate (WER) = 31.2%

Speech Recognition Performance





System (all real-time results) WER (%)



Baseline 31.2



+ 145M word LM (Al Hayat) 26.6



+ System Improvements (MLLR and tuning) 21.0



+ 128k Lexicon (OOV reduced to 2%) 20.4



+ Additional 20 hours acoustic data 19.1



+ 290M word LM + improved lexicon 17.3



+ New scoring (remove hamza from alif) 15.3

Call Home Experiments



• Modified OnTAP system to make it more

appropriate for Call Home data.

• Added features from LVCSR research to

OnTAP system for Call Home data.

• Experiments:

– Acoustic training: 80 conversations (15 hours)

• Transcribed with diacritics

– Acoustic training data (150K words) used for LM

– Real-time

Using OnTAP system for Call Home









System WER (%)



Baseline for OASIS 64.1



+ Bypass BN segmenter 63.4



+ Cepstral Mean Subtraction on conversations 62.4



+ Incremental MLLR on whole conversation 61.8



+ 1-level CMS (instead of 2) 60.8

Additions from LVCSR





System WER (%)



Baseline for OASIS 60.8



+ VTL on training and decoding (unoptimized) 59.0



+ LPC Smoothing with 40 poles 58.7



+ ‘split-init training’ 58.1



+ HLDA (not used for workshop) 56.6



+ Modified backoff (not used for workshop) 56.0

Output Provided for Workshop

• OASIS was run on various sets of training as needed

• Systems were run either for Arabic script phonemes

or „Romanized‟ phonemes – with diacritics.

• In addition to workshop participants, others at BBN

provided assistance and worked on workshop

problems.

• Output provided for workshop was N-best sentences

– with separate scores for HMM, LM, #words, #phones,

#silences

– Due to high error rate (56%), the oracle error rate for 100

N-best was about 46%.

• Unigram lattices were also provided, with oracle error

rate of 15%

Phoneme HMM Topology Experiment



• The phoneme HMM topology was increased

for the Arabic script system from 5 states to

10 states in order to accommodate a

consonant and possible vowel.

• The gain was small (0.3% WER)

OOV Problem



• OOV Rate is 10%

– 50% is morphological variants of words in the

training set

– 10% is Proper names

– 40% is other unobserved words

• Tried adding words from BN and from

morphological transducer

– Added too many words with too small gain

Use BN to Reduce OOV

• Can we add words from BN to reduce OOV?

• BN text contains 1.8M distinct words.

• Adding entire 1.8M words reduces OOV from

10% to 3.9%.

• Adding top 15K words reduces OOV to 8.9%

• Adding top 25K words reduces OOV to 8.4%.

Use Morphological Transducer



• Use LDC Arabic transducer to expand verbs to

all forms

– Produces > 1M words

• Reduces OOV to 7%

Language Modeling Experiments



Described in other talks

• Searched for available dialect transcriptions

• Combine BN (300M words) with CH (230K)

• Use BN to define word classes

• Constrained back-off for BN+CH

Autoromanization of

Arabic Script

Melissa Egan and John Henderson

Autoromanization (AR) goal

• Expand Arabic script representation to include short

vowels and other pronunciation information.



• Phenomena not typically marked in non-diacritized script

include:

– Short vowels {a, i, u}

– Repeated consonants (shadda)

– Extra phonemes for Egyptian Arabic {f/v,j/g}

– Grammatical marker that adds an „n‟ to the pronunciation

(tanween)



• Example

Non-diacritized form: ktb – write

Expansions: kitab – book

aktib – I write

kataba – he wrote

kattaba – he caused to write

AR motivation

• Romanized text can be used to produce better output

from an ASR system.

– Acoustic models will be able to better disambiguate based

on extra information in text.

– Conditioning events in LM will contain more information.





• Romanized ASR output can be converted to script for

alternative WER measurement.



• Eval96 results (BBN recognizer, 80 conv. train)

– script recognizer: 61.1 WERG (grapheme)

– romanized recognizer: 55.8 WERR (roman)

AR data



CallHome Arabic from LDC

Conversational speech transcripts (ECA) in both

script and a roman specification that includes

short vowels, repeats, etc.



set conversations words

Romanizer asrtrain 80 135K

Testing

dev 20 35K

eval96(asrtest) 20 15K

Romanizer

eval97 20 18K

Training h5_new 20 18K

Data format

• Script without and with diacritics









• CallHome in script and roman forms



Script: AlHmd_llh kwIsB w AntI AzIk

Roman: ilHamdulillA kuwayyisaB~ wi inti izzayyik

our task

Autoromanization (AR) WER baseline

• Train on 32K words in eval97+h5_new

• Test on 137K words in ASR_train+h5_new



Status portion error % total

in train in test in test error

unambig. 68.0% 1.8% 6.2%

ambig. 15.5 13.9 10.8

unknown 16.5 99.8 83.0

total 100 19.9 100.0



Biggest potential error reduction would come from predicting

romanized forms for unknown words.

AR “knitting” example

1. Find close

unknown: tbqwA

known word known: ybqwA



2. Record ops known: y bqwA

required to

make roman kn.roman: yibqu

from known ops: ciccrd

unknown: t bqwA

3. Construct

new roman

using same

kn.roman: yibqu

ops: ciccrd

ops

new roman: tibqu

Experiment 1 (best match)



Observed patterns in the known short/long pairs:

Some characters in the short forms are consistently

found with particular, non-identical characters in the

long forms.



Example rule:

Aa

Experiment 2 (rules)

Environments in which „w‟ occurs

in training dictionary long Environments in which „u‟

forms: occurs in training

Env Freq dictionary long forms:

C _ V 149 Env Freq

V _ # 8 C _ C 1179

# _ V 81 C _ # 301

C _ # 5 # _ C 29

V _ V 121

V _ C 118



• Some output forms depend on output context.

• Rule:

– „u‟ occurs only between two non-vowels.

– „w‟ occurs elsewhere.

• Accurate for 99.7% of the instances of „u‟ and „w‟ in the

training dictionary long forms. Similar rule may be

formulated for „i‟ and „y.‟

Experiment 3 (local model)

• Move to more data-driven model

– Found some rules manually.

– Look for all of them, systematically.

• Use best-scoring candidate for replacement

– Environment likelihood score

– Character alignment score





Known long: H a n s A h a

Known short: H A n s A h A

input: H A m D y h A

result: H a m D I h a

Experiment 4 (n-best)

• Instead of generating romanized form using the single

best short form in the dictionary, generate romanized

forms using top n best short forms.

Example (n = 5)

Character error rate (CER)

• Measurement of insertions, deletions, and substitutions in

character strings should more closely track phoneme error

rate.



• More sensitive than WER

– Stronger statistics from same data





• Test set results

– Baseline 49.89 character error rate (CER)

– Best model 24.58 CER

– Oracle 2-best list 17.60 CER suggests more room for gain.

Summary of performance (dev set)



Accuracy CER

Baseline 8.4% 41.4%

Knitting 16.9% 29.5%

Knitting + best match + rules 18.4% 28.6%

Knitting + local model 19.4% 27.0%

Knitting + local model + n-best 30.0% 23.1%

(n = 25)

Varying the number of dictionary matches



30

performance







26





22





18

0 50 100 150 200

dictionary matches





accuracy CER

ASR scenarios



1) Have a script recognizer, but want to

produce romanized form.

postprocessing ASR output



2) Have a small amount of romanized data

and a large amount of script data

available for recognizer training.

preprocessing ASR training set

ASR experiments



Roman

Roman Result WERR

AR

ASR

Script

R2S Result WERG

Script Preprocessing

Train

Script

Result WERG

Script

ASR Roman

AR WERR

Result



Postprocessing

Experiment: adding script data

• Script LM training data

could be acquired from

AR train

found text.

40

• Script transcription is

cheaper than roman

ASR train transcription

100 conv

• Simulate a

preponderance of script

by training AR on a

separate set.





Future training set • ASR is then trained on

output of AR.

Eval 96 experiments, 80 conv



Config WERR WERG

script baseline N/A 59.8

post processing 61.5 59.8

preprocessing 59.9 59.2 (-0.6)

Roman baseline 55.8 55.6 (-4.2)





Bounding experiment

• No overlap between ASR train and AR train.

• Poor pronunciations for “made-up” words.

Eval 96 experiments, 100 conv

Config WERR WERG

script baseline N/A 59.0

postprocessing 60.7 59.0

preprocessing 58.5 57.5 (-1.5)

Roman baseline 55.1 54.9 (-4.1)





More realistic experiment

• 20 conversation overlap between ASR train

and AR train.

• Better pronunciations for “made-up” words.

Remaining challenges

• Correct “dangling tails” in short matches









• Merge unaligned characters

Bigram translation model

r*  arg max

r

 p(s,d )p(r | s,d )

s l

d (d s , dl )



 arg max

r

 p(s,d )p(r, s,d )

s l

d (d s ,d l )



p(r, s, dl )   p(ri | ri 1 )p(sj | ri )p(dlk | ri )

i

p(sj | ri )

input s: t b q w A

p(ri | ri 1)

output r: □ t i b q u □

p(dlk | ri )

kn. roman dl: y i b q u

Future work

• Context provides information for

disambiguating both known and unknown

words

– Bigrams for unknown words will also be unknown,

use part of speech tags or morphology.

• Acoustics

– Use acoustics to help disambiguate vowels?

– Provide n-best output as alternative

pronunciations for ASR training.

Factored Language

Modeling



Katrin Kirchhoff, Jeff Bilmes, Dimitra Vergyri,

Pat Schone, Gang Ji, Sourin Das

Arabic morphology

• structure of Arabic derived words

pattern



particles fa- s a k a n -tu affixes



root



LIVE + past + 1st-sg-past + part: “so I lived”

Arabic morphology

• ~5000 roots

• several hundred patterns

• dozens of affixes

 large number of possible word forms

 problems training robust language

model

 large number of OOV words

Vocabulary Growth - full word forms



vocab size CallHome English

Arabic

16000

14000

12000

10000

8000

6000

4000

2000

0

# word

k



k



k



k



k



k



k



k



k

0k



0k



0k

10



20



30



40



50



60



70



80



90

10

tokens

11



12

Vocabulary Growth - stemmed words

CallHome EN words

vocab size

AR words

16000 EN stems

14000 AR stems

12000

10000

8000

6000

4000

2000

0

# word

k



k



k



k



k



k



k



k



k

0k



0k



0k

10



20



30



40



50



60



70



80



90



tokens

10



11



12

Particle model



• Break words into sequences of stems +

affixes:

W  1 ,2 ,...,M

• Approximate probability of word sequence by

probability of particle sequence

T

P(W1 ,W2 ,...,WN )   P(t | t 1,t 2,..., t n1 )

t n

Factored Language Model

• Problem: how can we estimate P(Wt|Wt-1,Wt-2,...) ?

• Solution: decompose W into its morphological

components: affixes, stems, roots, patterns

• words can be viewed as bundles of features



Pt-2 Pt-1 Pt patterns

Rt-2 Rt-1 Rt roots

At-2 At-1 At affixes

St-2 St-1 St

stems



Wt-2 Wt-1 Wt words

Statistical models for factored

representations

• Class-based LM:



P(Wt | Wt 1 ,Wt 2 )  P(Wt | Ft ) P( Ft | Ft 1 , Ft 2 )

• Single-stream LM:



P( Ft | Ft 1 , Ft 2 ,..., F1 )  P( Ft | Ft 1 , Ft 2 )

Full Factored Language Model



assume wi  ai , ri ,i where 

w = word, r = root,  = pattern, a = affixes

P( wi | wi 1 , wi 2 )  P(ai , ri ,i | ai 1 , ri 1 ,i 1 , ai 2 , ri 2 ,i 2 )

 P(ai | ri ,i , ai 1 , ri 1 ,i 1 , ai 2 , ri 2 ,i 2 )

P(ri | i , ai 1 , ri 1 ,i 1 , ai 2 , ri 2 ,i 2 )

P(i | ai 1 , ri 1 ,i 1 , ai 2 , ri 2 ,i 2 )



• Goal: find appropriate conditional independence

statements to simplify this model.

Experimental Infrastructure

• All language models tested using nbest

rescoring

• two baseline word-based LMs:

– B1: BBN LM, WER 55.1%

– B2: WS02 baseline LM, WER 54.8%

• combination of baselines: 54.5%

• new language models were used in

combination with one or both baseline LMs

• log-linear score combination scheme

Log-linear combination

For m information sources, each producing a

maximum-likelihood estimate for W:

1 m ki

P(W | I ) 

Z (I )

 P (W | I )

i

i i







I: total information available

Ii : the i‟th information source

ki: weight for the i‟th information source

Discriminative combination

• We optimize the combination weights jointly with the

language model and insertion penalty to directly

minimize WER of the maximum likelihood hypothesis.



• The normalization factor can be ignored since it is the

same for all alternative hypotheses.



• Used the simplex optimization method on the 100-

bests provided by BBN (optimization algorithm

available in the SRILM toolkit).

Word decomposition

• Linguistic decomposition (expert knowledge)



• automatic morphological decomposition: acquire

morphological units from data without using human

knowledge



• assign words to classes based not on characteristics

of word form but based on distributional properties

(Mostly) Linguistic Decomposition

• Stems/morph class: information from LDC CH lexicon:

$atamna $atam:verb+past-1st-plural



stem morph. tag

• roots: determined by K. Darwish‟s morphological

analyzer for MSA

$atam  $tm

• pattern: determined by subtracting root from stem

$atam  CaCaC

Automatic Morphology



• Classes defined by morphological components

derived from data

• no expert knowledge

• based on statistics of word forms

• more details in Pat‟s presentation

Data-driven Classes

• Word clustering based on distributional statistics

• Exchange algorithm (Martin et. al 98)

– initially assign words to individual clusters

– move each temporarily word to all other clusters, compute

change in perplexity (class-based trigram)

– keep assignment that minimizes perplexity

– stop when class assignment no longer changes

• bottom-up clustering (SRI toolkit)

– initially assign words to individual clusters

– successively merge pairs of clusters with highest average

mutual information

– stop at specified number of classes

Results

• Best word error rates obtained with:

– particle model: 54.0% (B1 + particle LM)

– class-based models: 53.9% (B1+Morph+Stem)

– automatic morphology: 54.3% (B1+B2+Rule)

– data-driven classes: 54.1% (B1+SRILM, 200

classes)

• combination of best models: 53.8%

Conclusions

• Overall improvement in WER gained from language

modeling (1.3%) is significant

• individual differences between LMs are not significant

• but: adding morphological class models always helps

language model combination

• morphological models get the highest weights in

combination (in addition to word-based LMs)

• trend needs to be verified on larger data set

 application to script-based system?

Factored Language Models

and Generalized Graph

Backoff

Jeff Bilmes, Katrin Kirchhoff

University of Washington, Seattle &

JHU-WS02 ASR Team

Outline

• Language Models, Backoff, and Graphical

Models

• Factored Language Models (FLMs) as

Graphical Models

• Generalized Graph Backoff algorithm

• New features to SRI Language Model

Toolkit (SRILM)

Standard Language Modeling

• Example: standard tri-gram



P(wt | ht )  P(wt | wt 1 , wt 2 , wt 3 )







Wt  4 Wt 3 Wt  2 Wt 1 Wt

Typical Backoff in LM

Wt | Wt 1 ,Wt 2 ,Wt 3

• In typical LM, there

is one natural

(temporal) path to Wt | Wt 1 ,Wt 2

back off along.

• Well motivated

since information

often decreases

Wt | Wt 1

with word distance.



Wt

Factored LM: Proposed Approach

• Decompose words into smaller morphological

or class-based units (e.g., morphological

classes, stems, roots, patterns, or other

automatically derived units).

• Produce probabilistic models over these units

to attempt to improve WER.

Example with Words, Stems, and

Morphological classes



M t 3 M t  2 Mt 1 M t



St 3 St  2 St 1 St



Wt 3 Wt  2 Wt 1 Wt

P(wt | st , mt ) P(st | mt , wt 1 , wt 2 ) P(mt | wt 1 , wt 2 )

Example with Words, Stems, and

Morphological classes



M t 3 M t  2 Mt 1 M t



St 3 St  2 St 1 St



Wt 3 Wt  2 Wt 1 Wt



P(wt | wt 1 , wt 2 , st 1 , st 2 , mt 1 , mt 2 )

In general



3 3 3 3

F

t 3 F

t 2 F

t 1 Ft

2 2 2 2

F

t 3 F

t 2 F

t 1 Ft

1 1 1 1

F

t 3 F

t 2 F

t 1 Ft

General Factored LM

• A word is equivalent to collection of factors.



{wt }  { ft } 1:K

f k  the kth factor

• E.g., if K=3

P( wt | wt 1 , wt  2 )  P( f t1 , f t 2 , f t 3 | f t 1 , f t 21 , f t 1 , f t  2 , f t 2 , f t 3 )

1 3 1

2 2



 P( ft1 | ft 2 , ft 3 , ft 1 , ft 21 , ft 1 , ft  2 , f t 2 , f t 3 )

1 3 1

2 2

1 3 1

P( ft 2 | ft 3 , ft 1 , ft 21 , ft 1 , f t  2 , f t 2 , f t 3 )

2 2

1 3

P( ft 3 | f t 1 , f t 21 , f t 1 , f t 12 , f t 2 , f t 3 )

2 2



• Goal: find appropriate conditional independence

statements to simplify this sort of model while

keeping perplexity and WER low. This is the

structure learning problem in graphical models.

The General Case



3 3 3 3

F

t 3 Ft 2 F

t 1 Ft

2 2 2 2

F

t 3 Ft 2 F

t 1 Ft

1 1 1 1

F

t 3 Ft 2 F

t 1 Ft

The General Case









FA 1 FA2 FA3





Fi

The General Case

FA 1 FA2 FA3



Fi

FA 1 FA2 FA 1 FA3 FA2 FA3



Fi Fi Fi

FA 1 FA2 FA3



Fi Fi Fi





Fi

A Backoff Graph (BG)

Fi | FA1 , FA2 , FA3





Fi | FA1 , FA2 Fi | FA1 , FA3 Fi | FA2 , FA3





Fi | FA1 Fi | FA2 Fi | FA3









Fi

Example: 4-gram Word Generalized

Backoff

Wt | Wt 1 ,Wt 2 ,Wt 3



Wt | Wt 1 ,Wt 2 Wt | Wt 1 ,Wt 3 Wt | Wt 2 ,Wt 3





Wt | Wt 1 Wt | Wt 2 Wt | Wt 3



Wt

How to choose backoff path?



Four basic strategies

1.Fixed path (based on what seems

reasonable (e.g., temporal

constraints))

2.Generalized all-child backoff

3.Constrained multi-child backoff

4.Child combination rules

Choosing a fixed back-off path



Fi | FA1 , FA2 , FA3



Fi | FA1 , FA2 Fi | FA2 , FA3 Fi | FA1 , FA3





Fi | FA1 Fi | FA2 Fi | FA3



Fi

How to choose backoff path?



Four basic strategies

1.Fixed path (based on what seems

reasonable (e.g., temporal

constraints))

2.Generalized all-child backoff

3.Constrained multi-child backoff

4.Child combination rules

Generalized Backoff

 N ( f , f P1 , f P 2 )

 d N ( f , f P1 , f P 2 ) if N ( f , f P1 , f P 2 )  0

PBO ( f | f P1 , f P 2 )   N ( f P1 , f P 2 )

  ( f P1 , f P 2 ) g ( f , f P1 , f P 2 )

 otherwise

• In typical backoff, we drop 2nd parent and use

conditional probability.



g ( f , f P1 , f P 2 )  PBO ( f | f P1 )

• More generally, g() can be any positive function, but

need new algorithm for computing backoff weight

(BOW).

Computing BOWs

N ( f , f P1 , f P 2 )

1 , f )0 d N ( f , fP1 , fP 2 ) N ( f , f )

 ( f P1 , f P 2 )  f :N ( f , f P1 P 2 P1 P2



 g ( f , f P1 , f P 2 )

f :N ( f , f P1 , f P 2 )  0





• Many possible choices for g() functions (next

few slides)

• Caveat: certain g() functions can make the

LM much more computationally costly than

standard LMs.

g() functions



• Standard backoff

g ( f , f P1 , f P 2 )  PBO ( f | f P1 )

• Max counts

g ( f , f P1 , f P 2 )  PBO ( f | f Pj* )

j  argmax N ( f , f Pj )

*



j



• Max normalized counts

N ( f , f Pj )

j  argmax

*



j N ( f Pj )

More g() functions

• Max backoff graph node.

g ( f , f P1 , f P 2 )  PBO ( f | f Pj* )

j  argmax PBO ( f | f Pj )

*



j





Fi | FA1 , FA2 , FA3



F1 | FA1 , FA2 F1 | FA2 , FA3 Fi | FA1 , FA3





Fi | FA1 Fi | FA2 Fi | FA3



Fi

More g() functions

• Max back off graph node.

g ( f , f P1 , f P 2 )  PBO ( f | f Pj* )

j  argmax PBO ( f | f Pj )

*



j





Fi | FA1 , FA2 , FA3



F1 | FA1 , FA2 F1 | FA2 , FA3 Fi | FA1 , FA3





Fi | FA1 Fi | FA2 Fi | FA3



Fi

How to choose backoff path?

Four basic strategies

1.Fixed path (based on what seems

reasonable (time))

2.Generalized all-child backoff

3.Constrained multi-child backoff

• Same as before, but choose a subset of

possible paths a-priori

4.Child combination rules

• Combine child node via combination

function (mean, weighted avg., etc.)

Significant Additions to

Stolcke‟s SRILM, the SRI Language

Modeling Toolkit

• New features added to SRILM including

– Can specify an arbitrary number of graphical-

model based factorized models to train, compute

perplexity, and rescore N-best lists.

– Can specify any (possibly constrained) set of

backoff paths from top to bottom level in BG.

– Different smoothing (e.g., Good-Turing, Kneser-

Ney, etc.) or interpolation methods may be used

at each backoff graph node

– Supports the generalized backoff algorithms with

18 different possible g() functions at each BG

node.

Example with Words, Stems, and

Morphological classes



M t 3 M t  2 Mt 1 M t



St 3 St  2 St 1 St



Wt 3 Wt  2 Wt 1 Wt

P(wt | st , mt ) P(st | mt , wt 1 , wt 2 ) P(mt | wt 1 , wt 2 )

Wt | St , M t

How to specify a model

## word given stem morph

W : 2 S(0) M(0)

Wt | St S0,M0 M0 wbdiscount gtmin 1 interpolate

S0 S0 wbdiscount gtmin 1

Wt 0 0 wbdiscount gtmin 1





M t | Wt 1 ,Wt 2 ## morph given word word

M : 2 W(-1) W(-2)

W1,W2 W2 kndiscount gtmin 1 interpolate

M t | Wt 1 W1 W1 kndiscount gtmin 1 interpolate

0 0 kndiscount gtmin 1

Mt

St | M t ,Wt 1 ,Wt 2

## stem given morph word word

St | M t ,Wt 1 S : 3 M(0) W(-1) W(-2)

M0,W1,W2 W2 kndiscount gtmin 1 interpolate

M0,W1 W1 kndiscount gtmin 1 interpolate

St | M t M0 M0 kndiscount gtmin 1

0 0 kndiscount gtmin 1

St

Summary

• Language Models, Backoff, and Graphical

Models

• Factored Language Models (FLMs) as

Graphical Models

• Generalized Graph Backoff algorithm

• New features to SRI Language Model

Toolkit (SRILM)

Coffee Break



Back in 10 minutes

Knowledge-Free Induction

of Arabic Morphology



Patrick Schone

21 August 2002

Why induce Arabic

morphology?

(1) Has not been done before

(2) If it can be done, and if it has value in LM,

it can generalize across languages without

needing an expert

Original Algorithm

(Schone & Jurafsky, „00/`01)



Looking for word inflections on words w/ Fr>9



Use a character tree to find word pairs with

similar beginnings/ endings

Ex: car/cars , car/cares, car/caring



Use Latent Semantic Analysis to induce

semantic vectors for each word,

then compare word-pair semantics



Use frequencies of word stems/rules to improve

the initial semantic estimates

Algorithmic Expansions

IR-Based Minimum Edit Distance

Trie-based approach could be a problem for Arabic:

Templates => $aGlaB: { $aGlaB il$AGil $aGlu $AGil }

Result: 3576 words in CallHome lexicon w/ 50+ relationships!





∙ $ A G i l Use Minimum Edit Distance

∙ 0 1 2 3 4 5 to find the relationships

$ 1 0 1 2 3 4 (can be weighted)

a 2 1 2 3 4 5

Use information-retrieval

G 3 2 3 2 3 4 based approach to faciliate

l 4 3 4 3 4 3 search for MED candidates

a 5 4 5 4 5 4

B 6 5 6 5 6 5

Algorithmic Expansions

Agglomerative Clustering Using Rules & Stems



#Word Pairs w/ Rule #Word Pairs w/ Stem

* => il+* 1178 Gayyar 507

* => *u 635 xallaS 503

* => *i 455 makallim$ 468

*i => *u 377 qaddim 434

* => fa+* 375 itgawwiz 332

* => bi+* 366 tkallim 285

… …





Do bottom-up clustering, where weight between

two words is Ct(Rule)*Ct(PairedStem)1/2

Algorithmic Expansions

Updated Transitivity









If X~Y and Y~Z and |X^Y|>2 and X^Y il+NULL)

• Generate only if initial and final n-characters of stem have

been seen before.



Number Coverage Observed

proposed as words

Rule only 993398 41.3% 0.1%

Rule+1-char stem agree 98864 25.0% 1.1%

Rule+2-char stem agree 35092 14.9% 1.8%

Text Selection for

Conversational Arabic

Feng He

ASR (Arabic Speech Recognition) Team

JHU Workshop

Motivation

• Group goal: Conversational Arabic Speech

Recognition.

• One of the Problems: not enough training

data to build a Language Model – most

available text is in MSA (Modern Standard

Arabic) or a mixture of MSA and

conversational Arabic.

• One Solution: Select from mixed text

segments that are conversational, and use

them in training.

Task: Text Selection

– Use POS-based language models because it has

been shown to better indicate differences in

styles, such as formal vs conversational.

– Method:

1.Training POS (part of speech) tagger on

available data

2.Train POS-based language models on formal

vs conversational data

3.Tag new data

4.Select segments from new data that are

closest to conversational model by using

scores from POS-based language models.

Data

• For building the Tagger and Language Models

– Arabic Treebank: 130K words of hand-

tagged Newspaper text in MSA.

– Arabic CallHome: 150K words of

transcribed phone conversations. Tags are

only in the Lexicon.

• For Text Selection

– Al Jazeera: 9M words of transcribed TV

broadcasts. We want to select segments

that are closer to conversational Arabic,

such as talk-shows and interviews.

Implementation

ti 1 ti

• Model (bigram):



wi

P(W | T ) P(T )

T  arg max P(T | W )  arg max

*



T T P(W )



P (W | T ) P (T )   P ( wi | ti ) P (ti | t0:i 1 )

i



  P ( wi | ti ) P (ti | ti 1 )

i

About unknown words:

• These are words that are not seen in training

data, but appear in test data.

• Assume unknown words behave like

singletons (words that appear only once in

training data).

• This is done by duplicating training data with

singletons replaced by special token. Then

train tagger on both the original and

duplicate.

Tools:

GMTK (Graphical Model Toolkit)



Algorithms:

Training: EM training – set parameters so

that joint probability of hidden states and

observations is maximized.



Decoding (tagging): Viterbi – find hidden

state sequence that maximizes joint

probability of hidden state and observations.

Experiments

Exp 1: Data: first 100K of English Penn Treebank. Trigram

model. Sanity check.

Exp 2: Data: Arabic Treebank. Trigram model.

Exp 3: Data: Arabic Treebank and CallHome. Trigram

model.

The above three experiments all used 10 fold cross

validation, and are unsupervised.



Exp 4: Data: Arabic Treebank. Supervised trigram model.

Exp 5: Data: Arabic Treebank and Callhome. Partially

supervised training using Treebank‟s tagged data. Test on

portion of treebank not used in training. Trigram model.

Results



Experiment Accuracy Accuracy Baseline

of OOV

1 – tri, en 92.7 37.9 79.3 – 95.5

2 – tri, ar, tb 79.5 19.3 75.9

3 – tri, ar, tb+ch 74.6 17.6 75.9

4 – tri, ar, tb, sup 90.9 56.5 90.0



5 – repeat 3 with 83.4 43.6 90.0

partial supervision

Building Language Models and Text

Selection



• Use existing scripts to build formal and

conversational language models from

tagged Arabic Treebank and CallHome data.

• Text selection: use log likelihood ratio

 P( Si | C )1/ N i P(C ) 

Score( Si )  log 

 P( S | F ) P( F ) 

1/ N i



 i 

Si: the ith sentence in data set

C: coversational language model

F: formal language model

Ni : length of Si

Score Distribution

percentage









log count





log likelihood ratio log likelihood ratio

Assessment



• A subset of Al Jazeera equal in size to Arabic

CallHome (150K words) is selected, and

added to training data for speech recognition

language model.

• No reduction in perplexity.

• Possible reasons: Al Jazeera has no

conversational Arabic, or has only

conversational Arabic of a very different style.

Text Selection Work

Done at BBN

Rich Schwartz

Mohamed Noamany

Daben Liu

Nicolae Duta

Search for Dialect Text



• We have an insufficient amount of CH text for

estimating a LM.

• Can we find additional data?

• Many words are unique to dialect text.

• Searched Internet for 20 common dialect

words.

• Most of the data found were jokes or chat

rooms – very little data.

Search BN Text for Dialect Data



• Search BN text for the same 20 dialect words.

• Found less than CH data

• Each occurrence was typically an isolated

lapse by the speaker into dialect, followed

quickly by a recovery to MSA for the rest of

the sentence.

Combine MSA text with CallHome



• Estimate separate models for MSA text (300M

words) and CH text (150K words).

• Use SRI toolkit to determine single optimal

weight for the combination, using deleted

interpolation (EM)

– Optimal weight for MSA text was 0.03

• Insignificant reduction in perplexity and WER

Classes from BN



Hypothesis:

• Even if MSA ngrams are different, perhaps the

classes are the same.

Experiment:

• Determine classes (using SRI toolkit) from BN+CH

data.

• Use CH data to estimate ngrams of classes and / or

p(w | class)

• Combine resulting model with CH word trigram

Result:

• No gain

Hypothesis Test Constrained Back-Off



Hypothesis:

• In combining BN and CH, if a probability is different,

could be for 2 reasons:

– CH has insufficient training

– BN and CH truly have different probabilities (likely)

Algorithm:

• Interpolate BN and CH, but limit the probability

change to be as much as would be likely due to

insufficient training.

• Ngram count cannot change by more than its sqrt

Result:

• No gain

Learning & Using

Factored Language

Models

Gang Ji

Speech, Signal, and Language Interpretation

University of Washington

August 21, 2002

Outline

• Factored Language Models (FLMs)

overview

• Part I: automatically finding FLM structure

• Part II: first-pass decoding in ASR with

FLMs using graphical models

Factored Language Models

• Along with words, consider factors as

components of the language model

• Factors can be words, stems, morphs, patterns,

roots, which might contain complementary

information about language

• FLMs also provide a new possibilities for

designing LMs (e.g., multiple back-off paths)

• Problem: We don‟t know the best model,

and space is huge!!!

Factored Language Models

• How to learn FLMs

– Solution 1: do it by hand using expert

linguistic knowledge

– Solution 2: data driven; let the data

help to decide the model

– Solution 3: combine both linguistic

and data driven techniques

Factored Language Models

• A Proposed Solution:

– Learn FLMs using evolution-inspired search

algorithm

• Idea: Survival of the fittest

– A collection (generation) of models

– In each generation, only good ones survive

– The survivors produce the next generation

Evolution-Inspired Search









Combination: retain useful LMs next generation

Mutation: choose the change in

• Selection: some small goodcharacteristics

Evolution-Inspired Search



• Advantages

– Can quickly find a good model

– Retain goodness of the previous generation while

covering significant portion of the search space

– Can run in parallel

• How to judge the quality of each model?

– Perplexity on a development set

– Rescore WER on development set

– Complexity-penalized perplexity

Evolution-Inspired Search



• Three steps form new models.

– Selection (based on perplexity, etc)

• E.g. Stochastic universal sampling: models are

selected in proportion to their “fitness”

– Combination

– Mutation

Moving from One Generation to Next



• Combination Strategies

– Inherit structures horizontally

– Inherit structures vertically

– Random selection

• Mutation

– Add/remove edges randomly

– Change back-off/smoothing strategies

Combination according to Frames

F1



F2



F3



t  2 t 1 t t  2 t 1 t t  2 t 1 t

F1



F2



F3

t  2 t 1 t

Combination according to Factors



F1



F2



F3



t  2 t 1 t t  2 t 1 t t  2 t 1 t



F1



F2



F3

t  2 t 1 t

Outline



• Factored Language Models (FLMs)

overview

• Part I: automatically finding FLM structure

• Part II: first-pass decoding with FLMs

Problem

• May be difficult to improve WER just by

rescoring n-best lists

• More gains can be expected from using better

models in first-pass decoding

• Solution:

1. do first-pass decoding using FLMs

2. Since FLMs can be viewed as graphical models, use

GMTK (most existing tools don‟t support general

graph-based models)

3. To speed up inference, use generalized graphical-

model-based lattices.

FLMs as Graphical Models



F1



F2



F3



Word









Graph for Acoustic Model

FLMs as Graphical Models



• Problem: decoding can be expensive!

• Solution: multi-pass graphical lattice

refinement

– In first-pass, generate graphical lattices using

a simple model (i.e., more independencies)

– Rescore the lattices using a more complicated

model (fewer independencies) but on much

smaller search space

Example: Lattices in a Markov Chain









0 1 2

2 2 3 5

7 4





This is the same as a word-based lattice

Lattices in General Graphs









1 0

2 0 2 1

3 5 4

6





1 0 1

2 2 3 1

6 3 4 2

5

Research Plan

• Data

– Arabic CallHome data

• Tools

– Tools for evolution-inspired search

• most part already developed during workshop

– Training/Rescoring FLMs

• Modified SRI LM toolkit: developed during this

workshop

– Multi-pass decoding

• Graphical models toolkit (GMTK): developed in last

workshop

Summary

• Factored Language Models (FLMs)

overview

• Part I: automatically finding FLM structure

• Part II: first-pass decoding of FLMs using

GMTK and graphical lattices

Minimum Divergence Adaptation

of a MSA-Based Language Model

to Egyptian Arabic

A proposal by

Sourin Das

JHU Workshop Final Presentation

August 21, 2002

Motivation for LM Adaptation

• Transcripts of spoken Arabic are expensive to obtain;

MSA text is relatively inexpensive (AFP newswire,

ELRA arabic data, Al jazeera …)

– MSA text ought to help; after all it is Arabic



• However there are considerable dialectal differences

– Inferences drawn from Callhome knowledge or data ought

to overrule those from MSA whenever the inferences drawn

from them disagree: e.g. estimates of N-gram probabilities

– Cannot interpolate models or merge data naïvely

– Need to instead fall back to MSA knowledge only when the

Callhome model or data is “agnostic” about an inference

Motivation for LM Adaptation



• The minimum K-L divergence framework provides a

mechanism to achieve this effect

– First estimate a language model Q* from MSA text only

– Then find a model P* which matches all major Callhome

statistics and is close to Q*.





• Anecdotal evidence: MDI methods successfully used

to adapt models based on NABN text to SWBD: a 2%

WER reduction in LM95 from a 50% baseline WER.

An Information Geometric View



The Uniform Distribution

Models satisfying

Callhome marginals







MaxEnt Callhome LM



MaxEnt MSA-text LM Min Divergence Callhome LM







Models satisfying

MSA-text marginals



The Space of all Language Models

A Parametric View of MaxEnt Models

• The MSA-text based MaxEnt LM is the ML estimate

among exponential models of the form

Q(x) = Z-1(,) exp[ i fi(x) + j gj(x)]

• The Callhome based MaxEnt LM is the ML estimate

among exponential models of the form

P(x) = Z-1(,) exp[ j gj(x) + k hk(x)] *

• Think of the Callhome LM as being from the family

P(x) = Z-1(,) exp[ i fi(x) + j gj(x) + k hk(x)]

where we set =0 based on the MaxEnt principle.

• One could also be agnostic about the values of i‟s,

since no examples with fi(x)>0 are seen in Callhome

– Features (e.g. N-grams) from MSA-text which are not seen

in Callhome always have fi(x)=0 in Callhome training data

A Pictorial “Interpretation” of the

Minimum Divergence Model





The ML model for MSA text

Q*(x)=Z-1(,) exp[ i*fi(x) + j*gj(x)]



Subset of all exponential models with =*

P(x)=Z-1(,,) exp[ i*fi(x) + j gj(x) + k hk(x)]





The ML model for Callhome, with =* instead of =0.

P*(x)=Z-1(,,) exp[ i*fi(x) + j**gj(x) + k*hk(x)]



Subset of all exponential models with =0

Q(x)=Z-1(,) exp[ i fi(x) + j gj(x)]





All exponential models of the form

P(x)=Z-1(,,) exp[ i fi(x) + j gj(x) + k hk(x)]

Details of Proposed Research (1):

A Factored LM for MSA text

• Notation W=romanized word, =script, S=stem, R=root, M=tag

Q(i|i-1,i-2) = Q(i|i-1,i-2,Si-1,Si-2,Mi-1,Mi-2,Ri-1,Ri-2)

• Examine all 8C2 = 28 all trigram “templates” of two variables from the

history with i.

– Set observations w/counts above a threshold as features

• Examine all 8C1 = 8 all bigram “templates” of one variable from the

history with i.

– Set observations w/counts above a threshold as features

• Build a MaxEnt model (Use Jun Wu‟s toolkit)

Q(i|i-1,i-2)=Z-1(,) exp[ 1f1(i,i-1,Si-2)+2f2(i,Mi-1,Mi-2) …

+ifi(i,i-1)+…+jgj(i,Ri-1)+…+JgJ(i)]

• Build the Romanized language model

Q(Wi|Wi-1,Wi-2) = U(Wi|i) Q(i|i-1,i-2)

A Pictorial “Interpretation” of the

Minimum Divergence Model





The ML model for MSA text

Q*(x)=Z-1(,) exp[ i*fi(x) + j*gj(x)]









The ML model for Callhome, with =* instead of =0.

P*(x)=Z-1(,,) exp[ i*fi(x) + j**gj(x) + k*hk(x)]









All exponential models of the form

P(x)=Z-1(,,) exp[ i fi(x) + j gj(x) + k hk(x)]

Details of Proposed Research (2):

Additional Factors in Callhome LM

P(Wi|Wi-1,Wi-2) = P(Wi,i| Wi-1,Wi-2,i-1,i-2,Si-1,Si-2,Mi-1,Mi-2,Ri-1,Ri-2)



• Examine all 10C2 = 45 all trigram “templates” of two variables from the

history with W or .

– Set observations w/counts above a threshold as features



• Examine all 10C1 = 10 all bigram “templates” of one variable from the

history with W or .

– Set observations w/counts above a threshold as features



• Compute a Min Divergence model of the form

P(Wi|Wi-1,Wi-2)=Z-1(,, ) exp[ 1f1(i,i-1,Si-2)+2f2(i,Mi-1,Mi-2)+…

+ifi(i,i-1 )+…+jgj(i,Ri-1)+…+JgJ(i)]

exp[1h1(Wi,Wi-1,Si-2)+ 2h2(i,Wi-1,Si-2) +…

+ khi(i,i-1)+…+ KhK(Wi)]

Research Plan and Conclusion



• Use baseline Callhome results from WS02

– Investigate treating romanized forms of a script

form as alternate pronunciations

• Build the MSA-text MaxEnt model

– Feature selection is not critical; use high cutoffs

• Choose features for the Callhome model

• Build and test the minimum divergence model

– Plug in induced structure

– Experiment with subsets of MSA text

A Pictorial “Interpretation” of the

Minimum Divergence Model





The ML model for MSA text

Q*(x)=Z-1(,) exp[ i*fi(x) + j*gj(x)]









The ML model for Callhome, with =* instead of =0.

P*(x)=Z-1(,,) exp[ i*fi(x) + j**gj(x) + k*hk(x)]









All exponential models of the form

P(x)=Z-1(,,) exp[ i fi(x) + j gj(x) + k hk(x)]


Related docs
Other docs by HC11111012121
88210252livre doc
Views: 7  |  Downloads: 0
GMDN_EDMA_Codes
Views: 4  |  Downloads: 0
Corps1_Choper_2000spr 1
Views: 0  |  Downloads: 0
rcom_methylpyrrolidone_20110510
Views: 34  |  Downloads: 0
Origine_famille_Moscou
Views: 0  |  Downloads: 0
MasterListrev
Views: 0  |  Downloads: 0
Coopcontract2001
Views: 0  |  Downloads: 0
V01X04_liste_rubriques_20100908
Views: 1  |  Downloads: 0
desert operations
Views: 0  |  Downloads: 0
By registering with docstoc.com you agree to our
privacy policy

You are almost ready to download!

You are almost ready to download!