learning arabic by dritantofandayahoo

VIEWS: 133 PAGES: 52

									Learning Arabic Morphology With Information Theory
Paul Rodrigues & Damir !avar
Linguistics Department Computational Linguistics Lab
41st Chicago Linguistics Society Chicago, Il. April 8th, 2005

Indiana University

prrodrig@indiana.edu

TOC
• The Arabic Morphology Problem • Computational Solutions • Statistical Solution Hypothesis • A Statistical Solution • Results • Conclusion

Arabic Roots
• 3 or 4 radicals • Represents a semantic field • ktb = write • (kataba, he wrote) !"!#!$ • (katAbun, a book) %&'#($ • (maktabun, an office) %"!#)*!+

Arabic Roots - Verbs Forms
• 15 Forms • I. (kataba, he wrote) !"!#!$ • II. (kattaba, he made someone write) !",#!$ • ....

Arabic Roots - Prosodic Template

• The vowels of each of those forms control voice. • (kataba - active) !"!#!$ • (kutiba - passive) !"(#-$

...

• ...most of the time • (kutub; noun: piece of writing) "-#-$

Concatenative

• Particles • (al-katab-u)

!"#$#%&'

• (wa-al-katab-u) -&'#!*./!0

Concatenative

• Case Endings • (al-katAb-u)

!()$%&'

• (wa-al-katAb-u) -&'#!*./!0

Concatenative

• Gender • *"#*+ • "#*+

Concatenative

• Reduplication • (apricot) 123+ • (whisper) 4!5)6!0
Though reduplication in Arabic is not productive, it is interesting for theoretical linguistics. Grab it!

Examples from (McCarthy, 1979)

Roots
• Is the root a morpheme? • ROOT: McCarthy (1979, 1981), Prunet et al. (2000), Davis & Zawaydeh (1999, 2001), etc... • STEM: McOmber (1995), Ratcliffe (1997), Benmamoun (1999), etc... • Summary: Davis (2001) • Either way, roots are useful for lexicography, data retrieval, spell check, etc...

Arabic Verb Roots
Name Sound 3 2
Usually 3, but at least 2: Same Mid and Final

Weak (Glides) 0
Glottal Stop

Sound

Sound Hamzated Doubled Verbs Assimilated Hollow Defective Doubly Weak

0 or 1 (Initial) 1 (Initial) 1 (Mid) 1 (Final) 2

2 2 2 1

Weak

(Mace, 1998: p26-103)

Semitic Computational Morphology

Arabic - Beesley (*), Buckwalter (*), Kartunnen (*), Kay (*), Kiraz (*) Finite State Two-level (Koskenniemi, 1983) (Kartunnen, 1983) Rule-based Dictionary-based

Statistical Computational Morphology
Statistic Morphology Learning Linguistica: Goldsmith (*), Hu & Goldsmith, Belkin & Goldsmith (http:// linguistica.uchicago.edu) ABUGI: !avar et al. (*) (http:// jones.ling.indiana.edu/~prrodrig) Cruetz, Lagus (2001) - MDL and EM approach Schone, Jurafsky (2000) - LSA approach

Intermission Linguistica and Abugi
• Similarities • Statistical Morphological Parser • MDL and MI • Differences • ABUGI is an incremental learning model. • ABUGI doesn’t have prior knowledge of morphological types. • ABUGI’s current version decides the most significant evidence, and votes on the parse.

Statistical Semitic Morphology
• An approach for Hebrew (Daya, 2003) • Ezra Daya, Dan Roth, & Shuly Wintner: “Learning Hebrew Roots: Learning with Linguistic Constraints” 2003. • 30 Roots • Hidden Markov Models • Up to 83% on a single radical. Up to 59.83% overall.

WHY?
Possibible Roots In A Word
500 450 400 350 300 250 200 150 100 50 0 3 4 5 6 7 8 9 10 11 12 13 14 15 Length of Word

Possible Roots

A statistical Approach to Semitic Morphology Parsing
• Learning Model • Statistical • Guides the learner. • Makes predictions about unseen data. • Alignment-Based • Incremental • We can track the learning progress and analyze our algorithm.

ConstrainTS
• Constraints were used to reduce the search space. • Considers only triliteral roots • Customizable max-distance (n) from which to search for end radicals. (R1nR3) • Customizable max-distance (m) between radicals. (R1mR2mR3)

Innate Assumptions
• Morphology • Concatenative • Interdigitation • Reduplication • Concatenative Morphology only occurs outside the first and last characters of the root. • We assume that frequency and collocation can be used to learn these morphemes.

We assume
• Input is word-by-word. • Incrementally. • Autosegmental Orthography • Word formation is by concatenation and interdigitation. Goldsmith (1976), McCarthy (1979)

Parsing the Roots
• Started with (Elghamry, 2004) • Evaluates statistical evidence for and against the inclusion of the radical as part of the root. • Reported a 90% precision on an AlJazeera corpus, hand-evaluated.

Roots - Positive Evidence
FreqRoot1 FreqRoot2 FreqRoot3 E1 = + + FreqA f f ix FreqA f f ix FreqA f f ix

Freq1 Freq2 Freq3 E3 = + + FreqRoot1 FreqRoot2 FreqRoot3

Roots - Negative Evidence
FreqA f f ix1 FreqA f f ix2 FreqA f f ix3 E4 = + + Freq1 Freq2 Freq3

FreqA f f ix1 + FreqA f f ix2 + FreqA f f ix3 E5 = NumberO f LettersInT hePossibleA f f ixes

Roots - Evidence Combined
E1 + E3 E= E4 + E5
•And maximize this over the word.

ROOT Replacement

• Once we have the root, we can find the concatenative morphology. • yaXu (- X!7) • wayaXu (- X!7!0)

Hypotheses Generation
• Alignment-Based Learning (VanZaanen:2001) • For every new word, generate only the hypotheses based upon alignment with previously learned morphemes (Concatenation and Interdigitation). • Substitutability and Complementarity. (Harris: 1955, 1961)

Minimum Description Length
• Learning of grammar is equivalent to compression. • Reduce memory. • Decrease recall speed. • MDL Principle: Minimize the size of the grammar and the size of the data described by the grammar. (Grünwald: 1998)

Relative Entropy

• Constraint 1: Minimize the Relative Entropy (MacKay: 2001)

P(x) RE = ∑ P(x)log2 Q(x) x∈X

Mutual Information

• Constraint 2: Maximize the Mutual Information (MI) P(xy) MI(xy) = log 2 P(xy) P(x) ∗ P(y)

Integration
• Find the optimal parse by maximizing the satisfaction of the constraints: • Minimize Description Length Maximize the Mutual Information Minimize the Relative Entropy (!avar et. al. 2004)

Algorithm
foreach word { 1. Find the root. 2. Everything within root is the template. 3. Replace the root with an X 4. Use MDL to create concatenative hypotheses 5. foreach hypothesis { 1. Perform MI on hypothesis. 2. Perform RE on hypothesis. } 6. Pick the best one. (Max MI, Min RE) 7. That is the split. }

Corpus
• Buckwalter Arabic Morphological Analyzer (Buckwalter, 2002) • Used 60,000 morphosyntacticaly correct words generated randomly • This ensures that our roots are triliteral, and that we know the root.

Radical Precision
Random Order - Trilateral 20 W M.A; n=60k; Avg. Length 6.01
1 0.9 0.8 0.7
Correct

0.6 0.5 0.4 0.3 0.2 0.1 0 1 422 843 1264 1685 2106 2527 2948 3369 3790 4211 4632 5053 5474 5895 Words / 20

Root Precision
Random Order - Root 20 W M.A.; n=60k; Avg. Length=6.01
1 0.9 0.8 0.7
Correct

0.6 0.5 0.4 0.3 0.2 0.1 0 1 1086 2171 3256 4341 5426 6511 7596 8681 9766 10851 11936 Words / 5

Individual Radicals
Random Data - Individual Radicals - 20 W MA ; n=60k; Avg. Len=6.01
1.2 1 0.8 Correct 0.6 0.4 0.2 0 1 565 1129 1693 2257 2821 3385 3949 4513 5077 5641 Words / 20

Series1 Series2 Series3 50 per. Mov. Avg. (Series3) Poly. (Series3)

Ascending Length Radical Learning
Ascending - 20W MA; n=60k; Avg. L=6.01
1 0.9 0.8 0.7 Correct 0.6 0.5 0.4 0.3 0.2 0.1 0 1 418 835 1252 1669 2086 2503 2920 3337 3754 4171 4588 5005 5422 5839 Words / 20

Ascending Length Individual Roots
Ascending Individual Rads - 20W MA; n=60k; Avg. L=6.01
1 0.9 0.8 0.7

Correct

0.6 0.5 0.4 0.3 0.2 0.1 0 1 657 1313 1969 2625 3281 3937 4593 5249 5905 Words / 20

Series1 Series2 Series3 50 per. Mov. Avg. (Series3)

Descending Length Radical Learning
Descending Triliteral - 20W MA; n=60k; Avg. L=6.01
1 0.9 0.8 0.7
Correct

0.6 0.5 0.4 0.3 0.2 0.1 0 1 418 835 1252 1669 2086 2503 2920 3337 3754 4171 4588 5005 5422 5839 Words / 20

Descending Length INDividual roots
Descending Indiv. Rads - 20W MA; n=60k; Avg. L=6.01
1 0.9 0.8 0.7 Correct 0.6 0.5 0.4 0.3 0.2 0.1 0 1 590 1179 1768 2357 2946 3535 4124 4713 5302 5891 Words / 20 50 per. Mov. Avg. (Series3) Series1 Series2 Series3

Normalized, Unvoweled Data
Normalized Triliteral- 20W MA; n=50k; Avg. L=6.01
1 0.9 0.8 0.7 Correct 0.6 0.5 0.4 0.3 0.2 0.1 0 1 452 903 1354 1805 2256 2707 3158 3609 4060 4511 4962 Words / 20

Individual Roots Normalized Data
Random Normalized Indiv. Rads. n=50k; Avg. Len=6.01
1 0.9 0.8 0.7 Correct 0.6 0.5 0.4 0.3 0.2 0.1 0 1 482 963 1444 1925 2406 2887 3368 3849 4330 4811 Words / 20 Series2 Series3 50 per. Mov. Avg. (Series3) Poly. (Series3) Series1

Why The discrepancy?
• Why the discrepancy between Elghamry’s results and ours? • His is two pass all-at-once algorithm, while this is incremental learning. • Unvoweled text. • Ours is a closer correlate to speech. • Shorter text.

Average word length
• Avg Word Length AlJazeera = 4.96
Unvoweled Sample corpus (Two articles dated 03/27/05)

• Avg Word Length we used = 6.01 Voweled
(Random60k)

• Avg Word Length Arabic Treebank = 8.25
Voweled

(p2v2, AlHiyat)

“Zipf’s Laws”

• The longer the word, the more content. • The shorter words are more frequent. • Clustering by length and frequency reveals distinct categories of open and closed class words.

Conclusions
• The generated output is • Consonant Template • Vowel Template • If you adopt interdigitation, you have two tiers. One is highly frequent, one is less frequent. The more frequent one is “functional,” and the less frequent one defines the semantic fields.

A conclusion

• Can this statistical effect be used to draw theoretic conclusions about the morphemic status of the root??????????

CONCLUSIONS
• Morphology offers linguistic cues. • Word type information can be derived on the basis of morphological paradigms.
Brants (2000), Lee et al. (2002), Hu & Goldsmith. (2005), !avar et al. (2005)

• Roots allow induction of semantic field. • We have a usable morphological parser for Arabic that does not need a dictionary.

What’s NeXT
Parse when confident. Adapting the algorithm for discontinuous morphemes of variable length. Automatic clustering of types (from the concatenative morphology) and semantic themes (from the root and template) (De
Roeck, Al-Fares)

Evaluation on Amharic and Hebrew.

What’s next Practical

Testing out phonological constraints of the root system. Training on what you know to be the roots.

Bibliography 1
Damir !avar, Joshua Herring, Toshikazu Ikuta, Paul Rodrigues, Giancarlo Schrementi. “On Induction of Morphology Grammars and its Role in Bootstrapping.” Proceedings of the 9th Conference on Formal Grammar (FGNancy). Nancy, France. August 2004. Benmamoun, Elabbas. (1999). Arabic Morphology: The Central Role of the Imperitive. Lingua 108:175-201 Elghamry, Khaled, (2004). A Constraint-based Algorithm for the Identification of Arabic Roots. Proceedings of the 1st Midwest Computational Linguistics Colloquium. Elghamry, Khaled, and !avar, Damir. (2004). Bootstrapping Cues for Cue-based Bootstrapping. Mscr. Indiana University Davis, Stuart & Bushra Adnan Zawaydeh. (1999). A Descriptive Analysis of Hypocoristics in Colloquial Arabic. Languages and Linguistics. 3:83-98 Gruenwald, Peter. (1998). The Minimum Description Length Principal and Reasoning under Uncertainty. Ph.D Dissertation. Universiteit van Amsterdam. Goldsmith, John. (2001). Unsupervised Learning of the Morphology of a Natural Language. Computational Linguistics. Vol. 27 Num 2 pp153-198 !avar, Damir, Joshua Herring, Toshikazu Ikuta, Paul Rodrigues, Giancarlo Schrementi. (2005) “”. 29th Penn Linguistics Colloquium. (To Appear) Ezra Daya, Dan Roth, & Shuly Wintner: “Learning Hebrew Roots: Learning with Linguistic Constraints.” 2003. Harris, Zellig S. (1955). From Phonemes To Morphemes. Language. Volume 31. Number 2. 190-222

Bibliography 2
Harris, Zellig S. (1961). Structural Linguistics. University of Chicago Press: Chicago. (Published in 1951 under title: Methods in Structural Linguistics) Kiraz, George Anton. (2001). Computational Nonlinear Morphology: with emphasis on Semitic Languages. Studies in Natural Language Processing. Cambridge University Press: Cambridge, U.K. Mace, John. (1998). Arabic Grammar: A Reference Guide. Edinburgh University Press. Edinburgh. MacKay, David J. C. (2003). Information Theory, Inference, and Learning Algorithms. Cambridge University Press: Cambridge. McCarthy, John. (1979). Formal Problems in Semitic Phonology and Morphology. MIT Ph.D Dissertation. McComber, Michael. (1995). Morpheme Edges and Arabic Infixation.In Mushira Eid (ed.) Perspectives on Arabic Linguistics VII, p. 173-189. Prenet, Jean-Francois, Renee Beland, and Ali Idrissi (2000). The Mental Representation of Semitic Words. Linguistic Inquiry. 31:609-648 Ratcliffe, Robert (1997) Prosodic Templates in a Word-Based Morphological Analysis of Arabic. In Mushira Eid & Robert Ratcliffe (eds), Perspectives on Arabic Linguistics X: 147-171. van Zaanen, Menno M. (2001) “Bootstrapping Structure into Language: Alignment-Based Learning” Ph.D Dissertation. The University of Leeds


								
To top