Documents
Resources
Learning Center
Upload
Plans & pricing Sign in
Sign Out
Get this document free

Arabic_ Farsi and Urdu Text Normalization for Natural Language

VIEWS: 70 PAGES: 27

									Arabic, Farsi and Urdu Text Normalization
for Natural Language Processing

Zina Saadi, Computational Linguist and
Middle Eastern Language Specialist
June 7, 2007


                  Proprietary Information of Basis Technology Corp.
Why Normalize Text ?

  Orthographic variations among Arabic, Farsi and Urdu writers cause
  inaccuracies in Natural Language Processing (NLP) applications

  Unicode allows users to introduce cross-lingual variations by
  providing mapping to each character in various languages




                                                                       2
Presentation Outline

  What is Arabic Unicode?

  Arabic Unicode Orthographic Variations
     Ambiguous Variations
     Interchangeable Variations
     Cross-Lingual Variations
     Handling in NLP


  Experiment
     How important is text normalization?


  Summary & Questions


                                            3
What is Arabic Unicode?
  Arabic Unicode Representation A
       (U+FB50–U+FDFF )
      No longer in use

  Arabic Unicode Representation B
       (U+FE70–U+FEFE)
      No longer in use

   Forms A and B are one code point
   per different glyph (grapheme)

   One-to-one relationship

  Arabic Unicode (U+0600 – U+06FF)
      Currently used

   code point per logical letter (one-to-many)

                                                 4
Comparing All Three Types


              Arabic Unicode Variations for Yeh
          Glyph
          Form A             FBFF      FBFE
          Form B      FEF2    FEF4       FEF3     FEF1
         0600-06FF                064A
         0600-06FF                06CC

                     Arabic Unicode Yeh
               Arabic Unicode Farsi Yeh


                                                         5
‫‪Arabic Example‬‬
  ‫ﺛﻤﺎﻧﻴﺔ ﺁﻻف ﻮﺴﺗﻤﺎﺌﺔ ﻮ ﺧﻤﺴﺔ ﻮ ﺗﺴﻌﻮن آﺗﺎﺑﺎ‬

                                     ‫ﺛﻤﺎﻧﻴﺔ ﺁﻻف ﻮﺴﺗﻤﺎﺌﺔ ﻮ ﺧﻤﺴﺔ ﻮ ﺗﺴﻌﻮن آﺗﺎﺑﺎ‬
      ‫‪Arabic Representation Form B‬‬
  ‫ﺛﻤﺎﻧﻴﺔ ﺁﻻف وﺳﺘﻤﺎﺋﺔ و ﺧﻤﺴﺔ و ﺗﺴﻌﻮن آﺘﺎﺑﺎ‬

                                      ‫ﺛﻤﺎﻧﻴﺔ ﺁﻻف وﺳﺘﻤﺎﺋﺔ و ﺧﻤﺴﺔ و ﺗﺴﻌﻮن آﺘﺎﺑﺎ‬
      ‫‪Unicode Arabic Letter Yeh‬‬

  ‫ﺛﻤﺎﻧﻴﺔ ﺁﻻف وﺳﺘﻤﺎﺋﺔ و ﺧﻤﺴﺔ و ﺗﺴﻌﻮن ﮐﺘﺎﺑﺎ‬


                                        ‫ﺛﻤﺎﻧﻴﺔ ﺁﻻف وﺳﺘﻤﺎﺋﺔ و ﺧﻤﺴﺔ و ﺗﺴﻌﻮن ﮐﺘﺎﺑﺎ‬
      ‫‪Unicode Arabic Letter Farsi Yeh‬‬

                                                                                  ‫6‬
Arabic Unicode Orthographic Variations

     Ambiguous Variants (affect meaning and NLP)

     Interchangeable Variants (do not affect meaning, but affect NLP)

     Cross-Lingual Variants (do not affect meaning, but affect NLP)

     How should such variations be handled in NLP?




                                                                        7
Ambiguous Variants
    Arabic Example:

                           … ‫ﻗﺎل ﻟﻰ ﻋﻠﻰ اﻟﻜﺮدى‬

         … ‫ﻗﺎل ﻟﻲ ﻋﻠﻰ اﻟﻜﺮدي‬
“He told me about the Kurdish person …”
                                          … ‫ﻗﺎل ﻟﻲ ﻋﻠﻲ اﻟﻜﺮدي‬
                                           “Ali Al-Kurdi told me …”
(Preposition: ‫)ﻋﻠﻰ‬       vs.       (Proper_Noun:   ‫ ,ﻋﻠﻲ‬spelling-error: ‫)ﻋﻠﻰ‬

      Lacking the dots on the Yeh affect the reading of the sentence



                                                                               8
Ambiguous Variants (Cont.)

  They create ambiguity

        Arabic: (Buckwalter 2004)

            Uses of ‫ ي‬vs. ‫( ى‬Yeh vs. Alif Maqsura)

            Uses of   ‫ ﻩ‬vs. ‫ة‬   (Heh vs. Taa Marbuta)

            Initial Alif Variations: (‫( )ا أ إ ٱ ﺁ‬Alif Madda, Alif Wasla,
            Alif with Hamza below or above, and plain Alif)




                                                                            9
‫‪Interchangeable Variants‬‬
  ‫‪Different users use different spellings‬‬
     ‫ئ .‪ vs‬ﯼ ‪uses of‬‬

         ‫)‪ (Mehr-News‬رﺋﻴﺲ ﺟﻤﻬﻮرﯼ ﺁﻣﺮﻳﮑﺎ .‪ vs‬ﻣﻌﺎون رﻳﻴﺲ ﺟﻤﻬﻮر :‪Farsi‬‬

         ‫)‪ (Mehr-News‬ﺧﻠﻴﻔہ ﺑﻦ زاﺋﺪ ﺁل ﻧﮩﻴﺎن .‪ vs‬ﻋﮩﺪ ﺷﻴﺦ ﻣﺤﻤﺪ ﺑﻦ زاﻳﺪ ﺁل ﻧﮩﻴﺎن :‪Urdu‬‬


     ‫ؤ .‪ vs‬و ‪uses of‬‬


         ‫)‪ (Al-Jazeera‬اﻟﻤﻬﻨﺪس داؤود ﻳﺤﻴﻰ .‪ vs‬ﺿﻴﻔﻨﺎ داﻳﻔﺪ آﻴﺰﻳﻮت أو داوود :‪Arabic‬‬

         ‫)‪ (Mehr-News‬ﺑﺎ هﻤﮑﺎرﯼ ﻣﺆﺳﺴﻪ ﭘﮋوهﺸﯽ .‪ vs‬ﺑﺎ هﻤﮑﺎرﯼ ﻣﻮﺳﺴﻪ ﺁﻣﻮزﺷﯽ :‪Farsi‬‬

         ‫)‪ (Mehr-News‬روزﮦ دار ﻣﺆﻣﻨﻴﻦ آﻰ اﻃﺎﻋﺖ .‪ vs‬ﺑﻨﺪﮔﺎن ﻣﻮﻣﻨﻴﻦ ﭘﺮ ﻣﮩﺮﺑﺎﻧﻰ :‪Urdu‬‬
                                 ‫ِ‬

                                                                                      ‫01‬
Cross-Lingual Variations
 Adapted from “Notes on Unicode Arabic usage” (Kew 2005)
            Glyphs            Unicode   Arabic Unicode Name      Languages
       ‫ك‬    ‫كـ‬   ‫ـكـ‬    ‫ـك‬    U+0643    Letter Kaf               Arabic

       ‫ڪـ ڪ‬      ‫ـڪ ـڪـ‬       U+06AA    Letter Swash Kaf         Sindhi
 Kaf
       ‫ک‬    ‫کـ‬   ‫ـک‬     ‫ـک‬    U+06A9    Letter Keheh             Urdu, Farsi
        ‫ه‬   ‫هـ‬   ‫ـهـ‬     ‫ـه‬   U+0647    Letter Heh               Arabic, Farsi

 Heh    ‫ە‬                ‫ـە‬   U+06D5    Letter AE                Kurdish

        ‫ہ‬   ‫ہـ‬   ‫ـہـ‬     ‫ـہ‬   U+06C1    Letter Heh Goal          Urdu
       ‫ھ‬    ‫ھـ‬   ‫ـھـ‬    ‫ـھ‬    U+06BE    Letter Heh Doachashmee   Urdu, Sindhi
       ‫ي‬    ‫يـ‬    ‫ـيـ‬   ‫ـي‬    U+064A    Letter Yeh               Arabic
       ‫ى‬    ‫ىـ‬    ‫ـىـ‬   ‫ـى‬    U+0649    Letter Alif Maksura      Arabic
 Yeh
                                                                 Urdu,
       ‫ی‬    ‫یـ‬    ‫ـیـ‬   ‫ـی‬    U+06CC    Letter Farsi Yeh         Persian


                                                                                 11
Orthographic Variations Handling in NLP
  Ambiguous Variations:
     Preserve and allow for both possibilities

         ‫ ﻋﻠﻰ‬can refer to either ‫( ﻋﻠﻰ‬Preposition) or ‫( ﻋﻠﻲ‬Proper Noun)
         ‫ ﻣﺪرﺳﻪ‬can refer to either ‫( ﻣﺪرﺳﻪ‬his teacher) or ‫( ﻣﺪرﺳﺔ‬school)

  Interchangeable Variations:
     Normalize to the most common variant (Dictionary-Normalization)
     Allow for both variations (Name Matching: ‫ ﺑﻦ زاﻳﺪ ﺁل ﻧﮩﻴﺎن‬vs. ‫) ﺑﻦ زاﺋﺪ ﺁل ﻧﮩﻴﺎن‬


  Cross-Lingual Variations
      Normalize all variations to the character(s) used in the language
   Arabic:
         ‫ ﮐﺘﺎﺑﮏ‬and ‫ ڪﺘﺎﺑڪ‬Normalized    ‫( آﺘﺎﺑﻚ‬your book)

                                                                                         12
Asked Questions

  How much Arabic, Farsi and Urdu orthographic variation really
  exists in the real world?

  Will handling variations improve accuracy?




                                                                  13
 Experiment Overview

                                      Collected Online
                                 Arabic, Farsi and Urdu Text

                                                  Measured Occurrences of Letters
1.   Normalize Arabic Unicode Forms A & B
2.   Normalize Cross-Lingual Variants
3.   Normalize Interchangeable Variants
4.   Allow for Ambiguous Variants



         Dictionary Lookup



Measured Occurrences of Normalized Letters             Compared Results

        A normalized word is only returned if it exists in the dictionary

                                                                                    14
Selected Corpora
  News-Content Corpora (March 2007)
     Deutsche Welle (Germany)
     Mehr News (Iran)

        Arabic (250MB)




        Farsi (250MB)




        Urdu (150MB)




                                      15
Experiment Hypotheses
  Since data is recent
       => no Arabic Unicode presentation forms A or B

  Arabic: More Perso-Arabic Cross-Lingual Variations
     In MehrNews (Iranian source) vs. DW (German source)


  Fewer Cross-Lingual Variations
     In Online News (MehrNews) vs. DW (originally Radio News)


  Handling orthographic variations
      => Increase dictionary-lookup accuracy?




                                                                16
Arabic Findings
 Ambiguous Variants:                                     Mehr-News             DW-News
                                     Char
   Initial (‫ )أ إ ٱ ﺁ‬vs. initial ‫ا‬                 Pre          Norm     Pre         Norm
                                      ‫ك‬
   ‫ ي‬vs. ‫ى‬                                  0643    124087      139253    157499     157501
                                      ‫ک‬     06A9     15166           0          2          0
     ‫ ﻩ‬vs. ‫ة‬
                                      ‫ه‬     0647    341730      224357    185050     166146
                                      ‫ہ‬     06C1           0         0          0          0
 Interchangeable Variants:            ‫ھ‬     06BE           0         0          0          0
     Allowed a second variant         ‫ي‬     064A    729320      859204    722751     726144
     for seated Hamza                 ‫ى‬     0649     49917       75847     64012      60619
                                      ‫ی‬     06CC    155829           1          8          0
 Normalized all cross-                ‫ئ‬     0626     72256       72270     35034      35042
 lingual and forms A and B            ‫ا‬     0627   1953453     1972314   1374428    1489435
 variants                             ٔ‫ا‬    0623     30552       19741    181004      70091
                                      ٕ‫ا‬    0625     6765        6765      57679      57679
                                      ٓ‫ا‬    0622     11663       3589      7071          2961

                                                                                                17
Farsi Findings
  Interchangeable Variants:
     Normalized:                                             Mehr-News                 DW-News
                                        Char.
                                                       Pre           Norm        Pre         Norm
         Initial (‫ )أ إ ٱ‬to initial ‫ا‬
                                         ‫ك‬      0643           0            0    244964            37
         Final ‫ ة‬to ‫ ﻩ‬or ‫ت‬               ‫ک‬      06A9    727461      727461       164909      409834
                                          ‫ه‬     0647   1819302     1821741       940097      940099
     Allowed a second variant:           ‫ھ‬      06BE           0            0           0           0
         For ‫ ئ‬vs. ‫60( ﯼ‬CC)              ‫ي‬      064A     25096              0    597317           223
         For ‫ ؤ‬vs. ‫و‬                     ‫ى‬      0649     2446               0    381692            42
                                         ‫ی‬      06CC   3147172     3175414       517649     1497030
                                         ‫ئ‬
  Normalized all cross-lingual                  0626     71914       71214        24626          23971

  and forms A and B Variants              ‫ا‬     0627   4723101     4728545      2193525     2196343
                                          ٔ‫ا‬    0623     5444               1      2824             0
                                          ٕ‫ا‬    0625           1            0           0           0
                                          ٓ‫ا‬    0622    202245      202245       135544      135544

                                                                                                         18
Urdu Findings
 Interchangeable Variants:                               Mehr-News                DW-News
                                     Char
    Normalized:                                    Pre           Norm      Pre           Norm

                                      ‫ك‬     0643   111947             50          138           0
        Initial (‫ )أ إ ٱ‬to initial
                                     ‫ک‬      06A9
        ‫ا‬                                            78421      237741      66911        67049
                                      ‫ه‬     0647         6608         1           141           0
        Final ‫ ة‬to ‫ ﻩ‬or ‫ت‬
                                      ‫ہ‬     06C1   145239       181415      41134        41219
                                      ‫ھ‬     06BE     26902       34956      16058        16114
    Allowed a second variant:         ‫ي‬     064A   175111            204          219           0
        For ‫ ئ‬vs. ‫60( ﯼ‬CC)            ‫ى‬     0649           6          1            0            0
        For ‫ ؤ‬vs. ‫و‬                   ‫ی‬     06CC   140550       405744     108316       110623
                                      ‫ئ‬     0626     31409       36095      14056        11968
 Normalized all cross-lingual         ‫ا‬     0627   404576       508836     123346       123346
 and forms A and B Variants           ٔ‫ا‬    0623           3          3            0            0
                                      ٕ‫ا‬    0625           1          1            0            0
                                      ٓ‫ا‬    0622     10243       13276           7244       7244

                                                                                                    19
Overall Experiment Findings
  Originally: “no Arabic Unicode representation forms A or B”
     Forms A and B had higher occurrences in Urdu text (Mehr-News)




                                                                     20
Overall Experiment Findings


  Arabic: More Perso-Arabic cross-lingual variations
     In MehrNews (Iranian source) more than in DW (German source)



                                   AR Mehr-News       AR DW-News
                                  Pre      Norm Pre           Norm

                      ‫ك‬    0643 124087 139253 157499 157501

                      ‫ﮎ‬   06A9     15166          0      2         0

                      ‫ي‬   064A 729320 859204 722751 726144

                      ‫ى‬    0649    49917   75847      64012   60619

                      ‫ﯼ‬   06CC 155829             1      8         0




                                                                       21
Overall Experiment Findings
    Originally: “Fewer cross-lingual variations in Mehr-News”
                                                                                 FA Mehr-News               FA DW-News
      Arabic-Urdu: Mehr-News contained more
                                                                               Pre             Norm        Pre        Norm
      cross-lingual variants than DW News
                                                                 ‫ك‬      0643               0           0   244964         37
      Farsi: Mehr-News contained fewer cross-
      lingual variants than DW News                              ‫60 ﮎ‬A9          727461         727461     164909     409834
                                                                 ‫ي‬      064A         25096             0   597317        223
              AR Mehr-News        AR DW-News
                                                                 ‫ى‬      0649         2446              0   381692         42
            Pre       Norm       Pre       Norm
                                                                 ‫60 ﯼ‬CC 3147172 3175414                    517649 1497030
‫ك‬    0643    124087   139253     157499    157501
‫ﮎ‬    06A9     15166          0         2          0                  UR Mehr-News               UR DW-News
                                                                 Pre           Norm            Pre         Norm
‫ي‬    064A    729320   859204     722751    726144
                                                      ‫ك‬   0643       111947           50             138          0
‫ى‬    0649     49917    75847      64012     60619
                                                      ‫60 ﮎ‬A9           78421 237741             66911       67049
‫ﯼ‬    06CC    155829          1         8          0
                                                      ‫ﻩ‬   0647         6608            1             141          0
                                                      ‫ﮦ‬   06C1       145239 181415              41134       41219
                                                      ‫60 ه‬BE           26902     34956          16058       16114
                                                      ‫460 ي‬A         175111          204             219          0
                                                      ‫60 ﯼ‬CC         140550 405744             108316      110623            22
Overall Experiment Findings
 Handling ambiguous, interchangeable and cross-lingual variants
     => Errors reduced: (Arabic: 92%), (Farsi: 71%), (Urdu: 79%)

    Arabic text contains more orthographic variants:

       Unicode added Farsi and Urdu specific characters
       Arabic is the 5th widely spoken language
       Arabic is taught in most Islamic countries
                                                          Urdu
         => Writers having multiple background
                                                          Farsi


                                                        Arabic
                                                       Characters

                                                                    23
Summary

  Orthographic variations in Arabic affect accuracy in NLP
  Possible cross-lingual variations since:
     Unicode allows for unique mapping of characters in many languages
     Keyboard layouts allow for multi-mapping.
         Example: Yeh presentation in a Dari keyboard




  Normalization and generating possible variants increase accuracy

                                                                         24
Normalization Applications at Basis
  Arabic Editor
  Rosette Name Technology
  Rosette Linguistics Platform (RLP)




                                       25
References
  Tim Buckwalter (2004). Issues in Arabic Orthography and Morphology
  Analysis. Proceedings of the Workshop on Computational Approaches to
  Arabic Script-based Languages, COLING 2004, Geneva, August 28, 2004
  Peter Constable (2000). Unicode Character Encoding of Archived Linguistic
  Data. Paper presented at the workshop on Web-Based Language
  Documentation and Description, Philadelphia, 12-15 December 2000
  S. Hussain (2004). "Letter to Sound Rules for Urdu Text to Speech System,"
  in Proceedings of Workshop on "Computational Approaches to Arabic Script-
  based Languages," COLING 2004, Geneva, Switzerland (2004).
  Richard Ishida. (2004). Urdu script notes. Retrieved on March 20th, 2007
  from: http://people.w3.org/rishida/scripts/urdu/urdu-in-unicode.html
  Jonathan Kew (2005). Notes on some Unicode Arabic characters:
  recommendations for usage, 2005-04-21.
  Zina Saadi (2007). Orthographic Unicode Variations in Arabic: A Case Study
  of Character Occurrences in News Corpora. 21st Arabic Linguistic
  Symposium. Provo, Utah.
  The Unicode standard, version 5.0.0.
  Shah Mahmood Ghazi Watt. (2003). Computer Locale Requirements for
  Afghanistan. 2003 UNDP. Retrieved on March 14th 2007:
  http://www.evertype.com/standards/af/af-locales.pdf
                                                                               26
Questions

            Thank You!                       ‫ﺷ‬
                                      ‫ُــﻜــﺮا‬
                              ‫متـشــــكّرم‬       ً




                 zina@basistech.com

                                                     27

								
To top