Docstoc

Preprocessing Arabic Text for Na

Document Sample
Preprocessing Arabic Text for Na Powered By Docstoc
					   A Rule-Based Morphological Analyzer of Arabic Words

                                 ARAFAT AWAJAN
                             Computer Science Department
                   Princess Sumaya University College for Technology
                                Royal Scientific Society
                                 Amman - JORDAN



Abstract: - This paper describes a rule-based technique for analyzing the morphology of
Arabic words. The proposed „Morphological Analyzer‟ processes the input word in order to
determine its lexical form. The lexical form of the majority of Arabic words consists of a root
and a morphological pattern. The analyzer applies a set of predefined rules in order to analyze
the morphology of Arabic words as they appear in real text. It is able to recognize
diacriticized, undiacriticized or partially diacriticized Arabic words generated from N-letter
roots. In order to determine the possible meanings of a word, the Morphological Analyzer
also provides some useful attributes of the word such as its type, gender, tense and number.
The proposed Morphological Analyzer is a general-purpose technique that can be integrated
into larger scale systems such as automatic translation applications, text summarization
applications, text correction applications, web search engines, automatic vowelization of
Arabic text applications and other natural language processing applications.

Keywords: - Natural Language Processing, Arabic Word Recognition, Lexical Form, Roots,
Morphological Patterns, Morphological Analyzer.


1. Introduction                                     the subject of the morphological analysis
For the Arabic language, as well as for             of Arabic words. Most of these published
many other languages, the morphological             works ignore the presence of diacritics in
features of a word provide crucial                  the Arabic text or limit the analysis to
information to enable understanding of              words generated from 3-letter roots [1] [4]
text and information extraction. In fact, the       [3].
possible meanings of individual words               The rule-based Morphological Analyzer
depends mainly on their morphology and              presented in this paper has the objective of
their position in a sentence. Therefore, the        finding the lexical form and the possible
possible meanings of a word must be                 meanings of each word in a text written in
determined first in order to accomplish the         Arabic language. The proposed analyzer is
understanding of text written in a natural          being developed in order to analyze
language.                                           Arabic words as they appear in real text. It
A number of research papers concerning              can be applied in the case of diacriticized,
the morphological analysis of words have            undiacriticized or partially diacriticized
been published for various natural                  Arabic words. Furthermore, it allows the
languages, particularly the European                morphological analysis of words generated
languages [2] [5] [8] [6] [9]. Descriptions         from variable root-lengths.
of real systems for analyzing the                   Our approach is based on the use of the
morphology of these languages are also              specific features and structures that the
available [7]. These works show that the            Arabic language uses for generating
complexity of the morphological analysis            words. It applies a set of predefined rules
of words varies from one natural language           specific to the Arabic language in order to
to another. There have been fewer articles          extract the lexical structure of the word
and published research papers written on            which generally consists of a root and a
                                                    morphological pattern. A classical lexicon


                                                1
is used to verify the correctness of the          considered to deal with the irregularities
analyzed word and to determine the                present in almost all the natural languages.
meanings it could take.                           The morphology of the Arabic language is
The morphological information that our            based on the Semitic root-and-pattern
technique is able to extract gives vital          scheme of forming words. Therefore, the
support to the different fields and               majority of words are generated from
applications    of    Natural     Language        basic entities called roots or radicals
Processing. The purpose of the                    according to a predefined list of patterns
Morphological Analyzer is to preprocess           called morphological balances or patterns
Arabic text in order to prepare it for some       [1] [4] [3]. The roots are constructed
automated treatment such as human-                mainly from 3 letters, although 4 and 5-
machine interaction, translation, text            letter roots exist too. The morphological
summarization, text correction, automatic         patterns represent the major spelling rules
vowelization of Arabic text, web search           of Arabic words. This mechanism of
engines and other applications of natural         Arabic word generation is called „AL-
language processing.                              ISHTIQAQ.‟        This     mechanism       is
                                                  performed by adding letters and/or
                                                  diacritical marks to the roots. These
2. The Morphology of Arabic                       additional letters and diacritical marks
                                                  may be added at the beginning, at the
Words                                             middle or at the end of the root. In this
In many European languages words are
                                                  paper, a morphological pattern is
constructed from basic units called
                                                  represented by the additional parts, their
morphemes by adding a suffix and prefix.
                                                  positions and the slots where the letters of
A morpheme is the primitive unit of
                                                  a root can be inserted. The character “*”
meaning in a language. For example, the
                                                  represents the slots of the root‟s letters.
meaning of the English word „friendly‟ is
                                                  Figure 1 contains examples that illustrate
derivable from the meaning of the noun
                                                  the “AL-ISHTIQAQ “ mechanism, it
„friend‟ and the suffix „–ly‟ that
                                                  presents words generated from the same
transforms a noun into an adjective [3]. In
                                                  root “ K T B “ according to different
such cases the morphological analysis is
                                                  morphological patterns. It is important to
based on the elemination of affixes and the
                                                  note the role that diacritics play in fixing
extraction of the basic morpheme of a
                                                  the meaning of the first and second words
word. Special treatment is always
                                                  of Figure 1.

                    Example of words generated from the same root
                                “K T B “ ‫ك ت ب‬

    The generated words           Their meaning in        Morphological pattern used
                                      English               for building the word
              ‫ة‬                     (He) wrote                       َ َ
              ‫ة‬                     (It is ) Written                َ ِ ُ
             ‫اب‬                          Book
             ‫ات ة‬                       Writer
                                 (They are) writing


                         Figure 1. AL-ISHTIQAQ Mechanism

All classifications of words (verbs, nouns,       from roots according to the appropriate
adjectives and adverbs) can be generated          patterns. The pattern used for generating a


                                              2
word determines its various attributes such        one is distinguished from the other solely
as gender (masculine/feminine), number             by the diacritical marks. These marks are
(singular/plural), tense (past, present, and       classified into the following categories: [1]
imperatives), mode etc. Figure 2 presents                Three diacritical marks to indicate
an example that shows the importance of                    the short vowels ( ِ ُ َ ),
the standard Arabic morphological                        Double diacritical marks which
patterns in fixing the meaning of a word.                  combine the single ones ( ٍ ٌ ً ),
Based on the above, an Arabic word can                   Single Diacritical mark to indicate
be represented lexically by its root, along                absence of vowelization ( ْ ),
with its morphological pattern. The latter               A single diacritical mark to indicate
is one element of a countable set of limited               the duplicate occurrence of a
size. A pattern is defined by a set of                     consonant ( ّ )
additive letters and/or a set of diacritical       According to the extent that diacritics have
marks and their positions in the generated         been used, Arabic text may be classified
word.                                              into three different categories:
                                                   undiacriticized, partially diacriticized, and
                                                   fully diacriticized text. The first category
3. Arabic Language Features                        represents text without diacritics such as
and Challenges                                     typed or printed text and newspapers. The
                                                   second category represents partially
The formation of Arabic words presents
                                                   diacriticized text where diacritical marks
specific features and challenges that must
                                                   are added to eliminate the ambiguities of
be taken into consideration when fixing
                                                   some words. The last category represents
the rules used by the morphological
                                                   fully diacriticized Arabic text, according
analyzer. The first challenge is that some
                                                   to which every consonant is followed by a
letters of the root may be dropped or
                                                   diacritical mark. Such a format is used for
modified during the generation of words
                                                   writing the Holy Koran, classic Arabic
from roots. The analyzer has to rebuild the
                                                   literature and children‟s educational
original root-letters by retrieving the
                                                   books..
missing or modified letters of the word.
                                                   The third challenge is that not all the
The second challenge is the presence of
                                                   words in Arabic text are generated from a
eight different types of diacritical marks,
                                                   root. For example, some words such as the
used to represent short vowels. In written
                                                   tools and foreign words cannot be broken
text they are considered as special letters
                                                   down into a root and pattern. As the
where each one is assigned a single code,
                                                   number of tools is limited, a table of these
as with normal letters. In fully
                                                   predefined tools can be used to check
diacriticized text a diacritical mark is
                                                   whether a word is a tool or not before
added after each consonant of the word.
                                                   sending it to the analyzer. Meanwhile the
These diacritical marks play a very
                                                   „loan‟ or foreign words, are listed in the
important role in fixing the meaning of
                                                   lexicon      and     need    not     undergo
words. In fact, two different patterns may
                                                   morphological analysis.
have the same sequence of consonants, but


  The word (              ) is generated by the root play (         ) according
  to the pattern (                 ). This pattern indicates that the word is a
  noun, its gender is masculine, and it is plural.
  The final meaning will be players: (play: noun; plural; masculine)

   Figure 2. Role of the Morphological Pattern of an Arabic Word in Fixing its
                                    Meaning




                                               3
4. The Morphological Analyzer                       mark EXTRA-SECOUN. A word is then
The Morphological Analyzer of Arabic                represented by a list of character L
words (MAAW) processes each word of                 according to the next format:
the input text in order to determine its root
and pattern. The results of the                             [C1 V1 C2 V2 . . . Cn Vn]
morphological analyzer can be used for
further analysis. Figure 3 presents these           Where Ci is a consonant and Vi is a
transformations schematically.                      diacritical mark. Each one of the classical
The identification of the morphological             patterns is also represented by a list of the
structure of a word depends on a rule-              same structure where the slots of a root‟s
based system that can find the                      letters are marked by the character „*‟.
morphological pattern for diacriticized or          Figure 4 shows an example of a classical
undiacriticized words. To achieve this              pattern representation.
process, we assume that a diacritic follows         To deal with the three possible situations
each letter of the word. If a diacritic is          of Arabic text (fully diacriticized, partially
omitted, it will be replaced by a special           diacriticized and undiacriticized text), the
character (EXTRA-SECOUN) that we                    list L will be further divided into two new
introduce to replace the absent diacritic.          lists. The first list LC contains the
This diacritic (EXTRA-SKOUN) will be                sequence of consonant [C1, C2, . . . Cn]
noted by a dot in the examples of this              and the second list LV contains the
paper.                                              diacritical characters [V1, V2, . . . Vn].
A procedure „ Check_Diacritics‟ takes the           Table 1 shows examples of the
list of characters forming the word and             segmentation of words into consonants
checks for the presence of diacritics after         and diacritics. The three examples given in
each consonant. It replaces the absence of          Table 1 share the same list of consonants
diacritical marks after a consonant by the          LC.



                                          Original Text


                              Morphological Analyzer : MAAW


                                  Morphological Features


                          Further Analysis (NLP Applications)


                            Figure 3. Morphological Analyzer


               The pattern:                “               “

               Its corresponding list : [                            ].


                             Figure 4. Pattern Representation




                                                4
      Word          Word Class                      List of               List of Diacritics
                                                    Consonants
                    Fully diacriticized             [          ]          [            ]
                    Partially diacriticized         [          ]          [ . . . ]
                    Undiacriticized                 [          ]          [ . . . . . .]

     Table 1. Decomposition of Words into a List of Consonants and a List of
                                  Diacritics

The list of consonants (LC) represents the            representation allows us to manipulate all
letters of the word‟s root, and the suffixes,         kind of roots (3-letters roots, 4-letters
infixes and prefixes used to form the word            roots and 5-letters roots). Table 2 gives
according to a given pattern. In order to             examples of the above representation. The
extract the root of a word, the list LC can           first two words are generated from two
be represented by the following general               different three-letter roots according to the
description:                                          same morphological pattern, they share the
                                                      same additive parts (prefix, infix and
[X1[X2[X3]]] R1 [Y1] R2 [Y2] R3 [                     postfix). The last three words are
  [Y3] R4 [[Y4] R5]] [Z1[Z2[Z3]] ]                    generated from the same root according to
                                                      different patterns.
where components X1X2X3 represent a                   The morphological patterns will also be
prefix of 3 letters maximum, the                      segmented into two lists: LC and LV. For
components Z1Z2Z3 represents a postfix                example the pattern presented above in
of three letters maximum, and components              Figure 4 can be broken down into two
Y1Y2Y3Y4 represent the possible infixes               lists: a list of consonants (LC) and a list of
of four letters maximum. The slots R1, R2,            diacritical marks (LV) [Figure 5]. The
R3, R4, and R5 represent the letters of the           separation of consonants and diacritics
root used to generate the word. The                   significantly reduces the number of
characters [ ] are used here to indicate that         patterns to be tested.
the included component is optional. This

    Input                List of              Root            Prefix         Infix       Postfix
    Word               Consonants           R1R2R3           X1X2X3          Y1 Y2       Z1Z2Z3
                  [                     ]   [       ]         [      ]        [ ]         [     ]
                   [                   ]    [      ]          [      ]        [ ]         [     ]
                       [           ]        [      ]             [ ]          [ ]         [     ]
                       [           ]        [      ]            [ ]          [ ]            [ ]

                   Table 2. Decomposition of the List of Consonants


  The pattern:                “                 “

  List of consonant (including the slots for root) : [                                    ].

  List of Diacritical marks:                                   [                          ].


   Figure 5. Decomposition of Patterns Into a List of Consonants and a List of
                                   Diacritics



                                                5
The LC list of a given pattern will be              5.  Components      of                  the
represented by the following structure:
                                                    Morphological Analyzer
[ X1 [ X2 [ X3 ] ] ] * [ Y1 ] * [ Y2 ] * [
   [Y3] * [[Y4] * ] ] [ Z1 [ Z2 [ Z3 ] ] ]          The main components of the proposed
                                                    Morphological Analyzer (MAAW) are
The characters „*‟ represent slots where            shown in Figure 8. It has three analytical
consonants can be inserted to form a real           components: the „rules‟ component, the
word. Table 3 shows examples of the                 „lexical‟ component, and the „patterns‟
representation of morphological patterns            component.
using this schema. A comparison of                  First, the rules component consists of an
Tables 2 and 3 shows that the words of              engine containing the rules used to extract
Table 2 share the same prefix, infix and            diacritics, and the rules used to extract the
postfix parts with the patterns of Table 3,         patterns and roots. Second, the pattern
which means that the words of Table 2 are           component lists the standard patterns,
generated according to the corresponding            where we associate with each entry all the
patterns of Table 3.                                possible and acceptable configurations of
Morphological patterns can be regrouped             diacritical marks, the number of
into classes according to their list of             configurations is limited to a maximum of
consonants. Patterns of the same class              5. Third, the lexicon has a classical form
share the same list of consonants and they          and lists the roots of the Arabic language,
differ one from the other according to the          and for each root the possible patterns that
list of diacritical marks. Table 4 shows an         can be applied to generate words from the
example of three different patterns of the          root. The lexicon is used to verify the
same class; these patterns have the same            correctness of the analysis performed by
list LC and different lists LV. The set of          the other components of MAAW. If the
patterns will be represented by the set of          word-correctness is verified , the extracted
consonant lists LC, where we associate              root, pattern and list of diacritics will be
with each entry all the possible and correct        used by the lexicon to identify its possible
combinations of diacritical marks LV. The           meanings.
couplet LC and LV will determine the
morphology of the word.


       Pattern             The List PLC             Prefix          Infix         Postfix
                                                   X1X2X3           Y1 Y2         Z1Z2Z3
                          [                ]        [      ]         [ ]           [     ]
                           [              ]            [ ]           [ ]           [     ]
                            [            ]            [ ]           [ ]              [ ]

                     Table 3. Examples of Pattern Representation

        Pattern                 List of Consonants             List of Diacritical Marks
                                        LC                                LV
                                   [          ]                       [           ]
                                   [          ]                       [            ]
                                   [          ]                       [            ]

        Table 4. Grouping Patterns According to Their List of Consonants




                                               6
The recursive rule „Decompose‟ performs            root of the word. The rule defined by
the decomposition of the word into two             „FindSlot‟ returns a list of integers
lists; one for consonants, LC, and the             determining the position of the letters of
second for diacritics, LV. Decompose               the root in the given pattern. It is
calls another rule In. The rule „In‟ returns       necessary to use two separate rules to
TRUE only if the character H is one of the         identify the root, to accommodate cases
diacritical marks of the Arabic language. It       where one of the letters of a root is
marks the absence of diacritics by adding          dropped or changed.
the EXTRA_SECOUN mark „dot.‟ The                   The lexical componant of the system
recursive rule Decompose can be                    receives the results of the precedent-
described by the following Prolog style            analysis: the root, the list of consonant of
code [Figure 6].                                   the pattern PLC and the list of diacritics of
The step of identification of the root and         the word LV. The lists LV and PLC
pattern is realized by a recursive procedure       determine the pattern of the analyzed
„Match‟ that takes the list LC of the input        word. It then verifies the correctness of the
word and returns the list PLC of                   word. An Arabic word is correct lexically
consonants of the pattern and the root             if its root is an entry of the lexicon and its
ROOT. The Prolog-style description of the          pattern is among the acceptable patterns of
recursive rule Match is presented in               this root. If the word is correct, its
[Figure 7].                                        meaning or meanings are provided by the
Applying the rules relaying the pattern to         lexicon.
the slots of the root letters identifies the

Decompose (word, LC, LV, Flag):
   Decompose ([ ] , _ , _ , False ).      // Basic case when the decomposition is terminated
   Decompose ([ ] , _ , T2 , True ):      // The last consonant is not
   Decompose([ ] , _ , [„.‟ | T2] , False).// followed by a diacritc
   Decompose ([H|T], _ , T2, _ ): In (H, DiacriticsList). // Detection of a diacritic
   Decompose(T , _ , [H|T2] , False);
   Decompose ([H|T], T1 , _ , False ): // Detection of a consonant at
   Decompose(T , [H|T1] , _ , True); // the first call of the rule or
                                          // after detection of a diacritic.
   Decompose ([H|T], _ , T2 , True): // detection of 2
   Decompose([H|T] , _ , [„.‟,T2] , False ) //consecutive consonant

                        Figure 6. The recursive Rule Decompose

Match (LC, PLC , ROOT):
    FindPattern (LC , PLC),
    FindSlots (PLC , [list_of_slots]),
    FindRoot (LC , [List_of_slots], ROOT , 1);
FindPattern (LC , PLC ):
    FindPattern ([],[]).
    FindPattern ([Head|Tail1], [„*„|Tail2]): FindPattern (Tail1, Tail2).
    FindPattern ([Head|Tail1], [Head|Tail2]): FindPattern (Tail1, Tail2).
FindRoot (LC , [List_of_slots], ROOT, Pos):
    FindRoot (LC,[],[], _ ).
    FindRoot ([Head|Tail1], [H|T], [Head|Tail2], X):
        X = H,
        FindRoot (Tail1, T, Tail2, X+1).
    FindRoot ([Head|Tail1], L1, L2, X): FindRoot (Tail1, L1, L2, X+1).

                          Figure 7. The recursive Rule March



                                               7
                        Input word




                       Decomposition
                           Rules




                          List LC




                         Matching
                          Rules




                         LEXICON



                               RESULTS
                         Valid Word ? Yes
Meaning:
Root ( ‫ ) ع ل م‬ Teach
Pattern ( ‫ ) م * * * ت‬ ( noun, plural, feminine, subject)
Final Meaning : Teachers (Feminine)



               Figure 8. Components of MAAW




                                8
6. Experiments                                     7. Conclusion
The Morphological Analyzer proposed in             The Morphological Analyzer of Arabic
this paper can be applied in different ways.       words presented in this paper aims to
It can be used as an independent system or         prepare Arabic text for natural language
as a part of one of the NLP applications.          processing applications. It analyzes Arabic
The output of the Analyzer determines              words in order to extract their
weather the word is generated from a root          morphological structures. This task poses
or not. If it is generated from a root             many problems related to the specific
according to a pattern, this information           features of the Arabic language, such as
may be sent to a lexicon to extract the            the presence of diacritics and the
meaning or meanings of the word. The               elimination of some letters in the
richness of the lexicon determines the             generation of words.
ultimate performance of the complete               In order to solve these problems, we
system.                                            introduced a rule-based system that takes a
Figure 9 gives a complete example of the           word and determines its morphology. The
application of the rules of MAAW. The              system has three components: the rules,
proposed Analyzer gives accurate results           the lexicon and the morphological patterns
for fully diacritisized Arabic text. In the        of the language. The morphological
case of the absence of diacritical marks, it       analyzer proceeds by matching to
produces a list of consonants that                 determine the root and the morphological
correspond to the morphological patterns.          pattern of the word. In the case of missing
In these cases, the definitive list of             diacritics, more than one pattern may be a
diacritics replacing the missing diacritics        candidate to the final output.
requires advanced syntactical analysis.            Future work will focus on an expansion of
                                                   the Morphological Analyzer toward the
                                                   use of syntactical information in order to
                                                   determine the definitive morphological
                                                   pattern used to build the word in case of
                                                   the absence of diacritical marks.




           The input word:
           List of characters:                     [                              ]
           List of consonants:                     [                         ]

           List of diacritical marks:              [                        ]
           The pattern that will be detected: [                  ]
           Positions of the root letters:     [2 ,3,4 ]
           The root will be:                  [         ]
           The Lexicon output:
                          ROOT          Play
                          PATTERN  Verb , Present, Plural, Masculine



            Figure 9. Complete Example of a Word Analyzed by MAAW




                                               9
References                                       ACL Special Interest Group in
                                                 Computational Phonology, Luxembourg,
[1] N. Ali, N Hegazi and E. Abed, “A             pp. 1-12, August 2000.
Morphology Based Data Compression
Technique For Arabic Text”, Computer             [6] L. Breidt and F. Segond, “IDAREX:
Communications–AFRICOM84, pp. 241-               Formal Description of German and French
251, 1984.                                       Multi-Word Expression with Finite State
                                                 Technology”, MLTT-022, Novembre
[2] E. L. Antworth, “Morphological               1995.
Parsing with a Unification-Based Word
Grammar”, North Texas Natural Language           [7] D. Carter, “Rapid development of
Processing Workshop, University of Texas         Morphological Descriptions for Full
at     Arlington,     http://www.sil.org/        Language        Processing     Systems”,
pckimmo/ ntnlp94.html, May 1994.                 http://www.cam.sri.com/tr/crc047paper/pa
                                                 per.html, 1997.
[3] A. Awajan, “Low-Level NLP
Technique for Arabic Text Processing”,           [8] J. P.Chanod, “Finite State Composition
The Proceedings of the ISCA 16th                 of French Verb Morphology”. MLTT
International Conference on Computers            Technical                         reports,
and     Their   Applications, Seattle,           http://www.rxrc.xerox.com/publis/mltt/mlt
Washington USA, pp. 287-289, March               ttech.html, November 1994.
2001.
                                                  [9] G. Grefensette and P. Tapanainen,
[4] K. R. Beesley, “Arabic Finite-State          “What is a word, What is a sentence,
Morphological Analysis And Generation”,          Problems of Tokenization”, The 3rd
The 16th International Conference On             Conference       on       Computational
computational Linguistics, Proceeding.           Lexicography and Text Research.
Vol. 1, pp. 89-94, August 1996.                  Complex ‟94, Budapest, July 1994.

[5] K. R. Beesley and L. Karttunen,
“Finite-State        Non-Concatenative
Morphotactics”,        SIGPHON-2000,
Proceeding of the 5th Workshop of the




                                            10

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:37
posted:11/12/2010
language:English
pages:10