Named Entity Recognition for Indian Languages

Document Sample
Named Entity Recognition for Indian Languages Powered By Docstoc
					                Named Entity Recognition for Indian Languages

                                                           This paper deals with a new
Abstract                                         technique to recognize named entities of
                                                 different languages. Our approach does not
    This paper talks about a new                 use the previously mentioned techniques.
    approach to recognize named                            Instead, we use an approach that not
    entities for Indian languages.               only reduces the burden of collecting and
    Phonetic matching technique is used          annotating data, but is language independent
    to match the strings of different            as well. We use this method to build a
    languages on the basis of their              multilingual named entity list that can be
    similar sounding property. We have           used by the named entity recognizer. The
    tested our system with a comparable          second advantage of this approach is that we
    corpus of English and Hindi                  avoid the building of transliteration rules for
    language data. This approach is              named entities in the context of named
    language independent and requires            entities. Instead, our method finds the actual
    only a set of rules appropriate for a        transliteration from an untagged corpus. Our
    language.                                    idea was to match the two representations of
                                                 the same named entity in two different
1. Introduction:                                 languages using a phonetic matching
                                                 algorithm. This comes from the property of
Named Entity Recognition is a subtask of         named entities that they sound similar when
machine translation and information              written in native script or any other script.
retrieval. Named entities are words which        However this cross-lingual matching is not a
belong to certain categories like persons,       trivial task. First of all, the two strings to be
places, organizations, numerical quantities,     matched have to be represented in a
expressions of times etc. A large number of      common script. So we face two choices
techniques have been developed to                here. Either we should convert the two
recognize named entities for different           strings into some common intermediate
languages. Some of them are Rule based and       representation (ex. Phonemic representation)
others are Statistical techniques. The rule      or transliterate the name written in Indian
based approach uses the morphological and        language to English and then look for
contextual evidence (Kim and Woodland,           phonetic equivalence. Our engine has been
2000) of a natural language and                  tested for Hindi. After making transliteration
consequently determines the named entities.      rules for Hindi, we used a variation of the
This eventually leads to formation of some       Editex algorithm to match the transliterated
language specific rules for identifying          string with entries in English named entity
named entities. The statistical techniques use   database to find a match. Here it is
large annotated data to train a model            worthwhile to mention that certain class of
(Malouf, 2002) (like Hidden Markov Model)        name entities which are not similar sounding
and subsequently examine it with the test        (mostly phrases) cannot be extracted
data. Both the methods mentioned above           through this cross-lingual matching. E.g.
require the efforts of a language expert. An     “United Nations”, “Government of India”
appropriately large set of annotated data is     etc. Abbreviations which are spelled
yet to be made available for the Indian          character by character in both the languages
Languages. Consequently, the application of      can however be extracted. E.g. BBC
the statistical technique for Indian
                                                 (       ), LTTE(            ) etc.
Languages is not very feasible.
         In the next section we have given       English and other Indian languages. The
the system architecture. The logical flow        data collected for English is used to populate
and overall description of the system are        the English named entity database which is
discussed here. Our own set of                   significantly accurate. We have used the
transliteration rules in Hindi are given in      freely available Stanford Named Entity
                                                 Recognizer (Finkel, Grenager, and Manning,
the third section. In the fourth section we      2005) in our engine. The data collected for
define our baseline task. Our system has         Indian languages will be used to build a
been tested with a parallel corpus which         database of named entities for the given
consisted of both English and Hindi              language.
language data. The results obtained using
our system is described in the fifth section
                                                 2.2. Parser
together with an analysis. Conclusions are
presented in the last section together with
                                                 The crawler saves the content in an html
directions for future improvements.
                                                 form onto the system. The parser parses
                                                 these html files. Additionally the parser can
2. System Architecture: Logical Flow             also parse the PDF as well as RTF files. The
and overall description of the System            output of the parser is passed to the
                                                 corresponding modules for the two different
The system architecture is shown in Figure       languages.
1. It consists of the following modules:
                                                 2.3 Phonetic Matcher

                                                 Phonetic matching is the task of matching
                                                 two representations of the same name. A
                                                 name may have more than one
                                                 representation in its native script itself. If the
                                                 name is represented in a script other than its
                                                 native script, there may be large number of
                                                 potential variants for its representation.
                                                 Phonetic matching is a fuzzy string
                                                 matching technique in which we match
                                                 strings on the basis of their similar sounding
                                                 property and not identity. Most common
                                                 phonetic matching techniques are Soundex
                                                 and Editex. These techniques are used to
                                                 match two representations of the same name
        Figure 1: System Architecture            in English. We survey the techniques in the
                                                 following subsections.
2.1 Crawler
                                                 2.3.1 Soundex:
The crawler is a web-bot or spider which
browses the web in an automated manner. It       Soundex algorithm was designed by Odell
starts with a list of Uniform Resource           and Russell in the year 1918 to find spelling
Locators (URL) that it is to visit, called the   variation of names. It represents classes of
seeds. As the crawler visits these URL‟s it      sounds which can be lumped together. The
collects all the hyperlinks and adds them to a   classes for the algorithm are shown in
queue. URL‟s from the queue are crawled          Appendix A. These classes are placed for
further. Since the crawler collects the data     phonetic matching according to the
from web, the data collection is fully           following algorithm:
automated. The crawler gathers data for both
                                                 same string and it returns 1 if Xi-1 is „h‟ or
    1. Replace all but the first letter of the   „w‟ and Xi-1 is not equal to Xi. The editex
        string by its phonetic code.             equivalence classes and the editex pseudo-
    2. Eliminate           any       adjacent    code are given in Appendix C.
        representation of codes.                         Editex performs fairly better than
    3. Eliminate all occurrences of code 0       Soundex and Leveinshtein edit distance
        i.e. eliminate all vowels.               algorithms. However further enhancements
    4. Return the first four characters of       in Editex are also possible. “Tapering” is
        the resulting string.                    one enhancement in which we weigh
    5. Examples: Dickson = d25, Dikson =         mismatches at the beginning of the string
        d25.                                     with higher score than mismatches towards
    Two names match if they have the same        the end (Zobel and Dart, 1996). Other
soundex representation. This method does         enhancements are those in which input
not account for vowels and hence is not          strings are mapped to their phonemic
accurate for cross-lingual matching.             representation, called phonometric methods
                                                 (Zobel and Dart, 1996).
2.3.2 Editex:
                                                 3. Transliteration rules
The Editex algorithm was designed by Zobel
and Dart (Zobel and Dart,1996). It is an         To perform phonetic matching of two
enhancement         of     the    Levenshtein    different representations of a named entity,
(Levenshtein, 1966) edit distance algorithm.     we need both of them in a common script.
The Levenshtein algorithm measures the           We choose to transliterate the named entity
edit distance between two strings where edit     in Indian language to English. The
distance is defined as the minimum number        transliteration rules for a language must be
of basic operations required to match one        written for the same. We have written our
string to the other where the basic operations   own set of transliteration rules for Hindi.
are insertion, deletion and substitution.        These can be described briefly as under.
Insertion and deletion costs are 1 and
substitution cost is given by a function         1. The entity to be transliterated is scanned
subst_cost (Xi, Yj) which returns 0 if the       character by character from left to right.
two characters Xi and Yj are same and 1, if      Each character of Hindi is mapped to an
they are different. The score dist [m, n] is     equivalent character/set of character in
returned as the edit distance between two        English according to a mapping function.
strings. A score of zero implies a perfect       The character set generated by the function
match.                                           is appended into a string as per the rules.
         The algorithm has O (mn) time and       E.g. क =                is a single character
space complexity where m and n are the           representation in Unicode („क‟) and maps to
lengths of the two strings respectively. The     „Ka‟.
pseudo code for the Levenshtein edit
distance algorithm is described in Appendix      2. Start with an empty string. When a
B. Editex groups similar sounding phonemes       consonant or singleton vowel (not as
into equivalence classes. The substitution       „matra‟) is encountered append the set of
cost is determined by a function S (Xi, Yj)      characters returned by mapping function.
that returns 0 if the two characters Xi and Yj
are same, 1 if they lie in the same              3. When a consonant is followed by a vowel
equivalence class and 2 otherwise. The           the preceding „a‟ should be removed and the
insertion and substitution costs are             character set for the vowel should be
determined by a function D (Xi-1, Xi) which
is almost same as S (Xi, Yj) except for the                       े
                                                 appended. E.g. क consists of two characters
difference that it compares letters of the
क +     . Once we encounter क we append
                                                      language specific enhancements            was
                                                      implemented for Hindi.
„ka‟ and when     is encountered next we
remove the „a‟ and append the mapping for             4.1 Abbreviation Check:
   i.e. „e‟. This rule applies in general to all
the vowels.                                           We first check whether the Hindi string is an
                                                      abbreviation in which the English characters
4. If the transliterated string has „a‟ as its last   are spelled individually. For each English
character while it doesn‟t have the vowel             alphabet we have some unique Hindi
                                                      representation. The function performs
        as last character of Hindi string,            accurately most of the time and extracts
remove this occurrence of „a‟. The last               such named entities. If we are able to find
vowel in Hindi is very important as two               out that the string is an abbreviation, the
altogether different words may have the only          corresponding English representation can be
difference in the last vowel. E.g. “कमल”              returned by the function itself, hence there is
and “कमला” are proper nouns having                    no need of further matching. If the string is
                                                      not an abbreviation, we proceed to the actual
different     genders.     Their     English          matching algorithm.
representations are “Kamal” and “Kamla”
respectively.
                                                      4.2. First letter matching:
        The transliteration always performs
a one to one mapping of a character in Hindi          The first letters of the two strings must
to a set of characters in English. However            either be the same or should belong to the
the English representation may have                   same equivalence class. The equivalence
different character sets for the same Hindi           classes for first character matching are:
character in different names. E.g. “कमल” is               "ckq", "wbv", "iy”,"jz", "aeiou"
                       े
“Kamal” while “क्रिकट” is “Cricket”. „क‟              The English named entity database must be
                                                      indexed according to the first letter of the
is often represented by „K‟ for Hindi names,
                                                      named entity so that we only search for
by „C‟ for English names and by „Q‟ for
                                                      matches in those indexes which fall into the
Urdu names. The Editex algorithm groups
                                                      same equivalence class. This is very
these letters in the same equivalence class.
                                                      important for the computational efficiency
                                                      of the engine as it reduces the search space.
4. Baseline Task
                                                      4.3 Preprocessing:
The Editex be modified to take into account
that there can be more than three (0, 1, 2)           Often the phonetic inconsistencies in
levels of acceptability for substitutions due         English lead to low matching score for two
to the inherent properties of a particular            representation of the same name. To take
languages. For example, say “ckq” is one              this into account, before matching the two
equivalence class in Editex. „c‟ and „k‟ have         strings the named entity retrieved from
a substitution cost of 1. We may reduce this          English Named entity database is
substitution cost to 0.5 for a language in            preprocessed to form a new string. We have
which it is highly probable that the same             used the famous “Mark Twain‟s plan for the
character maps to „c‟ and „k‟ in the English          improvement      of     English     spelling”
representation of its names. Thus the                 (http://grammar.ccc.commnet.edu/grammar/
equivalence classes and the substitution              twain.htm) added with some more rules.
costs in Editex can be modified for cross-            This way we tackle the problem of more
lingual phonetic matching. There can also be          than one possible character sets for some
further language specific enhancements. The           vowels since only one of them can be
following algorithm along with some                   chosen during transliteration. We also tackle
some other problems like silent-alphabets         preprocessed string are compared to
and repeated alphabets so that the                generate an editex score. The equivalence
probability of generating high matching           classes we used were similar to as proposed
score increases. The following set of rules       in the original editex algorithm except for
for preprocessing was used.                       some language specific changes for Hindi.
1. Change all occurrences of “oo” to “u”.         Length of the two strings has to be
(both character sets are for the vowel  )         considered while deciding the threshold
                                                  score for a match otherwise there can be
2. Change all occurrences of “ee” to “i”          greater number of mismatches for small
(both character sets are for the vowel )          strings. So we normalize the editex score as
                                                  d = [1- {editex(X, Y)/ (length(X) +
3. Change all occurrences of “f” to ph”           length(Y)}]
                                                  The decided threshold for match was 0.86.
4. Change all occurrences of “au” to “o”          A score above threshold guarantees
                                                  equivalence of the two representations.
5. If a word starts with "x", replace the "x"     Table 1 shows some Hindi named entities,
with a "z". Change all the remaining "x"s to      their transliteration output and normalized
"ks"s.                                            Editex scores.

6. If a "c" is directly followed by an "e" or     Hindi NE      English       Translite- Editex
"i", change the "c" to an "s"                                   NE            ration     Score
                                                                              Output
7. If a "c" is directly followed by a "k",        क्रिन्दी      Hindi         Hindi      1.0
remove the "c". Keep applying this rule as
necessary (Example: "cck" becomes "k".)           फ़लस्तीनी     Philistini    Phalastini     0.9

                                                  बांगलादे श    Bangladesh    Bangladesh     1.0
8. If a word starts with "sch", change the
"sch" to a "sk".                                  झारखण्ड       Jharkhand     Jharakhand     0.894


9. If a "ch" is directly followed by an "r",      पश्चिम        Pashchim      Pashchim       1.0

change the "ch" to a "k".
                                                  बंगाल         Bengal        Bangal         0.916

10. After applying the above rules, change        भारत          Bharat        Bharat         1.0
all "c"s that are not directly followed by an
"h", to a "k". (This includes all "c"s that are        े
                                                  क्रिकट        Cricket       Kriket         0.923

the last letter of a word.)                       ग्रेग         Greg          Greg           1.0

11. If a word starts with "kn" change the         चैपल          Chappel       Chaipal        0.857
"kn" to an "n".
                                                  मिं द्र       Mahendra      Mahendr        0.933

12. Change all double consonants of the           रािुल         Rahul         Rahul          1.0
same letter to a single consonant. A
consonant is any letter that is not one of "a,    द्रविड        Dravid        Dravid         1.0

e, i, o, u." (Example: "apple" becomes
                                                  छत्तीसगढ      Chattisgarh   Chattisagadh   0.866
"aple"). Keep applying this rule as necessary
(Example: "zzz" becomes "z".)
                                                  Table 1: Hindi named entities with
4.4 Editex score:                                 transliteration output and normalized Editex
Now the transliterated string and the             scores
5. Results and Analysis:                                 It is important to observe here that
                                                the engine shows good recall for location
We have tested our system with a parallel       entity names (LE) which were more
corpus which consisted of both English and      abundant in the corpus. Besides this, the
Hindi language data. Further we used the        corpus had a heterogeneous mix of named
web crawler to populate our NE list of both     entities with tourism-related information not
the languages thus embedding the concept of     only from India but also from the continents
comparable corpus. The results for English      of South America and Antarctica. The
obtained using parallel corpus are:             results for location entity names are
                                                encouraging as the named entities from
Precision: 81.40%                               South America and Antarctica did not have
                                                phonetic similarity with the native entities
Recall: 81.39%                                  available from tourism information from
                                                India. Causes for the comparatively lower
This corpus carried named entities from the     recall percentage among person entity
domain of travel, tourism and culture. For      names and organization entity names are
the purpose of classifying the results we       under further investigation.
used the definition of named entities as
given by Chinchor (Chinchor, 1997) as for       6. Conclusions:
entity names (as in organizations OE),
person names (PE) and location names (LE).      We have used the phonetic matching
The results for numeric expressions             technique to match the strings of different
(monetary values and percentages) and           languages on the basis of their similar
temporal expressions (dates and times) were     sounding property. As the Phonetic Matcher
not considered for results because it is a      module is tested for more data, more generic
trivial task to build grammar rules for such    rules can be made to improve its accuracy.
entities which appear quite regularly.          The Engine should be improved so that it
          We have focused on OE, PE and LE      may recognize phrasal named entities and
named entities for Hindi so that we can         abbreviations. The engine will work for any
analyze the performance on new and              language if the phonetic matching rules are
hitherto undiscovered entities which come       written for that language. We can also
into existence with the passage of time. This   develop a crawler which will be focused
premise provides the real basis for             upon a certain domain of interest. Focused
challenging the performance of any named        crawlers are very important for generating
entity recognition technique for Indian         resources for natural language processing. A
Languages.                                      focused crawler application is an intelligent
          The testing on the corpus of around   agent that crawls the web for content related
1000 sentences revealed the following           to a specific domain. This kind of crawler
results for Hindi:                              could be used in the future for purposes of
                                                data collection for a particular domain.
Precision for all named entities (PE + OE +
LE): 80.2 %                                     Acknowledgements:
Recall for PE (person entity names): 47.4%
                                                The authors gratefully acknowledge
                                                financial assistance from TDIL, MCIT
Recall for OE (organization entity names):
                                                (Govt. of India).
42.9 %

Recall for LE (location entity names):
74.6 %
References:                                    Appendix A:

Chinchor, N. 1997. MUC-7 Named entity          Soundex classes
task definition. In Proceedings of the 7th
Message Understanding Conference (MUC-         Code            Letters
7)
                                               0               aeiouyhw
 Finkel, Jenny Rose, Grenager, Trond and
Manning, Christopher. 2005. “Incorporating     1               bpfv
Non-local Information into Information
Extraction Systems by Gibbs Sampling.”         2               cgjkqsxz
Proceedings of the 43rd Annual Meeting of
the    Association    for    Computational     3               dt
Linguistics (ACL 2005), pp. 363-370.           4               l
Kim, J. and Woodland, P.C. (2000a) “Rule       5               mn
Based     Named    Entity   Recognition”.
Technical        Report        CUED/F-         6               r
INFENG/TR.385, Cambridge University
Engineering Department, 2000.

Malouf, Robert.2002 Markov models for          Appendix B:
language-independent       named    entity
recognition. In Proceedings of CoNLL-2002      Pseudo code for Leveinshtein edit
Taipei, Taiwan, pages 591-599.                 distance:
Levenshtein, V.I. 1966, Binary codes
capable of correcting deletions, insertions,   Input: Two strings, X and Y
and reversals. Soviet Physics Doklady 10:
707–710                                        Output: The minimum edit distance
                                                       between X and Y
 Zobel, Justin and Dart, Philip. 1996.
Phonetic string matching: Lessons from
                                               m ← length(X)
information retrieval. In Proceedings of the
Eighteenth ACM SIGIR International             n ← length(Y)
Conference on Research and Development         for i =0 to m do
in    Information       Retrieval,   Zurich,   dist*i, 0+ ← i
Switzerland, August 1996, pp. 166-173.         for j = 0 to n do
                                               dist*0, j+ ← j
http://grammar.ccc.commnet.edu/gramm           for i = 1 to m do
ar/twain.htm                                   for j = 1 to n do
                                               dist[i, j] =
                                               min { dist[i-1, j] + inser_cost,
                                                       dist[i-1, j-1] + subst_cost[Xi, Yj],
                                                       dist[i, j-1] + delet_cost
                                                     }
                                               end

                                               Appendix C:
Editex Equivalence Classes:

              aeiouy
                bp
               ckq
                 dt
                 lr
                mn
                 gj
                fpv
                sxz
                csz



Pseudo code for Editex Algorithm:

Input: Two strings, X and Y
Output: The editex distance between X
          and Y
m = length(X)
n = length(Y)
editex_dist[0, 0] = 0
for i = 1 to m do
editex_dist[i, 0] = editex_dist[i-1, 0]
                   + D(Xi-1, Xi)
for j = 0 to n do
editex_dist[0, j] = editex_dist[0, j-1]
                    + D(Yj-1, Yj)
for i = 1 to m do
for j = 1 to n do
editex_dist[i, j] =
 min { editex_dist[i-1, j] + D(Xi-1, Xi),
       editex_dist[i-1, j-1] + S(X, Yj),
       editex_dist[i, j-1] + D(Yj-1, Yj),
end

				
Lingjuan Ma Lingjuan Ma MS
About work for China Compulsory Certification. Some of the documents come from Internet, if you hold the copyright please contact me by huangcaijin@sohu.com