Docstoc

GSoC - Statistical Diacritic Restoration - 2011 project

Document Sample
GSoC - Statistical Diacritic Restoration - 2011 project Powered By Docstoc
					                       Automatic Diacritic Restoration
Name: Dhiraj Lohiya
E-mail address: lohiya.dhiraj@gmail.com
Other information that may be useful for contact:
IRC nick: Dj
Mobile no.: +91 97850 27222
Gtalk Id: lohiya.dhiraj@gmail.com
College email Id: f2007097@bits-pilani.ac.in

Why is it you are interested in machine translation?
I would try to answer this question by giving some previous work done and use
cases.

My interest in machine translation developed last year when I started working
on my college project of "Self-improving Phonetic Matching Algorithm" for the
technical festival APOGEE [1]. It was on open ended problem and once I got a
deeper insight of the work done in the field till that time after reading quite a
few research papers etc., it was fun working on it. It was the first time I was
reading research papers that really generated interest in me in the field, I was
able to analyse the proposed research work and more importantly question the
proposal and the way things were done and suggest/discuss alternative ways
which were/weren't implementable. Either case the satisfaction is immense
since the understanding was always boosted.

A brief description of "Self-improving Phonetic Matching Algorithm" project:
In this project, we have modified soundex to narrow down the number of false
positives for matching phonetically similar words and have modularized the
algorithm so that it could work for any language once the set of equivalent
classes rule set in that language is given. Also, unlike soundex, not considering
1 character at a time, we used an approach to form substrings of words by
dividing it into vowel-consonant pairs. Moreover, unlike having a static rule-set
of soundex/double metaphone, it was designed in such a way that the
algorithm could evolve the rule set with time based on usage and user input
and establish dynamic relationships between entities on the fly to be appended
to results based on the analysis.

India is a multi-lingual country and many languages are commonly spoken in
day to day life, with about 800 different languages and 2000 dialects [2]. In
fact, in my hostel wing (series of 12 rooms) itself, I have about 6 friends having
different languages as their mother tongue and are conversant in that
language. Now, if they write blogs or do any similar work in their language, I
would love to read that. Or possibly, for other cases, it would be great
convenience if we could have a gadget which could translate a voice
conversation on-the-fly (which is only a matter of time before it is developed).

With the exponential growth of information happening on the web with all
collaborative, social networking sites, blogs all around, it is a matter of fact that
people will be contributing huge amount of data on the web in coming days,
and this might not be always in a language that a particular person
understands. At the same time, it should not be that the language is restricting
the sharing of knowledge/information. Machine translation provides an answer
to this and comes handy with good amount of efficiency in spite of the
                      Automatic Diacritic Restoration
difficulties involved in understanding a language based on rules.

Why is it that you are interested in the Apertium project?
Apertium is actively involved with research in the field, practically
experimenting and validating stuff. It is really great to be a part of the
community of a research oriented Open source organization, wherein, without
starting from scratch, I can learn and understand the work done till that
instance and then make some contribution in the field, learn, research and also
share with others. The best thing that I like in open source communities is that
we share, we collaborate, we help each other and it's always "we". The
Apertium community has been really helpful, accessible, cooperative and
energetic.

Now, at Apertium, I get a community of people totally dedicated to machine
translation and computational linguistics which is the field I am interested in. I
can informally talk to them, question them, get feedback, suggestions etc.
Also, people frankly accept if they aren't aware of some concept/technology
and redirect to some appropriate person accordingly. Now that is important and
I get to gain knowledge from the experience of much knowledgeable people.
What more could I ask for? Of course, I would need to do the required
homework before asking anything since spoon-feeding won't be benefitting me
anyway in the long run and that is not something I would want as well.

I plan to pursue post-graduation in computational linguistics and working with
Apertium will surely be a learning curve for me. I have observed members from
previous GSoC who are actively involved with Apertium presently and are
making brilliant contribution and actively pursuing research to take Apertium to
new heights. I sincerely hope that down the timeline, I could form some good
bonding with quite a few members of the community and contribute my bit
back to the community. I would thus strive to prove myself to be a person who
is worth to have in the Apertium team.

In future, I would also love to add a new language pair to Apertium amongst
Hindi, Marathi and English and thus take a dive in the machine translation part
as well which is definitely going to be a great learning experience dealing with
the nitty-gritty of machine translation.

Which of the published tasks are you interested in?
Automated diacritic restoration.
Related to this, Kevin Scannell's research paper on "Statistical Unicodification of
African Languages" is of particular interest to me. [3]

What do you plan to do?
I plan to create an optional module to automatically restore diacritics and
accents on input text, and integrate it into the Apertium pipeline. As a part of
this, I would port charlifter [4] perl implementation of Kevin Scannell to C++
and optimize the smoothing of the statistical models on a language by
language basis.

Title:
Automatic Diacritic Restoration.
                       Automatic Diacritic Restoration
Reasons why Google and Apertium should sponsor it:
A major chunk of the work done in this direction is based on the availability of
high quality corpora and hence using the same approach with minority
languages has been an issue on account of lack of availability of high quality
corpora. But our approach promises to work for even a bit noisy corpora which
is the need of the day. Presently, Apertium expects that the input grammar is
correct in given diacritics form and would not give correct results for other
case. Now there is a lot of data on the web which is somewhat incorrect with
respect to diacritics and while using Apertium for translating the same,
incorrect diacritic is one of the big issues which might result in incorrect results
and hence user dissatisfaction Or this could be in case of real time scenario on
chats at IRC, instant messengers etc. Integration of this feature will allow for
the convenience of a huge user base from most of the supported languages
which have diacritics used. Considering the no. of users this is affecting, it is
necessary that this task be given high priority.

A quote from a user at a blog which serves as a use case of the scenario [5]:
"Paŭlpro:
I did try the Spanish-Esperanto version. It is rather good. Far from perfect, but
you can understand the meaning. There is only one minor point. Spanish
speaking people often "forget" the diacritics, e.g. they write "invitacion" instead
of "invitación". The Apertium translator does not understand the words without
diacritics. So if the Spanish members want to be understood by the "aliens",
they should write more carefully ;-)
¡Ánimo y adelante, amigos españoles!"

This problem could be solved by the addition of diacritic restoration feature. We
being at an open source community, this approach will have an analogous
effect, not only on machine translation but also on other language
technologies.

Google itself has access to massive corpora; it could use our approach to make
its search, translation, transliteration, autosuggest, autocorrect, gtalk instant
messenger and a bunch of other features more intelligent. Moreover, if a user
of the service posts some feedback/complain/suggestion in native language on
a forum/mail which is diacritically incorrect, automatic diacritic restoration by
facilitating machine translation could also help customer care team to
understand it since it is practically not possible to have a team knowing all
languages in the world.

A description of how and who it will benefit:
It is possible to create grammatically correct documents (inclusive of diacritic,
accent etc.) for most languages using Unicode [6]. But this is often bypassed
by many users be it because of the layout of keyboard, the difficulties
associated or be it that only ascii characters are allowed at certain places. Our
approach would help improvise upon the present scenario in a lot of ways.

Now a days, practically all languages have good amount of web data published
by users, be it through blogs, social networking sites, forums etc. and many
suffer from this problem of missing diacritics which sometimes does change the
meaning of the terms. Hence this approach would benefit users of all those
languages which have diacritics used in their language.
                      Automatic Diacritic Restoration
The automatic diacritic restoration module will pave way for Machine
Translation for a minor language which in turn would facilitate the following [7]:
      • Increase its “normality”
      • Increase its literacy levels
      • Have an effect on its standardization
      • Increase its “visibility”
      • Increase language expertise and resources Increase independence

A relevant scenario:
The lack of diacritics means that students new to English must learn proper
pronunciation through trial and error. Consider the two English
words cut and put. Each contains the vowel u, but they are pronounced very
differently. The words in their written form offer no clue as to how each word
should be pronounced. By contrast, in Spanish, words are spelled just like they
sound. For example, the words esta and está are pronounced differently and
have different meanings. The reader's clue is the accent mark. Since English
offers no such help, many new students of the language find it more difficult to
learn than other European languages. [8]

A detailed work plan including, if possible, a schedule with milestones
and deliverables:
My experience has taught me that early preparation and planning are vital to
success. While I am busy juggling mid semester exams, I have taken several
steps to begin familiarizing myself with the main concepts behind Automatic
Diacritic Restoration and the relevant smoothing algorithms along with
charlifter and Apertium. The following is a detailed sketch of the entire
technique and how I plan to proceed with it:

1. We take the data of about 100 Latin scripts as input, one at a time. This web
corpora has already been gathered by Kevin using a web crawler (hence is not
of the best quality but the algorithm takes that into consideration).

2. For each Latin script, on the training data, we apply the following 5
smoothing algorithms (with different parameters) namely:
   • Modified Kneser-Ney
   • Backoff smoothing
   • Good-Turning
   • Witten-Bell
   • Add lamda

3. To "tune" the smoothing for different languages to maximize performance,
some test data would be set aside for a given language, and after training
using a bunch of different smoothing parameters, the performance will be
computed on the test data for each parameter setting. The one which gives the
best result will be selected.

4. For the selection of different smoothing parameters, a “Tuning Algorithm”
will need to be designed based on the following approach:
An optimized setting could be arrived at by using a gradient approach over the
superposition of individual plots of performance with all the different smoothing
                      Automatic Diacritic Restoration
parameters, moving in the direction towards increasing performance.

5. Ten-fold cross validation will be used to decide upon the training test corpora
for each language, wherein the data is split into 10 equal sets, of which 9 sets
are used for training and the left out set is used for testing. This will be done
taking all the 10 sets as training set, one at a time and finally a mean accuracy
will be derived. [9]

6. A “Diacritic Error algorithm” which would randomly introduce some errors
with respect to diacritics in the correct test set and keeps a track of the
inserted errors. This might range from 1 diacritic error for some words while
some words might have multiple diacritic errors introduced as well which will
be dynamically decided at run time based on the size of data, the number of
words with diacritics, the distribution of diacritic'd letters per word etc.

7. We check out results as to which of the above combination of algorithm and
smoothing parameters performs best by measuring the percentage of words
correctly restored in each language. A “Performance algorithm” will be
designed giving appropriate weightage to different parameters of performance
measure which are as follows:
   1. Number of correct classifications
         a. Number of words which were already correct and remain
             unchanged after the process.
         b. Number of words which were initially having error but were
             completely/partially corrected after the procedure (with different
             weightage for complete correctness and partial correctness)

   2. Accuracy of probability estimates.
   3. Costs assigned to different types of errors (Like how many diacritic'd
      letters replaced in a word having diacritic'd word, false positives, false
      negatives etc.)

8. [Optional] A time v/s performance graph will be plot for all algorithms and
see how much time is being sacrificed to get the best performance and could
we do away with a little lower performance if the time gained in the latter is
good enough. Since this will be during the offline training phase, this might not
be of much consideration during the runtime of the algorithm and latency in
training phase can be considered. Probably, we could do this for some
languages for statistics.

9. Experimenting how this affects the overall machine translation performance.
For this, we could compare the performance of Apertium with and without
diacritic restoration and again plot the time v/s performance graph to showcase
the statistics. A lot of factors would be considered for this (also considering
testing on almost all languages supported by Apertium even in pre-alpha
stage). We could compare it based on the machine translation before and after
Diacritic Restoration by calculating the WER ad PWER.
The following are the approaches that would be taken:
  1. Asciification (stripping all diacritics) – This would serve as the baseline.
  2. Text taken from blogs, IRC logs of Apertium etc. which would probably
      have partial diacritics. This would serve a good real world scenario.
  3. Introducing random diacritic errors – The prevously designed Diacritic
                        Automatic Diacritic Restoration
      Error Algorithm would come handy here.
   4. Evaluation from Wiki [18]- Expect a performance deterioration here
      assuming Wiki will have correct diacritics in most cases.

Initially, we will test our technique on the following languages to prove a
benchmark since there are published numbers for some of these languages
using other approaches [12]:
     1. Czech
     2. Dutch
     3. French
     4. German
     5. Chinese Pinyin (Especially, for considering performance on tonal diacritics
         )
     6. Romanian

Then we will target all languages having support in Apertium(including pre-
alpha releases), except a few in cases where either diacritics are not used or
are used in a different context. This would be around 30 languages.
      ISO 639 codes: (ca, cy, ro, af, es, pl, gl, it, pt , fr, nn, nb, sv, da, is, br, se,
      gd, ht, ga, oc, se, eo, pt_BR, ast)

We could have diacritic restoration for Indian scripts as well for which a similar
approach would work. Moreover, since I have already studied it, I would be
indeed comfortable with it. (And ya, diacritic restoration is a problem for which I
did lose some marks in my childhood in exams :P ) Again, this would be
targeting a wider user base and with would be very useful to Apertium in near
future as more and more Indian languages are released. These would
constitute many Indian languages including Hindi, Marathi, Bengali, Konkani,
Sanskrit (and a lot of others which basically use a similar script.)

Regarding integrating “Automatic Diacritic Restoration” with the Apertium
pipeline, a separate module could be added in the pipeline before the
“Morphological Analyser” module.
Accordingly, the new pipeline will be as follows:

   Source-
                                 de-formatter
   language text




                                                                                    To PoS Tagger.
                                  Diacritic                       Morphological
                                  Restoration                     analyser

Proposed modification to Apertium pipeline for integrating Diacritic Restoration
                                    module.
This approach would be fortified during the community bonding period after
discussion with the core developers’ team of Apertium and modifications will be
made accordingly.

Timeline:
Community bonding period:
                     Automatic Diacritic Restoration
  1. Doing an in-depth study of the relevant research papers and published
     statistics [3],[10],[11],[12],[13],[14].
  2. To get the high quality corpora of big languages already listed, I would
     write to the authors of publication [12] to get the same training data so
     that the comparison is on the same platform.
  3. Completely understand the perl code of charlifter.
  4. Design algorithms for the following after taking all the parameters into
     consideration (Decide the parameters to be considered):
         a. Performance Algorithm.
         b. Tuning Algorithm.
         c. Diacritic Error Algorithm.
  5. Collecting published statistics which could be used for comparison later.
  6. Working out the context of diacritic restoration in Indian script languages
     (like Hindi, Marathi, Bengali, Punjabi, Sanskrit etc.) and making
     appropriate changes required in the technique to deal with Indian scripts.
  7. Project Blog will be setup and updated weekly with the progress made in
     the project!
Milestone #0

Week 1:
Porting the perl implementation of charlifter to C++
Debugging and documentation of code on wiki
Milestone #1
Deliverable #1: Charlifter.pl ported to C++.

Week 2:
Implement the following 3 designed algorithms:
  1. Performance Algorithm
  2. Tuning Algorithm
  3. Diacritic Error Algorithm

Also implementation for the overall automation of the technique will be done.
Debugging and documentation of algorithm and code on wiki
Milestone #2

Week 3:
  1. Using the entire technique on all the languages with previously published
     data with different methodology in [12]:
        a. Czech
        b. Dutch
        c. French
        d. German
        e. Chinese Pinyin
        f. Romanian
  2. Each of these high quality languages will take approximately 15-20 hours
     for a ten-fold cross validation, probably more depending on the amount
     of data received for the above languages.
  3. Manually look through the execution of a couple of languages based on
     the inputs and results.
  4. Designing the template of documenting results of all languages which
     properly depicts all the results in a manner which is easy to compare and
     ambient and document the results for the above 5 languages and publish
                      Automatic Diacritic Restoration
     on wiki.
Milestone #3

Week 4:
  1. Work on 10 languages supported by Apertium. Each language will take
     around 15 hours for the ten-fold cross validation.
  2. Manually look through the execution of 1 language based on the inputs
     and results.
  3. #Buffer time for any backlog.
  4. Documentation of results on wiki.
Deliverable #2: The entire technique implemented in C++.

Week 5:
Work on 10 language supported by Apertium.
Work on the technique for unicodification of Indian scripts.
Debug and document the code and publish results on wiki.

Week 6:
Work on 10 language supported by Apertium.
Implement the technique for unicodification of Indian scripts.
Debug and document the code and publish results on wiki.

Week 7:
Work on Indian script languages including Hindi, Marathi, Bengali, Sanskrit,
Punjabi etc.
Integrate charlifter in the Apertium pipeline.
Debug and document the code and publish results on wiki.

Week 8:
#Buffer time for unicodification of Indian language scripts and more Indian
languages to be tested upon.
Test the implementation in the Apertium pipeline.
Milestone #4
Deliverable #3: Charlifter integrated in Apertium.

Week 9:
Work on set of languages that might be supported by Apertium in near future
(those under development).
Provide test cases with documentation.
Suggestions/feedback from the community will be requested.
Debug and document the code and publish results on wiki.

Week 10:
Work on next 10 languages from collection of web corpora which might have
good amount of usage of diacritics.
Debug and document the code and publish results on wiki.
Integrating suggestions, finishing touch to charlifter in Apertium, update, clean
up.
Test across various platforms to ensure that the changes work as expected on
all platforms.

Week 11:
                          Automatic Diacritic Restoration
Evaluation and release.
Finishing touch to online wiki and tutorials. Final report.
Suggested pencils down date by GSoC is 9th August.
#Buffer time for any possible time lag for any reason.

Week 12:
#Buffer time reserved for any unforeseen circumstance that might occur (like
interruption of code during the 15 hr. stretch etc.)

Future scope [Not a part of the summer project proposal but will look
over if time permits during and after summer]:
  1. The challenge of diacritic restoration in Cyrillic script could be worked upon where most
     diacritic'd letters can't be represented even using unicode. I would try to work
      upon this once the approach for Indian script is finalized but I am not including
      this in the proposed deliverables on account of lack of familiarity with the
      Cyrillic and the corresponding issues.
   2. Working on the rest of languages for which we have web corpora available.
   3. Once the technique is ready and we prove its performance and set a
      benchmark, it's all a matter of imagination of where all this could have
      use. Be it a plugin for OpenOffice which provides suggestions for correct
      diacritics on the fly, on forums, blogs, websites showing: "The following
      content has some wrong grammar/diacritics, would you like to see the
      grammatically corrected version?" autocorrect/autosuggest features
      based on this and so on.
   4. Game like Google Image Labeler [15]: From a set of sentences from the
      web corpora with us, allowing the user to choose which diacritic is correct
      based on the context. This could help us improve upon the quality of web
      corpora that we get through crawling.
   5. Looking to the future, we expect the quality of web corpora themselves
      to improve, as integration of unicodification into authoring tools helps
      overcome both the lack of proper keyboards and any unfamiliarity with
      proper orthography that may exist in some language communities.
   6. A research paper on unicondification of Indian script :) [and cyrillic
      script?]
Note: Since my task will include quite a lot of time in training/testing and I
have kept a few buffers, I plan to work upon the aforementioned tasks in that
time and would anyway continue after summer.

In the proposal, list your skills and give evidence of your
qualifications.
I am a 3rd year student from Birla Institute of Technology and Science, Pilani,
India [16], presently pursuing B.E.(Hons.) Computer Science. I will have
completed all structured courses for Computer Science in this semester
including Data structures and Algorithms, Programming language and Compiler
construction, Theory of computation, Object oriented programming (OOP),
Computer Programming 1 & 2 (CP1 & CP2) etc. I have been Professional
assistant for the courses OOP and CP2 where my responsibility has been to
design lab questions for students, conduct labs and help out students with
issues during the lab. I have also achieved a score of 95% in Sun Certified Java
Programmer (SCJP 5) examination. [I could send in a validation report to an
email address anytime if need be.]
                      Automatic Diacritic Restoration
My formal education as an engineer has provided me with the right technical
and analytic skills required for this project. I have carried out quite a lot of
independent study while doing the project on "Self improving phonetic
matching algorithm" and I am always eager to get to read anything new in this
field. Soon I plan to publish a paper on the work done in the aforementioned
project.

I have been completely using open source software since my 2nd year. I have
often tweaked around with the source codes of open source softwares (be it
simply to put custom texts on buttons, labels etc. which I do a lot) but haven't
made any big code contribution to them.

I am the head of the Open Source club at my institute, which is a group of 30
students who organize open source activities, lectures, events, workshops etc.
to promote the culture of open source software amongst the students and to
share the knowledge, which was unfortunately much less wide-spread amongst
the students till last year. We recently organized OSScamp[17] at BITS Pilani
from 12th to 14th March, India's largest unconference on Open Source during
APOGEE, the International Technical festival of BITS Pilani. I was also the co-
ordinator of the Computer Science Association (a group of 56 students) of BITS
Pilani for APOGEE 2010, and led its way to huge success with maximum
number of project prizes amongst all the discipline associations and highly
successfully events.

With Apertium, it has been a great experience till now working with the team
and I am really happy to have a chance to take part in this project which surely
guarantees to be a great learning curve for me. Given an opportunity, I will
take responsibility for this project and make sure that all the deadlines are met
and make sure that the output is “awesome”.

Please list any non-Summer-of-Code plans you have for the summer:

I will be completely dedicated full time to this till 1st August which does not
overlap with anything else. From 1st August my class work will restart, since it
being my final semester, and the fact that I will be having only 2 courses, I
would still be able to manage 50 hours a week for this time duration.

References:
[1]: http://www.bits-apogee.org/
[2]: http://www.search.com/reference/Languages_of_India
[3]: K. Scannell (2010) "Statistical Unicodification of African Languages".
[4]: http://sourceforge.net/projects/lingala/ &
http://logipam.org/charlifter/index.php
[5]: http://www.ipernity.com/blog/team/69143%7CR0E;off%3D0?r[off]=100&
[6]:
http://en.wikipedia.org/wiki/Diacritic#Languages_with_letters_containing_diacrit
ics
[7]: Mikel L. Forcada, Open-source machine translation: an opportunity for
minor language , Proceedings of 5th SALTMIL workshop, LREC 2006, Genoa,
Italy.
[8]: http://www.rennert.com/translations/resources/diacritics.htm
[9]: http://en.wikipedia.org/wiki/Cross-validation_(statistics)#K-fold_cross-
                     Automatic Diacritic Restoration
validation
[10]: Simard, Michel (1998). "Automatic Insertion of Accents in French
Texts". Proceedings of EMNLP-3. Granada, Spain.
[11]: Rada F. Mihalcea. (2002). "Diacritics Restoration: Learning from Letters
versus Learning from Words". Lecture Notes in Computer Science 2276/2002
pp. 96--113
[12]: G. De Pauw, P. W. Wagacha; G.M. de Schryver (2007) "Automatic diacritic
restoration for resource-scarce languages". Proceedings of Text, Speech and
Dialogue, Tenth International Conference. pp. 170--179
[13] P.W. Wagacha; G. De Pauw; P.W. Githinji (2006) "A grapheme-based
approach to accent restoration in Gĩkũyũ". Proceedings of the Fifth
International Conference on Language Resources and Evaluation
[14] D. Yarowsky (1994) "A Comparison Of Corpus-Based Techniques For
Restoring Accents In Spanish And French Text". Proceedings, 2nd annual
workshop on very large corpora. pp. 19—32
[15]: http://images.google.com/imagelabeler/
[16]: http://www.bits-pilani.ac.in/
[17]: http://osscamp.in/
[18]: http://wiki.apertium.org/wiki/Evaluating_with_Wikipedia

				
DOCUMENT INFO
Shared By:
Stats:
views:112
posted:4/29/2011
language:English
pages:11