Embed
Email

Automatic Full Phonetic Transcription of Arabic Script; with ...

Document Sample
Automatic Full Phonetic Transcription of Arabic Script; with ...
Shared by: HC111124103916
Categories
Tags
Stats
views:
2
posted:
11/24/2011
language:
English
pages:
39
www.RDI-eg.com





Automatic Full Phonetic Transcription

of Arabic Script with and without

Language Factorization



Based on research conducted by RDI’s NLP group (2003-2009)

http://www.RDI-eg.com/RDI/Technologies/Arabic_NLP.htm

Mohsen Rashwan, Mohamed Al-Badrashiny, and Mohamed Attia



Presented by

Mohamed Attia





Talk hosted by

Group of Computational Linguistics - Dept. of Computer Science

University of Toronto – Toronto - Canada

Oct. 7th, 2009

Automatic Full Phonetic Transcription of Arabic Script, with and without Language Factorization (Oct. 2009)





The Problem of Ambiguity with NLP



 Numerous non-trivial NLP tasks that are handled via rule-based (i.e.

language factorizing) methods typically end up with multiple possible

solutions/analyses; e.g. Morphological Analysis, PoS Tagging, Syntax

Analysis, Lexical Semantic Analysis ... etc.



 This residual ambiguity arises due to our incomplete knowledge of

the underlying dynamics of the linguistic phenomenon, and maybe also

due to the lack of higher language processing layers constraining such

a phenomenon; e.g. absence of semantic analysis layer constraining

morphological and syntax analysis.



 Statistical methods are well known to be one of the most (if not the

ever most) effective, feasible, and widely adopted approaches to

automatically resolve that ambiguity.









2/39 CL group - Dept. of CS – U of T – Toronto - Canada

Automatic Full Phonetic Transcription of Arabic Script, with and without Language Factorization (Oct. 2009)









Statistical disambiguation of factorized sequences of language entities



3/39 CL group - Dept. of CS – U of T – Toronto - Canada

Automatic Full Phonetic Transcription of Arabic Script, with and without Language Factorization (Oct. 2009)





Intermediate Ambiguous NLP Tasks



 Sometimes, such ambiguous NLP tasks are not sought for the sake

of their outputs themselves, but as an intermediate step to infer

another final output.



 An example is the problem of automatically obtaining the phonetic

transcription of a given Arabic crude text w1 … wn , which can be

directly inferred as a one-to-one mapping of diacritics on the

characters of the input words. But these diacritics are typically absent

in MSA script!



 The NLP solution to this TTS problem is to indirectly infer the

diacritics d1 … dn via factorizing the crude input words by

morphological analysis, PoS tagging, and Arabic phonetic grammar.

Slides no. 13 to 26 provides a review of these language factorization

models.



 However these language factorization processes are themselves

highly ambiguous!



4/39 CL group - Dept. of CS – U of T – Toronto - Canada

Automatic Full Phonetic Transcription of Arabic Script, with and without Language Factorization (Oct. 2009)









Arabic morphological analysis as an intermediate ambiguous language

factorization towards the target output of the diacritics of i/p words





5/39 CL group - Dept. of CS – U of T – Toronto - Canada

Automatic Full Phonetic Transcription of Arabic Script, with and without Language Factorization (Oct. 2009)





Why not to Go without Language Factorization Altogether!?





 Some researchers, however, argue

that if statistical disambiguation is

eventually deployed to get the most

likely sequence of outputs, why do

not we go fully statistical; i.e.

un-factorizing from the very

beginning and give up the burden of

rule-based methods?



 For our example; this means the

statistical disambiguation (as well as

the statistical language models) are

built from manually diacritized text

corpora where spelling characters and

their full diacritics are both supplied

for each word.





6/39 CL group - Dept. of CS – U of T – Toronto - Canada

Automatic Full Phonetic Transcription of Arabic Script, with and without Language Factorization (Oct. 2009)





Cannot Cover, but How Accurate and How Fast?



 The obvious answer in many such cases (including the one of our example) is

to overcome the problem of poor coverage when the input language entities are

produced via a highly generative linguistic process; e.g. Arabic morphology.



 However, that sound question may be modified so that it enquires about the

performance (accuracy and speed) of statistically disambiguating un-factorized

language entities (at least those frequent ones that may be covered without

factorization) as compared to statistically disambiguating factorized language

entities.



 The rest of this presentation discusses 4 issues in this regard:

1- The statistical disambiguation methodology deployed in both cases.

2- The related Arabic NLP factorization models and the architecture of the

factorizing system.

3- The architecture of the hybrid (factorizing/un-factorizing) Arabic

phonetic transcription system.

4- Results analysis: factorizing system vs. hybrid system, and hybrid

system vs. other groups’.





7/39 CL group - Dept. of CS – U of T – Toronto - Canada

Automatic Full Phonetic Transcription of Arabic Script, with and without Language Factorization (Oct. 2009)



1- Statistical Disambiguation Methodology

Noisy Channel Model for Statistical Disambiguation









With maximum a posteriori probability (MAP) criterion:









 For our example; O is the crude Arabic i/p text words sequence.



- In case of the factorizing system; I is any valid sequence of factorizations;

e.g. Arabic morphological analyses (quadruples), and the ^ denotes the most

likely one.



- In case of the un-factorizing system; I is any valid sequence of diacritics, and

the ^ denotes the most likely one.





8/39 CL group - Dept. of CS – U of T – Toronto - Canada

Automatic Full Phonetic Transcription of Arabic Script, with and without Language Factorization (Oct. 2009)



1- Statistical Disambiguation Methodology

Likelihood Probability

In other pattern recognition problems; e.g. OCR and ASR, the term P(O|I)

referred to as the likelihood probability, is modeled via probability distributions;

e.g. HMM.



Our language factorization models enable us to do better by viewing the

availability of possible structures for a given i/p string - in terms of probabilities

- as a binary decision of whether the observed string complies with the formal

rules of the factorization models or not. This simplifies the MAP formula into:



where R(O) is the part of space of the factorization model

corresponding to the observed input string; i.e.



 In case of the factorizing system; I is now restricted to only possible

factorized sequences that can generate (via synthesis) that input sequence, and

the ^ denotes the most likely one.



 In case of the un-factorizing system; I is a possible sequence of diacritics

matching that i/p sequence, and the ^ denotes the most likely one.







9/39 CL group - Dept. of CS – U of T – Toronto - Canada

Automatic Full Phonetic Transcription of Arabic Script, with and without Language Factorization (Oct. 2009)



1- Statistical Disambiguation Methodology

Statistical Language Models, and Search Space

The term P(I) is conventionally called the (Statistical) Language Model (SLM).

Let us replace the conventional symbol I by the more adequate for our problem,

by Q which is more convenient for our specific problem/set of problems.

With the aid of the 1st graph in this presentation; the problem is now reduced to

searching for the most likely sequence of qi,f(i); 1≤i≤L, i.e. the one with the

highest marginal probability through the following lattice:



This creates a Cartesian

search space:







A* search algorithm is

guaranteed to exit with the

most likely path via two tree-

search strategies .









10/39 CL group - Dept. of CS – U of T – Toronto - Canada

Automatic Full Phonetic Transcription of Arabic Script, with and without Language Factorization (Oct. 2009)



1- Statistical Disambiguation Methodology

Lattice Search, and n-Gram Probabilities

1- Heuristic probability estimation of the rest of the path to be expanded next.

This is called the h* function.

combined with

2- Best-first tree expansion of the path with highest sum of start-to-expansion

probability; the g function, plus the h* function.



It is then required to estimate the marginal probability of any whole/partial

possible path in the lattice. Via the chain rule and the attenuating correlation

assumption, this probability is approximated by the formula:









Where h+1 is the maximum affordable length of n-grams in the SLM.









11/39 CL group - Dept. of CS – U of T – Toronto - Canada

Automatic Full Phonetic Transcription of Arabic Script, with and without Language Factorization (Oct. 2009)



1- Statistical Disambiguation Methodology

Computing Probabilities of n-Grams with Zipfian Sparseness

 These conditional probabilities are primarily calculated via the famous

Bayesian formula. Due to the Zipfian sparseness, the Good-Turing discount and

Katz’s back-off techniques are also deployed to obtain smooth distributions as

well as reliable estimations of rare and unseen events respectively.



 While the DB of elementary n-gram probabilities P(q1…qn); (1≤n≤h) are built

during the training phase, the task of the statistical disambiguation in the

runtime is rendered to:









12/39 CL group - Dept. of CS – U of T – Toronto - Canada

Automatic Full Phonetic Transcription of Arabic Script, with and without Language Factorization (Oct. 2009)



2- Arabic NLP Factorization Models

Arabic Phonetic Transcription: Problem Definition









Despite Arabic is an intensively diacritized language, Modern Standard

Arabic (MSA) is typically written by the contemporary natives without

diacritics!



So, it is the task of the NLP system to accurately infer all the missing

diacritics of all the input words in the input Arabic text, and also to

amend those diacritics in order to account for the mutual phonetic

effects among adjacent words upon their continuous pronunciation.



13/39 CL group - Dept. of CS – U of T – Toronto - Canada

Automatic Full Phonetic Transcription of Arabic Script, with and without Language Factorization (Oct. 2009)



2- Arabic NLP Factorization Models

Challenges of Arabic Phonetic Transcription



 Modern standard Arabic (MSA) is typically written without diacritics.



 MSA script is typically full of many common spelling mistakes.



 The extreme derivative and inflective nature of Arabic, which

necessitates treating it as a morpheme-based rather than a

vocabulary-based language. The size of generable Arabic vocabulary is

within the order of billions!



 One (or more) diacritic in about 65% of the words in Arabic text is

dependent on the syntactic case-ending of each word.



 Lexical and Syntax grammars alone produce a high avg. no. of

possible solutions at each word of the text. (High Ambiguity)



 7.5% of open-domain Arabic text are transliterated words which lack

any Arabic constraining model. Moreover, many of these words are

confusingly analyzable as normal Arabic words!

14/39 CL group - Dept. of CS – U of T – Toronto - Canada

Automatic Full Phonetic Transcription of Arabic Script, with and without Language Factorization (Oct. 2009)



2- Arabic NLP Factorization Models

The Ladder of NLP Layers; Undiscovered Levels







 Theoretically speaking,

NLP problems should be

combinatorially tackled at all

the NLP layers, which is yet

far beyond the reach of the

current state-of-the-art of

science.



 Moreover, NLP researchers

have not developed firm

knowledge at all the NLP

layers yet.









15/39 CL group - Dept. of CS – U of T – Toronto - Canada

Automatic Full Phonetic Transcription of Arabic Script, with and without Language Factorization (Oct. 2009)



2- Arabic NLP Factorization Models

Language Factorizations Deployed for Solving the Problem



 Arabic morphological analysis (and statistical disambiguation) is

deployed to retrieve the syntax-independent lexical phonetic info of

each input Arabic word from its building morphemes.



 Arabic PoS-tagging (along with morphological analysis) are deployed

to statistically infer the most likely syntax-dependent (case-ending)

phonetic info of i/p Arabic words.



 For transliterated (foreign) words, intra-word Arabic Phonetic

Grammar is deployed to constrain the statistical search for the most

likely diacritization that matches the spelling of each input

transliterated word.



 Inter-word Arabic phonetic Grammar is deployed (synthetically) to

phonetically concatenate fully diacritized adjacent words of all kinds.









16/39 CL group - Dept. of CS – U of T – Toronto - Canada

Automatic Full Phonetic Transcription of Arabic Script, with and without Language Factorization (Oct. 2009)









The Architecture of the

Factorizing Arabic Phonetic

Transcription System









17/39 CL group - Dept. of CS – U of T – Toronto - Canada

Automatic Full Phonetic Transcription of Arabic Script, with and without Language Factorization (Oct. 2009)



2- Arabic NLP Factorization Models

Arabic Morphological Structure: Morphemes



 Arabic is a highly derivative and inflective language whose words

can be decomposed into a relatively compact set of morphemes.



 Our Arabic morphological model

Morphemes

acknowledge the following



P: 260 prefixes.

Rd: 4,600 derivative roots.

P Body S

Frd: 1,000 regular derivative patterns.

Fid : 300 irregularly derived words.

Derivative Non-derivative Rf: 260 roots of fixed words.

Ff: 300 fixed words.

Rd Frd Fid Fixed Arabized

Ra: 240 roots of Arabized words.

Rf Ff Ra Fa Fa: 290 Arabized words.

S: 550 suffixes.





18/39 CL group - Dept. of CS – U of T – Toronto - Canada

Automatic Full Phonetic Transcription of Arabic Script, with and without Language Factorization (Oct. 2009)



2- Arabic NLP Factorization Models

Arabic Morphological Structure: Lexicon







A comprehensive Arabic

lexicon has been built to be

the repository of the linguistic

(orthographic, phonological,

morphological, Syntactic)

description of each Arabic

morpheme along with all their

possible mutual interactivities

(with other morphemes) are

registered as extensively as

possible in a compact

structured format.



This lexicon is the core of all

our language factorizations.









19/39 CL group - Dept. of CS – U of T – Toronto - Canada

Automatic Full Phonetic Transcription of Arabic Script, with and without Language Factorization (Oct. 2009)



2- Arabic NLP Factorization Models

Canonical Structure of Arabic Morphology



w  q  (t : p, r , f , s )









20/39 CL group - Dept. of CS – U of T – Toronto - Canada

Automatic Full Phonetic Transcription of Arabic Script, with and without Language Factorization (Oct. 2009)







The Multiplicity

of Possible

Arabic Lexical

Analyses









21/39 CL group - Dept. of CS – U of T – Toronto - Canada

Automatic Full Phonetic Transcription of Arabic Script, with and without Language Factorization (Oct. 2009)



2- Arabic NLP Factorization Models

The Arabic Lexical Disambiguation Lattice









After this process we obtain the diacritization of each Arabic word

except for the case ending ones.





22/39 CL group - Dept. of CS – U of T – Toronto - Canada

Automatic Full Phonetic Transcription of Arabic Script, with and without Language Factorization (Oct. 2009)



2- Arabic NLP Factorization Models

The Arabic Case Endings Disambiguation Lattice









After this process we obtain the case ending diacritics of each Arabic

word.









23/39 CL group - Dept. of CS – U of T – Toronto - Canada

Automatic Full Phonetic Transcription of Arabic Script, with and without Language Factorization (Oct. 2009)



2- Arabic NLP Factorization Models

Inferring the Diacritization of Transliterated Words



 Foreign names and terminology frequently appear as transliterated Arabic

strings in real-life Arabic text at a rate of 7.5% = 1/14 approx.

These words are not constrained by Arabic Morphological or Syntactic models.

Look-Up table-based approach is not a viable solution due to:

- Its lack of completeness and bad coverage.

- Its lack of tolerability to spelling variance.

- Its inability to attaching Arabic infixes.

- Its lack of guarantee to the compliance with the Arabic phonology

and above all:

- The time variance nature of this problem,

 Our approach was then to go statistical at the phoneme level, however, this

would generate a too wide search space and perplexity to get good results.

 To limit the search space, we constrain the search with another NLP model at

the phonology layer: Intra Word Arabic Phonetic Grammar.









24/39 CL group - Dept. of CS – U of T – Toronto - Canada

Automatic Full Phonetic Transcription of Arabic Script, with and without Language Factorization (Oct. 2009)



2- Arabic NLP Factorization Models

Disambiguation Lattice of Transliterated Words









After this process we obtain the case ending diacritics of each Arabic

word.





25/39 CL group - Dept. of CS – U of T – Toronto - Canada

Automatic Full Phonetic Transcription of Arabic Script, with and without Language Factorization (Oct. 2009)



2- Arabic NLP Factorization Models

Intra Word Arabic Phonetic Grammar









26/39 CL group - Dept. of CS – U of T – Toronto - Canada

Automatic Full Phonetic Transcription of Arabic Script, with and without Language Factorization (Oct. 2009)



3- The Hybrid Factorizing/Un-factorizing Transcriptor

Adding the Un-factorizing Phonetic Transcriptor









 The un-factorizing diacritizer simply tests the spelling of each input word

against a dictionary of final-form words; i.e. vocabulary list.

 The possible diacritizations of each word in a sequence of input words (called

henceforth “Segment”) that are all covered by that dictionary are

directly retrieved without any language factorization. The resulting

diacritizations lattice of each segment is then statistically disambiguated.

 Uncovered segments (along with the disambiguated diacritizations of the

covered segments) are then sent to the factorizing transcriptor for inferring the

most likely diacritization of uncovered segments as well as for phonetically

concatenating the words in all segments.

27/39 CL group - Dept. of CS – U of T – Toronto - Canada

Automatic Full Phonetic Transcription of Arabic Script, with and without Language Factorization (Oct. 2009)



3- The Hybrid Factorizing/Un-factorizing Transcriptor

The Architecture of the Hybrid Transcriptor









28/39 CL group - Dept. of CS – U of T – Toronto - Canada

Automatic Full Phonetic Transcription of Arabic Script, with and without Language Factorization (Oct. 2009)



4- Results Analysis

Experimental Evaluation of both Architectures



Two sets of experiments and result analyses have been performed to

evaluate our Arabic phonetic transcription work:



 Experiments to compare the performance of the purely factorizing

architecture with the hybrid factorizing/un-factorizing one.



 Experiments to compare the performance of the best of our two

architectures, with the best-reported other systems produced by our

rival R&D groups.



While the first set of experiments shows the hybrid architecture to

outperform the purely factorizing one, the second set shows our hybrid

system to be superior to the ones of our rival groups.









29/39 CL group - Dept. of CS – U of T – Toronto - Canada

Automatic Full Phonetic Transcription of Arabic Script, with and without Language Factorization (Oct. 2009)



4- Results Analysis

Comparing with Best Rivals; Experimental Setup



The best two reported rival systems reported in the published literature on the

problem of full Automatic Arabic Phonetic Transcription are:



 N. Habash & O. Rambow group in Columbia Univ. whose architecture is

a language factorizing one, with statistical modeling/disambiguation tool of

Support Vector Machine Tool (SVMTool). They also build an open-vocabulary

SLM with Kneser-Ney smoothing using the SRILM toolkit. (2007)



 I. Zitouni, J. S. Sorensen, R. Sarikaya group in IBM’s WRC whose

architecture is also a language factorizing one, with statistical

modeling/disambiguation work frame of Maximum Entropy. (2006)



Both of the two groups evaluated their performance by training and testing their

two systems using LDC’s Arabic Treebank of diacritized news stories

(LDC2004T11; text–part 3, v1.0) that is published in 2004.



This Arabic text corpus which includes a total of 600 documents ≈ 340K words

from AnNahar (Lebanese) newspaper text is split into a training data ≈ 288K

words and test data ≈ 52K words.





30/39 CL group - Dept. of CS – U of T – Toronto - Canada

Automatic Full Phonetic Transcription of Arabic Script, with and without Language Factorization (Oct. 2009)



4- Results Analysis

Comparing with Best Rivals; Experimental Results



In order to obtain a fair comparison with the work of Habash &

Rambow’s group, and with Zitouni et al.’s group:

 We used the same aforementioned training and test corpus from

LDC’s Treebank.

 We adopted their same metrics at counting the errors while

evaluating our hybrid system vs. theirs.



As each of the other two

groups deploys more

sophisticated statistical

tools than ours, one can

attribute the superior

performance of ours to

hybridizing the un-

factorizing transcriptor

with the factorizing one in

our system architecture.



31/39 CL group - Dept. of CS – U of T – Toronto - Canada

Automatic Full Phonetic Transcription of Arabic Script, with and without Language Factorization (Oct. 2009)



4- Results Analysis

Comparing the Factorizing to the Hybrid Architecture;

Experimental Setup



It is very insightful not only to know how better is the hybrid transcriptor

compared to the purely factorizing one, but also to know how the error margin

evolves in both cases with increasing the size of the training annotated text

corpora.

To this end; a domain-balanced annotated training Arabic text corpora of a total

size of 3,250K words have been developed (over years) so that a manually

supervised full Arabic morphological analysis and diacritization had been applied

to every word.

Another domain-balanced (tough) test set of 11K words had also been prepared

in both the annotated and un-annotated formats.

At approx. log-scale steps of the size of the training corpora, the statistical

models (with the same equivalent h) had been built and the following metrics

have been measured for each of the two architectures:

 Error margin.

 Average execution time per query.

 Average size of the SLM's.



32/39 CL group - Dept. of CS – U of T – Toronto - Canada

Automatic Full Phonetic Transcription of Arabic Script, with and without Language Factorization (Oct. 2009)



4- Results Analysis

Comparing the Factorizing to the Hybrid Architecture;

Experimental Results









 Both systems asymptote to the same irreducible error margin.

Justification: Despite being put in two different formats, the SLM’s of both systems are

built form the same data and have hence the same information content.



 The hybrid system has a faster learning curve than the purely factorizing one.

Justification: The un-factorizing component suggests fewer candidate diacritizations (by

looking the dictionary up) than the factorizing component (which generates all the

possibilities) which in turn leads to less ambiguity. Due to the NLP’s Zipfian distribution, a

small dictionary (built up from small training data) can quickly capture the frequent words.

33/39 CL group - Dept. of CS – U of T – Toronto - Canada

Automatic Full Phonetic Transcription of Arabic Script, with and without Language Factorization (Oct. 2009)



4- Results Analysis

Comparing the Factorizing to the Hybrid Architecture;

Experimental Results (cont’d)



 The hybrid system has been found to be approx. twice faster than

the purely factorizing one as per the avg. execution time per

transcription query.

Justification: Time needed for extra language factorizations, and

slimmer lattice hence less A* search time.



 The storage needed for the SLM's of the un-factorizing system has

been found to be 8 times smaller (on avg.) than their equivalent

counterparts of the purely factorizing one.

N.B. The storage needed for the SLM's of the hybrid system is the sum

of those needed for the factorizing and un-factorizing components.



Justification: Extra space is needed to store much more lower-order

n-grams in the factorizing system than in the un-factorizing one.







34/39 CL group - Dept. of CS – U of T – Toronto - Canada

Automatic Full Phonetic Transcription of Arabic Script, with and without Language Factorization (Oct. 2009)



Relevant Publications by: I- Competing Groups



(Columbia Univ. group)

- N. Habash, O. Rambow, Arabic Diacritization through Full Morphological

Tagging, Proceedings of the 8th Meeting of the North American Chapter of the

Association for Computational Linguistics (ACL); Human Language Technologies

Conference (HLT-NAACL), 2007.



(IBM group)

- I. Zitouni, J. S. Sorensen, R. Sarikaya, Maximum Entropy Based Restoration

of Arabic Diacritics, Proceedings of the 21st International Conference on

Computational Linguistics and 44th Annual Meeting of the Association for

Computational Linguistics (ACL); Workshop on Computational Approaches to

Semitic Languages; Sydney - Australia, July 2006;

http://www.ACLweb.org/anthology/P/P06/P06-1073.









35/39 CL group - Dept. of CS – U of T – Toronto - Canada

Automatic Full Phonetic Transcription of Arabic Script, with and without Language Factorization (Oct. 2009)



Relevant Publications by: II- Our Group (RDI’s)



1- Rashwan, M., Al-Badrashiny, M., Attia, M., Abdou, S., Rafea, A., A Stochastic

Arabic Diacritizer Based on a Hybrid of Factorized and Un-factorized Textual Features,

IEEE Transactions on Audio, Speech, and Language Processing (TASLP)

http://www.SignalProcessingSociety.org/Publications/Periodicals/TASLP. (Accepted

but not published yet)



2- Rashwan, M., Al-Badrashiny, M., Attia, M., Abdou, S., Rafea, A., A Stochastic

Arabic Hybrid Diacritizer, 2009 IEEE International Conference on Natural Language

Processing and Knowledge Engineering (IEEE NLP-KE'09);

http://caai.cn:8080/nlpke09/, Dalian-China, Sept. 2009.



3- Al-Badrashiny, M., Automatic Diacritization for Arabic Texts, M.Sc. thesis, Dept. of

Computer Engineering, Faculty of Engineering, Cairo University, June 2009:

http://www.rdi-eg.com/rdi/Downloads/ArabicNLP/Mohamed-Badashiny_MSc-

Thesis_June2009.pdf.



Cont. on the next page 









36/39 CL group - Dept. of CS – U of T – Toronto - Canada

Automatic Full Phonetic Transcription of Arabic Script, with and without Language Factorization (Oct. 2009)



Relevant Publications by: II- Our Group (RDI’s) “Cont’d”



4- Attia, M., Rashwan, M., Al-Badrashiny, M., Fassieh; a Semi-Automatic Visual

Interactive Tool for the Morphological, PoS-Tags, Phonetic, and Semantic Annotation

of the Arabic Text, IEEE Transactions on Audio, Speech, and Language Processing

(TASLP) http://www.SignalProcessingSociety.org/Publications/Periodicals/TASLP:

Special Issue on Processing Morphologically Rich Languages, Vol. 17 - Issue 5; pp.

916 to pp. 925

http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?isnumber=5067414&arnumber=50757

78&count=21&index=6, July 2009.



5- Rashwan, M., Al-Badrashiny, M., Attia, M., Abdou, S., A Hybrid System for

Automatic Arabic Diacritization, The Proceedings of the 2nd International Conference

on Arabic Language Resources and Tools, Cairo - Egypt

http://www.MEDAR.info/Conference_All/2009/index.php, Apr. 2009.



6- Attia, M., Theory and Implementation of a Large-Scale Arabic Phonetic

Transcriptor, and Applications, PhD thesis, Dept. of Electronics and Electrical

Communications, Faculty of Engineering, Cairo University,

http://www.rdi-eg.com/rdi/technologies/papers.htm, Sept. 2005.



7- Attia, M., A Large-Scale Computational Processor of the Arabic Morphology, and

Applications, M.Sc. thesis, Dept. of Computer Engineering, Faculty of Engineering,

Cairo University, http://www.rdi-eg.com/rdi/technologies/papers.htm, Jan. 2000.



37/39 CL group - Dept. of CS – U of T – Toronto - Canada

Automatic Full Phonetic Transcription of Arabic Script, with and without Language Factorization (Oct. 2009)



Conclusions



I- A given statistical disambiguation technique operating on either

factorized or un-factorized sequences of linguistic entities asymptotes

to the same disambiguation accuracy at infinitely huge size of

annotated training corpora.



II- Disambiguating un-factorized sequences is easier-to-develop,

computationally faster, and seems to have a faster “accuracy vs.

training corpora size” learning curve.



III- With highly generative linguistic phenomena (e.g. Arabic

morphology), language factorization is necessary to handle the

problem of coverage.



IV- On the other hand, language factorization costs much R&D efforts,

and is also more computationally expensive.



V- In such cases, the optimal systems can be built as a hybrid of the

two approaches so that the factorizing mode is resorted to only if some

un-factorized entities in the i/p sequence are OOV.



38/39 CL group - Dept. of CS – U of T – Toronto - Canada

Automatic Full Phonetic Transcription of Arabic Script, with and without Language Factorization (Oct. 2009)









Thank you for your attention.



To probe further, please visit:

http://www.RDI-eg.com/RDI/Technologies/Arabic_NLP.htm



You may also contact:

- Prof. Mohsen Rashwan: Mohsen_Rashwan@RDI-eg.com

- Dr. Mohamed Attia: m_Atteya@RDI-eg.com









39/39 CL group - Dept. of CS – U of T – Toronto - Canada


Related docs
Other docs by HC111124103916
SubTAG Report
Views: 0  |  Downloads: 0
Microsoft Word - ASL 1.doc
Views: 0  |  Downloads: 0
Application for Membership
Views: 0  |  Downloads: 0
fsb0088/1
Views: 0  |  Downloads: 0
IDENTIFICACION DE MERCADERIAS
Views: 1  |  Downloads: 0
Slide 1
Views: 0  |  Downloads: 0
Sheet1
Views: 25  |  Downloads: 0
AAG-HW02
Views: 0  |  Downloads: 0
By registering with docstoc.com you agree to our
privacy policy

You are almost ready to download!

You are almost ready to download!