A PROPOSAL FOR
CREATION OF A
Focus: linguistic data
What is ‘Linguistic Data’?
• Printed words - in different
scripts, fonts, platforms & But this data is of
environments use only if it comes
• Domain-specific texts (e.g.
90-odd ones in current Indian with linguistic
• Samples of Spoken Corpus –
telephone talk, public
lectures, formal discussions,
in-group conversations, radio
talks, natural language
• Hand-written samples
• Ritualistic Use of languages – ‘Cause it must be tagged
scriptures, chanting, etc. and aligned to be of use
• Language of Performance -
Reading, recitations, THAT’S WHAT CREATES AN
enactment IMPORTANT ROLE FOR
LINGUISTS IN THIS ENTERPRISE
• The Brown University text corpus was adopted to build statistical language
How the • models.
TI-46 & TI DIGITS databases, of Texas Instruments (early 80's) distributed by
Idea of an• The LDC at U-Penn was established in 1992.
• CIIL houses 45 million Word Corpora in 15 Indian lgs with
Indian DoE-TDIL support. CIIL has been distributing it to R&D groups
the world over.
LDC Came • Now converted into UNICODE jointly with the U of Lancaster
and with another 45 million word Corpora from five Indian
languages under Emille project coming in, it has been
released in early 2004.
• CIIL is now working with Universities of Uppsala on corpora
of lesser-known languages of India; See www.ciiluppsala-
• SO WHAT made us PROPOSE a LDC-IL?
•The giant strides in IT that India has made.
•Because demands were made by several
Software and Telecom giants – Reliance,
IBM, HPLabs, Modular Syetems & Infosys.
•Due to suggestions of the Hindi Committee
•As decided in the 1st ILPC meeting, 2004.
RECOLLECTING Proposal evolved through discussion held
with many Institutions in India and abroad.
August 13, 2003: 1st presentation at the
MHRD, with the then ES in the chair, and
THE FA, AS, J.S.(L), Director (L) and experts
PROPOSAL? from C-DAC and IIT-Kanpur.
August 17 and 18, 2003: An International Workshop on
LDC was held at the CIIL, Mysore in collaboration with
IIIT-Hyderabad and HPLabs, India. It was inaugurated by
Smt. Kumud Bansal (the then AS & now Secretary,
Elementary Ed), and attended by the J.S. (L). Those who
created LDC in USA had participated.
August 19, 2003: a follow up meeting of a smaller group was held
at the Indian Institute of Science to thrash out further details. A
Project Committee was set up.
• The Project Drafting •Nov 18,’03:
Committee had top NLP Modified
specialists and linguists proposal
with the Director CIIL as submitted.
the Coordinator. •Dec 19, 2003:
• Five experts from IIT-B, During the 2nd
IIT-M, IISc, IIIT- Hyd, & ICON,
CIIL with inputs from the representati
industry. ves of lead
• All changes were made Institutes
through email chats and met in Mysore
exchanges, and after four to discuss
after teleconferencing the draft
during Sept-Oct, 2003. sent to the
• The importance of creation of a large data-
archive of Indian languages is undeniable. In
fact, it is this realization that resulted in
government’s plan for corpora development in
• Indian languages often pose a difficult
challenge for the specialists in AI/NLP.
• The technology developers building mass-
application tools/products, have for long been
calling for availability of linguistic data on a
• However, the data should be collected,
organized and stored in a manner that suits
different groups of technology developers.
• These issues require us to involve a number of
disciplines like linguistics, statistics, & CS.
• Further, this data must be of high quality with
• Resources must be shared, so that all R&D
groups are benefited.
• All these are possible with a data consortium.
Spoken language data &
importance of phoneticians
• Numerous Indian languages,
each with so many sound
patterns identified/studied by
phoneticians for centuries.
• The inventory of IPA is
invaluable for spoken language
corpus, but their identification
from speech data requires
• For speech technology,we have
to create both phonetics/
acoustics models of languages
• Even when it is now aided and
eased by Visual Phonetics
technology, as available in CIIL
or TIFR labs, what we need in
addition is trained phoneticians.
•An ideal model of Consortium could be
THE seen if we consider the Linguistic Data
Consortium (LDC) hosted by the
MODEL University of Pennsylvania.
•LDC (USA) is an open consortium of
universities, companies & government
R&D labs that creates, collects and
distributes speech and text databases,
lexicons, and other resources for R&D.
• This ‘LDC’ has 100 plus agencies as its active users and members.
Includes some non-western languages:Arabic,Chinese, Korean.
• The core operations of are self-supporting after ten years.
• The activities include maintaining the data archives, producing and
distributing CD-ROMs, and arranging networked data distribution, etc.
• All these have provided a great impetus to R&D in the field of
language technology for English and other European languages.
• It is proposed to adopt a similar approach in the Indian context.
Who funded LDC in US? 1.Govt
• LDC was supported initially by 2.Industry
US Govt grant IRI-9528587 3.University
from the Information and
Intelligent Systems division
• Also by a grant 9982201 from
the Human Computer
Interaction Program of the
National Science Foundation
• Powered in part by Academic
Equipment Grant 7826-990 237-
US from Sun Microsystems.
• No member institution could
afford to produce this
Who will set up LDC-IL in India?
What will it do actually?
• The Ministry of HRD through the Central Institute of Indian
Languages (CIIL), Mysore along with other institutions
working on Indian Languages technology like Indian
Institute of Science, Bangalore, Indian Institutes of
Technology at Mumbai and Chennai, as well as the
International Institute of Information Technology, Hyderabad
propose to set up this LDC-IL.
• It is proposed that they will be the Lead Institutions in this
initiative, with CIIL as the coordinating body.
•LDC-IL will be an archive plus.
•Besides data, tools and standards of data representation
and analysis must be developed.
•It will create, analyze, segment, tag, align, and upload
different kinds of linguistic resources.
•It will accept electronic resources from authors,
newspapers, publishers, film, TV, radio & process them for
use of the community.
Potential Participants /
Institutions in India
All academic institutes, research organizations and Corporate R&D
groups from India and abroad working on Indian languages will be
encouraged to participate in LDC-IL. The following have already
•All Indian Institutes of Technology;
•IIITs at Hyderabad and elsewhere;
•Universities like U of Hyderabad; DU; JNU; NEHU
•HP Labs India;
•IBM; Infosys; Reliance Infocom;
•Language institutions like CIEFL, KHS, NCPUL & RSKS;
Major areas of Linguistic Resource
Development as proposed
• Speech Recognition
• Character Recognition
• Creation of different
kinds of Corpora
• By-products : Word
finders, lexicons of
different kind, thesauri,
Usage compilations etc.
Other possible applications
• Collocational restrictions for
• TTS: Statistical
• Build a speech recognition
• Develop Tree-bank tools
• Skeletal parses
• Will form a basis of MAT or MT
IN A WAY, ALL THESE WILL ONLY BE COMPLEMENTARY
TO WHAT IS BEING PLANNED / ENCOURAGED BY TDIL
of MCIT, and will complement it perfectly
Funding & Management
• The core funding from the Government of India. It
will span over two plan periods.
• All activities will be in a project mode and through
CIIL’s PL account.
• All staff will be on contract.
• All receipts and payments through internet
gateways, or through conventional means, will go
to this special bank account.
• Will attempt to leverage expertise already available
to cut avoidable cost and delay.
• As the nodal agency, CIIL will further distribute the
relevant funding for specific sub-components of the
scheme to other academic institutions.
• An annual progress report will be submitted to the
A. LDC-IL : Open to institutions, Research Organizations, and
Corporate sector from all over the world.
B. Will encourage members to contribute databases and
share revenues from sale of the data they contribute.
C. The databases will be available for R&D purposes to all
members and non-members on payment of the
appropriate fee, with a license for use only.
D. General membership will entitle all to get a large chunk of
tagged/aligned data for free; However, for specialized
parts, depending on the data contributors, they will have
to pay additional amounts.
E. The organization will be asked to sign a License Agreement
that the databases will not be distributed by it to others
either free or for a fee.
F. The IP and the copyright of any product developed as a
result of such an R&D activity shall lie with the
organization that has created the product.
PAC of LDC-IL
1. The LDC–IL will have a Project Advisory Committee (PAC).
2. Permanent members: Directors or nominees of lead
3. The PAC may be expanded later.
4. Lead institutions may be made expandable, with major
enterprises joining by putting in a major corpus grant.
5. It is to be understood that even if institutions from abroad
join this Consortium the administration/governance of it will
remain with Indian members only.
6. An official of the language Bureau nominated by the
MHRD and a nominee of the MCIT will be members of the
PAC. The FA of MHRD will also be a member.
7. The Director, Central Institute of Indian Languages will be
the Head of the LDC-IL. He will be assisted by a Project
Director nominated/ appointed for the purpose.
8 One Expert in IPR matters normally drawn from
Institutions like National Law School University, Bangalore
Differential rate of annual fee
India: Other countries :
• 1. Individual Researchers: • 1. Individual Researchers:
Rs.2000/- per annum $ 2,000/- per annum
• 2. Educational • 2. Educational
Institutions: Rs.20,000/- per Institutions: $ 20,000/- per
3. Software and related • 3. Software and related
industry : Rs.2,00,000/- per industry : $ 50,000/-
annum per annum
GOES WITHOUT SAYING THAT THIS WOULD
REQUIRE CONSTANT UPDATION AND
UPGRADATION AS WELL AS EXPANSION OF OUR
DATA / TOOLS / PRODUCTS
• It is estimated that by the third
year, LDC-IL will have 50
Institutional members from
India, and 200 Indian scholars as
individual members, contributing
to Rs. 12 lakh annually.
• In addition, it is estimated to
have at least 20 researchers from
abroad as individual members,
contributing to $ 40,000 or Rs. 20
• The attempt will be to secure
industrial support from the IT
sector internationally to raise at
least 10 institutional
memberships initially, creating a
corpus of $ 200,000 annually
by/during the third year. Should
that happen, it will generate a
substantial amount for LDC-IL.
Budget: A broad indication*
Rs. 221.60 lakhs per year. Total: Rupees
1772.8 lakhs for the next 8 years.
• 1. Human Resources: 69,84,000
• 2. Tasks: 64,76,000
• 3. Events (Meetings, workshops,
seminars & Training programs) : 50,00,000
• 4. Equipments & maintenance: 27,00,000
• 5. IPR costs & publications: 10,00,000
Total: Rs. 2,21,60,000
•NB: The Director CIIL on the advise of the Project Advisory Committee of the
LDC-IL may be authorized to re-appropriate funds from among the heads
indicated here, without exceeding the overall budget.
•In case the people in service in the Government or Autonomous Institutions in
substantial capacity are selected their service and salary will be protected.
Sl.No. Head Amount
1. Human Resources 69,84,000
(a) Project Director (1) Rs. 30,000 (variable)30,000x12 3,60,000
(b) Scientist A (3) 29,000 x 3 persons x 12 man-months 10,44,000
(c) Scientist B (4) 21,000 x 8 x 12 m 20,16,000
(d) Scientist C (5) 14,000 x 6 x 12 m 10,08,000
(e) Scientist D (8) 11,000 x 8 x 12 m 10,56,000
(f) Project technicians (Rs.5,000 x 20 x 12 m) 12,00,000
(g) Maint Personnel – Accounts (Rs.11,000 x 1 12m) 1,32,000
(h) Maint Personnel – Sales & Promo (Rs.7,000 x 1 x 84,000
(i) Maint Personnel - General (Rs.7,000 x 1 x 12m) 84,000
2. Tasks at various Participating Institutions (as in 64,76,000
3. Academic Meetings in diff. Instt x 2 2,00,000
4. LDC-IL PAC meetings at CIIL x 2 2,00,000
5. Seminars & Events in diff Instt – 7 every year in 15,00,000
(a) Seminars (National) in diff Instt x 2 4,00,000
(b) Seminars (Regional) in diff Instt x 4 6,00,000
(c) Seminars (Int’l) rotating in diff participating Instt x 5,00,000
1 per year
6. (Prod) Workshops for production (6) 6,00,000
7. Training Programmes x 4 per year 2,00,000
8. Travel & Incidentals 8,00,000
Equipments & Maintenance 27,00,000
9. Hardware 20, 00,000
10. Software/Tools 4,00,000
11. Equipment maintenance Variable (From
12. Maintenance of LDC-IL 3,00,000
13. IPR/Copyright payments (variable) 5,00,000
14. Publications, incl E-pub (10 a year) 5,00,000
TOTAL Rs. 2,21,60,000
Resource Generation- Details
• The first 2 years of the project are
incubation years. It would take time
to set up, and test-run tools and
deliverables & advertise.
• It is estimated that from the third
year onwards, the annual revenue
may be 8% to 10% of the annual
investment, i.e. Rs. 17.73 lakhs to Rs.
22.16 lakhs contributing to Corpus
• 6th year on, it will be around 25% to
35% of the amount invested, i.e.
Rs.55.4 lakhs to Rs.66.48 lakhs
• At the end of eight years, there will be
at least Rs. 201.66 lakhs to Rs. 243.76
lakhs plus interests in corpus funds.
• Hopefully, there will be new lead
institutions to contribute to corpus
fund further, once LDC-IL works in full
Core Operations to be self-
• Beyond eight years, Govt may
support only events (Rs.50 lakhs
from CIIL’s OC-Plan), tasks of
software development (Rs.64.76
lakhs from our OE-Plan), and
maintenance of equipments
(Rs.15.24 lakhs from OE-Non-Plan),
i.e. Rs.130 lakhs a years.
• The services of the personnel and
the IPR costs will be paid from 6%
interests of the corpus funds
(Rs.14.63 lakhs) plus anticipated
annual income, i.e. 66.48 lakhs, i.e.
Rs.81.11 lakhs generated annually.
With Rs.130 lakhs as above, the
total comes to Rs.211.11 lakhs
Speech Recognition and Synthesis: Objectives
• 1. Primarily to build speech recognition and synthesis systems.
• 2. Although there are ASR & TTS systems for many western languages,
commercially viable speech systems are unavailable.
• 3. Voice User Interfaces for IT applications and services, useful especially in
• 4. If such technology is available in Indian languages, people in various semi-
urban and rural parts of India will be able to use telephones and Internet to
access a wide range of services and information on health, agriculture, travel,
• 5. However, for this a computer has to be able to accept speech input in the
user’s language and provide natural speech output.
• 6. Also in India, if speech technology is coupled with translation systems
between the various Indian languages.
• 7. The main obstacle is to customize this technology for various Indian
languages is the lack of appropriate annotated speech databases.
• 8. Focus: (i) to collect data that can be used for building speech enabled
systems in Indian languages and (ii) to develop tools that facilitate collection of
high quality speech data.
Goals – long & short term
Long Term Goal:
The grand vision of this project is to collect data to provide speech-to-speech translation from each and every language
to each and every other language spoken in India (including Indian English). Such a system would include unlimited
vocabulary speech synthesis and recognition systems for every Indian language coupled with machine translation systems
between those languages. The block diagram given below describes the basic architecture of such a system.
in language A
Recognition in Recognized Text in Language A
Short Term Goal:
To create databases for building (a) bi-directional speech to speech translation system of read speech for a pair of
Indian languages, namely, Hindi-Telugu, (b) a speech recognition system for Indian English. Further, it is desired to collect
large vocabulary isolated data for the 22 Scheduled Indian languages.
Text in Language A
Text to Speech
Translated Text in Language Bconversion in
Machine Language B
Language A to B
in Language B
Data collection Effort for Automatic Speech Recognition (ASR)
Data required: Read speech corpora for two Indian languages and Indian English.
Channels: 1. Close talking microphone, on a desktop or laptop.
2. Telephone, both landline and mobile .
Annotation: The data will be annotated at phoneme, syllable, word and sentence levels.
Data Collection for Isolated Speech Recognition
Channels: 1. Close talking microphone, on a desktop or laptop
2. Telephone, both landline and mobile
Demography: 10,000 words from 300 speakers (150 male, 150 female)
Data Collection for Text to Speech Synthesis
Data Required: Data will be collected in the form of read-out phonetically balanced text which will ensure
coverage of all speech sounds of the language concerned in different prosodic and phonological contexts. The
phonetically balanced text will be extracted from a huge text corpus.
Channels: Speech Synthesis requires high quality recording in an anechoic chamber using high quality
microphones and recording equipment.
Demography: 6 speakers: 3 males and 3 females per language.
Annotation: Data to be annotated at phone, phoneme, syllable, word, and phrase level.
• Speech to Speech translation for a pair of Indian
languages, namely, Hindi and Telugu.
• Command and control applications.
• Multimodal interfaces to the computer in Indian
• E-mail readers over the telephone.
• Readers for the visually disadvantaged.
• Speech enabled Office Suite.
The effort for both Speech Recognition and Speech Synthesis will be repeated
across all 22 Scheduled languages. For Speech Recognition, spontaneous speech
data will be collected along with read speech. For speech synthesis, data will be
collected from professional speakers, with very good voice quality. Additional
speech data will be collected to come out with models for prosody (intonation,
duration, etc.) to improve the naturalness of synthesized speech. A database
(lexicon) of proper names (of Indian origin) will be created, with the equivalent
phonetic representation for each of the names.
• Character Recognition refers to the conversion of printed or
handwritten characters to a machine-interpretable form.
• ”Online” handwriting recognition or Online HWR refers to the
interpretation of handwriting captured dynamically using a handheld
or tablet device. It allows the creation of more natural handwriting-
based alternatives to keyboards for data entry in Indian scripts, and
also for imparting of handwriting skills using computers.
• “Offline” handwriting recognition or Offline HWR refers to the
interpretation of handwriting captured statically as an image.
• Optical character recognition or OCR refers to the interpretation of
printed text captured as an image. It can be used for conversion of
printed or typewritten material such as books and documents into
• These different areas of language technology require different
algorithms and linguistic resources.
• They are all hard research problems because of the variety of writing
styles and fonts encountered.
• Of these, OCR has seen some research in a few Indian scripts because
of support from the TDIL program. However the technology is not yet
mature and there is only one commercial offering.
1. Handwriting Interface to Computers
Indian scripts are complex and not suitable for keyboard-based entry. Replacing the
keyboard with a simpler and more natural interface based on handwriting would make
computers much more accessible to the common man and to educators in particular. The
solution would also need to support numerals, punctuation, and editing gestures, and
functionally replace the keyboard.
2. Handwriting Tutor
3. Multilingual Digital Libraries for Education
A wealth of literature and other education material in Indian languages is trapped in
books, which require storage and are subject to physical decay. Online books may be
easily made available to students all over in their schools, homes or hostels.
The proposed solution will use a complete OCR pipeline for converting scanned images of
book pages into electronic form, with serch using the local language, using either spoken
(using Speech Recognition) or written (using Online HWR) queries.
4. Automatic Forms Processing/Educational Testing
With millions of application forms filled in every year in Indian languages especially in the
education sector, a solution for automatically reading handwriting from scanned images of
forms is valuable.
The proposed solution is a complete forms-processing system.
The interpreted results can be stored into a database (for applications) or compared with
correct responses (for educational testing).
Natural Language Processing
• Electronic dictionaries:
• Electronic dictionaries are a primary requisite for developing any software in NLP.
• ED 1 Monolingual/bilingual dictionaries
• 25,000 words per year (per language)
• ED 2. Transfer Lexicon and Grammar(TransLexGram) (per language)
• Transfer Lexicon and Grammar above involves developing a language
resource which would contain
• English Headwords
• Their grammatical category
• Their various senses in Hindi
• Corresponding sense in the other Indian language
• An example sentence in English for each sense of a word
• Corresponding translation in the concerned Indian language
• o In case of verbs, parallel verb-frames from English to Indian language.
• As is obvious from the above, TransLexGram will be a rich lexicon which will not
only contain the word level information but also the crucial information of verb-
argument structure and the vibhaktis with specific senses of a verb.
• The resource, once created will be a parallel resource not only between
English and Indian languages but also across all Indian languages.
Creation of Corpora
• Domain Specific Corpora:
• Apart from these basic text corpora creation an attempt will be made to create
domain specific corpora in the following areas :
• a. Newspaper corpora
• b. Child language corpus
• c. Pathological speech/language data
• d. Speech error Data
• e. Historical/Inscriptional databases of Indian languages which is one of the
most important to trace not only as the living documents of Indian History but
also historical linguistics of Indian languages.
• f. Grammars of comparative/descriptive/reference are needed to be
considered as corpus of databases.
• g. Morphological Analyzers and morphological generators.
POS tagged corpora
• Part-of-speech (or POS) tagged corpora are collections of texts
in which part of speech category for each word is marked.
• To be developed in a bootstrapping manner.
• First, manual tagging will be done on some amount of text.
• Then, a POS tagger which uses learning techniques will be used
to learn from the tagged data.
• After the training, the tool will automatically tag another set of
the raw corpus.
• Automatically tagged corpus will then be manually validated
which will be used as additional training data for enhancing the
performance of the tool.
Other kinds of Corpora
Chunked corpora: Semantically tagged corpora:
• The chunked corpora will The real challenge in any NLP and
also be prepared in a text information processing
manner similar to the POS
application is the task of
tagging. Here also the initial
training set will be a disambiguating senses. In spite of
complete manual effort. long years of R & D in this area,
Thereafter, it will be a man- fully automatic WSD with 100%
machine effort. That is why, accuracy has remained an elusive
the target in the first year is goal. One of the reasons for this
less and double in the shortcoming is understood to be
successive years. Chunked the lack of appropriate and
corpora is a useful resource adequate lexical resources and
for various applications.
tools. One such resource is the
"semantically tagged corpora".
• Syntactic tree bank: • Parallel aligned corpora:
• Preparation of this resource • A text available in multiple
requires higher level of linguistic languages through translation
constitutes parallel corpora.
expertise and needs more human
effort. First, experts will manually • NBT & Sahitya Akademi are some
of the official agencies who
tag the data for syntactic parsing. develop parallel texts in different
languages through translation.
• Since, a crucial point related to • Such Institutions have given
this task is to arrive at a consensus permission to CIIL to use their
regarding the tags, degree of works for creation of electronic
fineness in analysis and the versions of the same as parallel
methodology to be followed. This corpora.
calls for some discussions amongst • The literary magazines and news
the scholars from varying fields paper houses with multiple
language editions will have to be
such as Sanskritists, linguistics and approached for parallel corpora.
computer scientists . It will be
• Computer programmes have to be
achieved through conduct of written for creating
workshops and meetings.
• [I] Aligned texts; [II] Aligned
sentences; and [III] Aligned
• 1.Tools for Transfer Lexicon Grammar (including creation of interface for
building Transfer Lexicon Grammar)
• 2. Spellchecker and corrector tools
• 3. Tools for POS tagging. (Trainable tagging tool + an Interface for editing POS
• tagged corpora)
• 4. Tools for chunking (Rule-based language-independent chunkers)
• 5. Interface for chunking (Building an interface for editing and validating the
• 6.Tools for syntactic tree bank, incl. interface for developing syntactic tree bank
• 7. Tools for semantic tagging with basic resources are the Indian language
WordNets showing a browser that has two windows – one showing the senses
(i.e., synsets) from the WordNet appear in the other window, after which a
manual selection of the sense can be done
• 8. (Semi) automatic tagger based on statistical NLP (the preliminary version of
which is ready in IITB)
• 9. Tools for text alignment, including Text alignment tool, Sentence alignment tool
and Chunk alignment tool as well as an interface for aligning corpora