Docstoc

Urs

Document Sample
Urs Powered By Docstoc
					XML-Unicode environment
for creating and accessing of
   Indian language theses:
   Vidyanidhi experiences

           Shalini R. Urs
      Vidyanidhi Digital Library
    University of Mysore,Mysore, India
        shalini@vidyanidhi.org.in
     Indo-US Workshop, June 25, 2003
Vidyanidhi Digital Library
• Vidyanidhi began as a pilot project in
  2000
• Supported by the NISSAT, DSIR, GOI
• Objective was to demonstrate the
  feasibility of an Electronic Thesis and
  Dissertation( ETD) Initiative in the
  Indian Context
• It is now evolving into a national effort
• Supported by the Ford Foundation

         Indo-US Workshop, June 25, 2003
        Vidyanidhi:Vision
To evolve into a information infrastructure
  to strengthen the research capacities of
  Indian Universities by-
Developing accessible digital libraries of
  theses and dissertations.
Sensitizing and training doctoral research
  students in       Scholarly writing, E-
  publishing and ETDs
Developing appropriate policies
Developing/making available requisite
  tools and resources
         Indo-US Workshop, June 25, 2003
   Vidyanidhi: Strategies
• Policy Framework – through
  meetings, liaison, participation
• Education and Training
• Content Building- full text and
  metadata
• Resources and tools
  (software,interfaces…)

       Indo-US Workshop, June 25, 2003
Indian Academic Research Output
• Large system of higher education
• More than 300 universities-reservoir of
  extensive doctoral research work
• Doctoral research output-around 30,000
  annually
• English is the predominant language
• Increasing vernacularisation –20-25% in
  Indian Languages
• This trend is increasing resulting in more
  and more research output in Indian
  Languages
          Indo-US Workshop, June 25, 2003
Language Interoperability
• Vidyanidhi approach has been guided
  by the language inter operability
  factor
• Our choice of technology and tools
  will have to be inter operable across
  languages


         Indo-US Workshop, June 25, 2003
     Indian Languages: Diversity

• The rich diversity in Indian Languages and
  scripts is simply overwhelming.
• India is made up of a number of separate
  linguistic communities, each of which
  shares a common language and culture.
• No of languages listed for India is 418
• 407 are living languages
• 11 are extinct.
• Many Languages -without script of their
  own
           Indo-US Workshop, June 25, 2003
         Eighteen Indian languages

•   Assamese                 •   Bengali
•   Gujarati                 •   Hindi
•   Kashmiri                 •   Kannada
•   Malayalam                •   Konkani
•   Marathi                  •   Manipuri
•   Oriya                    •   Nepali
•   Punjabi                  •   Sanskrit
•   Sindhi                   •   Tamil
•   Telugu                   •   Urdu
            Indo-US Workshop, June 25, 2003
 Language Families of Indian
        Languages
• Indo European- North and Central
  India
• Dravidian – South India
• Mon-Khmer- Assam and some
  Eastern parts of India
• Sino-Tibetan- Northern Himalayan
  and Burmese border area

      Indo-US Workshop, June 25, 2003
            Indian Scripts
• Interestingly, though the languages belong
  to four different language groups, Indian
  scripts have a common root/origin
• Scripts of all Indian Languages are derived
  from Bhahmi
• Greater uniformity in the arrangement of
  Alphabets


          Indo-US Workshop, June 25, 2003
Indo-US Workshop, June 25, 2003
    Indian Alphabet: Characteristics
• Consonants
     – Five Vargs (groups)
     – Non varg
     – Have an implicit + vowel
•    Anuswar ( a nasal consonant)
•    Chandrabindu ( a nasalisation Sign)
•    Visarg
•    Vowels and Vowel Signs
•    Vowel omission sign( Halant)
•    Conjuncts
           Indo-US Workshop, June 25, 2003
 Indian Languages and scripts

• Indic scripts are syllable oriented-
  phonetic based with imprecise
  character sets
• The different scripts look different
  (different shapes) but have vastly
  similar yet subtly different alphabet
  base and script grammar


        Indo-US Workshop, June 25, 2003
        Indian Languages and
            scripts:Issues
• The Indic characters consist of
  consonants,      vowels,      dependent
  vowels-called       ‘matras’     or   a
  combination of any or all of them
  called conjuncts.
• Collation (sorting) is a contentious
  issue as the script is phonetic based
  and not alphabet based
         Indo-US Workshop, June 25, 2003
        Handling Indian
  Languages:Possible approaches
• Transliteration              -          Glyph   based
  approach
  – Indic characters are encoded in either
    ASCII or any other proprietary
    encoding
  – Use glyph technologies to display and
    print Indic scripts
  – Currently the most popular approach
    for desktop publishing.
        Indo-US Workshop, June 25, 2003
       Handling Indian
 Languages:Possible approaches
• Develop an encoding system for all the
  possible        characters/combinations
  running into nearly 13,000 characters in
  each language-with a possibility of a
  new combination leading to a new
  character- an approach developed and
  adopted by the IIT Madras development
  team
• Adopt the ISCII/Unicode encoding
        Indo-US Workshop, June 25, 2003
   ISCII- Indian Script Code for
     Information Interchange
• ISCII-91 -BIS Standard , IS 13194:1991
• An outcome of the efforts of Govt. of
  India, DOE, MIT, C-DAC and many
  other institutions
• Is an 8 bit code
• Is an extension of the 7 bit ASCII code
• Top 128 characters cater to the 10 Indian
  Scripts

        Indo-US Workshop, June 25, 2003
                    Unicode
• The Unicode consortium has
  encoded all of the world’s scripts
• Unicode represents a carefully
  thought out ,technically
  impressive and a full featured
  attempt at encoding Indic Scripts
• Unicode has unique code points
  for all of the Indic scripts

      Indo-US Workshop, June 25, 2003
Script       Unicode Range                                Major Languages



Devanagari   U+0900 to U+097F                             Hindi, Marathi, Sanskrit



Bengali      U+0980 to U+09FF                             Bengali, Assamese



Gurumukhi    U+0A00 to U+0A7F                             Punjabi



Gujurati     U+0A80 to U+0AFF                             Gujarati



Oriya        U+0B00 to U+0B7F                             Oriya



Tamil        U+0B80 to U+0BFF                             Tamil



Telugu       U+0C00 to U+0C7F                             Telugu



Kannada      U+0C80 to U+0CFF                             Kannada



Malayalam    U+0D00 to U+0D7F                             Malayalam
                                Indo-US Workshop, June 25, 2003
      Unicode implementation for
             Indic scripts

• Despite the robustness ,technical soundness
  and practical viability, Unicode
  implementation for Indic scripts is almost non
  existent
• Our search of the major databases-LISA,
  INSPEC, WOS did not show up any initiative
  in this direction
• Vidyanidhi is an example of successful
  implementation of Unicode for Indic scripts
          Indo-US Workshop, June 25, 2003
      Vidyanidhi approaches

• Taking Indian                          Language
  thesis to the Web
  – Full Text
  – Metadata



       Indo-US Workshop, June 25, 2003
                      MS Word to XML
Template for
thesis in MS
Word

               Student
               submits thesis
               in Word

                                     Convert to XML
                                     using the RTF to
                                     XML Converter

                                                        Take them to
                                                        the Web
                      Indo-US Workshop, June 25, 2003
            Full Text
• Vidyanidhi provides tools for the creation
  of theses in Indian Languages
• Our approach is to-
• provide a style sheet /template on line
• When the thesis is submitted then convert
  the same into to XML encoded in Unicode



          Indo-US Workshop, June 25, 2003
Indo-US Workshop, June 25, 2003
Indo-US Workshop, June 25, 2003
Indo-US Workshop, June 25, 2003
Indo-US Workshop, June 25, 2003
Indo-US Workshop, June 25, 2003
Indo-US Workshop, June 25, 2003
Indo-US Workshop, June 25, 2003
Indo-US Workshop, June 25, 2003
Indo-US Workshop, June 25, 2003
Indo-US Workshop, June 25, 2003
Indo-US Workshop, June 25, 2003
Indo-US Workshop, June 25, 2003
   Vidyanidhi database-approach…
• Each script /language will have one
  table. Currently there are three separate
  tables for the three scripts- one each for
  Roman, Hindi (Devanagari), & Kannada
• The theses in Indic languages will have
  two records -one in the Roman script
  (transliterated) and the other in the
  vernacular. However the theses in
  English will have only one record (in
  English)
        Indo-US Workshop, June 25, 2003
       Vidyanidhi database-
           approach…
• The two records are linked by the
  ThesisID number-a unique id for
  the record
• The bibliographic description of
  Vidyanidhi follows the ThesisMS
  Dublin Core standard adopted by
  the NDLTD and OCLC

     Indo-US Workshop, June 25, 2003
      Vidyanidhi - Platform

• Microsoft
• Windows XP supports all the 10 Indic
  scripts
• Using Windows Glyph processing–
• Open Type Font Format
• Uniscribe-Unicode Script Processor
• Open Type Layout Services library


     Indo-US Workshop, June 25, 2003
     Vidaynidhi - platform


– MS SQL 2000
    • A truly multilingual-capable SQL
    • Achieves satisfactory collation
– Front End- ASP
– Java script



 Indo-US Workshop, June 25, 2003
Indo-US Workshop, June 25, 2003
   Vidyanidhi:Accessing and
          Searching
• One can search the Vidyanidhi
  Database either in -
  – In English ( Roman Script)
  – The integrated ( Master) database has
    metadata records for theses in all
    languages
  – Vernacular database has records of
    the specific language only

       Indo-US Workshop, June 25, 2003
        Two approaches-
          differences
• one affords search in the English
  language and the other in the vernacular.
• The first approach also provides for
  viewing records in Roman script for all
  theses-search output- that satisfy the
  conditions of the query and also an
  option for viewing records in vernacular
  script for theses in vernacular

       Indo-US Workshop, June 25, 2003
• The second approach- enables
  one to search only the vernacular
  database and thus is limited to
  records in that language.
• However, this approach enables
  the search to be in the vernacular
  language and script

    Indo-US Workshop, June 25, 2003
Indo-US Workshop, June 25, 2003
Indo-US Workshop, June 25, 2003
Indo-US Workshop, June 25, 2003
Indo-US Workshop, June 25, 2003
Indo-US Workshop, June 25, 2003
  Unicode and Indic Scripts
• Vidyanidhi implementation dispels
  certain     misconceptions     and
  misconstructions about Unicode
• Supposed problems-
  – Data Input
  – Display and printing
  – Collation

        Indo-US Workshop, June 25, 2003
     Data input/Keyboard layout

Our Test bed and comparison with other
  methods:
• Unicode layout is as easy as the other in
  terms of speed
• In terms of ‘no of key strokes’-No
  difference and some times Unicode method
  has less number of keystrokes involved
• Data input was almost comparable to
  English records in terms of productivity

         Indo-US Workshop, June 25, 2003
        Display and Printing

• It is fairly satisfactory except for
  a few issues/problem areas-
   – Handling of certain conjuncts
   – Inability to display non terminating
     pure consonant
   – Limited choice of font types
• Unicode can handle conjunct
  clusters of four consonants
     Indo-US Workshop, June 25, 2003
      Collation issues-some
          observations
• Consensus with respect of Indic
  scripts is hard to come by
• Difference of opinion is not
  uncommon as Indic languages are a
  cross between syllabic and phonemic
  writing systems
• Collation according to phonetic order
  would be different from alphabetic
  order
    Indo-US Workshop, June 25, 2003
        Collation Issues
• A few of the disorder stem from
  the common script base and
  order for all Indic scripts
• Differences between Indic
  scripts -in the number and
  arrangement of consonants and
  vowels-despite strong similarity

    Indo-US Workshop, June 25, 2003
      Collation by Unicode

• Given the above collation
  problems, the collation achieved
  by Unicode is fairly satisfactory
  and compares very well with
  other more popular Font based
  software package-Nudi


    Indo-US Workshop, June 25, 2003
              Conclusion
Unicode is able to handle
 admirably the challenges of a
 Multilanguage multi script
 database implementation despite
 the complexity and the minutiae
 of a family of Indian languages
 and scripts with strong
 commonalities and faint
 distinctions among themselves
   Indo-US Workshop, June 25, 2003
    Contact
shalini@vidyanidhi.org.in
Indo-US Workshop, June 25, 2003

				
DOCUMENT INFO
Categories:
Tags:
Stats:
views:1
posted:3/27/2012
language:English
pages:56