Unicode for Under Resourced Languages

Document Sample
Unicode for Under Resourced Languages Powered By Docstoc
					Unicode for Under Resourced
         Languages
          Daniel Yacob
    Ge’ez Frontier Foundation
     SALTMIL 5: Genoa, Italy 2006
Overview
• What is “Unicode”?
  – More than Just Encoded Letters!
• Working with Unicode
  – How Unicode can help you.
  – Resources and how to apply them.
• Working for Unicode
  – How you can help Unicode.
  – How Unicode can help your U-RL.
My Background
• Started Ethiopic software work in 1993
  – transliterator, keyboard, fonts
• Amharic Computational Linguistics in 1994
• “Extended Ethiopic” Unicode
  Standardization 1995-2004
• Corpus Collection 1997 – Present
• Began Using Unicode in 1995 for Ethiopic
  – but no Unicode standard existed until 2000!
My Background
• Little or no Unicode based resources in
  1993-1997
  – Today there is almost always an OpenSource
    project that you can start with and extend.
  – Minimize the time and labour you put into
    developing basic resources.
  – Avoid the maintenance trap.
• We will assume the worst case scenario
  – You work on a language, using a script, with
    no pre-existing software resources at all.
What Unicode is
Unicode …
  – is a consortium
  – is a process
  – is a community
  – is a conference
  – is a database
  – is a standard
  – is a collection of standards
What Unicode is not
Unicode …
  – is not a font
  – is not a keyboard system
  – is not a transliteration system
  – is not the ISO
  – is not perfect
  – is not complete
Over 80 Scripts not Encoded!
India, Nepal,          Southeast Asia            China:
Bangladesh:             (excluding China):       • Lanna
                                                 • Naxi Geba
• Chakma               • Batak
                                                 • Naxi Tomba
• Methei / Manipuri    • Cham
                                                 • Pollard
• Newari               • Javanese
• Sorang               • Pahawh Hmong
                                                 Africa:
    Sompeng            • Viet Thai
                                                 • Bamum
• Varang Kshiti                                  • Bassa
                                                 • Mende


           Courtesy of Michael Everson: http://evertype.com
Over 80 Scripts not Encoded!
•Ahom               •Grantha           •Mandaic            •Palmyrene
•Alpine             •Hatran            •Manichaean         •Proto-Elamite
•Aramaic            •Iberian           •Mayan              •Pyu
•Avestan                                   Hieroglyphs     •Rongorongo
                    •Indus Valley      •Meroitic
•Aztec Pictograms                                          •Samaritan
                    •Jurchin           •Modi
•Balti                                                     •Satavahana
•Brahmi             •Kaithi            •Nabataean          •Sharada
•Büthakukye         •Kawi              •North Arabic       •Siddham
•Byblos             •Khotanese         •Numidian           •South Arabian
•Chalukya           •Kitan Large       •Old Hungarian      •Soyombo
•Chola                     Script      •Old Permic         •Takri
•Cypro-Minoan       •Kitan Small       •Orkhon             •Tangut
•Egyptian                  Script      •Pahlavi               Ideograms
   Hieroglyphs                                             •Uighur
                    •Landa
•Elbasan                                                   •Vedic accents
                    •Linear A
•Elymaic
                    •Luwian
               Courtesy of Michael Everson: http://evertype.com
   Current State of the Unicode
  Standard: New Script Additions
For Unicode 5.0 (2006):             For Unicode 5.1 (2008):
   N’Ko (West Africa)                  Lepcha (India)
                                       Ol Chiki (India)
  Balinese (Indonesia)                  Vai (Liberia)
  Phags-pa (historical)              Saurashtra (India)
                                Myanmar minorities (Myanmar)
 Phoenician (historical)            Kayah Li (Myanmar)
 Cuneiform (historical)              Rejang (Indonesia)
                                   Sundanese (Indonesia)
                                   Carian, Lycian, Lydian
                                           (historical)


           Courtesy of Michael Everson: http://evertype.com
Working with Unicode
Unicode is all About Text
• Most applicable to problems where
  language is represented by text.
• Unicode addresses some vocabulary but
  under the scope of localization (CLDR).
• May not be the solution if you are not
  working with text represented in written
  form
  – Although, Unicode can be used for symbol
    processing
Working with Unicode
Operating Systems
• Most anything from this millennia.
• Apple MacOS Version ≥ 9.2
• Microsoft Windows CE, NT, XP, 2000
• Solaris ≥ 2.8
• Any GNU/Linux (for console use)
  – GNOME 2.0 or KDE 2.0 and Later
Working with Unicode
The International Phonetic Alphabet (IPA)
Working with Unicode
The International Phonetic Alphabet (IPA)

• SIL Charis, Doulos, Gentium
  – free and most complete
  – matches “New Times Roman” style
  – http://scripts.sil.org/IPAhome
Working with Unicode
If you need more letters…
• Create Your own Fonts!
• Use the Unicode Private Use Area (PUA)
  – this is Unicode’s extension mechanism.
  – does not break compatibility with Unicode
    software.
  – you must send your fonts with your work.
  – encode non-letter symbols, no need for fonts.
Working with Unicode
The PUA
• 6,400 code points in the range E000-F8FF
• 218 additional available in “planes” 15 & 16
• Work in Plane 0 first (0000 – FFFF)
• Intended for company logos, ligatures
  used by typesetting software, etc.
Working with Unicode
Creating Your Own Fonts
• Bitmap (BDF)
  – Faster to create
  – One size per font, not so scalable
  – Works best with X-Windows (Unix)
• Outline (TrueType, PostScipt, OpenType)
  – Takes more time
  – Scalable
  – MS Windows, Mac, Modern Unixes
Working with Unicode
Bitmap Editors
• Each letter is a matrix of pixels, like tiles
• You toggle them on or off to shape your
  letters
• GBDFED for recent GNOME/Linux
• XMBDFED for general Unix
• Or search for “BDF Editor”
Working with Unicode
Working with Unicode
Bitmap Editors




         Zoom View Within Edit Window
Working with Unicode
Outline Editors
• Create Bezier
  curves to outline
  scalable shapes
• Here traced
  around a scanned
  image
• FontForge
 http://fontforge.sf.net
Working with Unicode
Creating Your Own Keyboards
• No standard formats
• Different on every operating system
• May require some painful programming
  – transliteration may be a better alternative.
• For small amounts of typing try:
     Ctrl+Shift+X1X2X3X4
     Ctrl+Shift+1234
Working with Unicode
Creating Your Own Keyboards
Linux
• Migration Toward Smart Common Input
  Method (SCIM)
  – simple table based
  – more complex as needed
  – http://scim.sf.net
  - or Yudit, Emacs for older Unixes, but you can
    only type in these applications.
Working with Unicode
Creating Your Own Keyboards
Windows
• Keyman, most mature & robust
• Keyboards created with KeymanDeveloper
  – $59 academic and developing world license
  – worth every cent
  – compiled keyboards also run under Linux with
    a SCIM module
  – http://tavultesoft.com
Working with Unicode
Text Processing
• International Components for
   Unicode (ICU)
  – http://icu.sf.net
  – Java, C/C++
  – Bindings in: Python, Ruby, C#,
    Perl 6 (some Perl 5)
  – started by IBM, is OpenSource
  – managed by the Unicode president
  – check with ICU before
• 700+ Encoding Conversions
  – convert legacy systems to and from Unicode
  – migrate corpora to Unicode
Working with Unicode
Text Processing          n     +           ˜            =    ñ
                        006E              0303              00F1
ICU: Normalization
                         u     +           ¨            =    ü
• Equate letters and    0075              0308              00FC

  diacritical symbols    A     +           °            =    Å
                        0031              030A              212B


                         e     +    ^      +      .
                        0065       0302          0323

                         e     +    .      +      ^     =    ệ
                        0065       0323          0302
                                                            1EC7
                         ê     +    .
                        00EA       0323
Working with Unicode
Text Processing
ICU: Regular Expressions
• Applies the Unicode Character Database
• Categorize every character as one of
   –   Letter
   –   Number
   –   Separator
   –   Punctuation
   –   Marks
   –   Symbols
   –   Others
• Subcategories within each. Examples
   – Letter,  Uppercase, lowercase, Other, …
   – Symbols, Math, Currency, Modifiers, …
   – Mark,    spacing, non-spacing, enclosing
• Defines 80 character property types
Working with Unicode
Text Processing
ICU: Regular Expressions
Set Operations
• [^\p{Letter}]                     Negation
• [\p{Letter}\p{Number}]            Union
• [\p{Letter}&\p{script=Cyrllic}]   Intersection
• [\p{Letter}-\p{Latin}]            Difference


• Important for a character set the size of Unicode.
Working with Unicode
Text Processing
ICU: Regular Expressions
• Enhanced Word Boundaries:

     Hello There.   G’day 123.456 Classic RE


     Hello There.   G’day 123.456 Unicode Word Boundaries
Working with Unicode
Text Processing
ICU: Regular Expressions
• Equivalence Classes
  – [=e=] matches all “e” [eèéêëēĕėęě]
  – not yet implemented
  – use Perl instead
Working with Unicode
Overloading Perl Regex with Regexp::Ethiopic

Simple Plurals:
     [#7#]ች

vs
     [ሆሎሖሞሦሮሶሾቆቦቮቶቾኆኖኞኦኮዖዞዦዮዶጆጎጦጮጶጾፆፎፖ]ች
Working with Unicode
Overloading Perl Regex with Regexp::Ethiopic
• /[#3#]ያ/
  – አንባቢያን
  – ሚያዚያ
  – ኢትዮጵያዊያን

• /[#3,6#]ያ/
  – አንባቢያን       አንባብያን
  – ሚያዚያ          ሚያዝያ
  – ኢትዮጵያዊያን ኢትዮጵያውያን
Working with Unicode
Text Processing
ICU: Transliteration
• Defined by “transform rules”
  – One to one mappings:
     • α <> a;
     • β <> b;
  – Context Rules:
     • β } [aeiou] > b;
     • β } [^aeiou] > v;
Working with Unicode
Text Processing
ICU: Transliteration
• Defined by “transform rules”
  – Applying UCD Properties
     • Θ } [:LowercaseLetter:] <> Th;
     • Θ <> TH;
  – Reverse Transliteration Context Rules
     • σ < [:^Letter:] { s } [:^Letter:] ;
     • ς < s } [:^Letter:] ;
     • σ<s;
Working with Unicode
Text Processing
• ICU: Transliteration
  – Gets much more sophisticated
• See also Perl’s Text::Transliterate
Working for Unicode
Taking Your Work a Step Further
• You’ve helped create an orthography
  –now make it official.
• You’ve worked with a pre-existing un-encoded
  script using the PUA –now formalize it.
• You’ve created a transliteration system
  –make it an ISO standard.
• You’ve identified a dialect –encode it in ISO 639.
• You’ve developed a keyboard
  –make it a national standard.
• etc.
Working for Unicode
Why go the extra mile kilometer?
• Ethnic pride and identity is promoted.
• Literacy efforts can be encouraged.
• The study of historic scripts is kept alive.
• Communication between and amongst members
  of the community is promoted.
• Government communication in times of
  emergency (disease, war, natural disaster).
• Leads to localization, greater access to ICT.
• …and you become the expert!
Working for Unicode
What to Consider
• The work will be more social than technical.
• The work will take years (at least two).
• Review Encoding History
   – Has this been attempted before and failed? Why?
   – Are there any non-Unicode encodings?
• Determine the Stakeholders
   – The Government –will they support you, oppose you, jail you?
   – Political Parties, Religious, Education, Cultural Groups
      • does anyone have something to lose by the encoding?
• Communicate, Communicate, Communicate…
   – and be transparent.
   – the perception of being closed breeds suspicion and opposition.
      • …even 11 years after the fact, trust me on this.
Working for Unicode
New Keyboard?
• No international standardization working
  groups
• Contribute Keyboard back to main project
• Contact Local ICT Professionals
  Organization
• Contact Local University CS Department
• Contact Local Standards Body
Working for Unicode
New Language or Dialect?
• Contact the ICO/DIS 639-3 Registration
  Authority
  – http://sil.org/iso639-3/
  – iso639-3@sil.org
• Contact Language or Cultural Authority
• Contact Local University Linguistics
  Department
Working for Unicode
New Orthography? Or Un-encoded?
• Contact the ISO 15924 Registration Authority
    – http://unicode.org/iso15924/
•   Contact Language or Cultural Authority
•   Contact Local ICT Professionals Organization
•   Contact Local University CS Department
•   Contact Local University Linguistics Department
•   Contact Local Standards Body
•   Contact the Script Encoding Initiative
Working for Unicode
The Script Encoding Initiative
• http://linguistics.berkeley.edu/sei
• Works with users on script proposals.
• Helps raise money for script proposals to be
  written and free fonts to be created.
• Works collaboratively with other groups (e.g.
  SIL) to avoid duplication of effort.
• Helps seek experts to review proposals.
• Participates at standards meetings on behalf of
  minority groups and scholars.
                            ~fini~
• Conclusion
  – Use Unicode Now!
  – You can do it!
  – Yes you can do it!
  – There are no excuses anymore…
  – …its 2006 already, I’m telling you can do this!
  – and when you do (remember I have faith in you!) consider
    feeding back into the system via standardization.
  – Be a good citizen of earth, always ☺.

                       Thank You for Listening.
                      Are There Any Questions?

               This presentation: http://yacob.org/papers/

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:10
posted:9/29/2012
language:Unknown
pages:43