Docstoc

LOP_Yoshiki_Mikami

Document Sample
LOP_Yoshiki_Mikami Powered By Docstoc
					                   An Experience
                  of the Language
               Observatory Project
                            Yoshiki Mikami
       Leader, Language Observatory Project
        Japan Science & Technology Agency

                               Workshop on
“Recent Experiences on Measuring Languages
                         on the Cyberspace”
          UNESCO, Paris, February 22, 2007
Outlines
1.    Global Digital Divide
2.    Language Observatory: How It Functions?
3.    Major Findings
     3.1   Survey Snapshots, Asia and Africa
     3.2   Technical aspect of the Divide
     3.3   Social aspect of the Divide
     3.4   Several non-linguistic aspects
4.    Future Agenda
     Regarding Measurement
     From Measurement to Empowerment

                        LANGUAGE OBSERVATORY    2
1. Global Digital Divide
Income, telephony
 800                                                                 2000       Population
 million                                                             million    (right)
 600                                              2004               1500       Telephon
 400                                                                 1000
                                                                                Mobile
 200                                                                 500
                                                                                Internet
    0                                                                0
      100<     215<   464<   1,000<   2,154<   4,642<   10,000< 21,544<

 400                                                                  2000
 million                                                              million
 300                                              1999                1500
 200                                                                  1000
 100                                                                  500
     0                                                                0
        100<   215<   464<   1,000<   2,154<   4,642< 10,000< 21,544<
                                                  Per capita GDP, US$
 Source: ITU Statistics
                                LANGUAGE OBSERVATORY                                         3
The Degree of Inequality
Telephony<Income<Internet
                       100%                                     GDP
 accumulated numbers




                       80%

                       60%                                      number of fixed
                                                                lines
                       40%
                                                                number of cellular
                       20%                                      subscribers
                        0%                                number of
                              0%   20% 40% 60% 80% 100% internet domains
                                   accumulated population
Gini-coefficient: Telephony 0.51 < GDP 0.73 < Internet 0.91
                                         LANGUAGE OBSERVATORY                     4
UNESCO Recommendation

Recommendation concerning the Promotion and Use
  of Multilingualism and Universal Access to
  Cyberspace, October 2003

[PREAMBLE]
 Noting that linguistic diversity in the global information
  networks and universal access to information in
  cyberspace are at the core of contemporary debates and
  can be a determining factor in the development of a
  knowledge-based society,


                     LANGUAGE OBSERVATORY                  5
Linguistic activities
moving onto the Web
                       << Real world >>          << Cyberspace >>
             Oral/vocal             Recorded          Web media
speak &   conversation/chat            -----         chat room
listen        telephone                -----     email, SMS, SNS
             conference            proceedings       web forum
listen          songs               music CD         audio files
               radio/TV               DVD           web radio/TV
             movie film                               [subtitled]
read                             advertisement         web-ads
                  -----             magazines     online magazine
                                   newspaper        online news
                                 book/textbook      e-books, etc.
write                                 letter     email, SMS, SNS
                  -----               diary            weblogs
                                     articles      online journals
                      LANGUAGE OBSERVATORY                       6
2. Language Observatory
How It Functions?
                       Crawler        http://gii.nagaokaut.ac.jp/gii/papers.php

Internet            [ UbiCrawler ]
                                      <HTML><HEAD>
                                      <TITLE>Language Observatory</TITLE>
                                      <META http-equiv=Content-Type
                                         content="text/html; charset=UTF-8">
                          pages       </HEAD>
                                      <BODY>        Tag Analysis
  Language                            <A href = "http://www.language-
Identifier [ LI ]                        observatory.org"><IMG height=137
                                         alt="logo" src = “LO.files/logo.gif"
                                         width=155></A>
 Analysis on                          <H2>About us</H2>
   Digital           Language
  Language                            <P>Astronomical observatory catches the
                     Resources           light from stars, likewise.................
   Divide
                                                   Contant nalysis
                           LANGUAGE OBSERVATORY                                    7
Unit of Identification = LSE
Language+Script+Encoding
Language              Script                   Encoding

Dari    Difference of Arabic                   UTF-8
Farsi   language      Arabic                   UTF-8
Hindi                 Devanagari               UTF-8
Hindi                 Devanagari               Arjun   Differnce of
Hindi                 Devanagari               Shusha Encoding
Hindi                 Devanagari               Shivaji
Azeri                 Latin                    Latin-1
Azeri                 Cyrillic Difference of   KOИ-R
Azeri                 Arabic Script            ASMO


                       LANGUAGE OBSERVATORY                           8
The First Workshop
on the IMLD, 2004




                                 UNESCO reported the launch of the project
http://portal.unesco.org/ci/en/ev.php-URL_ID=14480&URL_DO=DO_TOPIC&URL_SECTION=201.html

                              LANGUAGE OBSERVATORY                                   9
Milestones, 2003 to 2007
Oct. 2003   UNESCO Adopted “Cyberspace Recommendation”
Oct. 2003   Project started by the support of Japan Science and
            Technology Agency (JST)
Feb. 2004   The First Language Observatory Workshop
Jun. 2004   Started to collect web data by “UbiCrawler”
Aug. 2005   The First version of Language Identification Module (LIM)
Nov. 2005   WSIS Tunis meeting inspired the collaboration with
            ACALAN.
Feb. 2006   The first meeting of the World Network for Linguistic
            Diversity
Jun. 2006   Workshop at Bamako, Mali on African Survey

                         LANGUAGE OBSERVATORY                       10
Expert Collaboration
Case of African Survey
       June 26-28, 2006 at Bamako, Mali   ACALAN
                                          Mali
                                          Algeria
                                          Burkina Faso
                                          Ethiopia
                                          Kenya
                                          Malawi
                                          Nigeria
                                          Tunisia
                                          CNRS, France

               LANGUAGE OBSERVATORY                11
Researchers Network
Over 35 countries




Experts’ contribution is essential in collection of local
coding text, seed URLs, and verification of LI results

                           LANGUAGE OBSERVATORY             12
                                                                                                              0%
                                                                                                             20%
                                                                                                             40%
                                                                                                             60%
                                                                                                             80%




                                           0%
                                          20%
                                          40%
                                          60%
                                          80%
                                                                                                            100%




                                         100%
                            Myanmar                                                              Cyprus
                             Thailand                                                            Turkey




as of June 2006
                                   Lao                                                             Israel
                            Cambodia                                                            Lebanon
                            Malaysia                                                              Jordan
                            Indonesia                                                              Syria
                           Philippines                                                         Palestine
                                                                                                   GCC
                                Brunei
                                                                                                     Iran
                              Vietnam
                                                                                              Afganistan
                            Singapore
                                                                                                                                                      %Local




                                                                                   %Local
                                                                                                                                            %Arabic
                                                                                                                                  %Others


                                                                                                            %English




                                                                         %Arabic
                                                               %Others
                                                                                                                       %Russian




                                         %English
                                                    %Russian
                                                                                                                                                               3.1 Survey Snapshot




    LANGUAGE OBSERVATORY
                              Pakistan                                                      Kazakhstan
                                 India                                                      Kyrgyzstan
                             Sri Lanka                                                      Uzbekistan
                             Maldives                                                 Turkmenistan
                                                                                                                                                               Languages on the net, Asia




                               Bhutan                                                        Tajikistan
                                Nepal
                                                                                             Azerbaijan
                           Bangladesh
                                                                                              Mongolia
                                           0%
                                           20%
                                           40%
                                           60%
                                           80%
                                                                                                          0%




                                           100%
                                                                                                          20%
                                                                                                          40%
                                                                                                          60%
                                                                                                          80%
                                                                                                          100%




13
3.1 Survey Snapshot (cont.)
Languages on the net, Africa
  100%
                                                           English

    80%
                                                           French
    60%
                                                           Arabic
    40%
                                                           Other
    20%                                                    Languages
                                                           African
     0%                                                    Languages
           All African   Common-   Franco-    League of
            domains       wealth   phonie    Arab States

as of October 2006         LANGUAGE OBSERVATORY                      14
3.2 Technical Aspect
Localization Problem




                         “Language Localization” has
                          been the key obstacle to the
                          use of new information
                          technologies since type
                          printing age.


           LANGUAGE OBSERVATORY                      15
A Jesuit Friar’s letter, 1608
Six hundred versus 24
                         "Before I end this letter I wish to bring
                         before Your Paternity's mind the fact
                         that for many years I very strongly
                         desired to see in this Province some
                         books printed in the language and
                         alphabet of the land, as there are in
                         Malabar with great benefit for that
                         Christian community. And this could not
                         be achieved for two reasons; the first
                         because it looked impossible to cast so
                         many moulds amounting to six hundred,
                         whilst as our twenty-four in Europe."

 Doctrina Christam       source: Priolkar, The Printing Press in India,
 in Tamil, 1578          Bombay, 1958
                     LANGUAGE OBSERVATORY                             16
Doctrina in Tagalog, 1593
The script was finally lost




                                                              Philippines
                                                              postal stamp
                                                              issued in 1995



 “Doctrina Christiana”, bi-lingual version, printed in Tagalog by Tagalog script /
 in Tagalog by Latin script / in Spanish by Latin script.
                            LANGUAGE OBSERVATORY                               17
Encoding Chaos leads to
delay of localization
Language                              Standard encoding              Examples of other
                                      and its share                  encodings found [footnote]
Turkish                                ISO 8859 (99.5%)
Hebrew                                 ISO 8859 (87.7%)
Vietnamese                                 UTF-8 (96.4%)             TCVN, VIQR, VPS
Thai                                     TIS 620 (97.3%)
Mongolian                                  UTF-8 (95.5%)  Latin-Cyrillic
Sinhala                                    UTF-8 (44.5%)  Metta, Kaputa, etc.
Telugu                                     UTF-8 (16.6%)  Shree, TLH, etc.
Tamil                                      UTF-8 (14.9%)  Amudham, Kumudam,
                                                          Shree, Vikatan, etc.
Burmese                                      UTF-8 (0.7%) WinResearcher, etc.
note: Local proprietary encodings are shown in this table by names of font (families). as of June 2006

                                      LANGUAGE OBSERVATORY                                           18
Unavailability of search
engines :another problem
     Script Latin     Cyrillic      Arabic    hanzi    Indic        Others
Region
Europe      Major     Russian       ---       ---      ---          Greek
            European
            language
            s (17)
Asia        Indonesia ---           Arabic    中/日/韓    ---          Hebrew
Africa      n
                                              Google
           African     Bulgarian,   Farsi,             Indic,       Ethiopic,
           language,   Ukraine,     Urdu               Thai, Lao,   Georgia,
           Tagalog,    Belarus,     Pashtu,            Khmer,       Armenian,
           etc.        Central      etc.               Myanmar,     Divehi
                       Asian                           Tibetan,
                                                       etc.


As of June 2006             LANGUAGE OBSERVATORY                             19
Technical Aspect of the
Digital Language Divide
                                              lack of standard
                                          in typewriter keyboard

differentiation
 strategy to
                    local          local                less attention
                    media         IT firms
    enclose                                           from IT vendors
  customers           encoding chaos
                    delay in localization                                  global
                  non-availability of search                              IT firms
                       engines (SEs)
    lack of                                               difficulty in
 leadership in      gov.            users                 access to
standardization                                        standardization
                                                           process           Int’l
                                            various localization by       standard
                                            overseas communities           bodies

                            LANGUAGE OBSERVATORY                               20
3.3 Social Aspect: languages
in multilingual society
  Personal            Public        Occupational       Educational
   domain            domain           domain             domain
Conversation,    Official          Business           Textbook,
mail, phone,     documents,        letter, invoice,   academic
blog,            laws and          manual,            journal,
magazines,       regulations,      contract,          dictionary,
newspaper,       traffic signs,    name card,         scientific
novel, songs,    contract,         packaging,         communicati
etc.             legal, etc.       etc.               on, etc.

Based on EU’s “Common European Framework of Reference for Languages”
(2004)

                         LANGUAGE OBSERVATORY                      21
      Language plays a different
      role in multilingual society
                         ac.xx
                         educational
Socio-economic domains




                         com.xx
                         occupational
                              secondary
                             level domain
                         gov.xx
                         public

                         others
                         personal

                                            LANGUAGE OBSERVATORY   22
Specialization of Language
Secondary domain analysis
 English   Greek     Others     Turkish     English   Others   Turkish     Tatar
   ac                                        ac
 com                                       com
                         Cyprus                                Turkey
  gov                                       gov
others                                    others

 English   Russian   Others     Kazakh      English   Arabic   Others      Farsi
   ac                                        ac

   co                                        co
             Kazakhstan                                                  Iran
  gov                                       gov
others                                    others

                              LANGUAGE OBSERVATORY                              23
Social Aspect of the Digital
Language Divide
                             restricted social
     non                     activities
                local
  availability business                            global       e-
    of SEs                                        IT firms   business


local                     users
                              users                                   media
media                                                                 press
                                                 overseas community

                primary          higher
        gov.   seondary         education          users        gov
               education
               low               absence of
               literacy            mother
                                  language

                           LANGUAGE OBSERVATORY                           24
3.4 Non-linguistic Aspects
a. Network and Server
                                             •○rw: Rwanda
                                             •△ml: Mali
                                             •□mz: Mozambique
                                             •White: servers
                                             installed in the country
                                             •Colored: servrs
                                             installed overseas
                                             80% of servers under
                                             African domains are
                                             located outside of the
                                             country. 60% of
                                             servers in Asian
                                             domains are also
                                             “offshore”

as of December 2005   LANGUAGE OBSERVATORY                         25
Complaint against access
A letter from Namibia
 I am the web master of the XXXXXXX Database. We are
 being severely hit by your Language Observatory‘s web
 crawler - already 37000 page hits this month. In
 December 2005 you hit us 34000 times. We are on limited
 bandwidth, and this puts unacceptable strain on our
 server. I notice that you consider one HTTP request
 every 5 seconds 'polite' and 'modest'. This may be true
 in Japan, but not in Africa - our connections are very
 slow and very narrow.
 I would appreciate it if you could prevent your crawlers
 from visiting our URL again. In return, I will be happy
 to provide you directly with whatever statistics about
 our site you need for your research.
              we carefully control data collection speed using a
 Sincerely   set of parameters, such as revisiting interval, depth,
               maximum pages per server, prohibition URL list.

                     LANGUAGE OBSERVATORY                             26
b. Domain Governance
pages
1.E+08
                                                                            ZA:South Africa
1.E+07                        ST:Sao Tome &
                                   Pricipe DJ:Djibouti            SN      MA      MG
         AC:Ascension                                  MU NA                  EG
1.E+06                      SH:St. Helena                       LY TN ZW UG CD
                      SC:Seychelles                      LS      BI MW CI        KE
                                                   GM         CG             TZ
                                               RE                       BFMZ
1.E+05
              IO:Indian Ocean              CV       BW       MR RW ZM        DZ
                                                                                     NG
                                                                    NE    GH
                  Territory                                ER BJ  ML     CM     ET
1.E+04                                            SZ
                                                                       AO
                                                            TG
                                                    GA             GN
1.E+03                                                       SL              SD
                                              KM            CF
1.E+02         Management of small Islands’ domains are often     TD
                                               GQ           LR
              re-delegated to overseas web-hosting operators,
1.E+01
              who tend to admit spam, porn, etc.
                                                     GW           SO
1.E+00
     1.E+00         1.E+01        1.E+02         1.E+03        1.E+04         1.E+05          1.E+06
                                                                         population (1,000)
as of December 2005                 LANGUAGE OBSERVATORY                                        27
c. Access regulations
by the government
                       2.5
                                np        Countries where only state controlled TV
                        2
                                        stations available, show higher percentage of
                                           links going to global news sites abroad.
 Press Linkage Ratio




                                                      tm                           tj
                       1.5
                                af

                        1                           kg

                                 pg

                       0.5                                               az
                                                     jo
                                                          ae
                                                                           cy
                                         mn          ir ph         u tsg
                                                                     zh                  kz                bn
                                kh
                                        bdin
                                                  id    sy vnmy sa       ye il     bh          tr
                        0       mm
                                bt    la pk    lk ps
                                               mv                          lb     kw      qa   om

                         0.00         0.10         0.20         0.30             0.40   0.50        0.60        0.70

                                                          TV receivers per population
                                                      LANGUAGE OBSERVATORY                                             28
4. Future Agenda
       Regarding Measurement
        Improvement of accuracy and coverage
        Multi-stakeholder Collaboration
        Global Observatories Network


       From Measurement to Empowerment
         Goals/Targets/Indicators system which help and
         guide stakeholders in empowering languages


                      LANGUAGE OBSERVATORY            29
World Network for
Linguistic Diversity




            LANGUAGE OBSERVATORY   30
”Language Empowerment”
Mother language for creation
                       language community             localization of
   local language                                    application SW
       search                                       based on standard
      engines                      OSS
                       language
                                developers
                         portal                       母語情報処理技術
                                                      promotion of NLP
                                 IT firms
                                                       OCR, TTS, 翻訳
                                                       OCR, TTS, MT
                                                       e-dictionary, etc
                    media      mother     higher
                    press     language education
   豊富な母語
  creation of                for creation
    コンテンツ
local contents

      electronic            gov      users
      delivery of                                  mother language
        public                                      use in higher
       services                   literacy            education


                            LANGUAGE OBSERVATORY                        31
Millennium Development
Goals: Structure




          LANGUAGE OBSERVATORY   32
Thanks for your attention




                Jehan Rectus Square, Paris
            LANGUAGE OBSERVATORY by Wunna Ko Ko, June 33
                photo: courtesy                       2005