LOP_Yoshiki_Mikami
Document Sample


An Experience
of the Language
Observatory Project
Yoshiki Mikami
Leader, Language Observatory Project
Japan Science & Technology Agency
Workshop on
“Recent Experiences on Measuring Languages
on the Cyberspace”
UNESCO, Paris, February 22, 2007
Outlines
1. Global Digital Divide
2. Language Observatory: How It Functions?
3. Major Findings
3.1 Survey Snapshots, Asia and Africa
3.2 Technical aspect of the Divide
3.3 Social aspect of the Divide
3.4 Several non-linguistic aspects
4. Future Agenda
Regarding Measurement
From Measurement to Empowerment
LANGUAGE OBSERVATORY 2
1. Global Digital Divide
Income, telephony
800 2000 Population
million million (right)
600 2004 1500 Telephon
400 1000
Mobile
200 500
Internet
0 0
100< 215< 464< 1,000< 2,154< 4,642< 10,000< 21,544<
400 2000
million million
300 1999 1500
200 1000
100 500
0 0
100< 215< 464< 1,000< 2,154< 4,642< 10,000< 21,544<
Per capita GDP, US$
Source: ITU Statistics
LANGUAGE OBSERVATORY 3
The Degree of Inequality
Telephony<Income<Internet
100% GDP
accumulated numbers
80%
60% number of fixed
lines
40%
number of cellular
20% subscribers
0% number of
0% 20% 40% 60% 80% 100% internet domains
accumulated population
Gini-coefficient: Telephony 0.51 < GDP 0.73 < Internet 0.91
LANGUAGE OBSERVATORY 4
UNESCO Recommendation
Recommendation concerning the Promotion and Use
of Multilingualism and Universal Access to
Cyberspace, October 2003
[PREAMBLE]
Noting that linguistic diversity in the global information
networks and universal access to information in
cyberspace are at the core of contemporary debates and
can be a determining factor in the development of a
knowledge-based society,
LANGUAGE OBSERVATORY 5
Linguistic activities
moving onto the Web
<< Real world >> << Cyberspace >>
Oral/vocal Recorded Web media
speak & conversation/chat ----- chat room
listen telephone ----- email, SMS, SNS
conference proceedings web forum
listen songs music CD audio files
radio/TV DVD web radio/TV
movie film [subtitled]
read advertisement web-ads
----- magazines online magazine
newspaper online news
book/textbook e-books, etc.
write letter email, SMS, SNS
----- diary weblogs
articles online journals
LANGUAGE OBSERVATORY 6
2. Language Observatory
How It Functions?
Crawler http://gii.nagaokaut.ac.jp/gii/papers.php
Internet [ UbiCrawler ]
<HTML><HEAD>
<TITLE>Language Observatory</TITLE>
<META http-equiv=Content-Type
content="text/html; charset=UTF-8">
pages </HEAD>
<BODY> Tag Analysis
Language <A href = "http://www.language-
Identifier [ LI ] observatory.org"><IMG height=137
alt="logo" src = “LO.files/logo.gif"
width=155></A>
Analysis on <H2>About us</H2>
Digital Language
Language <P>Astronomical observatory catches the
Resources light from stars, likewise.................
Divide
Contant nalysis
LANGUAGE OBSERVATORY 7
Unit of Identification = LSE
Language+Script+Encoding
Language Script Encoding
Dari Difference of Arabic UTF-8
Farsi language Arabic UTF-8
Hindi Devanagari UTF-8
Hindi Devanagari Arjun Differnce of
Hindi Devanagari Shusha Encoding
Hindi Devanagari Shivaji
Azeri Latin Latin-1
Azeri Cyrillic Difference of KOИ-R
Azeri Arabic Script ASMO
LANGUAGE OBSERVATORY 8
The First Workshop
on the IMLD, 2004
UNESCO reported the launch of the project
http://portal.unesco.org/ci/en/ev.php-URL_ID=14480&URL_DO=DO_TOPIC&URL_SECTION=201.html
LANGUAGE OBSERVATORY 9
Milestones, 2003 to 2007
Oct. 2003 UNESCO Adopted “Cyberspace Recommendation”
Oct. 2003 Project started by the support of Japan Science and
Technology Agency (JST)
Feb. 2004 The First Language Observatory Workshop
Jun. 2004 Started to collect web data by “UbiCrawler”
Aug. 2005 The First version of Language Identification Module (LIM)
Nov. 2005 WSIS Tunis meeting inspired the collaboration with
ACALAN.
Feb. 2006 The first meeting of the World Network for Linguistic
Diversity
Jun. 2006 Workshop at Bamako, Mali on African Survey
LANGUAGE OBSERVATORY 10
Expert Collaboration
Case of African Survey
June 26-28, 2006 at Bamako, Mali ACALAN
Mali
Algeria
Burkina Faso
Ethiopia
Kenya
Malawi
Nigeria
Tunisia
CNRS, France
LANGUAGE OBSERVATORY 11
Researchers Network
Over 35 countries
Experts’ contribution is essential in collection of local
coding text, seed URLs, and verification of LI results
LANGUAGE OBSERVATORY 12
0%
20%
40%
60%
80%
0%
20%
40%
60%
80%
100%
100%
Myanmar Cyprus
Thailand Turkey
as of June 2006
Lao Israel
Cambodia Lebanon
Malaysia Jordan
Indonesia Syria
Philippines Palestine
GCC
Brunei
Iran
Vietnam
Afganistan
Singapore
%Local
%Local
%Arabic
%Others
%English
%Arabic
%Others
%Russian
%English
%Russian
3.1 Survey Snapshot
LANGUAGE OBSERVATORY
Pakistan Kazakhstan
India Kyrgyzstan
Sri Lanka Uzbekistan
Maldives Turkmenistan
Languages on the net, Asia
Bhutan Tajikistan
Nepal
Azerbaijan
Bangladesh
Mongolia
0%
20%
40%
60%
80%
0%
100%
20%
40%
60%
80%
100%
13
3.1 Survey Snapshot (cont.)
Languages on the net, Africa
100%
English
80%
French
60%
Arabic
40%
Other
20% Languages
African
0% Languages
All African Common- Franco- League of
domains wealth phonie Arab States
as of October 2006 LANGUAGE OBSERVATORY 14
3.2 Technical Aspect
Localization Problem
“Language Localization” has
been the key obstacle to the
use of new information
technologies since type
printing age.
LANGUAGE OBSERVATORY 15
A Jesuit Friar’s letter, 1608
Six hundred versus 24
"Before I end this letter I wish to bring
before Your Paternity's mind the fact
that for many years I very strongly
desired to see in this Province some
books printed in the language and
alphabet of the land, as there are in
Malabar with great benefit for that
Christian community. And this could not
be achieved for two reasons; the first
because it looked impossible to cast so
many moulds amounting to six hundred,
whilst as our twenty-four in Europe."
Doctrina Christam source: Priolkar, The Printing Press in India,
in Tamil, 1578 Bombay, 1958
LANGUAGE OBSERVATORY 16
Doctrina in Tagalog, 1593
The script was finally lost
Philippines
postal stamp
issued in 1995
“Doctrina Christiana”, bi-lingual version, printed in Tagalog by Tagalog script /
in Tagalog by Latin script / in Spanish by Latin script.
LANGUAGE OBSERVATORY 17
Encoding Chaos leads to
delay of localization
Language Standard encoding Examples of other
and its share encodings found [footnote]
Turkish ISO 8859 (99.5%)
Hebrew ISO 8859 (87.7%)
Vietnamese UTF-8 (96.4%) TCVN, VIQR, VPS
Thai TIS 620 (97.3%)
Mongolian UTF-8 (95.5%) Latin-Cyrillic
Sinhala UTF-8 (44.5%) Metta, Kaputa, etc.
Telugu UTF-8 (16.6%) Shree, TLH, etc.
Tamil UTF-8 (14.9%) Amudham, Kumudam,
Shree, Vikatan, etc.
Burmese UTF-8 (0.7%) WinResearcher, etc.
note: Local proprietary encodings are shown in this table by names of font (families). as of June 2006
LANGUAGE OBSERVATORY 18
Unavailability of search
engines :another problem
Script Latin Cyrillic Arabic hanzi Indic Others
Region
Europe Major Russian --- --- --- Greek
European
language
s (17)
Asia Indonesia --- Arabic 中/日/韓 --- Hebrew
Africa n
Google
African Bulgarian, Farsi, Indic, Ethiopic,
language, Ukraine, Urdu Thai, Lao, Georgia,
Tagalog, Belarus, Pashtu, Khmer, Armenian,
etc. Central etc. Myanmar, Divehi
Asian Tibetan,
etc.
As of June 2006 LANGUAGE OBSERVATORY 19
Technical Aspect of the
Digital Language Divide
lack of standard
in typewriter keyboard
differentiation
strategy to
local local less attention
media IT firms
enclose from IT vendors
customers encoding chaos
delay in localization global
non-availability of search IT firms
engines (SEs)
lack of difficulty in
leadership in gov. users access to
standardization standardization
process Int’l
various localization by standard
overseas communities bodies
LANGUAGE OBSERVATORY 20
3.3 Social Aspect: languages
in multilingual society
Personal Public Occupational Educational
domain domain domain domain
Conversation, Official Business Textbook,
mail, phone, documents, letter, invoice, academic
blog, laws and manual, journal,
magazines, regulations, contract, dictionary,
newspaper, traffic signs, name card, scientific
novel, songs, contract, packaging, communicati
etc. legal, etc. etc. on, etc.
Based on EU’s “Common European Framework of Reference for Languages”
(2004)
LANGUAGE OBSERVATORY 21
Language plays a different
role in multilingual society
ac.xx
educational
Socio-economic domains
com.xx
occupational
secondary
level domain
gov.xx
public
others
personal
LANGUAGE OBSERVATORY 22
Specialization of Language
Secondary domain analysis
English Greek Others Turkish English Others Turkish Tatar
ac ac
com com
Cyprus Turkey
gov gov
others others
English Russian Others Kazakh English Arabic Others Farsi
ac ac
co co
Kazakhstan Iran
gov gov
others others
LANGUAGE OBSERVATORY 23
Social Aspect of the Digital
Language Divide
restricted social
non activities
local
availability business global e-
of SEs IT firms business
local users
users media
media press
overseas community
primary higher
gov. seondary education users gov
education
low absence of
literacy mother
language
LANGUAGE OBSERVATORY 24
3.4 Non-linguistic Aspects
a. Network and Server
•○rw: Rwanda
•△ml: Mali
•□mz: Mozambique
•White: servers
installed in the country
•Colored: servrs
installed overseas
80% of servers under
African domains are
located outside of the
country. 60% of
servers in Asian
domains are also
“offshore”
as of December 2005 LANGUAGE OBSERVATORY 25
Complaint against access
A letter from Namibia
I am the web master of the XXXXXXX Database. We are
being severely hit by your Language Observatory‘s web
crawler - already 37000 page hits this month. In
December 2005 you hit us 34000 times. We are on limited
bandwidth, and this puts unacceptable strain on our
server. I notice that you consider one HTTP request
every 5 seconds 'polite' and 'modest'. This may be true
in Japan, but not in Africa - our connections are very
slow and very narrow.
I would appreciate it if you could prevent your crawlers
from visiting our URL again. In return, I will be happy
to provide you directly with whatever statistics about
our site you need for your research.
we carefully control data collection speed using a
Sincerely set of parameters, such as revisiting interval, depth,
maximum pages per server, prohibition URL list.
LANGUAGE OBSERVATORY 26
b. Domain Governance
pages
1.E+08
ZA:South Africa
1.E+07 ST:Sao Tome &
Pricipe DJ:Djibouti SN MA MG
AC:Ascension MU NA EG
1.E+06 SH:St. Helena LY TN ZW UG CD
SC:Seychelles LS BI MW CI KE
GM CG TZ
RE BFMZ
1.E+05
IO:Indian Ocean CV BW MR RW ZM DZ
NG
NE GH
Territory ER BJ ML CM ET
1.E+04 SZ
AO
TG
GA GN
1.E+03 SL SD
KM CF
1.E+02 Management of small Islands’ domains are often TD
GQ LR
re-delegated to overseas web-hosting operators,
1.E+01
who tend to admit spam, porn, etc.
GW SO
1.E+00
1.E+00 1.E+01 1.E+02 1.E+03 1.E+04 1.E+05 1.E+06
population (1,000)
as of December 2005 LANGUAGE OBSERVATORY 27
c. Access regulations
by the government
2.5
np Countries where only state controlled TV
2
stations available, show higher percentage of
links going to global news sites abroad.
Press Linkage Ratio
tm tj
1.5
af
1 kg
pg
0.5 az
jo
ae
cy
mn ir ph u tsg
zh kz bn
kh
bdin
id sy vnmy sa ye il bh tr
0 mm
bt la pk lk ps
mv lb kw qa om
0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.70
TV receivers per population
LANGUAGE OBSERVATORY 28
4. Future Agenda
Regarding Measurement
Improvement of accuracy and coverage
Multi-stakeholder Collaboration
Global Observatories Network
From Measurement to Empowerment
Goals/Targets/Indicators system which help and
guide stakeholders in empowering languages
LANGUAGE OBSERVATORY 29
World Network for
Linguistic Diversity
LANGUAGE OBSERVATORY 30
”Language Empowerment”
Mother language for creation
language community localization of
local language application SW
search based on standard
engines OSS
language
developers
portal 母語情報処理技術
promotion of NLP
IT firms
OCR, TTS, 翻訳
OCR, TTS, MT
e-dictionary, etc
media mother higher
press language education
豊富な母語
creation of for creation
コンテンツ
local contents
electronic gov users
delivery of mother language
public use in higher
services literacy education
LANGUAGE OBSERVATORY 31
Millennium Development
Goals: Structure
LANGUAGE OBSERVATORY 32
Thanks for your attention
Jehan Rectus Square, Paris
LANGUAGE OBSERVATORY by Wunna Ko Ko, June 33
photo: courtesy 2005
Get documents about "