CLARIN Overview - CLARIN-ES
Document Sample


CLARIN:
The common language
resources and technology
infrastructure
Steven Krauwer
CLARIN Coordinator
Utrecht institute of Linguistics UiL-OTS (NL)
Overview
• Problem & Mission
• Some why-questions
• Some who-questions
• Overall plan
• What CLARIN is NOT about
• How we work
• Funding
• Structure
• Where we stand
• Some dreams
• To conclude
Steven Krauwer CLARIN - Barcelona 06-02-2009 2
The problem
• Much data in digital archives language based
• Existence often only known to insiders
• Archives mostly unconnected, even at the national
level
• Every archive has its own standards for storage and
access
• Normally only simple retrieval of files (text, audio or
video documents)
• Other tools exist but are hard to use for non-specialist
• Social sciences and humanities researchers are not
language or speech technologists
• They are often not aware of the potential benefits of
using language and speech technology
Steven Krauwer CLARIN - Barcelona 06-02-2009 3
The CLARIN Mission
What:
• Create an infrastructure that makes language
resources and technology (LRT),
available to scholars of all disciplines, especially
social sciences and humanities (SSH)
How:
• Unite existing digital archives into a European
federation of archives with unified web access
• Provide existing language and speech technology
tools as web services operating on language data
in archives
Steven Krauwer CLARIN - Barcelona 06-02-2009 4
Towards strong and
persistent centers
• need to add a persistent infrastructure layer on top of the existing
landscape which is formed by accidental and temporary collaborations
• should be easily accessible for everyone
• should offer high availability (always on-line) so that people can rely on it
• will be different types of centers dependent on the service
• need strong national support for many years
Steven Krauwer CLARIN - Barcelona 06-02-2009 5
Why a European
infrastructure?
• too much fragmentation
• lack of coordination across countries
• lack of visibility
• lack of interoperability
• lack of sustainability
• expertise exists but not in all countries
• language independent tools can be shared
• language dependent tools can often be ported
• most countries not able to bear the cost
Steven Krauwer CLARIN - Barcelona 06-02-2009 6
Why now?
• Exponential growth of digital data
• Increasing maturity of language and speech
technology:
– high speed
– large volumes
– new research questions
• Growing interest at EU level in Research
Infrastructures (RI), also for soft sciences
• RI Roadmap published in 2006 by ESFRI
• includes 35 accepted proposals for RIs
• CLARIN is one of them and has EC funding for a
1-3 year preparatory phase
Steven Krauwer CLARIN - Barcelona 06-02-2009 7
Who we are and where we
come from
• The CLARIN consortium has now 32 partners from
22 EU and associated countries (and more on the
waiting list)
• The CLARIN community has 148 members in 32
countries (Feb 2009)
• CLARIN is based on 4 earlier broad European
initiatives with many participants:
– LangWeb
– EARL
– TELRI
– (and later) DAM-LR
Steven Krauwer CLARIN - Barcelona 06-02-2009 8
Who else do we need?
• Both our membership and our consortium are
quite unbalanced:
– Written language technology over-represented
– Speech & multimodality under-represented
– Humanities other than linguistics under-represented
– Social sciences under-represented
– Some countries and languages (national and regional)
still missing
• There is no money to extend the consortium but
we have to fill these gaps to ensure balanced
coverage
Steven Krauwer CLARIN - Barcelona 06-02-2009 9
Overall plan for CLARIN
• Preparatory phase (2008-2010): Put everything
in place
• Construction phase (2011-2015): Build and
populate with tools and resources
• Exploitation phase (2016-….): CLARIN in full
service
• Budget Prep phase
• 4.1 M€ from EC
• ??? from countries (process still ongoing)
• Estimated budget until 2020: ca 200 M€
• mostly from national and regional funding agencies
• max 20% from EC (not yet formally decided)
Steven Krauwer CLARIN - Barcelona 06-02-2009 10
4-dimensional approach
in the preparatory phase
First 3 years dedicated to the design:
• The technical dimension
• The language dimension
• The user dimension
• The governance and legal dimension
Steven Krauwer CLARIN - Barcelona 06-02-2009 11
Technical
• Technical specification of the infrastructure
• Construction of a prototype
• Validation on rich variety of
– languages (>20)
– resources
– services
• Federation of existing archives
• Based on existing resources, tools
• Strong focus on interoperability standards
• Conversion of existing resources
• Encapsulation of existing tools
Steven Krauwer CLARIN - Barcelona 06-02-2009 12
Languages
• Cover all languages spoken or studied in
participating countries, including regional
languages
• Representational and descriptive standards
should be adequate and validated for all
languages
• Same minimal coverage of basic resources and
tools for all (living) languages
• BLARK (Basic Language Resources Toolkit) to
be defined and implemented (funds from other
sources needed)
Steven Krauwer CLARIN - Barcelona 06-02-2009 13
Language technology
activities
Activities during preparatory phase
– survey of resources and tools, including:
• encoding and annotation data
• quality indicators
– developing taxonomies and ontologies
– agreeing on common standards
Focus on
– integration of tools
– interoperability
– usage scenarios
– creating missing essential resources
– validating specifications and prototype
Steven Krauwer CLARIN - Barcelona 06-02-2009 14
User
• Users are SSH scholars (including linguists,
translation experts)
• Do WE know what they need?
• Do THEY know what they need?
• Actions:
– analyze past and ongoing SSH projects
– user consultation
– launch typical example projects to show potential (see
Call for Humanities Projects)
– expertise centers
– awareness actions
Steven Krauwer CLARIN - Barcelona 06-02-2009 15
Legal and ethical
IPR and ethical issues
• aim at open source, but IPR for existing and
future non-open resources must be
accommodated
• federation of archives requires authentication,
authorization and trust between archives
• aim at limited number of template license
agreements for most common cases
• respect national legislation
• address ethical issues
Steven Krauwer CLARIN - Barcelona 06-02-2009 16
Governance and
Funding
Agree on e.g.:
• Who is going to pay for the construction and
exploitation of the infrastructure
• How will it be managed
• How will it be coordinated with national policies
Actions:
• Analyse best practice in funding and
management of transnational projects
• Prepare agreement between (now) 22 countries
about long term joint funding of CLARIN
Steven Krauwer CLARIN - Barcelona 06-02-2009 17
What CLARIN is NOT
(yet) about
• building the infrastructure – during this phase we
are just preparing it
• creating new resources – at this stage we want to
use what is there and adapt it if necessary
• creating new applications – except maybe some
essential tools or demonstrators
• focusing on the big languages – we find all
languages equally important
• strengthening European industry – our target
audience are SSH researchers, but we don’t want
to exclude anyone
Steven Krauwer CLARIN - Barcelona 06-02-2009 18
How we work (1)
Work packages:
• WP1: Management and coordination
• WP2: Designing the infrastructure and building a
prototype
• WP3: Humanities overview
• WP5: Language resources and technology
overview
• WP6: Dissemination
• WP7: IPR and business models
• WP8: Construction and exploitation agreement
Steven Krauwer CLARIN - Barcelona 06-02-2009 19
How we work (2)
WP8
Org&Legal
Framework
5
1
WP7 8
IPR, A&A, 4
licensing WP2
Infrastructure
Prototype
3 6
2
WP5 WP3
7 Humanities
LRT
Exploration Projects
Steven Krauwer CLARIN - Barcelona 06-02-2009 20
How we work (3)
• Most tasks executed in Working Groups (WGs)
• WGs consist of project partners & other experts
(CLARIN is open!)
• Some WGs do work (e.g. build prototype),
others collect data or create consensus
• Participation by others essential as e.g.
standards cannot be imposed
by a small group
• Unfortunately no EC funding available for WG
participation – only reward is influence!
Steven Krauwer CLARIN - Barcelona 06-02-2009 21
Funding &
what to use it for
• From EC: 4.1 M€, used for generic, language independent
tasks
• From countries: ??? M€, to be used for preparing CLARIN
at the national or regional level in every country:
– build and organize local national CLARIN communities
– support for participation in working groups (e.g. travel)
– validation tasks for own language(s)
– creation or adaptation of essential resources
– pilots and demonstrators & humanities projects
– (co-)organisation of local or international events
– preparing for future role (expertise centers,
repositories)
Steven Krauwer CLARIN - Barcelona 06-02-2009 22
Structure
• Executive Board, consisting of the 7 WP leaders
plus a special representative to liaise with the
humanities community (a.o. through the DARIAH
sister project)
• Boards:
– Scientific Board
– Strategic Coordination Board
– International Advisory Board
• Meetings (virtual or face to face):
– Consortium meetings
– Member meetings
– Working group meetings
Steven Krauwer CLARIN - Barcelona 06-02-2009 23
Where we stand
• We have just finished the 1st year (still 2 to go)
• Various working groups have been set up and are
already active – but you can still join:
http://www.clarin.eu/join-a-working-group
• We have regular workshops on various topics:
see http://www.clarin.eu/all_events
• Public documents are published on
http://www.clarin.eu/documents
• We have just launched a Call for Humanities
Projects http://www.clarin.eu/wp3/wp3-
documents/call_final-version
Steven Krauwer CLARIN - Barcelona 06-02-2009 24
Our dreams
An example:
– Ethnologists have a recording of a dance with singing,
and a transcription; they want to search for certain
textual patterns and then return to the corresponding
recorded dance fragments
– For a 3 minutes recording no problem
– 30 minutes might just be doable
– … but what about 3, 30 or 300 hours of video?
– To do this and to save time they would need to align
media and transcriptions
– There are “aligner tools”
– But who is able to use them and will they work on the
transcription format?
Steven Krauwer CLARIN - Barcelona 06-02-2009 25
… more dreams …
Another example:
• Historians want to access all material from physics, politics
and sociology to understand the reasons for the marine
dominance of the Serene Republic of Venice
• to do this they need to search for concepts in all material,
extract summaries, relate fragments, add and exchange
comments etc
• they need to do this collaboratively
• currently this involves a huge amount of handwork to
overcome institutional, linguistic (morphological
normalization, translation), semantic boundaries
• but who is able to carry out such work, who can operate the
tools
Steven Krauwer CLARIN - Barcelona 06-02-2009 26
… and more
One day any SSH scholar should be able to
ask without any difficulty:
• “List all uses of enthusiasm in 19th century
English novels written by women”
• “Find all video clips of Prince Charles
talking about architecture in 2007”
• “Summarize the inaugural speech of
Obama - in Catalan”
Steven Krauwer CLARIN - Barcelona 06-02-2009 27
To conclude (1)
• CLARIN is a long term endeavour with lots of
challenges of very different types
• For the medium and longer term I see the
following main challenges (where we could really
fail):
– Agreeing on standards and actually using them
– Persuading users to formulate requirements and to use
the infrastructure
– Making the CLARIN infrastructure resistent to
technological developments
– Securing long term funding
• In CLARIN there is room for all languages
• If it succeeds it will give a boost to SSH research
Steven Krauwer CLARIN - Barcelona 06-02-2009 28
To conclude (2)
More information:
• CLARIN Website: http://www.clarin.eu
• CLARIN Office: clarin@clarin.eu
• CLARIN Newsletter (issue 4 just out):
http://www.clarin.eu/newsletter
• CLARIN Members & how to join:
http://www.clarin.eu/members
Thanks!
Steven Krauwer CLARIN - Barcelona 06-02-2009 29
Get documents about "