KNOW2 Language understanding technologies for multilingual by jianghongl


									KNOW2: Language understanding technologies for multilingual
          domain-oriented information access∗
                         ıas             o
         KNOW2: Tecnolog´ de comprensi´n del lenguaje para el acceso
                      u                o
             multiling¨e a la informaci´n orientada a dominios

     Eneko Agirre                        o
                            Irene Castell´n         Salvador Climent                  Jordi Turmo
    German Rigau              UB, GRIAL               UOC, GRIAL                                o
                                                                                      Lluis Padr´
    EHU, IxA taldea                UPC, TALP                                                                                                           

         Resumen: El objetivo de KNOW2 es avanzar en el desarrollo de un entorno inte-
         grado que permita la implantaci´n a bajo coste de portales verticales de acceso a la
                   o                                                       o           n
         informaci´n para dominios concretos. El proyecto tiene una duraci´n de tres a˜os y
         acaba de comenzar en enero del 2010.
                                                                     a         a
         Palabras clave: Procesamiento del Lenguaje Natural, An´lisis Sint´ctico, Inter-
                o      a               o                             o               o
         pretaci´n Sem´ntica, Adquisici´n de Conocimiento, Extracci´n de Informaci´n, Re-
                  o             o
         cuperaci´n de Informaci´n
         Abstract: The goal of the project is to explore integrated environments allowing
         the cost-effective deployment of vertical information access portals for specific do-
         mains. The project started in January 2010, and will last three years.
         Keywords: Natural Language Processing, Syntactic Analysis, Semantic Interpre-
         tation, Knowledge Acquisition, Information Extraction, Information Retrieval

1.      General description                           acquired knowledge. (3) The acquired knowl-
                                                      edge should allow to build cost-effectively
    New forms of (multilingual) information
                                                      vertical IA portals for domains.
access (MLIA, IA) based on Natural Lan-
guage Processing (NLP, specially featuring
semantic information) are being adopted by            2.        Relation to other projects
strong companies such as Google, Microsoft                KNOW2 builds on the results of KYOTO
or Yahoo: Question Answering has been de-             and KNOW. KYOTO1 is a three year Euro-
ployed (PowerSet -now part of Microsoft-,             pean project which proposes a system that
Yahoo Answers, Google), IA centered on en-            allows people in communities to define the
tities is being explored (Spock, Yahoo, Silo-         meaning of their words and terms in a shared
breaker) alongside new navigation strategies          Wiki platform so that it becomes anchored
(MMexplorer), and cross-lingual IA has been           across languages and cultures but also so that
deployed by major search engines (Google).            a computer can use this knowledge to detect
    Our project is based on the idea that auto-       knowledge and facts in text. We plan to use
matic text processing, specially in the seman-        and further develop the software and exper-
tic layer, is already enabling a new genera-          tise gathered in KYOTO.
tion of MLIA systems. In order to acquire the             KNOW2 is the predecessor of KNOW2,
required knowledge and process free-running           and it already enhanced Cross Lingual IA
text accurately, our strategy has three inter-        and Question Answering technology with
connected threads: (1) We need to focus on            improved NLP technologies for the open-
specific domains, and thus apply text min-             domain. With respect to KNOW, KNOW2
ing and domain adaptation techniques to im-           aims to obtain better performance by using
prove NLP tools and resources, including in-          two main strategies: (i) moving from gener-
ference and reasoning capabilities. (2) The           al to specific domains and (ii) incorporating
users and domain experts need to be included          text-mining and collaborative interfaces.
in the loop, via collaborative interfaces to the
∗                                                          2
3.   Project coordination                         rich (and adapt to the domain) current multi-
   The ambitious goals on the project can on-     lingual knowledge bases with concepts, rela-
ly be achieved gathering a critical mass of re-   tions and factual events. The acquisition will
searchers. For this reason KNOW2 has been         be driven by automatically captured docu-
designed as a coordinated project integrating     ment collections.
the research and the multilingual abilities of    - Development of a collaborative interface to
three groups, which are structured in three       the domain knowledge. This wiki-style inter-
subprojects with well-defined goals:               face will allow the user community to man-
Subproject 1 (EHU) focuses in manage-             age the whole process, including the edition
ment and design, development of collabora-        of the acquired concepts, domain ontologies,
tive interfaces, reasoning and inference, lay-    and the extraction rules.
ers, linguistic processors for Basque, question   - Integration of all acquired knowledge in a
answering, extraction of multilingual lexical     single Multilingual Central Repository. De-
knowledge, adaptation of linguistic proces-       velopment of a semantic engine which will in-
sors to the domain, integration of the knowl-     clude new techniques for automatic reasoning
edge gathered in the rest of subprojects and      and inference, and which will be adapted to
evaluation.                                       the domain.
Subproject 2 (UPC) focuses on the study,          - Development of prototypes for the mono-
evaluation and comparison of advanced text        lingual and multilingual IA to the documents
mining techniques to support the building         and factual information extracted from them.
of domain ontologies; this goal involves en-      It will include Information Retrieval, Cross-
hancement of machine learning techniques          Lingual IA and Question Answering demon-
and improvements in sintactic-semantic pro-       strators.
cessors and knowledge acquisition for text        - Resources, tools, and applications will be
classification, information extraction, ques-      evaluated in international benchmarks and
tion answering and textual entailment.            competitions whenever possible.
Subproject 3 (UOC-UB) focuses in lin-
guistic research for developing semantic
                                                  5.   Defining cases of use in real
processors and in building lexical-semantic            scenarios
knowledge bases (WordNets) for Spanish               KNOW2 will produce demonstrators and
and Catalan using Machine-Translation and         prototypes on different cases of use in real
Computer-Assisted Translation techniques.         scenarios related to specific domains, such
                                                  as environment, European parliament, ge-
4.   Specific objectives                           ographic text and/or popular science and
   The main objective is to improve current       technology (including public portals like
MLIA systems with research that enables the and BasqueResearch, part of Al-
construction of an integrated environment al-     phaGalileo). We are currently working on
lowing the cost-effective deployment of verti-     the definition of such set of cases of use in
cal IA portals for domains, which comes down      collaboration with collaborating companies
to the following specific objectives:              (EPOs). In this sense, we are opened to any
- Adoption of current standards for the rep-      kind of suggestions from interested compa-
resentation of linguistic annotations, both of    nies.
documents and of semantic resources. This            KNOW2 wants to apply state-of-the-art
adoption will enable easier interoperability      research to real scenarios. The adoption of
and an easier adoption of KNOW2 technolo-         recent representation standards and free soft-
gy by the industry. In addition, KNOW2 will       ware licenses should facilitate technology
support free software licenses of all developed   transfer to industrial environments.
tools and resources.
- Development of robust linguistic processors,
including semantic processing, for Basque,
Catalan and Spanish; procedures to adapt
those processors, and English ones, to the tar-
get domain; analysis of discourse structure.
- Development of knowledge mining tech-
niques, which will mine domain texts and en-

To top