semantic markup by khn19658

VIEWS: 13 PAGES: 65

									                          Risorse Linguistiche
                    (lessici, corpora, ontologie, …)
               Standard e tecnologie linguistiche (cont.)


                                … e Progetti

                                 Nicoletta Calzolari
                   Istituto di Linguistica Computazionale - CNR - Pisa
                                 glottolo@ilc.cnr.it
      With many others at ILC


                                                                         1
N. Calzolari                       Dottorato, Pisa, Maggio 2009
                            SIMPLE Model for a BioLexicon
                Design a representational model for a BioLexicon, a
                 comprehensive lexical resource
                         able to integrate terminological, lexical and ontological info
                         compatible with HLT international standards (i.e. ISO)
                         able to meet the domain-specific requirements


               Implement a BioLexicon database, a container with lexical
                objects to be filled with data provided by “populators” (EBI,
                UoM & CNR-ILC)
                  – able to be automatically incremented with new terms and linguistic info
                    extracted from texts



from Valeria Quochi
                                                                                              2
N. Calzolari                                    Dottorato, Pisa, Maggio 2009
                               BioLexicon Building cycle

                          Term Repository                                Bio-Lexicon Population
                            Gather terms EBI                            variants; synt info of terms UoM




                                                Bio-Lexicon
                                               Conceptual model
                                                and physical DB
                                                     ILC




                      Bio-events                                      Terminolgy to Ontology
               extraction of bio-events ILC                                   Jena/Rennes/EBI




from Valeria Quochi
                                                                                                           3
N. Calzolari                                   Dottorato, Pisa, Maggio 2009
                            The BioLexicon: where from
                                                     Incremental population process
  Existing repositories
                            chemical compounds, species names, disease, enzymes

                             genes/proteins      Subclustering of
                                                  term variants

                            new genes/proteins names
    MEDLINE
                               Named Entity                    Term Mapping by         BioLexicon
                               Recognition                      Normalisation

                                                  Verbs, nouns, adjs, advs (variants,
                               Manual curation
                                                  inflected forms, derivative relations, ...)

                                                                 Subcat extraction
                            Linguistic pre-processing
                                                                                        Syn-sem
                                                                                        mapping
                              Manual annotation of a
                                                                Bio-event extraction
                                bio-event corpus
from Simonetta Montemagni
                                                                                                  4
N. Calzolari                                  Dottorato, Pisa, Maggio 2009
   BioLexicon Model: High-level lexical
   objects, Data Categories
                                                                      e.g.
                                                                      <feat att=“POS” val=“VVZ”>
                                                                      <feat att=“ConfScore” val=“0.9”>
                                                                      <feat att=“source” val=“UNIPROT”
                                                                      ……




                              Syntax




                  Semantics

from Valeria Quochi
                                                                                                         5
N. Calzolari                           Dottorato, Pisa, Maggio 2009
                      GeneRegOnto – BioLex
                      Concepts to Predicates




from Valeria Quochi
                                                           6
N. Calzolari                Dottorato, Pisa, Maggio 2009
                                                          NF-AT positively regulates IL2


                                                                                                     regulate
                                                                                                                           regulation

                      Regulation
                                                                                                PredRegulate



                                                                                 Arg0Regulate                Arg1Regulate
      PositiveProtei                      NegativeProtein
             n                              Regulation                               regulator               regulatee
       Regulation




    Transcription
                           regulates                  Protein                      NF-AT                                           IL2
       Factor
                                                                                                        regulates
                             isregulatedby


                                                                                        bio semantic entry
               bio event concept
                                                                                        predicative argument structure
               bio entity concept
                                                                                        bio semantic roles
                       bio relations
                                                                                                Bio-specific qualia relations           7
N. Calzolari                    from Valeria Quochi      Dottorato, Pisa, Maggio 2009
                                                                                            Activity
               SynBehaviour                                           Sense
                 Lesion1                                            Lesion1




               SubcatFrame                                          Predicate
                  pp-of                                             LESION




                          SynArg                              SemArg            Protein
                          Arg0                                 Arg0
                          pp-of                                 Pat


  The pattern “lesion of PROTEIN” is not in the lexicon, but can be calculated accessing info scattered over
  various lexical objects (i.e the syntactic unit lesion heads a pp-of corresponding to the patient argument,
  restricted by the ontological node PROTEIN)
  All lexical items labelled as PROTEIN can be candidates to fill this argument slot. Lesion of OmpC,
  OmpR, etc… are all admitted instances/sentences of this “predicate”/pattern.
                                                                                                         8
N. Calzolari                                Dottorato, Pisa, Maggio 2009
                     Good mapping of Relations
                        OBO Relations
                                                                             Agentive

        isA             is_a         Formal             derivesFrom        derived_from
        partOf          is_a_part_of                    precededBy         ?
        hasPart         has_as_part                     participatesIn     ?
        GrainOf         …                               hasParticipant     ?
        hasGrain         …                              agentOf             …
        componentOf      …                              hasAgent           ?
        hasComponent     …                              functionOf         is_the_activity_of
        properPartOf     …                              hasFunction         …
        hasProperPart    …                              instanceOf          …          Telic
        locatedIn        …
                                           Constitutive
        locationOf       …
                                                                      Relations from
        containtIn       …
                                                                     Extended Qualia
        contains        contains                                        Structure
        adjacentTo      ?

                                                                                               9
N. Calzolari                       Dottorato, Pisa, Maggio 2009
               Enhancing Semantic Relations

                       Source_Sense          Rel Type                    Target_Sense
                       Phosphoglycolate      BelongsToSpecies            Mouse




                                            BelongsToSpecies

                      phosphoglycolate                                     mouse




from Valeria Quochi
                                                                                        10
N. Calzolari                              Dottorato, Pisa, Maggio 2009
           How to link Bio-Ontology and Bio-Lexicon
              Place(s) of Semantics in BootStrep
              Bio-Ontology holds domain specific as well as general semantics
               (in terms of classes and relations between classes)
              Lexicon model comes with semantic layer based on linguistic ontology
               (SIMPLE-CLIPS Ontology)

      Questions:
       What relation between bio-ontology and linguistic ontology?
       Do they overlap? What is the overlap/intersection? the difference?
       Mapping possible? How could a mapping look like?



                                                                                ?
      Aim:
       Bringing lexical semantics and ontological semantics together


                                                                                      11
N. Calzolari                            Dottorato, Pisa, Maggio 2009
               the BioLexicon Model & Standards

                The Bio-Lexicon is based on the MILE metamodel
                and the more recent ISO proposal of a Lexical
                Markup Framework (LMF)
                Data Categories drawn as far as possible from
                already existing repositories and standards (i.e.
                morphosyntactic datacat)
                There is the need, however, to define a set of Data
                Categories specific for the biology domain (i.e.
                semantic roles and relations)




                                                                      12
N. Calzolari                      Dottorato, Pisa, Maggio 2009
                                                   ISO
                          Meta-model & Data Categories

               An ISO standard for NLP lexica
                   Definition of the Lexical Markup Framework, a general &
                    abstract meta-model & a set of structural nodes relevant for
                    linguistic description




               Objectives
                  Design of the abstract lexical meta-model
                  Definition of the common set of related Data Categories



  from Monica Monachini
                                                                                   13
N. Calzolari                             Dottorato, Pisa, Maggio 2009
                                          ISO - LMF
              Specifically designed to accommodate as many models of lexical
               representation as possible
              Its pros:
                   Meta-model: a high-level specification ISO24613
                   Data Category Registry: low-level specifications ISO12620
              Not a monolithic model, rather a modular framework
                   LMF library provides the hierarchy of lexical objects (with structural relations
                    among them)
                   Data Category Registry provides a library of descriptors to encode linguistic
                    information associated to lexical objects (N.B. Data Categories can be also user-
                    defined)




                                                                                                        14
N. Calzolari                                   Dottorato, Pisa, Maggio 2009
                                ISO LMF –        Builds also on
                                                 EAGLES/ISLE
                        Lexical Markup Framework
   Structural
   skeleton, with the
                                                            Core Package
   basic hierarchy of
   information in a                                                                          Constraint Expression
   lexical entry
                                     Morphology



   + various                                                               NLP Syntax
   extensions;                                                                                       NLP Semantic


                            NLP Paradigm class


                                                              MRD
  LMF specs comply with modelling                                                       NLP Multilingual notations
  UML principles; an XML DTD
  allows implementation
                                     NLP MWE pattern




    LIRICS                                                    NICT Language-                           ICT
                                    NEDO                        Grid Service                      KYOTO
                                    Asian                        Ontology
                                    Lang.
                                                                                                               15
N. Calzolari                              Dottorato, Pisa, Maggio 2009
               LMF: NLP Extension for Semantics




                                                         16
N. Calzolari              Dottorato, Pisa, Maggio 2009
                                            Lexical Entry
                               SyntacticBeahviour     <LexicalEntry rdf:ID="LEprotein">
                                  SB_protein            <hasSyntacticBehaviour
                                                         rdf:resource=“../../#SB_protein”/>
                                                        <hasLemma>
                                                         <Lemma rdf:ID="L_protein“/>
               Lexical Entry                               <hasRepresentationFrame>
               LE_protein
                                                            <RepresentationFrame rdf:ID=“RF_protein” />
                                                           </hasRepresentationFrame>
                                Lemma                  </hasLemma>
                               L_protein              </LexicalEntry>



                       Representation Frame
                           RF_protein
                     DC: writtenForm= protein




                                                                                                   17
N. Calzolari                                    Dottorato, Pisa, Maggio 2009
                             Event Representation
                          through SemanticPredicate


                                  SemanticPredicate
                                  SP_regulate




               SemanticArgument                        SemanticArgument
               SP_TF_protein                           SP_Target Gene
               DC: role=agent                          DC: role=patient




                                                                          18
N. Calzolari                         Dottorato, Pisa, Maggio 2009
                                             Sense Representation
                           Synset
                          activate
                                              <Sense rdf:ID=“activate_2">
                                                 <belongsToSynset rdf:resource="#activate"/>
                      PredicativeRepre          <hasSemanticRelation rdf:resource="#is_a_1"/>
                         sentation              <hasSemanticRelation
                                             rdf:resource="#has_as_part_1"/>
                                                <hasSemanticRelation
          Sense        SemanticFeature       rdf:resource="#object_of_the_activity_1"/>
      activate_2                               <hasSemanticFeature rdf:resource="#
                      SF_chemistry           SF_chemistry"/>
                      SF_process              <hasSemanticFeature rdf:resource="# SF_process"/>
                                             </Sense>
                         Collocation

                          SemanticRelation
                   is_a: [SenseID]
                   Typical_of: [SenseID]
                   S_protein


                                                                                        19
N. Calzolari                           Dottorato, Pisa, Maggio 2009
               Example of Semantic Relation
                                          <SemanticRelation rdf:ID=“is_in">
                                             <hasSourceSense>
                                                <Sense rdf:ID=“S_cox15">
                    Sense                         <id
                   S_cox15               rdf:datatype="http://www.w3.org/2001/XMLSchema#string">S
                                         _cox15</id>
                                                </Sense>
                                              </hasSourceSense>
                                                      <hasTargetSense>
                SemanticRelation                <Sense rdf:ID=“S_chromosome19">
                     Is_in                        <id
                                         rdf:datatype="http://www.w3.org/2001/XMLSchema#string">S
                                         _chromosome19</id>
                                                </Sense>
                                              </hasTargetSense>
                     Sense                    <relationName
               S_chromosome19            rdf:datatype="http://www.w3.org/2001/XMLSchema#string">is
                                         _in</relationName>
                                         </SemanticRelation>




                                                                                           20
N. Calzolari                       Dottorato, Pisa, Maggio 2009
    Example: How to encode Wordnet type of Info in LMF

                    : Lem m a               : Lexical Entry                           : Lem m a              : Lexical Entry
               w ordForm = oak tree       partOfSpeech = noun                  w ordForm = oak            partOfSpeech = noun




                                                 : Sense                                   : Sense                : Sense
                                               id = oak_tree0                              id = oak0              id = oak2




                                                 : Synset                        : Synset Relation                       : Synset
                                               id = 12100067                  label = substanceHolonym                id = 12100739




                            : Sem antic Definition
                 text = a deciduous tree of the genus Quercus

                                                                                                                         : Sem antic Definition
                 : Statem ent                                 : Statem ent
                                                                                                                text = the hard durable w ood of any oak
     text = has acorns and lobed leaves        text = great oaks grow from little acorns

                                                                                                                          : Statem ent
                                                                                                         text = used especially for furniture and flooring


                                                                                                                                                         21
N. Calzolari                                                    Dottorato, Pisa, Maggio 2009
               XML based Abstract Lexicon Interchange Format
                                          Mapping exercise

            Entries from existing lexicons have been mapped to LMF to prove that the
            model is able to represent many best practices and achieve unification

            Major best practices:
                        OLIF
                        PAROLE/SIMPLE
                        LC-Star
                        WordNet - EuroWordNet
                        FrameNet
                        BDef formal database of lexicographic definitions derived from Explanatory
                        Dictionary of Contemporary French
                        …
                        …others on the way…
from Monica Monachini
                                                                                                     22
N. Calzolari                                  Dottorato, Pisa, Maggio 2009
                            Lexical WEB &
                Content Interoperability  „Standards‟
       As a critical step for semantic mark-up in the SemWeb


               NomLex                                              WordNets
                                               WordNets

           ComLex                                          WordNets                  with
                                                                                  intelligent
                                                                                    agents
           SIMPLE                          LMF
                                                                          Lex_x
                        FrameNet

                                                                  Lex_y
    Standards for
   Interoperability                        Enough?
                                              ?                                         23
N. Calzolari                       Dottorato, Pisa, Maggio 2009
                         Need of tools to make this vision
                             operational & concrete
      New prototype “LeXFlow”:

              web-based collaborative environment for semi-automatic
               management/integration of lexical resources
                   enabling interoperability of distributed lexical resources
                   accessed by different types of agents
              addressing semi-automatic integration of computational lexicons, with focus
               on linking and cross-lingual enrichment of distributed LRs
                   Case-study: cross-fertilization between Italian and Chinese WordNets

              From Language Resources
                                                                      To Language Services

                                                                                             24
N. Calzolari                                   Dottorato, Pisa, Maggio 2009
                                              25
N. Calzolari   Dottorato, Pisa, Maggio 2009
                           Our WN case study

                  ItalWordNet (Roventini et al., 2003)
                  Academia Sinica Bilingual Ontological WordNet
                   (Sinica BOW, Huang et al., 2004)
                  Both connected to Princeton WordNet (although to
                   different versions)
                  Same set of semantic relations (EWN ones)




                                                                      26
N. Calzolari                     Dottorato, Pisa, Maggio 2009
          Architecture for cooperative integration of lexicons
                                                     Agent Role3

                         Agent Role1                                                Agent Role4

                                                     Agent Role2

                                                                                                  Coordination

                                              Web service Interface



                   Simple-Wordnet
                                                                    MultiWordnet
                                                                                                  Application
                  Relation Calculator
                                                                  Relation Calculator


                                              Web service Interface


               Italian              Italian         ILI       Relation        Chinese              Data
               Simple              Wordnet        Mapper      Mapper          Wordnet


                                                                                                            27
N. Calzolari                                   Dottorato, Pisa, Maggio 2009
                   Basic assumptions behind MWN …

                  Interlingual level:
                       Interlingua provides an indirect linkage between different
                        WordNets: the Interlingual Index (ILI), an unstructured
                        version of WordNet used in EuroWordNet
                       Each synset in a WNA is linked to at least one record of the
                        ILI by means of a set of relations (eq_synonym, eq_near_synonym,
                        …)
                  Synset correspondence:
                       If there is a SA and a SB that point to the same ILI, they are
                        correspondent
                  Relation correspondence:
                       If there are two synsets in WNA and a relation between them,
                        the same holds between corresponding synsets in WNB


                                                                                           28
N. Calzolari                             Dottorato, Pisa, Maggio 2009
                              iperonimia/HYP
                                                                        parte, tratto
                                                                              N#12348
                                                                                          A new proposed mero relation
     passaggio,
     strada,via                    meronimy/MPT                                                                curvatura,
          N#1290
                                                                                                              svolta,curva
                                                                                                                     N#20944
                                       iponimia/HPO
           Synonym




                                                                                        carreggiata                               Derived
                                                                                             N#21225




      ILI1.5-3001757-n                  ILI1.5-5691718-n           ILI1.5-2857000-n       ILI1.5-3002522-n   ILI1.5-8488101-n
      road,route                        stretch                    passage                roadway            bend,crook,turn
      ILI1.6-3243979-n                  ILI1.6-???                 ILI1.6-3092396-n       ILI1.6-3245327-n   ILI1.6-9992072-n
           Synonym




                                                                   tong_dao                                                     Reinforcement
                                                                    (通道 )                                                       & validity
                                                                    N#03092396
                               上位(泛稱)詞_為
                                                                                          che_dao
                               /HYP
    dao_lu,dao,lu                                                                         (車道 )
                                                                                            N#3245327
  (道路,道              ,路   )              下位(特指)詞_為
        N#03243979                       /HPO
                                                                                                                    wan
                                        部件_部份詞_為                                                                   (彎 )
                                                                                                                  N#9992072
                                        /MPT
                                                                                                                                      29
N. Calzolari                                               Dottorato, Pisa, Maggio 2009
                                             00403772-v
                         HYP
    00001533-v
        吸
                             HPO                00407124-v


                                                                            eq_syn

                        HYP
                                                                                       CAU
    eq_syn

                                                    eq_syn
                                                                                                       00462055-a
     00406975-v                                                                00403772-v
                                  01513366-v                 00407124-v                             Respective, several,
  Absorb, assimilate                                                        acquire_knowledge
                                 receive, have                 imbibe                                    various
   Ingest, take_in                                                             00335115-v
                                  01260836-v                 00338343-v                                00364361-a
     00338206-v
                               eq_near_syn                                  eq_near_syn
    eq_syn
                                                        has_hyponym
                                                                                 V#32925
                                   V#39802                                studiare_3, imparare_1,         eq_syn
                                   prendere_3
                                                                               apprendere_2

                             has_hyperonym            has_hyperonym

                                                                                                        AG#42011
         V#32080
                                                                                                        relativo_4
assimilare_5, assorbire_3,                                      causes
 accettare_2, recepire_1
                                                                                   Derived
                        For a Global WordNet Grid
          This architecture for making distributed wordnets interoperable lends itself to
           different applications in LR processing:
                  Enrichment of existing lexical resources
                  Creation of new resources
                  Validation of existing resources

          Can provide a platform for cooperative & collective creation & management of
           LRs, by providing a web-based environment for the collaboration & interaction of
           distributed agents and resources

      Can be seen as the
          Prototype of a web application supporting the GlobalWordNet Grid initiative, i.e. a
           shared multi-lingual knowledge base for cross-lingual processing based on distributed
           resources over the Grid
                                                                                 New project:
                                                                                  KYOTO
                                                                                                   31
N. Calzolari                                      Dottorato, Pisa, Maggio 2009
      Distributed, diverse & dynamic data                     Environmental organizations

                                                          1

                                                                                                                  Citizens
                                                                                  4                               Governments
                                                                          maintain                                Companies
                                                                          terms & concepts
                                                              Wikyoto

 Capture text:                                       Wordnets          Ontology
 "Sudden increase of
 CO2 emissions in
                         2                                                    
 2008 in Europe"                                                                                       Top
                                                                   Abstract Physical
                   Tybot: term yielding robot
                                                                    Process            Substance
                             3
                                 CO2 emission                                                          Middle
                                                                                      H20   CO2

                                                                H20       CO2     Greenhouse
                                                                                                       Domain
                                                               Pollution Emission     Gas

                   Kybot: knowledge yielding robot
                                 Index facts:
                             5   Process:    Emission                                              6         Semantic
                                 Involves:   CO2                       Text & Fact Index
                                 Property:   increase, sudden
                                                                                                              Search
from Piek Vossen                 When:       2008
                                 Where:      Europe                                                                     32
N. Calzolari                                     Dottorato, Pisa, Maggio 2009
          Wordnet                 ontology
                                                      TEXT


                                                      Linear            Discourse
   Domain Wordnet          domain ontology
                                                       DAF              Annotation
          LMF API               OWL API
                                                      Linear            Morphological
                                                       MAF               Annotation     Language
                    Domain Terms                                                        Specific
                                                     Linear              Syntactic
                                                     SYNAF              Annotation
                      Generic
                       TMF
                                                     Linear             Semantic
                       Term                         SEMAF               Annotation
Language
                     Extraction
Neutral
                      (Tybot)                        Linear               Fact          Language
                                                    Generic             Extraction      Neutral &
                                                    FACTAF               (Kybot)        Specific

from Piek Vossen
                                                                                            33
N. Calzolari                             Dottorato, Pisa, Maggio 2009
                           System components
              Wikyoto = wiki environment for a social group:
                  to model the terms and concepts of a domain and agree on their meaning,
                   within group, across languages and cultures
                  to define the types of knowledge and facts of interest
              Tybots = Term extraction robots, extract term data from text
               corpus
              Kybots = Knowledge yielding robots, extract facts from a text
               corpus
              Linguistic processors:
                  tokenizers, segmentizers, taggers, grammars
                  named entity recognition
                  word sense disambiguation
                  generate a layered text annotation in Kyoto Annotation Format (KAF)


from Piek Vossen
                                                                                         34
N. Calzolari                              Dottorato, Pisa, Maggio 2009
                              KYOTO SYSTEM
                                                  Linear
                                               SYNAF/SEMAF


                   Term extraction                                          Semantic annotation
                       (Tybot)            Generic           Linear
                                           TMF             SEMAF


                              Domain editing                             Fact extraction
                                (Wikyoto)                                   (Kybot)

 Concept                                                                                   Fact
                          LMF API           OWL API                          Linear
  User                                                                                     User
                                                                            Generic
                      Domain Wordnet    Domain ontology                     FACTAF

                          Wordnet           Ontology

from Piek Vossen
                                                                                             35
N. Calzolari                              Dottorato, Pisa, Maggio 2009
                               Fact mining by Kybots
                                                                                        Morpho-syntactic analysis
                                                                                        [[the emission]NP
        Source                                   Linguistic
                                                                                         [of greenhouse gases]PP
       Documents                                Processors
                                                                                         [in agricultural areas]PP] NP

                           Ontology             Logical          Wordnets &
                                                Expressions Linguistic Expressions
                               
                                                  Generic
                   Abstract     Physical
                                                                                 Fact analysis
                                            Patient
        Process                 Substance                                         [[the emission]NP ] Process: e1
                                                                                   [of greenhouse gases]PP Patient: s2
       Chemical               H2O     CO2                                          [in agricultural areas]PP] Location: a3
       Reaction
                                                      Domain
                                      Patient
 CO2           water
 emission      pollution


from Piek Vossen
                                                                                                                         36
N. Calzolari                                             Dottorato, Pisa, Maggio 2009
                      Contribution of KYOTO
            • KYOTO learns terms and concepts from textcommunity based control
            •hundredsenables a Websources in the environment domain
                       of thousands 2.0 environment for documents,
                      delivers semantic search and fact extraction
            • Stored different languages
            •in manyascan partially understand language and understand 1 data
              Software structures that people and computers
              Connects people across language and cultures exploit web
            • Establish over thehelped knowledge and concepts defined for each language
            •spread all consensus andby the termstransition
              Understanding is world
            •changing every day


                     html pdf

                        xls                                                               KYBOT
                                                                                        environment
                                                                                           facts



                                                                          WIKYOTO
                                                    Wordnet
                                          Wordnet environment
                                                                                      Wordnet
                                        environment terms
                                                                            Wordnet environment
                                TYBOT      terms              Ontology    environment terms
                                                            environment      terms
                                                              concepts


from Piek Vossen


                                                                                                      37
N. Calzolari                          Dottorato, Pisa, Maggio 2009
       A common representation format:
       WordNet - LMF                                                                                     Data
                                                                                                       Categories
                                                                  LexicalResource

                                   1..1                    1..*                                                              0..1
                  GlobalInformation                  Lexicon                                                         SenseAxes

                           1..*                                                  0..*                                       1..*
                                          0..1
               LexicalEntry                         Meta                          Synset                              SenseAxis

           1..1                            0..*            0..1                          0..1             0..1                  0..1
       Lemma                         Sense                 Definition        SynsetRelations    Monolingual        Interlingual
                                                                                                ExternalRefs       ExternalRefs
                                            0..1                  0..*                  1..*                1..*              1..*
                                  Monolingual                                                      Monolingual      Interlingual
                                                           Statement         SynsetRelation
                                  ExternalRefs                                                     ExternalRef      ExternalRef
                                             1..*                                0..1                   0..1             0..1
                                  Monolingual
                                  ExternalRef                                Meta                    Meta              Meta
                                            0..1

                                      Meta


   from Monica Monachini
                                                                                                                            38
N. Calzolari                                                  Dottorato, Pisa, Maggio 2009
                        Centralized WordNet DC Registry

    A list of 85 sem.rels as a result
     of a mapping of the KYOTO
                                                                         Intra-WN
                                        Inter-WN
               WordNet grid




from Monica Monachini
                                                                                    39
N. Calzolari                              Dottorato, Pisa, Maggio 2009
                                        WordNet-LMF multilingual level - Cross-lingual synset relations
   <!ELEMENT SenseAxes (SenseAxis+)>
   <!ELEMENT SenseAxis (Meta?, Target+,
   InterlingualExternalRefs?)>
   <!ATTLIST SenseAxis
                                                              IWN                                    SWN
   id ID #REQUIRED
   relType CDATA #REQUIRED>                          <fuoco_1, fiamma_1>                       <fuego_3, llama_1>
               groups monolingual synsets
   <!ELEMENT Target EMPTY>                                   00001251-n                             09686541-n
   <!ATTLIST Target
               corresponding to each other
   ID CDATA #REQUIRED>
               and sharing the same
   <!ELEMENT InterlingualExternalRefs
                                                                                  WN3.0
   (InterlingualExternalRef+)>
               relations to English
   <!ELEMENT InterlingualExternalRef (Meta?)>                             <fire_1 flame_1 flaming_1>
   <!ATTLIST InterlingualExternalRef                                              13480848-n
   externalSystem CDATA #REQUIRED
   externalReference CDATA #REQUIRED
                                                                                                  specifies the type of
   relType (at|plus|equal) #IMPLIED>                                                                correspondence




                                                                                                link to ontology/(ies)
from Monica Monachini
                                                                                                                    40
N. Calzolari                                      Dottorato, Pisa, Maggio 2009
                                     Ultimate goal
                  Global standardization and anchoring of meaning such that:
                       Machines can start to approach text understanding -> semantic web
                        connects to the current web
                       Communities can dynamically maintain knowledge, concepts and
                        their terms in an easy to use system
                       Cross-linguistic and cross-cultural sharing and communication of
                        knowledge is enabled


                  Comparable to a formalization of Wikipedia for humans AND
                   machines across languages




from Piek Vossen
                                                                                            41
N. Calzolari                                Dottorato, Pisa, Maggio 2009
               Some steps for a “new generation” of LRs
              From huge efforts in building static, large-scale, general-
               purpose LRs
               To non-static LRs rapidly built on-demand, tailored to spefic
               user needs
              From closed, locally developed and centralized resources
               To LRs residing over distributed places, accessible on the
               web, choreographed by agents acting over them

                     From Language Resources
                      To Language Services
                                                                             42
N. Calzolari                        Dottorato, Pisa, Maggio 2009
                   Distributed Language Services
               A long-term scenario implying
                       content interoperability standards,
                       supra-national cooperation and
                       development of architectures enabling accessibility


                   Create new resources on the basis of existing
                                                                              Language
                   Exchange and integrate information across repositories       Grid
                   Compose new services on demand


                  Collaborative & collective/social development and
                                                                                 Wik
                   validation, cross-resource integration and exchange of
                                                                                  i
                   information


                                                                                       43
N. Calzolari                               Dottorato, Pisa, Maggio 2009
                                             In the “Semantic Web”
                                                            vision ...
   …need to tackle the twofold challenge of
       content availability &
       multilinguality


    Natural convergence with HLT:
           •multilingual semantic processing
           •ontologies
           •semantic-syntactic computational lexicons
                                                                    44
N. Calzolari                 Dottorato, Pisa, Maggio 2009
               Language Tech … & …
                Knowledge, Content
                                                              Ready??
                                                                 ?

                              Knowledge Markup


                                       How to                  LT & LRs
               Semantic Web          cooperate??




                        Content Interoperable LRs & LT

                                                                          45
N. Calzolari                   Dottorato, Pisa, Maggio 2009
       LR and the future of LT or Content Tech
               The need of ever growing and richer LRs for effective multilingual content processing
               requires a change in the paradigm, & the design of a new generation of LRs,
               based on open content interoperability standards
               The Semantic Web notion may be used to shape the LRs of the future, in the vision
               of an open space of sharable knowledge available on the Web for processing
               The effort of making available millions of “richly annotated words” for dozens of
               languages is not affordable by any single group
               This objective can only be achieved creating integrated Open and Distributed
               Linguistic Infrastructures
               Not only the linguistic experts can participate in these, but may include designers,
               developers, users of content encoding practices, etc. in wiki mode


        Is the LR/LT field mature enough to broaden and open itself to the concept of
                     cooperative effort of different set of communities?
        Could a sort of “Language Genome” large initiative be effective? Storing lots of
                                     (annotated) facts
                                                                                                      46
N. Calzolari                                   Dottorato, Pisa, Maggio 2009
               Today, many vitality & success signs… for LRs
         In Spoken, Written, Multimodal areas … … in new emerging areas
         Statistical approaches…
         Different dimensions & layers: Content (Ontologies), Emotion, Time, …
         For Evaluation
         For Training
         …

                  LREC (> 900 submissions); many LRs at COLING and even at ACL!!
                  ELRA (self-sustaining) & LDC
                  LRE (new Journal: N. Ide & NC)
                  ISO-TC37-SC4/WG4 (International Standards for LRs)
                  AFNLP…
                  FLaReNet
                  ESFRI - CLARIN (also political & strategic role)
                  New calls or initiatives in EU, US, ASIA, on LRs, interoperability, cooperation, …

                                                                                             47
N. Calzolari                             Dottorato, Pisa, Maggio 2009
                              BUT … an important point

       In the ‟90s
              There was a global vision of the field & its main components:
                    Standards
                    Creation of LRs                                        … towards the
                    Distribution       ELRA LDC
       Then:
                                                                            Infrastructure of
                    Automatic acquisition                                  LRs & LT


    While today:
                   There is an ever increasing set of initiatives for new LRs, basic robust
                    technologies, models??, algorithms,
        We have a LR community culture
        BUT sort of scattered, opportunistic, not much coherence

                                                                                                48
N. Calzolari                                 Dottorato, Pisa, Maggio 2009
                                      Today …
      The wealth of data & of basic technologies is such that:
               We should reflect again at the field as a whole & ask if

                   Standards                           Content interoperability

                   Creation of LRs           Collaborative creation & Manag.
                   Automatic acquisition
                                                              Dynamic LRs
                   Distribution          Sharing

      are still “the” important components,
      or how they have changed/must change


                                               … Which new challenges towards a
                                   new & more mature infrastructure of LRs & LTs??
                                                                                    49
N. Calzolari                           Dottorato, Pisa, Maggio 2009
                         These dimensions

           Content interoperability

           Collaborative creation & Manag.                   Need more


           Dynamic LRs        Technology exists


         Sharing
             +
         Distributed architectures/infrastr

                                                    could be at the basis of a
                                                 new Paradigm for LRs & LT
                                                  & of a new Infrastructure ??
                                                                            50
N. Calzolari                   Dottorato, Pisa, Maggio 2009
    Many dimensions around the notion of language

    We need to put together                                                finally
     technical,
     organisational,
                                                             Two new European Infrastructural &
     strategic,                                              Networking Initiatives
     economic,
     political issues of LRs



                   Political issues
        e.g. a commonly agreed list of minimal
      requirements for “national” LRs: BLARK

      Economic,                Cultural issues
      social issues                   Language … and cultural identity
               Applications          Language … and the Humanities
               Services                                                   Technical issues
                                                                                                  51
N. Calzolari                                Dottorato, Pisa, Maggio 2009
  Technologies exist, but the infrastructure that puts them
  together and sustains them is still missing
                                                                                      for
    Which Communities?                                               Humanities
                                        core                         Social Sciences
                            Language Resources                      Digital Libraries
                                                                     Cultural Heritage
                   Enabling Language                                …
                   infrastr Technologies
                            Standardisation                                     CLARIN
                                                                                 ResInfra
                                                                             FLaReNet
                       on                                                     Network
              Grid
              Semantic Web                                     Focus on cooperation
              Ontologists
              ICT                          Many application domains
              …
                                          for
                                              (eculture, egovernment, ehealth, …)
                                                                                          52
N. Calzolari                     Dottorato, Pisa, Maggio 2009
  ESFRI Research Infrastructures

                                      CLARIN
     Common Language Resources and Technologies Infrastructure
              for the Humanities & Social Sciences
    Large-scale pan-European collaborative effort (31+ countries)
         Make LRs & LTs available & readily usable to scholars of humanities &
          social sciences (& all disciplines)
         Need to overcome the present fragmented situation by harmonising
          structural and terminological differences
         Basis is a Grid-type infrastructure and Semantic Web technology
         The benefits of computer enhanced language processing become available only
          when a critical mass of coordinated effort is invested in building an
          enabling infrastructure, which can provide services in the form of provision
          of tools & resources as well as training & counseling across a wide span of
          domains
         The infrastructure will be based on a number of resource, service and expertise
          centres
                                                                                      53
N. Calzolari                          Dottorato, Pisa, Maggio 2009
                                      CLARIN Mission


              Create a comprehensive and free to use distributed archive of LRs &
               LTs covering not only the languages of all member states, but also other
               languages studied and used in Europe

              Through the fact that the tools & resources will be interoperable
               across languages & domains, contribute to preserving and
               supporting multilingual & multicultural European heritage

              An operational open infrastructure of web services will introduce a
               new paradigm of distributed collaborative development

              Allow many contributors to add all kinds of new services based on
               existing ones, thus ensuring reusability and allowing scaling up to suit
               individual needs


                                                                                          54
N. Calzolari                            Dottorato, Pisa, Maggio 2009
               How can we tackle these challenges?

          J. Taylor
          “eScience is about global collaboration in
          key areas of science and the next generation
          of infrastructures that will enable it”

          Need to build new types of platforms
              to allow researchers to combine existing
               resources easily to new ones to tackle the
               big challenges
              to increase the productivity of all
               interested researchers, since currently
               too much time is wasted by preparatory
               work

from P. Wittenburg
                                                                    55
N. Calzolari                         Dottorato, Pisa, Maggio 2009
   eScience Vision

    CLARIN establishes such a new generation of extended infrastructure

    Thus CLARIN is not about creating and building new language resources
       and technology, but
     making them available and accessible
     as services
     in a stable and persistent infrastructure
    to allow tackling the great challenges

        CLARIN:              http://www.clarin.eu
        Grid Project:        http://www.mpi.nl/dam-lr
        ISO TC37/SC4:        http://www.tc37sc4.org
        Standards Project:   http://lirics.loria.fr/


from P. Wittenburg
                                                                          56
N. Calzolari                     Dottorato, Pisa, Maggio 2009
         We have still a long path …
                                         & also a “new project”

    in an e-Contentplus Call for a:
         “Thematic Network on Language Resources”:


                                    FLaReNet
               To provide common recommendations (to the EC)
               for future actions
               To give priorities
               Need of „visions‟

               In a global context, in cooperation with CLARIN
                         & also with non-EU members
                                                                  57
N. Calzolari                   Dottorato, Pisa, Maggio 2009
  LRs & LTs exist, but a global vision, policy and strategy
  is still missing
                                                                                               for
                                                                         Humanities
    Which Communities?                                               
                                                                         Social Sciences
                                             core
                                                                     
                                                                        Digital Libraries
                                                                         Cultural Heritage
                               Language Resources                   
                                                                        …
                               Language
                      EU        Technologies                                                 CLARIN
                     Forum     Standardisation                                               ResInf

                               Ontologists                                     FLaReNet
                                                                                 Network
                               Content


                      for                                          Focus on cooperation
              EC
              Funding agencies                     Many application domains
              …                                 (eculture, egovernment, ehealth,
                                             for intelligence, domotics, content
                                                       industry, …)                              58
N. Calzolari                        Dottorato, Pisa, Maggio 2009
                                               Fostering Language          e Content plus


                                               Resources Network

                                                        http://www.flarenet.eu




     A new European Network for Language Resources –


                                                       Nicoletta Calzolari (coord.)
                                                                glottolo@ilc.cnr.it




                                                                                      59
N. Calzolari                   Dottorato, Pisa, Maggio 2009
                             FLaReNet
               Fostering Language Resources Network
                                                     http://www.flarenet.eu/
A European forum
        to facilitate interaction among LR stakeholders
The Network structure considers that LRs present various dimensions and
    must be approached from many perspectives:
        technical, but also
        organisational
        economic
        legal
        political
Addresses also
        multicultural and multilingual aspects, essential when facing
        access and use of digital content in today’s Europe



N. Calzolari                      Dottorato, Pisa, Maggio 2009             60
               Organised in Thematic Working Groups

A layered structure, with leading experts & groups (national and European institutions,
     SMEs, large companies) for all relevant LR areas (about 40 partners)
               in collaboration with CLARIN
               to ensure coherence of LR-related efforts in Europe
FLaReNet will
       consolidate existing knowledge, presenting it analytically and visibly
       contribute to structuring the area of LRs of the future by discussing new
       strategies to:
               convert existing and experimental technologies related to LRs into
               useful economic and societal benefits
               integrate so far partial solutions into broader infrastructures
               consolidate areas mature enough for recommendation of best
               practices
               anticipate the needs of new types of LRs

N. Calzolari                           Dottorato, Pisa, Maggio 2009                 61
                             Thematic Areas
              The Chart for the area of LRs in its different dimensions
              Methods and models for LR building, reuse, interlinking and
               maintenance
              Harmonisation of formats and standards
              Definition of evaluation protocols and evaluation procedures
              Methods for the automatic construction and processing of LRs



       To build together:

              Evolving RoadMap
              Blueprint of actions and infrastructures


N. Calzolari                           Dottorato, Pisa, Maggio 2009           62
                     Objectives & expected results
The largest Network of LR and HLT players, with diverse approaches, efforts and
    technologies
               Enable progress toward community consensus
               Give an extended picture of LRs & recast its definition in the light of recent
               scientific, methodological, technological, social developments
               Consolidate methods & approaches, common practices, frameworks and
               architectures
               A “roadmap” identifying areas where consensus has been achieved or is emerging
               vs. areas where additional discussion and testing is required, together with an
               indication of priorities
               Recommendations in the form of a plan of coherent actions for the EU and
               national organizations
               A European model for the LRs of the next years


                                                                                  Ambitious!
N. Calzolari                                Dottorato, Pisa, Maggio 2009                   63
                       Outcomes of FLaReNet
  The outcomes will be of a directive nature
              to help the EC, and national funding agencies, identifying priority
               areas of LRs of major interest for the public that need public
               funding to develop or improve


  A blueprint of actions will constitute input to policy development
     both at EU and national level
              for identifying new language policies that support linguistic diversity
               in Europe
              in combination with strengthening the language product market,
               e.g. for new products & innovative services, especially for less
               technologically advanced languages


N. Calzolari                             Dottorato, Pisa, Maggio 2009                64
                 These Initiatives, … together


              Call for international cooperation also outside
               Europe


          and will be relevant for
              setting up a global worldwide Forum of
               Language     Resources   and   Language
               Technologies


N. Calzolari                         Dottorato, Pisa, Maggio 2009   65

								
To top