Language Technology and the Semantic Web by sammyc2007


									Language Technology and the
      Semantic Web
 Thierry Declerck & Paul Buitelaar
 (Saarland University & DFKI GmbH)
 We present collaborative research work on the combination of
 language technology (LT) and technologies for encoding
 (domain) knowledge in ontologies, supporting the emergence of
 the Semantic Web (SW), or maybe more appropriate: Semantic
    MUMIS (dealing with multimedia content indexing and
 searching in the soccer domain, finished in December 2002)
     MuchMore (dealing with cross-lingual information retrieval
 in the medical domain, finished in Mai 2003)
    Esperonto (developing a Semantic Annotation Service for
 upgrading the actual Web to the Semantic Web, Sept. 2002 -
 Mai 2005)
T. Declerck, P. Buitelaar                                     2
 Semantic Web Applications of LT

      Supporting accurate ontology-based semantic
    annotation of multilingual web documents
    (Knowledge Markup)
      Supporting Ontology Learning/Construction
    from linguistically/semantically annotated
    multilingual text (Knowledge Extraction)
    See also the Special Interest Group (SIG-5) OntoWeb-lt on
    Language Technology in Ontology Development and Use:

T. Declerck, P. Buitelaar                                       3
Knowledge Markup and Knowledge Extraction

                                       Text/Speech Mining

                                  Concepts, Relations, Events              and Semantic

                                     Linguistic Analysis
                               Morpho-Syntactic Analysis and Tagging,
                            Semantic Class Tagging, Term/NE Recognition,
                                  Grammatical Function Tagging,
                                   Dependency Structure Analysis

T. Declerck, P. Buitelaar                                                            4
Knowledge Markup and Knowledge Extraction (2)

                                 Text/Speech/Media Mining
                                Concepts, Relations, Events   Image and

                              Linguistic and Media Analysis

  T. Declerck, P. Buitelaar                                             5
    Integration of Language Technology
          and Domain Knowledge

T. Declerck, P. Buitelaar                6
     Linguistic Analysis
Language technology tools are needed to support the upgrade of
the actual web to the Semantic Web (SW) by providing an
automatic analysis of the linguistic structure of textual
documents. Free text documents undergoing linguistic analysis
become available as semi-structured documents, from which
meaningful units can be extracted automatically (information
extraction) and organized through clustering or classification
(text mining). Here we focus on the following linguistic analysis
steps that underlie the extraction tasks: morphological analysis,
part-of-speech tagging, chunking, dependency structure
analysis, semantic tagging.

   T. Declerck, P. Buitelaar                                        7
     Morphological Analysis

Morphological analysis is concerned with the inflectional,
derivational, and compounding processes in word formation in
order to determine properties such as stem and inflectional
information. Together with part-of-speech (PoS) information this
process delivers the morpho-syntactic properties of a word.

While processing the German word Häusern (houses) the
following morphological information should be analysed:


   T. Declerck, P. Buitelaar                                       8
     Part-of-Speech Tagging
Part-of-Speech (PoS) tagging is the process of determining the
correct syntactic class (a part-of-speech, e.g. noun, verb, etc.) for
a particular word given its current context. The word “works” in
the following sentences will be either a verb or a noun:

  He works [N,V] the whole day for nothing.
  His works [N,V] have all been sold abroad.

PoS tagging involves disambiguation between multiple part-of-
speech tags, next to guessing of the correct part-of-speech tag for
unknown words on the basis of context information.

   T. Declerck, P. Buitelaar                                            9
Following Abney: chunks as the non-recursive parts of core
phrases, such as nominal, prepositional, adjectival and adverbial
phrases and verb groups.

Chunk parsing is an important step towards making natural
language processing robust, since the goal of chunk parsing is
not to deliver a full analysis of sentences, but to extract just the
linguistic fragments that can be surely identified. However, even
if this strategy fails to produce an analysis for the whole
sentence, the partial linguistic information gained so far will still
be useful for many applications, such as information extraction
and text mining.

    T. Declerck, P. Buitelaar                                           10
      Named Entities detection
Related to chunking is the recognition of so-called named
entities (names of institutions and companies, date expressions,
etc.). The extraction of named entities is mostly based on a
strategy that combines look up in gazetteers (lists of companies,
cities, etc.) with the definition of regular expression patterns.
Named entity recognition can be included as part of the
linguistic chunking procedure and the following sentence
 “…the secretary-general of the United Nations, Kofi Annan,…”
will be annotated as a nominal phrase, including two named
entities: United Nations with named entity class: organization,
and Kofi Annan with named entity class: person

    T. Declerck, P. Buitelaar                                       11
     Dependency Structure Analysis

A dependency structure consists of two or more linguistic units
that immediately dominate each other in a syntax tree. The
detection of such structures is generally not provided by
chunking but is building on the top of it.
There are two main types of dependencies that are relevant for
our purposes: On the one hand, the internal dependency
structure of phrasal units or chunks and on the other hand the so-
called grammatical functions (like subject and direct object).

   T. Declerck, P. Buitelaar                                     12
      Internal Dependency Structure
In linguistic analysis, for this we use the terms head,
complements and modifiers, where the head is the dominating
node in the syntax tree of a phrase (chunk), complements are
necessary qualifiers thereof, and modifiers are optional
qualifiers.Consider the following example:
“The shot by Christian Ziege goes over the goal.”

The prepositional phrase “by Christian Ziege” (containing the named
entity Christian Ziege) depends on (and modifies) the head noun “shot”.

    T. Declerck, P. Buitelaar                                             13
     Grammatical Functions
Determine the role (function) of each of the linguistic chunks in
the sentence and allow to identify the actors involved in certain
events. So for example in the following sentence, the syntactic
(and also the semantic) subject is the NP constituent “The shot
by Christian Ziege”:

“The shot by Christian Ziege goes over the goal.”

This nominal phrase depends on (and complements) the verb
“goes”, whereas the Noun “shot” is the head of the NP (it this
the shot going over the goal, and not Christian Ziege!)

   T. Declerck, P. Buitelaar                                        14
     Semantic Tagging
Automatic semantic annotation has developed within language
technology in recent years in connection with more integrated
tasks like information extraction, which require a certain level of
semantic analysis. Semantic tagging consists in the annotation of
each content word in a document with a semantic category.
Semantic categories are assigned on the basis of a semantic
resources like WordNet for English or EuroWordNet, which
links words between many European languages through a
common inter-lingua of concepts.

   T. Declerck, P. Buitelaar                                     15
      Semantic Resources
Semantic resources are captured in dictionaries, thesauri, and semantic
networks, all of which express, either implicitly or explicitly, an ontology of
the world in general or of more specific domains, such as medicine.
They can be roughly distinguished into the following three groups:

    Thesauri: Semantic resources that group together similar words or terms
according to a standard set of relations, including broader term, narrower
term, sibling, etc. (like Roget)
    Semantic Lexicons: Semantic resources that group together words (or
more complex lexical items) according to lexical semantic relations like
synonymy, hyponymy, meronymy, and antonymy (like WordNet)
    Semantic Networks: Semantic resources that group together objects
denoted by natural language expressions (terms) according to a set of
relations that originate in the nature of the domain of application (like UMLS
in the medical domain)

    T. Declerck, P. Buitelaar                                                     16
  The MeSH Thesaurus
MeSH (Medical Subject Headings) is a thesaurus for indexing articles
and books in the medical domain, which may then be used for searching
MeSH-indexed databases. MeSH provides for each term a number of
term variants that refer to the same concept. It currently includes a
vocabulary of over 250,000 terms. The following is a sample entry for
the term gene library (MH is the term itself, ENTRY are term variants):
  MH                        =        Gene Library
  ENTRY =                   Bank, Gene
  ENTRY =                   Banks, Gene
  ENTRY =                   DNA Libraries
  ENTRY =                   Gene Bank

T. Declerck, P. Buitelaar                                                 17
  The WordNet Semantic Lexicon

WordNet has primarily been designed as a computational
account of the human capacity of linguistic categorization
and covers an extensive set of semantic classes (called
synsets). Synsets are collections of synonyms, grouping
together lexical items according to meaning similarity.
Synsets are actually not made up of lexical items, but rather
of lexical meanings (i.e. senses)

T. Declerck, P. Buitelaar                                       18
   WordNet: An example
The word 'tree' has two meanings that roughly correspond to the classes
of plants and that of diagrams, each with their own hierarchy of classes
that are included in more general super-classes:

09396070 tree 0
 09395329 woody_plant 0 ligneous_plant 0
  09378438 vascular_plant 0 tracheophyte 0
   00008864 plant 0 flora 0 plant_life 0
    00002086 life_form 0 organism 0 being 0 living_thing 0
      00001740 entity 0 something 0
10025462 tree 0 tree_diagram 0
09987563 plane_figure 0 two-dimensional_figure 0
 09987377 figure 0
  00015185 shape 0 form 0
    00018604 attribute 0
     00013018 abstraction 0

 T. Declerck, P. Buitelaar                                                 19
   CyC: A Semantic Network
CYC is a semantic network of over 1,000,000 manually defined rules that
cover a large part of common sense knowledge about the world . For
example, CYC knows that trees are usually outdoors, or that people who died
stop buying things. Each concept in this semantic network is defined as a
constant, which can represent a collection (e.g. the set of all people), an
individual object (e.g. a particular person), a word (e.g. the English word
person), a quantifier (e.g. there exist), or a relation (e.g. a predicate, function,
slot, attribute). The entry for the predicate #$mother:
   #$mother :
         (#$mother ANIM FEM)
              isa: #$FamilyRelationSlot #$BinaryPredicate

This says that the predicate #$mother takes two arguments, the first of which
must be an element of the collection #$Animal, and the second of which
must be an element of the collection #$FemaleAnimal.

 T. Declerck, P. Buitelaar                                                      20
   Word Sense Disambiguation

Words mostly have more than one interpretation, or sense. If
natural language were completely unambiguous, there would be
a one-to-one relationship between words and senses. In fact,
things are much more complicated, because for most words not
even a fixed number of senses can be given. Therefore, only in
certain circumstances and depending on what we mean exactly
with sense, can we give restricted solutions to the problem of
Word Sense Disambiguation (WSD)

 T. Declerck, P. Buitelaar                                  21
                            A simplified Example
                            of a Domain Ontology

  Ontology_1: Movies                           Ontology_1: Movies
   Title: String                                Title: Lord of the Rings
   Date: mm/dd/yyyy                             Date:
   Duration: minutes                            Duration:
   Type: (action, drama,..)        Instances    Type:
   Director: String                             Director: Peter Jackson
   Main Actors:                                 Main Actors:
      Name_1:       Role:                          Name_1:        Role:
      Name_2:       Role:                          Name_2:        Role:
      Name_3:       Role:                          Name_3:        Role:
      …                                            …
  …                                            …

T. Declerck, P. Buitelaar                                          22
                     Example of RDF Schema for
                        the Movie Ontology
<rdf:Description rdf:about=''>
  <rdf:type rdf:resource=''/>
  <rdfs:comment>Details of company that created special effects in this movie</rdfs:comment>
  <rdfs:subClassOf rdf:resource=''/>
<rdf:Description rdf:about=''>
  <rdf:type rdf:resource=''/>
  <rdfs:comment>Films that deal solely with police activity</rdfs:comment>
  <rdfs:subClassOf rdf:resource=''/>


   T. Declerck, P. Buitelaar                                                                       23
  Integration of Ontology and
  Semantic Lexicon
  Example of Semantic Lexicon is WordNet (sometimes also
  referred to as a ‘Linguistic Ontology’)
  Ontologies are domain specific models, usually lacking
  linguistic information (PoS, Morphology, Syntax etc.)
  To be Integrated in One Resource or Kept/Accessed Separately?


  Format                   Web-based Standards for Lexical Semantic
                            Representation will Increase their Uptake
                            (Easy Plug-and-Play, Remote Access, etc.)
  Content                  Widely Used (Lexical) Semantic Resources will
                            lead to (Further) Semantic Standardization
T. Declerck, P. Buitelaar                                               24
Defining a Linguistic Ontology for the Art
World (Tentative)

   xmlns:rdf =””
   xmlns:rdfs =””
   xmlns:xsd =””
   xmlns:daml =””
   xmlns:art =””

  <daml:Ontology rdf:about=”Concepts in the Art World”>

T. Declerck, P. Buitelaar                                         25
Defining Art World Concepts (Classes, “Synsets”)
  <daml:Class rdf:ID="art-world.01">

  <art-world.01 rdf:ID="work"/>
  <art-world.01 rdf:ID="painting"/>

  <daml:Class rdf:ID="art-world.02"/>
  <art-world.02 rdf:ID="beautiful"/>
  <art-world.02 rdf:ID="colourful"/>

  <daml:Class rdf:ID="art-world.03"/>
  <art-world.03 rdf:ID="paper"/>
  <art-world.03 rdf:ID="canvas"/>

T. Declerck, P. Buitelaar                                            26
Defining Properties (“Selection Restrictions”)
 <daml:ObjectProperty rdf:ID="manner">
  <rdfs:range rdf:resource="#art-world.02"/>
  <rdfs:domain rdf:resource="#art-world.01"/>
 </daml:ObjectProperty >

 <daml:ObjectProperty rdf:ID="medium">
  <rdfs:range rdf:resource="#art-world.03"/>
  <rdfs:domain rdf:resource="#art-world.01"/>
 </daml:ObjectProperty >


T. Declerck, P. Buitelaar                        27
(Semantic) Lexicons will be…

        …an Important Part of the Semantic Web

        …Represented Using Markup Languages (RDF)

        …Accessible in a Remote, Distributed Fashion

        …Central to Further Semantic Standardization

T. Declerck, P. Buitelaar                              28
          Multilingual terminological lexicon,
                     attached to a
             domain ontology (MUMIS)
<lex-element id="ID" concept="Shot-on-goal">
   <... lang="DE" type="main">Torschuss</term>
   <... lang="EN" type="main">shot on goal</term>
   <... lang="NL" type="main">schot op doel</term>
   <definition>ein Angriffsspieler kickt den Ball zu den
          gegnerischen Tor</definition>
   <... lang="DE" type="synonym">Distanzschuss</term>
   <... lang="DE" type="synonym">Nachschuss</term>
   <... lang="DE" type="synonym">Schuss</term>
   <... lang="DE" type="synonym">abzieh</term>

T. Declerck, P. Buitelaar                                  29
 Extension and Formalization of the multilingual
         terminological lexicon, including
 syncategorematic information. Supporting WSD.

<lex-element id="ID" concept="Shot-on-goal">
   <...lang = "DE" type = "main„ pos = „N“ mod = {„von
   concept = „Player“ | concept = „player“ gender = „gen“ |
   pos = „posspron“ } >Torschuss</term>
   <...lang="DE" type="synonym„ pos = „V“ comp =
   {„SUBJ“ concept = „Player“} >abzieh</term>
   <definition>URL: DFB home page/glossary</definition>

T. Declerck, P. Buitelaar                                     30
              Integrating Syntactic and
                 Domain Knowledge

                  Including Syntactic Analysis for
                  a more accurate tagging of domain
                  specific semantic annotation

T. Declerck, P. Buitelaar                             31
                     Abstraction over Syntactic
             Head                 Ontology_2: PP
             Comp                 Head: Prep
             Mod                  Type: {LocPP,DatePP, etc.}
             Spec                 Comp: NP

 Ontology_1: NP                           Ontology_4:
 Head:N                                   Grammatical Functions
 Mod: {Adj*,PP?,GenNP}                    Subject, Object, Ind. Object
 Spec: {Det? PossPron?}                   NP Adjunct, PP Adjunct,
 Type: {RefNP, ProNP,                     etc..

T. Declerck, P. Buitelaar                                           32
            Merging of Syntactic and Domain
  Example of a possible rule for conceptual annotation:

  If (Head of Subj_NP of Verb[type=soccer::shot-on-goal] is a
  person) =>        { annotate head of NP with semantic class

  Example of a rule for Instance Filling:

  If (term annotated with concept “soccer::player”) =>
  { try to find information about relations “Team”, “Age” etc.
   (Template Filling in Information Extraction).
T. Declerck, P. Buitelaar                                        33
       NLP-based knowledge markup

T. Declerck, P. Buitelaar           34
  MuchMore: DTD for Annotation
                                                              id       code
                                                   umlsterm   from     pref
                                                              to       tui
                                                               cui     msh

                                                               id      code
                                       xrceterms   xrceterm    from    pref
                                                               to      tui
                                                              cui      msh

                                        semrels     semrel    term1
      document              sentence

                                       ewnterms    ewnterm    sense     offset

                                       gramrels    gramrel
                                        chunks      chunk     to

                                          text       token     id

T. Declerck, P. Buitelaar                                                     35
    MuchMore: Linguistic Annotation
    (Lemmatization, POS, Basic Chunking)
Balint syndrom is a combination of symptoms including simultanagnosia, a disorder of
spatial and object-based attention, disturbed spatial perception and representation,
and optic ataxia resulting from bilateral parieto-occipital lesions.
  <token    id="w1"    pos="NN">Balint</token>
  <token    id="w2"    pos="NN">syndrom</token>
  <token    id="w3"    pos="VBZ" lemma="be">is</token>
  <token    id="w4"    pos="DT" lemma="a">a</token>
  <token    id="w5"    pos="NN" lemma="combination">combination</token>
  <token    id="w6"    pos="IN" lemma="of">of</token>
  <token    id="w7"    pos="NNS" lemma="symptom">symptoms</token>
  <token    id="w20"     pos="JJ"   lemma="spatial">spatial</token>
  <token    id="w21"     pos="NN"   lemma="perception">perception</token>
  <token    id="w22"     pos="CC"   lemma="and">and</token>
  <token    id="w23"     pos="NN"   lemma="representation">representation</token>
<chunk id="c1" from="w1" to="w2" type="NP"/>
<chunk id="c7" from="w20" to="w23" type="NP"/>

     T. Declerck, P. Buitelaar                                                      36
  MuchMore: Semantic Annotation
  (UMLS, EuroWordNet)
Balint syndrom is a combination of symptoms including simultanagnosia, a disorder of
spatial and object-based attention, disturbed spatial perception and representation,
and optic ataxia resulting from bilateral parieto-occipital lesions.
<umlsterm id="t7" from="w20" to="w21">
         <concept id="t7.1" cui="C0037744" preferred="Space Perception" tui="T041">
            <msh code="F2.463.593.778"/>
            <msh code="F2.463.593.932.869"/>
<umlsterm id="t8" from="w26" to="w26">
         <concept id="t8.1" cui="C0029144" preferred="Optics" tui="T090">
            <msh code="H1.671.606"/>
<semrel id="r7" term1="t7.1" term2="t8.1" reltype="issue_in"/>

<ewnterm id="e2" from="w21" to="w21">
         <sense offset="0487490"/>
         <sense offset="3955418"/>
         <sense offset="4002483"/>

    T. Declerck, P. Buitelaar                                                  37
MUMIS: DTD for Linguistic Annotation



         Document            Paragraph   Sentence




 T. Declerck, P. Buitelaar                                          38
MUMIS: DTD for Linguistic Annotation






 T. Declerck, P. Buitelaar             39
MUMIS: DTD for Linguistic Annotation





                             VG   SENT_STRING



                             W    VG_STRG


 T. Declerck, P. Buitelaar                         40
MUMIS: DTD for Linguistic Annotation
 T. Declerck, P. Buitelaar                            41
 MUMIS: Linguistic Annotation
 (Lemmatization … Dependency Structure)

Industrie, Handel und Dienstleistungen werden in der ersten Liste aufgeführt, wobei
die in Klammern gesetzten Zahlen auf die Mutterfirmen hinweisen.

(Industry, trade and services are mentioned in the first list, in which numbers within
brackets point to parent companies.)

 <chunk id="c1"     from="w1" to="w5" type="NP" head=”w1,w3,w5”/>
 <chunk id="c2"     from="w6" to="w6" type="VG"/>
 <chunk id="c3"     from="w7" to="w10" type="PP" head=”w7” complement=”w8,w9,w10”/>
 <chunk id="c4"     from="w11" to="w1" type="VG"/>

 <clause id="cl1" from="c1" to="c4" pred_struct="c2 c4" GF_Subj="c1"/>
 <clause id="cl2" from="c6" to="c9" pred_struct="c9" GF_Subj="c6"/>

   T. Declerck, P. Buitelaar                                                      42
                       MUMIS: Semantic Annotation (Events)
7. Ein Freistoss von Christian Ziege aus 25 Metern geht über das Tor.

      <chunk id="c1" from="w1" to="w5" type="NP" head=”w2” pp modifier=”w3 w4 w5”/>
      <chunk id="c2" from="w6" to="8" type="PP" head=”w6” complement=”w7 w8”/>
      <chunk id="c3" from="w9" to="9" type="VG"/>
      <chunk id="c4" from="w10" to="w12" type="PP" head=”w10” complement=”w11 w12”/>

       <clause id="cls1" from="c1" to="c4" pred_struct="c3“ GF_Subj="c1"/>

      <event id="e1" clause=”cls1” event-name=”free-kick”>
               <argument id="arg1" name="player” value=”w4, w5”/>
               <argument id="arg2" name="location” value=”25-meter”/>
               <argument id="arg3" name="time” value=”07:00”/>
      <event id="e2" clause=”cls1” event-name=”goal-scene-fail”>
               <argument id="arg1" name="player” value=”w4, w5”/>
               <argument id="arg2" name="location” value=”25-meter”/>
               <argument id="arg3" name="time” value=”07:00”/>
    T. Declerck, P. Buitelaar                                                          43
Conceptual Annotations for Multimedia Indexing
and Retrieval: A multilingual cross-document and
incremental IE approach (MUMIS)

      Technology development to automatically index (with
   formal annotations) lengthy multimedia recordings (off-line
   process): Find and annotate relevant entities, relations and

      Technology development to exploit indexed multimedia
   archives (on-line process): Search for interesting scenes
   and play them via Internet

   Test Domain: Soccer Games / UEFA Tournament 2000

  T. Declerck, P. Buitelaar                                       44
      Indexing by...
                            Off-line Task
           Automatic Speech Recognition (Radio/TV Broadcasts)
                  Automatically transforms the speech signals into texts
                   (for 3 languages — Dutch, English and German)
           Natural Language Processing (Information Extraction)
                Analyse all available textual documents (newspapers,
                   speech transcripts, tickers, formal texts ...), identify
                   and extract interesting entities, relations and events
           Merging all the annotations produced so far
           Create a database with formal annotations
           Use video processing to adjust time marks

T. Declerck, P. Buitelaar                                                     45
               Information Extraction

      Information Extraction (IE) is the task of
    identifying, collecting and normalizing relevant
    information for a specific application or user.
      The relevant information is typically
    represented in form of predefined “templates”,
    which are filled by means of Natural Language
    (NL) analysis.
      IE combines pattern matching mechanisms,
    (shallow) NLP and domain knowledge
    (terminology and ontology).

T. Declerck, P. Buitelaar                              46
         Information Extraction (2)

  IE is generally subdivided in following tasks:
      - Named Entity task (NE)
      - Template Element task (TE)
      - Template Relation task (TR)
      - Scenario Template task (ST)
      - Co-reference task (CO)

T. Declerck, P. Buitelaar                          47
                            Subtask of IE

      Named Entity task (NE): Mark into the text
    each string that represents, a person,
    organization, or location name, or a date or
    time, or a currency or percentage figure.
    Template Element task (TE): Extract basic
    information related to organization, person, and
    artifact entities, drawing evidence from
    everywhere in the text.

T. Declerck, P. Buitelaar                              48
                            Subtask of IE (2)
    Template Relation task (TR): Extract relational information
    on employee_of, manufacture_of, location_of relations etc.
    (TR expresses domain-independent relationships).
    Scenario Template task (ST): Extract pre-specified event
    information and relate the event information to particular
    organization, person, or artifact entities (ST identifies
    domain and task specific entities and relations).
    Co-reference task (CO): Capture information on co-
    referring expressions, i.e. all mentions of a given entity,
    including those marked in NE and TE.

T. Declerck, P. Buitelaar                                    49
                            IE applied to soccer

Terms as descriptors for the NE task
Team: Titelverteidiger Brasilien, den respektlosen Außenseiter Schottland
Player:Superstar Ronaldo, von Bewacher Calderwood noch von
   Abwehrchef Hendry, von Jackson als drittem Stürmer, Torschütze
   Cesar, von Roberto Carlos (16.),
Referee: vom spanischen Schiedsrichter Garcia Aranda
Trainer: Schottlands Trainer Brown, Kapitän Hendry seinen Keeper
Location: im Stade de France von St. Denis (more fine-grained location
   detection would be: Stadion: im Stade de France and City: von St.
   Denis )
Attendance: Vor 80000 Zuschauern

T. Declerck, P. Buitelaar                                              50
   IE applied to
Termssoccer (2)
         for NE Task
Time: in der 73. Minute, nach gerade einmal 3:50 Minuten,
  von Roberto Carlos (16.), nach einer knappen halben
  Stunde, scheiterte Rivaldo (49./52.) jeweils nur knapp, das
  vor der Pause Versäumte versuchten die Brasilianer nach
  Wiederbeginn, ...
Date: am Mittwoch, der Turnierstart (?), im WM-
  Eröffnungsspiel (?)
Score/Result: Brasilien besiegt Schottland 2:1, einen 2:1
  (1:1)-Sieg, der zwischenzeitliche Ausgleich, in der 4.
  Minute in Führung gebracht, köpfte zum 1:0 ein

T. Declerck, P. Buitelaar                                       51
   IE applied to
     soccer (3)
Relations for TR Task:
Opponents: Brasilien besiegt Schottland, feierte der Top-
    Favorit ... einen glücklichen 2:1 (1:1)-Sieg über den
    respektlosen Außenseiter Schottland,
Player_of: hatte Cesar Sampaio den vierfachen Weltmeister
    ... in Führung gebracht, Collins gelang ... der
    zwischenzeitliche Ausgleich für die Schotten, der Keeper
    des FC Aberdeen, Brasiliens Keeper Taffarel
Trainer_of: Schottlands Trainer Brown

T. Declerck, P. Buitelaar                                      52
   IE applied to
     soccer :
Events for ST task(4)
Goal: in der 4. Minute in Führung gebracht, das schnellste
  Tor ... markiert, Cesar Sampaio köpfte zum 1:0 ein, Collins
  (38.) verwandelte den Strafstoß, hätte Kapitän Hendry
  seinen Keeper Leighton um ein Haar zum zweiten Mal
  bezwungen, von dem der Ball ins Tor prallte
Foul: als er den durchlaufenden Gallacher im Strafraum allzu
  energisch am Trikot zog
Substitution: und mußte in der 59. Minute für Crespo Platz

T. Declerck, P. Buitelaar                                   53
      NL Processing and Knowledge Markup of
    (German) soccer texts with the SCHUG system
A multilingual ontological lexicon

•    Formal Text1
•    Formal Text2
•    XML Soccer Annotation for Text1
•    XML Soccer Annotation for Text2
•    Merging of Annotations for Formal Texts
•    Semi-Formal Text
•    Semi-Formal Text annotated with Soccer Information (XML)

     T. Declerck, P. Buitelaar                          54
                      Multilingual Extension
                       Spanish (Esperonto)

  <lex-element id="ID" concept=„Second-half">
     <... lang="DE" type="main">zweite Halbzeit</term>
     <... lang="EN" type="main">second half</term>
     <... lang=„ES" type="main">reanudacion</term>

  Processing with the SCHUG system:
T. Declerck, P. Buitelaar                                55
Conceptual Annotations for Multimedia
Indexing and Retrieval: MUMIS
Automatic Speech Recognition                                  Domain Modeling
         Formal                                 Formal
                                               Formal                                               Ontology
        Formal                                 Formal                  Formal
           Text                                   Text
                                                 Text                 Formal
     Formal                                 Formal
                                                Text                     Text
     Formal                                 Formal
                                           Formal                  Formal
    Formal                                    Text
                                           Formal                  Formal
      Text                                   Text
                                             Text                    Text
  Formal                                 Formal
                                            Text                    Text
    Text                                 Formal
                                           Text                 Formal
   Text                ASR              Trans-
                                          Text                  Formal
                                          Text                   Text
                                                                 Text                DM          Ontology
 Signals                                cripts                                                  Ontology
                                                                Texts                        Domain Lexicon

Information Extraction                                                                    Merging
            Formal                                   Formal                                        Formal
            Formal                                   Formal
              Text                                     Text                                          Text
             Text                                     Text
        Formal                                   Formal                                           Text
        Formal                                   Formal
          Text              Merged                 Text
                                                  Text                    IE                  Anno-
      Formal                                   Formal
        Text                                  Formal                                          tations
       Text               Annotated             Text
       Text               formal text         Text

Legend                       User Interface
                                                         Ontology                              Merging
  EN       NL     DE
    T. Declerck, P. Buitelaar                                                                 Annotations 56
          The first user interface of MUMIS

T. Declerck, P. Buitelaar                     57
Intelligent Software Components (Coord) : Semantic Web, Annotation
UPM : ontology development and evaluation.
University of Innsbruck : Semantic Web languages.
Saarland University : multilingual Annotation services, using
Information Extraction
UNILIV : Semantic indexation of Semantic Web content. Routing
solutions. Visualization and navigation to make content presentation
Residencia de Estudiantes : Content provider. Cultural tour test case.
CIDEM : Content provider. Fund finder test case. Evaluation.
BioVista : Content provider. Scientific Discovery test case.
     T. Declerck, P. Buitelaar                                 58
   Application Service Provision of Semantic Annotation, Aggregation,
   Indexing and Routing of Textual, Multimedia, and Multilingual Web Content

The project aims at bridging the gap between the actual World
Wide Web and the semantic Web by providing a service to
"upgrade" existing content to semantic Web content.
Ontologies play a key role in this effort, together with
multilingual Natural Language Analysis of textual documents
currently in the web as free or HTML encoded texts.

 T. Declerck, P. Buitelaar                                               59
                                Main Goals
 To bridge the gap between the current web and the Semantic Web: SemASP
   Ontology-based annotation
   Sources:
       Static pages
       Pages dinamically generated from DB
       Textual and multimedia information
       Web services
 Added value knowledge-based services on top of the constructed semantic
   Routing based on P2P communication
   Semantic aggregation
   Meaning negotiation
 Support Multilinguality on ontology construction, ...

    T. Declerck, P. Buitelaar                                     60
                                Applications           Multilingual                        Provider

Semantic Web

                                                                                           Router             Router
                                      Portal                          Router
                                                                               Router                     Router

                                                   Semantic indices, Concept instances

         Certificate                        Multilinguality
                              Ontology                                 Tagger/          Tagger/                          Tagger/          Tagger/
         Workbench                          Reengineering
                             Repository                                Wrapper          Wrapper                          Wrapper          Wrapper
SemASP   Maintenance          Service       Mapping
                       XML   DAML   OIL   RDF(S)

World Wide Web
    T. Declerck, P. Buitelaar                                         Web Server         Dynamic                          Static         61
                                                                                                                                        Multimedia Data
                                                                       Provider    Information Provider            Information Provider
              Ontology-based Annotation

        Annotate accurately document with concepts
     and terms described in various semantic resources:
     EuroWordNet, UMLS, Soccer ontology etc.
         Annotate documents with relations defined in
     the ontology

T. Declerck, P. Buitelaar                                 62
              Ontology construction from Text

     There are various methodologies under
     investigation for extracting/learning knowledge
     from text, and to encode it in an ontology (see
     Ontology Learning Overview - OntoWeb D1.5 Many are based on
     Machine Learning techniques
     We discuss here the possibility of a rule-based
     approach for partial and shallow ontology
     construction from text, based on various levels of
     syntactic patterns annotated in the documents.

T. Declerck, P. Buitelaar                                 63
                Ontology construction from Text:
                A starting experiment: Medicine
     Document Set: 65 sample phrases that link
     symptoms with Rheumatoid Arthritis (RA).

T. Declerck, P. Buitelaar                          64
          Ontology construction from Text:
          Apposition and Paranthesis (1)
     “The effects of rheumatoid arthritis on bone include structural
     joint damage (erosions) and osteoporosis “
     Linguistic Structure:
     [[The effects of rheumatoid arthritis] [on bone]] [include] [[structural
     joint damage ( erosions )] [ and] [osteoporosis]]
     => The Apposition (2 syntactic heads “joint” and “erosions” in one
     NP) including a parenthesis construction suggests a synonymy
     relation or a definition. Heuristic: Establishing Semantic Relations on
     the top of linguistic “head-modifiers” constructions

T. Declerck, P. Buitelaar                                                       65
                Ontology construction from Text:
                Apposition with Paranthesis (2)
“For symptoms of rheumatoid arthritis (pain, joint stiffness), the
reference treatment is a nonsteroidal antiinflammatory drug (NSAID)
such as diclofenac or ibuprofen.”
Linguistic Structure
 [For symptoms of rheumatoid arthritis ( pain , joint stiffness )] , [the
reference treatment] [is] [a nonsteroidal antiinflammatory drug ( NSAID)]
Suggesting a semantic relation between („pain“ and „joint stiffness“)
Classify „pain“ and „joint stiffness“ as symptom of RA. The word
„symptom“ is linguistically annotated as the head of the Compl-NP of the PP
starting with „For“.

T. Declerck, P. Buitelaar                                                66
                Ontology construction from Text:
                Apposition with Paranthesis (3)

       But there is a need for constraining the hypothesis:
       “In patients with rheumatoid arthritis (RA)” => RA is
       abbreviation of rheumatoid arthritis
       And in the sentence:
       “Fourteen consecutive elbows have been treated for
       rheumatoid arthritis (9 elbows) and for post-traumatic
       osteoarthrosis (5 elbows) by total elbow replacement with
       the GSB III implant. “,
       the parenthesis (9 elbows) and (5 elbows) have no semantic
       relations to the preceding head nouns!

T. Declerck, P. Buitelaar                                           67
                Ontology construction from Text:
                Apposition with commas
      “Etoricoxib, a selective COX2 inhibitor, has been shown to be
      as effective as non-selective non-steroidal anti-inflammatory
      drugs in the management of chronic pain in rheumatoid
      arthritis and osteoarthritis, …”
      Linguistic Structure:
      [Etoricoxib, a selective COX2 inhibitor,] [has been shown]…

      The same hypothesis as in the former examples: a semantic relation
      between “Etoricoxib” and “selective COX2 inhibitor”. Probably a
      “isa” relation

T. Declerck, P. Buitelaar                                                  68
                Ontology construction from Text:
                Compound Analysis

     „Joints destructions, „joint damage“, „joint
     disease“, „joint stiffness“ but „joint cartilage“.
     „Knee joints“ vs. „tender joints”

     What can happen to joins, where are joints
     located?. Use of synsets to detect relations? „Joint
     cartilage“ is not a disease.

T. Declerck, P. Buitelaar                                   69
                Ontology construction from Text:
                PP post-modification

     „inflammation of joints, synovial lining of joints”

     Here: use of synsets for grouping that what can
     happen to joints?

T. Declerck, P. Buitelaar                                  70
                Ontology construction from Text:
                Phrase Internal Coordination
     “The effects of rheumatoid arthritis on bone include structural
     joint damage (erosions) and structural joint damage “
     Linguistic Structure:
     [[The effects of rheumatoid arthritis] [on bone]] [include]
     [[structural joint damage ( erosions )] [ and] [osteoporosis]]

     RA causes structural joint damage AND structural joint
     damage (interpreting the head noun “effects” as a causation).
     Hypothesis: The two heads of an NP coordination are
     somehow related.

T. Declerck, P. Buitelaar                                              71
                Ontology construction from Text:
                Phrase Internal Coordination (2)
     “A study was conducted to determine the incidence of ulnar
     and peripheral neuropathy “

     Linguistic Structure:
     … [The incidence of [[ulnar and peripheral] neuropathy]]

      The AP “ulnar and peripheral” AP modifies the head noun
     “neuropathy”. The AP is a coordinated one, having two
     Adjectival heads.
      Hypothesis: They correspond to two types of neuropathy

T. Declerck, P. Buitelaar                                         72
                Ontology construction from Text:
                Subject Verb Objetcs (Ind. Obj. etc.)
     [Rheumatoid arthritis is an immunologically mediated
     inflammation of joints of unknown aetiology] and [often
     leads to disability]
     => RA leads to Disability (effect of ellipsis resolution: RA
     detected as the subject of the verb „leads“, even if not
     realised in text. Reference resolution very important for
     knowledge extraction)
     => Lexical semantic info: collects all objects of RA leads to
     =>Suggest Causality (verb lead + to)

T. Declerck, P. Buitelaar                                            73
                Ontology construction from Text:
                Subject Verb Objects (Ind. Obj etc.)
     “These changes constitute hallmarks of synovial
     cell activation and contribute to both chronic
     inflammation and hyperplasia”

     On line exercise!

T. Declerck, P. Buitelaar                              74
                Future Work

       Still have to identify accurately the sub-set of
       linguistic tags, describing syntactic/semantic
       patterns that are relevant for ontology extraction
       (or even ontology mark-up).

T. Declerck, P. Buitelaar                                   75
                First Conclusions

      Construction of partial and shallow ontologies
      from (complex) syntactic patterns seems feasible.
      It might seem “expensive” in the sense that
      documents first should be (automatically)
      linguistically annotated.
      But Machine Learning methods also needs a lot of
      semi-automatically annotated data for training.
      A need to conduct a comparative evaluation taking
      into account as many parameters as possible.

T. Declerck, P. Buitelaar                                 76
  Practical Sessions (Adrian Raschip)
• Exercise 1 : Semi-Automatic Terminological extension:
  Romanian and other languages. On the base of the TMX
  encoded MUMIS multilingual terminology
• Exercise 2 : (Manual) linguistic annotation of English and
  Romanian Text on Soccer
• Exercise 3 : Define a soccer ontology in Protégé
• Exercise 4: Search for possible mapping rules between
    linguistic annotations and relations that might be relevant
    to be extracted

T. Declerck, P. Buitelaar                                         77

To top