On the Need to Bootstrap Ontology Learning with Extraction

Document Sample
On the Need to Bootstrap Ontology Learning with Extraction Powered By Docstoc
					  On the Need to Bootstrap
   Ontology Learning with
Extraction Grammar Learning
              Georgios Paliouras
   Software & Knowledge Engineering Lab
 Inst. of Informatics & Telecommunications
              NCSR “Demokritos”
   http://www.iit.demokritos.gr/~paliourg
             Kassel, 22 July 2005
Outline
• Motivation and state of the art
• SKEL research
     – Vision
     – Information integration in CROSSMARC.
     – Meta-learning for information extraction.
     – Context-free grammar learning.
     – Ontology enrichment.
     – Bootstrapping ontology evolution with multimedia
       information extraction.
• Open issues
Kassel, 22/07/2005       ICCS’05                          2
Motivation
• Practical information extraction requires a
  conceptual description of the domain, e.g. an
  ontology, and a grammar.
• Manual creation and maintenance of these
  resources is expensive.
• Machine learning has been used to:
     – Learn ontologies based on extracted instances.
     – Learn extraction grammars, given the conceptual
       model.
• Study how the two processes are interacting
  and the possibility of combining them.
Kassel, 22/07/2005     ICCS’05                           3
Information extraction
• Common approach: shallow parsing with
  regular grammars.
• Limited use of deep analysis to improve
  extraction accuracy (HPSGs, concept graphs).
• Linking of extraction patterns to ontologies (e.g.
  information extraction ontologies).
• Initial attempts to combine syntax and
  semantics (Systemic Functional Grammars).
• Learning simple extraction patterns (regular
  expressions, HMMs, tree-grammars, etc.)

Kassel, 22/07/2005   ICCS’05                       4
Ontology learning
• Deductive approach to ontology modification:
  driven by linguistic rules.
• Inductive identification of new concepts/terms.
• Clustering, based on lexico-syntactic analysis of
  the text (subcat frames).
• Formal Concept Analysis for term clustering
  and concept identification.
• Clustering and merging of conceptual graphs
  (conceptual graph theory).
• Deductive learning of extraction grammars in
  parallel with the identification of concepts.

Kassel, 22/07/2005   ICCS’05                      5
Outline
• Motivation and state of the art
• SKEL research
     – Vision
     – Information integration in CROSSMARC.
     – Meta-learning for information extraction.
     – Context-free grammar learning.
     – Ontology enrichment.
     – Bootstrapping ontology evolution with multimedia
       information extraction.
• Open issues
Kassel, 22/07/2005       ICCS’05                          6
SKEL - vision
Research objective:
 innovative knowledge technologies for
 reducing the information overload on the
 Web
Areas of research activity:
      – Information gathering (retrieval, crawling, spidering)
      – Information filtering (text and multimedia
        classification)
      – Information extraction (named entity recognition and
        classification, role identification, wrappers, grammar
        and lexicon learning)
      – Personalization (user stereotypes and communities)
      – Ontology learning and population

Kassel, 22/07/2005       ICCS’05                             7
Outline
• Motivation and state of the art
• SKEL research
     – Vision
     – Information integration in CROSSMARC.
     – Meta-learning for information extraction.
     – Context-free grammar learning.
     – Ontology enrichment.
     – Bootstrapping ontology evolution with multimedia
       information extraction.
• Open issues
Kassel, 22/07/2005       ICCS’05                          8
CROSSMARC Objectives
Develop technology for Information
Integration that can:
• crawl the Web for interesting Web pages,
• extract information from pages of different sites
  without a standardized format (structured, semi-
  structured, free text),
• process Web pages written in several
  languages,
• be customized semi-automatically to new
  domains and languages,
• deliver integrated information according to
  personalized profiles.
 Kassel, 22/07/2005   ICCS’05                     9
CROSSMARC Architecture




                               Ontology



Kassel, 22/07/2005   ICCS’05              10
 CROSSMARC Ontology
                                 <node idref="OV-d0e1041">
                                  <synonym>Intel Pentium III</synonym>
                                  <synonym>Pentium III</synonym>
…                                 <synonym>P3</synonym>
<description>Laptops</description><synonym>PIII</synonym>
 <features>                      </node>
  <feature id="OF-d0e5">                                  Lexicon
   <description>Processor</description>
   <attribute type="basic" id="OA-d0e7">
    <description>Processor Name</description>
    <discrete_set type="open">
      <value id="OV-d0e1041">
       <description>Intel Pentium 3</description>
      </value>                     <node idref="OA-d0e7">
      …
                                    <synonym>Όνομα Επεξεπγαστή</synonym>
               Ontology
                               </node>

                                               Greek Lexicon

 Kassel, 22/07/2005            ICCS’05                                     11
Outline
• Motivation and state of the art
• SKEL research
     – Vision
     – Information integration in CROSSMARC.
     – Meta-learning for information extraction.
     – Context-free grammar learning.
     – Ontology enrichment.
     – Bootstrapping ontology evolution with multimedia
       information extraction.
• Open issues
Kassel, 22/07/2005       ICCS’05                          12
Meta-learning for Web IE
Motivation:
• There are many different learning
  methods, producing different types of
  extraction grammar.
• In CROSSMARC we had four different
  approaches with significant difference in
  the extracted information.
Proposed approach:
• Use meta-learning to combine the
  strengths of individual learning methods.
Kassel, 22/07/2005   ICCS’05                  13
Meta-learning for Web IE

                     Stacked generalization
                                                 New vector x
          Base-level dataset D
      Dj              D \ Dj             L1…LN      C1...CN


                                                           Meta-level
C1(j)…CN(j)           L1…LN
                                                            vector

     MDj                                 LM           CM

       Meta-level dataset MD
                                              Class value y(x)
Kassel, 22/07/2005             ICCS’05                              14
Meta-learning for Web IE
Information Extraction is not naturally a classification task
In IE we deal with text documents, paired with templates
Each template is filled with instances <t(s,e), f>
 …TransPort ZX <br> <font size="1"> <b> 15" XGA TFT Display </b> <br> Intel
 <b> Pentium III 600 MHZ </b> 256k Mobile processor <br> <b> 256 MB
 SDRAM up to 1GB…

                                       Template T
                        t(s,e)                      s, e     Field f
                     Transport ZX                47, 49      Model
                         15”                     56, 58    screenSize
                         TFT                     59, 60    screenType
               Intel <b> Pentium III             63, 67    procName
                      600 MHz                    67, 69    procSpeed
                       256 MB                    76, 78       ram

Kassel, 22/07/2005                     ICCS’05                                15
  Meta-learning for Web IE
               Combining Information Extraction systems

      …TransPort ZX <br> <font size="1"> <b> 15" XGA TFT Display </b> <br>
      Intel <b> Pentium III 600 MHZ </b> 256k Mobile processor <br> <b> 256 MB
      SDRAM up to 1GB…


       T1 filled by the IE system E1                  T2 filled by the IE system E2
       t(s, e)           s, e          f              t(s, e)         s, e        f
   Transport ZX         47, 49     model           Transport ZX      47, 49     manuf
         15”            56, 58   screenSize            TFT           59, 60   screenType
        TFT             59, 60   screenType      Intel <b> Pentium   63, 66   procName
Intel <b> Pentium III   63, 67   procName            600 MHz         67, 69   procSpeed
     600 MHz            67, 69   procSpeed           256 MB          76, 78      ram
      256 MB            76, 78      ram                1 GB          81, 83   HDcapacity
        1 GB            81, 83      ram


   Kassel, 22/07/2005                  ICCS’05                                          16
Meta-learning for Web IE
                         Creating a stacked template
   …TransPort ZX <br> <font size="1"> <b> 15" XGA TFT Display </b> <br>
   Intel <b> Pentium III 600 MHZ </b> 256k Mobile processor <br> <b> 256
   MB SDRAM up to 1GB…

                                     Stacked template (ST)
     s, e                  t(s, e)           Field by E1     Field by E2   Correct field
   47, 49              Transport ZX               model        manuf          model
   56, 58                   15”              screenSize           -         screenSize
   59, 60                   TFT             screenType       screenType    screenType
   63, 66             Intel<b>Pentium               -        procName            -
   63, 67            Intel<b>Pentium III     procName             -         procName
   67, 69                600 MHz             procSpeed       procSpeed      procSpeed
   76, 78                 256 MB                  ram           ram            ram
   81, 83                  1 GB                   ram        HDcapacity          -


Kassel, 22/07/2005                      ICCS’05                                            17
Meta-learning for Web IE
          Training in the new stacking framework
    D = set of documents, paired with hand-filled templates

          Dj              D \ Dj                L1…LN         E1…EN


    E1(j)…EN(j)           L1…LN


 ST1      ST2        …


         MDj                                      LM            CM
    MD = set of meta-level feature vectors

Kassel, 22/07/2005            ICCS’05                                 18
Meta-learning for Web IE

                      Stacking at run-time


                     E1      T1

     New                             Stacked
  document d
                     E2      T2      template     CM

                     …                             <t(s,e), f>
                     EN      TN          Final
                                       template   T


Kassel, 22/07/2005        ICCS’05                                19
Experimental results
F1-scores (combined recall and precision) on four
benchmark domains and one of the CROSSMARC
domains.

   Domain            Best base         Stacking
 Courses               65.73            71.93
 Projects               61.64            70.66
 Laptops                63.81            71.55
 Jobs                   83.22            85.94
 Seminars               86.23            90.03
Kassel, 22/07/2005    ICCS’05                       20
Outline
• Motivation and state of the art
• SKEL research
     – Vision
     – Information integration in CROSSMARC.
     – Meta-learning for information extraction.
     – Context-free grammar learning.
     – Ontology enrichment.
     – Bootstrapping ontology evolution with multimedia
       information extraction.
• Open issues
Kassel, 22/07/2005       ICCS’05                          21
Learning CFGs
Motivation:
• Wanting to provide more complex extraction
  patterns for less structured text.
• Wanting to learn more compact and human-
  comprehensible grammars.
• Wanting to be able to process large corpora
  containing only positive examples.
Proposed approach:
• Efficient learning of context free grammars from
  positive examples, guided by Minimum
  Description Length.

Kassel, 22/07/2005   ICCS’05                    22
Learning CFGs

                     Introducing eg-GRIDS

• Infers context-free grammars.
• Learns from positive examples only.
• Overgenarisation controlled through a
  heuristic, based on MDL.
• Two basic/three auxiliary learning operators.
• Two search strategies:
     – Beam search.
     – Genetic search.

Kassel, 22/07/2005         ICCS’05                23
Learning CFGs
Minimum Description Length (MDL)

  Model Length (ML) = GDL + DDL
     Grammar Description Length (GDL)   Derivations Description Length (DDL)
                            Overly General
                          Grammar Bits required to encode all
     Bits required to encode
     the grammar G.                training examples, as
                                   encoded by the grammar
                         Hypothese G.
                             s
                                      GDL
                     DDL

                           Overly Specific
                             Grammar


Kassel, 22/07/2005              ICCS’05                                        24
Learning CFGs
eg-GRIDS Architecture
                                       Overly Specific
           Training
                                         Grammar
          Examples

                                       Search Organisation                                       Beam of
                                            Selection
                                                                                                Grammars



                                       Learning Operators
                                                             Merge NT
                                                             Operator
             Evolutionary




                                                                             Create
              Algorithm




                                                                           Optional NT
                            Mutation
                                                             Create NT                           Operator
                                                             Operator
                                                                           Detect Center          Mode
                                                                            Embedding
                                                               Body
                                                            Substitution



                                                                                                            YES
                                                              Final                        NO      Any Inferred
                                                                                                  Grammar better
                                                            Grammar                             than those in beam?




Kassel, 22/07/2005                                          ICCS’05                                                   25
Experimental results
• The Dyck language with k=1:
          S→SS|(S)|є
Errors of:
• Omission: failures to parse sentences
  generated from the “correct” grammar (longer
  test sentences than in the training set).
     – Overly specific grammar.
• Commission: failures of the “correct” grammar
  to parse sentences generated by the inferred
  grammar.
     – Overly general grammar.


Kassel, 22/07/2005     ICCS’05                    26
Experimental results
Probability of parsing a valid sentence (1-errors of omission)




 Kassel, 22/07/2005          ICCS’05                             27
Experimental results
Probability of generating a valid sentence (1-errors of commission)




 Kassel, 22/07/2005          ICCS’05                                  28
Outline
• Motivation and state of the art
• SKEL research
     – Vision
     – Information integration in CROSSMARC.
     – Meta-learning for information extraction.
     – Context-free grammar learning.
     – Ontology enrichment.
     – Bootstrapping ontology evolution with multimedia
       information extraction.
• Open issues
Kassel, 22/07/2005       ICCS’05                          29
  Ontology Enrichment

• We concentrate on instances.
• Highly evolving domain (e.g. laptop
  descriptions)
  – New Instances characterize new concepts.
    e.g. ‘Pentium 2’ is an instance that denotes a new concept if it
    doesn‟t exist in the ontology.
  – New surface appearance of an instance.
    e.g. ‘PIII’ is a different surface appearance of ‘Intel Pentium 3’

• The poor performance of many Information
  Integration systems is due to their incapability to
  handle the evolving nature of the domain.

  Kassel, 22/07/2005         ICCS’05                                     30
Ontology Enrichment

                Annotating Corpus
              Using Domain Ontology
                                                             machine
                                                             learning

                               Corpus

                                                Additional
  Multi-Lingual                                annotations
Domain Ontology                                                Information
                       Ontology                                 extraction
                     Enrichment /
                                              Validation
                      Population



                                    Domain Expert

Kassel, 22/07/2005                  ICCS’05                                  31
  Finding synonyms
• The number of instances for validation increases
  with the size of the corpus and the ontology.
• There is a need for supporting the enrichment of
  the „synonymy‟ relationship.
• Discover     automatically    different   surface
  appearances of an instance (CROSSMARC
  synonymy relationship).
• Issues to be handled:
   Synonym :           „Intel pentium 3’ - „Intel pIII’
   Orthographical :    „Intel p3’ - „intell p3’
   Lexicographical :   „Hewlett Packard’ - „HP’
   Combination :        „Intell Pentium 3’ - „P III’

  Kassel, 22/07/2005        ICCS’05                       32
  COCLU
• COCLU (COmpression-based CLUstering): a model
  based algorithm that discovers typographic similarities
  between strings (sequences of elements-letters) over an
  alphabet (ASCII characters) employing a new score
  function CCDiff.
• CCDiff is defined as the difference in the code length of a
  cluster (i.e., of its instances), when adding a candidate
  string. Huffman trees are used as models of the clusters.
• COCLU iteratively computes the CCDiff of each new
  string from each cluster implementing a hill-climbing
  search. The new string is added to the closest cluster, or
  a new cluster is created (threshold on CCDiff ).
  Kassel, 22/07/2005     ICCS’05                            33
 Experimental results
                                                  100

Discovering lexical synonyms:




                                   Accuracy (%)
                                                   90


Assign an instance to a group,                     80



while decreasing                                   70


                                                   60
proportionally the number of                       50

instances available initially in                        0    20      40        60     80

                                                              Instances removed (%)
each group.

Discovering new instances:                              Initial    2nd iter.
Hide part of the known                                  15/58      48/58
instances.                                              28/58      56/58
Evolve ontology and grammars
to recover them.
                                                        40/58      57/58
 Kassel, 22/07/2005      ICCS’05                                                       34
Outline
• Motivation and state of the art
• SKEL research
     – Vision
     – Information integration in CROSSMARC.
     – Meta-learning for information extraction.
     – Context-free grammar learning.
     – Ontology enrichment.
     – BOEMIE: Bootstrapping ontology evolution with
       multimedia information extraction.
• Open issues
Kassel, 22/07/2005       ICCS’05                       35
    BOEMIE - motivation
•      Multimedia content grows with increasing rates in public
       and proprietary webs.
•      Hard to provide semantic indexing of multimedia
       content.
•      Significant advances in automatic extraction of low-level
       features from visual content.
•      Little progress in the identification of high-level semantic
       features
•      Little progress in the effective combination of semantic
       features from different modalities.
•      Great effort in producing ontologies for semantic webs.
•      Hard to build and maintain domain-specific multimedia
       ontologies.
    Kassel, 22/07/2005       ICCS’05                              36
BOEMIE- approach
  OTHER
ONTOLOGIES
                 SEMANTICS EXTRACTION                                                                    Content
                                                                                                         Collection
                     FROM VISUAL           FROM NON-VISUAL
                       CONTENT                CONTENT                                                    (crawlers,
                                                                                                         spiders, etc.)

                                   FROM FUSED
                                    CONTENT                                    MULTIMEDIA
                                                                                CONTENT
                      SEMANTICS
  INITIAL            EXTRACTION
ONTOLOGY                RESULTS             ONTOLOGY EVOLUTION                                EVOLVED
                                                                                             ONTOLOGY

                        POPULATION &
                         ENRICHMENT                             COORDINATION



                                                INTERMEDIATE
                                                  ONTOLOGY




                                       ONTOLOGY EVOLUTION TOOLKIT                SEMANTICS EXTRACTION TOOLKIT

                                           ONTOLOGY MANAGEMENT TOOL                  VISUAL EXTRACTION TOOLS


                                           LEARNING TOLS                             TEXT EXTRACTION TOOLS

                                           REASONING ENGINE                          AUDIO EXTRACTION TOOLS

                                           MATCHING TOOLS                            INFORMATION FUSION TOOLS


Kassel, 22/07/2005                                  ICCS’05                                                               37
Outline
• Motivation and state of the art
• SKEL research
     – Vision
     – Information integration in CROSSMARC.
     – Meta-learning for information extraction.
     – Context-free grammar learning.
     – Ontology enrichment.
     – Bootstrapping ontology evolution with multimedia
       information extraction.
• Open issues
Kassel, 22/07/2005       ICCS’05                          38
KR issues
• Is there a common formalism to capture
  the necessary semantics + syntactic +
  lexical knowledge for IE?
• Is that better than having separate
  representations for different tasks?
• Do we need an intermediate formalism
  (e.g. grammar + CG + ontology)?
• Do we need to represent uncertainty (e.g.
  using probabilistic graphical models)?

Kassel, 22/07/2005   ICCS’05              39
ML issues
• What types and which aspects of
  grammars and conceptual structures can
  we learn?
• What training data do we need? Can we
  reduce the manual annotation effort?
• What background knowledge do we need
  and what is the role of deduction?
• What is the role of multi-strategy learning,
  especially if complex representations are
  used?

Kassel, 22/07/2005   ICCS’05                 40
Content-type issues
• What is the role of semantically annotated
  content in learning, e.g. as training data?
• What is the role of hypertext as a graph?
• Can we extract information from
  multimedia content?
• How can ontologies and learning help
  improve extraction from multimedia?



Kassel, 22/07/2005   ICCS’05               41
SKEL Introduction


                     Acknowledgements

• This is research of many current and past members
  of SKEL.
• CROSSMARC is joint work of the project consortium
  (NCSR “Demokritos”, Uni of Edinburgh, Uni of Roma
  „Tor Vergata‟, Veltinet, Lingway).




Kassel, 22/07/2005        ICCS’05                     42