Automatically Annotating the Semantic Web by fjn47816

VIEWS: 71 PAGES: 41

									Automatically Annotating
   the Semantic Web
        Prof. Fabio Ciravegna
     Web Intelligence SG Coordinator
    Natural Language Processing Group
           University of Sheffield
The Semantic Web
• Towards a Knowledge-Based Web
  – Not Information-based
• Access to knowledge
  – Not to words, pages or information




                  Fabio Ciravegna
                                         2
Today’s Web
• For human consumption:
  – brainwork done by humans




                 Fabio Ciravegna
                                   3
Tomorrow’s Web
• For Human and Automatic Consumption
  – (most of) brainwork automated




         Semantic Web Technologies




                    Fabio Ciravegna
                                        4
What Annotations (1)
• Contained information
  – Associated to objects in an ontology to
    allow ontology-driven processing
    • Service based on ontology will be able to use
      information
       – Searching




                     Fabio Ciravegna
                                                      5
Fabio Ciravegna
                  6
What Annotations (2)
• Information added to the existing one
  – Braindump
  – Document enrichment
     • Connected to other documents
        – e.g. Automatic generation of hyperlinks




                       Fabio Ciravegna
                                                    7
Added Annotations




           Fabio Ciravegna
                             8
Enriching Documents
                                  Why we used these references




                             Why we DID NOT use other references




           Fabio Ciravegna
                                                                 9
        The Vision
• Producing tools for defining automatic annotation
  services
       • With Semantic Web knowledge (not of underlying technologies)
       • Efficiently
       • Effectively
• Annotation services
   – For a specific ontological component/application
       • Efficiently
       • Effectively
• Annotating: How
   – Helping users producing annotations
       • Trusted environments
           – e.g. Knowledge Management
   – Producing unsupervised annotations
       • other cases
                             Fabio Ciravegna
                                                                        10
        How to approach the issue
• User point of view
  – User needs
  – Implementation actions
  – Evaluation / Validation
• Architectural point of view
  –   How to empower users to implement needs
  –   How to implement actions
  –   How to distribute tasks
  –   How to control tasks
  –   How to validate results
• Application point of view
  – How to present results

                       Fabio Ciravegna
                                                11
           User Needs
• From User needs to application definition
  – Tasks identification
     •   E.g. annotation of documents/creation of virtual pages
     •   Output:
         – Description of task
  – Domain definition
     •   Ontology building/selection
     •   Output:
         – Ontology
  – Identification of potential sources of knowledge
     •   Sites to be harvested for knowledge!
     •   Pre-existing resources (gazetteers, ontologies, KBs, Web services)
     •   Output:
         – Description of the sources of information
  – Other requirements (e.g. speed)
                                 Fabio Ciravegna
                                                                              12
        User Implementation Action
• Describing resources
  – How resources can contribute to the task
    •   As if done by a person
  – Output: Catalogues for information sources
• User Architecture Design
  – Definition of:
    •   How to find and integrate information
    •   How to compose modules (logical)
    •   How to integrate modules (software)
    •   How to validate results
  – Output: Architecture design
    •   Information Flow + Validation Strategy
• Result Presentation (application dependent)
  – Output: a strategy for result presentation
                                 Fabio Ciravegna
                                                   13
     Annotation through Harvesting
• Harvesting information
  – To create a KB of instances
     • Including identification strategies
  – From trusted sources
• Annotating documents using
  – Identification strategies in the KB
• Method is learning-based
  – Incrementality

                      Fabio Ciravegna
                                             14
      Harvesting as a Formal Task
• Harvesting can be defined as:
  – The task of identifying instances for every object in a
    given ontology
• Being an ontology a formal definition
  – Harvesting task(s) can be defined formally
  – A conjunction of predicates
• Harvesting modules
  – Defined according to objects they work on
  – Formally defined in terms of the task(s) they perform


                       Fabio Ciravegna
                                                              15
       Semantic Web Services
• Architecture based on SWS
  – Each service
    • Associated to parts of ontology
    • Works in an independent way
    • Can use other services to perform sub-tasks
        – Discovery
    • Can be geographically distributed
    • Can work on an internal ontology with internal
      predicates
        – Needs mapping to current ontology



                       Fabio Ciravegna
                                                       16
     Basic Methodologies
• Search Engines
• Information Extraction from Documents
  – Generic IE engines (e.g. NERs)
  – Domain specific IE engines
  – Wrappers
• Information Integration
  – Generic methodologies for II
  – Domain-specific methodologies
• Document Enrichment
  – From a KB to documents
  – Improved Hyper-navigation
  – Associate services




                         Fabio Ciravegna
                                           17
            Tools: Clew toolbox
• Armadillo: Automatic annotation of information
     – Dispersed in large repositories
     – Given a new ontology, annotate documents
         • Who works at the DCS in Sheffield?
• Melita: Active user-centred document annotation
     – Goal: producing annotation for a specific (set of) document (s)
     – Active learning to reduce annotation burden
• Amilcare: User-driven Information Extraction
     – IE systems portable to new applications by providing examples
     – Working on mixed document types (Web docs)
     – Backbone of Armadillo and Melita

•   Active Document Enrichment
     – From document annotation to knowledge associated to documents


                                   Fabio Ciravegna
                                                                         18
     At Work: The CS Site Task
• Task
  – Harvesting information about CS Researchers
• Goal:
  – Enriching documents about CS research
• Method:
  – Finding names of people working for the
    specific CS site
  – Finding their home pages & extracting data
  – Finding personal biblio pages & extracting
    references
                  Fabio Ciravegna
                                                 19
      CS Services (1)
• Domain Ontology:
  – AKT reference ontology (part of)
• External Services
  – Google (via API)
  – Citeseer/NLDB Unitrier
  – HomepageSearch
  – Lists of CS departments
    • Including websites
  – Named Entity Recognizer

                    Fabio Ciravegna
                                       20
      CS Services (2)
• Clew:
  – Armadillo
    • Workflow implementation
    • Information Integration strategies
          – Unique identification of individuals
          – Rule based strategies based on Web bias
  – Amilcare Adaptive IE system
    • Domain/document specific IE




                        Fabio Ciravegna
                                                      21
     Redundancy
• Information repeated in different
  superficial formats
  – Databases/ontologies
  – Structured pages
     • e.g. produced by databases
  – Largely structured pages
     • Bibliography pages
  – Unstructured pages
     • Free texts

                    Fabio Ciravegna
                                      22
     Harvesting Strategy
• Unsupervised harvesting of Information
• Redundancy to bootstrap unsupervised learning
  – Starting point:
     • User-defined lexicon or easy to model/mine sources
         – Using APIs or Wrappers
     • Collects seed examples
     • Annotates in corpus
     • Multiple strategies to combine evidence
  – Cycle:
     • Seed examples used to bootstrap learning
         – For progressively more complex modules
             » From inducing wrappers to full fledged IE systems
     • Produces more examples
     • Multiple strategies to combine evidence

                         Fabio Ciravegna
                                                                   23
   Finding People’s Names
• Names of people working at the site
  – Not only a Named Entity Recognition task!
• Strategy:
  – Identifying a handful of seed names
     • Multiple evidence to identify correct names
  – Bootstrap and reseeding learning cycle
     • Initial examples to seed learning from HTML lists and
       produce more examples
     • More examples needed to learn from free texts
         – “Enrico Motta is promoted to Professorship at KMI” means
           “EM works at KMI”


                       Fabio Ciravegna
                                                                      24
     Multiple Evidence for Seed Names
• Finding initial list of potential names
   – Using a NE Recognizer
   – Querying Citeseer for likely bigrams
• Deciding if they work in the dept
   – 3 potential situations:
       • Correct: people’s names and working in the dept
       • False: people’s names but not working in the dept
       • Wrong: not people’s names
   – Filter:
       •   Generic patterns confirm it is a NE
       •   A homepage exists in the site
       •   Hyperlink surrounding name to an internal page
       •   Name is found in HTML lists containing known names


                            Fabio Ciravegna
                                                                25
Learning other names


                                         HomePageSearch


                                  • Mines the site looking for People’s
                                    names
                                  • Uses
                                  • Annotates known names
                                       •Generic patterns (NER)
                                       •Citeseer for likely to discover
                                  • Trains on annotations bigrams
                                    the HTML structured lists of
                                  • Looks forstructure of the page
                                  • Recovers all names and hyperlinks
                                      names
                    Fabio Ciravegna
                                                                      26
     Basket:   People name lists (and hyperlinks to homepage)
           Experimental Results - People
• Sheffield
       –   116 correct, 8 wrong
       –   Precision                 116 / 124 =    93.5 %
       –   Recall                    116 / 139 =    83.5 %
       –   F-measure                                         88.2 %
• Edinburgh
       –   153 correct, 10 wrong
       –   Precision                 153 / 163 =    93.9 %
       –   Recall                    153 / 216 =    70.8 %
       –   F-measure                                         80.7 %

• Aberdeen
       –   63 correct, 2 wrong
       –   Precision                 63 / 65        =        96.9 %
       –   Recall                    63 / 70        =        90.0 %
       –   F-measure                                         93.3 %

                                  Fabio Ciravegna
                                                                      27
              Names: Experimental Results
Information Integration




 Information Extraction

 Errors
                                     • IE:
                                             –   Speech and Hearing
    •Seeds:                                  –   European Network
          –A. Schriffin                      –   Department Of
          –Eugenio Moggi                     –   Position Paper
                                             –   The Network
          –Peter Gray
                                             –   To System

                           Fabio Ciravegna
                                                                      28
      Paper Citations
• Available digital libraries are incomplete
• Use the available people names to:
  – Query digital libraries and get a list of papers
  – Use paper list returned to find personal
    bibliographic pages
     • Querying Google using titles
  – Use known examples to start a seed & learn
    cycle from those pages

                     Fabio Ciravegna
                                                       29
        Finding Citations
• Authors:
   – Names in a specific context
   – Not to be mistaken for the editors of collections
• Titles:
   – Any sequence of word
   – Not to be mistaken for the titles of collections
• HTML structure does not help
 <li>Fabio Ciravegna and Yorick Wilks: “Designing Adaptive Information Extraction for
 the Semantic Web in Amilcare”, in S. Handschuh and S. Staab (eds), <b>Annotation
 for the Semantic Web</b>" in the Series "Frontiers in Artificial Intelligence and
 Applications" by IOS Press, Amsterdam, 2003.</li>


                               Fabio Ciravegna
                                                                                 30
     Discovering Paper Citations


                                                                 HomePageSearch



                                                     People’s Names

                                                          • Annotates known papers
                                                          • Trains on annotations to
                                                            discover the HTML structure
                                                          • Recovers co-authoring
                                                            information
                                        Fabio Ciravegna
                                                                                                31
Basket:   Basket: Name lists and hyperlinks Personal and People Co-authoring
          Name lists and hyperlinks Personal data Peopledata Projectsand Projects information
                 Experimental Results
            Data: 8 researchers from dcs.shef.ac.uk


Information Integration




  Information Extraction




                             Fabio Ciravegna
                                                      32
     Computer Science Domain
• Papers: Seven Random Test Cases from different
  URLs for the academic domain.




                  Fabio Ciravegna
                                              33
Annotating Documents




       Fabio Ciravegna
                         34
      ArTmadillo
• Mines the web to retrieve information on
  painters and their works




                 Fabio Ciravegna
                                             35
          Artists domain Evaluation
Artist       Method   Precision         Recall           F-Measure
Caravaggio   II                100.0%              61%         75.8%

             IE              100.0%              98.8%         99.4%
Cezanne      II                100.0%            27.1%         42.7%

             IE                91.0%             42.6%         58.0%
Manet        II                100.0%            29.7%         45.8%

             IE              100.0%              40.6%         57.8%
Monet        II                100.0%            14.6%         25.5%

             IE                86.3%             48.5%         62.1%
Raphael      II                100.0%            59.9%         74.9%

             IE                96.5%             86.4%         91.2%
Renoir       II                 94.7%            40.0%         56.2%

             IE                96.4%             60.0%         74.0%

                      Fabio Ciravegna
                                                                       36
Product Comparison




       Fabio Ciravegna
                         37
Conclusions
• Large scale automatic annotation tools
   – Allow large scale exploitation of Semantic Web
     Technologies
• Annotation means:
   – Extraction, Integration and enrichment
• Multidisciplinary activity
   – SE, IE, ML, II, WS, HCI
• Potential Applications
   – Web
   – Knowledge Management


                      Fabio Ciravegna
                                                      38
    Challenges
• Further research needed:
  – Formalization/Modelling of resources
    • Ontology mapping
  – Workflow composition
  – Information extraction models
  – Combination of evidence
    • Of sources
    • Of extractors



                      Fabio Ciravegna
                                           39
      Challenges (ctd)
• HCI
   – Application definition
      • Result presentation for validation of methodologies
          – System returns 1,000s of triples how to evaluate them?
   – Information presentation (document annotation)
      • Intrusivity:
          – How to avoid annoying users with too many annotations
      • Trust
          – Who do users trust?
             » Tracing preferred sources
          – Where does the information come from?
• Scalability
   – Large scale indexing systems
      • Millions of pages (not billions!)

                           Fabio Ciravegna
                                                                     40
Thank You
• Contact Information
  – F.Ciravegna@dcs.shef.ac.uk
  – www.dcs.shef.ac.uk/~fabio
• Web Intelligence
  – http://nlp.shef.ac.uk/wig/
• NLP Sheffield
  – http://nlp.shef.ac.uk/
• AKT Project
  – www.aktors.org
• Dot.Kom Project
  – www.dot-kom.org

								
To top