051005_Popov_Slides.ppt

Shared by: yantingting
Categories
Tags
-
Stats
views:
2
posted:
9/21/2011
language:
English
pages:
67
Document Sample
scope of work template
							Ontotext @ JRC



     5-6 Oct 2005
                           Semantic Web

•  The Semantic Web is the abstract representation of data on the WWW,
   based on the RDF and other standards
• SW is being developed by the W3C, in collaboration with a large number of
   researchers and industrial partners
http://www.w3.org/2001/sw/
http://www.SemanticWeb.org




                                                                              2/68

                               Ontotext @ JRC                         5-6 Oct 2005
                             Semantic Web (II)

•   "The Semantic Web is an extension of the current web in which information is
    given well-defined meaning, better enabling computers and people to work in
    cooperation.“ [Berners-Lee et al. 2001]

The spirit:
• Automatically processable
   metadata regarding:
     – the structure (syntax) and
     – the meaning (semantics)
     – of the content.
• Presented in a
            standard form;
• Dynamic interpretation
   for unforeseen purposes




                                                                                 3/68

                                    Ontotext @ JRC                        5-6 Oct 2005
                     Semantic Web: Languages

•   RDF(S) – the next slides
•   SHOE, XOL, etc – the pioneers
•   Topic Maps – a metadata language with limited impact
•   OIL – Ontology Interchange Language, the basis of the next two
    http://www.ontoknowledge.org/oil/
     – Description Logics-based multilayered language
•   DAML+OIL – the predecessor of OWL, not to be developed
•   OWL – the W3C standard for Semantic Web ontology language,
    http://www.w3.org/2001/sw/WebOnt/
     – Extends RDF(S), but also constraints it
     – Has multiple layers (Lite, DL, Full)
     – Transitive/symmetric/etc properties, disjointness, cardinality restrictions




                                                                                            4/68

                                    Ontotext @ JRC                                   5-6 Oct 2005
                     Semantic Web: Problems

•   Critical mass of metadata is necessary

•   Still lack of consensus on many issues (like query languages)

•   Lack of practices at the proper scale and complexity

•   Lack of robust Semantic (in our days RDFS) repositories:
     – Should be as flexible, multi-purpose and easy to use as HTTP servers
       and
     – As efficient in structured knowledge management as RDBMS




                                                                               5/68

                                  Ontotext @ JRC                        5-6 Oct 2005
                   What are Sirma & Ontotext?

•   Established in 1992 as a Bulgarian AI Lab.
•   Current structure:
     – Sirma Group International Corp, Montreal, Canada;
     – 8 subsidiary companies; the most important ones follow below.
•   Sirma AI, Sofia
     – The R&D backbone of the group with two divisions:
     – Sirma Solutions: e-Business, banking, C3, e-Publishing, consultancy;
     – Ontotext Lab: Knowledge and Language Engineering.
•   EngView Systems, Montreal
     – CAD/CAM systems and applications.
•   WorkLogic.Com, Ottawa
     – Web-based collaboration, workflow, e-Gov.




                                                                              6/68

                                Ontotext @ JRC                         5-6 Oct 2005
     Software Development and Research since 1992

• Track record of success – large companies and government
  organizations in US, Canada,
  Western Europe and Bulgaria;
• Top-3 Software Company in Bulgaria;
• About 70 developers;

• ISO 2001 Certificate;
• 1999 EIST prize winner;




                                                                    7/68

                              Ontotext @ JRC                 5-6 Oct 2005
                 Sirma Businesses and Domains

Diverse business, ranging from COTS products to custom projects,
   consultancy, and outsourcing services.
Major areas:
• AI – expert systems (beside Ontotext);
• b2b market places
• CAD/CAM (for packaging, quality control)
• e-Government, CSCW, Groupware, Workflow;
• Banking
• C3/C4 Systems (military, airport traffic);
• VOIP billing systems;
• e-Publishing, Proofing tools.




                                                                          8/68

                                  Ontotext @ JRC                   5-6 Oct 2005
                               Ontotext Lab

                            An R&D lab of Sirma for
                    Knowledge and Language Engineering


Research and core technology development for
knowledge discovery, management, and engineering.


Specialized for applications in Semantic Web, Knowledge Management, and
   Web Services.


Aside from the scientific matters, most of us are just professional software
   developers.




                                                                                      9/68

                                 Ontotext @ JRC                                5-6 Oct 2005
       Leading Semantic Web Technology Provider

Ontotext is a leading Semantic Web technology provider, being:
• the developer of the KIM Semantic Annotation Platform and
• a co-developer of the GATE language engineering platform;
• a co-developer of the Sesame semantic repository and OWLIM high-
   performance OWL reasoner;
• the developer of the WSMO4J semantic web services API;
• a partner in the SWAN Semantic Web Annotator project.
Ontotext is part of most of the major European research projects in the field;
   the most successful Bulgarian participant in FP6.




                                                                                 10/68

                                Ontotext @ JRC                             5-6 Oct 2005
                                  Mission

•   A critical mass of research in a number of AI areas made efficient KM
    almost possible.
•   the technology on the market is mostly of two sorts:
    – Expensive black boxes
    – Academic prototypes
Our mission is:
•   To develop and popularize open, skillfully engineered tools...
•   For Information Extraction and Knowledge Management,
•   Which considerably reduce the cost for implementation and use of KM
    applications.




                                                                                  11/68

                                Ontotext @ JRC                              5-6 Oct 2005
                         Major Research Areas

We focus on building cutting-edge expertise and technology in the following
   areas:


•   ontology design, management, and alignment;
•   knowledge representation, reasoning;
•   information extraction (IE), applications in IR;
•   semantic web services;
•   upper-level ontologies and lexical semantics;
•   NLP: POS, gazetteers, co-reference resolution, named entity recognition
    (NER)
•   machine learning (HMM, NN, etc.)




                                                                                12/68

                                  Ontotext @ JRC                          5-6 Oct 2005
               Academic & Technology Partners

•   NLP Group, Sheffield University, UK;

•   Digital Enterprise Research Institute (DERI),
    Institut für Informatik, Innsbruck, Austria, and
    National University of Ireland, Galway;

•   Aduna (Aidministrator) b.v., The Nederland's;

•   Linguistic Modelling Lab.
    CLPOI, Bulgarian Academy of Sciences;

•   British Telecommunications Plc, (BT), UK.

•   Froschungszentrum Informatik (FZI) and Institut AIFB
    Karlsruhe, Germany.



                                                                 13/68

                                Ontotext @ JRC             5-6 Oct 2005
                               Customers



•   SemanticEdge GmBH, Berlin, Germany;

•   QinetiQ Ltd, UK;

•   Fairway Consultants, UK;




                                                      14/68

                               Ontotext @ JRC   5-6 Oct 2005
                        Research Projects

We were/are part of a number of FP5 research projects:

•   On-To-Knowledge - the project which invented OIL.
    Ontology Middleware Module and a DAML+OIL reasoner.

•   VISION - Towards Next Generation Knowledge Management.

•   OntoWeb - Ontology-based information exchange for knowledge
    management ….

•   SWWS - Semantic Web enabled Web Services.




                                                                        15/68

                              Ontotext @ JRC                      5-6 Oct 2005
                          Research Projects (II)

FP6 integrated projects that started Jan 2004, durations ~3 years:
•   SEKT: Semantic Knowledge Technologies. Targeting a synergy of
    Ontology and Metadata Technology, Knowledge Discovery and Human
    Language Technology.
•   DIP: Data, Information, and Process Integration with Semantic Web
    Services.
•   PrestoSpace: Preservation towards storage and access. Standardized
    Practices for Audiovisual Contents in Europe.
•   Infrawebs: Intelligent Framework for Generating Open (Adaptable) Development
    Platforms for Web-Service Enabled Applications Using Semantic Web
    Technologies, Distributed Decision Support Units and Multi-Agent-Systems




                                                                                     16/68

                                   Ontotext @ JRC                              5-6 Oct 2005
                     Introduction to Ontologies

Despite the formal definitions, ontologies are:
•   Conceptual models or schemata
    – Represented in a formalism which allows
    – Unambiguous “semantic” interpretation
    – Inference
•   Can be considered a combination of:
    – DB schema
    – XML Schema
    – OO-diagram (e.g. UML)
    – Subject hierarchy/taxonomy (think of Yahoo)
    – Business logic rules




                                                          17/68

                                  Ontotext @ JRC    5-6 Oct 2005
                     Introduction to Ontologies (II)

•   Imagine a DB storing
            “John is a son of Mary”.
•   It will be able to "answer" just:
     – Which are the sons of Mary? Which son is John?
•   An ontology with a definition of the family relationships. It could
    infer:
     – John is a child of Mary (more general)
     – Mary is a woman;
     – Mary is the mother of John (inverse);
     – Mary is a relative of John (generalized inverse).
•   The above facts, would remain "invisible" to a typical DB, which
    model of the world is limited to data-structures of strings and
    numbers.




                                                                                18/68

                                      Ontotext @ JRC                      5-6 Oct 2005
                                  Products

•   The Ontology Middleware Module (OMM) is an enterprise back-end for formal KR
    and KM applications based on Semantic Web standards
•   An extension of the Sesame
    RDF(S) repository that adds
    a Knowledge Control System.
•   OMM integration options:
    Built-In, RMI,                                   e
                                                    M-t
                                                      a
    SOAP, HTTP.                                     f a
                                                    o t i
                                                        o
                                                  In rm n




                                                            Fpreser
                                                             il ter ve
                                   Tr ac a
                                       k by s




                                                                   ed ad by
                                        Store




                                                                       nd
                                                   nw g
                                                      e
                                                  Ko l d e
                                                  o tro ste
                                                       S
                                                 Cn l y m

                                ra kn
                               T c ig                                          ce
                                                                              Ac ss
                                hne
                               Ca g s           hne e gt n
                                                     n t i
                                                Ca g Iv si a o                 ot l
                                                                                r
                                                                              Cn o

                                                  u n r fo
                                                     Us
                                                 Cre t e In .

                                                                                        19/68

                                  Ontotext @ JRC                                  5-6 Oct 2005
                                    Products

•   BOR – a DAML+OIL reasoner.
•   Proprietary GATE components:
      – Hash Gazetteer. A high-performance lookup tool.
      – Hidden Markov Model Learner. A stohastic module for
        filtering annotations, disambiguation, (etc.,) based on
        confidence measures.
•   The News Collector is a web service, collecting and indexing articles from
    the top-10 global news wires:
      –   About 1000 articles/day, annotated and indexed using KIM;
      –   Used to validate the heuristics and resources of KIM;




                                                                                 20/68

                                    Ontotext @ JRC                       5-6 Oct 2005
                               Products (II)

•   The KIM Platform (the next slides), http://www.ontotext.kim.
•   SWWS Studio (http://swws.ontotext.com)
      – Semantic Web Service description development environment
      – Developed in the course of the SWWS project
      – Based on WSMO (http://www.wsmo.org)
•   WSMO4J (http://wsmo4j.sourceforge.net)
      – A WSMO API and a reference implementation
      – for building Semantic Web Services applications
      – Used in WSMO Studio, (http://www.wsmostudio.org/)
      – The basis for ORDI, used in OMWG (http://www.omwg.org)
      – Used in projects DIP, SEKT, Infrawebs



                                                                         21/68

                                 Ontotext @ JRC                    5-6 Oct 2005
                                 OWLIM

• OWLIM is a high-performance OWL repository
• Storage and Inference Layer (SAIL) for Sesame RDF database
• OWLIM performs OWL DLP reasoning
• It is uses the IRRE (Inductive Rule Reasoning Engine) for forward-chaining
  and “total materialization”
• In-memory reasoning and query evaluation
• OWLIM provides a reliable persistence, based on RDF N-Triples
• OWLIM can manage millions of statements on desktop hardware
• Extremely fast upload and query evaluation even for huge ontologies and
  knowledge bases




                                                                               22/68

                                Ontotext @ JRC                          5-6 Oct 2005
                                                        Scalability: Upload and Reasoning

                                20000
                                                                                                                                    2Xeon1GB
                                18000
                                                                                                                                    2Opt3GB
Upload speed (statements/sec)




                                16000                                                                                               2Opt5GB
                                14000                                                                                               PM512MB
                                12000                                                                                               1/log

                                10000
                                8000
                                6000
                                4000
                                2000
                                   0
                                        0.5   1   1.5    2   2.5     3   3.5   4   4.5    5   5.5   6   6.5   7   7.5     8   8.5   9    9.5    10
                                                                   Size of repository (millions of explicit statements)


                                                                                                                                               23/68

                                                                               Ontotext @ JRC                                           5-6 Oct 2005
                                                      Scalability: Query Answering

                            400

                            350
Evaluation time Q2 (msec)




                            300

                            250

                            200

                            150
                                                                                                                             2Xeon1GB
                            100
                                                                                                                             2Opt3GB
                             50                                                                                              2Opt5GB
                                                                                                                             PM512MB
                              0
                                  0.5   1   1.5   2   2.5    3    3.5   4   4.5    5   5.5   6   6.5    7   7.5    8   8.5   9   9.5    10
                                                            Size of repository (millions of explicit statements)



                            • Q2: Pattern of 12 statement-joins and LIKE literal constraint

                                                                                                                                       24/68

                                                                         Ontotext @ JRC                                          5-6 Oct 2005
                  OWLIM under LUMB Benchmark
• The Lehigh Univ. evaluation is one of the most comprehensive benchmark
  experiments published recently (ISWC 2004, WSJ 2005)
• Synthetically generated OWL knowledge bases
• The biggest set generated is LUMB(50,0) – 6M explicit statements
• 14 queries, checking different inferences
• OWLIM on LUMB:
     – On a desktop machine OWLIM loads LUMB(50,0) in 10 min
     – The only other systems known to load it, does this for 12 hours
     – All the queries are answered correctly
• Based on this we can claim that:
     – OWLIM is the fastest OWL repository in the world!




                                                                                 25/68

                                     Ontotext @ JRC                        5-6 Oct 2005
                                     JOCI

•   “Jobs & Contacts Intelligence”, Innovantage, Fairway Consultants
•   Gathering recruitment-related information from web-sites of UK
    organizations
•   Offering services on top of this data to recruitment agencies, job portals,
    and other.
•   JOCI uses KIM for information extraction (IE, text-mining)
•   JOCI makes use of a domain ontology to:
      – support the IE process,
      – to structure the knowledge base with the obtained results, and
      – facilitate semantic queries.
•   Sirma is shareholder in Fairway Consultants




                                                                                   26/68

                                  Ontotext @ JRC                             5-6 Oct 2005
                       JOCI Dataflow


     UK Web
      Space



                                                     Web UI



                       Information Extraction      KIM Server

 Focused Crawler          Single-Document IE    Semantic Repository


Crawler   Classifier     Object Consolidation    Document Store




                                                                      27/68

                          Ontotext @ JRC                       5-6 Oct 2005
                  JOCI: Vacancy Consolidation/Matching

                                          Consolidated Vacancy



                                                                 locatedIn

    Vacancy 1                                                                         Vacancy 2
                         hasJobTitle



locatedIn              “IT Applications          sub-string      “Support                      locatedIn
                       Support Analyst”                          Analyst”



       U.K.                                     Scotland                             Glasgow
                          subRegionOf                                 subRegionOf


                type                                                                type

                              Country                              City
                                               subClassOf



                                               Location
                                                                                                  28/68

                                          Ontotext @ JRC                                    5-6 Oct 2005
                             JOCI Statistics

• The figures below are indicative and reflect an old state of the JOCI system:
     – The actual figures are to be announced after the launch of JOCI

• Web-sites inspected: 0.5M
• Web-sites with vacancy announcements: 30K
• Extracted vacancies: 100K




                                                                                  29/68

                                 Ontotext @ JRC                            5-6 Oct 2005
                           The KIM Platform

• A platform offering
        services and infrastructure for:
      – (semi-) automatic semantic annotation and
      – ontology population
      – semantic indexing and retrieval of content
      – query and navigation over the formal knowledge


• Based on Information Extraction technology




                                                               30/68

                                Ontotext @ JRC           5-6 Oct 2005
                           KIM What’s Inside?

The KIM Platform includes:


•   Ontologies (PROTON + KIMSO + KIMLO) and KIM World KB


•   KIM Server – with a set of APIs for remote access and integration


•   Front-ends: Web-UI and plug-in for Internet Explorer.




                                                                              31/68

                                  Ontotext @ JRC                        5-6 Oct 2005
                          The AIM of KIM

• Aim: to arm Semantic Web applications
     -   by providing a metadata generation technology
     -   in a standard, consistent, and scalable framework




                                                                   32/68

                             Ontotext @ JRC                  5-6 Oct 2005
  What KIM does?
Semantic Annotation




                            33/68

     Ontotext @ JRC   5-6 Oct 2005
Simple Usage: Highlight, Hyperlink, and…




                                                 34/68

               Ontotext @ JRC              5-6 Oct 2005
Simple Usage: … Explore and Navigate




                                             35/68

             Ontotext @ JRC            5-6 Oct 2005
Simple Usage: … Enjoy a Hyperbolic Tree View




                                                     36/68

                 Ontotext @ JRC                5-6 Oct 2005
                         KIM is Based On…

KIM is based on the following open-source platforms:

• GATE – the most popular NLP and IE platform in the world, developed at the
  University of Sheffield. Ontotext is its biggest co-developer.
  www.gate.ac.uk and www.ontotext.com/gate

• OWLIM – OWL repository, compliant with
  Sesame RDF database from Aduna B.V.
  www.ontotext.com/owlim

• Lucene – an open-source IR engine by Apache. jakarta.apache.org/lucene/




                                                                             37/68

                                Ontotext @ JRC                         5-6 Oct 2005
                    How KIM Searches Better

KIM can match a Query like:
Documents about a telecom company in Europe, John Smith, and a date in the
    first half of 2002.
With a document containing:
       “At its meeting on the 10th of May, the board of Vodafone appointed John G.
           Smith as CTO"
The classical IR could not match:
       – Vodafone with a "telecom in Europe“, because:
              • Vodafone is a mobile operator, which is a sort of a telecom;
              • Vodafone is in the UK, which is a part of Europe.
       – 5th of May with a "date in first half of 2002“;
       – “John G. Smith” with “John Smith”.




                                                                                     38/68

                                  Ontotext @ JRC                              5-6 Oct 2005
Entity Pattern Search




                              39/68

     Ontotext @ JRC     5-6 Oct 2005
Pattern Search: Entity Results




                                       40/68

          Ontotext @ JRC         5-6 Oct 2005
Entity Pattern Search: KIM Explorer




                                            41/68

            Ontotext @ JRC            5-6 Oct 2005
                   Semantic Metadata in KIM…

•   Provides a specific metadata schema,
      – focusing on named entities (particulars),
      – as well as number and time-expressions, addresses, etc.,
      – everything “specific”, apart from the general concepts.
•   Defines specific tasks for generation and usage of the metadata which are
    well-understood and measurable.
•   Why not metadata about general things (universals)?
      – It is too complex…
      – but we leave the door open.
•   The particulars seem to provide a good 80/20 compromise.




                                                                                42/68

                                 Ontotext @ JRC                          5-6 Oct 2005
                         World Knowledge in KIM

Rationale:
•   The ontology is encoded in OWL Lite and RDF.
•   provide common knowledge about world entities;
•   KIM bets on scale and avoids heavy semantics;
    minimum modeling of common-sense, almost no axioms;
•   The ontology is encoded in OWL Lite and RDF.
•   In addition, a number of rules (generative axioms) are defined, e.g.:
    <X,locatedIn,Y> and <Y,subRegionOf,Z> =>
           <X,locatedIn,Z>


•   Axioms of this sort are supported by OWLIM and they provide a consistent
    mechanism for “custom” extensions to the OWL or RDF(S) semantics with respect
    to a particular ontology




                                                                                    43/68

                                     Ontotext @ JRC                           5-6 Oct 2005
                                       PROTON

•   Name. PROTON is an acronym for
    Proto Ontology
     – ex-names: BULO (basic upper-level ontology), GO (generic ontology);
     – not a Russian space rocket 
     – “proto” – used in the sense of “primary”, “beginning”, “giving rise to”, vs. “first in
        time” or “oldest”;
     – connotations: positive, fundamental, elemental, “in favour of”, even romantic
        (like a science-fiction novel from the 60-ies) 
•   Intended usage. A Basic Upper-Level Ontology like PROTON - used for:
     – ontology population
     – knowledge modelling and integration strategy of a KM environment;
     – generation of domain, application, and other ontologies.




                                                                                                44/68

                                       Ontotext @ JRC                                    5-6 Oct 2005
                               PROTON Design

•   Design principles:
     1. domain-independence;
     2. light-weight logical definitions;
     3. Compliance with popular metadata standards;
     4. good coverage of concrete and/or named entities (i.e. people,
        organizations, numbers);
     5. no specific support for general concepts (such as “apple”, “love”, “walk”),
        however the design allows for such extensions




                                                                                   45/68

                                    Ontotext @ JRC                           5-6 Oct 2005
                                 Some Figures…

•   PROTON defines about
    250 classes and 100 properties
•   Providing coverage of most of the upper-level concepts necessary for semantic
    annotation, indexing, and retrieval
•   A modular architecture, allowing for great flexibility of usage and extension:
     – SYSTEM module - contains a few meta-level primitives (6 classes and 7
       properties); introduces the notion of 'entity', which can have aliases;
     – TOP module - the highest, most general, conceptual level, consisting of about 20
       classes;
     – UPPER module - over 200 general classes of entities, which often appear in
       multiple domains.




                                                                                           46/68

                                      Ontotext @ JRC                                 5-6 Oct 2005
                      PROTON Ontology Language

•   The current version of the ontology is encoded in OWL Lite.
•   A few custom entilement rules (axioms) are also defined for usage in tools that support
    them, for instance:
          Premise:
                     <xxx, protont:roleHolder, yyy>
                     <xxx, protont:roleIn, zzz>
                     <yyy, rdf:type, protont:Agent>
          Consequent:
                     <yyy, protont:involvedIn, zzz>
•   Axioms of this sort are interpreted by OWLIM
•   PROTON is portable to any OWL(Lite)-compliant tool.
•   PROTON can be used without such axioms either.




                                                                                          47/68

                                     Ontotext @ JRC                                 5-6 Oct 2005
                        Other Standards: Relations

•   ADL Feature Type Thesaurus and GNS
     – the backbone of the Location branch;
     – on its turn aligned with the geographic feature designators, of the GNS database of
        NIMA;
     – PROTON is more coarse-grained, taking about 80 out of 300 types.
•   Dublin Core
     – the basic element set available as properties of protont:InformationResource and
        protont:Document classes;
     – the resource type vocabulary is mapped to sub-classes of InformationResource.
•   OpenCyc and WordNet– consulted and referred to in glosses.
•   ACE (Automatic Content Extraction) annotation types – covered.
•   FOAF – assure easy mapping (e.g. the Account class was added).
•   DOLCE, EuroWordnet Top, and others – consulted to various extent.




                                                                                              48/68

                                       Ontotext @ JRC                                   5-6 Oct 2005
                   Other Standards: Compliance

•   Other models are not directly imported (for consistency reasons)
•   The mapping of the appropriate primitives is easy, on the basis of
     – a compliant design, and
     – formal notes in the PROTON glosses, which indicate the appropriate
       mappings.
•   For instance, in PROTON, a protont:inLanguage property is defined
     – as an equivalent of the dc:language element in Dublin Core
     – with a domain protont:InformationResource
     – and a range protont:Language




                                                                               49/68

                                  Ontotext @ JRC                         5-6 Oct 2005
                                KIM World KB

A quasi-exhaustive coverage of the most popular entities in the world …
•   What a person is expected to have heard about that is beyond the horizons
    of his country, profession, and hobbies.
•   Entities of general importance … like the ones that appear in the news …
KIM “knows”:
•   Locations: mountains, cities, roads, etc.
•   Organizations, all important sorts of: business, international, political,
    government, sport, academic…
•   Specific people, etc.




                                                                                       50/68

                                   Ontotext @ JRC                                5-6 Oct 2005
                 KIM World KB: Entity Description

•   The NE-s are represented with their Semantic Descriptions via:
•   Aliases (Florida & FL);
•   Relations with other entities (Person hasPosition Position);
•   Attributes (latitude & longitude of geographic entities);
•   their proper Class




                                                                           51/68

                                  Ontotext @ JRC                     5-6 Oct 2005
                 The Scale of KIM World KB

RDF Statements                   Small KB           Full KB
 - explicit                               444,086   2,248,576
 - after inference                  1,014,409       5,200,017
Instances
 - Entity:                                40,804      205,287
   - Location:                            12,528       35,590
       - Country:                             261         261
       - Province:                         4,262        4,262
       - City:                             4,400        4,417
   - Organization:                         8,339      146,969
       - Company:                          7,848      146,262
   - Person:                               6,022        6,354
 - Alias:                                 64,589      429,035
                                                                    52/68

                         Ontotext @ JRC                       5-6 Oct 2005
KIM IE Pipeline




                          53/68

   Ontotext @ JRC   5-6 Oct 2005
                            JAPE Grammars


•   Jape grammars are based on the last MUSE version


•   Class/instance information included


•   Better class granularity in grammars
•   Relation recognition grammars - LocatedIn and
    HasPositionWithinOrganization




                                                             54/68

                                 Ontotext @ JRC        5-6 Oct 2005
                      Disambiguation & Filtering

•   Simple disambiguation (longest match), e.g. San Francisco Journal
•   Based on the main alias, e.g. “Beijing”
•   By priority of the class, instance or relative class priority
     – E.g. Brand “Microsoft” vs. Company “Microsoft Corp.”
     – We assign a priority (1-1000) to each class and instance
     – For pairs of classes we define relative priority
     – If the difference between the priorities is greater than a certain threshold
          the possible reference to the entity with the lower priority is ignored
•   Still to be improved




                                                                                    55/68

                                   Ontotext @ JRC                             5-6 Oct 2005
                         KIM Scaling on Data

• The Semantic Repository is based on OWLIM
• In our practical tests we observe perfect performance on top of:
      – 1.2M of entity descriptions:
      – about 15M explicit statements;
      – above 30M statements after forward chaining.
• Document and Annotation storage and indexing with Lucene:
      – One million docs, processed on a $1000-worth machine;
      – retrieval in milliseconds.




                                                                           56/68

                                 Ontotext @ JRC                      5-6 Oct 2005
                   Entity Ranking: a sketch for Jan-May 2004

    No                         Instance                              Label       Rank

1        Country_T.5                              United States                     0.032

4        Country_T.IZ                             Republic of Iraq                  0.011

6        Person_T.51                              George W. Bush                    0.010

9        Country_T.IS                             State of Israel                   0.006

11       DayOfWeek_T.4                            Tuesday                           0.005

12       NewsAgency_T.6                           The Associated Press              0.005

14       InternationalOrganization_T.13           United Nations                    0.005

27       Country_T.CH                             People's Republic of China        0.004

32       City_T.3068                              New York                          0.004

36       InternationalOrganization_T.18           European Union                    0.004

40       Person_T.115                             Ariel Sharon                      0.003

43       Country_T.JA                             Japan                             0.003

44       Country_T.UK                             United Kingdom                    0.003

45       CountryCapital_T.93                      Baghdad                           0.003


                                                                                     57/68

                                          Ontotext @ JRC                       5-6 Oct 2005
                 SWAN/KIM Cluster Architecture

• At present, KIM is used for massive semantic annotation in the context of
  the SWAN and SEKT projects
Here are some of its features:
• support for a virtually unlimited number of annotators
• centralized ontology storage and querying;
• centralized meta-data (annotations) and document storage, indexing, and
  querying;
• support for multiple crawlers (or other data sources);
• dynamic reconfiguration of the cluster (e.g. staring new crawlers or
  annotators on demand).




                                                                               58/68

                                 Ontotext @ JRC                          5-6 Oct 2005
SWAN/KIM Cluster Console




                                 59/68

       Ontotext @ JRC      5-6 Oct 2005
                          SWAN Project:
                      Semantic Web Annotator
Large Scale Annotation of human language for the Semantic Web using Human
  Language Technology (HLT).
Hosted by DERI (NUIG, Galway) and involves also:
• GATE team (from the Sheffield University's NLP Group) and
• Ontotext Lab.
• For more details take a look at http://deri.ie/projects/swan/


The current status:
• KIM Cluster of 7 servers in DERI
• Above 0.5TB shared storage
• 6 AMD64 Opterons, 6 Xeons, 36GB RAM




                                                                          60/68

                                  Ontotext @ JRC                    5-6 Oct 2005
                     CoreDB: Name and Goals

•   CoreDB is a component of KIM

•   Stands for: Co-Occurrence and Ranking of Entities DB

•   In a nutshell, it is designed to allow fast queries of the sort:
     – Q1: the number of appearances of “UK” in documents during Jan 2005
     – Q2: all people co-occurring with John Smith and some bank institution
        in documents from the second half of 2003
     – Q3: Q2 + where the documents contain “fraud” and the name of the
        institution contains “capital”




                                                                               61/68

                                Ontotext @ JRC                          5-6 Oct 2005
                             CoreDB: Functionality

•   It allows asking in a structured manner for:
     – The number of references to entities in a (sub-)set of documents
     – The entities, which co-occur together with other entities
•   Entities can be constrained by:
     – Class (and its sub-classes)
     – Keyword/token in one of its names/aliases/labels
•   Documents can be constrained according to DC-like features:
     – Date (range; could be any date in the doc)
     – Type (exact match; could be any string)
     – Authors
     – Title and Sub-title
     – Keyword/token in the content, authors or the title fields




                                                                                62/68

                                      Ontotext @ JRC                      5-6 Oct 2005
                         The Scale of Ambition

•   The major point is to allow such queries in *efficient* manner over data with
    the following cardinality:
     – 10^6 entities/terms
     – 10^7 documents
     – 10^2 entities occurring in an average document
•   This means managing and querying efficiently 10^9 entity occurrences


•   We had tested the current implementations with 10^7 occurrences and it
    answers the basic queries in milliseconds.




                                                                                  63/68

                                  Ontotext @ JRC                            5-6 Oct 2005
                          CoreDB Applications

•   Detection of “associative” links between entities, based on co-
    occurrence in documents
     – It is an alternative of the detection of strong links based on local context
       parsing
•   Ranking, measuring popularity, of an entity over a set of documents
     – The ranking is as good/relevant/representative as the set of documents
       is
•   Computing timelines (changes over time) for entity ranking or co-
    occurrence
     – “How did our popularity in the IT press changed during June”
       (i.e. “What is the effect of this 1.5MEuro media campaign ?!?”)
     – “How does the strength of association between organization X and RDF
       changes over Q1 ?”




                                                                                    64/68

                                  Ontotext @ JRC                              5-6 Oct 2005
                               Implementation

•   It is a new component in the architecture of KIM
      – Having an API (part of the KIM API), allows different implementations
•   There are now a couple of RDBMS-based implementations:
      – Derby (free, open-source, 100% Java, was Cloudscape from IBM)
      – ORACLE (v. 10g)
•   The Derby implementation – does not allow for efficient searches involving
    keywords
•   The ORACLE implementation is used also for FTS-style indexing of the document
    contents
      – Makes possible efficient combination of semantic and keyword search (which is
          already available through the SemanticQuery API)
•   In both RDBMS implementations:
      – Part of the ontology and the KB are replicated
      – Same with part of the document and index related information




                                                                                        65/68

                                   Ontotext @ JRC                                5-6 Oct 2005
                                   Ontotext Facts

•   Founded year 2000
•   14 employees (permanent, without the shared personnel and associates)
•   Daily statistics for http://www.ontotext.com, over: 150 visits; 2000 hits
•   Number of scientific publications: above 30
•   Number of projects running: 9
•   More than 20 partners we directly cooperate with on projects
•   Average age: about 28
•   Number of servers per developer: 0.7




                                                                                      66/68

                                       Ontotext @ JRC                           5-6 Oct 2005
        Ontotext Lab



        Robust Technology
   and Professional Services for
Knowledge and Language Engineering


    http://www.ontotext.com




                                           67/68

          Ontotext @ JRC             5-6 Oct 2005

						
Related docs
Other docs by yantingting
Chapter 17 Study Guide - Castle High School
Views: 0  |  Downloads: 0
Chapter 17 Air Pollution.pptx
Views: 0  |  Downloads: 0
Chapter 16 Air Pollution
Views: 0  |  Downloads: 0
Neuroscience.docx - KUPT2013comps
Views: 0  |  Downloads: 0
FYE_Tutorial_2012FebRev
Views: 0  |  Downloads: 0
Chapter 15.pptx - ViewpointAPES
Views: 0  |  Downloads: 0