Document Sample
SMART Powered By Docstoc
					   Ontological Analysis &
 Integration of Terminologies:
Towards An Environmental Reference Ontology

  Geri Steve, Aldo Gangemi, Domenico M. Pisanelli

 Istituto di Tecnologie Biomediche, CNR, Rome, Italy

                                                       Santa Fe 2K
Which part are you talking about?

• If my liver is part of my digestive system, and that
  system is part of me, is my liver part of me?
• If my liver is a part of me and I am part of the CNR,
  is my liver part of the CNR?

My liver is a component of my digestive system, while I am a
  member of CNR. No rule for composing component and
  member relations
Moreover, I am a body, but I am also a person. A living person
  depends on a body. Nevertheless, a living person can be
  member of CNR, but a body cannot

                                                      Santa Fe 2K
Object or place?

• A body region is an object that one could cut, or a place?
• A gene is a DNA fragment, or a DNA region (allele)?
• A river is an orographic object, or the geographic place
  of a watercourse?

Despite many differences, such three cases seem
  analogous: they share a polysemy partly dependent on
  an abstract difference between objects and regions, and
  a related axiom specifying that objects must be located
  at some region

                                                     Santa Fe 2K
          River in the GEMET thesaurus


  sea           water (geog)        water body      water reservoir       watercourse      hydrologic cycle   lake

sea           water reservoir

                           surface water                          brook   river         spring

brook       river     spring

                                                                                                     Santa Fe 2K
Should we worry about those
Even in presence of polysemous names, a standalone application using
  a local databank or terminological repository may be able to
  accomplish its task without serious flaws.
However, when it is integrated with another application, semantic
  mismatches constitute a serious obstacle for the agent or interface
  that is negotiating or sharing information.
The ever-increasing demand of data sharing has to rely on a solid
  conceptual foundation in order to give a semantics to the terabytes
  available in different databases and eventually traveling over the
Ontologies are currently recognized as the answer to the needs of
  conceptual foundation.

                                                            Santa Fe 2K
The advantages of ontologies

   to allow a more effective data and knowledge

   to facilitate knowledge re-use in decision support

   to give theoretical foundation to vocabulary
    standardization activity

                                                Santa Fe 2K
Our task

We learn domain ontologies (in medicine, environment) by
  integrating the conceptual models that can be extracted
  from terminological sources

The goal is building Domain Reference Ontologies in the form
  of modular libraries of formal theories

In our ONIONS methodology, ontology learning needs both
   incremental bottom-up learning from sources, and
   incremental definition and reuse of general theories that
   can account for the intended meaning of terms

                                                       Santa Fe 2K
Integration                       context

Of        defining elements



                                                Santa Fe 2K
Minimal history

ONIONS methodology for ontology integration has been
 developed since the early 1990s to account for the
 problem of conceptual heterogeneity. It addresses some
 problems encountered in the context of the European
 project GALEN and the Italian projects SOLMC
 (Ontological and Linguistic Tools for Conceptual
 Modeling) and ONTOINT (Ontological Integration of

                                                Santa Fe 2K
Some related research projects

   CYC anatomy


   HL7 vocabulary committee

   MED

                               Santa Fe 2K
 What is an ontology?

«A specification of a conceptualization»
                                              (Gruber, 1993)

«The subject of ontology is the study of the categories of
  things that exist or may exist in some domain. The product
  of such a study, called an ontology, is a catalog of the types
  of things that are assumed to exist in a domain of interest D
  from the perspective of a person who uses a language L for
  the purpose of talking about D. [...] »
                                                  (Sowa, 1997)

«A partial and indirect specification of a conceptualization»
                -restricted notion-            (Guarino, 1998)
                                                         Santa Fe 2K
What is an ontology (restricted notion)?

An ontology is a set of axioms that account for the intended
   meaning (the intended models) of a vocabulary (the namespace
   of a logical language)
A set of axioms usually only approximate such intended models
   that on their turn only approximate the conceptualization of
   vocabulary items
A conceptualization is a set of conceptual relations that range over
   a domain and a set of relevant states of affairs (possible worlds)
   for that domain
Therefore, a precise definition of "ontology" (in a restricted, formal
   sense) might be "a partial specification of the intended
   models of the conceptualization of a vocabulary"

                                                             Santa Fe 2K
Types of ontologies (broad notion)

   Catalog of normalized terms, e.g. a list of terms used in the reports from
    a laboratory: no taxonomy, no axioms, and no glosses
   Glossed catalog, e.g. a dictionary of medicine: a catalog with glosses.
   Thesaurus, e.g. many parts of the UMLS Metathesaurus, GEMET: a
    hierarchical collection of terms; the hierarchical link is usually
   Taxonomy, e.g. the ICD10: a collection of classes with a partial order
    induced by inclusion (classification)
   Axiomatized taxonomy, e.g. the GALEN Core Model: a taxonomy with
   Ontology library, e.g. the Ontolingua repository: a set of axiomatized
    taxonomies with relations among them. Each element of the library is a
    module, which can be included into another one. Also, a concept from a
    module can be only used into another one. Ontology modules can be
    considered subdivisions of the namespace of a model

                                                                   Santa Fe 2K
 From Data Integration to
 Conceptual Integration

• Heterogeneous texts
• Heterogeneous semi-structured texts (retrieval of
  web data types and descriptions)
• Heterogeneous databases (schema integration,
  information brokering)

=> In all these cases, heterogeneity concerns the
  conceptualization of the terminology used in the sources

                                                     Santa Fe 2K
Polysemy and overlapping

Since the primary causes of heterogeneity are
• polysemy (conceptual disalignment, difference of intended
   meaning of one name), and
• conceptual overlapping (different names having overlapping
that arise in the union of the vocabularies of two any sources,
   ontologies are a major component to provide semantic access
   to (and integration of) terminological resources

Incidentally, polysemy is usually found within the same source as
   well (views, themes, homonyms):

                                                          Santa Fe 2K
Ontology Learning

•   From Natural Language
•   From Semi-structured Data
•   From Structured Data
•   From Terminologies

=> Integration of sources needs:
    (Principled) Conceptual Abstraction

                                          Santa Fe 2K
Conceptual abstraction: an example

The domain ontology A has body region with the intended meaning of
   «loosely specified part of the body that can be cut, filled, etc.»
The domain ontology B has body region with the intended meaning of «region
   of the body at which body parts are located»

There is a metonymy acting on body region in A, whose intended meaning
   concerns body parts located at some region, although they are denoted by
   referring to the region itself (the intended meaning in B)
Hence, the metonymic name should be distinguished from the plain name,
   and correctly related to it

The distinction between objects (body parts) and regions, and the notion of a
   localization relation holding between objects and regions are both
   necessary to make the metonymy clear, and cannot be found in the
   specifications given in A or B. They have to be found in some generic
                                                                    Santa Fe 2K
Ontology integration: conceptual issues

Ontology integration is – generally speaking – the construction of an ontology
   C that formally specifies the union of the vocabularies of two other
   ontologies A and B

To be sure that A and B can be integrated at some level, C has to commit to both
   A's and B's conceptualizations. In other words, the intension of the concepts
   in A and B should be mapped to the intension of C's concepts

Unfortunately, this cannot be realized using only the conceptual relations specified
   in A and B for local tasks (for a specific context). The methodological principle
   adopted here is that generic ontologies reused from the philosophical,
   linguistic, mathematical, AI literature must found the comparison of different
   intensions. Our approach may be called principled conceptual integration

                                                                           Santa Fe 2K
Aspects of integration

Three aspects of an ontology are taken into account:

•   the intended models of the conceptualizations of its vocabulary
•   the domain of interest of such models, i.e. the 'topic' of the
•   the namespace of the ontology

The most interesting case is when A and B are supposed to commit
  to the conceptualization of the same domain of interest or of two
  overlapping domains. In particular, A and B may be:

                                                          Santa Fe 2K
Some integration cases for the same

   Alternative ontologies: the intended models of the conceptualizations of A and
    B are different (they partially overlap or are completely disjoint) while the domain
    of interest is (mostly) the same. This is a typical case that requires integration:
    different descriptions of the same topic are to be integrated
   Truly overlapping ontologies: both the intended models of the
    conceptualizations of A and B and their domains of interest have a substantial
    overlap. This is another frequent case of required integration: descriptions of
    strongly related topics are to be integrated
   Equivalent ontologies with vocabulary mismatches: the intended models of
    the conceptualizations of A and B are the same, as well as the domain of
    interest, but the namespaces of A and B are overlapping or disjoint. This is the
    case of equivalent theories with alternative vocabularies

                                                                            Santa Fe 2K
Ontological integration: operational issues

Depending on the amount of change necessary to the operational integration of A and B,
   different levels of interoperability can be distinguished:

Mediation: it requires no changes to A and B, but only mapping relations that describe the
    equivalence (partial or total) of A's and B's elements to C's elements. This may result in
    weak interoperability, since usually the intended models of A and B overlap only: some
    concepts from A may not have a correspondent in B, and vice-versa. This is the design
    choice for some recent information brokering architectures. However, such architectures,
    have a weak commitment towards a principled way of conceptual integration, possibly for its
    additional cost
Alignment: it requires some change to fill the biggest gaps of A and B respect to an ideal C that
    completely integrates A and B. Therefore, alignment requires at least a partial conceptual
    integration. It may support a limited interoperability; for example, deep inferences may be
Unification: it may require a major reorganization of A and B, which are 'harmonized'.
    Unification intervenes on the inferential features of the systems, and consists in a complete
    operational integration: everything can be made in one system, can be made in the other. It
    results in the most complete interoperability but requires a complete conceptual
    integration as well. From the conceptual viewpoint, unification consists in the adoption of C
    as a standard in the systems using A or B

                                                                                     Santa Fe 2K
Ontology integration: practical issues
•   Lack of hierarchies
•   Ambiguous hierarchies
•   Informality
•   Lack of modularity
•   Polysemy
•   Uncertain semantics
•   Prototypical descriptions
•   Ontological opaqueness
•   Lack of a (minimal) set of axioms
•   Confusing lexical clues
•   Awkward naming policy
•   'Remainder' partitions
•   'Exception' partitions
•   Terminological cycles
•   Meta-level soup
•   Low maintenance capabilities

                                         Santa Fe 2K
Ontologies: some desiderata

•   An explicit taxonomy with subsumption among concepts
•   Semantic explicitness of links
•   Modularity of namespace
•   A stratified design of the modules
•   Absence of polysemy within a module
•   Disjointness of concepts within a module and within the top-level
•   A proper interface between the ontology namespace and one or more sets of lexical
•   Linguistically meaningful naming policy (cognitive transparency)
•   Rich documentation
•   Some minimal axiomatization to detail the difference among sibling concepts
•   Explicit linkage to concepts and relations from generic theories
•   Meta-level assignments to distinguish among the formal primitives assigned to
•   Languages and implementations that support the previous needs as well as the
    possibility of collaborative modeling

                                                                         Santa Fe 2K
The ONIONS Methodology
ONIONS implementation is meant to provide extensive axiomatization, clear
   semantics, and ontological depth to a domain terminology

•   Extensive axiomatization is obtained through a conceptual analysis of the
    terminological sources and their representation in a logical language with
    a rigorous semantics

•   Ontological depth is obtained by reusing a library of generic ontologies,
    on which the axiomatization depends. Such library may include multiple
    choices among partially incompatible ontologies. In particular, we suggest
    the importance of mereology or theory of parts, topology or theory of
    wholes, connexity and boundaries, morphology, or theory of form and
    congruence, localization, or theory of regions, time theory, actors, or
    theory of participants in a process, dependence theory, and the theory of
    environmental niches

                                                                    Santa Fe 2K
                            The main steps (I)
 0.  Semantically opaque hierarchies and lists are pre-
  processed in order to create ‘clean’ taxonomies
 1. All concepts, relations, templates, rules, and axioms
  from a source ontology are represented in the ONIONS
  formalisms, currently Loom, Ontolingua, and OKBC
 2. When available, plain text descriptions are analyzed
  and axiomatized (text formalization)
 3. The union of such products is integrated by means of
  a set of generic ontologies. This is the most characteristic
  activity in ONIONS, which can be briefly described as
                                                     Santa Fe 2K
   3.1. For any set of sibling concepts in a taxonomy, the conceptual difference
    between each of them is inferred, and such difference is formalized by axioms
    that reuse the relations and concepts already in the library. If no concept is
    available to represent the difference, new concepts are added to the library
   3.2. For any set of polysemous senses of a term, different concepts are stated
    and placed within the library according to their topic and to the available
    modules. (Polysemy occurs when two concepts with overlapping or disjoint
    intended models have the same name.)
   3.3. Often, polysemous senses of a term - as well as different 'alternative'
    concepts - are metonymically related. For example: process/outcome (as in
    inflammation), region/object (as in body region), etc. Alternatives must be
    properly defined by making it explicit the relationship between them: e.g.
    "has-product" for inflammation, "location" for body-region
   3.4. When stating new concepts, the relations necessary to maintain the
    consistency with the existing concepts are instantiated. If conflicts arise with
    existing theories, a more general theory is searched which is more
    comprehensive. If this is impracticable, an alternative theory is created

                                                                          Santa Fe 2K
   3.5. Relevant integration cases. Since ONIONS requires the use of generic
    theories to axiomatize alternative theories, the integration of a concept C from
    an ontology O is performed by comparing C with the concepts D1,…,n already
    present in the evolving ontology library L, whose ontology set M1,…,n contains at
    least a significant subset of generic ontologies and the set of domain ontologies
    at that state in the evolution of L. The following cases appear relevant to the
   3.5.1. C's name is polysemous in O (internal polysemy). Iterate 3.2 ÷ 3.4
   3.5.2. C's name is homonym with the name of a Di. (Homonymy occurs when
    both the intended models and the domains of two concepts with the same name are
    disjoint.) Homonyms must be differentiated by modifying the name, or by
    preventing the homonyms to be included in the same module namespace
   3.5.3. C's name is synonym with the name of a Di. (Synonymy is the converse of
    homonymy and occurs when two concepts with different names have both the same
    intended model and the same domain.) Synonyms must be preserved, or included
    in the set of lexical realizations related to the concept
   3.5.4. C is subsumed by some Di in L, but it has no total mapping on any Dj in L.
    The gap in L must be filled by adding C as a subconcept of Di

                                                                          Santa Fe 2K

   3.5.5. C is an intersection between two concepts Di and Dj in L. Solved
    by distinguishing types and roles, or different defining elements
   3.5.6. C has an alternative concept Di in L (same domain, but
    overlapping or disjoint intended models):
      If C metonymically depends on Di, C is properly related to Di
      If C and Di are different viewpoints on the same domain of interest,
         both concepts are kept; if the case, they are included in separate modules
      If the intended model of C is finer than Di's, Di is substituted with
      If the intended model of C is coarser than Di's, C is ignored (but
         track of it is kept for mapping between sources)

                                                                           Santa Fe 2K

   4. The library of generic, intermediate, and domain ontologies should
    be stratified, say domain modules should include intermediate
    modules - that should include generic modules - so that each set of
    modules can be plugged or unplugged from its more general set
    without affecting the coherence of the entire library
   5. The source ontologies are explicitly mapped to the integrated
    ontology, in order to allow interoperability. The only admitted
    mappings are equivalent and coarser equivalent. Formally: for any
    source ontology SO and an ontology IO that is supposed to result (also)
    from the integration of SO, for any concept Ci in SO, there is a Di in IO
    such that CiI = DiI (equivalence of possible interpretations), or there is
    a disjunctive concept (or Di Dj) in IO such that CiI = DiI  DjI
    (equivalence of possible interpretations to a disjunction of concepts –
    i.e. to a union of finer concepts)
   5.1. Partial mappings must have been already resolved through the
    methodology: if any, some step in the integration procedure must be

                                                                   Santa Fe 2K
 Ambiguous hierarchies
     Entity                                        Event

Conceptual Entity                         Phenomenon or Process

     Finding               Natural Phenomenon               Injury or Poisoning

                              Biologic Function                   fractures

                            Pathologic Function

                    malunion and nonunion of fracture

                            ununited fractures

                                                                              Santa Fe 2K
A principled formalization

(defconcept ununited-fracture
 :is-primitive (and fracture
           (some morphology
            (and bone
            (or (some embodies malunion)
               (not integral))))
           (some dependently-postdates fracture)
                      (all interpretant clinical-condition)))

                                                                Santa Fe 2K
Some UMLS concepts pertaining the intersection:
Amino Acid, Peptide, or Protein & Carbohydrate

(|hamster oviduct-specific glycoprotein|)
(|Par j I|)
(|collapsing factor|)
(|BDV 18K glycoprotein|)
(|SI-gene-associated glycoprotein, Nicotiana|)
(|FdI allergen|)
(|sca gene product|)
(|EPV20 protein|)
                                                 => More than 118,000 UMLS concepts
(|lubricin|)                                     (25%) are classified under an
(|Pluritene|)                                    intersection
(|Par h 1 allergen|)
(|Wnt11 gene product|)
(|mannose-bovine serum albumin conjugate|)
(|acrosome granule lysin|)
(|sulfatide activator|)
(|vaccinia virus A34R protein|)

                                                                           Santa Fe 2K
Ontological analysis of the intersection

(defconcept |Amino Acid, Peptide, or Protein & Carbohydrate|

 "834 instances. This conjunct includes two sibling types.
  A protein containing a carbohydrate."

 :annotations ((Sugg.Name "carbohydrate-containing-protein")
                   (onto-status integrated))

 :is-primitive (:and protein
                      (:some has-component carbohydrate))
 :context :substances)

                                                               Santa Fe 2K
Names of anatomical morphologies are often polysemous:
 Both a condition and the function that caused the condition ("inflammation",
    "ulcer", "fracture", "wound", "hyperplasia")
 Both an object and the function that produced the object ("neoplasm",
 Both an object O and the condition created in another object O' by O
For example: "the fracture has been caused by a fall" vs. "the fracture is transverse";
    "the obstruction occurred in the jejunum" vs. "the obstruction has been removed"
Conceptual analysis puts into evidence other issues concerning morphologies:
• The dependence between a morphological condition, a function, and the related
    organ. For example, an "ulcer" (as a condition) of a stomach implies that the
    stomach embodies an ulceration function (an ulcer as a function)
• The mereological import of morphologies: some are featured by an organ, some
    only by a part of an organ. For instance, an "ectopic heart" is wholly ectopic, but
    an "ulcerated stomach" is only partly ulcerated

                                                                           Santa Fe 2K
         Morphologies analyzed
   a property ("color", "consistency", "thickness", "size", "number", "shape")
   a condition:
        a topologically relevant condition:
               an alteration of connection:
                      that creates a configuration (a new property) in an object ("fracture",
                      in the holey interior of an object ("obstruction")
                      between several objects ("fusion")
               an alteration of the boundary between an object holey interior and the object
                      creating a configuration in the boundary ("cavitation", "ulcer")
                      producing a substance flow ("hemorrhage", "ulcer")
        an abnormal placement ("dislocation", "ectopia", "absence")
        a form alteration condition ("deformity", "hyperplasia", "hypoplasia")
        a condition involving the alteration of several properties ("inflammation", "eruption")
   an abnormal, foreign object ("mass", "neoplasm", "calculus", "obstruction")

                                                                                  Santa Fe 2K
            Expliciting relations

                      Region                       Health-Condition

 uniquely-located                                                target

Member                         target-population                   Procedure


                      Group                          Guideline

                                                                          Santa Fe 2K
    Medical source ontologies

•   The UMLS top-level (1998 edition: 132 "semantic types", 91
    "relations", and 412 "templates"),
•   The Snomed-III top-level (510 "terms" and 25 "links"),
•   GMN top-level (708 "terms"),
•   The Icd10 top-level (185 "terms"), and
•   The GALEN Core Model v.5h (2,730 "entities", 413 "attributes" and
    1,692 axioms), etc.

•   The 1998 edition of the UMLS Metathesaurus (476,000 "concepts",
    93,000 explicit templates, and 599,000 thesaurus-like templates)

                                                               Santa Fe 2K
             The current ON9.2 library
   Me ta on tol og y                                      Eq ua li ty                              De pe nd enc e

                            St ru ct uri ng -C onc ep ts                               La ye rs

                                                                                                                    Gr an ul ari ty
                                                                        To p- Le vel

                                  Me re ol ogy

                                                                                                        Un re st ric te d- Tim e
              To po lo gy               Me ro ny my                      Ac to rs

                                                                                                                  Mo rp ho log y

       Lo ca li zat io n                As se ss men t                        Re pr es ent at io n                                           Un it s

                                                                                                                       Qu an ti tie s

              Po si ti ons
                                                                                                                                             Di ag ra ms
                                  To pi cs                                                                 Ph ys ic al- Co nc ept s
To po -M orp ho lo gy                                        So ci al -Ob je ct s

                                Ar ti fa cts

                                                                                                                                              Pl an ni ng
                                                             Su bs ta nce s
                                                                                                  Ab st ra ct- Ob je cts
           An at om y
                                             Bi ol og ic- Fu nc tio ns
                                                                                                                                              Pr oc ed ure s

                           Na tu ra l-K in ds

                                                                                                                                      Me di ca l-P ro ce dur es
                                                                                       Mo le cu lar -B io log y
      Bo dy -D ire ct io ns
                                                 Ab no rm ali ti es

    Bi ol og ic- Su bs tan ce s

                                             We b- No tio ns                                                                Cl in -A ct

                                                                                          Gu id el ine s
                                                                                                                                                               Santa Fe 2K
                                     The current top-level

                        continuant                                                                                         occurrent
 social-object                                                           topic
                                              abstract-object                                                 act


                                                                                            biologic-function                                 non-biologic-function
                                                 substance           language

                                                                notion                                              human-caused-phenomenon                    natural-phenomenon

        organism                        anatomical-structure         physiologic-function                 pathologic-function

                                                                                                                                                      Santa Fe 2K
Tool for representation
Tool for representation and classification
Tool for intermediate representation and
Tool for browsing and editing

                                       Santa Fe 2K
Santa Fe 2K

   ON9.2: integration of the medical top levels within a library of
    generic theories. It includes a set of 50 modules with about 1,500
    concepts. It is available in both Ontolingua and Loom languages

   Explicitation of the Metathesaurus terminological knowledge:
    intersections of UMLS semantic types, relations defined by
    sources (IS_A and other relations)

   Integration of the Metathesaurus intersections within ON9.2

   Contextualization of the Metathesaurus

   An integrated model of clinical guidelines

                                                          Santa Fe 2K
 What is a Domain Reference Ontology?

An ontology usable to build new ontologies in a
  domain, or to plug existing ontologies in it
Our research in medical conceptual structures aims
  at defining a Medical Reference Ontology (library)
The current research in environmental metadata
  could be reconsidered as the construction of an
  Environmental Reference Ontology
We are confident that our methodology is suitable
  to this task without substantial revision
Warning: at first sight, conceptual heterogeneity in
  environment seems harder than medicine
                                             Santa Fe 2K
"Es gibt nichts praktischers als eine gute Theorie"

                             (Ludwig von Boltzmann)

                                          Santa Fe 2K
"Es gibt nichts praktischers als eine gute Theorie"

"There is nothing more practical than a good

                             (Ludwig von Boltzmann)

                                          Santa Fe 2K
for generalities, the library, and conceptual investigations:
Gangemi A, Pisanelli DM, Steve G, "An overview of the ONIONS
    project: Applying ontologies to the integration of medical
    terminologies", Data and Knowledge Engineering, 31 (1999), 183-
for the investigation of the UMLS:
Pisanelli DM, Gangemi A, Steve G, "An Ontological Analysis of the UMLS
    Metathesaurus", Journal of American Medical Informatics Association,
    vol. 5 (symposium supplement), 1998
for the pre-processing of informal terminological repositories:
Steve G, Gangemi A, Pisanelli DM, "Integrating Medical Terminologies with
    ONIONS Methodology", in Kangassalo H, Charrel JP (eds.) Information
    Modelling and Knowledge Bases VIII, Amsterdam, IOS Press 1997
for the integration of clinical guidelines:
Pisanelli DM, Gangemi A, Steve G, "Toward a Standard for Guideline
    Representation: an Ontological Approach", Journal of American Medical
    Informatics Association, vol. 6 (symposium supplement), 1999
                                                               Santa Fe 2K

Shared By: