Learning Center
Plans & pricing Sign in
Sign Out

NIF as a Multi-Model Semantic Information System - CRBS


  • pg 1
									NIF as a Multi-Model Semantic
     Information System
 Part 1: Relational, XML, RDF and OWL models

                Amarnath Gupta

        University of California San Diego
   Preamble – 1
 As we design and extend the NIF system we recognize that
   Users will give us data in any form that is convenient for
     Standard data may be stored in a flat file
     Web service output can be in XML
     Semantic Web enthusiasts may represent data using proper RDF
   However, regardless of the form in which data may be
     The NIF system must treat them (query, index, relate, ...) in a
      uniform manner
     The NIF system must utilize the underlying systems to
      query/access/index data
Preamble – 2
 In this presentation we intend to
   Explain our perspective on these different data models
   Provide a background on the data models we consider
   Offer a sense of the “semantic character” of these data
   Present our design philosophy on
     Where to keep them separate
     Where to transform them into a common model
What is a Data Model?
 A conceptual data model
   A formal representation of the users’/application’s mental model
    of data elements and their relationships that should be put in a
    database, manipulated, queried and operated upon
 A logical data model
   A formal description of the data model in a logical structure that a
    computer can use to perform the queries and other operations.
    In many cases, the same conceptual model can be
    represented by different logical models
 A physical data model
   An implementable version of the data model in terms of data
    structures, access structures (e.g., indices) and the set of low-level
    operations that a system needs to perform on the data
                                               ORM Model – Terry Halpin

        A Conceptual Model
     Object   Constraint

                                  Constraint          Value
n-ary Role
A Logical Data Model
 A formal specification of
   The structure of the data
      The structure tells us how the data is organized
            (123, “Purkinje Cell”, Cerebellum)
            (828, Hippocampus, “Hilar Cell”)        is not structured
      Often the structure of the data, together with some constraints, represent
       some semantics
      If the data are not structured (like free text), the techniques for handling
       them will be different.
   Operations on this structure
      Every data model is based on some mathematical principles that define what
       you can do with the data
   the nature of data values
      Data domains and data types
   operations on data values
        The Relational Data Model                                         Attribute name
                                       Relation name
        Table: Neurons
        NeuronID     NeuronName       BrainRegion      NeuroTransmitter        Current
Tuple       1         Purkinje Cell    Cerebellum          Glutamate       Transient Na+
            2            Hilar Cell   Dentate Gyrus          GABA               Ca2+
 Attribute Domain  all possible values the attribute can take            Attribute value:
 Candidate key: a set of columns that uniquely determines a row           Cannot be complex
 Relational model is a set (bag) of tuples model
    Metadata stored in a separate catalog which is also relational
    First order constraints
 All queries are about some combination of
    Selecting rows, columns
    Combining tables by union, intersection, join
    Computing data or aggregate functions
    Grouping and sorting
 A query can return only values
    Relation names and attribute names cannot serve as variables in a query
   Object Relational Model
 Eases some of the problems of the classical relational model
   Data values can be of arbitrary data types
       Sets (e.g., multiple currents for a neuron)
       Tuples (e.g., references ordered by year)
       Time-series (e.g., raw EEG data)
       Spatial Data (e.g., atlases in CCDB)
    Each data type can have its own operations
       Find all data points within a neighborhood of a spatial location

 Queries are still values
   Catalog queries and data queries cannot be mixed in a single query
 All industrial-strength DBMSs use some version this model
 Need to be a skilled DB programmer to develop custom applications
  on this model
       XML (Two Perspectives)
        Document Community
            data = linear text documents
            mark up (annotate) text pieces to describe context, structure,
             semantics of the marked text

<physiologicalCondition> Oxidative stress </physiologicalCondition> has been proposed to be
involved in the <biologicalProcess context=“disease”> pathogenesis </biologicalProcess> of
<disease> Parkinson's disease</disease> (PD). A plausible source of <physiologicalCondition>
oxidative stress </physiologicalCondition> in <brainRegion> nigral </brainRegion> <neuron>
dopaminergic neurons </neuron> is the redox reactions that specifically involve <chemical>
dopamine </chemical> and produce various <chemical context=“biologicalAgent”> toxic
</chemical> molecules.
          XML (Two Perspectives)
           Database Community
              XML as a (most prominent) example of the semi-structured
              data model
             => captures the whole spectrum from highly structured, regular
              data to unstructured data (relational, object-oriented, marked
              up text, ...)
<?xml version="1.0" encoding="utf-8"?>                                      From the CARMEN group
          <description>A new annotation file </description>
          <interval group_id="04">
                     <eventNote timeOffset="1237888.230” attachedFile="sound1.wmv”
                              application="realplayer">Text message for the event start.</eventNote>
                     <eventNote timeOffset="18958585.232">Text message for the event
</NDTF_Annotation >
      XML as a Logical Data Model
 XML is a tree-structured
   Nodes
     Element nodes
        Children can be ordered
        Recursive elements (parts
         under parts)
     Attribute nodes
     Mandatory or optional
   Edges
     Sub-element edges
     Attribute edges
                                   •   Trees are more flexible than tables
     IDRef edges
   Constraints                         • Any number of nodes can be added
     References                          anywhere without breaking the model
     Value restrictions, OneOf
     Cardinality
XML as a Logical Data Model
• XML has its own schema language
    • Lets you specify a complex type system
• A database is a collection of XML trees
 Storing XML
   Mostly relational with some very clever indexing to encode the
    hierarchy, tree paths, and order
 Querying XML
   Elements, attribute names, values and structure can be queried
   Multiple trees can be joined by value
   Example (Xpath)
     Find images of the spinal column
     //image[//structurelabel/text()=“SPINAL COLUMN”]/ish_image_path
        is a tree query
   XQuery and full-text XQuery
Misusing and Abusing XML
 Using XML if your data is relational
   It will result in flat trees that will suffer from complex querying
 Encoding orders and hierarchies that need special parsing
  <Brand_Mixtures count=“2”>
         <Brand_Mixture_1> Apo-Levocarb (carbidopa + levodopa) </Brand_Mixture_1>
         <Brand_Mixture_2> Apo-Levocarb CR Controlled-Release Tablets (carbidopa + levodopa)
 Using implicit multi-valuedness
  <atomArray atomID="a1 a2 a3" elementType="O N C" hydrogenCount="1 1 3">
        <array dictRef="cml:calcCharge" dataType="xsd:decimal" units="cml:electron">0.2 -0.3
Expressing Semantics in XML
 Adorning elements with Namespaces
   A namespace is a unique URI (Uniform Resource Locator)
      To disambiguate between two elements that happen to share the same name
      To group elements relating to a common idea together
  <item xmlns:bp="">
  <bp:protein ID="Protein1">
        <bp:unificationXref rdf:ID="Xref1">
The Problem with XML Semantics
                 Two different XML
                  representations of the same
                  kind of information may
                  not be easily unifiable
                 What did XML not
   Resource Description Format (RDF)
URI(membrane                                  Rdf:type
  -protein)                                                    URI(protein-
                                                              mediated toxicity)

               Rdf:subject                     Rdf:object
  Rdf:type                                                           Rdf:type

   URI(CNTFR-a)              URI(modulates)                   mediated

 The Basic Constructs of RDF
 RDF meta-model basic elements
   All defined in rdf namespace
 Types (or classes)
   rdf:resource – everything that can be identified (with a URI)
   rdf:property – specialization of a resource expressing a binary
    relation between two resources
   rdf:statement – a triple with properties rdf:subject, rdf:predicate,
 Properties
   rdf:type - subject is an instance of that category or class defined
    by the value
   rdf:subject, rdf:predicate, rdf:object – relate elements of statement
    tuple to a resource of type statement.
      Relational Data vis-à-vis RDF

 Node to edge ratio is
  relatively small in many
 Number of relationships
  need not be fixed at design
   The general tendency is
    keep the number of edge
    labels small
 Graph-based operations can
  be performed on RDF,
  which requires an
  unspecified number of joins
  in relational data
RDF Blank Nodes
 RDF allows one to create anonymous objects whose existence is
  known but details are not
 There exists some neuron to which both NeuronX and NeuronY
 <neurons:NeuronX rdf:about="">
                 <neurons:Neuron rdf:nodeID=“n1"/>
 <neurons:NeuronY rdf:about="">
                 <neurons:Neuron rdf:nodeID=“n1"/>
  RDF Schema
 Declaration of vocabularies
   classes, properties, and relationships defined by a particular community
     rdfs:Class, rdfs:subClassOf
   Property-related
     rdfs:subPropertyOf
   relationship of properties to classes
     rdfs:domain, rdfs:range

 Provides substructure for inferences based on existing triples
   NOT prescriptive, but descriptive
     This is different from XML Schema

 Schema language is an expression of basic RDF model
   uses meta-model constructs: resources, statements, properties
   schema are “legal” RDF graphs and can be expressed in RDF/XML
Examples of RDF Inferencing

 From this we can infer
     (:alice rdf:type parent)
     (:betty rdf:type parent)
     (:eve rdf:type female-person)
     (:charles rdf:type :person)
RDF as a Logical Data Model
 RDF does not distinguish between different relationships
   Instance-to-type
   Instance-to-instance
   Type-to-instance
 No transitivity inference is possible over, say, rdf:type
 RDF (as well as XML) has lost the notion of the abstract data type
  like spatial object or time
   Operations on object types does not mix well with RDF
 Constraints like uniqueness, 1-to-1 relationships, cannot be
 SPARQL, the query language for RDF is
   An edge-only language – it cannot express the // construct of XML
   Blank nodes are treated as variables not output in the results
   Parts of the language are undecidable!
A problem is undecidable if it can be proved that there can be no algorithm to solve it
 Components of an OWL Ontology
   Vocabulary (concepts)
   Structure (attributes of concepts and hierarchy)
   Concept-to-concept, concept-to-data, property-to-property
   Logical characteristics of relationships
     Domain and range restrictions
     Properties of relations (symmetry, transitivity)
     Cardinality of relations
 Open world vs. Closed world assumptions
   Contrast to most reasoning systems that assume anything absent from
    knowledge base is not true
   Need to maintain monotonicity with tolerance for contradictions
 OWL Classes                               Class of all classes
   rdf:subclassOf     rdf:Class
Basic OWL Constructs
 Creating OWL Classes
   disjointWith
      Neurons are not glial cells
   sameClassAs (equivalence)
      Class Gabaergic neuron is exactly the same class as neuronswhich
       has GABA as neurotransmitter
   Enumerations (on instances)
      Class Cerebellar lobules are Lobule I, Lobule II, …
   Boolean set semantics (on classes)
      Union (logical disjunction)
         Class nerve cell is union of neuron, glial cell
      Intersection (logical conjunction of class with properties)
         Class hippocampal neurons is conjunction of things of class
          Neuron and have property (has-soma-located-in) (hippocampus
          union any class that is (part-of) hippocampus)
      complimentOf (logical negation)
         Class ‘benign tumor’ is disjunct of class ‘malignant tumor’
Properties of OWL Properties
 Transitive Property
   P(x,y) and P(y,z)  P(x,z) subclassOf
 SymmetricProperty
   P(x,y) iff P(y,x) is_functionally_related_to
 Functional Property
   P(x,y) and P(x,z)  y=z soma_located_in
 inverseOf
   P1(x,y) iff P2(y,x) regulates  is_regulated_by
 InverseFunctional Property
   P(y,x) and P(z,x)  y=z is_isoform_of
 Cardinality
   Only 0 or 1 in OWL-lite and OWL-full
Instances in OWL
 Instances are distinct from Classes
 In RDF there is no distinction between class and instances
   <Species, type, Class>
   <Lion, type, Species>             is allowed in RDF
   <MyLion, type, Lion>
 OWL DL restrictions
   Type separation
     Class can not also be an individual or property
     Property can not also be an individual or class
A Rough Comparison


    RDF and OWL do not represent n-ary roles cleanly
Querying OWL
 The are several languages in the making
   SPARQL engines (e.g., Virtuoso) are used often
   Pellet is used for reasoning tasks
     Subsumption
     Consistency
 New, more advanced languages like nSPARQL are coming up
   vSPARQL is being developed to enable views on SPARQL,
    which will lead to nested SPARQL queries
 Our goal
   Develop a query processor for these advanced languages
   Part of OntoQuest, our ontological information management
Where does NIF stand in this?
 Not every model is directly inter-convertible with every other
 NIF is designed to
   Work with multiple models
     Ensure that the modeling capability and query capability of every model is
      preserved in its native form
     Queries in our system get translated to queries in the native forms of the
      databases we federate
   Express the local semantics of any data appropriately by
     Augmenting the semantic model of the data
     Connecting the data to NIF’s ontology
       Extending the NIF ontology in the process
   Develop a mechanism to create a common integrated model over
    these models
      this model is an ontological graph that incorporates object and temporal
   Example of An Ontological Extension
 Representing time and events
      Phenotypes, physiology, …
   Instants, intervals, and periods
      Temporal granularity of observation
   Events
      Multi-temporal observations based on conditions on properties
      Modeling states, objects in state, and state transitions        TOWL and
   One-only, repeatable, and time deictic events                      Temporal ORM
   Subevents
   History of objects, events, roles
   Subtype migration, Temporal roles and role migration
      Progression of disease, symptom or recovery states
      Repeatability

To top