The Semantic eScience Framework

Document Sample
The Semantic eScience Framework Powered By Docstoc
					  XInformatics; bridging the gap between
       science and discipline neutral
    cyberinfrastructure with semantics:
The Journey from 2004 to 2010 and Beyond
                 Peter Fox
     Tetherless World Constellation, RPI
          Marine Biology Lab 2010

• The origins of this effort, putting the X in
• Why a framework and not a system?
• Semantics in 2004
• The design and development methods
• Ontologies and the software and production!
• Semantics between 2004 and ~ 2009
• Discussion of the expressivity and
  implementability balance and one more …
• Since it is 2010 … what we are up to
                 Tetherless World Constellation   2

Scientists should be able to access a global, distributed
    knowledge base of scientific data that:
    • appears to be integrated
    • appears to be locally available
But… data is obtained by multiple instruments, using
    various protocols, in differing vocabularies, using
    (sometimes unstated) assumptions, with inconsistent
    (or non-existent) meta-data. It may be inconsistent,
    incomplete, evolving, and distributed
And… there exist(ed) significant levels of semantic
   heterogeneity, large-scale data, complex data types,
   legacy systems, inflexible and unsustainable
   implementation technology…
              Origins and a preview

• In 2000-2001 the need for capturing and preserving
  knowledge in science data became very clear but the
  barriers were high
• In 2004 we started a virtual observatory project based
  on semantic technologies
• Use case driven – in solar and solar-terrestrial physics
  with an emphasis on instrument-based measurements
  and real data pipelines; we needed implementations
• We knew we also needed integration and provenance
  (but that came later)
• We aimed to push semantics into our systems to build
  new ‘prototypes’ but we ‘failed’ ;-)
                     Tetherless World Constellation          4
Content: Coupling Energetics
and Dynamics of Atmospheric
                          Community data
                          archive for
                          observations and
                          models of Earth's
                          upper atmosphere
                          and geophysical
                          indices and
                          needed to
                          interpret them.
                          capabilities by
                          models, …
Content: Mauna Loa
 Solar Observatory
               Near real-time
               data from Hawaii
               from a variety of
               solar instruments.
               Source for space
               weather, solar
               variability, and
               basic solar
               Other content used
               too – CISM – Center
               for Integrated Space
               Weather Modeling
          Virtual Observatories
Make data and tools quickly and easily accessible to a
 wide audience.
Operationally, virtual observatories need to find the
  right balance of data/model holdings, portals and
  client software that researchers can use without
  effort or interference as if all the materials were
  available on his/her local computer using the user’s
  preferred language: i.e. appear to be local and
Likely to provide controlled vocabularies that may be
   used for interoperation in appropriate domains along
   with database interfaces for access and storage and
   “smart” tools for evolution and maintenance.           7
             Early days of VxOs

                            VO2      VO3

            DB2       DB3          DBn
DB1                         …………

                 The Astronomy approach;
                  data-types as a service
                     interoperability                                               VOTable
                                          VO App2                      VO App3
           VO App1
                                                                                   Image Access
                 OGC: {WFS, WCS, WMS} and                                           Spectrum
                           SWE {SOS, SPS, SAS}                                     Simple Time
VO layer
                        use the same approach                                        Protocol

                                              Lightweight semantics
                                              Limited meaning,        hard
                  DB2                   DB3                                  DBn
           DB1                                 …………
                                              Limited extensibility
                                              Under review
                            Mind the Gap!

• There is/ was still a gap between science and the underlying
 infrastructure- information scienceavailable the
   Informatics and technology that is includes
   science of (data and) information, the practice of
   information processing, and the engineering of
   information systems. Informatics studies the
   structure, behavior, and interactions of natural and
   artificial systems that store, process and
• Cyberinfrastructure is the new research
   communicate (data and) information. It also
  environment(s) that support advanced data
   develops its own conceptual and theoretical
  acquisition, data storage, data management, data
  integration, data mining, data visualization and other
   foundations. Since computers, individuals and
  computing and information processing services over
   organizations all process information, informatics
  the Internet.
   has computational, cognitive and social aspects,
   including study of the social impact of information
   technologies. Wikipedia.
              Progression after progression
IT Cyber               Cyber             Core        Science                Science,
                       Informatics       Informatics Informatics
   Infrastru                                                                Societal
   cture                              Requirements                          Benefit

• CI = OPeNDAP server running over HTTP/HTTPS, wiki, databases,
• Cyberinformatics = Data (product) and service ontologies, triple stores
• Core informatics = Reasoning engine (Pellet), OWL, and much more
• Science (X) informatics = Use cases, science domain terms/ vocabularies,
concepts in an ontology                                                                11
          Frameworks vs. Systems

• Prior to 2005, we built systems
• Rough definitions
  – Systems have very well-define entry and exit
    points. A user tends to know when they are using
    one. Options for extensions are limited and
    usually require engineering
  – Frameworks have many entry and use points. A
    user often does not know when they are using
    one. Extension points are part of the design
• You don’t have to agree, this was our view
                  Tetherless World Constellation       12
                     In 2004

• 2004 – OWL was a W3 recommendation!!
• Protégé 2.x and the Protégé-Java-OWL API
• SWOOP was a viable editor
• Jena and the Jena API were in good shape
• Pellet worked
• SPARQL was still a twinkle in the RDF working
  group’s eye
• Semantics were still the realm of computer
  scientists – luckily we had one of the best
                 Tetherless World Constellation   13
                              Ontology Spectrum

                  Thesauri                                                                  Selected
                 “narrower                        Formal Frames                               Logical
Catalog/           term”                            is-a (properties)                     (disjointness,
ID                relation
                                                                                             inverse, …)

     Terms/                     Informal                   Formal                              General
                                                                            Value               Logical
     glossary                      is-a                   instance
                                                                            Restrs.         constraints

 Originally from AAAI 1999- Ontologies Panel by Gruninger, Lehmann, McGuinness, Uschold, Welty;
 – updated by McGuinness.
 Description in:

         Design and Development

• We made a conscious decision only to develop
  ontologies that were required to answer
  specific use cases
• We made a conscious effort to use whatever
  ontologies were available**
• We were pretty sure that rules would be
• We ignored query

                Tetherless World Constellation   15
           Science and technical
                 use cases
Find data which represents the state of the neutral
   atmosphere anywhere above 100km and toward the
   arctic circle (above 45N) at any time of high
   geomagnetic activity.
   – Extract information from the use-case - encode knowledge
   – Translate this into a complete query for data - inference
     and integration of data from instruments, indices and

Provide semantically-enabled, smart data query
  services via a SOAP web for the Virtual Ionosphere-
  Thermosphere-Mesosphere Observatory that
  retrieve data, filtered by constraints on Instrument,
  Date-Time, and Parameter in any order and with
  constraints included in any combination.                       16
                       Use Case example
• Plot the neutral temperature from the Millstone-Hill
  Fabry Perot, operating in the non-vertical mode during
  January 2000 as a time series.
• Plot the neutral temperature from the Millstone-Hill
  Fabry Perot, operating in the non-vertical mode during
  January 2000 as a time series.
• Objects:
   –   Neutral temperature is a (temperature is a) parameter
   –   Millstone Hill is a (ground-based observatory is a) observatory
   –   Fabry-Perot is a interferometer is a optical instrument is a instrument
   –   Non-vertical mode is a instrument operating mode
   –   January 2000 is a date-time range
   –   Time is a independent variable/ coordinate
   –   Time series is a data plot is a data product

            Knowledge representation

• Statements as triples: {subject-predicate-object}
      interferometer is-a optical instrument
      Fabry-Perot is-a interferometer
      Optical instrument has focal length
      Optical instrument is-a instrument
      Instrument has instrument operating mode
      Instrument has measured parameter
      Instrument operating mode has measured parameter
      NeutralTemperature is-a temperature
      Temperature is-a parameter
• A query*: select all optical instruments which have
  operating mode vertical
• An inference: infer operating modes for a Fabry-Perot
  Interferometer which measures neutral temperature
   Added value                                      Education, clearinghouses, other services, disciplines,

              Semantic mediation layer - mid-upper-level

                                                      Web                                   VO
                                                      Serv.                                 API
         Portal              Added value

                               Added value                                                     Query,
                           Semantic query,                                                     access and
                           hypothesis    and                                                   use of data
Mediation Layer            inference
• Ontology - capturing concepts of Parameters,
  Instruments, Date/Time, Data Product (and
       Semantic mediation and Service Classes
  associated classes, properties)layer - VSTO - low level
• Maps queries to underlying data
• Generates access requests for metadata, data schema,
• Allows queries, reasoning, analysis, new hypothesis
  generation, testing, explanation, etc.      Added value
                     DB2               DB3                                          DBn
        DB1                                           …………
                                   Fox - APAC 2007, Driving e-research:
                                           Grids and Semantics
                     Semantic  filtering by
                     domain or instrument

Partial exposure of
Instrument         class
hierarchy - users seem     Fox - APAC 2007, Driving e-research:
to LIKE THIS                       Grids and Semantics
Inferred plot type
and return required
axes data

Semantic Web Services

Semantic Web Services

                                           OWL document returned
                                           using VSTO ontology - can be
                                           used both syntactically or

    Fox - APAC 2007, Driving e-research:
            Grids and Semantics
               Semantic Web Benefits
• Unified/ abstracted query workflow: Parameters, Instruments, Date-Time
• Decreased input requirements for query: in one case reducing the number of
  selections from eight to three
• Generates only syntactically correct queries: which was not always insurable in
  previous implementations without semantics
• Semantic query support: by using background ontologies and a reasoner, our
  application has the opportunity to only expose coherent query (portal and
• Semantic integration: in the past users had to remember (and maintain codes)
  to account for numerous different ways to combine and plot the data whereas
  now semantic mediation provides the level of sensible data integration
  required, and exposed as smart web services
    – understanding of coordinate systems, relationships, data synthesis, transformations.
    – returns independent variables and related parameters
• A broader range of potential users (PhD scientists, students, professional
  research associates and those from outside the fields)
Semantic Web Methodology and
Technology Development Process
           Developing ontologies
• Use cases and small team (7-8; 2-3 domain/ data experts,
  2 knowledge experts, 1 software engineer, 1 facilitator, 1
• Identify classes and minimal properties (leverage
  controlled vocab.)
   –   Start with narrower terms, generalize when needed or possible
   –   Adopt a suitable conceptual decomposition (e.g. SWEET)
   –   Import modules when concepts are orthogonal
   –   Add service classes and properties where needed
• Review, vet, publish
• Only code them (in RDF or OWL) when needed (CMAP, …)
• Ontologies: small and modular
Species validation

   Tetherless World Constellation   29
Expressivity VSTO 1.0

    Tetherless World Constellation   30
Expressivity VSTO dev. version

         Tetherless World Constellation   31

Tetherless World Constellation   32
       Ontologies and the software

• Protégé 2.x and then 3.x built from our
  ontology on the web
• Java class generation
• Eclipse as a development environment
• Leveraged a portal code base (from the Earth
  System Grid project)

                 Tetherless World Constellation   33

           Implementation choices

• Our big challenge was time – in use cases and in
  the representation
  – Depending on the level of granularity there were >
    200,000 day-time records, and > 70,000,000 sub-day
    time intervals – no triple store could handle this**
• We descoped our effort to delay use cases such
  as: find all neutral temperature data around the
  summer solstice for the last decade
• We chose a minimal time encoding in the
  ontology and delegated that to a relational DB
• Reasoning in finite time does not mean 3-4 secs!
                    Tetherless World Constellation         36
VSTO - semantics and ontologies in an
operational environment:

                                                  Web Service

           Fox - APAC 2007, Driving e-research:
                   Grids and Semantics
          Implications and OWL 1.0

• Lack of numeric support meant that the the
  rules and procedural logic were implemented
  in java, i.e. in the code
• On several occasions the tools (not to be
  named) pushed us into OWL-Full, introduced
  inconsistencies, etc.
• Finally, they stabilized, and in 2005 (and again
  in 2006 and twice in 2007) we had stable
                  Tetherless World Constellation     38
•   Highlights:
    – Less clicks to data
    – Auto identification and retrieval of independent variables & plotting support
    – Faster
    – Support for finding instruments (without specifying the id includes finding
       data from instruments that the user did not know to ask for)
•   Questions (potentially with 35 responses)
    – What do you like about the new searching interface? (9)
    – Are you finding the data you need? (35: Yes=34, No=1)
    – What is the single biggest difference? (8)
    – How do you like to search for data? Browse, type a query, visual? (10,
       Browse=7, Type=0, Visual=3)
    – What other concepts are you interested in using for search, e.g. time of high
       solar activity, campaign, feature, phenomenon, others? (5, all of these)
    – Does the interface and services deliver the functionality, speed, flexibility you
       require? (30, Yes=30, No=0)
    – How often do you use the interface in your normal work? (19, Daily=13,
       Monthly=4, Longer=2)
    – Are there places where the interface/ services fail to perform as desired? (5,
       Yes=1, No=4)
                                Tetherless World Constellation                            39
• We need the ability to evolve the ontology and not
  break the framework
• As we broaden re-use of these ontologies and creation
  of new ones
   – We needed visual tools like CMAP Ontology Editor
   – We needed the visual tools to work with the editing/
     plugin tools – they do not
   – We needed to use natural language forms but this ended
     up being sparse but that need will increase
   – Need tools aimed at software engineers and domain
     scientists: three-pronged approach and interoperable:
      • OWL in editors (e.g. Protégé, SWOOP, etc.)
      • Visual (e.g. CMAP/COE)
      • Natural Language (e.g. Rabbit, CL, Peng)
                         Tetherless World Constellation       40

• Support for collaborative feedback, evolution
• Change management
• Support for ‘comments’ and ‘annotations’, i.e.
• Package management: creation, dependency,
  consistency checking

                 Tetherless World Constellation    41
               Semantics between
                 2004 and 2009
•   Ontologies were needed for data integration
•   and provenance
•   and mediation for data mining
•   Protégé 3.x and then 4.0 came out
•   SWOOP development was interrupted
•   Cmap added OWL predicate support*
•   SPARQL became a recommendation
•   Triple stores exploded in use and capability
•   Linked Open Data started to take off
•   Pellet 2.0 came out
•   We invaded OWLED 2006, 2007, and 2009 (2010
    papers went in yesterday)
                    Tetherless World Constellation   42
Semantic Web Layers

Other projects – ontologies for
        faceted search

         Tetherless World Constellation   44
For data integration

    Tetherless World Constellation   45
Ontology packaging

   Tetherless World Constellation   46

Tetherless World Constellation   47
            Discussion of E versus I

• We had to expand the balance to now include
  maintainability (/ evolvability)
• E-M-I briefly
  – E.g. modularization has become essential to facilitate
    ontology packaging -> need to take advantage of OWL 2
  – Separation of class and instances
     • Makes visual development possible
     • Also facilitates SPARQL end-point approaches
• As tools and applications improve we reconsider
  our past choices
  – Adding time** back into VSTO and moving to OWL 2
                      Tetherless World Constellation         48
                   So far in 2010

• Recently funded to take our developments into a
  configurable SDF, thus we will push ontology
  languages and tools on new ways:
• OWL 2 – RL in particular
   – Annotations
   – Property chaining
• SPARQL (yawn)
• RIF – probably not for a while but we like Jena
  and SWRL a lot!
• However, the tools still lag behind – especially for
  visual and natural language development
                    Tetherless World Constellation       49
• One of the primary goals of VSTO 2.0 is to modularize the
  VSTO ontology, e.g., an instrument module does not require
  any other classes besides the instrument and maybe an
  instrument operating mode to substantiate what an
  instrument is.
• The problem with modularization, however, is that although a
  subset may substantiate a concept, that concept, especially in
  VSTO, has a number of relations linking it with other concepts
  within the ontology, for instance the instrument module may
  measure a number of parameters in the parameter module,
  or have a time coverage that would be defined in the time

                       Tetherless World Constellation              50
• Each observatory that the VSTO integrates data for will import
  only the modules that are appropriate for the observatory's
• There are also some modules that will always be required,
  regardless of the domain, like the instrument, parameter, and
  time modules. Each observatory ontology has its own way of
  linking these modular concepts, which will be called link
• This presents a problem, as the VSTO portal may not know
  which link property to use to associate an instrument with a
  set of parameters or a time coverage, as it becomes the
  responsibility of the ontology for the respective observatory
  to define the link properties.
                       Tetherless World Constellation              51
            ‘Interfaces’ or ‘Extensions’

• This is where the VSTO interface ontology comes in. It doesn't
  have to be called the VSTO interface, it could be VSTO link
  properties, or anything for that matter.
• The purpose of this ontology is to define a few link properties
  that will be required for navigation to data in the VSTO portal.
  For instance, the guided workflows as they work now, would
  require a number of link properties. E.g. the Start by
  Instrument Workflow, the VSTO interface would require an
  instrument and time coverage link property to get from step 1
  to step 2 in the workflow.

                        Tetherless World Constellation               52
            ‘Interfaces’ or ‘Extensions’

• In the case that an instrument of the CEDAR
  observatory is selected in step 1, this link property
  could be created in a rule-based logic as…
   – ( Instrument_1 hasInstrumentOperatingMode IOM_1 ^ IOM_1
     hasDataset Dataset_1 ^ Dataset_1 hasTimeCoverage TimeInterval_1 )
     => Instrument_1 hasTimeCoverage TimeInterval_1
• Of course, this would have to be done for all
  instrument operating modes and all datasets
  associated with those operating modes to determine
  the full time coverage of an instrument.

                         Tetherless World Constellation                  53
              OWL 2 considerations
• What's good?:
   – new syntactic sugar to simplify ontology
   – ability to compare numerics
• OWL 2 QL Synopsis:
   – focused on ontology interoperability with database
     systems where scalable reasoning and query answering
     over large numbers of instances is most important task
• Why is it a good match?:
   – synopsis above, query answering over a large number of
     time instances will have to be performed

                      Tetherless World Constellation          54
              OWL 2 considerations
• Why isn't it a good match?:
   – does not support enumerations, a feature required by
     some concepts in VSTO
   – does not support functional properties, a feature required
     by some properties in VSTO
   – does not support property inclusions involving property
     chains, a feature we hope to utilize to define rules for
   – does not support keys, a feature we hope to add when
     Protege 4.1 released (along with support for creation of

                      Tetherless World Constellation              55
              OWL 2 considerations
• OWL 2 RL Synopsis:
   – focused on ontology interoperability with rule extended
     DBMSs where scalable reasoning over large datasets is the
     most important task
• Likely current choice:
   – supports all OWL features currently required by VSTO,
     including enumerations and functional properties
   – supports property inclusions involving property chains, so
     potential for rules can be addressed, namely for reasoning
     over time intervals
   – supports keys

                      Tetherless World Constellation              56
            Back to Semantic Data
• With the substantial adoption of semantics in
  science data applications
  – There is a need for a higher level of application/
    tool infrastructure
  – Others are experiencing the same lessons with
    ontology and application development
• We have aggregated our efforts into a:
  Semantic eScience Framework (SESF)*
  – Configurable, i.e. ontology loadable and driven
                    Tetherless World Constellation       57
High-level architecture

     Tetherless World Constellation   58
Provenance aware
  faceted search

    Tetherless World Constellation   59
             Inference vs. Query

• The real power of semantic web in science is
  likely to lay in the ability to balance
  implementation choices between inference
  (RDFS and OWL) and query (even SPARQL)
• It is clear to us that the effect upon
  expressivity and maintainability will be an
  essential consideration
  – Recall the OWL-QL – OWL RL findings
• Also depends on how dynamic the KB is…
                  Tetherless World Constellation   60
                 I.e. SDF vs LOD

• Linked open data – RDFS and SPARQL
• Emergent ontology versus, well, an engineered
  – Current chaos due to owls:sameas
  – Dynamic content
• One of the present challenges for us is to
  accommodate the web of data into emerging
  needs for federated search and access as SDFs are
• And yes, there is RDFS (2.0) to consider
                    Tetherless World Constellation    61
• We set out to build a prototype and ended up with a
  production semantic data framework
   – Language and tools served us well
• Even with modest expressivity we challenged the tools
  of the time and made many compromises
• All along the way, we evaluated our ontology
  developments and implementations to gauge the
  benefits of semantics
• Maintainability, esp. modularization is driving new
  expressivity needs
• Xinformatics is the key - we continue to need to bridge
  the computer science and application communities

                      Tetherless World Constellation        62
             Further Information

• Contacts:

                  Tetherless World Constellation        63