Docstoc

Enabling Semantic Interoperability for Earth Science Data

Document Sample
Enabling Semantic Interoperability for Earth Science Data Powered By Docstoc
					                Enabling Semantic Interoperability for Earth Science Data
                     Final Report to NASA Earth Science Technology Office (ESTO)

                                               Rob Raskin
                                        Jet Propulsion Laboratory


  Abstract-    Data    interope rability  across       Markup Language (ESML), Earth Science
heterogeneous systems can be hampered by               Modeling Framework (ESMF), and the Open GIS
differences in terminology, particularly when          consortium (OGC). Key to the success of these
multiple scientific communities are involved. To       initiatives is the development of a common
reconcile differences in semantics, a common           semantic framework. Such a framework enables
semantic frame work was created through the            dataset and science concepts to be understood by
development of Earth science ontologies. Such a        software tools. The framework goes beyond data
shared understanding of concepts enables               interoperability by supporting knowledge reuse, or
ontology-aware software tools to understand the        the exchange of conceptual knowledge within and
meaning of te rms in docume nts and we b pages.        across these disciplines.

  This report updates last year's Semantic Web            This framework can be achieved through the
for Earth and Environmental Terminology                "Semantic Web" (Fensel, et al., 2003), an
(SWEET) prototype. For the recent work, we             ambitious extension to the existing WWW
incorporated concepts of other funded                  environment, coordinated by the World Wide
initiatives such as ESML, ESMF, grid                   Consortium (W3C). The Semantic Web encodes
computing, and OGC. We also created a system           common sense knowledge directly into web pages
to update its knowledge base as needed, from           themselves, using broadly agreed upon namespaces
gazetteers and other on-line Web sources. An           and ontologies to define terms and their mutual
accompanying search tool supports system-wide          relationships.
search and ultimately, a wide range of
semantically-based web services.                         The motivation of our task is to improve
                                                       semantic understanding of web resources by
  This report includes some background                 software tools, with specific application to
material that appeared in last year’s report that      discovery and use of Earth science data. Semantic
is repeated to convey a self-contained                 understanding of text by automated tools is enabled
understanding of the subject. This report              through the combined use of i) ontologies and ii)
concludes with road maps for various                   software tools that can interpret the ontologies. An
technology initiatives.                                ontology is a formal representation of technical
                                                       concepts and their interrelations in a form that
                                                       supports domain knowledge. Generally, an
                1. INTRODUCTION                        ontology is hierarchical, with child concepts having
                                                       explicit properties to specialize their parent
  Earth system science data originate from many        concept(s).
disciplines, spanning several community standards,
terminologies, and data formats. Several initiatives     A Semantic Web emerges if terms on web pages
are underway to develop a common infrastructure        are associated with corresponding elements in
to improve data interoperability across the            ontologies. This is accomplished by placing an
disciplines. Examples include the: Earth Science       XML tag around a term to identify its associated
ontology namespace. A search tool potentially can       Web Mapping Service (WMS) and Web Coverage
use these metadata tags to distinguish different uses   Service (WCS) protocols. The HDF-EOS and
of the same term (e.g. “fall” as a season vs. “fall”    OGC solutions enable a data seeker to query and
as a downward motion) to eliminate false hits. It       access data by spatial/temporal parameters rather
also can locate resources without having an exact       than by array row/columns (which would require
keyword match, because terms such as “El Nino”          human intervention).      Thus a software tool
have an equivalent definition in terms of its           understanding these conventions can access any
defining scientific components.                         HDF-EOS or OGC-compliant dataset and be
                                                        guaranteed that the spatial-temporal interpretation
  To support potential Semantic Web activities, we      is known.
developed a collection of ontologies for the Earth
and environmental sciences and supporting areas.          Semantic interoperability for dependent variables
We created a common sense knowledge base of the         has generally meant the use of controlled
Earth sciences using the Ontology Web Language          keywords. For instance, the NASA GCMD defines
(OWL) [1], a standard adopted by the W3C. We            approximately 1000 controlled keywords, each
use these ontologies in a prototype search tool that    with a dictionary definition. Such a representation
improves performance by creating additional             does not support computer reasoning that would be
relevant search terms based on the underlying           required to respond to general queries or chain
semantics. We demonstrate how such a knowledge          services together. It does not support inheritance
base can be “virtual” by adding a wrapper around        of concepts for knowledge reuse, does not provide
remote, dynamic data repositories.                      a rich expression of the relationship between the
                                                        keywords and is not directly extendable by the
1.1 SEMANTIC INTEROPERABILITY                           user. This project addresses a more scalable
                                                        solution to semantic interoperability in the context
  In the early days of computing, an initial level of   of the Earth sciences.
data interoperability resulted when data structures
(arrays) created on one computer system were                    2. ONTOLOGY DEVELOPMENT
readable by another computer. Data formats such
as HDF emerged to extend this level of                    An ontology is a formal representation of
interoperability to more complex data structures        technical concepts and their interrelations in a form
and across vendor platforms and enabled the             that captures domain knowledge. Generally, an
preservation of variable names. The Internet later      ontology is hierarchical, with child concepts having
brought on protocols such as DODS, which                explicit properties to specialize their parent
supported modification of the data structure (subset    concept(s). Thus, “hydrosphere” is the parent
extraction) during the transfer. Exchanges of this      concept of “surface water”, which is a parent of
type say nothing about the scientific interpretation    “river”, which is a parent of “Mississippi River”,
of the data on the receiving end. A variable name       etc. In this paper, we describe our experiences
is assigned to a data structure, but human              with the development of Earth and environmental
intervention is required to make sense of it.           science ontologies.

  The HDF-EOS format remedied the semantic                In the initial year of ESTO funding, we created
interoperability problem for independent variables      the Semantic Web for Earth and Environmental
by standardizing the naming convention of spatial       Terminology (SWEET) [2] to prototype how a
and temporal parameters. The Open GIS                   Semantic Web can be implemented in the Earth
Consortium (OGC) provides a similar level of            sciences. We used the terms in the Global Change
spatial/temporal interoperability problem in its        Master Directory (GCMD) [3] as a starting point in
manually populating the ontologies, but                … We defined multidimensional concepts such as
reorganized and expanded the concepts to form a        coordinate systems, mathematical operators and
scalable framework. Later, we incorporated an          functions.
analogous keyword list used in the Earth Science
Modeling Framework (ESMF) [4].                         Temporal Entity
                                                         Time is essentially a numerical scale with
Earth Realm                                            terminology specific to the temporal domain. We
  The “spheres” of the Earth constitute an             developed a time ontology in which the temporal
EarthRealm ontology, based upon the physical           extents and relations are special cases of numeric
properties of the planet. Elements of this ontology    extents and relations, respectively.       Temporal
include “atmosphere”, “ocean”, and “solid earth”,      extents include: duration, season, century, 1996, …
and associated subrealms (such as “ocean floor”        Temporal relations include: after, before, …
and “atmospheric boundary layer”). The subrealms
generally are distinguished from their parent          Spatial Entity
classes, based on the property of altitude, e.g.,        Space is essentially a 3-D numerical scale with
“troposphere” is the subclass of “atmosphere”          terminology specific to the spatial domain. We
where elevation is between 0 and 15 km.                developed a space ontology in which the spatial
                                                       extents and relations are special cases of numeric
Non-Living Element (Substance)                         extents and relations, respectively. Spatial extents
  This ontology includes the non- living building      include: country, Antarctica, equator, … Spatial
blocks    on    nature,     such  as: particles,       relations include: above, northOf, …
electromagnetic     radiation,   and    chemical
compounds.                                             Phenomena
                                                         A phenomena ontology is used to define transient
Living Element                                         events. A phenomenon crosses bounds of other
  This ontology includes plant and animal species,     ontology elements. Examples include: hurricane,
imported from the GCMD “biosphere” taxonomy.           earthquake, El Nino, volcano, terrorist event, and
                                                       each has associated Time, Space, EarthRealms,
Physical Property                                      NonLivingElements, LivingElements, etc. We also
  A separate ontology was developed for physical       include specific instances of recent phenomena.
properties that might be associated with any
component of EarthRealm, NonLivingElements, or         Human Activities
LivingElements.      PhysicalProperties include          This ontology is included for representing
“temperature”, “pressure”, “height”, “a lbedo”,        impacts of environmental phenomena such as
etc.                                                   commerce, fisheries, etc.

Units                                                  Data
  Units are defined using Unidata’s UDUnits. The         The data ontology provides support for dataset
resulting ontology includes conversion factors         concepts, including representation, storage,
between various units. Prefixed units such as km       modeling, format, resources, grid computing, and
are defined as a special case of m with appropriate    distribution.
conversion factor.
                                                       2.1. ONTOLOGIES AS A UNIFYING
Numerical Entity                                       KNOWLEDGE FRAMEWORK
 Numerical extents include: interval, point, 0, R2 ,
… Numerical relations include: greaterThan, max,
  The first several ontologies listed above represent   meaning for: cardinality, inverse properties,
orthogonal concepts (or dimensions), often called       synonyms, and many more concepts in three
facets. Traversing down the tree associated with a      versions: OWL Lite, Owl DL, and OWL Full. The
facet follows the scientific path of reductionism by    four languages (RDF, Owl Lite, OWL DL, OWL
adding additional details to more abstract concepts.    Full) offer a nested set of language capabilities.
                                                        We adopted OWL Full due to its anticipated
  A completely different type of ontology is            widespread acceptance over the coming years. Our
encountered in “phenomena”, as this category is         ontologies initially were written in the DARPA
synergetic rather than orthogonal to the others.        Markup Language (DAML), a predecessor to
The phenomena entries describe synthesizing             OWL, and converted these ontologies to OWL
concepts that utilize elements from the other           Full.
ontologies (e.g., a hurricane is associated with
particular coastal areas, and is characterized by         OWL has support for numbers only through a
high winds, rainfall, flood impacts, etc.). Thus,       W3C specification [5]. This spec defines number
phenomena are defined in terms of combinations of       types (e.g., real numbers, unsigned integer) and
elements from the faceted concepts. The “Human          some abilities to create derivations of these types
activities” ontology also is a unifying, rather than    (e.g. the closed interval between 0 and 1). It
reductionist collection.                                contains no operations or relations on these
                                                        numbers. This is a deficiency, because basic
  Taken together, these two complementary               scientific concepts are defined in terms of numeric
approaches mirror the scientist’s dual processes of     concepts. For example, “brighter”, “higher”,
reductionism and synthesis.         This structure      “later”, and “more northerly” are special cases of
provides a relatively complete framework for            the “greater than” relation, when applied in specific
capturing scientific knowledge. Using OWL, we           domains. In particular, spectral regions are defined
relate concepts in these two approaches.                in terms of wavelength (e.g. visible light is
Generally, unifying concepts are built up and           between 0.3 and 0.7 nanometers), atmospheric
defined    in terms of individual facets.               layers are defined by altitude (e.g. troposphere is
Alternatively, facets can be defined through            between 0 and 15 km), etc. This specification also
projection operations on unifying concepts.             has no notion of a multidimensional space Rn .

2.2 ONTOLOGY LANGUAGES                                    Repositories of OWL ontologies exist to enable
                                                        the work of others to be extended. However, at
  An ontology is expressed using a language that is     present there are no ontologies supporting numeric
typically a specialization of XML. XML is widely        operations (e.g. “greater than”, “max”). Several
supported by existing software tools and is             spatial and temporal ontologies exist, but these
platform- independent.        The World Wide            ontologies do not exploit the fact that space and
Consortium (W3C) has adopted two XML                    time are numerical scales. Therefore, the
languages as its standard method of representing        numerical, space, time, and event ontologies that
ontologies: the Resource Description Framework          we developed for SWEET will be submitted to a
(RDF) and the Ontology Web Language (OWL).              general OWL ontology library.
Each of these languages is rich enough to express
the hierarchical structures inherent in knowledge         XML-based languages such as OWL are well
representation.      RDF specializes XML by             suited to data and model exchange, but are less
standardizing meanings for: class, subclass,            practical for storage and query of large ontologies.
property, subproperty, domain, range, etc. OWL is       Existing database management systems provide the
a further specialization of RDF; it adds standard       needed functionality in storage and indexing of
robust ontologies, including support for data           formal commitment from the ESMF project to use
integrity, concurrency control, etc. Consequently,      our ontology at this time.
we adopted the Postgres object-oriented DBMS to
store the names and parent-child relations of our         The Earth System Grid (ESG) [6] is a DOE-
ontology elements.         We created two-way           funded project to use grid computing in support of
translators between       the     internal DBMS         Earth system modeling. We included the grid
representation     and   the      standard  XML         concepts into the SWEET data ontology and are
representation of OWL properties. By placing all        working with Line Pouchard, ESG Project
term declarations in the DBMS, any search for           Associate at ORNL, to achieve consistency
terms is very rapid.                                    between ESG and SWEET.

  For representation of spatial concepts, we used       The Open GIS Consortium uses standard
bounding polygons to describe regions, where            representations for coordinate systems and
possible. Polygons are a native datatype in             geometric objects. While it was not practical to
PostGRES.                                               include all of these entities in the SWEET spatial
                                                        ontology, we included the more widely used ones.
  3. SEMANTIC INTEROPERABILITY WITH
          OTHER INITIATIVES                                 4. DYNAMIC ACCESS TO ONTOLOGY
                                                                      ELEMENTS
  The Earth Science Markup Language (ESML)
combines an XML-based language for describing             Many Earth science facts reside in large external
datasets with an API read library. Its XML tags are     databases. We created OWL wrappers to enable
of two types: syntactic (for reading data) and          several of these database contents to be accessible
semantic (for interpreting data). ESML no longer        as if they were local ontology elements. The
maintains semantic tags within its libraries and        databases include three gazetteers: CIA World Map
relies instead on external ontologies to provide that   [7], Getty Thesaurus [8], and the Calle Global
functionality. Thus, SWEET tags may be used to          Gazetteer [9]. Gazetteers translate vernacular
provide the semantic content of any ESML file.          names to and from geographic coordinates. We
Examples include: science subject, geographic           added polygon boundaries to many gazetteer
coordinate system, scaling factors & offsets, etc.      entries that otherwise contained only rectangular
                                                        bounding boxes. Also included are the USGS real-
  ESMF is an effort to make large Earth System          time list of earthquakes [10] and the Heavens
models interoperable.       Model interoperability      Above real-time list of satellite locations [11]. A
involves knowing input/outut compatibility and          Web Map Server (WMS) [12] import capability
parameter tables. We defined within SWEET the           was added to acquire images and maps accessible
model parameters required to ascertain model            through WMS-compliant servers. A map-based
interoperability. ESMF also uses the list of 350        interface demonstrates all of these capabilities by
variable names, defined under the CF/Standard           querying the external sources in response to user
name conventions. Most of these terms are               requests.
concatenations      of    several  terms      (e.g.
temperature_at_top_of_boundary_layer).         We         The gazetteer entries generally include fields for
mapped the terms to the SWEET ontology, so that         bounding rectangle but not bounding polygons.
this list of terms could grow more naturally. We        The polygon information is available separately
are working with Cecelia DeLuca, Project                from other sources for state and international
Associate at UCAR, to ensure compatibility              boundaries. We inserted the bounding polygon
between ESMF and SWEET, though there is no              data into the internal SWEET database. In many
cases, the size of the polygon exceeded what could      resolution, model assumptions, etc.) is required to
be stored natively in PostGRES, and we reduced          enable community comparisons and review. By
the spatial resolution.                                 defining a target concept in terms of ontology
                                                        concepts, such a representation can be articulated.
      5. INTELLIGENT SEARCH ENGINE                      This functionality is particularly important in on-
                                                        board processing systems, where knowledge must
  A search tool that is aided by an ontology can        be reused to identify targets of interest for
locate resources without having an exact keyword        enhanced data collection. It is recommended that
match. To demonstrate this capability, we created       data mining activities be required to use the full
a search tool that consults the SWEET ontology to       expressive capabilities of an ontology. Without a
find related terms.         These terms may be          formal requirement, it is unlikely that algorithm
synonymous (same as), more specific (child of), or      developers will voluntarily contribute this
less specific (parent of) than those requested. The     information.
tool then submits the union of these terms to the
GCMD search tool and presents the results. The          Web Services
results verified that additional relevant terms were       Based on current business application trends, it is
found from the search, relative to the exact            likely that a wide range of Web services will be
keyword search. The search tool is implemented as       established in the Earth sciences to locate, acquire
a web service using the RQDL (RDF Query                 and use data. WSDL and UDDI are currently used
Language). Once the synonyms and parent-child           to describe and advertise services, respectively.
relationships have been discovered, the augmented       WSDL and UDDI address the semantics of
query returns resulting GCMD DIF summaries. An          requests, only to the extent that ontologies are
extension of this search tool will be incorporated      referenced in the service descriptions. It is
into the Earth Science Information Partner (ESIP)       recommended that future data-oriented Web
Federation Interactive Network for Discovery            service descriptions be required to use the full
(FIND).                                                 expressive capabilities of an ontology.         This
                                                        suggestion is especially pertinent in a grid
                  6. ROADMAPS                           computing environment, where ontologies can
                                                        describe what services may be chained together
  The      following     mini-strategies     describe   and how this choreography is implemented.
opportunities      for     exploiting       semantic
interoperability in future NASA and ESTO work.          Science Domain Specialist Involvement
These recommendations may be difficult to                 Obtaining review of existing SWEET ontologies
implement immediately, due to the tendency of           has been very difficult. This situation is due in part
scientists to retain their narrow disciplinary          to the limited tools available for ontology
perspectives. But emerging demands for cross-           visualization, as the dimensionality of the semantic
disciplinary science and automated data services        space is very large. Dedicated workshops focusing
will rely heavily on semantic interoperability.         on 3 –D walkthroughs of the semantic space might
                                                        inspire much greater community involvement in
Data Mining                                             the review process. Making this happen will
  The target of a data mining algorithm generally is    require investments in the relevant 3-D
a phenomena of interest, as defined within the          visualization technologies and support of
mining algorithm.         The definition of the         workshops for domain specialists.
phenomena often is hidden or incomplete, as there
is no standard language for its expression. A
complete description (including spatial/temporal
Standards                                           http://earthquake.usgs.gov/recenteqsww/Quakes/qu
  Currently, NASA- funded Earth science data        akes_all.html
products must be classified using the GCMD
Science Keywords. It is recommended that this       [11] Heavens Above. http://www.heavens-
requirement be relaxed to allow an alternative      above.com
classification, such as representation in a SWEET
ontology.      This requirement is of secondary     [12] Web Map Server. http://opengis.org
importance because we provide a transformation
table on our Web site between GCMD and SWEET
representations.

                 REFERENCES

Fensel, D., J. Hendler, H. Lieberman, W. Wahlster
(Eds.), 2003, Spinning the Semantic Web, MIT
Press, Cambridge, 479 pp.

           INTERNET REFERENCES

[1] OWL. http://www.w3.org/TR/owl-ref

[2] SWEET. http://sweet.jpl.nasa.gov

[3] GCMD Science Keywords and Directory
Keywords. http://gcmd.nasa.gov/Resources/valids

[4] CF Standard name table.
http://www.cgd.ucar.edu/cms/eaton/cf-
metadata/standard_name.html

[5] XML Schema Part 2: Datatypes.
http://www.w3.org/TR/xmlschema-2.

[6] Earth System Grid. http://earthsystemgrid.org

[7] CIA World Factbook.
http://www.cia.gov/cia/publications/factbook/

[8] Getty Thesaurus of Place Names.
http://www.getty.edu/research/conducting_research
/vocabularies/tgn/

[9] Calle Global Gazetteer.
http://www.calle.com/world

[10] Earthquake List for World.