Informatics at UCSB MSI and NCEAS (PowerPoint)

Document Sample
Informatics at UCSB MSI and NCEAS (PowerPoint) Powered By Docstoc
					 http://knb.ecoinformatics.org                 http://seek.ecoinformatics.org



Science Environment for Ecological
       Knowledge: EcoGrid
                            Matthew B. Jones
        National Center for Ecological Analysis and Synthesis
               University of California Santa Barbara
                    Science Environment for Ecological Knowledge

Research Objectives

   Access to ecological, environmental, and biodiversity data
       Enable data sharing & re-use
       Enhance data discovery at global scales


   Scalable analysis and synthesis
       Taxonomic, Spatial, Temporal, Conceptual integration of data
            Address data heterogeneity issues
       Enable communication and collaboration for analysis
       Enable re-use of analytical components

   Collaborators
       NCEAS, UNM, SDSC, U Kansas
       Vermont, Napier, ASU, UNC
                          SEEK Components

Science Environment for Ecological Knowledge

   Kepler
            Modeling scientific workflows
   EcoGrid
            Making diverse environmental data systems interoperate
   Semantic Mediation System
            “Smart” data discovery and integration

   Knowledge Representation WG
   Taxon WG
   BEAM WG
   Education, Outreach, Training
                         Scientific Workflows

   Model the way scientists work with their data now
       Mentally coordinate export and import of data among software systems


   Workflows emphasize data flow

   Output generation includes creating appropriate metadata
       The analysis workflow itself becomes metadata
       The workflow describes the data lineage as it has been transformed
       Derived data sets can be stored in EcoGrid with provenance




     Query EcoGrid                                         Archive output to EcoGrid
      to find data                                          with workflow metadata
                       Kepler: scientific workflows




• Collaborative effort of SEEK, SciDAC/SDM, GEON, Ptolemy Project
Kepler understands EML data
Kepler: molecular biology example
                           SEEK EcoGrid

   Goal: allow diverse environmental data systems to interoperate
       Hides complexity of underlying systems using lightweight interfaces
       We have standardized data via EML, need standard APIs
       Integrate diverse data networks from ecology, biodiversity, and
        environmental sciences


   Data systems
       Any system can implement these interfaces
       Prototyping using:
            Metacat, SRB, DiGIR, Xanthoria, etc.


   Supports multiple metadata standards
       EML, Darwin Core as foci
                        EcoGrid client interactions

   Modes of interaction
       Client-server
       Fully distributed
       Peer-to-peer



   EcoGrid Registry
       Node discovery
       Service discovery



   Aggregation services
       Centralized access
       Reliability
       Data preservation
                       EcoGrid Query Interfaces

   Provides a mechanism for search and retrieval of metadata
    and federated data
   Supports third party interaction with search results –
    forwarding of result set identifiers to another service
    instance for retrieval
   Different levels of compliance
       Low barrier for participation                    Query   Result
       Bulk of data will be accessible through Type I
                    Query Interfaces Implemented

   Initial prototype to support query and retrieval
    from:
       Storage Resource Broker (SRB)
       Metacat
       Distributed Generic Information Retrieval (DiGIR)
       Xanthoria


   Encourage additional experimentation with and
    feedback based on other system implementations
               EcoGrid Query Level I

   Basic, entry level exposure of data and
    metadata for EcoGrid and SEEK
   Response contains data – intended for direct
    communications rather than 3rd party
    indirection
     Result                        Query


ResultsetType query(SessionID,QueryType)

byte[] get(SessionID,objectID)
                Query Conditions

 Language independent representation of a query
  structure
 Transformed into the appropriate native language
  of the data store
Example:
<AND>
  <condition operator="LIKE“
             concept="ScientificName">peromyscus%
  </condition>
  <condition operator="NOT EQUALS“
             concept="DecimalLatitude">NULL
  </condition>
</AND>
                                                    Query
                     Specifying the Resultset

    Specify the list of concepts (fields) to be returned in the
     resultset
    Simple paths used to identify elements or document
     subtrees
    Effectively flattens the structure of the records, but allows
     generic representation

Example:
    <returnfield>/ScientificName</returnfield>
    <returnfield>/Longitude</returnfield>
    <returnfield>/Latitude</returnfield>

                                                             Query
                  Full Query Example

<egq:query queryId="query-digir.1.1"
   system="http://knb.ecoinformatics.org"
    xmlns:egq="ecogrid://ecoinformatics.org/ecogrid-query-
   1.0.0beta1"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="ecogrid://ecoinformatics.org/ecogrid-
   query-1.0.0beta1 ../../src/xsd/query.xsd">
    <namespace
   prefix="darwin">http://digir.net/schema/conceptual/darwin/2
   003/1.0</namespace>
    <returnfield>/ScientificName</returnfield>
    <returnfield>/Longitude</returnfield>
    <returnfield>/Latitude</returnfield>
    <title>Peromyscus genus query</title>
    <condition operator="LIKE"
   concept="Genus">Peromyscus</condition>
</egq:query>


                                                        Query
                          Query Result Set Structure
<rs:resultset resultsetId="foo.1.1"
    system="urn:not://sure/what/to/put/here"
    xmlns:rs="ecogrid://ecoinformatics.org/ecogrid-resultset-1.0.0beta1"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="ecogrid://ecoinformatics.org/ecogrid-resultset-1.0.0beta1
    ../../src/xsd/resultset.xsd">

   <resultsetMetadata>
       <sendTime>2003-05-02T16:45:50-09:00</sendTime>
       <startRecord>1</startRecord>
       <endRecord>2</endRecord>
       <recordCount>2</recordCount>
       <namespace>http://digir.net/schema/conceptual/darwin/2003/1.0</namespace>
       <system
        id="1">http://speciesanalyst.net/digir/DiGIR.php?resource=MammalsDwC2</system>
   </resultsetMetadata>

    <record number="1"
             system="1"
             identifier="mvz1">
        <returnField name="ScientificName">PEROMYSCUS LEUCOPUS NOVEBORACENSIS</returnField>
        <returnField name="Longitude">100</returnField>
        <returnField name="Latitude">200</returnField>
     </record>
     …
</rs:resultset>
                                                                                    Result
                EcoGrid Query Level II

   More detailed handling of results
   Uses RSIDs to identify resultsets- handles
    that can be passed to a third party

RSID search(SessionID,query)

Resultset retrieve(SessionID,RSID,start,numrecs)

query decodeResultsetIdentifier(SessionID,RSID)

statusinfo getResultStatus(SessionID)

int transfer(SessionID,sourceURL,destURL,ObjectID)
               EcoGrid Write

   Used to push data back to sources (e.g.
    publishing EML documents)
   Depends on the availability of an
    authentication and access control system

put(sessionID, objectID, object, type)

delete(sessionID,objectID)
                       Data Instance Query

   New requirement to support direct query and retrieval with
    arbitrary data sets
   Generally no common schemas between different instances

   Could either
       Push data instance to service that can query object (e.g. the SRB)
       Implement interface at the data instance location
   Simple JDBC / SQL interface?
    dbSchema getDataSchema(sessionID,objectID)

    dbResultset search(sessionID,objectID,SQL)
                     Building the EcoGrid

                                         NTL
          AND
                                                                                HBR




                                                                           VCR




                                                        LUQ

                                 LTER Network (24)           Natural History Collections (>> 100)
Metacat node     SRB node        Organization of Biological Field Stations (180)
VegBank node     DiGIR node      UC Natural Reserve System (36)
Xanthoria node                   Partnership for Interdisciplinary Studies of Coastal Oceans (4)
                 Legacy system   Multi-agency Rocky Intertidal Network (60)
Metadata-driven analysis cycle
                  Acknowledgements

This material is based upon work supported by:
The National Science Foundation under Grant Numbers 9980154,
9904777, 0131178, 9905838, 0129792, and 0225676.
The National Center for Ecological Analysis and Synthesis, a Center
funded by NSF (Grant Number 0072909), the University of California,
and the UC Santa Barbara campus.
The Andrew W. Mellon Foundation.
PBI Collaborators: NCEAS, University of New Mexico (Long Term
Ecological Research Network Office), San Diego Supercomputer
Center, University of Kansas (Center for Biodiversity Research)

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:13
posted:12/3/2011
language:English
pages:22