Informatics at UCSB MSI and NCEAS

Document Sample
Informatics at UCSB MSI and NCEAS Powered By Docstoc
					                Kepler: A Workflow Tool for
           Heterogeneous Ecological Data Analysis
                                    Chad Berkley

              National Center for Ecological Analysis and Synthesis (NCEAS),
                          University of California Santa Barbara
          Long Term Ecological Research Network Office, University of New Mexico
                                   University of Kansas
                             San Diego Supercomputer Center

Edinburgh, Scotland                    December 4, 2003

   Quick history
   SEEK overview
   Ecological Metadata Language
   Using workflows in Ecology
   Workflow editing with Kepler
   Future visions

   Late 1990s – patterns noticed in the problems
    surrounding data synthesis at NCEAS
   1999 - Michener et al paper on ecological metadata
   2000 – Knowledge Network for Biocomplexity
       Morpho, Metacat, Ecological Metadata Language
       Some footholds into workflow creation and execution
   2003 – Scientific Environment for Ecological
    Knowledge (SEEK) Grant
       Continues the work done on the KNB grant
       Emphasis on using metadata for advanced data
      SEEK approach

   General approach to specific ecological
   Data described with adequate metadata in a
    grid accessible repository
   Reasoning engine (ontology based) to locate
    and extract data and processes
   Modeling system to put it all together and
    control execution flow
        SEEK Components

   Ecogrid
       Analysis Library
       Metadata and data repository
   Semantic Mediation System
       Controlled semantic vocabulary
       Ontological discovery system
   Analysis and Modeling System (Kepler)
       Workflow control system
       Utilizes resources from other components
SEEK Architecture
      Ecological Metadata Language

   Common language for archiving and
    transport of datasets
   XML based
   Designed for/by the ecological community
   Describes physical and logical structure of
   Also includes project, literature and
    software information
   SEEK will add semantic information
       Workflows in SEEK

   In the SEEK model, data ingestion/cleaning is
    metadata driven (specifically with EML)
   Output generation includes creating appropriate
   The analysis pipeline itself becomes metadata
        Metadata driven data ingestion

   Key information needed to read and machine
    process a data file is in the metadata
       File descriptors (CSV, Excel, RDBMS, etc.)
       Entity (table) and Attribute (column) descriptions
            Name
            Type (integer, float, string, etc.)
            Codes (missing values, nulls, etc.)
            In the future, this will include semantic typing
       Metadata revision

   Metadata is revised following any
   Versioning of metadata and data is very
   This process results in a lineage of the data file
    as it has been transformed
      Typical ecological workflow example

   Workflows can automate the integration process
    if data is described with adequate structured
       Homogeneous data integration
   Integration of homogeneous or mostly homogeneous data
    via EML metadata is relatively straightforward
        Heterogeneous Data integration
   Integration of heterogeneous data requires much more
    advanced metadata and processing
       Attributes must be semantically typed
       Collection protocols must be known
       Units and measurement scale must be known
       Measurement mechanics must be known (i.e. that
        Semantic typing and ontologies
   Label data with semantic types
   Label inputs and outputs of analytical components with semantic types

           Data               Ontology          Workflow Components

   Use Semantic Mediation System (SMS) to generate transformation steps
       Beware analytical constraints
   Use SMS to discover relevant components
   Ontology – specification of a conceptualization (a knowledge map)
        Measurement Ontology

   Density is part of a larger measurement ontology
   SEEK‟s intent is to create one or more community created ecological
   Creates a controlled vocabulary for ecological metadata
   More about this in Bertram‟s talk
      About Kepler

   Kepler is the name of the SEEK/SDM
    additions to the Ptolemy modeling system
   Ptolemy was designed by the UC Berkeley
    EECS department
   Primary use is modeling EE circuits
   Free, opensource, pure Java
   Flexible design GUI for building workflows
   A Kepler model consists of linked “actors”
    (which correspond to workflow steps)
   Timing is controlled by a “director”
   All actors are written in Java but can call
    other applications (such as SAS and MATLAB
    or native language code via JNI)
   Actors can call arbitrary Web (or Grid)
   Ptolemy already has a very large inventory
    of actors
   Easy to use, drag „n drop interface
       SEEK Contributions to Kepler (so far)
   EML data ingestion actor

   Actor design tool
      EML data ingestion actor

   Ingests any data format described by EML
   Converts raw data to Kepler format
   Data can then be operated on with other actors
   Produces one output port for each attribute in the
   Individual attributes can then be mapped to other
Ptolemy model with EML ingestion actor
       SEEK Contributions to Kepler (so far)
   EML data ingestion actor

   Actor design tool
      Actor design tool
   Allows “place-holder” actors to be defined on the
    fly by non-programmers during workflow creation
   Domain scientists can thereby create workflows
    without programming knowledge
   Workflows created with these actors can be
    executed once their functionality is implemented
    by a programmer
   Allows quick prototyping of workflows by domain
   “Place-holder” actors can still be linked to other
    working actors
Ptolemy and dynamically created actor
      How domain scientists will benefit

   More fully automated integration systems
   A library of pre-defined analytical
    processes which can be executed on
    heterogeneous data
   Semantic data discovery and processing
   Automated unit and measurement scale
   A fuller understanding of cross site
    research implications

More info:
Questions? IRC: #seek

This material is based upon work supported by:
The National Science Foundation under Grant Numbers 9980154, 9904777,
and 0225676 to NCEAS and its collaborators.
The National Center for Ecological Analysis and Synthesis, a Center funded
by NSF (Grant Number 0072909), the University of California, and the UC
Santa Barbara campus.
Primary Collaborators: University of New Mexico (Long Term Ecological
Research Network Office), San Diego Supercomputer Center, University of
Kansas (Center for Biodiversity Research)

Shared By: