The Kepler Project Overview_ Status_ and Future Directions

Document Sample
The Kepler Project Overview_ Status_ and Future Directions Powered By Docstoc
					          The Kepler Project
    Overview, Status, and Future Directions

                 Matthew B. Jones
        on behalf of the Kepler Project team

National Center for Ecological Analysis and Synthesis
      University of California, Santa Barbara
    The Kepler Project

•   Goals

    • Produce an open-source scientific workflow system
        •    enable scientists to design scientific workflows and execute
             them

    • Support scientists in a variety of disciplines
        •    e.g., biology, ecology, astronomy

    • Important features
         •   access to scientific data
         •   flexible means for executing complex analyses
         •   enable use of Grid-based approaches to distributed computation
         •   semantic models of scientific tasks
         •
      SWDB                 for 2004
             effective UIAug 29, workflow design
   Kepler Collaboration

• Open-source
   • Builds on Ptolemy II
• Collaborators
   • SEEK Project
   • SciDAC SDM Center
   • Ptolemy Project
   • GEON Project                     Ptolemy II
   • ROADNet Project
   • Resurgence Project
• Goals
   • Create powerful analytical
         tools that are useful
         across disciplines
    •    Ecology, Biology,
         Engineering, Geology, 2004
        SWDB                Aug 29,
         Physics, Chemistry,
         Astronomy, …
    Usage statistics
                                               –Projects using Kepler:
•   Source code access                  •SEEK (ecology)

     • 154 people accessed source code •SciDAC (molecular bio, ...)
     • 30 members have write permission •CPES (plasma simulation)
                                        •GEON (geosciences)
                                               •CiPRes (phylogenetics)
                                               •CalIT2
          Kepler downloads                     •ROADnet (real-time data)
                                               •LOOKING (oceanography)
          Total = 9204
                                               •CAMERA (metagenomics)
          Beta = 6675                          •Resurgence (Computational chemistry)
                                               •NORIA (ocean observing CI)
          red=Windows                          •NEON (ecology observing CI)
          blue=Macintosh                       •ChIP-chip (genomics)
                                               •COMET (environmental science)
                                               •Cheshire Digital Library (archival)
                                               •Digital preservation (DIGARCH)
       SWDB                Aug 29, 2004        •Cell Biology (Scripps)
                                               •DART (X-Ray crystallography)
                                               •Ocean Life
                                               •Assembling theTree of Life project
                                               •Processing Phylodata (pPOD)
                                               •FermiLab (particle physics)
    Kepler advances

•   Data and Actor search
     • EarthGrid data access system
     • Kepler Component Library
•   Kepler Archive (KAR) format
     • Integrated support for LSID identifiers for all objects
•   Object Manager and cache
•   Web service execution
•   RExpression & MatlabExpression actors
•   Redesigned user interface
•   Authentication subsystem
•   Null-value handling
      SWDB           Aug 29, 2004
    More advances

• Documentation
• Collection-oriented workflows (COMAD)
• Domain-specific actors for case studies
     • e.g., GARP, phylogenetics actors
•   Provenance system
•   Grid computing support
     • NIMROD, Globus, ssh, ...
•   Semantics support
     • annotation, search, workflow validation, integration



      SWDB            Aug 29, 2004
    Distributed execution

•   Opportunities for parallel execution
     • Fine-grained parallelism
     • Coarse-grained parallelism
        •    Few or no cycles
        •    Limited dependencies among components
        •    ‘Trivially parallel’
        •    Many science problems fit this mold
               •   parameter sweep, iteration of stochastic models
•   Current ‘plumbing’ approaches to distributed execution
     • workflow acts as a controller
        •    stages data resources
        •    writes job description files
        •    controls execution of jobs on nodes
    • requires expert understanding of the Grid system
      SWDB                  Aug 29, 2004

•   Scientists need to focus on just the computations
     • try to avoid plumbing as much as possible
Distributed Kepler

• Higher-order component for executing a model on one or more
    remote nodes
•   Master and slave controllers handle setup and communication
    among nodes, and establish data channels
•   Extremely easy for scientist to utilize
      •    requires no knowledge of grid computing systems




             IN




                     OUT

    SWDB               Aug 29, 2004


            Controller                         Controller
              Master                             Slave
    Data Management

•   Need for integrated management of external data
     • EarthGrid access is partial, need refactoring
     • Include other data sources, such as JDBC, OpeNDAP, etc.
     • Data needs to be a first class object in Kepler, not just
         represented as an actor
     •   Need support for data versioning to support provenance

•   e.g., Need to pass data by reference
     • workflows contain large data tokens (100’s of megabytes)
     • intelligent handling of unique identifiers (e.g., LSID)


         SWDB
                   A   Aug 29, 2004
                                                B
                                      Token
                                      {1,5,2}
                                      ref-276

                                                {1,5,2}
    New projects: REAP

•   Management and Analysis of Environmental Observatory Data using
    the Kepler Scientific Workflow System

•   Extend Kepler to:
     • Manage and monitor sensor networks
     • Consume data from sensors
     • Integrate sensor data handling with data archive handling
     •   Terrestrial ecology and oceanography use cases



                PIs                            Institutions
                Jones       Ludäscher          UCSB
                Altintas    Schildhauer        UCD
         SWDB   Estrin      Reichman
                            Aug 29, 2004       UCSD
                Seabloom    Baru               UCLA
                Gallagher   Potter             OSU
                Cornillon   Borer              OpeNDAP
                Hosseini
REAP breakdown




 SWDB       Aug 29, 2004
    New projects: ChIP-chip

•   A Collaborative Scientific Workflow Environment for
    Accelerating Genome-Scale Biological Research
     • CS/IT: Ludaescher, Bowers, McPhillips
     • Bio: Peggy Farnham, Mark Bieda
    •   Integrate a web-based "experiment workspace" environment with a
        flexible scientific workflow system
    •   Support rapid prototyping and easy addition of new "methods"
          •    templates
                 • details of how key steps are left out until runtime, then late-binding of one or
                   more specific algorithms or data
              •    which is the best motif-finding algorithm?
              •    parts of workflow have similar set of steps
              •    need to compare results from parallel analyses
    •   Support client-server (i.e., enterprise) deployments for group/lab-
        wide collaboration
          •
        SWDB   different people have different roles [Software dev (Tim),
                             Aug 29, 2004
               Bioinformatics specialist (Mark), Biologist (Peggy)]
                   Experiment Workspace
                   (setup, run, and manage)             Setup           Import/Export     Data Display,         Run
                                                      “protocol”            Data          Visualization      Experiment




Peggy
“biologist”                                                                                                                experiment
                                                                                                                            repository




                   Workflow Automation                 Configuration         Execution        Monitoring    Provenance
                   (configuration and execution        Management           Management         Support       Tracking
                   support)


                        Kepler
                       Workflow                                                                                           provenance
                        Engine                                                                                             repository


                                   (1) select design template                   (2) generate optimized
                                       and configure                                executable workflow

                   Workflow Specification                                          T1                           T2
Mark               (workflow design and                            F1                                 F3
                   template creation)                                         F2
“bioinformatics
specialist”                                                                                          F2
                                                                                                                             design
                                                                                                                           repository

                   Component Specification
Tim                (wrapping, integration, and
                   creation of components)
“software
developer” SWDB                               Aug 29, 2004                                                                 component
                                                                                                                            repository


                                                                                                                            external
                                                                                                                          components
                                                                                                                          and services
              ChIP-chip Data Analysis   Motif Finding Algorithms Visualization Packages    Public Databases & Services
               (ChIPOTle, HMM, …)        (MEME, MDscan, …)         and Statistics Tools   (GenBank, David,TransFac, …)

                                                                                                           Figure from Bowers and McPhillips
    Kepler C.O.R.E. proposal

•   Development of Kepler CORE -- A Comprehensive,
    Open, Robust, and Extensible Scientific Workflow
    Infrastructure
     • Ludäscher, Altintas, Bowers, Jones, McPhillips

•   Goals
     • Reliable
        •    refactored build
        •    more modular design
        •    improved engineering practices
     • Independently extensible
     • Open architecture, open project
        •    improved governance
      SWDB               Aug 29, 2004
Kepler C.O.R.E. -- Extensibility




  SWDB         Aug 29, 2004
Kepler C.O.R.E. -- Governance




  SWDB       Aug 29, 2004
    Kepler C.O.R.E. -- Sustainability

•   How does Kepler persist?

    • Now, via research grants
        •    unsustainable for production purposes


    • Future
        •    new models for financial support
               •   support contracts?
               •   extension contracts?
               •   new science domains?
               •   continued research dollars?
               •   foundations?
        •    exploring 501.3c organization that can sustain Kepler and similar
             open-source initiatives
      SWDB                  Aug 29, 2004
    Acknowledgements

•   Funding
     • The National Science Foundation under Grant Numbers 9980154,
         9904777, 0131178, 9905838, 0129792, 0225676, and 0619060.
     •   The National Center for Ecological Analysis and Synthesis, a Center
         funded by NSF (Grant Number 0072909), the University of
         California, and the UC Santa Barbara campus.
     •   The Andrew W. Mellon Foundation
     •   The Department of Energy
•   Collaborators
     • NCEAS (UC Santa Barbara), University of New Mexico (Long Term
         Ecological Research Network Office), San Diego Supercomputer
         Center, University of Kansas, University of Vermont, University of
         North Carolina, Napier University, Arizona State University, UC
         Davis
•   Kepler contributors 2004
     • SEEK, Ptolemy II, SDM/SciDAC, GEON, RoadNet, EOL, Resurgence
       SWDB           Aug 29,