Docstoc

TRELLIS

Document Sample
TRELLIS Powered By Docstoc
					          Semantic Workflows:
Metadata Meets Computational Experiments


                  Yolanda Gil, PhD
             Information Sciences Institute
          And Department of Computer Science
            University of Southern California
                         gil@isi.edu


               With Ewa Deelman, Jihie Kim,
        Varun Ratanakar, Paul Groth, Gonzalo Florez,
              Pedro Gonzalez, Joshua Moody



                                                       1
      Scientific Analysis
   Complex processes involving a variety of algorithms/software




                                                                   2
Issues in Managing Complex Scientific
Applications
   Managing a variety of software packages
    •   Installation
    •   Documentation
   Conversion routines
    •   Each lab has software to convert across packages
   New algorithms/software available all the time
    •   Cost of installing and understanding may be a deterrent
   May require large scale computations
    •   Some packages include treatment of large scale computations
    •   May or may not be transparent to the user
   Individual software components are only the ingredients of
    overall analysis method
    •   Need to capture the method that combines the components
                                                                      3
Managing Scientific Applications as
Computational Workflows
   Emerging paradigm for large-scale and large-scope
    scientific inquiry
    •   Large-scope science integrates diverse models, phenomena,
        disciplines
    •   “in-silico experimentation”
   Workflows provide a formalization of scientific analyses
    •   Analysis routines need to be executed, the data flow amongst them,
        and relevant execution details
    •   The overall “method” used to produce a new result
   Workflows can represent semantic constraints of
    components and parameters that can be used to assist users
    in creating valid analyses
   Workflow systems can reason about constraints and
    requirements, manage their execution, and automatically
    keep provenance records for new results
                                                                         4
         NSF Workshop on Challenges of Scientific
         Workflows (2006, Gil and Deelman co-chairs)
   Despite investments on CyberInfrastructure as an enabler of a
    significant paradigm change in science:
     •   Exponential growth in Compute, Sensors, Data storage, Network BUT
         growth of science is not same exponential
     •   Reproducibility, key to scientific method, is threatened
   What is missing:
     •   Perceived importance of capturing and sharing process in accelerating
         pace of scientific advances
     •   Process (method/protocol) is increasingly complex and highly distributed
   Workflows are emerging as a paradigm for process-model driven
    science that captures the analysis itself
   Workflows need to be first class citizens in scientific
    CyberInfrastructure
     •   Enable reproducibility
     •   Accelerate scientific progress by automating processes
   Interdisciplinary and intradisciplinary research challenges
   Report available at http://www.isi.edu/nsf-workflows06
                                                                                    5
     Wings/Pegasus
    Workflow System
           for
Large-Scale Data Analysis


                            6
 The Wings/Pegasus Workflow System
 [Gil et al 07; Deelman et al 03; Deelman et al 05; Kim et al 08; Gil et al forthcoming]


   WINGS:                                •Ontology-based reasoning on
                                          workflows and data (W3C’s OWL)
   Knowledge-based                       •Workflow library of useful analyses
   workflow environment                  •Proactive assistance +automation
   www.isi.edu/ikcap/wings               •Execution-independent workflows

Pegasus:                                 •Optimize for performance, cost,
Automated workflow                        reliability
                                         •Assign execution resources
refinement and execution                 •Manage execution through DAGMan
pegasus.isi.edu                          •Daily operational use in many domains

                                         •Sharing of distributed resources
Grid services                            •Remote job submission
condor.wisc.edu                          •Scalable service-oriented architecture
www.globus.org                           •Commercial quality, open source


                                                                                       7
Physics-Based Seismic Hazard Analysis: SCEC’s
CyberShake [Slide from T. Jordan of SCEC]
        Seismicity         Paleoseismology                          Local site effects         Geologic structure




           Faults




                               Seismic
                               Hazard
                                Model

                                             InSAR Image of the
                                           Hector Mine Earthquake
                               ¥ A satellite
                                 generated
                                 Interferometric
                                 Synthetic Radar
                                 (InSAR) image of
                                 the 1999 Hector
                                 Mine earthquake.

                               ¥ Show s the
                                 displacement field
                                 in the direction of
                                 radar imaging

     Stress                    ¥ Each fringe (e.g.,
                                 from red to red)                                                        Rupture
    transfer                     corresponds to a
                                 few centimeters of
                                 displacement.                                                          dynamics
                     Crustal                             Crustal                Seismic velocity
                     motion                            deformation                 structure                        8
    A Wings Workflow Template for Seismic
    Hazard Analysis
     Single        File       Nested File     Application           Component
      File       Collection   Collection      Component              Collection

    Generation
     Fringes




                                                                 Rich metadata
                                                                 descriptions
                                                                 for all data
                                                                 products
                                Intensional    Querying
Intensional                    descriptions   results of other
descriptions                   of parallel    data creation
of data sets                   computations   subworkflows        Rupture Variations
                                                                   -> Strain Green Tensors
                                                                   -> Seismograms
                                                                   -> Spectral acceleration

                                                                                         9
Wings/Pegasus Workflows for Seismic Hazard
Analysis
[Gil et al 07]
                    Input data: a site and an earthquake forecast model
                      •   thousands of possible fault ruptures and rupture
                          variations, each a file, unevenly distributed
                      •   ~110,000 rupture variations to be simulated for that site
                    High-level template combines 11 application codes
                    8048 application nodes in the workflow instance
                     generated by Wings
                    Provenance records kept for 100,000 workflow data
                     products
                      •   Generated more than 2M triples of metadata
                    24,135 nodes in the executable workflow generated by
                     Pegasus, including:
                      •   data stage-in jobs, data stage-out jobs, data registration
                          jobs
                    Executed in USC HPCC cluster, 1820 nodes w/ dual
                     processors) but only < 144 available
                      •   Including MPI jobs, each runs on hundreds of processors
                          for 25-33 hours
                      •   Runtime was 1.9 CPU years

                                                                                       10
Pegasus: Large Scale Distributed Execution
    [Deelman et al 06] Best Paper Award, IEEE Int’l Conference on e-Science, 2006
                                                    Number of jobs per day (23 days), 261,823 jobs total, Number
                                                       of CPU hours per day, 15,706 hours total (1.8 years)
   Pegasus managed 1.8                    100000
                                                                                                                                    JOBS
                                                                                                                                    HRS
    years of computation                    10000

    in 23 days in NSF’s                      1000
    TeraGrid
                                              100

   Processed 20 TB of
    data with 260,000
                                               10



    jobs                                        1




                                                                                                        /2


                                                                                                               /4


                                                                                                                      /6


                                                                                                                             /8
                                                  9


                                                          1


                                                                  3


                                                                          5


                                                                                  7


                                                                                          9


                                                                                                  1




                                                                                                                                      0
                                                /1


                                                        /2


                                                                /2


                                                                        /2


                                                                                /2


                                                                                        /2


                                                                                                /3




                                                                                                                                    /1
                                                                                                      11


                                                                                                             11


                                                                                                                    11


                                                                                                                           11
    Daily operations of



                                              10


                                                      10


                                                              10


                                                                      10


                                                                              10


                                                                                      10


                                                                                              10




                                                                                                                                  11

    NSF’s TeraGrid
    shared resources
    managed using
    Globus services
   Pegasus managed
    ~840,000 individual
    tasks in a workflow
    over a period of three
    weeks
                                                                                                                                           11
Background:
Benefits of Workflow Systems
                          Managing execution
                              Dependencies among steps
                              Failure recovery
                          Managing distributed
                           computation
                              Move data when needed
                          Managing large data sets
                              Efficiency, reliability
                          Security and access control
                              Remote job submission
                          Provenance recording
                              Low-cost high-fidelity
                               reproducibility

                                                          12
This Talk:
Benefits of Semantic Workflows [Gil JSP-09]
Execution management:        Semantics and reasoning:
 Automation of workflow      Workflow retrieval and
  execution                    discovery
 Managing distributed        Automation of workflow
  computation                  generation
 Managing large data sets    Systematic exploration of
 Security and access          design space
  control                     Validation of workflows
 Provenance recording        Automated generation of
 Low-cost high fidelity       metadata
  reproducibility             Guarantees of data pedigree
                              “Conceptual” reproducibility

                                                          13
   Parallel Data Processing in
 Workflows with Accuracy/Quality
            Tradeoffs

 Thanks to Vijay Kumar, Tahsin Kurc, Joel Saltz, and
 the rest of the OSU’s Multiscale Computing Lab and
Emory’s Center for Comprehensive Informatics groups



                                                       14
Accuracy/Quality Tradeoffs in Large-Scale
Biomedical Image Analysis
                          NC: Neuroblastoma Classification
                           (from OSU’s BMI [Kong et al 07])
                            •   Diverse user queries:
                                  –   “Minimize time, 60% accuracy”
                                  –   “Most accurate classification of a region within 30
                                      mins”
                                  –   “Classify image regions with some min accuracy
                                      but obtain higher accuracy for feature-rich
                                      regions”
                            •   Easily parallelizable computations: per
                                image chunk




                          PIQ: Pixel Intensity Quantification
                           (from National Center for Microscopy
                           and Imaging Research [Chow et al 06])
                            •   Terabyte-sized out-of-core image data
                            •   Need to minimize execution time while
                                preserving highest output quality
                            •   Some operations are parallelizable,
                                others must operate on entire images


                                                                                      15
Architecture [Hall et al 07]
                          Wings: workflow
                           representation and reasoning
                          Pegasus/Condor/Globus:
                           workflow mapping and
                           execution on available
                           resources
                          DataCutter (DCMPI):
                           execution within a site (eg a
                           cluster) using efficient
                           filter/stream data processing
                          Trade-off module: explores
                           and selects parameter values

                                                       16
                                          Starting point:
                                        high-level dataflow


Chunkisize       [Sharma et al 07]:
                • 3D image layers
                • Chunked into tiles
Z-projection
                • Projected into the Z plane
                • Normalized and aligned
                • Stitched back into a single 2D image
Normalize      Tiles and parallelism are not explicit!

Auto-align




                               Stitch



                               Warp

                                                          17
           Set of z layers each with t tiles (.IMG)
                                                               Controllable parameter:
                                                               p is number of chunks
Format conversion                                                 in an image layer               Workflow Template
           Set of z layers each with t tiles (.DIM)                                               Sketch


    Chunkisize
                                                                                                          LEGEND
                                                                                                        Regular codes
          Set of stacks of chunks
          Collections of z layers, each has p chunks, each chunk has t tiles)                           Parallel codes (MPI) with
                                                                                                        one computation node


   Z-projection              p (one per chunk stack)                                                    Will be expanded to
                                                                                                        several computation nodes
           Sets of chunks for z-proj

Generate magic tiles       Internally iterates over each
                           chunk of the z-proj
          2 magic tiles

    Normalize              p (one per chunk of z-proj)                          Normalize           p * z (one per chunk
                                                                                                    per layer)
           Sets of normalized chunks for z-proj

                           Internally iterates over each chunk
    Auto-align                                                                        Sets of normalized chunks for all layers
                           of the normalized z-proj
           Scores per pairs of chunks in z-proj
       MST

           Graph

                                                                                                 Internally iterates over each
                                                                                 Stitch          chunk of each layer
                                                                                      Sets of stitched norm. chunks for all layers
                                                                                                                               18
Managing Parallelism with WINGS
Workflows
         Workflow Template                                         Workflow Instance
          (Edited in Wings)                                         (Automatically
                                                                  generated by Wings)




   Many degrees of freedom, many tradeoffs:
     •   Processors available: amount and capabilities
     •   Data processing operations: reduce I/O and communication overhead
     •   Parameters of application components: vary accuracy and performance
     •   Workflow parameters: degree of parallel branching based on chunking data into
         smaller datasets


                                                                                         19
Exploring Tradeoffs [Kumar et al, forthcoming]




                                                 20
Workflows for Genetic Studies
    of Mental Disorders
In collaboration with Chris Mason, Ewa Deelman, Varun Ratnakar,
                           Gaurang Mehta




                                                                  21
Workflow Portal for Genetic Studies of Mental
Disorders
 Work with Christopher Mason from Yale and Cornell




                                                     22
Workflows for Population Studies
Work with Christopher Mason from Yale and Cornell
   Workflows for common analysis types
    •   Association tests
    •   CNV detection
    •   Variant discovery
    •   Family-based association analysis (TDT)
   Workflow components based on widely used software
    •   Plink (Purcell, Harvard)
    •   R (Chambers et al)
    •   PennCNV (Penn) -- Hidden Markov Models
    •   Gnosis (State, Yale) -- sliding windows
    •   Allegro (Decode, Iceland) -- Multiterminal Binary Decision Diagrams
    •   Structure (Pritchard, Chicago) -- structured association
    •   FastLink (Schaffer, NCBI)
    •   (BWA) Burrows-Wheeler Aligner (Li * Durbin)
    •   SAMTools
                                                                              23
  Workflow Portal for Population Studies in
  Genomics
             Association Test
                                     Transmission
                                     Disequilibrium Test (TDT)




Variant Discovery
from Resequencing
                                CNV Detection




                                                                 24
Why Semantics:
Choosing an Analysis
    What type of analysis is appropriate for my data?
    What workflow is appropriate for my data?
    How do I set up the workflow parameters?
             Association Test                            Variant Discovery
                                                         from Resequencing
         Association tests
         are best for large                  Variant discovery
         datasets that are              CNV is used for genomic
                                            Detection
       not within a family                      data from the
 Transmission
 Disequilibrium Test (TDT)                    same individual

                           TDT analysis
                          requires no less
                         than 100 families
                                                                         25
Why Semantics:
Choosing a Workflow
    What type of analysis is appropriate for my data?
    What workflow is appropriate for my data?
    How do I set up the workflow parameters?

             Association Test                 Applies population
                                       stratification to remove outliers

           Uses CMH              Assumes outliers have been removed
           association                 Uses structured association
 Transmission
 Disequilibrium Test (TDT)
                                                Uses a standard test

                                               Incorporates parental
                                              phenotype information
                                                                       26
Why Semantics:
Configuring Parameters
    What type of analysis is appropriate for my data?
    What workflow is appropriate for my data?
    How do I set up the workflow parameters?

             Association Test




                                                         Max Population
                        Max individuals per cluster (“mc”)
                   and merge distance p-value constraint (“ppc”)

       If Affimetrix data, set cutoff
     (“miss”) to 94%, if Illumina 98%
                                                                          27
There is documentation to answer all these
questions…




                                             28
But the Problem is:
   Will a user be inclined to read all?
   Would a user know what subset to read?
   A user may read about something, but will she remember?
   How can a user be sure she is aware of all information
    relevant to making a given choice correctly ?
   When a new and better version/algorithm/workflow
    comes along, will a user have time to read up on it?

        This can lead to:
          1) meaningless results
          2) suboptimal analyses

    •   There is a cost to advisors/colleagues/reviewers/public
                                                                  29
    Our Approach: Semantic Workflows [Kim et al CCPEJ 08;
    Gil et al IEEE eScience 09; Gil et al K-CAP 09; Kim et al IUI 06; Gil et al IEEE IS 2010]

    Workflows augmented with semantic constraints
      •   Each workflow constituent has a variable associated with it
           – Nodes, links, workflow components, datasets
           – Workflow variables can represent collections of data as well as classes of
             software components
      •   Constraints are used to restrict variables, and include:
           – Metadata properties of datasets
           – Constraints across workflow variables
      •   Incorporate function of workflow components: how data is transformed
    Reasoning about semantic constraints in a workflow
      •   Algorithms for semantic enrichment of workflow templates
      •   Algorithms for matching queries against workflow catalogs
      •   Algorithms for generating workflows from high-level user requests
      •   Algorithms for generating metadata of new data products
      •   Algorithms for assisting users w/creation of valid workflow templates
                                                                                           30
    Semantic
    Workflows
    in WINGS
   Workflow
    templates
   Dataflow diagram
    •   Each constituent
        (node, link,
        component,
        dataset) has a
        corresponding
        variable
   Semantic
    properties
       Constraints        (TestData dcdom:isDiscrete false)
        on workflow        (TrainingData dcdom:isDiscrete false)
        variables
                                                                   31
Semantic Constraints as Metadata Properties

                                               Constraints on
                                                reusable
                                                template
                                                (shown below)
                                               Constraints on
                                                current user
                                                request
                                                (shown above)

               [modelerInput_not_equal_to_classifierInput:
                 (:modelerInput wflow:hasDataBinding ?ds1)
                 (:classifierInput wflow:hasDataBinding ?ds2)
                 equal(?ds1, ?ds2)
                 (?t rdf:type wflow:WorkflowTemplate)
                     > (?t wflow:isInvalid "true"^^xsd:boolean)]
                                                               32
Three Uses of Semantic Reasoning
   Semantic enrichment of workflow templates
    •   Enables workflow retrieval (discovery)
   Workflow generation algorithm
    •   Enables users to start with a high level specification of their
        analysis as an underspecified request
   Metadata generation
    •   Enables automatic generation of descriptions of new workflow
        data products for future reuse and provenance recording




                                                                          33
        The Need for Semantic Enrichment of
        Workflow Templates
   Semantic information is not provided
    when creating the workflow
     Only Components and Dataflow Links
     We don’t expect users to specify any
      properties of datasets in the workflow
    – E.g., when a user adds a NaiveBayesModeler, he
      wouldn’t be expected to define that the output of
      this would be a NaiveBayesModel or a Bayes
      Model (superclass) or a nonhuman readable
      Model

   However, retrieval queries are often based on metadata
    properties of datasets (either input or results)
    •   e.g., “Find workflows that can normalize data which is continuous and
        has missing values [<- constraints on inputs] to create a decision tree
        model [constraint on intermediate data products]
                                                                                  34
    An Algorithm for Semantic Enrichment of
    Workflow Templates [Gil et al K-CAP 09]
   Phase 1: Goal Regression
    •   Starting from final products,
        traverse workflow backwards
    •   For each node, query
        component catalog for
        metadata constraints on inputs
   Phase 2: Forward Projection
    •   Starting from input datasets,
        traverse workflow forwards
    •   For each node, query
        component catalog for
        metadata constraints on
        outputs


                                              35
       Semantic Enrichment Algorithm (Phase 1)



Rule in Component Catalog:
[modelerSpecialCase2:
   (?c rdf:type pcdom:ID3ModelerClass)
   (?c pc:hasInput ?idv)                                      ?Dataset4 dcdom:isDiscrete true
   (?idv pc:hasArgumentID "trainingData”)

 -> (?idv dcdom:isDiscrete
"true"^^xsd:boolean)]

                                            Model5   Model6        Model7




                                                                                            36
        Semantic Enrichment Algorithm (Phase 1)


                                   ?Dataset3 dcdom:isDiscrete true

Rule in Component Catalog:
[samplerTransfer:
 (?c rdf:type pcdom:RandomSampleNClass)
 (?c pc:hasOutput ?odv)                                              ?Dataset4 dcdom:isDiscrete true
 (?odv pc:hasArgumentID
"randomSampleNOutputData")
 (?c pc:hasInput ?idv)
 (?idv pc:hasArgumentID
"randomSampleNInputData”)
  (?odv ?p ?val)                              Model5       Model6         Model7
   (?p rdfs:subPropertyOf dc:hasMetrics)

-> (?idv ?p ?val)]



                                                                                                   37
        Semantic Enrichment Algorithm (Phase 1)
                              ?TrainingData dcdom:isDiscrete true



                                         ?Dataset3 dcdom:isDiscrete true

Rule in Component Catalog:
[normalizerTransfer:
 (?c rdf:type pcdom:NormalizeClass)
 (?c pc:hasOutput ?odv)                                                    ?Dataset4 dcdom:isDiscrete true
 (?odv pc:hasArgumentID
"normalizeOutputData")
 (?c pc:hasInput ?idv)
 (?idv pc:hasArgumentID
"normalizeInputData")
  (?odv ?p ?val)                                    Model5       Model6         Model7
  (?p rdfs:subPropertyOf dc:hasMetrics

-> (?idv ?p ?val)]



                                                                                                         38
       Semantic Enrichment Algorithm (Phase 2)
                              ?TrainingData dcdom:isDiscrete true



                                       ?Dataset3 dcdom:isDiscrete true

Rule in Component Catalog:
[modelerTransferFwdData:
 (?c rdf:type pcdom:ModelerClass)
 (?c pc:hasOutput ?odv)                                               ?Dataset4 dcdom:isDiscrete true
 (?odv pc:hasArgumentID "outputModel”)
 (?c pc:hasInput ?idv)
 (?idv pc:hasArgumentID "trainingData")
 (?idv ?p ?val)
 (?p rdfs:subPropertyOf dc:hasDataMetrics)
  notEqual(?p dcdom:isSampled)                  Model5       Model6         Model7


-> (?odv ?p ?val)]                                                       ?Model5 dcdom:isDiscrete true
                                                                         ?Model6 dcdom:isDiscrete true
                                                                         ?Model7 dcdom:isDiscrete true



                                                                                                    39
        Semantic Enrichment Algorithm (Phase 2)
                                    ?TrainingData dcdom:isDiscrete true



Rule in Component Catalog:                    ?Dataset3 dcdom:isDiscrete true
[voteClassifierTransferDataFwd10:
 (?c rdf:type pcdom:VoteClassifierClass)
 (?c pc:hasInput ?idvmodel1)
 (?idvmodel1 pc:hasArgumentID "voteInput1")
 (?c pc:hasInput ?idvmodel2)
 (?idvmodel2 pc:hasArgumentID "voteInput2")                                  ?Dataset4 dcdom:isDiscrete true
 (?c pc:hasInput ?idvmodel3)
 (?idvmodel3 pc:hasArgumentID "voteInput3")
 (?c pc:hasInput ?idvdata)
 (?idvdata pc:hasArgumentID "voteInputData")
     (?idvmodel1 dcdom:isDiscrete ?val1)
     (?idvmodel2 dcdom:isDiscrete ?val2)
                                                        Model5
     (?idvmodel3 dcdom:isDiscrete ?val3)                            Model6         Model7
     equal(?val1, ?val2), equal(?val2, ?val3)                                   ?Model5 dcdom:isDiscrete true
                                              ?TestData
                                               dcdom:isDiscrete                 ?Model6 dcdom:isDiscrete true
     -> (?idvdata dcdom:isDiscrete ?va1l)]                                      ?Model7 dcdom:isDiscrete true
                                               true



                                                                                                           40
        Result: Semantically Enriched Workflow
        Template
                           ?TrainingData dcdom:isDiscrete true
   Support
    semantic queries
                                    ?Dataset3 dcdom:isDiscrete true
    •   Template will be
        retrieved only
        for discrete
        datasets
   Many other                                                     ?Dataset4 dcdom:isDiscrete true

    metadata
    constraints, e.g.:
    •   Model7 is human
        readable                             Model5       Model6         Model7

                                                                      ?Model5 dcdom:isDiscrete true
    •   Dataset4 is               ?TestData
                                                                      ?Model6 dcdom:isDiscrete true
                                   dcdom:isDiscrete
        normalized                 true                               ?Model7 dcdom:isDiscrete true
    •   …
                                                                                                 41
Three Uses of Semantic Reasoning
   Semantic enrichment of workflow templates
    •   Enables workflow retrieval (discovery)
   Workflow generation algorithm
    •   Enables users to start with a high level specification of their
        analysis as an underspecified request
   Metadata generation
    •   Enables automatic generation of descriptions of new workflow
        data products for future reuse and provenance recording




                                                                          42
      The Need for Semantic Reasoning for
      Workflow Generation
   Users provide high level (underspecified) input to the system




    Unspecified
    parameter
                            Underspecified
    value
                            algorithm
                                                                    43
              unified well-formed request

Seed workflow from request                      Automatic Template-Based Workflow
              seeded workflows                  Generation Algorithm
Find input data requirements
                                                Input: Workflow request
              binding-ready workflows             •   Needs to specify:
                                                       – High-level workflow template to be used
   Data source selection                               – Additional constraints on data
                                                  •   User does not need to specify:
              bound workflows
                                                       – Data sources to be used
    Parameter selection                                – Particular component to be used
                                                       – Parameter settings for the components
              configured workflows              Output: workflow instances ready to submit for
   Workflow instantiation
                                                 execution
                                                By reasoning about semantic constraints of workflow &
              workflow instances                 components and the properties of datasets, the
                                                 algorithm:
    Workflow grounding
                                                  •   Maintains a pool of possible workflow candidates,
              ground workflows                    •   Elaborates existing candidates by propagating constraints
                                                  •   Creates a new candidate when several choices are possible
     Workflow ranking                                 for datasets, components, and parameters
                                                  •   Eliminates invalid workflow candidates
              top-k workflows

    Workflow mapping

                                                                                                             44
Automatic Workflow Generation: Exploring the
Workflow Design Space




✔                              Assigns
                               sensible
                               parameter
           Eliminates
                               values
✔          invalid
           workflows

x
x                       Suggests the
                        use of novel
                        algorithms             45
Three Uses of Semantic Reasoning
   Semantic enrichment of workflow templates
    •   Enables workflow retrieval (discovery)
   Workflow generation algorithm
    •   Enables users to start with a high level specification of their
        analysis as an underspecified request
   Metadata generation
    •   Enables automatic generation of descriptions of new workflow
        data products for future reuse and provenance recording




                                                                          46
   Metadata Generation:
   An Example in a Seismic Hazard Workflow

                            127_6.txt.variation
                          127_6.txt.variation               127_6.txt.variation
                        127_6.txt.variation
                     127_6.txt.variation                  127_6.txt.variation
                            -s0000-h0000
                   127_6.txt.variation
                          -s0000-h0000                      -s0000-h0000
                                                       FD_SGT/PAS_1/A/SGT161
                        -s0000-h0001
                     -s0000-h0001 127                     -s0000-h0001
127_6.rvm                   - source_id:
                          - source_id: 127
                   -s0000-h0001
                        - source_id: 127                    - source_id: 127
                                                       - site_name: PAS
                                                          - source_id: 127
- source_id: 127     - source_id: 127 6 6
                            - rupture_id:
                   - source_id: 127
                          - rupture_id:
                        - rupture_id: 6                     - rupture_id:
                                                          - rupture_id: 6 6
                                                       - tensor_direction: 1
- rupture_id: 6
                     - rupture_id:
                   - rupture_id: 6 6
                            - slip_relaization_#:0
                          - slip_relaization_#:0            - slip_relaization_#:0
                                                       - time_period: A
                        - slip_relaization_#:0
                     - slip_relaization_#:01 1            - slip_relaization_#:0
                            - hypo_center_#:
                   - slip_realization_#:0
                          - hypo_center_#:
                        - hypo_center_#:
                     - hypo_center_#: 1 1
                                                            - hypo_center_#:
                                                          - hypo_center_#: 1 1
                                                       - xyz_volumn_id: 161
                   - hypo_center_#: 1
 RVM               Rupture_variation
                     Rupture_variation      …                 SGT
                                                            SGT           …


                       SeismogramGration


                                     Seismogram

                                Seismogram_PAS_127_6.grm
                                -site_name: PAS
                                -source_id: 127
                                -rupture_id: 6
                                                                                     47
Conclusions:
Benefits of Semantic Workflows [Gil JSP-09]
Execution management:        Semantics and reasoning:
 Automation of workflow      Workflow retrieval and
  execution                    discovery
 Managing distributed        Automation of workflow
  computation                  generation
 Managing large data sets    Systematic exploration of
 Security and access          design space
  control                     Validation of workflows
 Provenance recording        Automated generation of
 Low-cost high fidelity       metadata
  reproducibility             Guarantees of data pedigree
                              “Conceptual” reproducibility

                                                          48

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:4
posted:8/15/2011
language:English
pages:48