Applying Chimera Virtual Data Concepts to Cluster Finding in the by fdshsdhs


									Applying Chimera Virtual Data
Concepts to Cluster Finding in
    the Sloan Sky Survey
   Presented by: Idowu(Id) Adewale
   Instructor:   Dr. Rob Simmonds
   Course:       Introduction to Grid Computing
                          (CPSC 601.90)
   Introduction
       Importance of Virtual Data Concepts
       Motivations
       Definition of terms
   Technology
       Chimera Virtual Data System
   Application
       Sloan Digital Sky Survey
   Future Work
       Research challenges & plans
       Importance of Virtual Data
   To track all aspects of data capture, production,
    transformation and analysis
   It helps to audit, validate, reproduce, and/or rerun
    with corrections, various data transformations.
“I’ve come across some
interesting data, but I need
                                    Motivations (1)
to understand the nature of
the corrections applied                    “I’ve detected a calibration
when it was constructed        Data        error in an instrument and
before I can trust it for my               want to know which derived
purposes.”                                     data to recompute.”

                Product-of                 consumed-by/

     Transformation         execution-of     Derivation
“I want to search an astronomical
database for galaxies with certain                        “I want to apply an
characteristics. If a program that           astronomical analysis program
performs this analysis exists, I                 to millions of objects. If the
won’t have to write one from                   results already exist, I’ll save
scratch.”                                            weeks of computation.”
                Motivations (2)
   Data track-ability and result audit-ability
   Universally sought by GriPhyN applications
   Data can be sent along with its recipe
   Repair and correction of data
     Rebuild data products—c.f., “make”
   Workflow management
     A new, structured paradigm for organizing, locating,
      specifying, and requesting data products
   Performance optimizations
     Ability to re-create data rather than move it
                                  Virtual Data Grid
                                                                                                     virtual           virtual
                                                       discovery                                      data              data
                                                                                                     catalog           catalog

                                                                     sharing                                               virtual
                       science                                                              virtual data index              data
                        review                                    discovery                                               catalog

                                                 researcher                                                                            w              detector
production                                                                   io
                                                                                n                                                    ra

 manager                                                                 iv

                                                                     d                                               storage








                                                                                           replica location
                           workflow                                                                                      storage
                                                                                                service                  element
                           executor                request planner
                          (DAGman)                                                                       Data Grid
                       request executor                request
                          (Condor-G,                  predictor
                            GRAM)                    (Prophesy)

                                                                               simulation data

                                   Grid Monitor

                                                                                             Computing Grid

                                   op center
Chimera Virtual Data System
   It is a database-driven approach to tracking data-
    intensive collaborations
   It helps scientists to discover computational
    procedures developed by others and the data
    produced by these procedures
   It is a system that combines virtual data catalog with
    virtual data language interpreter
   It is coupled with distributed Data Grid services to
    enable on-demand execution of computation
    schedules constructed from database queries
Sloan Digital Sky Survey (SDSS)
   SDSS is a digital imaging survey used in mapping a
    quarter of the sky in five colors, with greater
    magnitudes than previous large sky surveys
   Forms of Digital Sky Data Online:
       Large collection (~10TB) of images
       Smaller set of catalogs (~2TB)
   SDSS contains measurements of each 250,000,000
    detected objects
Identifying Clusters of Galaxies

   Involves the largest gravitational dominated structures in
    the universe
   The analysis involves
       Sophisticated algorithms
       Large amounts of computations
       Use of distributed computing storage resources provided by
        Data Grids
GridPhyN tools for Virtual Data Grid

   GridPhyN toolkit (VDT V1.0) includes:
     Globus Toolkit

     Condor and Condor-G

     Grid Data Mirroring Package

   It is proposed that Chimera VDS be included in VDT
   Importance of Chimera in VDT
     It supports the capture and reuse of information on
        how data is generated by computations
          Chimera Virtual Data System

   Virtual data catalog                          Virtual Data
    – Transformations,                                       Task Graphs
      derivations, data                                   (compute and data
                                                          movemment tasks,
                              Virtual Data Language       with dependencies)
   Virtual data language
    – Catalog Definitions         VDL Interpreter
                               (manipulate derivations
                                                           Data Grid Resources
                                                            (distributed execution
                                and transformations)       and data management)
   Query Tool
   Applications include                                      GriPhyN VDT
    browsers and data          Virtual Data Catalog          Replica Catalog
                               (implements Chimera
    analysis applications      Virtual Data Schema)         Globus Toolkit, Etc.
         Virtual Data Catalog (VDC)
   Tracks how data is derived with sufficient precision
    that one can create and re-create data from this
   It allows the implementation of a new class of “virtual
    data management operations”. For example,
       Rematerialize data product that were deleted
       Generate data products that were never created
       Regenerate data when data dependencies or
        transformation programs change
       Create replicas of data products at remote locations
        when recreation is more efficient than data transfer
        Virtual Data Language (VDL)
   VDL captures and formalizes descriptions of how
    a program can be invoked and records its
    potential/actual invocations
   Transformation is the abstract description of how
    a program is to be invoked:
       The parameters needed
       The files it reads as inputs
       The environment required
   Derivation is the invocation of a transformation
    with a specific set of input values/files
     Transformations and Derivations
   Transformation
    – Abstract template of program invocation
    – Similar to "function definition"
   Derivation
    – Formal invocation of a Transformation
    – Similar to "function call"
    – Store past and future:
        A record of how data products were generated
        A recipe of how data products can be
   Invocation (future)
    – Record of each Derivation execution
                    VDL Query

   This function allows a user or application to search
    the VDC for derivation or transformations definitions
   The search can be made by criteria such as
     Input filenames

     Output filenames

     Transformation name

     Application name
                Request Planner
   It generates a Directed Acyclic Graph (DAG) for use by
    DAGman to manage the computation of the object and
    its dependencies
   Algorithm for creating DAG:
     Find which derivation contains this file as output

     For each input of the associated transformation

         Find the derivation that contains it as output

     Iterating until all the dependencies are resolved

   Once DAG is generated, DAGman dynamically
    schedules distributed computations via Condor-G to
    create requested file(s)
                   Condor DAGs

   The dependencies between Condor jobs is specified
    using DAG data structure

   DAG manages dependencies automatically
     – (e.g., “Don’t run job “B” until job “A” has been
       completed successfully.”)

                                                      Job A
   Each job is a “node” in DAG

   Any number of parent or                   Job B           Job C
       children nodes

   No loops                                          Job D
Applications of Chimera Data System

   Reconstruction of simulated collision event data from a
    high-energy physics experiment
   The search of digital sky survey data from galatic clusters
   It can be used to plan and orchestrate the workflow of
    thousands of tasks on a data grid comprising hundreds of
Finding Clusters of Galaxies in SDSS Data

   In practice, calculations are performed at the location
    of a galaxy
   For each galaxy, the likelihood to be a luminous
    galaxy Bright Red Galaxy (BRG) is calculated
   If the likelihood is above a threshold, the likelihood
    that it is a Brightest Cluster Galaxy (BCG) is
   The computation is done by weighting the BRG by the
    number of galaxies in the neighborhood
   Lastly, check if this galaxy is the most likely in the
    area, if so there is a cluster present at this point
                The Cluster Finding Pipeline

                          1               2                  maxBcg is a series of transformations
                tsObj          Field           BRG
                                                         3     Core
                          1               2
                tsObj          Field           BRG                                     5
                                                                        4    Cluster       Catalog
                          1               2
                tsObj          Field           BRG
                                                         3     Core
                          1               2
                tsObj          Field           BRG
Interesting intermediate data reuse is made possible
by Chimera, if the problem of data discovery is solved.
  Cluster finding works with 1 Mpc radius apertures, in order to have
  dynamic range in a measure estimate. If one instead was looking for the
  sites of gravitational lensing, one would maximize the mass surface
  density by using a .25 Mpc radius. This would start at transformation 3.
                            It is here that virtual data comes in.
    Cluster Finding Pipeline (Contd)

   All the nodes represent data files and arrows the
   The first stage and second stage are straightforward
    in that each output file corresponds to one input file
   The third and fourth stage operates in a “buffer zone”
    around the target field
   The bcgSearch stage (transformation 3) needs to take
    both field files and brg files as inputs
    Cluster Finding Transformations
   FieldPrep: extracts from the full data set required
    measurements on galaxies of intent
   BrgSearch: calculates the unweighted BCG likelihood
    of each galaxy (using BRG likelihood)
   BcgSearch: calculates the weighted BCG likelihood
    for each galaxy. This is the heart of the algorithm
   BcgCoalese: determines whether a galaxy is the
    most likely galaxy in the neighborhood
   GetCatalog: removes extraneous data and stores the
    result in a compact form
Small SDSS Cluster-Finding DAG
           System Architecture
   The SDSS software environment was integrated with
    Chimera and a test grid was constructed
   Bulk data production is performed in parallel with
    interactive use
   The virtual data grid acts like a large-scale cache
                                   Grid Details
  The experimental Grid consisted of four large Condor
    pools shown below. The University of Chicago was
    used as the master pool, for job submission and
    persistent data storage, and all the other pools were
    used for the large amount of computations involved in
    the cluster finding algorithm
               50 Mb/s
                                                         WAN                        100 Mb/s
                                                                         600 Mb/s

 110 CPUs @ 300 450 Mhz                                                             40 CPUs @ 1 Ghz
 10Mb/s and 100 Mb/s Ethernet                                                       100 Mb/s Ethernet
 360GB disk                                                                         80 GB disk
                                400 CPUs @ 1GHz                 296 CPUs @ 1GHz
                                10 Mb/s and 100 Mb/s Ethernet   100 Mb/s Ethernet
                                ~500 GB disk                     1000 GB disk

University of Chicago University of Wisconsin University of Wisconsin               University of
                            at Madison            at Milwaukee                        Florida
                     Chimera Application:
               Sloan Digital Sky Survey Analysis
Size distribution of
 galaxy clusters?

             Galaxy cluster
             size distribution

  1000                                        Chimera Virtual Data System
                                              + GriPhyN Virtual Data Toolkit
                                             + iVDGL Data Grid (many CPUs)
         1                10           100
                  Number of Galaxies
    Experimentation and Results
   By Applying Chimera Virtual Data Concept to SDSS. The
    10TB result was stored as 2TB Data Catalogs in the form of
    transformations, derivations and data files.
   The experimentation with the cluster finding challenge problem
    to date has involved the analysis of 1350 square degrees of
    SDSS data (some 42,000 fields) using Chimera and GridPhyN
    virtual data toolkit. This represents approximately 20% of the
    ultimate survey coverage as summarized in the table below:
     Metric                                     # stripes   # fields

     Expected coverage of survey (by 2005)         45       210,000

     Total data available today                    14       50,400

     Total data suitable for cluster finding       13       42,000

     Total cluster finding workflow generated      13       42,000

     Total cluster finding workflow                7        14,280
        Virtual Data Research Issues
   Representation
    –   Metadata: how is it created, stored, propagated?
    –   What knowledge must be represented? How?
    –   Capturing notions of data approximation
    –   Higher-order knowledge: virtual transformations
   VDC as a community resource
    – Automating data capture
    – Access control, sharing and privacy issues
    – Quality control
   Data derivation
    – Query estimation and request planning
Virtual Data Research Issues (contd.)

   “Engineering” issues
    – Dynamic (runtime-computed) dependencies
    – Large dependent sets
    – Extensions to other data models: relational, OO
    – Virtual data browsers
    – XML vs. relational databases & query languages
            For More Information
   GriPhyN project
   Chimera virtual data system
    – “Chimera: A Virtual Data System for Representing,
      Querying, and Automating Data Derivation,”
      SSDBM, July 2002
    – “Applying Chimera Virtual Data Concepts to
      Cluster Finding in the Sloan Sky Survey”, SC’02,
      November 2002.
Thank You


To top