Applying Chimera Virtual Data Concepts to Cluster Finding in the Sloan Sky Survey Presented by: Idowu(Id) Adewale Instructor: Dr. Rob Simmonds Course: Introduction to Grid Computing (CPSC 601.90) Overview Introduction Importance of Virtual Data Concepts Motivations Definition of terms Technology Chimera Virtual Data System Application Sloan Digital Sky Survey Future Work Research challenges & plans Importance of Virtual Data Concepts To track all aspects of data capture, production, transformation and analysis It helps to audit, validate, reproduce, and/or rerun with corrections, various data transformations. “I’ve come across some interesting data, but I need Motivations (1) to understand the nature of the corrections applied “I’ve detected a calibration when it was constructed Data error in an instrument and before I can trust it for my want to know which derived purposes.” data to recompute.” Product-of consumed-by/ generated-by Transformation execution-of Derivation “I want to search an astronomical database for galaxies with certain “I want to apply an characteristics. If a program that astronomical analysis program performs this analysis exists, I to millions of objects. If the won’t have to write one from results already exist, I’ll save scratch.” weeks of computation.” Motivations (2) Data track-ability and result audit-ability Universally sought by GriPhyN applications Data can be sent along with its recipe Repair and correction of data Rebuild data products—c.f., “make” Workflow management A new, structured paradigm for organizing, locating, specifying, and requesting data products Performance optimizations Ability to re-create data rather than move it Virtual Data Grid virtual virtual discovery data data catalog catalog sharing virtual science virtual data index data review discovery catalog composition ta da researcher w detector production io n ra at pl manager iv er an d storage Transport Resource n storage Storage element in element Mgmt Data g workflow planner replica location workflow storage service element executor request planner (DAGman) Data Grid request executor request (Condor-G, predictor GRAM) (Prophesy) simulation simulation data analysis Grid Monitor Computing Grid grid op center Chimera Virtual Data System It is a database-driven approach to tracking data- intensive collaborations It helps scientists to discover computational procedures developed by others and the data produced by these procedures It is a system that combines virtual data catalog with virtual data language interpreter It is coupled with distributed Data Grid services to enable on-demand execution of computation schedules constructed from database queries Sloan Digital Sky Survey (SDSS) SDSS is a digital imaging survey used in mapping a quarter of the sky in five colors, with greater magnitudes than previous large sky surveys Forms of Digital Sky Data Online: Large collection (~10TB) of images Smaller set of catalogs (~2TB) SDSS contains measurements of each 250,000,000 detected objects Identifying Clusters of Galaxies Involves the largest gravitational dominated structures in the universe The analysis involves Sophisticated algorithms Large amounts of computations Use of distributed computing storage resources provided by Data Grids GridPhyN tools for Virtual Data Grid GridPhyN toolkit (VDT V1.0) includes: Globus Toolkit Condor and Condor-G Grid Data Mirroring Package It is proposed that Chimera VDS be included in VDT Importance of Chimera in VDT It supports the capture and reuse of information on how data is generated by computations Chimera Virtual Data System Virtual data catalog Virtual Data Applications – Transformations, Task Graphs Chimera derivations, data (compute and data movemment tasks, Virtual Data Language with dependencies) Virtual data language – Catalog Definitions VDL Interpreter (manipulate derivations Data Grid Resources (distributed execution and transformations) and data management) Query Tool XML Applications include GriPhyN VDT browsers and data Virtual Data Catalog Replica Catalog DAGman (implements Chimera analysis applications Virtual Data Schema) Globus Toolkit, Etc. Virtual Data Catalog (VDC) Tracks how data is derived with sufficient precision that one can create and re-create data from this knowledge It allows the implementation of a new class of “virtual data management operations”. For example, Rematerialize data product that were deleted Generate data products that were never created Regenerate data when data dependencies or transformation programs change Create replicas of data products at remote locations when recreation is more efficient than data transfer Virtual Data Language (VDL) VDL captures and formalizes descriptions of how a program can be invoked and records its potential/actual invocations Transformation is the abstract description of how a program is to be invoked: The parameters needed The files it reads as inputs The environment required Derivation is the invocation of a transformation with a specific set of input values/files Transformations and Derivations Transformation – Abstract template of program invocation – Similar to "function definition" Derivation – Formal invocation of a Transformation – Similar to "function call" – Store past and future: A record of how data products were generated A recipe of how data products can be generated Invocation (future) – Record of each Derivation execution VDL Query This function allows a user or application to search the VDC for derivation or transformations definitions The search can be made by criteria such as Input filenames Output filenames Transformation name Application name Request Planner It generates a Directed Acyclic Graph (DAG) for use by DAGman to manage the computation of the object and its dependencies Algorithm for creating DAG: Find which derivation contains this file as output For each input of the associated transformation Find the derivation that contains it as output Iterating until all the dependencies are resolved Once DAG is generated, DAGman dynamically schedules distributed computations via Condor-G to create requested file(s) Condor DAGs The dependencies between Condor jobs is specified using DAG data structure DAG manages dependencies automatically – (e.g., “Don’t run job “B” until job “A” has been completed successfully.”) Job A Each job is a “node” in DAG Any number of parent or Job B Job C children nodes No loops Job D Applications of Chimera Data System Reconstruction of simulated collision event data from a high-energy physics experiment The search of digital sky survey data from galatic clusters It can be used to plan and orchestrate the workflow of thousands of tasks on a data grid comprising hundreds of computers Finding Clusters of Galaxies in SDSS Data In practice, calculations are performed at the location of a galaxy For each galaxy, the likelihood to be a luminous galaxy Bright Red Galaxy (BRG) is calculated If the likelihood is above a threshold, the likelihood that it is a Brightest Cluster Galaxy (BCG) is computed The computation is done by weighting the BRG by the number of galaxies in the neighborhood Lastly, check if this galaxy is the most likely in the area, if so there is a cluster present at this point The Cluster Finding Pipeline 1 2 maxBcg is a series of transformations tsObj Field BRG 3 Core 1 2 tsObj Field BRG 5 4 Cluster Catalog 1 2 tsObj Field BRG 3 Core 1 2 tsObj Field BRG Interesting intermediate data reuse is made possible by Chimera, if the problem of data discovery is solved. Cluster finding works with 1 Mpc radius apertures, in order to have dynamic range in a measure estimate. If one instead was looking for the sites of gravitational lensing, one would maximize the mass surface density by using a .25 Mpc radius. This would start at transformation 3. It is here that virtual data comes in. Cluster Finding Pipeline (Contd) All the nodes represent data files and arrows the transformations The first stage and second stage are straightforward in that each output file corresponds to one input file The third and fourth stage operates in a “buffer zone” around the target field The bcgSearch stage (transformation 3) needs to take both field files and brg files as inputs Cluster Finding Transformations FieldPrep: extracts from the full data set required measurements on galaxies of intent BrgSearch: calculates the unweighted BCG likelihood of each galaxy (using BRG likelihood) BcgSearch: calculates the weighted BCG likelihood for each galaxy. This is the heart of the algorithm BcgCoalese: determines whether a galaxy is the most likely galaxy in the neighborhood GetCatalog: removes extraneous data and stores the result in a compact form Small SDSS Cluster-Finding DAG System Architecture The SDSS software environment was integrated with Chimera and a test grid was constructed Bulk data production is performed in parallel with interactive use The virtual data grid acts like a large-scale cache Grid Details The experimental Grid consisted of four large Condor pools shown below. The University of Chicago was used as the master pool, for job submission and persistent data storage, and all the other pools were used for the large amount of computations involved in the cluster finding algorithm 50 Mb/s 600Mb/s WAN 100 Mb/s 600 Mb/s - 110 CPUs @ 300 450 Mhz 40 CPUs @ 1 Ghz 10Mb/s and 100 Mb/s Ethernet 100 Mb/s Ethernet 360GB disk 80 GB disk 400 CPUs @ 1GHz 296 CPUs @ 1GHz 10 Mb/s and 100 Mb/s Ethernet 100 Mb/s Ethernet ~500 GB disk 1000 GB disk University of Chicago University of Wisconsin University of Wisconsin University of at Madison at Milwaukee Florida Chimera Application: Sloan Digital Sky Survey Analysis Size distribution of galaxy clusters? Galaxy cluster 100000 size distribution 10000 1000 Chimera Virtual Data System 100 + GriPhyN Virtual Data Toolkit 10 + iVDGL Data Grid (many CPUs) 1 1 10 100 Number of Galaxies Experimentation and Results By Applying Chimera Virtual Data Concept to SDSS. The 10TB result was stored as 2TB Data Catalogs in the form of transformations, derivations and data files. The experimentation with the cluster finding challenge problem to date has involved the analysis of 1350 square degrees of SDSS data (some 42,000 fields) using Chimera and GridPhyN virtual data toolkit. This represents approximately 20% of the ultimate survey coverage as summarized in the table below: Metric # stripes # fields Expected coverage of survey (by 2005) 45 210,000 Total data available today 14 50,400 Total data suitable for cluster finding 13 42,000 Total cluster finding workflow generated 13 42,000 Total cluster finding workflow 7 14,280 processed Virtual Data Research Issues Representation – Metadata: how is it created, stored, propagated? – What knowledge must be represented? How? – Capturing notions of data approximation – Higher-order knowledge: virtual transformations VDC as a community resource – Automating data capture – Access control, sharing and privacy issues – Quality control Data derivation – Query estimation and request planning Virtual Data Research Issues (contd.) “Engineering” issues – Dynamic (runtime-computed) dependencies – Large dependent sets – Extensions to other data models: relational, OO – Virtual data browsers – XML vs. relational databases & query languages For More Information GriPhyN project – www.griphyn.org Chimera virtual data system – www.griphyn.org/chimera – “Chimera: A Virtual Data System for Representing, Querying, and Automating Data Derivation,” SSDBM, July 2002 – “Applying Chimera Virtual Data Concepts to Cluster Finding in the Sloan Sky Survey”, SC’02, November 2002. Thank You Questions?
Pages to are hidden for
"Applying Chimera Virtual Data Concepts to Cluster Finding in the "Please download to view full document