Document Sample
d0_Grid_October_Statement_final Powered By Docstoc
					                                                         Final Version: Thursday, 01 November 2001

                           DØ Analysis Grid Application
                       Use Case and Requirements Document
                              UK Experiments Board

   1) Introduction

This document proposes a DØ grid project that is achievable on the timescale of 1-2 years and that
can demonstrate results and benefit to the experiment in the next 6 months. An additional document
has also been prepared describing a combined DØ and CDF project based on SAMi
DØ experiment is an active Grid based experiment today. A worldwide data grid (SAMii) is already
operational and being used for analysis of both Detector and Monte Carlo data on a day-to-day
DØ is determined to stay at the forefront of Grid development by making best use of Grid tools as
they are developed and contributing our experience and components of value to the wider
community. It is clear that the Grid concept and software will be given its most thorough testing by
active experiments with users who require immediate access to their data and an effective analysis
environment in which to run their jobs. DØ presents the possibility of possibility extending an
initial working data grid system to integrate with emerging grid standards, and of having a fully
functional Grid within the next two years (possibly earlier), thus demonstrating that the concept is
possible and helping to ensure continued funding to the field.
The essential elements of a Grid system are the transparent use of widely distributed compute and
storage resources to solve a common problem for a Virtual Organization - in this case the DØ
experiment, which must make full use of the computing and human resources of its 70
geographically distributed collaborating institutions. The compute and storage resources may not
be fully owned by DØ and may appear and disappear on the Grid as available to participate in the
work of the experiment. For the user of a Grid system transparency means that he/she may submit
jobs that analyse or create datasets (set of files) without knowing where the datasets of interest may
reside, and without knowing or controlling where the job will run. Ideally the Grid system not only
knows where all copies of the data required reside in the system, where created data should be sent
for storage in the system, but also can use knowledge of the state of the entire Grid system to
determine whether to move a job to data residing on disk at some location, or to fetch data for use
by a job, or to carry out a combination of these two activities – all transparent to the user. It can
also provide feedback to the user on the state of their job and estimations for completion time or
cost of the job.
No one has built such a completely transparent and fully functional Grid system, or has the tools to
do so at this time. The SAM Grid system already has a large number of elements of an ideal Grid
system. Computing and Storage resources are partitioned into Stations – a logical collection of
computers, networks, disks with optional access to a Mass Storage System from one or more
computers, that work together and are administered as a unit of SAM and where data may reside on
disk cache and jobs can be run. SAM stations register themselves in the DØ Grid when they are
running and available to perform services for local or remote users. The system gives a high degree
of transparency with respect to access to data and will transparently resolve user specified (via a
meta-data query language) data into a set of data files and reliably move the data files to the job. It
also provides a high degree of transparency in storage of data files, that may be reliably stored in
any permanent storage location of the data grid, routed from one Station to another to achieve this
as necessary. Currently the system provides only for job submission at a Station where the user has

DØ Analysis Grid Application                                                                -1-
been granted both access and resources (i.e. ability to run jobs, disk cache space for their group,
etc.). The job currently will run only on the Station at which it is submitted – transparently using
the batch system available on that Station. A variety of underlying batch systems including LSF,
PBS, Condor, and FBS are currently transparently supported behind the users “sam submit”
Future extensions to the DØ Grid system are discussed in more detail below and have three main
The first is towards interoperability between DØ-owned resources and shared resources. This
involves publishing information about the system and data files in a standard form and using
interfaces and protocols that emerge as common ones from the various Grid projects. It is apparent
that Globus security infrastructure, Globus Grid information services and file transfer services will
be some of the common middleware components that facilitate interoperability of Grid systems.
The second thrust is towards development of new capabilities in job submission and resource
management that will provide the missing transparency in job processing. This is a major piece of
work that has many levels of functionality and requires many components in order to attain a fully
functional, robust and transparent system. Components from various Grid projects, including
DataGrid and PPDG will be integrated bit by bit to attain the final goals, with the DØ SAM Grid
system providing a working framework for implementing and testing each new level of
functionality and sophistication. As mentioned earlier this will provide essential feedback on the
viability and utility of the Grid middleware being developed and will require active participation by
DØ and Fermilab people in multiple Grid middleware projects and probably in certain Global Grid
Forum working groups.
The third thrust is towards enhancing the diagnostic and monitoring capabilities in the system,
through use of emerging grid standards for monitoring frameworks and tools. This monitoring
thrust is less well developed in its explicit goals and deliverables at this time.

   2) Specification of SAM Grid enhancements

   1   Use of standard middleware to promote interoperability
       1.1 Use of Globus security infrastructure for Station to Station services, such as file transfer
           and job submission. Interoperability with Fermilab Kerberos security infrastructure and
           use of recognized Certificate Authorities on both sides of the Atlantic, in conjunction
           with work of other Grid projects.
       1.2 Use of GridFTP as one of the supported file transfer protocols
       1.3 Use of future enhanced Globus security infrastructure, including Community
           Authorisation Service
       1.4 Use of Globus as one supported system for job submission.
       1.5 Use of Condor and extensions as one supported system for job submission
       1.6 Publish availability and status of SAM station resources and services using standard
           Grid middleware components and services, when they emerge.
       1.7 Publish catalog of data files and their replicas using standard or standards emerging from
           PPDG and DataGrid
   2   Additional Grid functionality for Job specification, submission and tracking
       2.1 Use of full Condor services for migration and checkpointing of jobs – as much as is
           possible with DØ software and the DØ software framework. This may require work on
           both Condor and DØ software to achieve full functionality

DØ Analysis Grid Application                                                                -2-
       2.2 Building incrementally enhanced Job specification language and job submission services
           that ensure co-location of job execution and data files and reliably execute a chain of job
           processing, with dependencies between job steps. The first step in this is expected to be
           work in conjunction with the Condor team to provide for specification and execution of
           a Directed Acyclic Graph of jobs using an extended version of the DAGMAN product
           that CMS is testing for their MC job execution.
   3   Enhancing Monitoring and Diagnostic capabilities
       3.1 Extensions to existing system of logging all activities in the system to both local and
           central log files - as demanded by robustness and increased use of system.
       3.2 Incorporation of emerging Grid Monitoring Architecture and monitoring tools. Little
           exists on this at this point and this work will involve working with other Grid projects
           and participating in Global Grid Forum working groups

   3) Proposed DØ Grid Applications
DØ proposes to work on the following Grid applications.
   1. To upgrade the distributed Monte Carlo production system as a short term use case
      fordemonstrating, in an incremental fashion, essential components of the Grid. In particular
      this involves demonstrating transparent job submission to a number of DØ processing
      centers (SAM stations), reliable execution of jobs, transparent data access and data storage
      and an interface for users to understand and monitor the state of their MC requests and the
      resultant MC jobs and data that satisfy their requests
   2. To demonstrate analysis of data products (both MC and Detector Data) on desktop systems
      in the UK, using an enhanced version of the DØ SAM Grid system that incrementally
      incorporates Grid middleware components. This will not only demonstrate active use of a
      Grid for analysis but will also eventually demonstrate interoperability between the DØ Grid
      and other emerging Grid testbeds.

DØ Data Model
To fully understand the proposed DØ Grid applications some background information is required
about the DØ data and analysis model. We will describe in outline form the structure of the
different types of data, their storage and access methods, and the analysis method.
There are four broad classifications of data:
   1. Physics Data: This data is obtained directly or derived from the detector and is used for
      physics analyses. The minimum sized object for storage is an event, which is stored as
      classes in random access files (EVPACK). It comes in several flavours:
              Raw Data, which is the unprocessed data obtained from the detector. The Raw data
               is written to disk and tape. Typically it will only be available on tape after
               reconstruction. Events are sorted into physical streams based on the triggers.
              Reconstructed Data, which has been processed by the DØ reconstruction program.
               There are several output forms. EDU250 (STA) which contains the raw data, all
               reconstructed objects and is to be used for debugging purposes. Only a small fraction
               of data will be stored in this format. EDU50 (DST) which contains all the
               reconstructed objects and some other detector information, it contains enough
               information for most analyses, all of the data is available on tape with some fraction
               on disk. The thumbnail is a summary file containing a minimal set of information

DØ Analysis Grid Application                                                               -3-
               about each event, the thumbnail will be stored in root format. The files are associated
               via the file that the reconstruction program was run on. Note, reconstruction can be
               run on the EDU50 format file as well.
   2. Calibration and Monitoring Data: This is data that is used for calibration. It is generally
      processed on collection and summary data is stored in the DØ database. This is used as input
      parameters for the Reconstruction and checking the validity of the data. This includes
      luminosity and trigger information.
   3. MC Data
   4. Metadata: This describes the relationships between the various data files, calibration,
      trigger, run summary and MC production, etc.

   4) DØ Applications and their Data Access
The typical way of reading and analysing data in DØ is to read sequentially through all of the events
in a data file. The name Sequential Access through Metadata (SAM) thus came to be used to
characterize the data handling approach. So the entire system (rather than just the software
components, as it should be) is frequently referred to as “SAM”.
At DØ all data analysis, MC event generation, detector simulation, and reconstruction are carried
out within a standard framework and all retrieval and storage of data files uses SAM. In general a
user will run a program that carries out a series of selection and analysis tasks on a set of data or
MC events. These events are stored in files of similar events. Analysis jobs are controlled using a
combination of command line parameters and run control parameter (rcp) files. They may require
access to other conditions data and this is provided via independent manager packages that access
the conditions database through a database server.
Files of all types and data contents are entered and accessed from the data store via their unique
identifier (at DØ it is the file name). The data store is a logical concept and consists of several
storage locations worldwide, using hard disk and tape libraries.
Programs access the data by specifying a “Dataset” that they wish to use. They do this by means of
formulating queries based on meta-data stored in a central Oracle database. The information in the
database describes the data in the Data Store, as well as other files that existed at some stage of data
processing and from which the Data Store data were derived. Run and configuration data, input
parameters, full processing history, luminosity and calibration data are also stored in relational
database tables and their complex relationships to data files are recorded. Other fairly static data,
such as geometry and magnetic field data, are kept in flat files in cvs packages. Queries that specify
the data of interest are made using a fairly sophisticated query language that can be specified either
as a command or through a separate web-based interface. The Dataset (i.e. list of files) that results
from executing the stored query itself is named, versioned and saved and may be re-used.
Programs are run, via a sam submit command, specifying their input Dataset either by name and
version, or by means of the stored query that is to be executed at this time to determine the Dataset.
The framework I/O packages transparently read files from SAM if their input filename is set to
SAMInput:. They are fed a stream of files from the Dataset, with the SAM system taking care that
the files are made available, transported to a local disk cache and that all bookkeeping and tracking
of which programs have „seen‟ which files is properly handled. Storing back of resultant data files
is likewise handled transparently, with meta-data describing the resultant data file produced
automatically and the final storage destination for such a file determined automatically through
mapping tables in the SAM configuration database. Many robustness, retry and recovery features
are also implemented to allow for processing of large datasets and to allow for file delivery
problems and retries using alternate locations of files, if available.

DØ Analysis Grid Application                                                                 -4-
Additional manager packages are available in the DØ framework and these give access to the
conditions database for Run, Calibration, Trigger and other parameter information. These provide
network access via CORBA-based servers and so DØ framework applications do not require direct
SQL access to databases.

   5) The DØ Monte Carlo Production System and the Proposed Application

The DØ Monte Carlo Production System consists of the following iii steps:
    I. Job Request System
   II. Job Distribution System
  III. A Job Control system for running the executables (and the executables themselves). This
       Job Control system comes in two parts:
          a. Job Builder
          b. Job Runner
   IV Submitting the Metadata describing the job to the database. And storing Job Output in the
      DØ data store.
   V Notifying the Job Requestor that their Job is Complete.

   I   Job Request System
   Current: This consists of an informal list of required MC production work and manual
   decisions about what to run and at which processing station to run the job. The SAM database
   contains tables for storing MC requests, together with all of the relevant physics parameters.
   The same tables of detailed descriptive parameters are also used to characterize each MC file
   produced, including intermediate files that are used in the MC job processing chain but not
   permanently stored.
   Proposed Work: The user will be presented with a standard interface (GUI/Web) to the Job
   Request system that will allow them to define the Monte Carlo events to be generated at a
   remote facility. This request will be stored in a database. An interface will be provided to report
   on the status of the request.
   II Job Distribution System
   Current: Providing initial generator files and email to processing station asking them to start the
Proposed Work: The Metadata description of the request needs to be translated into a Job
Description Language. This will also process the request into appropriate sized chunks, prioritise
the request, produce a job control script that will be used for processing the request. The jobs will
then be in essence „submitted‟ to a DØ Grid job queue and will be handled by an available
processing station or stations. The distribution system will allow DØ to manage the resources of the
experiment in an automatic manner, sending MC requests to available resources.

       III     The Job Control System
       The Job Control System is divided into two parts. The first is the job builder that has to
       know all the DØ specific information, and the second is the Job Runner, which can be a
       generic job control program. The Job Control System is a standard DØ interface to set
       standard variables for the MC production. The MC production chain is as follows: Event
       Generation using a standard generator such as Pythia, Herwig, and Isajet, that has been
       integrated into the DØ framework; DØgstar, a detector simulation written using the standard

DØ Analysis Grid Application                                                                -5-
       Geant package; DØsim, which digitizes the Geant output, introduces electronic noise, and
       overlays additional events to simulate a real beam-beam crossing; DØreco, the standard
       reconstruction package; and Recoanalyze which produces a standard root-file of objects to
       be returned to the user. The output of DØreco and Recoanalyze are kept in the DØ data store
       and the rest of the production output is discarded. The Job Control System then produces
       Metadata describing all stages of the production process.
       Current: describe mc_runjob. Describe that each site has their own batch system and scripts
       for actually running the job.
       Proposed Work: It is proposed to evolve and incrementally enhance the Job Runner part, in
       conjunction with PPDG work. This includes working with CMS people who are using a
       variant of mc_runjob for their MC production system and with Condor people. This RunJob
       part of the Job Control system should also be applicable eventually to a chain of processing
       of any sort, including an analysis chain for detector data. Eventually this will involve
       interruption of a chain of processing steps and migration of it to another resource.
       IV Submitting the meta-data describing the job and storing results
       Current: Once a job is complete, the results are made available to the requestor and the rest
       of the collaboration by storing the Metadata and submitting the resulting files to the data
       store. The files can be stored centrally or locally until they are required for an analysis at
       another physical location.
       Future: Communication of preferences for file storage in the job description and ability for
       this final stage of processing to also be handled as yet another processing step.

       V User Notification
       Current: Email or look at web page created manually or by each processing station.
       Future: Incorporate status and enquiry tools into users view of entire request system.

    6) Use Cases
Statement: Both use cases will use the same architecture. MC Production is a more controlled
problem for developing appropriate solutions. User analysis job requires full functionality and
seamless integration.

Monte Carlo Production
    UC1.1 The user, at his desktop has to identify them selves as a valid DØ user who has
      permission to submit MC requests. This would be via a desktop command line interface or
      via a web interface to a centralized DØ system.
    UC1.2 The User invokes a MC Job Submission Request tool that will run on the desktop. This
      tool will query the DØ database for valid Metadata options. The User will then be presented
      with a menu that allows them to select which Metadata options they wish to use and that
      will fully define the Job. The Metadata then needs to be stored in a database and allocated a
      Request Identifier. The Users Job submission is now complete and Request tool is closed.
          1. The user will need to define the input of the Job. It can be one of two things:
                   i. Instructions to generate 4-vectors with a standard MC generator. In this case
                      a set of instructions will need to be provided for the generator.
                  ii. A set of files containing 4-vectors to be processed that is to be provided by
                      the user. In this case the files have to be submitted to the SAM so that the

DØ Analysis Grid Application                                                                -6-
                      files and their locations are available and can be delivered to a remote site for
         2. The user will have to define which detector description will be used. This file will
             have to be made available at the remote site.
         3. The user will have to define what type of events will be overlaid at the digitisation
             stage to simulate pile-up and additional proton antiproton collisions. These files will
             have to be declared in SAM and in most cases the remote sites will have a collection
             of files available. If not the files will have to be sent to the node along with the job
         4. Does the reconstruction program need any special configuration files other than the
             default inputs? If so these have to be defined and made available (this is typically
             done by requiring that the code is released). In some cases the reconstruction
             program may need to access a database for alignment and configuration information.
         5. If a trigger simulation is required then a trigger list has to be defined and made
             available. This may be done via a remote file or by accessing a database.
   UC1.3 The Metadata submitted to the database has to be analysed to determine what resources
     are required to process the job. This will probably occur at a centralised location and will
     not be carried out by the user.
         1. What executables and runtime environment are required?
         2. Does the job require any input files? Where are they located?
         3. An estimation of the resources required is made.
   UC1.4 The submitted Metadata needs to be processed to produce a Job description. These job
     description files will be used to run the jobs.
         1. The Request may need to be divided into smaller (sub-)requests into chunks that can
             be run remotely.
         2. The job description file needs to state exactly what programs are required. It needs
             to state the required runtime environment and any special requirements of the site
             that will be running the job.
   UC1.5 The jobs are „submitted‟ to a centralized Grid based job queue and are now available
     for processing.
   UC1.6 The remote resource site each run a process that interrogates the queue manager (which
     may be remote from it) to ask for any pending requests. If any are present one will be
     transferred via Grid services.
         1. The remote site will need to determine if it has the required software and
         2. If it does not have the required software it needs to install it.
   UC1.7 The remote facility will access any required input files via SAM/Grid.
   UC1.8 The remote facility processes the Job.
   UC1.9 On completion of each stage of processing the metadata describing for that stage is
     generated and declared to the metadata database.
   UC1.10 The job status is updated in the database automatically providing the user with status
     on their jobs via a web/GUI interface.
   UC1.11 On completion of Job, store output data in the DØ data store.
   UC1.12 Notify User of Job Completion.

   7) User Analysis Job
   UC2.1 The user decides on a physics project. This project could be one of a number of things, a
     physics measurement (eg. top quark mass), calibration of the detector, alignment, and event
     selection. Etc.
         - The choice of project will affect the resources required, the data selection, the
             triggers, the MC samples, etc.
   UC2.2 User defines a SAM Dataset . I.e. a Metadata description of a list of files to be analysed.

DØ Analysis Grid Application                                                                -7-
   UC2.3 The user writes an analysis program to investigate a SAM Dataset and links it with the
     DØ framework (which is linked to SAM tools automatically).
   UC2.4 User defines a SAM Dataset. I.e. a Metadata description of a list of files to be analysed.
   UC2.5 User submits analysis program with SAM Datasetas the input to a DØ-Grid queue for
     processing (creating a Job).
   UC2.6 The Job is analysed to determine the optimum location for execution.
   UC2.7 The data files and additional metadata required for the Job are prepared and delivered to
     the execution site.
   UC2.8 The job starts executing and processing files.
   UC2.9 For some reason file delivery is interrupted (failed tape read, failed network connection,
   UC2.10 The Job is check pointed and stopped, freeing resources until file delivery is able to
     resume. The user is notified of the delay.
   UC2.11 File delivery resumes at some node and the Job resumes, and the user is notified.
   UC2.12 Job completes and output is delivered to the user.

   8) Requirements

   R.1 The Condor Universe or equivalent should be able to support dynamically linked libraries.
   R.2 Grid security needs to be compatible with and interfaced to Kerberos security used and
       required by Fermilab. As a direct result of this solution a DØ Grid Community should be
       formed (this allows use of all DØ resources).
   R.3 GridFtp and Grid I/O need to be compatible with the data format used by DØ. Preferably
       they should be blind to the format so the format is not dependent on transfer method.
   R.4 Any metadata catalogue requirements should be flexible enough to allow use of pre-existing
       Metadata systems.
   R.5 The data catalogue needs to maintain the information on links between related files. I.e. eIf
       a file is created from another file this information needs to be maintained in the file
   R.6 GRID tools must provide effective status reports and error checking. The status of any given
       job must be made available.
   R.7 We require real time network monitoring and alternative resources in case of failure.
   R.8 Resource brokerage must be adaptable, work in real time, and has access to as a minimal list
       information on: network connectivity, site resources, available environments, available
       software resources, reliability, etc. etc. etc….
   R.9 Ability to prioritise jobs and changes the priority by authorised user.
   R.10         Need Database access methods from remote sites integrated into the Grid software.

   9) DØ Grid Working Group
DØ is establishing a Grid working group and there is a Fermilab/DØ SAM Grid development Team
(~3.5FTEs) including 2 positions funded through the PPDG project, as well as effort from other
core SAM developers. Together they will be responsible for implementing and testing Grid
enhancements to SAM and for carrying out the application developments that demonstrate
advanced Grid features. . There are a large number of groups interested in this and development on
some of the tools has already begun.
Two other development SAM teams, each of approx 3 FTEs, will continue to work on support and
enhancements to the SAM system to meet DØ requirements for reliability, robustness, and intuitive
access to data for Analysis. The number of SAM users (>300 registered) is growing daily and
requires continual reassessment of the tuning parameters of SAM. They Currently there are 14active
SAM stations with more sites being commissioned on a regular basis. This active use of SAM

DØ Analysis Grid Application                                                             -8-
produces valuable feedback and additional requirements as people use the system in new ways and
make increasing demands on it.

   10) Collaboration
DØ is actively engaged in joint development efforts on the Grid. These collaborative efforts involve
many groups and will be actively pursued and developed over the next 12 months. Listed below are
some of the more active collaborations:
    Particle Physics Data Gridiv, DØ-SAM is a collaborator on PPDG and is working on joint
       projects with the Condor developers on extensions to SAM and Condor aligned with the
       projects described above.
       Several members of PPDG are active developers of the SAM package.
      Condor.
       We are working with the Condor group to integrate the Condor batch system with our
       Runjob. We are exploring the possibility of adapting Runjob and Condor to encourage more
       common software to be available through standard Grid tools.
      Globus
       Discussions on integrating SAM into Globus as a file catalogue. (This is at the idea stage. It
       is not happening now – it could be part of a proposed work package)
      Fermi CMS Computing Group.
       The Fermilab CMS group is investigating the possibility of using SAM as the file and
       metadata catalogue for CMS. They are developing a version of Runjob for use at CMS.
      iVDGL International Virtual Data-Grid Laboratory.
       This US Grid project has DØ members that are playing an active part in thus effort.
      IGMB – InterGrid Management Board
       Active DØ collaborators are working in this group.
      EUDG
       PPDG (and hence, DØ) are in constant communication with EU Datagrid in the areas of
       resource management and scheduling, with the aims of converging compatibility.
The Fermilab Computing Division has stated that it is willing to discuss adapting the SAM package
for use by other experiments. In particular SAM boasts a very complete File catalogue and metadata
system that is well suited to High Energy Physics and with a little effort can be adapted to other
The Runjob package is being developed as well by the Fermilab CMS computing group with active
involvement form members of the Condor group. We are separating the package into two parts. The
first is a generic Job submission package giving the user a straightforward interface for operating
Condor and the second is an experiment specific job description tool.

   11) Future UK Work Effort and manpower Requirements

This is a copy of the request in the summary document i.
We request a total of 6 additional FTEs to work on the development of SAM and user applications
of the experiments. The UK CDF and DØ groups are currently supplying approximately 2 FTEs to
the development and maintenance of SAM within the context of the UK groups.

DØ Analysis Grid Application                                                               -9-
Four of the additional FTEs will work on integrating CORE Grid functionality into SAM as
described in the Section 2). These positions would not be working on user applications. These
FTE‟s would work in conjunction with the SAM project team at Fermilab, the PPDG and their
collaboratorsError! Bookmark not defined.i. These posts will concentrate on the projects essential for
meeting the proposed deliverables given in Section 12) and meeting the goal of having a functional
GRID using modern grid tools working as proof that the GRID will work for future experiments.

The remaining two FTEs will work on developing user applications with one FTE per experiment.

DØ propose to develop its Monte Carlo Production system as the user application.

CDF will allocate the user application FTE to work on the development of an interface for CDF
user applications to SAM. The UK groups will supply an additional 1 FTE to work on the
integration of CDF code with SAM. The FTEs working on Core projects will also offer support
where required.

     12) Deliverables

Month 3         CDF Pilot project ends with test of Globus tools and SAM. A deployed
                SAM station at one or more CDF institutes should be available to
                transfer files for data analysis.
Month 6         Integration of Globus Security infrastructure and GridFTP into SAM
                and deployment at several UK stations interoperating with Fermilab
                and other SAM stations.
                First demonstration of MC production system using Request interface
                and automated job submission to one SAM station with limited
                intelligence in job distribution and load balancing.
Month 12        Fully commissioned MC Production System with reliable execution of
                jobs, splitting into sub-jobs as necessary, intelligent job distribution and
                load balancing, taking into account the economics of data movement
                versus job movement.
Month 24        A fully robust production quality system, excellent monitoring and
                interoperability with other Grid projects and EU DataGrid with sharing
                of some resources. Updated Fabric capable of handing the data on each

  CDF and DØ Executive Summary, Thursday, 01 November 2001

Appendix A
                                SAM and the Particle Physics Data Grid

          Lauri Loebel-Carpenter, Lee Lueking, Carmenita Moore, Ruth Pordes, Julie Trumbo, Sinisa Veseli,
                         Igor Terekhov, Matthew Vranicar, Stephen White, Victoria White

DØ Analysis Grid Application                                                                           - 10 -
     1) Computing Division, Fermi National Accelerator Laboratory, Batavia, IL, USA

    The D0 experiment‟s data and job management system software, SAM, is an operational prototype of many
    of the concepts being developed for Grid computing. We explain how the components of SAM map into
    the Data Grid architecture. We discuss the future use of Grid components to either replace existing
    components of SAM or to extend its functionality and utility; work being carried out as part of the Particle
    Physics Data Grid (PPDG) project.

1. Introduction
The D0 data handling system, SAM[1], was built for the “virtual organization”, D0, consisting of 500 physicists in 72
institutions in 18 countries. Its purpose is to provide a worldwide system of shareable computing and storage resources
that can together be brought to bear to solve the common problem of extracting physics results from about a Petabyte of
measured and simulated data (c.2003). The goal of the system is to provide a large degree of transparency to the user
who makes requests for datasets (collections) of relevant data and submits jobs that execute monte-carlo simulation,
reconstruction or analysis programs on available computing resources. Transparency in storage and delivery of data is
currently in a more advanced state than transparency in the submission of jobs. Programs executed, in the context of
SAM, transform data by consuming data file(s) and producing resultant data file(s) of a different data content or „data
tier‟. Data files are read-only and are never modified, or versioned.

These data handling and job control services, typical of a data grid, are provided by a collection of servers using
CORBA communication. The software components are D0-specific prototypical implementations of some of those
identified in Data Grid Architecture documents [3] [4][5]. Some of these components will be replaced by „standard‟
data grid components emanating from the various grid research projects, including PPDG, [2] [5] [6]. Others will be
modified to conform to Grid protocols and APIs. Additional functional components and services will be integrated into
the SAM system. This work forms the D0/SAM component of the Particle Physics Data Grid project.

2. The Fabric
The fabric on which SAM currently operates consists of compute systems and storage systems at Fermilab,
France/Lyon-IN2P3, UK/Lancaster, Netherlands/Nikhef, Czech Republic/Prague, US/UTA, US/Columbia, US/MSU,
UK/Imperial College. Many other sites are expected to provide additional compute and storage resources when the
experiment moves from commissioning to physics data taking. Storage systems consist of disk storage elements at all
locations and robotically controlled tape libraries at Fermilab, Lyon and Nikhef - with Lancaster about to be added to
this list. All storage elements support the basic functions of storing or retrieving a file. Some support parallel transfer
protocols, currently via bbftp [8]. The underlying storage management systems for tape storage elements are different at
Fermilab, Lyon and Nikhef. Currently only the Fermilab tape storage management system, Enstore [7], provides the
ability to assign priorities and file placement instructions to file requests and provides reports about placement of data
on tape, queue wait time, transfer time and other information that can be used for resource management.

Catalogs of metadata, locations(replica catalog), data transformations (processing history), SAM configuration and
policy, as well as databases of detector calibration, detector configuration, and other parameters, are implemented as
Oracle relational databases. Other information required to characterize and reproduce data products, refers to versioned
code and parameter files residing in a cvs code repository.

Whereas Babar and the LHC experiments subdivide their fabric into centers at specific geographic locations (eg. Tier0,
Tier 1, Tier A), the D0 fabric is organized into physical groupings of compute, storage and network resources that are
operated and managed together for a common purpose. These are termed “Stations”. At Fermilab there are currently six
production SAM Stations named: datalogger, d0farm, central-analysis, linux-build-1, linux-analysis-cluster-1 and
clued0 (a cluster of linux desktop machines). Tape storage elements are designated as directly accessible from certain
stations. Their access from other stations must be defined via a route through one or more SAM stations which provide
caching and forwarding services. There is an accessibility matrix for each compute element and each storage element
managed or accessed by a Station.[11]

Disk storage elements may be either Station or externally managed. Station-managed elements together create logical
disk caches that are administered for a particular physics group. While all files residing in the Station‟s disk storage
elements are in fact accessible by any group permitted to use the Station resources, it is expected that the files for one

DØ Analysis Grid Application                                                                                   - 11 -
physics group will be largely disjoint from those of another and, therefore, caching and “pinning”of files proceed
according to the separate quotas and policies of each physics group.

Although Stations are intended to own resource partitions they do not necessarily have exclusive control over them. For
example, on the 176-node SGI system at Fermilab, at times, we run both a „central-analysis‟ station and a „central-
archive‟ station, sharing compute elements, but with distinct and disjoint disk storage elements. This also allows us to
run “development” or “test” grids that share compute elements and tape storage systems but have separate files,
catalogs and station-managed disk.

3. SAM Architecture and Components
In the diagram below the software components of SAM are illustrated and categorized in dependent layers, according to
recent documents that attempt to characterize the architecture of a Grid or Data Grid system [3],[4],[5]. We identify (in
bold) various components of the SAM system that will be either be extended, completely replaced, or added to the
system as a result of work carried out by PPDG and other grid projects and the integration of that work into the SAM
system, as part of the PPDG work plan.

                                                                                     Client Applications
                             Web                 Command line
                                                                                        D0 Framework C++ codes                           Python codes, Java codes

                                Formulator and                Request Manager                 Cache Manager                Job Manager                  Storage Manager
     Collective Services


                            “Dataset Editor”                “Project Master”               “Station Master”          “Station Master”           “File Storage Server”

                                                                            Batch Systems - LSF, FBS, PBS,
                                     SAM Resource Management                                                           Job Services                     Data Mover
                                         “Optimiser”                                                                                                    “Stager”

                            Significant Event Logger              Naming Service                   Catalog Manager                             Database Manager

                           CORBA            UDP                 Catalog            File transfer protocols -                              Mass Storage systems protocols
                                                               protocols                 ftp, bbftp, rcp             GridFTP                      e.g. encp, hpss

  Connectivity and Resource

                                            SAM-specific user, group, node, station registration                        GSI                   Bbftp „cookie‟
  Authentication and Security

                                Tape               Disk                                                                Resource and
                                                                 Compute           LANs and              Code                                 Replica             Meta-data
                              Storage             Storage                                                             Services Catalog
                                                                 Elements           WANs               Repostory                              Catalog              Catalog
                              Elements           Elements

              Indicates component that will be replaced                         enhanced            or added          using PPDG and Grid tools

                                                               Name in “quotes” is SAM-given software component name

4. Connectivity and Resource Services
The current system uses rather primitive user identification and authentication mechanisms. Unix user names, SAM
physics groups, nodes, domains and stations are registered and maintained centrally in the SAM configuration catalog.
Valid combinations of these must be provided to obtain services. The bbftp [8] file transfer protocol implements its own
form of authentication. Station servers at one station provide service on behalf of their local users and are „trusted‟ by
other Station servers or Database Servers.
Globus core Security Infrastructure services is a planned PPDG enhancement of the system. [12]

Service registration and discovery is implemented using a CORBA naming service, with namespace by station name.
APIs to services in SAM are all defined using CORBA Interface Definition Language and have multiple language
bindings (C++, Python, Java) and, in many cases, a shell interface. This includes services that provide access to
information in various catalogs. UDP is used for significant event logging services and for certain Request Manager
control messages. Rcp, Kerberized rcp, bbftp and encp[7] provide file transfer protocols. Work is currently underway
to provide a more scalable and robust CORBA naming service, possibly by replicating the naming service in multiple
locations and adding persistency.
Use of GridFTP and other standard protocols to access storage elements is a planned PPDG modification to the
system. Integration with grid monitoring tools and approaches is a PPDG area of research. Registration of

DØ Analysis Grid Application                                                                                                                                                  - 12 -
resources and services using a standardized Grid registration or enquiry protocol, in addition to the current
mechanism, is a PPDG enhancement to the system.

5. Software Components for Collective Services
Database Servers provide access to the Replica Catalog, Metadata Catalog, SAM Resource and configuration catalog
and Transformation catalog. Currently multiple database servers provide services. They may run an any machine with
SQLnet. Each is capable of providing all catalog services, but the workload is partitioned administratively. All catalogs
currently have their persistent format as tables in a central Oracle database; a matter that is hidden from their clients.
Replication of some catalogs in two or more locations worldwide is a planned enhancement to the system.
Database servers will need to be implemented that adapt SAM-specific APIs and catalog protocols onto Grid
catalog APIs using PPDG-supported Grid protocols so that information may be published and retrieved in the
wider Physics Data Grid that spans several virtual organizations.

A central Logging server receives significant events. This will be refined to receive only summary level information,
with more detailed monitoring and significant event information held at each site. Work in the context of PPDG will
examine how to use a Grid Monitoring Architecture and tools.

Each disk storage element has a Data Mover (“Stager”) associated with it that provides services to transfer or erase a
file by using the appropriate protocol for the source and destination storage elements involved.

Resource manager services are provided by an “Optimization” service. File transfer actions are prioritized and
authorized prior to being executed. The current primitive functionality of re-ordering and grouping file requests,
primarily to optimize access to tapes, will need to be greatly extended, redesigned and re-implemented to better
deal with co-location of data with computing elements and fair-shares and policy-driven use of all computing,
storage and network resource. This is a major component of the SAM/PPDG work, to be carried out in
collaboration with the Condor team [15].

A Request Formulator and Planner “Dataset Definition Editor” provides the user with a natural and mathematical
language formulation of queries based on metadata. Immediate feedback on the size of the resultant file set is given, but
feedback on the likely time to gain access to that dataset is not yet available.

Each Station Cache Manager provides caching services, including the ability to “pin” files, and implements the
designated caching and “pinning” policies for each physics group.

A Station Job Manager provides services to execute a user application, script, or series of jobs, potentially as parallel
processes, either interactively or by use of a local batch system. Currently supported batch systems are LSF[13] and
FBS[14], with adapters for PBS and Condor[15] in preparation and test. The station Cache Manager and Job Manager
are implemented as a single “Station Master” server.
Job submission and synchronization between job execution and data delivery is currently part of
SAM. Jobs are put on hold in batch system queues until data files are available to the job. At
present jobs submitted at one station may only be run using the batch system(s) available at that
The specification of user jobs, including their characteristics and input datasets, is a major component of the
PPDG work. The intention is to provide Grid job services components that replace the SAM job services
components . This will support job submission (including composite and parallel jobs) to suitable SAM Station(s)
and eventually any available Grid computing resource.

Request Managers arrange for pre-staging of file replicas and handle all bookkeeping of file consumption, including
errors, retries and restarts. This “Project Master” server, under the control of the “Station Master”; is executed for each
dataset that is to be delivered and consumed by a user job. Each Request manager, working together with its Station
Cache Manager and Data Movers, together implement a robust file replication service, for files of a requested dataset,
that are not in accessible Station storage elements, for the particular compute elements involved. This file replication
service handles inter-Station transfers of data and intra-Station transfers between storage elements. Each station
currently has a „preferred location‟ for its files, with non-preferred locations, if used, chosen randomly.
PPDG Resource Management components are needed, along with Grid job services components, to provide
optimized and cost effective replication and to assure co-location of jobs and data.

Storage Manager services are provided by a Station‟s “File Storage Server” that responds directly to user commands to
store files in tape and disk storage elements. It also services requests from other “File Storage Servers”. All file storage

DØ Analysis Grid Application                                                                                   - 13 -
requests must be accompanied by metadata that identifies not only the nature of the file and its data contents, but also
the parent files and the processing history by which this file was obtained. For example a typical client storage
command would be “sam store –” [--dest=]”. Metadata parameters for various data products are
currently implemented using Python classes and are extensible. Final storage destinations may be automatically
determined based on the file metadata, or may be explicitly specified. Access control is by user, group and station and is
not yet as stringent as necessary. In an attempt to encourage use of the system we have probably left it too open for
users to store files in the system, and on tape.

6. Client Applications
SAM APIs have multiple language bindings and so Client Applications may be user commands, web applications, C++
reconstruction or analysis code or Python or Java code. In keeping with the goal of transparent data access, the D0 C++
application framework[16] hides the data handling system from the user. It provides access to files in a dataset (in
random order) via the same I/O mechanisms used for access to single named files[17]. A designator SAMInput: or
SAMOutput: achieves this and metadata for output files is created. Richer metadata is required in the future for all data
products. The metadata catalog itself now supports an expandable and rich set of physics-related information, but the
production jobs and monte-carlo jobs right now provide only limited metadata; surely less than required for full support
of virtual data. Work is underway to make a ROOT client application with transparent access to SAM files. With
expanded metadata will come capabilities for a more physics-friendly dataset definition language and tools. Users
submit processing or analysis jobs via commands such as “sam submit –script=mystuff –dataset=mydataset –
group=mygroup –jobpars=…..”.

7. Robustness and Scalability – Experiences so far
Almost all of the Monte Carlo data produced so far has been created on Farm nodes outside of Fermilab and the
resulting reconstructed or root-tuple files (more than 20TB) have been stored using the File Storage Server mechanism
described above - from stations at Nikhef, Lancaster, Prague and Lyon. The system is in constant use for Online system
data logging (OSF1), farm production processing at Fermilab on a 90 node Linux Farm system, and for multiple
purposes of monitoring, analysis and program debugging on the central analysis machine (SGI-IRIX) and several
clusters of Linux workstations. The remote sites are ready and able to use their Stations to retrieve either Raw or
processed physics data from Fermilab. More details about the operation of the whole system are given in other papers
presented at this conference [9] [18]. Typically more than 30 datasets are simultaneously being served up on “central-
analysis”, with file replication only as needed, within the 2.4 TB of disk cache.

There are currently single points of failure in the system and these must be eliminated. Recently the implementation of
the CORBA naming service that we chose has proved unreliable and troublesome. The Online System‟s data logging
station uses an optional mode of operation where physical movement of files may proceed, even if the catalogs are not
available; cataloging requests are queued. The central Oracle database server, although clearly a single point of failure
and in the end non-scalable, has proved to be robust and highly available. Plans to replicate certain catalogs must
nevertheless continue. Users have been given considerable freedom to execute „requests‟ that may result in complex
and lengthy database queries. We need to do more work to prevent wild queries and also to provide additional database
server connections to the database to handle peak demands – that can happen when, for example, all students in a
tutorial define a gigantic dataset at the same moment. The file replication is not completely watertight against all forms
of failure and certain bad behavior of certain file transfer protocols that may return success, even in the face of failure.
Encouragingly, we have observed robust behavior following network failures while storing files from Nikhef to
Fermilab, where the process halted and then seamlessly continued after a configurable retry period. Tuning the various
retrial and other parameters is currently more of an art, than a science and would be greatly aided by a more
sophisticated Grid Monitoring environment, as well as a better user interface for administrators to adjust the various
tuning knobs.

One of the most difficult tasks has proved to be the evolution of the system while maintaining backwards compatibility
and without enforcing that all Stations upgrade to the newest version of software simultaneously. That is a challenge,
that we have met only some of the time.

8. Conclusions
The system is operational and working quite well and can be viewed, including browsing most of its catalogs,
installation guides, users guides and administrators guides, at Integration of standard grid
components, via PPDG, will greatly enhance the functionality of the system, while providing a real-life, in-use,
vertically integrated Grid system as a testbed for such components.

1.   The SAM home page

DØ Analysis Grid Application                                                                                   - 14 -
2.    The Particle Physics Data Grid (PPDG) home page:
3.    I. Foster, C. Kesselman, S. Tuecke, The Anatomy of the Grid: Enabling Scalable Virtual Organizations. To be
      published in Intl. J. Supercomputer Applications, 2001.
4.    Data Grid Reference Architecture. at
5.    DataGrid Architecture at
6.    The GriPhyN project home page:
7.    The Enstore home page.
8.    The Babar parallel file transfer protocol bbftp home page:
9.    L. Carpenter, SAM Overview and Operation at the D0 Experiment. Submitted to CHEP2001, September, 2001.
10.   I. Terekhov et. al., SAM for D0 – a fully distributed data access system. Talk presented at VII International Workshop
      on Advanced Computing and Analysis Techniques in Physics Research (ACAT 2000), Oct, 2000, Proceedings.
11.   I. Terekhov et. al., Distributed Data Access and Resource Management in the D0 SAM System.
      The Tenth IEEE International Symposium on High Performance Distributed Computing
      San Francisco, California, August 7-9, 2001, in Proceedings.
12.   The Globus Project home page:
13.   LSF workload management from Platform Computing.
14.   I. Mandrichenko et. al., Farms Batch System and Fermi Inter-Process Communication toolkit, CHEP 2000 proceedings,
15.   The Condor project home page:
16.   J. Kowalkowski, The D0 Framework. Talk presented at CHEP2000, February 2000, Padova, Italy
17.   The D0 object model.
18.   V.White The D0 Data Handling System. Submitted to CHEP2001, September, 2001

DØ Analysis Grid Application                                                                               - 15 -

Shared By: