LHC Computing Grid

Document Sample
LHC Computing Grid Powered By Docstoc
					                         CERN-LHCC-xyz-xyz
                               xyz June 2005




LHC Computing Grid

 Technical Design Report

       DRAFT 0.7
         12-April-2005
Technical Design Report   LHC COMPUTING GRID




II
AUTHORS


I. Bird1, Kors Bos2, N. Brook3, D. Duellmann1, C. Eck1, I. Fisk4, D. Foster1, B. Gibbard5,M.
Girone1, C. Grandi6, F. Grey1, A. Heiss7, F. Hemmer1, S.Jarp1, R. Jones8, D. Kelsey9, J.
Knobloch1, M. Lamanna1, H. Marten7, P. Mato Vila1, F. Ould-Saada10, B. Panzer-Steindel1, L.
Perini11, L. Robertson1, Y. Schutz12, U. Schwickerath7, J. Shiers1, T. Wenaus5




1 (CERN, Geneva, Switzerland)
2 (NIKHEF, Amsterdam, Netherlands)
3 (University of Bristol, United Kingdom)
4 ( Fermi National Accelerator Laboratory, Batavia, Illinois, United States of America)
5 (Brookhaven National Laboratory (BNL), Upton, New York, United States of America)
6 (INFN Bologna, Italy)
7 (Forschungszentrum Karlsruhe, Institut für Wissenschaftliches Rechnen, Karlsruhe, Germany)
8 (School of Physics and Chemistry, Lancester University, United Kingdom)
9 (Rutherford-Appleton Laboratory, Chilton, United Kingdom)
10 (Department of Physics, University of Oslo, Norway)
11 (INFN Milano, Italy)
12 (SUBATECH Laboratoire de Physique Subatomique et des Technologies Associées, Nantes,
France)
EXECUTIVE SUMMARY
Jürgen Knobloch, François Grey
Technical Design Report                                                                     LHC COMPUTING GRID


TABLE OF CONTENTS
1    INTRODUCTION ............................................................................................................ 1
2    EXPERIMENTS’ REQUIREMENTS .............................................................................. 1
     2.1 Logical Dataflow and Workflow............................................................................. 1
           2.1.1     ALICE ..................................................................................................... 1
           2.1.2     ATLAS .................................................................................................... 1
           2.1.3     CMS ......................................................................................................... 5
           2.1.4     LHCb ....................................................................................................... 6
     2.2 Resource Expectations ............................................................................................ 8
           2.2.1     ALICE ..................................................................................................... 8
           2.2.2     ATLAS .................................................................................................... 8
           2.2.3     CMS ......................................................................................................... 11
           2.2.4     LHCb ....................................................................................................... 12
     2.3 Baseline Requirements ............................................................................................ 13
           2.3.1     ALICE ..................................................................................................... 13
           2.3.2     ATLAS .................................................................................................... 13
           2.3.3     CMS ......................................................................................................... 13
           2.3.4     LHCb ....................................................................................................... 16
3    LHC COMPUTING GRID ARCHITECTURE ................................................................ 16
     3.1 Grid architecture in general – dataflow – aspect of several grids,
     functionality and services .................................................................................................. 16
     3.2 Network ................................................................................................................... 17
           3.2.1     Introduction ............................................................................................. 17
           3.2.2     Tiers ......................................................................................................... 18
           3.2.3     LHC network traffic ................................................................................ 18
           3.2.4     Provisioning ............................................................................................. 19
           3.2.5     Physical connectivity (layer1) ................................................................. 19
           3.2.6     Logical connectivity (layer 2 and 3) ........................................................ 19
           3.2.7     IP addressing............................................................................................ 21
           3.2.8     BGP Routing (Routed-T1)....................................................................... 22
           3.2.9     Lightpath (Lightpath-T1) ......................................................................... 23
           3.2.10 T1 to T1 transit ........................................................................................ 23
           3.2.11 Security .................................................................................................... 24
           3.2.12 Operations................................................................................................ 24
           3.2.13 Glossary ................................................................................................... 24
           3.2.14 Bandwidth requirements .......................................................................... 24
     3.3 Tier-0 Architecture .................................................................................................. 25
     3.4 Tier-1 Architecture(s), distributed Tier-1................................................................ 32
     3.5 Tier-2 – simulation, function of end-user analysis .................................................. 33
           3.5.1     Tier-1 Services for Tier-2 Regional Centres ........................................... 33
     3.6 Security ................................................................................................................... 42
4    GRID MANAGEMENT ................................................................................................... 42
     4.1 Network ................................................................................................................... 42
     4.2 Operations & Centre SLAs (Merged) ..................................................................... 43
           4.2.1     Security Operations – Draft (Dave) ......................................................... 43
     4.3 User Regisration and VO management - Security (abstract, Dave) ........................ 43
5    SOFTWARE ASPECTS ................................................................................................... 43
     5.1 Operating systems ................................................................................................... 44
     5.2 Middleware, interoperability & standards ............................................................... 44


II
LHC COMPUTING GRID                                                                            Technical Design Report


    5.3   NorduGrid ............................................................................................................... 45
    5.4   Grid Standards and Interoperability ........................................................................ 47
          5.4.1       Overview ................................................................................................. 47
          5.4.2       ARC and interoperability......................................................................... 48
    5.5 Common applications and tools .............................................................................. 49
          5.5.1       Persistency Framework (POOL and Conditions Database) ..................... 49
          5.5.2       SEAL and PI ............................................................................................ 49
          5.5.3       Software Process and Infrastructure (SPI) ............................................... 50
          5.5.4       ROOT ...................................................................................................... 50
    5.6 Analysis support ...................................................................................................... 50
    References ......................................................................................................................... 55
    5.7 Data bases – distributed deployment ....................................................................... 56
          5.7.1       Database Services at CERN T 0 .............................................................. 56
          5.7.2       Distributed Services at tier 1 and higher.................................................. 57
    5.8 Lifecycle support – management of deployment and versioning............................ 58
6   TECHNOLOGY ............................................................................................................... 59
    6.1 Status and expected evolution ................................................................................. 59
          6.1.1       Processors ................................................................................................ 60
          6.1.2       Infiniband ................................................................................................ 62
    6.2 Choice of initial solutions ....................................................................................... 63
          6.2.1       Software : Batch Systems ....................................................................... 63
          6.2.2       Software : Mass Storage System ............................................................ 64
          6.2.3       Software : Management System ............................................................. 64
          6.2.4       Software : File System ............................................................................ 65
          6.2.5       Software : Operating System .................................................................. 65
          6.2.6       Hardware : CPU Server ......................................................................... 66
          6.2.7       Hardware : Disk Storage........................................................................ 67
          6.2.8       Hardware : Tape Storage ....................................................................... 68
          6.2.9       Hardware : Network ............................................................................. 68
    6.3 Hardware lifecycle .................................................................................................. 69
7   PROTOTYPES AND EVALUATIONS - ........................................................................ 69
    7.1 Data challenges ....................................................................................................... 70
          7.1.1       Data Challenges (Farid)..................................................................... 70
          7.1.2       ATLAS Data Challenges ......................................................................... 70
          7.1.3       CMS ......................................................................................................... 72
          7.1.4       LHCb & LCG .......................................................................................... 76
    7.2 Service challenges ................................................................................................... 81
          7.2.1       Summary of Tier0/1/2 Roles ................................................................... 82
          7.2.2       Overall Workplan .................................................................................... 82
          7.2.3       CERN / Tier0 Workplan .......................................................................... 83
          7.2.4       Tier1 Workplan........................................................................................ 83
          7.2.5       Tier2 Workplan........................................................................................ 84
          7.2.6       Network Workplan .................................................................................. 86
          7.2.7       Experiment Workplan.............................................................................. 86
          7.2.8       Selection of Software Components ......................................................... 87
          7.2.9       Service Workplan – Coordination with Data Challenges ........................ 87
          7.2.10 Current Production Setup ........................................................................ 87
          7.2.11 Security Service Challenges (Draft, Dave) .............................................. 88
    7.3 Results of Service Challenge 1 & 2 ........................................................................ 88



                                                                                                                                   III
Technical Design Report                                                                       LHC COMPUTING GRID


            7.3.1    Goals of Service Challenge 3 .................................................................. 89
            7.3.2    Goals of Service Challenge 4 .................................................................. 90
            7.3.3    Timeline and Deliverables ....................................................................... 90
            7.3.4    Summary.................................................................................................. 91
            7.3.5    References ............................................................................................... 91
     EGEE DJRA1.1 EGEE Middleware Architecture.
     https://edms.cern.ch/document/476451/ ............................................................................ 91
8    START-UP SCENARIO .................................................................................................. 91
     8.1 Pilot run ................................................................................................................... 93
     8.2 Initial running .......................................................................................................... 93
9    RESOURCES ................................................................................................................... 93
     9.1 Minimal Computing Resource and Service Levels to qualify for
     membership of the LHC Computing Grid Collaboration .................................................. 96
     9.2 Tier-2 Services ........................................................................................................ 99
     9.3 Grid Operations Services ........................................................................................ 100
     9.4 Costing .................................................................................................................... 101
10   INTERACTIONS/DEPENDENCIES/BOUNDARIES .................................................... 104
     10.1 Network Services .................................................................................................... 104
     10.2 Grid Software .......................................................................................................... 105
     10.3 Globus, Condor and the Virtual Data Toolkit ......................................................... 105
     10.4 The gLite Toolkit of the EGEE Project ................................................................... 105
     10.5 The Nordugrid Project............................................................................................. 106
     10.6 Operation of Grid Infrastructure ............................................................................. 106
     10.7 EGEE/LCG ............................................................................................................. 106
     10.8 Open Science Grid .................................................................................................. 107
     10.9 The Nordic Data Grid Facility ................................................................................ 107
     10.10 (new section 10.11 - Security Policy and Procedures) draft, Dave ......................... 107
     10.11 DAQ systems .......................................................................................................... 108
11   MILESTONES .................................................................................................................. 108
12   RESPONSIBILITIES – ORGANISATION, PARTICIPATING
INSTITUTIONS .......................................................................................................................... 108
13   GLOSSARY – ACRONYMS – DEFINITIONS .............................................................. 110




IV
LHC COMPUTING GRID                                                    Technical Design Report



1     INTRODUCTION
Jürgen Knobloch, Les Robertson, François Grey



2     EXPERIMENTS’ REQUIREMENTS
Nick Brook
Abstract
This section summarises the salient requirements, of the experiments, that are driving the
LCG project resource needs and software/middleware deliverables. These needs will be
compared and contrasted.
An outline is given of the experiments computing models, defining the roles envisaged for
CERN, Tier-1s and the Tier-2s as well as the data processing steps involved in the data and
workflows. This will be mapped onto the resource requirements for CPU, fast storage and
long-term storage at each Tier level. In addition peak rates to MSS and network estimates are
given. A brief description of the required functionality of the middleware needed to
implement the experiments¹ computing model is given.


2.1     Logical Dataflow and Workflow
2.1.1 ALICE
2.1.2 ATLAS
2.1.2.1 Principle Real Data Sets
The source of the input real data for the computing model is primarily the Event Filter (EF).
Data passing directly from the online to offsite facilities for monitoring and calibration
purposes will be discussed only briefly, as they have little impact on the total resources
required, and also require further clarification. While the possibility of other locations for part
of the EF is to be retained, the baseline assumption is that the EF resides at the ATLAS pit.
Other arrangements have little impact on the computing model except on the network
requirements from the ATLAS pit area. The input data to the EF will require approximately
10x10 Gbps links with very high reliability (and a large disk buffer in case of failures). The
output data requires an average 320MB/s (3Gbps) link connecting it to the first-pass
processing facility. Remote event filtering would require upwards of 10Gbps to the remote
site, the precise bandwidth depending on the fraction of the Event Filter load migrated away
from the ATLAS pit.
The baseline model assumes a single primary stream containing all physics events flowing
from the Event Filter to Tier-0. Several other auxiliary streams are also planned, the most
important of which is a calibration hot-line containing calibration trigger events (which would
most likely include certain physics event classes). This stream is required to produce
calibrations of sufficient quality to allow a useful first-pass processing of the main stream
with minimum latency. A working target (which remains to be shown to be achievable) is to
process 50% of the data within 8 hours and 90% within 24 hours.
Two other auxiliary streams are planned. The first is an express-line of physics triggers
containing about 5% of the full data rate. These will allow both the tuning of physics and
detector algorithms and also a rapid alert on some high-profile physics triggers. It is to be
stressed that any physics based on this stream must be validated with the ‘standard’ versions
of the events in the primary physics stream. However, such a hot-line should lead to improved
reconstruction. It is intended to make much of the early raw-data access in the model point to
this and the calibration streams. The fractional rate of the express stream will vary with time,
and will be discussed in the context of the commissioning. The last minor stream contains



                                                                                                 1
Technical Design Report                                                    LHC COMPUTING GRID


pathological events, for instance those that fail in the event filter. These may pass the standard
Tier-0 processing, but if not they will attract the attention of the development team.
The following assumptions are made about the data flow between the EF and the Tier-0 input
buffer:
     Event Filter processors send their outputs to one of 30-50 SubFarm Output managers
        (SFOs).
     Events are written to files in bytestream format by SFOs.
     SFOs are equivalent to one another, and do not sort physics events by trigger or type.
     In normal operation, SFOs will fill a file to a specified size or event count threshold,
        then close the file and open a new one.
     Files are the unit of transfer from the Event Filter to the Tier-0 centre.
     Files are eligible for transfer as soon as they are filled and closed.
     The data acquisition system assigns run numbers and event numbers, and provides
        event timestamps.
     At run boundaries, files written by SFOs are closed and transferred to the Tier-0
        centre: no RAW data file from the Event Filter contains events from more than one
        run.

On arrival at the input-disk buffer of the first-pass processing facility (henceforth known as
Tier-0) at the input disk buffer, the raw data file:
is copied to Castor tape at CERN;
is copied to permanent mass storage in one of the Tier-1s;
calibration and alignment procedures are run on the corresponding calibration stream events;
the express stream is reconstructed with the best-estimate calibrations available;
once appropriate calibrations are in place, first-pass reconstruction (‘prompt’ reconstruction)
is run on the primary event stream (containing all physics triggers), and the derived sets
archived into Castor (these are known as the ‘primary’ data sets, subsequent reprocessing
giving rise to better versions that supersede them);
     a) two instances of the derived ESD are exported to external Tier-1 facilities; each Tier-
         1 site assumes principal responsibility for its fraction of such data, and retains a
         replica of another equal fraction of the ESD for which another Tier-1 site is
         principally responsible. Tier-1 sites make current ESD available on disk.13 ESD
         distribution from CERN occurs at completion of first-pass reconstruction processing
         of each file. As physics applications may need to navigate from ESD to RAW data, it
         is convenient to use the same placement rules for ESD as for RAW, i.e., if a site hosts
         specific RAW events, then it also hosts the corresponding ESD. The proposed “one
         file in, one file out” model for ESD production jobs makes achieving such
         correspondence simpler.
     b) the derived AOD is archived via the CERN analysis facility and an instance is
         shipped to each of the external Tier-1s (i.e. there is a full copy at each Tier-1);
     c) the AOD copy at each Tier-1 is replicated and shared between the associated Tier-2
         facilities;
     d) the derived TAG is archived into Castor and an instance is copied to each Tier-1.
         These copies are then replicated to each Tier-2 in full.
The Tier-0 facility performs the first-pass processing, and is also used in the production of the
first pass calibrations used.

The Tier-1 facilities perform all later re-reconstruction of the RAW data to produce new ESD,
AOD and primary TAG versions. They are also potential additional capacity to be employed
if there is a backlog of first-pass processing at the Tier-0. Note that this implies a degree of
control over the Tier-1 environment and processing that is comparable to that at the Tier-0.

13 At least one Tier-1 site proposes to host the entire ESD. This is not precluded, but the site would
nonetheless, like every other Tier-1, assume principal responsibility for its agreed fraction of the ESD.


2
LHC COMPUTING GRID                                                  Technical Design Report


Selected ESD will also be copied to Tier-2 sites for specialized purposes. The AOD and TAG
distribution models are similar, but employ different replication infrastructure because TAG
data are database-resident. AOD and TAG distribution from CERN occur upon completion of
first-pass reconstruction processing of each run.
2.1.2.2 Simulated Data
Simulated data are assumed to be produced in the external Tier-2 facilities. Once produced,
the simulated data must be available for the whole collaboration on an essentially 24/7 basis,
as for real data. This requirement implies that the simulated data should be concentrated at the
Tier-1 facilities unless the lower tiers can guarantee the required level of access. However, it
is assumed that all of the required derived datasets (ESD, AOD and TAG) are produced
together at the same site, and then transported to their eventual storage location. In general,
the storage and analysis of simulated data are best handled through the Tier-1 facilities by
default, although some larger Tier-2 facilities may wish to share this load, with appropriate
credit.
In addition to the official datasets that are officially requested by ATLAS and available to all,
there will be additional samples that will be generated locally to test and optimise the
simulation procedures and support local analyses.
2.1.2.3 Analysis Data
The analysis activity is divided into two components. The first is a scheduled activity run
through the working groups, analysing the ESD and other samples and extracting new TAG
selections and working group enhanced AOD sets or n-tuple equivalents. The jobs involved
will be developed at Tier-2 sites using small sub-samples in a chaotic manner, but will be
approved for running over the large data sets by physics group organisers. It is assumed there
are ~20 physics groups at any given time, and that each will run over the full sample four
times in each year. It is also assumed that only two of these runs will be retained, one current
and one previous.
The second class of user analysis is chaotic in nature and run by individuals. It will be mainly
undertaken in the Tier-2 facilities, and includes direct analysis of AOD and small ESD sets
and analysis of Derived Physics Datasets (DPD). We envisage ~30 Tier-2 facilities of various
sizes, with an active physics community of ~600 users accessing the non-CERN facilities.
The CERN Analysis Facility will also provide chaotic analysis capacity, but with a higher-
than-usual number of ATLAS users (~100). It will not have the simulation responsibilities
required of a normal Tier-2.
It is estimated that for the analysis of DPD, some 25 passes over 1% of the events collected
each year would require only 92 SI2k per user. (It should be kept in mind that the majority of
the user analysis work is to be done in Tier-3s.) Assuming the user reconstructs one
thousandth of the physics events once per year, this requires a more substantial 1.8 kSI2k per
user. It is assumed the user also undertakes CPU-intensive work requiring an additional 12.8
kSI2k per user, equivalent to the community of 600 people running private simulations each
year equal to 20% of the data-taking rate. Such private simulation is in fact observed in some
experiments, although it must be stressed that all samples to be used in published papers must
become part of the official simulation sets and have known provenance. It is assumed that
each user requires 1 TB of storage by 2007/2008, with a similar amount archived.
In order to perform the analysis, a Grid-based distributed analysis system has been developed.
It provides the means to return a description of the output dataset and to enable the user to
access quickly the associated summary data. Complete results are available after the job
finishes and partial results are available during processing. The system ensures that all jobs,
output datasets and associated provenance information (including transformations) are
recorded in the catalogues. In addition, users have the opportunity to assign metadata and
annotations to datasets as well as jobs and transformations to aid in future selection.
2.1.2.4 Non-Event Data
Calibration and alignment processing refers to the processes that generate ‘non-event’ data
that are needed for the reconstruction of ATLAS event data, including processing in the


                                                                                               3
Technical Design Report                                              LHC COMPUTING GRID


trigger/event filter system, prompt reconstruction and subsequent later reconstruction passes.
This ‘non-event’ data (i.e. calibration or alignment files) are generally produced by processing
some raw data from one or more sub-detectors, rather than being raw data itself, so e.g.
Detector Control Systems (DCS) data are not included here. The input raw data can be in the
event stream (either normal physics events or special calibration triggers) or can be processed
directly in the subdetector readout systems. The output calibration and alignment data will be
stored in the conditions database, and may be fed back to the online system for use in
subsequent data-taking, as well as being used for later reconstruction passes.
Calibration and alignment activities impact the computing model in several ways. Some
calibration will be performed online, and require dedicated triggers, CPU and disk resources
for the storage of intermediate data, which will be provided by the event filter farm or a
separate dedicated online farm. Other calibration processing will be carried out using the
recorded raw data before prompt reconstruction of that data can begin, introducing significant
latency in the prompt reconstruction at Tier 0. Further processing will be performed using the
output of prompt reconstruction, requiring access to AOD, ESD and in some cases even RAW
data, and leading to improved calibration data that must be distributed for subsequent
reconstruction passes and user data analysis.
Various types of calibration and alignment processing can be distinguished:
     1. Processing directly in the subdetector readout system (the RODs). In this case, the
         processing is done using partial event fragments from one subdetector only, and these
         raw data fragments do not need to be passed up through the standard ATLAS DAQ
         chain into the event stream (except for debugging). This mode of operation can be
         used in dedicated standalone calibration runs, or using special triggers during normal
         physics data-taking.
     2. Processing in the EF system, with algorithms either consuming dedicated calibration
         triggers (identified in the level 1 trigger or HLT), or ‘spying’ on physics events as
         part of the normal processing. In particular, an algorithm running at the end of a chain
         of event filter algorithms would have access to all the reconstructed information (e.g.
         tracks) produced during event filter processing, which may be an ideal point to
         perform some types of calibration or monitoring tasks. If the calibration events are
         identified at level 1 or 2, the event filter architecture allows such events to be sent to
         dedicated sub-farms, or even for remote processing at outside institutes.
     3. Processing after the event filter, but before prompt reconstruction. Event bytestream
         RAW data files will be copied from the event filter to the Tier-0 input buffer disk as
         soon as they are ready, and could then be processed by dedicated calibration tasks
         running in advance of prompt reconstruction. This could be done using part of the
         Tier-0 resources, or event files could also be sent to remote institutes for processing,
         the calibration results being sent back for use in later prompt reconstruction, provided
         the latency and network reliability issues can be kept under control.
     4. Processing offline after prompt reconstruction. This would most likely run on outside
         Tier-1 or Tier-2 centres associated with the subdetector calibration communities,
         leaving CERN computing resources free to concentrate on other tasks. RAW data,
         ESD and AOD will all be distributed outside CERN, though data from more than one
         centre would be needed to process a complete sample due to the ‘round robin’
         distribution of RAW and ESD to Tier-1 centres.
All of these processing types will be used by one or more of the ATLAS subdetectors; the
detailed calibration plans for each subdetector are still evolving. The present emphasis is on
understanding the subdetector requirements, and ensuring they are compatible with the
various constraints imposed by the different types of online and offline processing.

In principle, offline calibration and alignment processing is no different to any other type of
physics analysis activity and could be treated as such. In practice, many calibration activities
will need access to large event samples of ESD or even RAW data, and so will involve
resource-intensive passes through large amounts of data on the Tier-1s or even the Tier-0
facility. Such activities will have to be carefully planned and managed in a similar way to


4
LHC COMPUTING GRID                                                 Technical Design Report


bulk physics group productions. At present, the offline calibration processing needs are not
sufficiently well understood, though the recent definitions of the contents of DRD, ESD and
AOD should help subdetectors in defining what processing tasks they need to do, and how
they will accomplish them.


2.1.3 CMS
2.1.3.1 Event Data Description and Flow
The CMS DAQ system writes DAQ-RAW events (1.5 MB) to the High Level Trigger (HLT)
farm input buffer. The HLT farm writes RAW events (1.5 MB) at a rate of 150 Hz. RAW
events are classified in O(50) primary datasets depending on their trigger history (with a
predicted overlap of less than 10%). Primary dataset definition is immutable. An additional
express-line is also written with events that will be reconstructed with high priority. Primary
datasets are grouped into O(10) online streams in order to optimize their transfer to the Off-
line farm and the following reconstruction process. Data transfer from HLT to the Tier-0 farm
must happen in real-time at a rate of 225 MB/s.
Heavy-Ion data at the same total rate (225MB/s) will be partially processed in real-time on the
Tier-0 farm. Full processing of the Heavy-ion data is expected to occupy the Tier-0 during
much of the LHC downtime (between annual LHC pp running periods).
The first event reconstruction is performed without delay on the Tier-0 farm which writes
RECO events (0.25 MB). RAW and RECO versions of each primary dataset are archived on
the Tier-0 MSS and transferred to a Tier-1 which takes custodial responsibility for them.
Transfer to other Tier-1 centres is subject to additional bandwidth being available. Thus RAW
and RECO are available either in the Tier-0 archive or in at least one Tier-1 centre.
The Tier-1 centres produce Analysis Object Data (AOD, 0.05 MB) (AOD production may
also be performed at the Tier-0 depending on time, calibration requirements etc), which are
derived from RECO events and contain a copy of all the high-level physics objects plus a
summary of other RECO information sufficient to support typical analysis actions (for
example re-evaluation of calorimeter cluster positions or track refitting, but not pattern
recognition). Additional processing (skimming) of RAW, RECO and AOD data at the Tier-1
centres will be triggered by Physics Groups requests and will produce custom versions of
AOD as well as TAGS (0.01 MB) which contain high level physics objects and pointers to
events (e.g. run and event number) and which allow their rapid identification for further
study. Only very limited analysis activities from individual users are foreseen at the Tier-1
centre.
The Tier-1 centre is responsible for bulk re-processing of RAW data, which is foreseen to
happen about twice per year.
Selected skimmed data, all AOD of selected primary streams, and a fraction of RECO and
RAW events are transferred to Tier-2 centres which support iterative analysis of authorized
groups of users. Grouping is expected to be done not only on a geographical but also on a
logical basis, e.g. supporting physicists performing the same analysis or the same detector
studies.
CMS will have about 6 Tier-1 centres and about 25 Tier-2 CERN centre will host the Tier-0, a
Tier-1 without real-data custodial responsibility and a Tier-2 which will be about 3 times a
standard Tier-2. The CERN Tier-1 will allow direct access to about 1/6 of RAW and RECO
data and will host the simulated data coming from about 1/6 of the CMS Tier-2 centre. The
CERN Tier-2 will be a facility useable by any CMS member, but the priority allocation will
be determined by the CMS management to ensure that it is used in the most effective way to
meet the experiment priorities; particularly those that can profit from its close physical and
temporal location to the experiment.



                                                                                             5
Technical Design Report                                                       LHC COMPUTING GRID


2.1.3.2 Simulation
CMS intends to produce as much simulated data as real data. A simulated event size is about
2 MB. Simulation tasks are performed on distributed resources; mainly at the Tier-2 centres.
The simulated data are stored on at least one Tier-1 centre, which takes custodial
responsibility for them. Further distribution and processing of simulated data follows the
same procedure of real data.
2.1.3.3 Non-event data
CMS will have 4 kinds of non-event data: Construction data, Equipment management data,
Configuration data and Conditions data.
Construction data includes all information about the sub detector construction up to the start
of integration. It is available since the beginning of CMS and has to be available for the
lifetime of the experiment. Part of the construction data is included also in other kinds of data
(e.g. initial calibration in the configuration data).
Equipment management data includes detector geometry and location as well as information
about electronic equipment. They need to be available on-line.
Configuration data comprises the sub-detector specific information needed to configure the
front-end electronics. They are also needed for reconstruction and re-reconstruction.
Conditions data are all the parameters describing run conditions and logging. They are
produced by the detector front-end. Most of the conditions data stay at the experiment and are
not used for off-line reconstruction, but part of them need to be available for analysis
At the CMS experiment site there are two database systems. The Online Master Data Storage
(OMDS) database is directly connected to the detector and makes available configuration data
to the detector and receives conditions data from the detector front end. Offline
Reconstruction Conditions DB ONline subset (ORCON) database is a replica of the OMDS
but synchronization between the two is automatic only for what concerns conditions data
coming from the detector, while configuration data are manually copied from ORCON to
OMDS. ORCON is automatically replicated at the Tier-0 centre: the Offline Reconstruction
Conditions DB OFfline subset (ORCOFF) is the master copy for the non-event data system.
The relevant parts of ORCOFF that are needed for analysis, reconstruction and calibration
activities are replicated at the different CMS computing centres using technologies such as
3D.
There are currently no quantitative estimates for the data volumes of the non-event data. This
will be addressed in the first volume of CMS Physics TDR. The total data volume is however
considered negligible compared to event data and will not have a major impact on the
hardware resources needed.
2.1.4 LHCb
2.1.4.1 RAW data
The LHCb events can be thought of as being classified in 4 categories: exclusive b sample,
inclusive b sample, dimuon sample and D* sample14. The expected trigger rate after the HLT
is 2 kHz. The b-exclusive sample will be fully reconstructed on the online farm in real time
and it is expected two streams will be transferred to the CERN computing centre: a
reconstructed b-exclusive sample at 200Hz (RAW+rDST) and the RAW data sample at 2kHz.
The RAW event size is 25kB, and corresponds to the current measured value, whilst there is
an additional 25kB associated with the rDST. LHCb expect to accumulate 21010 events per
year, corresponding to 500TB of RAW data.


14 It is appreciated that there will be events that satisfy more than 1 selection criteria; for the sake of
simplicity this overlap is assumed negligible.


6
LHC COMPUTING GRID                                                   Technical Design Report


2.1.4.2 Simulated data
The LHCb simulation strategy is to concentrate on particular needs that will require an
inclusive b-sample and the generation of particular decay modes for a particular channel
under study. The inclusive sample numbers are based on the need for the statistics to be
sufficient so the total error is not dominated by Monte Carlo statistical error. To that end these
requirements can only be best guess estimates.
It is anticipated that 2109 signal events will be generated plus an additional 2109 inclusive
events every year. Of these 4109 simulated events, it is estimated that 4108 events will pass
the trigger simulation and will be reconstructed and stored on MSS.
The current event size of the Monte Carlo DST (with truth information) is approximately
500kB/event. LHCb are confident that this can be decreased to 400kB/event. Again TAG data
will be produced to allow quick analysis of the simulated data, with ~1kB/event.
2.1.4.3 Reconstruction
LHCb plan to reprocess the data of a given year once, after the end of data taking for that
year, and then periodically as required. The reconstruction step will be repeated to
accommodate improvements in the algorithms and also to make use of improved
determinations of the calibration and alignment of the detector in order to regenerate new
improved rDST information. Since the LHCC review of the computing model a prototype
rDST has been implemented that meets the 25 kB/event estimate.


2.1.4.4 Data stripping
The rDST is analysed in a production-type mode in order to select event streams               for
individual further analysis. The events that pass the selection criteria will be fully         re-
reconstructed, recreating the full information associated with an event. The output of        the
stripping stage will be referred to as the (full) DST and contains more information than      the
rDST.
LHCb plan to run this production-analysis phase (stripping) 4 times per year: once with the
original data reconstruction; once with the re-processing of the RAW data, and twice more, as
the selection cuts and analysis algorithms evolve.
It is expected user physics analysis will primarily be performed from the output of this stage
of data processing (DST+RAW and TAG.) During first data taking it is foreseen to have at
least 4 output streams from this stripping processing: two associated with physics directly (b-
exclusive and b-inclusive selections) and two associated with “calibration” (dimuon and D*
selections.) For the b-exclusive and b-inclusive events, the full information of the DST and
RAW will be written out and it is expected to need 100 kB/event. For the dimuon and D*
streams only the rDST information will be written out, with the RAW information added; this
is estimated to be 50 kB/event.
2.1.4.5 Analysis
Finally LHCb physicists will run their Physics Analysis jobs, processing the DST output of
the stripping on events with physics analysis event tags of interest and run algorithms to
reconstruct the B decay channel being studied. Therefore it is important that the output of the
stripping process is self-contained. This analysis step generates quasi-private data (e.g.
Ntuples or personal DSTs), which are analysed further to produce the final physics results.
Since the number of channels to be studied is very large, we can assume that each physicist
(or small group of physicists) is performing a separate analysis on a specific channel. These
“Ntuples” could be shared by physicists collaborating across institutes and countries, and
therefore should be publicly accessible.



                                                                                                7
Technical Design Report                                              LHC COMPUTING GRID


2.1.4.6 LHCb Computing Model
The baseline LHCb computing model is based on a distributed multi-tier regional centre
model. It attempts to build in flexibility that will allow effective analysis of the data whether
the Grid middleware meets expectations or not, of course this flexibility comes at the cost of a
modest requirement overhead associated with pre-distributing data to the regional centres. In
this section we will describe a baseline model but we will comment on possible variations
where we believe this could introduce additional flexibility.
CERN is the central production centre and will be responsible for distributing the RAW data
in quasi-real time to the Tier-1 centres. CERN will also take on a role of a Tier-1 centre. An
additional six Tier-1 centres have been identified: CNAF(Italy), FZK(Germany),
IN2P3(France), NIKHEF(The Netherlands), PIC(Spain) and RAL(United Kingdom) and an
estimated 14 Tier-2 centres. CERN and the Tier-1 centres will be responsible for all the
production-processing phases associated with the real data. The RAW data will be stored in
its entirety at CERN, with another copy distributed across the 6 Tier-1’s. The 2nd pass of the
full reconstruction of the RAW data will also use the resources of the LHCb online farm. As
the production of the stripped DSTs will occur at these computing centres, it is envisaged that
the majority of the distributed analysis of the physicists will be performed at CERN and at the
Tier-1’s. The current year’s stripped DST will be distributed to all centres to ensure load
balancing. To meet these requirements there must be adequate networking not only between
CERN and the Tier-1’s but also between Tier-1’s; quantitative estimates will be given later.
The Tier-2 centres will be primarily Monte Carlo production centres, with both CERN and the
Tier-1’s acting as the central repositories for the simulated data. It should be noted that
although LHCb do not envisage any analysis at the Tier-2’s in the baseline model, it should
not be proscribed, particularly for the larger Tier-2 centres.


2.2     Resource Expectations
For the purpose of this document, 2008 is assumed to be the first full year of data taking
corresponding to 107 seconds of data taking. The first year of heavy ion is running is assumed
to be 2009.
2.2.1 ALICE
2.2.2 ATLAS
Clearly, the required system will not be constructed in its entirety by the start of data-taking in
2007/2008. From a cost point-of-view, the best way to purchase the required resources would
be ‘just in time’. However, the early period of data-taking will doubtless require much more
reprocessing and less compact data representations than in the mature steady-state, as
discussed in the section on Commissioning above. There is therefore a requirement for early
installation of both CPU and disk capacity.
It is therefore proposed that by the end of 2006 a capacity sufficient to handle the data from
first running, needs to be in place, with a similar ramping of the Tier-1 facilities. During 2007,
an additional capacity required for 2008 should be bought. In 2008, an additional full-year of
capacity should be bought, including the additional archive storage medium/tape required to
cope with the growing dataset. This would lead to a capacity, installed by the start of 2008,
capable of storing the 2007 and 2008 data as shown in Table 1; the table assumes that only
20% of the data rate is fully simulated.



                                      CPU(MSI2         Tape           Disk
                                      k)              (PB)            (PB)
                  CERN Tier-0            4.1             6.2            0.35


8
LHC COMPUTING GRID                                                        Technical Design Report


                       CERN AF                 2.8             0.6              1.8
                   Sum of Tier-               26.5             10.1          15.5
                      1's
                   Sum of Tier-               21.1             0.0           10.1
                      2's
                     Total                    54.5             16.9          27.8


Table 1: The projected total resources required at the start of 2008 for the case when 20% of the
data rate is fully simulated.


For the Tier-2s, a slightly later growth in capacity, following the integrated luminosity, is
conceivable provided that the resource-hungry learning-phase is mainly consuming resources
in Tiers 0 and 1. However, algorithmic improvements and calibration activity will require also
considerable resources early in the project. As a consequence, we have assumed the same
ramp-up for the Tier-2s as for the higher Tiers.
Once the initial system is built, there will for several years be a linear growth in the CPU
required for processing, as the initial datasets will require reprocessing as algorithms and
calibration techniques improve. In later years, subsets of useful data may be identified to be
retained/reprocessed, and some data may be rendered obsolete. However, for the near future,
the assumption of linear growth is reasonable. For storage, the situation is more complex. The
requirement exceeds a linear growth if old processing versions are not to be overwritten. On
the other hand, as the experiment matures, increases in compression and selectivity over the
stored data may reduce the storage requirements.
The projections do not include the replacement of resources, as this depends crucially on the
history of the sites at the start of the project.



                35000

                30000


                25000

                                                                                      Total Disk (TB)
                20000
                                                                                      Total Tape (TB)
                                                                                      Total CPU (kSI2k)
                15000

                10000


                 5000

                       0
                            2007     2008     2009     2010     2011     2012
     Total Disk (TB)       164.692 354.1764 354.1764 495.847 660.539 850.0234
     Total Tape (TB)       1956.684 6164.608 10372.53 16263.62 22154.72 30002.49
     Total CPU (kSI2k)      1826     4058     4058     8239     10471    10471
Figure 1: The projected growth in ATLAS Tier-0 resources with time.




                                                                                                    9
Technical Design Report                                                   LHC COMPUTING GRID




                18000

                16000

                14000

                12000
                                                                                    Total Disk (TB)
                10000
                                                                                    Total Tape (TB)
                 8000                                                               Total CPU (kSI2k)

                 6000

                 4000

                 2000

                       0
                            2007     2008     2009     2010     2011     2012
     Total Disk (TB)       751.1699 1812.943 2342.153 3463.75 4955.813 6758.48
     Total Tape (TB)       208.0896 567.2796 824.8605 1261.443 1622.057 2190.76
     Total CPU (kSI2k)       974     2822     4286     8117     12279    16055
Figure 2: The projected growth in the ATLAS CERN Analysis Facility

              200000

              180000

              160000

              140000

              120000                                                                Total Disk (TB)
                                                                                    Total Tape (TB)
              100000
                                                                                    Total CPU (kSI2k)
                80000

                60000

                40000

                20000

                       0
                            2007     2008     2009     2010     2011     2012
     Total Disk (TB)       5541.362 15464.46 23093.6 41872.46 56997.26 72122.06
     Total Tape (TB)       3015.246 10114.47 18535.89 30873.28 45061.74 61101.28
     Total CPU (kSI2k)      7899     26502    47600    81332   123827   172427
Figure 3: The projected growth in the capacity of the combined ATLAS Tier-1 facilities.




10
LHC COMPUTING GRID                                                      Technical Design Report



           100000

            90000

            80000

            70000

            60000
                                                                                        Disk (TB)
            50000
                                                                                        CPU (kSI2k)
            40000

            30000
            20000

            10000

                   0
                         2007      2008      2009      2010      2011      2012
       Disk (TB)       3213.2069 10103.135 16990.023 26620.766 36251.509 45888.115
       CPU (kSI2k)       7306     21108     31932     52174     69277     86380

Figure 4: The projected growth of the combined ATLAS Tier-2 facilities. No repurchase effects
are included.


2.2.3 CMS
CMS resource requirements are summarized in Table 2. Calculations assume 2007 run to be
half of 2008. In 2009 CMS RAW event size will reduce to 1 MB due to better understanding
of the detector. In 2010 running at high luminosity increases processing by a factor of 5 but
the event size stays at 1 MB due to likely additional improvements in understanding the
detector.




                                                                                               11
Technical Design Report                                         LHC COMPUTING GRID




Table 2: CMS computing time profile


2.2.4 LHCb
It is anticipated that the 2008 requirements to deliver the computing for LHCb are 13.0
MSI2k.years of processing, 3.3 PB of disk and 3.4 PB of storage in the MSS. The CPU
requirements will increase by 11% in 2009 and 35% in 2010. Similarly the disk requirements
will increase by 22% in 2009 and 45% in 2010. The largest increase in requirements is
associated with the MSS where a factor 2.1 is anticipated in 2009 and a factor 3.4 for 2010,
compared to 2008. The requirements are summarised in Table 3. The estimates given in 2006
and 2007 reflect the anticipated ramp up of the computing resources to meet the computing
requirements need in 2008; this is currently 30% of needs in 2006 and 60% in 2007. This
ramp up profile should cover the requirements of any data taken in 2007.


CPU(MSI2k.yr)             2006         2007          2008           2009           2010
     CERN                 0.27         0.54           0.90          1.25           1.88
     Tier-1’s             1.33         2.65           4.42          5.55           8.35
     Tier-2’s             2.29         4.59           7.65          7.65           7.65
      Total               3.89         7.78          12.97          14.45         17.88


12
LHC COMPUTING GRID                                                Technical Design Report



      Disk(TB)
       CERN               248           496            826           1095           1363
      Tier-1’s            730           1459           2432          2897           3363
      Tier-2’s             7             14             23             23             23
        Total             984           1969           3281          4015           4749
      MSS (TB)
       CERN               408           825            1359          2857           4566
      Tier-1’s            622           1244           2074          4285           7066
        Total            1030           2069           3433          7144           11632
Table 3: LHCb computing resource estimates


2.3      Baseline Requirements
2.3.1 ALICE
2.3.2 ATLAS
2.3.3 CMS
The CMS computing system is geographically distributed. Data are spread over a number of
centres following the physical criteria given by their classification into primary datasets.
Replication of data is given more by the need of optimizing the access of most commonly
accessed data than by the need to have data "close to home". Furthermore Tier-2 centres
support users not only on a geographical basis but mainly on a physics-interest basis.
CMS intends as much as possible to exploit solutions in common with other experiments to
access distributed CPU and storage resources.
2.3.3.1 Compute and Storage Elements
The Compute Element (CE) interface should allow having access to batch queues in all CMS
centres independently of the User Interface (UI) from which the job is submitted. Mechanisms
should be available for installing, configuring and verifying CMS software at remote sites. In
a few selected centres CMS may require direct access to the system in order to configure
software and data for specific, highly demanding processes such as digitization with pile-up
of simulated data. This procedure does not alter the resource access mechanisms.
The Storage Element (SE) interface should hide the complexity and the peculiarities of the
underlying storage system, possibly presenting to the user a single logical file namespace
where CMS data may be stored. While we will support exceptions to this, we do not expect
them to be the default mode of operation.
2.3.3.2 Data management
This section only deals with management of event data since non-event data will be discussed
in detail in CMS Physics TDR as anticipated in previous section.
CMS data are indexed not as single files but as Event-Collections, which may contain one or
more files. Event-Collections are the lowest granularity elements that may be addressed by a
process that needs to access them. An Event-Collection may represent for instance a given
data-tier (i.e. RAW or RECO or AOD, etc.) for a given primary dataset and for a given LHC
fill. Their composition is defined by CMS and the information is kept in a central service
provided and implemented by CMS: the Dataset Bookkeeping System (DBS). The DBS


                                                                                            13
Technical Design Report                                               LHC COMPUTING GRID


behaves like a Dataset Metadata Catalogue in HEPCAL and allows all possible operations to
manage CMS data from the logical point of view. All or part of the DBS may be replicated in
read-only copies. Copies may use different back-ends depending on the local environment.
Light-weight solutions like flat files may be appropriate to enable analysis on personal
computers. In the baseline solution the master copy at the Tier-0 is the only one where
updates may be made, we don't exclude that in future this may change. Information is entered
in the DBS by the data-production system. As soon as a new Event-Collection is first made
known to DBS, a new entry is created. Some information about production of the Event-
Collection (e.g. the file composition, including their Globally Unique IDentifiers, GUID's,
size, checksum, etc...) may only be known at the end of its production.
A separate Data Location System (DLS) tracks the location of the data. The DLS is indexed
by file-blocks, which are in general composed by many Event-Collections. The primary
source of data location information is a local index of file-blocks available at each site. A
global data location index maintains an aggregate of this information for all sites, such that it
can answer queries on which file-blocks exist where. Our baseline is that information is
propagated from the local index to the global one asynchronously. The queries against the
global index are answered directly by the global index without passing the query to the local
indices, and vice versa. Information is entered into DLS at the local index where data are,
either by the production system after creating a file-block or by data transfer system (see
below) after transfer. In both cases only complete file-blocks are published. Site manager
operations may also result in modification of the local index, for instance in case of data loss
or deletion. Once the baseline DLS has been proven sufficient we expect the DLS model to
evolve.
Access to local data never implies access to the global catalogue; if data are found to be
present locally (e.g. on a personal computer), they're directly accessible.
Note that the DLS only provides names of sites hosting the data and not the physical location
of constituent files at the sites, or the composition of file-blocks. . The actual location of files
is only known within the site itself through a Local File Catalogue. This file catalogue has an
interface (POOL) which returns the physical location of a logical file (known either through
its logical name which is defined by CMS or through the GUID. CMS applications only know
about logical files and rely on this local service to have access to the physical files.
Information is entered in the local file catalogue in a similar way of the local catalogue of the
DLS, i.e. by the production system, by data transfer agent or by the local site manager. Note
that if the local SE may be seen as a single logical file namespace, the functionality of the
catalogue may be implemented by a simple algorithm that attaches the logical file name as
known by the CMS application to a site-dependent prefix that is provided by the local
configuration. In this case no information needs to be entered when file-blocks are added or
removed. This is the case for instance when data are copied to a personal computer (e.g. a
laptop) for iterative analysis.
CMS will use a suitable DLS implementation able to co-operate with the workload
management system (LCG WMS) if it exists, one. Failing that a CMS implementation will be
used, with certain consequences on job submission system (see below in the analysis
section). An instance of the local index must operate on a server at each site hosting data; the
management of such a server will be up to CMS personnel at the site. There may be a need to
be able to contact a local DLS from outside the site, however the local file catalogue
conforming to the POOL API only needs to be accessible from within the site.
2.3.3.3 Data transfer
Data transfers are never done as direct file copy by individual users. The data transfer system,
Physics Experiment Data Export (PhEDEx) consists of the following components:
Transfer management database (TMDB) where transfer requests and subscriptions are kept.




14
LHC COMPUTING GRID                                                   Technical Design Report


Transfer agents that manage the movement of files between sites. This also includes agents to
migrate files to mass storage, to manage local mass storage staging pools, to stage files
efficiently based on transfer demand, and to calculate file checksums when necessary before
transfers.
Management agents, in particular the allocator agent which assigns files to destinations based
on site data subscriptions, and agents to maintain file transfer topology routing information.
Tools to manage transfer requests, including interaction with local file and dataset catalogues
as well as with DBS when needed.
Local agents for managing files locally, for instance as files arrive from a transfer request or a
production farm, including any processing that needs to be done before they can be made
available for transfer: processing information, merging files, registering files into the
catalogues, injecting into TMDB.
2.3.3.4 Web accessible monitoring tools.
Note that every data transfer operation includes a validation step that verifies the integrity of
the transferred files.
In the baseline system a TMDB instance is shared by the Tier-0 and Tier-1s. Tier-2s and
further may share in the same TMDB instance or have site-local or geographically shared
databases. The exact details of this partitioning will evolve over time. All local agents
needed at sites hosting CMS data are managed by CMS personnel and are run on normal LCG
User Interfaces. The database requirements and CPU capacity for the agents is not expected
to be significant. Between sites the agents communicate directly with each other and through
a shared database. The amount of this traffic is negligible.
2.3.3.5 Analysis
While interactive analysis is foreseen to happen mainly locally at Tier-2/3 or on personal
computers, in general batch processing of data happens on the distributed system. Users
operate on a standard LCG UI. The mechanism that CMS foresees to use is similar to the one
described as "Distributed Execution with no special analysis facility support" in the
HEPCAL-II document.
A user provides one or more executables with a set of libraries, configuration parameters for
the executables (either via arguments or input files) and the description of the data to be
analyzed. Additional information may be passed to optimize job splitting, for example an
estimation of the processing time per event. A dedicated tool running on the User Interface
(CMS Remote Analysis Builder, CRAB) queries the DBS and produces the set of jobs to be
submitted. In the baseline solution an additional query to the DLS selects the sites hosting the
needed data. This translates to an explicit requirement to the WMS for a possible set of sites
in the job description (JDL file). In future the query to the DLS may be placed by the WMS
itself if a compatible interface between the DLS and the WMS is provided. Jobs are built in a
site-independent way and may run on any site hosting the input data. CRAB takes care of
defining the list of local files that need to be made available on the execution host (input
sandbox) and those that have to be returned back to the user at the end of execution (output
sandbox). The user has obviously the possibility to specify that the data are local and that the
job has to be submitted to a local batch scheduler or even forked on the current machine. In
this case CRAB has the responsibility to build the jobs with the appropriate structure. The
cluster is submitted to the LCG WMS either as a unique entity or job-by-job depending on the
functionalities provided by the WMS. The WMS selects the site where to run each job
depending on load balancing only. As anticipated in the Data Management section the
translation of logical file names to physical file names happens through a POOL catalogue
interface on the Worker Node (WN).
Job cluster submission and all interactions with the cluster or with its constituent jobs happen
through an interface (Batch Object Submission System, BOSS) which hides the complexity of


                                                                                               15
Technical Design Report                                            LHC COMPUTING GRID


the underlying batch scheduler, in particular whether it is local or on the Grid. This layer
allows submitting and canceling jobs and clusters, automatically retrieving their output,
getting information about their status and history. Furthermore it logs all information, either
related to running conditions or specific to the tasks they performed, in a local database. The
bookkeeping database backend may vary depending on the environment (e.g. performing
RDBMS like ORACLE for production systems, SQLite for personal computers or laptops). If
outbound connectivity is provided on the WN's or if a suitable tunneling mechanism (e.g.
HTTP proxy, R-GMA servlets, etc.) is provided on the CE, a job submitted through BOSS
may send information to a monitoring service in real-time and made available to the BOSS
system. Otherwise logging information is only available at the end of job execution (through
the job output sandbox). Note that BOSS does not require any dedicated service on the sites
where the jobs run.
2.3.3.6 Production
Physics Groups submit data production request to a central system (RefDB), which behaves
like a virtual data catalogue, since it keeps all the information needed to produce data. RefDB
also has the information about the individual jobs that produced the data. Most of the
information currently in RefDB will be moved to the DBS, leaving to RefDB only the
management of information specific to the control of the production system and the data
quality.
Data productions may happen on distributed or on local (e.g. Tier-1) resources. Once
production assignments are defined by the CMS production manager, the corresponding jobs
are created at the appropriate site, according to information stored in the RefDB. The tool that
performs this operation is McRunJob, but CMS is currently evaluating the possibility to use
the same tool for data production and data analysis. Detailed job monitoring is provided by
BOSS at the site where the jobs are created. A summary of the logged information is also
stored in RefDB.
Publication of produced data implies interaction with the DBS and with the local components
of the DLS and the file catalogue at the site where the data are stored. Note that for Grid
production this implies running the publication procedure at the site where the data are stored
and not by the job that performed the production. Part of the publication procedure is the
validation of the produced data, which is performed by the production system itself.


2.3.4 LHCb


3    LHC COMPUTING GRID ARCHITECTURE
Kors Bos



3.1    Grid architecture in general – dataflow – aspect of several grids,
functionality and services
Ian Bird
Abstract
This section will cover the basic grid services to be provided LCG-wide. These should be
described in terms of their essential functionality and interfaces. There should be a discussion
of essential basic-level services needed by all experiments, and higher level services which
might be used in different ways by each experiment. It is anticipated that the core of this
section will be defined by the ongoing baseline services working group.


16
LHC COMPUTING GRID                                                   Technical Design Report


3.2     Network
David Foster
"The contents of this chapter are currently under discussion in a GDB working group. This is
a work in progress and will change substantially. These are, however, the current thoughts."
3.2.1 Introduction
The Large Hadron Collider (LHC) is being built at CERN in Geneva, Switzerland. The large
amounts of data to be produced by the LHC are scheduled to be sent to data processing and
storage centres around the world. The data source is called “Tier 0” or T0 and the first level
processing and storage is called “Tier 1” or T1. The “Tier 2” or T2 sites are typically
universities and other scientific institutes that hang off one or more T1 sites.


The entire large-scale scientific instrument can be depicted as the collection of:
•The LHC and its data collection systems;
•The data processing and storage units at CERN called T0;
•The data processing and storage sites called T1;
•The data processing and storage sites called T2;
•Associated networking between all T0, T1, and T2 sites.


The following picture shows this in more detail:
               T1s
                                                                   T0



                                                                                     LHC




The T1 and T0 sites together with the links between these sites and attached equipment build
the LHC network. This document proposes the high-level architecture for the LHC network.
The aim of this architecture is to be as inclusive of technologies as possible, while still
proposing concrete directions for further planning and implementation. With respect to T0-T1
networking this document proposes a detailed architecture based on 10G light paths.
For T2 networking this document does not propose a detailed solution. However, if T2s will
be able to match the recommendations proposed in “IP addressing” chapter, it should be
relatively easy to extend the architecture to them.



                                                                                           17
Technical Design Report                                            LHC COMPUTING GRID




Version 1.0 of this document is the result of the small subgroup of the T0/1-networking
meeting and is the subject of discussion on the mailing list and/or in the large meeting. It is
envisaged that version 2.0 of this document is the resulting and final document, after
incorporating all results agreed upon in the extended T0/T1 meeting that will take place on 8
April 2005.


This document contains contributions of Erik-Jan Bos (SURFnet), Kors Bos (NIKHEF), Hans
Döbbeling (DANTE), David Foster (CERN), Bill Johnston (ESnet), Donna Lamore (FNAL),
Edoardo Martelli (CERN), Paolo Moroni (CERN), Don Petrovick (FNAL), Roberto Sabatino
(DANTE), Karin Schauerhammer (DFN), Klaus Ullmann (DFN) and others.
3.2.2 Tiers
T0:
- CERN - Switzerland
T1s:
          T1                    Location              AS number (if         NRNs involved
                                                          used)
ASCC                 Taipei - Taiwan                                   ASnet
Brookhaven           Upton – NY- USA                                   ESnet – LHCnet1
CERN                 Geneva - Switzerland                  513
CNAF                 Bologna - Italy                                   Geant2 - GARR
Fermilab             Batavia - Ill - USA                               ESnet – LHCnet1
IN2P3                Lyon - France                                     Renater
GridKa               Karlsruhe – Germany                               Geant2 - DFN
SARA                 Amsterdam - NL                                    Geant2 - SURFnet
NorduGrid            Scandinavia                                       Geant2 - Nordunet
PIC                  Barcelona - Spain                                 Geant2 - RedIRIS
RAL                  Didcot - UK                                       Geant2 – Ukerna
TRIUMF               Vancouver - Canada                                CA*Net4
1
    CALTECH-CERN transatlantic links


3.2.3 LHC network traffic
The LHC Network is designed to move data and control information in the context of the
LHC experiments. This data traffic will consist of the raw and derived data and control
information exchanged among the machines connected to the LHC Network that will have
visibility outside the local T. Because of the traffic estimates received from the LHC
experiments, it is assumed that every T1 will provision an end-to-end T0-T1 link.
The resources available at the T1s will not be all the same and therefore the average network
load might be expected to vary. In addition, the anticipated peak load is an important factor as
it is this peak load that the network should be capable of sustaining. As the computing models
continue to be refined this should become clearer, but for the moment a generally agreed
starting point is the provisioning of a 10 Gbit/s lambda per T1-T0.
This data traffic will be called “LHC network traffic”.
A lightpath is (i) a point to point circuit based on WDM technology or (ii) a circuit-switched
channel between two end points with deterministic behaviour based on TDM
technology or (iii) concatenations of (i) and (ii).



18
LHC COMPUTING GRID                                                Technical Design Report


Examples of (i) are:
STM-64 circuit;
10GE LAN PHY circuit
Examples of (ii) are:
a GE or 10GE channel carried over an SDH/SONET infrastructure with GFP-F encapsulation;
an STM-64/OC-192 channel between two points carried over an SDH/SONET infrastructure
3.2.4 Provisioning
The responsibility of providing network equipment, physical connectivity and man power is
distributed among the cooperating parties.
- The planned starting date for the production traffic is June 2007, but T1s are encouraged to
proceed with the provisioning well before that date, ideally already within 2005.
Nevertheless, they must be ready at full bandwidth not later than Q1 2006. This is important
as the service challenges now underway need to build up towards the full capacity production
environment exercising each element of the system from the network to the applications. It is
essential that the full network infrastructure is in place, in time for testing the complete
environment.
- Every T1 will be responsible for organising the physical connectivity from the T1's premises
to the T0's computer centre, according to the MoU (at time of writing this document not yet
finalized) between the T0 and the T1s.
- Every T1 will make available in due course the network equipment necessary for the
termination point of the corresponding T1-T0 line on the T1 side.
- T0 will provide the interfaces to be connected to each T1 link termination point at CERN.
- CERN is available to host T1's equipment for T0-T1 link termination at CERN, if requested
and within reasonable limits. In this case, T1 will provide CERN the description of the
physical dimensions and the power requirements of the equipment to be hosted.
- T1s are encouraged to provision direct T1-T1 connectivity whenever possible and
appropriate.
- T1s are encouraged to provision backup T0-T1 links on alternate physical routes with
adequate capacity.
3.2.5 Physical connectivity (layer1)
While T0 does not give any recommendation on the technology and the provider selected by
T1s to connect, for practical reasons it must set some restrictions regarding the physical
interfaces on its side.
T0 interfaces:
T0 preferred interface is 10Gbps Ethernet LAN-PHY. WAN-PHY and STM64-OC192 can be
negotiated with individual T1s at request.
T1 interfaces:
In case T1 cannot directly connect to any of the interfaces provided by T0, it will be
responsible of providing a suitable media converter to make the connection possible.


3.2.6 Logical connectivity (layer 2 and 3)
IPv4 is the network protocol chosen to provide communications among the upper layers
applications at the first stage; other network protocols like IPv6 can be considered in the



                                                                                              19
Technical Design Report                                              LHC COMPUTING GRID


future. Every Tier is encouraged to support a MTU of at least 9000 bytes, on the entire path
between T0 and the T1. Routed (layer 3) or not routed (Layer 2) approaches are acceptable.
T0 logical connectivity
The T0's equipment for the T1's access links will be an IP router.
T1s have two options for their connectivity:
Routed connection (Routed-T1)
In this case the termination point of the T0-T1 link is a BGP speaker, either managed directly
by the T1 or by an upstream entity, normally a NRN.
For each Routed-T1 a peering will be established between the T0's router and the T1's router
site using external BGP (eBGP) and the following parameters: (1) T0 will use the CERN
ASN 513. (2) For a T1 site that has its own ASN, this ASN will be used in the peering. For a
T1 site that has no ASN, the ASN of the intermediate NRN will be used instead.
T1 will announce its own prefixes (see IP addressing below) and possibly any of the prefixes
of the T1s or T2s directly connected to it (see below). From the architecture point of view,
every T0-T1 link should handle only production LHC data. If, due to a particular situation of
the T1, it is useful to exchange more traffic on some of such links, this can be discussed
separately, independently from this document.
T1s preferring this option are referred as Routed-T1 in what follows.
Non-routed connection (Lightpath-T1)
In this case, the T1 will configure a non-routed connection (Layer 2) up to the T0 interface
T1 will use a single CIDR address block on this interface.
T1 will assign one IP address of this network to the T0 interface
T1s preferring this option are referred as Lightpath-T1.
The following picture depicts an example of Lightpath-T1 and Routed-T1 architecture. It also
includes the backup connectivity described later.




20
LHC COMPUTING GRID                                               Technical Design Report




Please note that the T1 back-up connection can very well run through another T1, as some
T1s will have good connectivity between them. Examples could be:
•       Fermilab and Brookhaven through ESnet.
•       GridKa and CNAF.
•       SARA and NorduGrid.
•       Etc.


3.2.7 IP addressing
In order to manage effectively minimal network security and routing, it is essential to
aggregate as much as possible the IP addresses used in the context of LHC network traffic.
Every T1 and the T0 must allocate public routable IP address space to the machines that need
to be reached over the T0-T1 links. In what follows, this address spaces will be referred as
"LHC prefixes".
LHC prefixes should be aggregated in a single CIDR block for every T1; if this is not
possible, only a very small number of CIDR blocks per T1 would still be accepted.
LHC prefixes should preferably be dedicated to the LHC network traffic.
LHC prefixes can be carved as CIDR block from T1s' existing allocations or obtained as new
allocation from the appropriate RIR.
Every Routed-T1 will announce only the LHC prefixes on the T0-T1 links.
LHC prefixes cannot be RFC1918 (and related, like RFC3330) addresses (mainly because of
the backup requirements, see later).
RFC1918 addresses may be used for internal Tier traffic.


                                                                                         21
Technical Design Report                                             LHC COMPUTING GRID


T0 can allocate /30 prefixes for the addressing of the T0-T1 links (Routed-T1 only).
Every T1 (and T2 interested in exchanging traffic directly with the T0) is required to provide
the T0 with the list of its LHC prefixes. T0 will maintain a global list of all LHC prefixes and
inform T1s and T2s about any changes.


The following picture depicts an example of T1 structure:
                                                   T0




        Data Movers


LHC prefix                         Data flow
(es. 128.142.1.128/25)
                                   Data flow




                                                                         Server Farm

                               Any IP address (LHC prefix, RFC1918...)
                               (es. 128.142.0.0/16 or 10.1.0.0/16)

3.2.8 BGP Routing (Routed-T1)
External BGP peerings will be established between T0 and each Routed-T1. More precisely,
the Routed-T1 is the BGP speaker directly connected to the T0 on behalf of a specific T1 (e.g.
an NRN connecting a T1 not owning AS number).
T0 will use the CERN Autonomous System number (AS513).
Routed-T1s will use the AS number of the entity that provides the LHC prefixes to them or
the AS number of their standard upstream NRN.
T0 will re-announce all the LHC prefixes learned to all the peering T1s. Nevertheless, since
T1s are encouraged to establish direct connectivity among themselves, they will filter out
unnecessary LHC prefixes according to each individual T1-T1 routing policy.
T1 will accept T0's prefixes, plus, if desired, some selected T1's prefixes (see above).
T0 and T1s should announce their LHC prefixes to their upstream continental research
networks (Geant2, Abilene...) in order to establish a backup path (in case of T1-T0 link
failure).
T0 will accept all and only the LHC prefixes related to a specific Routed-T1 (i.e. the T1's own
LHC prefixes, plus LHC prefixes of any T1 or T2 for which that T1 is willing to provide
transit).
Usage of static routes is generally discouraged.




22
LHC COMPUTING GRID                                                    Technical Design Report


No default route must be used in T1-T0 routing. In particular, every Tier will make sure that
suitable access to the DNS system is possible from any machine within the LHC prefix
ranges.
3.2.9 Lightpath (Lightpath-T1)
A Layer 2 connection will be established from the T0 router interface to the T1's computer
centre.
- Lightpath-T1 will provide the CIDR block for the link, and will assign an address for the T0
router's interface.
- A Lightpath-T1 will have to make sure that all its LHC related machines are reachable in
this CIDR block (either directly or via proxy-arp).
- T0 will redistribute the Lightpath-T1's LHC prefix to its IGP in order to provide
reachability.
- T0 will not re-announce the Lightpath-T1's LHC prefix to the other Routed-T1s. This
because the prefix doesn't belong to the T0's AS.
- Thus, T0 will not be responsible of providing routing between Lightpath-T1 and other T1s.
Transit via T0 can still be achieved if T1s configure adequate routing (e.g. static routes) on
their sides, however this is not a recommended practice.
- Traffic to/from T2s with their own LHC prefixes and Lightpath-T1s cannot be exchanged
via T0.
3.2.10 T1 to T1 transit
T1 to T1 connectivity is needed, and the bandwidth required may be as large as the T0-T1
data traffic. T1-T1 data traffic can flow via T0 in order to save provisioning costs, but
bandwidth requirement must then take this into account.
- T0 will give transit to every T1 in order to reach another T1, if needed. Nevertheless, direct
T1-T1 traffic exchange is recommended, if possible.
[- T0-T1 traffic will be prioritized against T1-T1 traffic on direct T1-T0 links.]
10.      Backup connectivity
Backup paths are necessary in case of failure of the primary links.
- The recommended solution is to have two paths which are physically distinct using different
interfaces on the T0-T1 equipment.
- Backup connectivity can be also provided at Layer 3 via NRNs and Research Backbones
(Like Geant2, ESnet, etc.) if T1s and T0 can announce their LHC prefixes to them in BGP (in
order to not disrupt T0's production traffic, the T0-T1 backup traffic will be penalized against
production traffic using policy queuing on the T0 side). Nevertheless this optional backup
approach cannot guarantee enough performance and is not recommended because of its
potential impact on non-LHC production traffic.
- For the implementation to be reliable, it is required that Routed-T1s' LHC prefixes are
announced as originated from the same AS number both on the T1-T0 links and on the
backup paths.
- T1s must agree with their NRNs how to provide this backup service: the required bandwidth
can disrupt normal production traffic.
- Every T1 is responsible for monitoring its backup implementation.




                                                                                             23
Technical Design Report                                                 LHC COMPUTING GRID


3.2.11 Security
It is important to address security concerns already in the design phase. The fundamental
remark for the security set-up proposed below is that, because of the expected network traffic,
it is not possible to rely on firewalls. Therefore it is assumed that the overall number of
systems exchanging LHC traffic is relatively low and that such systems can be trusted. It
would be desirable to restrict the number of applications allowed for LHC traffic.
While ACL-based network security is not sufficient to guarantee enough protection for the
end-user applications, it can considerably reduce the risks involved with unrestricted internet
reachability, at relatively low cost.
The architecture will be kept as protected as possible from external access, while, at least in
the beginning, access from trusted sources (i.e. LHC prefixes) will not be restricted.
- Incoming traffic from T1s will be filtered using ACLs on the T0's interfaces connected to
the T1s. Only packets with LHC prefixes in the source-destination pair will be allowed, the
default behaviour will be to discard packets.
- T1s are encouraged to apply equivalent ACLs on their side. Otherwise outgoing filters at
T0's level can be considered.
- At least initially, filtering will be at IP level (permit IP or deny IP). Later restrictions to only
allow some specific ports may be considered, in cooperation with the application managers.
3.2.12 Operations
An operational model is still under consideration. And an initial plan for this will be presented
at the April 8, 2005 meeting.
3.2.13 Glossary
ACL                       Access Control List
ASN                       Autonomous System Number
BGP                       Border Gateway Protocol

CIDR                      Classless Inter-Domain Routing
Geant2                    European overlay platform of the NRNs
IGP                       Interior Gateway Protocol
Lightpath-T1              T1 with a layer2 connection up to the T0
LHC Network               Network connecting T0 and T1s for the LHC experiments
LHC network traffic       Data exchanged among data centres over the LHC network

MoU                       Memorandum of Understanding
NOC                       Network Operation Centre
NRN                       National Research Network: NREN (National Research and
                          Education Network, UE term) R&E (Research and Education
                          network, US term)
RIR                       Regional Internet Registry
Routed-T1                 T1 with a routed (Layer3) connection
T0
T1


3.2.14 Bandwidth requirements
         LHC-T0 T0-T1 T1-T0 T1-T1 T1-T2                                T2-T1       T0-T2       T2-T1
 ATLAS           3.5G          2.5G 750M
 ALICE           1-5G     1.6G 0.3G 0.1G


24
LHC COMPUTING GRID                                                  Technical Design Report


      LHC-T0 T0-T1                 T1-T0      T1-T1      T1-T2     T2-T1      T0-T2      T2-T1
 CMS        2.5G
 LHCb 0.8G  3.6G                                       2G
This table is to be discussed at the April 8, 2005 meeting.


3.3     Tier-0 Architecture
Bernd Panzer, David Foster
There are several key points to be considered when designing the architecture of the system :
From experience (LEP, fixed target experiments, CDF, D0, Barbar) we know that the crucial
period of an experiment are the first two years. Only when the data taking period has started
will the final usage models become clear and consequently the final architecture. While the
computing models propose the data flow inside the experiment concerning raw data, ESD and
AOD data for analysis in quite some detail, this is actually only true when the whole system
has stabilized and the detectors are fully understood. There will be, for example, a lot more
random access to the raw data needed in the first few years than in later years, which of
course affects heavily the data flow performance. Thus we have to be prepared to adapt to
maybe major changes in 2007 and 2008.
On the other side it is important to have stability in the computing fabric during the first two
years, so that the physicist can concentrate on debugging and analysis, e.g. no change of
Linux version during that time, stable network infrastructure, etc.
But as we have to be prepared for maybe major changes so a parallel and independant
test/R&D facility must be available. This must be integrated into the fabric so that the move
from test to production will be smooth.


The following diagram shows a schematic view of the data flow in the T0 system.
More details can be found in a paper on the sizing and costing of the T0. From our current
estimates 2/3 of the costs and resources will be needed for the installation of the T0 center at
CERN.




                                                                                             25
Technical Design Report                                            LHC COMPUTING GRID




All this requires a flexible and scalable architecture which can be evolved according to
changing requirements.


The general architecture is based on three functional units providing processing (CPU)
resources, disk storage and tape storage. Each of these units contains many independent nodes
which are connected on the physical layer with a hierarchical tree structured Ethernet
network. The application gets its access to the resources via software interfaces to three major
software packages which provide the logical connection of all nodes and functional units in
the system :
a batch system (LSF) to distribute and load-balance the CPU resources
a medium size distributed global shared file system (AFS) to have transparent access to a
variety of repositories (user space, programs, calibration, etc.)
a disk pool manager emulating a distributed global shared file system for the bulk data and an
associated large tape storage system (CASTOR).


And there is in addition on the low level the node management system (ELFms) .


The basic computing resource elements (CPU, disk and tape servers) are connected by
hardware components (network) and a small but sophisticated set of software components
(batch system, mass storage system, management system).
The following schematic picture shows the dependency between the different items.




26
LHC COMPUTING GRID                                                 Technical Design Report




The next picture shows the structure of the hierarchical organized Ethernet network
infrastructure. The heart of this sophisticated setup is based on a set of highly redundant and
high throughput routers connected with a mesh of multiple 10 Gbit connections. From the
computing models and the cost extrapolation for the years 2006-2010 one can estimate the
number of nodes (cpu, disk, tape, service) to be connected to this system to be about 5-8
thousand.


The following picture shows the schematic layout of the new network:




                                                                                            27
Technical Design Report                                               LHC COMPUTING GRID




The system provides full connectivity and bandwidth between any two nodes in the tree
structure, but not full bandwidth between any set of nodes. Today for example we have 96
batch nodes on fast Ethernet (100 Mbit/s) connected to one Gigabit (1000 Mbit/s) uplink to
the backbone, that is a ratio of 10 to 1 for the CPU server. The ratio is about 8 to one for disk
server. We have seen never a bottleneck in the network so far.


The expected ratios for 2008 will be 9 to 1 for CPU server and 3 to 1 for disk server.


This is a proposed configuration based on experience and predictions (References!). It is
anticipated that this configuration offers the flexibility to adjust critical parameters such as the
bandwidth ratios as we learn more about the analysis models.


It is assumed in this model that the data are evenly distributed across the available disk space
and that the access patterns are randomly distributed. More investigations need to take place
to understand how locality of access and re-distribution of data could be achieved to maintain
overall performance. However, the hardware infrastructure does not impose any particular
model.
The following picture shows the aggregate network utilization of the Lxbatch cluster over the
last 10 month. The system is mainly used by the LHC experiments and the running fixed
target experiments. The Jobs are mainly reconstruction of real data or monte carlo data plus


28
LHC COMPUTING GRID                                             Technical Design Report


quite a bit of analysis work on extracted data sets. Lxbatch was growing from about 1100
nodes in the beginning of the year towards 1400 nodes today containing about 4 different
generations of CPU server. A very rough calculations using 600 high end nodes and an peak
data rate of 300 MB/s gives an average speed per node of 0.5 MByte/s. This number is very
low.




But actually if we take some of the numbers given in the latest versions of the computing
models of the LHC experiments one comes to very similar numbers expected for 2008. Today
our dual processor CPU server have a total performance of ~ 2000 SI2000 and with the
expected processor performance improvements (factor 4 in 3 years) we will have 8000
SI2000 per node in 2008 probably dual CPU with 4-8 cores per CPU. Note that performance
per core is not the same as the performance per traditional CPU. (Reference!)
Reconstruction of raw data  producing ESD and AOD




                                                                                      29
Technical Design Report                                          LHC COMPUTING GRID




                           Raw data                CPU resource            IO value for a
                             event size              for one                  8000
                               [MB]                  event                    SI2000
                                                       [SI2000]               CPU server
                                                                                [MB/s]
     ALICE pp                       1                     5400                   1.5
     ALICE HI                     12.5                  675000                   0.1
     ATLAS                         1.6                   15000                   0.9
     CMS                           1.5                   25000                   0.5
     LHCb                        0.025                    2400                   0.1

Analysis of AOD data

                           AOD event               CPU resource            IO value for a
                             size                    for one                  8000
                               [MB]                  event                    SI2000
                                                       [SI2000]               CPU server
                                                                                [MB/s]
     ALICE pp                     0.05                   3000                    0.1
     ALICE HI                     0.25                  350000                  0.01
     ATLAS                         0.1                    500                    1.6
     CMS                          0.05                    250                    1.6
     LHCb                        0.025                    200                    1.0

The expected IO performance numbers per CPU server node are actually very similar to what
we observe already today.


The CPU servers are connected at 1Gb/sec and aggregated at 9:1, in other words, 90 servers
are connected to 1x 10Gb/sec uplink. So, each server can talk at approximately 100Mb/sec
before saturating the uplink. The data rates above are more in the range of 10-15Mb/sec so
there should be plenty of margin.


The Disk servers are connected at 1Gb/sec and aggregated at 3:1, in other words 30 servers
are connected to 1x 10Gb/sec uplink. So, each disk server can talk at approximately 300
Mb/sec before saturating the uplink.


With 4000 CPU servers and 1000 disk servers the ratio, on average is 4:1 which would mean
on average a load of 60 Mb/sec to any disk server capable of running at 300 Mb/sec.


In case of “hot spots” (i.e. many more than 4 cpu servers accessing the same disk server) the
CASTOR disk pool manager will replicate the data across more Disk servers but the existing
CPU servers will compete for access.
Efficient data layout and strategies for submitting jobs that will use the system efficiently
given these constraints have yet to be studied in detail.




30
LHC COMPUTING GRID                                                  Technical Design Report


In the disk storage area we have a physical and a logical connection architecture.
On the physical side we will follow a simple integration of NAS (disk server) boxes
on the hierarchical Ethernet network with single or multiple (probably 3 max.) Gigabit
interconnects.


The basic disk storage model for the large disk infrastructure (probably 2PB in 2008) is
assumed to be NAS with up to 1000 disk servers and locally attached disks. This amount of
disk space is assumed to grow considerably between 2008 and 2012 whereas the number of
servers could decrease substantially.


However the overall structure permits also the connection of different implementations of
disk storage which may be used to provide different levels of caching if needed (e.g. as a front
end to the tape servers).


The following list shows some examples starting with the simple NAS storage solution. We
are evaluating the other solutions to understand their benefits versus the simple NAS solution.

    simple Network Attached Storage boxes connected via
       Gigabit Ethernet (one or several) and 10 Gigabit
       Ethernet uplinks (one or two




    high end multi-processor server (>=4 CPU) with large
        amounts of space per box connected to 10 Gigabit
        Ethernet switches (???)




    separation of CPU part and the disk space itself, CPU
        server with fiber channel attached SATA disk
        arrays




                                                                                             31
Technical Design Report                                              LHC COMPUTING GRID



      small Storage Area Network setups linked with front-
         end nodes into the Gigabit network




      combination of the SAN setup with tape servers,
         locality of disk storage to tape drives




On the logical level the requirement is that all disk storage systems (independent of their
detailed physical implementation) are presenting file systems as the basic unit to higher level
applications (e.g. Mass Storage System).


The proposed architecture is essentially independent of the different tasks which are foreseen
for the CERN T0 Fabric.
Storage and distribution of data
First pass reconstruction of the raw data, producing ESD and AOD data
Calibration == processing of raw data and special calibration data
Processing ESD data to derive AOD data
Analysis of AOD data


The only ‘special’ case would be the real time requirements for detailed and frequent analysis
of ‘ntuple’ data by many users. While all the previous described setups have the goal to
optimize the global overall throughput for a lot of people, the real time aspect of
histogram/ntuple analysis requires facilities with special characteristics that have not yet been
studied in detail.


However, the growth of capability on the notebook class of computer including CPU
performance and disk space may generate the requirement for loading significant amounts of
data regularly onto such devices. This would imply a considerable increase in the camput
networking requirement and a corresponding increase in access capability to the disk servers.



3.4      Tier-1 Architecture(s), distributed Tier-1
Holger Marten, Bruce Gibbard, Farid Ould-Saada


32
LHC COMPUTING GRID                                                  Technical Design Report


Abstract
The basic functional elements of an LCG Tier 1 Center are 1) an online storage service, 2) an
archival storage service, 3) a compute service, 4) a fabric infrastructure, and 5) a Grid
infrastructure/interface. Each of these elements consists of hardware and a stack of software
layers extending up to the experiment specific infrastructure which supplies the experiment’s
services and within which its applications run. Functional requirements of each of these
elements will be discussed and examples described. The importance of presenting the LCG
agreed standard interface to both production and individual users will be discussed. A
distributed Tier 1 Center must in aggregate contain all of the elements of a monolithic Tier 1
Center. The primary difference being that some elements of the fabric and Grid infrastructure
will have to be duplicated at multiple sites and infrastructure components tying elements
together will include some wide area networking. Such a distributed Tier 1 Center must
present the same standard LCG interface to users, that is to say users should not be able to tell
that a distributed Tier 1 Center is actually distributed.
abstract-distributed-t1
 A Nordic Data Grid Facility (NDGF) is expected to be funded by the four Nordic countries in
2006. It will serve all sciences and will have a Director acting as manager and contact person.
The Nordic Tier 1 for ATLAS (and probably ALICE) will have a single contact point through
the NDGF management. This contact point will provide the 7/24 services required by the
LHC experiments for a Tier 1. The data transfer rate inside and between the Nordic countries
is already everywhere approaching 10 Gbits/s. The links from Copenhagen to Stockholm and
Oslo has been upgraded to 10 Gbit/s which completes the NorduNet upgrade. The information
available in The Report on Nordic Regional Centers will be kept updated
On the basis of our very positive experience with operating NorduGrid as a distributed grid
we plan that the compute and storage resources behind the Nordic Tier 1 single contact-point
be organized as a distributed grid in the Nordic countries. This distributed grid will consist of
the order of 12 national computer centers which together will deliver the services needed. The
same national computer centers will serve as Tier 2’s in their local regions. As NorduGrid is
already operating as a coherent production grid in ATLAS, currently providing a substantial
fraction of the capacity for ATLAS DC2, we are confident that we will be in a position to
demonstrate in due time that the specified Tier 1 services can be provided by a distributed
grid.


3.5     Tier-2 – simulation, function of end-user analysis
Kors Bos
Abstract
3.5.1 Tier-1 Services for Tier-2 Regional Centres
The LHC Computing MoU is currently being elaborated by a dedicated Task Force. This will
cover at least the services that Tier-0 (T0) and Tier-1 centres (T1) must provide to the LHC
experiments. At the same time, the services that T1s should provide to Tier-2 centres (T2)
should start to be identified and described. This note has been written by a small team
appointed by the LCG PEB with the objective of producing a description of the T1 services
required by T2 centres. The members of the team are:
Gonzalo               Merino                    /PIC                  -                convener
Slava Ilyin / SINP MSU
Milos Lokajicek /FZU
Klaus-Peter                                  Mickel                                        /FZK
Mike Vetterli / Simon Fraser University and TRIUMF



                                                                                              33
Technical Design Report                                               LHC COMPUTING GRID


The T2 requirements on T1 centres identified in this note have been mostly extracted from the
current versions of the computing models of the LHC experiments. These are still in active
development within each of the experiments. Therefore, these requirements will need to be
revised as the computing models evolve.
Experiments’ computing models plan for T1 and T2 centres:
Tier-1:
To keep certain portions of RAW, ESD, simulated ESD data and full copies of AOD and
TAG data, calibration data.
Data processing and further reprocessing passes.
Official physics group large scale data analysis (collaboration endorsed massive processing).
ALICE and LHCb – contribution to simulations.
Tier-2:
To keep certain portions of AOD and full copies of TAG for both real and simulated data
(LHCb – store only simulated data at T2s).
To keep small selected samples of ESD.
Produce simulated data.
General end-user analysis.


The T2 requirements on T1s identified in this document emerge from these roles and their
interplay. They have been categorized in five groups and each of them is described in one of
the following sections.
3.5.1.1 Storage Requirements
There is a wide variation in the size of T2 centres. Some will have a significant fraction of the
resources of a T1 centre, while others will simply be shared university computing facilities.
The role of the T2s even varies from experiment to experiment. This makes it somewhat
difficult to define a standard set of requirements for T2s. Nevertheless, the following
describes the services that T2s will require from T1s with regard to storage. These are listed
in no particular order of importance.
          1.)Some analyses based on AODs will be done at the T2s. The T1s will therefore
              need to supply the AODs to the T2s. This should be done within 1-2 days for the
              initial mass distribution, but the timescale should be minutes for requests of
              single files in the case that the T2 centre does not have the AOD file required by
              the user. In the latter case, the missing AOD file could also be downloaded from
              another T2 center.
          2.) During the analysis of AODs, it is possible that the T2 process will need to refer
              back to the ESDs. A subset of the ESDs will be stored at the T2s but it is likely
              that the particular data needed for analysis will be at the T1. Access to single
              ESD files at the T1s from the T2s should be on the timescale of minutes.
These first two points will require that access to the data files stored at the T1s be Grid-
enabled so that the process of location and retrieval of data will be transparent to the user.
          3) The T2s will need to store a subset of the raw data and the ESDs for algorithm and
              code development. They will get these files from the T1s.
          4) One of the identifiable roles of the T2s is Monte Carlo production. While T2
              centres are likely to have the CPU power necessary for this task, it is unlikely that
              sufficient storage will be available. The T1s should therefore be prepared to store


34
LHC COMPUTING GRID                                                  Technical Design Report


            the raw data, ESDs, and AODs from the Monte Carlo production. For ATLAS,
            this corresponds to 200 TBytes for the raw data, 50 TBytes for the ESDs, and 10
            TBytes for AODs per year. Since the ESDs will be replicated twice across all
            T1s and each T1 will store the full AOD, this leads to a total of 360 TB per year
            spread across all T1 centres for ATLAS Monte Carlo. This requirement will be
            even larger if multiple versions of the ESDs and AODs are produced each year.
            CMS plans to produce an equivalent amount of Monte Carlo data to real data so
            that CMS T2s will require as much storage at their corresponding T1s as for real
            data. The number for LHCb is 413 TB of Monte Carlo data per year, augmented
            by whatever replication factor is applicable for LHCb. The total storage for
            Monte Carlo data at ALICE is 750 TB/year, but this will be split equally between
            the T1 and T2 centers (with a small amount, 8%, at CERN).
The large file transfers of Monte Carlo data from the T2s to the T1 mass storage systems
(MSS) should be made as efficient as possible. This requires that, for example, the MSS
should have an SRM interface[1].
        5) The T2 centres will also need to get the calibration and slow controls databases
           from the T1s.
        6) ALICE: The computing model at ALICE is somewhat different from ATLAS and
           CMS. T1 and T2 centers play essentially the same role in the analysis of the
           data. The main difference between the two is that T1s have significant mass
           storage and will therefore be responsible for archiving the data. ESDs and AOD
           analysis will be spread over all T1 and T2 centres, with 2.5 copies of the ESDs
           and 3 copies of the AODs replicated over all T1 and T2 centers.
        7) The T2 centres will be heavily used for physics analyses based on AODs. The
           results of these analyses (e.g. ntuples) will need to be stored somewhere. Those
           T2s with mass storage can do this for themselves. However many T2s, especially
           those in university computer centres, will have mass storage only for backup of
           user home areas, not for data or large results files such as ntuples. In these cases,
           it will be necessary for the T1s to store the results of user analyses on tape. This
           could amount to about 40 TB per year per experiment; the numbers in the current
           models for CMS and ATLAS are 40 TB and 36 TB respectively.
3.5.1.2 Computing Power Requirements
The T2 centres will have no special requirements for usage of T1 CPU resources. The CPU
use will be primarily the decision of experiments on resource allocation for specific tasks. The
Data Challenge results will influence experiments computing models towards usage of T1 and
T2 centres. ALICE keeps its model flexible on the load distribution between the centres.
According to current computing models the T1 centers except CERN resources deliver 50%
of computing power for ATLAS and CMS, 40% for ALICE and 30% for LHCb experiments.
General assumption is that processing of data should be preferentially done at centres where
the data reside. Certain amount of the T1s CPU cycles will be needed for data transfers (both
remote data access and file transfer) to and from T2 centres. The transfer should not influence
T1s computing elements power as it would be delivered by storage elements that will have to
be balanced in respect to CPU power, disk space, data transfer requests and transfer rates.
One exception would be if the user analysis task processing AOD file would require access to
missing information located on ESD/RAW data files. The requested file might be either
remotely accessed, transferred to T2 or a remote task could be initiated on the T1 to process
requested information. In the last, probably rare case, T1 CPU cycles would be required for
T2 analysis tasks, but no estimates are available. Alternative solution is to process the tasks
requiring ESD/RAW data on the T1 possibly as part of already mentioned large scale data
analysis.



                                                                                              35
Technical Design Report                                              LHC COMPUTING GRID


Computing models can enable the usage of the T1 centres free capacity by tasks normally
processed at T2 centres like simulations or physicists analysis. For such usage T1 centres
should provide:
Grid enabled CPU cycles. Resource brokers of the experiment must be able to send jobs to the
T1 resources and, from these jobs, access any grid enabled file.
Possibly advanced reservation of CPU resources.
Computing models anticipate the task distribution between the T1s and T2s and thus usage of
available CPU power. A substantial hidden need of T1 CPU power for the T2s was not found
once the conditions for users’ T2 analysis tasks needing ESD information or usage of the T1s
free CPU cycles has been covered in the computing models.
3.5.1.3 Network Requirements
The activities foreseen in the T2 centres are mainly Monte Carlo production and end-user data
analysis. Therefore, in order to estimate the network bandwidth needed between a given T2
and its reference T1 centre, the following data transfer categories have been considered:
Real data:
From T1 into T2: distribution of selected samples of RAW, ESD, AOD and TAG to T2s for
further analysis.
Monte Carlo:
From T2 into T1: copy of simulated data produced at the T2 into T1 for permanent storage
there.
From T1 into T2: copy from the T1 the share of simulated data generated at other centres that
should be available at the T2 for analysis there.
The numbers assumed here are those in the experiment computing models presented in the
context of the group set up inside the LCG project to provide answers to questions posed by
the Computing MoU Task Force[2]. A summary of the data in those models that is relevant for
the T1 to T2 network services is presented in Table 1.
The bandwidth estimates have been computed assuming the data are transferred at a constant
rate during the whole year. Therefore, these are to be taken as very rough estimates that at this
level should be considered as lower limits on the required bandwidth. To obtain more realistic
numbers, the time pattern of the transfers should be considered, but this is still very difficult
to estimate today in a realistic manner. Furthermore, it is also very difficult to estimate the
efficiency with which a given end-to-end network link can be used, given the number of
factors that can affect that (fault-tolerance capacity, routing efficiency along individual paths,
etc). In order to account for all these effects, some safety factors have been included. The
numbers have been scaled up, first by a 50% factor to try to account for differences between
“peak” and “sustained” data transfers, and second by a 100% factor in the assumption that
network links should never run above their 50% capacity. The former would account, for
instance, for the use case discussed in previous sections in which data replication from T1 to
T2 is triggered by user analysis running at the T2 requiring access to some AOD/ESD that is
not available at the T2. A substantial bandwidth should be “reserved” so that this replication
could take place in a timescale of minutes.



                                    ALICE         ATLAS        CMS          LHCb

Parameters:

Number of Tier-1s                   4             6            6            5



36
LHC COMPUTING GRID                                                     Technical Design Report



Number of Tier-2s                   20            24           25            15

Real data “in-T2”:

TB/yr                               120           124          257           0

Mbit/sec (rough)                    31.9          32.9         68.5          0.0

Mbit/sec (w. safety factors)        95.8          98.6         205.5         0.0

MC “out-T2”:

TB/yr                               14            13           136           19

Mbit/sec (rough)                    3.7           3.4          36.3          5.1

Mbit/sec (w. safety factors)        11.2          10.2         108.9         15.3

MC “in-T2”:

TB/yr                               28            18           0             0

Mbit/sec (rough)                    7.5           4.9          0             0.0

Mbit/sec (w. safety factors)        22.5          14.7         0.0           0.0
              Table 1 - Bandwidth estimation for the T1 to T2 network links.


The numbers that result from the computing models categorize the experiments in two
different groups. On the one side there is LHCb, with the smallest bandwidth need estimated
to be of the order of 15Mbit/sec. This is in part due to the fact that LHCb does not foresee to
replicate to T2s any real data or Monte Carlo produced in other centres. On the other side, the
estimated bandwidth needs for T2 centres in ATLAS, CMS and ALICE is of the order of 100-
200Mbit/sec.
We want to stress at this point that the uncertainty in the safety factors assumed in this note is
very large at this moment. For this reason, the numbers before and after applying such factors
are quoted in the table. Specific tests should be performed during the experiments Data
Challenges that address this issue. For instance, some recent experimental results from the
current ALICE Data Challenge indicate that the network bandwidth needed between T1 and
T2 centres could be as high as 100MB/sec.
The T1 and T2 centres located in Europe will be computing facilities connected to the
National Research and Educational Networks (NRENs) which are in turn interconnected
through GÉANT. Today, this infrastructure already provides connectivity at the level of the
Gbit/sec to most of the European T1 centres. By the year the LHC starts, this network
infrastructure should be providing this level of connectivity between T1 and T2 centres in
Europe with no major problems.
For some sites in America and Asia the situation might be different, since the trans-Atlantic
link will always be “thin” in terms of bandwidth as compared to the intra-continental
connectivity. T1 centres in these countries might need to foresee increasing their storage
capacity so that they can cache a larger share of the data, hence reducing their dependency on
the inter-continental link. T2 centres will in general depend on a T1 in the same continent, so



                                                                                               37
Technical Design Report                                            LHC COMPUTING GRID


their interconnection by the time LHC starts should also be at the Gbit/sec level with no major
problems.
According to the above numbers, this should be enough to cope with the data movement in
ATLAS, CMS and LHCb T2 centres. On the other hand, those T2 centres supporting ALICE
will need to have access to substantially larger bandwidth connections, since the estimated
100MB/sec would already fill most of a 1Gbit/sec link.
It is worth to noting as well that the impact of the network traffic with T2 centres will not be
negligible for T1s as compared to the traffic between the T1 and the T0. The latter was
recently estimated in a report from the LCG project to the MoU task force [3]. The numbers
presented in this note indicate that, for a given T1, the traffic with a T2 could amount to ~10%
of that with the T0. Taking into account the average number of T2 centres that will depend on
a given T1 for each experiment, the overall traffic with T2s associated with a given T1 could
reach about half of that with the T0. On the other hand, it should also be noted that the data
traffic from T1 into T2 quoted here represents an upper limit for the data volume that a T1 has
to deliver into a given T2, since most probably there will be T2-to-T2 replications that will
lower the load on the T1.
3.5.1.4   Grid Services Requirements
Computing models of all four LHC experiments assume that T2 centres will operate as GRID
sites in one of the (multi regional) infrastructures – EGEE in Europe[4], GRID3/OSG in USA
(now the prototype is GRID3[5]), <Asia-LCG> etc., following the GRID-federated approach
to the global structure of LCG. The main functions, to be provided by T2s (simulation and
user analysis), tell us that they are resource centres (using the EGEE terminology). Then, core
and operation services will be provided for them by corresponding GRID service centres, e.g.
ROCs and CICs in EGEE (Regional Operations Centres and Core Infrastructure Centres,
correspondingly). In the following we use the terminology GRID Operation Center (GOC), as
a generic name, e.g. for ROCs and CICs in EGEE.
In many cases GOCs will be hosted at T1s. Note, however, that in some regions GOC
functions are distributed over several laboratories, most of them are T2s. One should add,
however, that in these cases a single representative body should be defined for such
distributed ROC and CIC. The T2 centre is treated by the “MoU for Collaboration in the
Deployment and Exploitation of the T1 centres of the LCG” document (being under
preparation still) as “a regional centre, the LCG-related work of which is coordinated by a
defined T1 centre”. In the following we refer on this defined T1 as to hosting-T1, while a T2
centre under this coordination will be referred as hosted-T2. Moreover, as basic case, the
relations of T2s with LCG as a whole will be regulated by special agreement with the hosting-
T1. This status assumes that the hosting-T1 should coordinate the elaboration of the GRID
service requirements by the hosted-T2. This requirement is strengthened also by other
requirements, on storage and CPU resources allocated at hosting-T1 for hosted-T2 needs,
because these resources should be operated as a part of the GRID.
As a result, one should consider both hierarchical structure, T1-T2-T3-…, of the LHC
regional centres and the GRID infrastructure (sometimes referred on as GRID cloud) when
one discusses the GRID services requirements to be provided for T2s.
A number of GOCs are planned to be created around the world (some have started the
operation already). At CERN the EGEE Operations Management Centre will be created as a
part of the EGEE infrastructure.
Then, according to the EGEE plan there will be nine ROCs in Europe, located in each of the
national or regional EGEE federations: at CCLRC-RAL (UK), at CC-IN2P3-Lyon (France),
distributed ROC in Italy (INFN and some universities), distributed ROC in Sweden (SweGrid
centers) and The Netherlands (FOM), distributed ROC in Spain (IFAE and CSIC) and
Portugal (LIP), distributed ROC in Germany (FZK and GSI), distributed ROC in South East
Europe (in Greece, Israel and Cyprus), at CYFRONET (Poland), and distributed ROC in


38
LHC COMPUTING GRID                                                Technical Design Report


Russia. Then, five CICs are under creation in Europe: at CERN, CCLRC-RAL (UK), CC-
IN2P3-Lyon (France), INFN-CNAF (Italy), and SINP MSU (Russia).
In USA currently the Indiana University is operating as a GOC for GRID3, and distributed
model is under discussion for future GOC in OSG (Open Science GRID).
In Asia there is a plan to create GOC, probably in Taiwan.
One should add that the LHC experiments could request GRID services to different T1s or
even to T2s.
The services to be provided by GOCs for resource centres, thus to T2s, can be shortly
described by referring to the EGEE formulations for ROCs and CICs:
“the ROCs must assist resources in the transition to GRID participation, through the
deployment of GRID middleware and the development of procedures and capabilities to
operate those resources as part of the GRID. Once connected to the GRID, the organizations
will need further support from the ROCs to resolve operational problems as they arise.”
“Core infrastructure manages the day-to-day operation of the GRID, including the active
monitoring of the infrastructure and the resource centers, and takes appropriate action to
protect the GRID form the effect of failing components and to recover from operational
problems. The primary responsibility of EGEE CICs is to operate essential GRID services,
such as databases used for replica metadata catalogues and VO administration, resource
brokers (workload managers), information services, resource and usage monitoring.”
Thus, the following scenarios can take place for the T2 centre:
the hosting-T1 is one of the GOCs and provides all necessary services (examples – CCLRC-
RAL in UK, and CC-IN2P3-Lyon in France);
the GRID services are provided by GOCs hosted at different T1s, including the hosting-T1;
the hosting-T1 has no functions to provide core or operation GRID services for some
experiments;
some of the GRID services are provided by GOC team at the T2 centre itself;
some of the GRID services are provided by GOCs teams at the T2s which are brothers (being
hosted by the same T1).


The described above possible scenarios resumes to the following requirements to the hosting-
T1:
Hosting-T1, together with T2 hosted, should define the map of GOCs, which will provide
necessary GRID services for this T2.
The hosting-T1 should participate in preparing corresponding agreements with defined GOCs,
based on current SLAs, if necessary with inclusion of specifics of these particular T2 and T1.
Then, the hosting-T1 should help the hosted-T2 to update these agreements, if it is asked by
new GRID releases.
Finally we give the list of basic GRID services to be provided for a T2 centre by the defined
GOCs:
Distribution, installation and maintenance of GRID middleware and other special systems
software and tools. Validation and certification of the installed middleware;
Monitoring and publication of the GRID enabled resources (disk/tape, CPU and networking)
at hosted T2, including resources allocated at hosting-T1 for the hosted-T2 needs;
GRID enabled resources usage accounting;



                                                                                            39
Technical Design Report                                            LHC COMPUTING GRID


Monitoring of GRID services performance for hosted T2, to ensure the agreed QoS;
Provide core GRID services, such as databases for replica metadata catalogues, resource
brokers, information services, support of the experiment VOs services, ensure basic GRID
security services etc.;
Support of GRID specialized networking solutions for effective point-to-point (or few-to-few)
connectivity.
3.5.1.5 Support Requirements
As the LCG environment develops, it is recognized that a variety of support functions
analogous to the support functions found in computer centre helpdesks, software support
organizations, and application development services, will be needed supporting the globally
distributed users. The main support functions include well defined Help Desk Processes,
User Information and Tools, Service Level Agreements, User Accounts and Allocation
Procedures, Education, Training and Documentation, Support Staff Information and
Tools, and Measuring Success. In this chapter is defined, which of these support functions
should be provided by the T1 centres and which by the Global Grid User Support Team(s)
(GUS).
Help Desk Processes (GUS): As mentioned earlier, every T2 should be hosted by a T1. The
hosting T1 has to support operations, sysadmins and users of the hosted T2s, because most
T2s are quite small and can’t provide all these services themselves. But most of support
functions should be provided mainly by the Grid User Support Centres (GUS) and also by the
Grid Operations Centres (GOC), of which there will be three each distributed around the
globe and which will take on duties for a larger group or even for all T1s and T2s.
The GUSs will provide a single point of contact for all the LCG users, via web, mail or
phone. The task of the GUSs will be to collect and to respond as a globally organized user
help desk the various user problems. For this the GUSs use a centralised ticketing system.
Normally the end users will first call the experiment user support (ESUS), and the
experiments are responsible for providing help desk facilities as a first level user support
service for their collaborators. The ESUS people will filter out the experiment specific
problems and send the remaining problems to the GUS’s people. At the GUSs all problems
will be written in a problem database, which will also get the concerning solutions; in this
way this problem database is becoming a knowledge database containing all known problems.
This knowledge database should be accessible not only for T1 and T2 staff but also for all end
users. Concerning the experiment software installation and support it is the responsibility of
the experiments to ensure that they have sufficient support for this to cover the T0, T1, and T2
centres.
It is agreed that experiment software installation, and its validation, is a responsibility from
the experiment, not from the T1. The T1 could act as “contact point” or “link” between the
experiment and the lower Tiers/RCs so that the experiments don’t have to talk with hundreds
of sites.
Following the report of a GDB working group concerning GUS the globally distributed GUS
will provide a support for the users on a 24/7 base. It is therefore not necessary that the T1s
also provide an around-the-clock availability of their specialists.[6]
User Information and Tools: This is the provision of important information resources and
tools to enable the use of the LCG environment and ranges from basic online documentation
to information about the current status of resource in the LCG environment and the LCG
infrastructure itself to debugging and performance analysis tools. This also includes the
methods of delivery of these information resources and tools. This service should be given by
the GUSs.
Service Level Agreements: It is important for the LCG providing the grid environment to
appropriately set the shared expectations the users of these environments and those providing


40
LHC COMPUTING GRID                                                  Technical Design Report


support. A clear statement that accurately delineates these expectations for both the users and
support operations in a grid computing environment is therefore critical and should be
elaborated.
User Accounts and Allocation Procedures: All LCG users need to obtain some type of
account and some form of authorization to use specific resources within the LCG
environment. Accounts should be given by the T1s for their own users and also for the users
of their hosted T2s. The rules for establishing accounts are elaborated by a specific GDB
working group.[7]
Education, Training , and Documentation: The LCG users need to be educated and trained
in it’s use. Ideally, if a user is trained how to work in the LCG environment, this will mean
the user will not have to learn the individual nuances of using all of the various resources
within the LCG. In practice, this goal may be difficult to achieve, so the need for instruction
on some “local” issues for resources on the LCG will likely need to be maintained.
Nonetheless, what is new to the majority of users is the distributed grid environment and, just
as documentation of this is needed, training is required to develop a user community fluent in
the use the environment. This include both on-line and in-person training activities.
Nevertheless basic training and tutorials normally should be provided in a centralised manner
by CERN. Following this the T1s should provide support and documentation for deployment
and maintenance of these grid software packages for “their” T2 people.
Support Staff Information and Tools: The support staff must have at their disposal a
number of “tools of the trade” and information resources to effectively provide support to the
user community. This include such things as the GUS knowledge base to draw upon,
information about the status and scheduling of resources and grid services, tools to assist in
the diagnoses of problems reported, and appropriate levels of access to resources to operate
effectively. By now it’s not yet totally clear, which part of this work has to be done by the
GUSs and which one by the GOCs. This service should be given in a shared manner by the
GUSs and the T1s.
Measuring Success: The support groups at the Grid User Support and in the T1s need some
way to determine success or failure of problem solving and support methods. This is seldom
an easy task because it can be largely subjective. While qualitative information is a more
useful indicator of the success of the support organization, it is more difficult to get.
Frequently, this information can be obtained from various forms of user feedback. One
possible way to this could be to collect quantitative metrics, which are fairly easy to collect.
Effective measures must be in place to advance the support functions. It seems to be
necessary to seek even more effective and accurate indicators of the performance of the GUS
and GOC support groups.




[1]
      http://sdm.lbl.gov/srm-wg/
[2]
      http://lcg.web.cern.ch/LCG/peb/mou
[3]
      http://lcg.web.cern.ch/LCG/PEB/MoU/Report_to_the_MoU_Task_Force.doc
[4]
      http://www.eu-egee.org/
[5]
      http://www.ivdgl.org/grid2003/
[6]
      http://lcg.web.cern.ch/LCG/PEB/gdb/WG5/WG5-Report-V2.0.doc




                                                                                             41
Technical Design Report                                             LHC COMPUTING GRID

[7]

http://agenda.cern.ch/askArchive.php?base=agenda&categ=a04113&id=a04113s1t1/documen
t


3.6     Security
Dave Kelsey
Abstract


This section will start with a statement of the importance of security both in terms of enabling
the Grid to operate in today's hostile internet, thereby meeting the scientific aims of LCG, and
in aiming to limit and contain the effects of security incidents when they happen (as they
surely will). The setting of priorities should be informed by ongoing risk analysis and risk
management. The project needs to strive for the most appropriate balance between
functionality and security.


This section will contain a brief description of the technical security model. Authentication
based on X.509 certificates (single sign-on) available from any of the accredited Certification
Authorities (accredited by EUGridPMA, TAGPMA and AP PMA). Authorization in the form
of role-based access control via membership of a VO and authorization attributes (groups and
roles) managed by VOMS.


The section will also describe the work of the Joint (LCG/EGEE) Security Policy Group
(JSPG) on Security Policy and Procedures, an important requirement for an operational Grid.
This will briefly describe the currently agreed set of documents and the process by which
these are approved by LCG (LCG GDB, EGEE ROC managers and approval by projecy
management).


Full details of Operational Security are given later in chapter 4 (GRID MANAGEMENT).
But note here that it will be essential to monitor grid operations carefully so as to identify
intrusions in a timely manner.
Speedy forensic analysis will be required to understand the impact of an incident quickly and
procedures will be needed to apply fixes and regain control as quickly as possible. Special
attention needs to be paid to the Tier 0, Tier 1's and their network connections to maintain
these essential services during or after an attack so as to reduce the affect on LHC data taking.


Mention should also be made of work (EGEE, GridPP) to identify and manage security
vulnerabilities both in terms of grid middleware and deployment problems.



4     GRID MANAGEMENT



4.1     Network
David Foster




42
LHC COMPUTING GRID                                                  Technical Design Report


4.2     Operations & Centre SLAs (Merged)
Ian Bird
Abstract
This section will describe the operations infrastructures put in place to support the grid. This
must include EGEE in Europe and Open Science Grid in the US, and must propose models
for operations support in the Asia-Pacific region and in the Nordic countries (part of EGEE in
principle). The need for a hierarchical support organisation and the proposed interactions
between the different grid infrastructures’ support organisations will be described.
The current grid projects have shown the need for SLAs (site charter, service level definitions,
etc) to be drawn up - covering basic things such as levels of support, resources provided,
support response times, etc. This must be done in coordination with the MoU - although here
we see the need for these basic SLAs with all sites, while the MoU covers mainly Tier 1 sites.
The other aspect of operations that will be covered is that of user support. There are 2
different activities - user support with helpdesk/call centre functionality; and that of VO
support - teams of grid experts working directly with the experiments to support their
integration into the grid environment.
4.2.1 Security Operations – Draft (Dave)


This section will describe the infrastructure, currently under discussion, to address the need
for distributed Security Operations. In Europe, this is likely to be based on a hierarchical
structure built around the EGEE ROCs Operational Security Coordination Team.


The section will also give an overview of the policy and procedures on Security Incident
Response, it being essential for LCG to respond quickly to any security incident. Speedy
analysis of any incident will be required and the efficint deployment of urgent security
patches or configuration fixes will need to be achieved.




4.3     User Regisration and VO management - Security (abstract, Dave)
This section will describe the agreed model for LCG User Registration and VO Management.
This is based on a registration interface and database (VOMRS, developed by FNAL) which
is linked to the CERN HR/Experiment databases. Users will need to register with their
experiment at CERN in the usual way before then being able to register in the experiment
Virtual Organisation. Registering with the VO will require the user to acknowledge
acceptance of the AUP and VO membership policy. Once the request to join has been
accepted by the appropriate VO manager as coming from a bona fide Grid user, the user will
be added to the VOMS/Authorizion database thereby granting them access to resources.
Site managers will have read access to the User registration database for purposes of audit
tracking. The requirements for regular renewal and the processes for removing expired users
will be described.




5     SOFTWARE ASPECTS
Jamie Shiers


                                                                                             43
Technical Design Report                                            LHC COMPUTING GRID




5.1     Operating systems




5.2     Middleware, interoperability & standards
Frédéric Hemmer,, Farid Ould-Saada,, + US,
Abstract-farid
I will mainly describe he NorduGrid middleware (or Advanced Resource Connector, ARC).
ARC is an open source software solution distributed under the GPL license, enabling
production quality computational and data Grids. Since the first release (May 2002) the
middleware is deployed and being used in production environments in high energy physics
and other fields. Emphasis is put on scalability, stability, reliability and performance of the
middleware. A growing number of grid projects, like Swegrid, DCGC, NDGF, NorGrid and
others are running on the ARC middleware .
ARC provides a reliable implementation of the fundamental grid services, such as information
services, resource discovery and monitoring, job submission and management, brokering and
data management and resource management. Most of these services are provided through the
security layer of the GSI. The middleware builds upon standard open source solutions like the
OpenLDAP, OpenSSL, SASL and Globus Toolkit 2 (GT2) libraries. NorduGrid provides
innovative solutions essential for a production quality middleware: the Grid Manager,
gridftpd (the ARC/NorduGrid GridFTP server), the information model and providers
(NorduGrid schema), User Interface and broker (a “personal” broker integrated into the user
interface), extended Resource Specification Language (xRSL), and the monitoring system.

Interoperability issues are being addressed within and outside LCG. Together with Frederic
and     others,      we     will    summarise      the   current    status     and    plans.
The coming “Compute Resource Management Interfaces” workshop to be held in Rome,
February 17-18 (http://www.pd.infn.it/grid/crm/) will be the first technical-level workshop
where interoperability of the major grid middlewares will ever be discussed. There will also
be a Glue-schema-dedicated meeting at RAL (February 25) and a Data Management technical
meeting at CERN (End February, beg. March) initiated by NorduGrid.


I will mainly describe he NorduGrid middleware (or Advanced Resource Connector, ARC).
ARC is an open source software solution distributed under the GPL license, enabling
production quality computational and data Grids. Since the first release (May 2002) the
middleware is deployed and being used in production environments in high energy physics
and other fields. Emphasis is put on scalability, stability, reliability and performance of the
middleware. A growing number of grid projects, like Swegrid, DCGC, NDGF, NorGrid and
others are running on the ARC middleware .
ARC provides a reliable implementation of the fundamental grid services, such as information
services, resource discovery and monitoring, job submission and management, brokering and
data management and resource management. Most of these services are provided through the
security layer of the GSI. The middleware builds upon standard open source solutions like the
OpenLDAP, OpenSSL, SASL and Globus Toolkit 2 (GT2) libraries. NorduGrid provides
innovative solutions essential for a production quality middleware: the Grid Manager,
gridftpd (the ARC/NorduGrid GridFTP server), the information model and providers



44
LHC COMPUTING GRID                                                Technical Design Report


(NorduGrid schema), User Interface and broker (a "personal" broker integrated into
the user interface), extended Resource Specification Language (xRSL), and the monitoring
system.

Interoperability issues are being addressed within and outside LCG. Together with Frederic
and     others,      we     will    summarise      the    current    status    and    plans.
The coming "Compute Resource Management Interfaces" workshop to be held in Rome,
February 17-18 (http://www.pd.infn.it/grid/crm/) will be the first technical-level workshop
where interoperability of the major grid middlewares will ever be discussed. There will also
be a Glue-schema-dedicated meeting at RAL (February 25) and a Data Management technical
meeting at CERN (End February, beg. March) initiated by NorduGrid.

5.3     NorduGrid
The NorduGrid middleware (or Advanced Resource Connector, ARC) is an open source
software solution distributed under the GPL license, enabling production quality
computational and data Grids. Since the first release (May 2002) the middleware is deployed
and being used in production environments, such as ATLAS data challenges. Emphasis is put
on scalability, stability, reliability and performance of the middleware. A growing number of
grid projects, like Swegrid, DCGC, NDGF and others are running on the ARC middleware.

5.3.1.1 Middleware description

ARC provides a reliable implementation of the fundamental grid services, such as information
services, resource discovery and monitoring, job submission and management, brokering and
data management and resource management. Most of these services are provided through the
security layer of the GSI. The middleware builds upon standard open source solutions like the
OpenLDAP, OpenSSL, SASL and Globus Toolkit 2 (GT2) libraries. All the external sofware
is provided in the download area. ARC will soon be built against GT4. NorduGrid provides
innovative solutions essential for a production quality middleware: the Grid Manager,
gridftpd (the ARC/NorduGrid GridFTP server), the information model and providers
(NorduGrid schema), User Interface and broker (a "personal" broker integrated
into the user interface), extended Resource Specification Language (xRSL), and the
monitoring system.
The listed solutions are used as replacements and extensions of the original GT2 services.
ARC does not use most of GT2 services, such as GRAM, job submission commands, the
WUftp-based gridftp server, the gatekeeper, gram job-manager scripts, MDS information
providers and schemas. Moreover, ARC extended the RSL and made the Globus MDS
functional. ARC is thus much more than GT2 -- it offers its own services built upon the GT2
libraries.
The NorduGrid middleware integrates computing resources (commodity computing clusters
managed by a batch system or standalone workstations) and Storage Elements, making them
available via a secure common grid layer.
ARC main components are:
Grid services running on the resources: the Grid Manager, gridftpd and the information
services. Grid jobs are submitted to the cluster through gridftpd and a separate session
directory is created for each job. The grid session directories are made available through the
gridftpd during and after job execution. The Grid Manager is a service running on a resource
taking care of jobs, session directories and the cache area. Information services are
implemented as efficient scripts populating the NorduGrid information database stored in the
Globus-modified OpenLDAP backends.




                                                                                           45
Technical Design Report                                              LHC COMPUTING GRID


Indexing services for the resources and data: A special simplified usage of the GT2 GIIS
OpenLDAP backend allows to build a hierarchical mesh of grid-connected sites. Both the
GT2 Replica Catalog and the GT2 RLS service can be used as metadata catalogues by the
ARC middleware. ARC client tools and the Grid Manager daemon are capable of interfacing
to these catalogues.
Clients making intelligent use of the distributed information and data available on the grid.
ARC comes with a light-weight client, the User Interface. The ARC User Interface is a set of
command line tools to submit, monitor and manage jobs on the grid, move data around and
query resource information. The User Interface comes with a built-in broker, which is able to
select the best matching resource for a job. The grid job requirements are expressed in xRSL.
Another special client is the Grid Monitor, which uses any Web browser as an agent to
periodically query the distributed information system and present the results as a set of inter-
linked Web pages.
Components still under development include:
The Smart Storage Element (SSE) is a replacement of the current ARC gridftpd-based simple
storage element. SSE was designed to try to overcome problems related to the previous SE by
combining the most desirable features into one service. SSE will provide flexible access
control, data integrity between resources and support for autonomous and reliable data
replication . It uses HTTPS/G for secure data transfer, Web Services (WS) for control
(through the same HTTPS/G channel) and can provide information to indexing services used
in middlewares based on the Globus ToolkitTM . At the moment, those include the Replica
Catalog and the Replica Location Service. The modular internal design of the SSE and the
power of C++ object oriented programming allows one to add support for other indexing
services in an easy way. There are plans to complement the SSE with a Smart Indexing
Service capable of resolving inconsistencies hence creating a robust distributed data
management system.
Logging service: A Logger service is one of the Web services implemented by NorduGrid and
based on gSOAP and Globus IO API. I provides a frontend to the underlying MySQL
database to store and retrieve information about computing resources' usage (jobs). This
database can be accessed through a graphical Web interface implemented using PHP4,
JavaScript and JPGraph based on GD library. The main goals of the Logger are to provide: (i)
information about the development and usage of NorduGrid over time and (ii) statistics for
different clusters, time intervals, applications and users.
ARC is designed to be a scalable, non-intrusive and portable solution. The development is
user and application-driven, with the main requirements being those of performance, stability,
useability and portability. As a result of this approach, the standalone client is available for a
dozen of platforms and can be installed in a few minutes. The server installation does not
require a full site reconfiguration. The middleware can be built on any platform where the
external software packages (like GT2, or soon GT4, libraries) are available. ARC has been
deployed at a number of computing resources around the world. These resources are running
various Linux distributions and use several different local resource management systems
(LRMS). Although various flavours of PBS are most common, there are sites running SGE,
Easy and Condor as well. Using different LRMS specific information providers the different
sites can present the information about their available resources in a uniform way in ARC’s
information system.




46
LHC COMPUTING GRID                                                Technical Design Report




5.4     Grid Standards and Interoperability

5.4.1 Overview

During the past years, numerous Grid and Grid-like middleware products have emerged, to
list some: Unicore , ARC , EDG/LCG/gLite , Globus , Condor , SRB . They are capable of
providing (some of) the fundamental Grid services, such as Grid job submission and
management, Grid data management and Grid information services. The emergence and
broad deployment of the different middlewares brought up the problem of interoperability.
Unfortunately, so far the Grid community did not meet the expectations of delivering widely
accepted, usable and implemented standards. Nevertheless, some promising development has
been started recently.
We believe in coexistence of interoperable Grid middlewares and the diversity of Grid
solutions. We don't think that a single Grid middleware is the solution neither we think it is
achievable. We would like to see well-defined, broadly accepted open interfaces of the
various Grid middleware components. Interoperability should be achieved by establishing
these interfaces based upon community standards. Interoperability is understood on the
service level, on the level of fundamental Grid services and their interfaces.



                                                                                           47
Technical Design Report                                           LHC COMPUTING GRID


The Rome CRM initiative, "Compute Resource Management Interfaces", was the first
technical-level workshop where interoperability of the major grid middlewares has ever be
discussed. It was followed by the Glue-schema-dedicated meeting at RAL, February 25.



5.4.2 ARC and interoperability

NorduGrid intends to play an active role in several standardization processes and willing to
invest efforts in the implementation and support of emerging standards. It contributes to the
CRM initiative, wants to contribute to the Glue-2.0 re-design, follows the GGF developments,
and cooperates with the major middleware development projects.
An interoperability snapshot of the NorduGrid/ARC middleware is presented below,
organized by middleware components.

5.4.2.1 Security system

The security infrastructure of ARC fully complies with and relies on the Grid Security
Infrastructure (GSI). GSI is a de facto community standard. Authorization within the different
components currently uses the GACL framework and there are plans to support XACML
systems too.

5.4.2.2 Job Description

Currently ARC uses the extended Resource Specification Language (xRSL) for describing
Grid job requests. The NorduGrid team, as a partner of the Rome CRM initiative, agreed to
compare XRSL to the JSDL being developed within the GGF and gradually move towards the
Global Grid Forum backed JSDL.

5.4.2.3 Data Management

ARC data management components support and are compatible with the most common
solutions, such as the GridFTP protocol, storages based on traditional FTP and HTTP servers.
ARC is also capable of interfacing to most commonly accepted open data indexing catalogues
such as the Globus Replica Catalogue and the Globus RLS. There is a work launched to
interface to the EGEE/gLite Fireman catalogue too. SRB systems are not supported due to the
restrictive license. ARC data management solutions will be compatible to the SRM standards.

5.4.2.4 Information Services

A community-accepted information model and representation of Grid resources and entities is
a cornerstone of interoperability. The major middlewares make use of different incompatible
information models. ARC implements and relies on its own model, other large deployments
make use of some alterations of the Glue model. The GGF is drafting a CIM-based model,
which unfortunately seems to be lacking community support and acceptance. The current
Glue model (version 1.2) was created by a small group and is known to be rather limited in
some areas. A major re-design of Glue is expected to start in the 3rd quarter of 2005 and the
NorduGrid Collaboration intends to be an active and significant player in that process

5.4.2.5 Job submission interface

There is no standard job submission interface commonly accepted by the Grid community. In
order to have a progress in the area, the Rome CRM initiative was launched in February this



48
LHC COMPUTING GRID                                                  Technical Design Report


year. The NorduGrid Collaboration is committed to accept and implement the results of this
working group. Current Grid systems make use of very different solutions for job submission:
some of them rely on a particular GRAM implementation from Globus, others make use of
Condor functionalities, or have their own proprietary protocol for that. The current
NorduGrid/ARC implements job submission via GridFTP channel. It is foreseen that a
standard job submission service will be implemented in a WS-RF framework.
NorduGrid/ARC plans to redesign and reimplement its job submission system making use of
WS-RF.

5.4.2.6 Usage statistics & accounting

ARC collects usage information via the experimental ARC logger service. Each Grid job ran
in the ARC system is described by a Usage Record. The current ARC Usage Record is rather
preliminary, a radical re-design is planned. NorduGrid plans to use an extension of the GGF
usage record which is unfortunately rather limited in its current form.



5.5     Common applications and tools
Pere Mato
Abstract
Common physics applications software is developed in the context of the Applications Area
(AA) of the LCG. This area is responsible for the development and maintenance of that part
of the physics applications software and associated infrastructure that is shared among the
LHC experiments. The scope includes common applications software infrastructure,
frameworks, libraries, and tools; common applications such as simulation and analysis
toolkits; grid interfaces to the experiments; and assisting the integration and adaptation of
physics applications software in the grid environment. AA subprojects include software
process and infrastructure (SPI), persistency framework (POOL and conditions database),
core libraries and services (SEAL), physicist interface (PI), and simulation. The experiments
and the AA also make extensive use of the ROOT data analysis framework. A brief overview
of these areas follows.
5.5.1 Persistency Framework (POOL and Conditions Database)
The purpose of POOL is to develop the persistency framework for physics applications at
LHC, including persistency for arbitrary transient C++ objects, transparent navigation to
single objects integrated with a grid aware file catalog as well as object collections and
iterators including associated metadata. The Conditions Database project is developing a
common framework for the storage, access and management of time-varying experimental
data such as detector calibrations, utilizing relational databases through a vendor-neutral
interface.
5.5.2 SEAL and PI
The Common Libraries and Components (SEAL) project provides the software
infrastructure, basic frameworks, libraries and tools that are common among the LHC
experiments. The project addresses the selection, integration, development and support of
foundation and utility class libraries. Components include a C++ component model
infrastructure, a plug-in manager, C++ dictionary services, math libraries and Python
scripting services. The Physicist Interface (PI) subproject has developed tools for interactive
analysis which extend and provide a ROOT implementation for the AIDA analysis tool
interfaces. SEAL integrates these tools and provides additional support for interactive analysis
in a scripting environment.



                                                                                             49
Technical Design Report                                            LHC COMPUTING GRID


Simulation
This subproject encompasses common work on modelling detector systems and simulating
the propagation and physics interactions of particles passing through them. This project
includes work on the development of generic simulation infrastructure supporting the Geant4
and FLUKA simulation engines, CERN and LHC participation in Geant4, integration of
FLUKA into the generic infrastructure, physics validation of the simulation, and Monte Carlo
physics generator services.
5.5.3 Software Process and Infrastructure (SPI)
This subproject provides services supporting the development of all LCG software. The
services are also used by the LHC experiments, the EGEE project and external projects such
as Castor. Key services are the Savannah software development web portal; software librarian
services supporting build, release and version management; the external software service
supporting distribution of software tools and libraries; QA/testing services; and
documentation tools.
5.5.4 ROOT
ROOT is an object-oriented data analysis framework used by all the LHC experiments and is
widely used in the HEP community. It includes facilities for statistical analysis and
visualization of data, storage of complex C++ object data structures, and distributed analysis.
While not part of the applications area, AA software makes use of ROOT in several areas and
the AA participates in the development and support of ROOT components.



5.6     Analysis support
Massimo Lamanna et al.
Analysis numbers
      From the different reviews
      From running experiments (Babar, FNAL, ...)
What is new with the Grid (and what is not)
      Batch-oriented analysis
            Streamlining of the calibration procedures
            Publishing/comparison/validation of calibration results
      Interactive analysis
Will the LHC analyses “just” the same as the LEP ones (just more
data, more users…)
      Access to official well-controlled resources
      Access to available resources (stability, OS version control,
heterogeneity...)



Analysis numbers (from the different reviews?/running expt)


*** Still missing (not all numbers + not made up my mind on how to present them)


What is new with the Grid (and what is not)




50
LHC COMPUTING GRID                                                   Technical Design Report


The LCG GAG group has already provided an extensive discussion on the definition of
analysis on the grid. This activity has been summarized in the HEPCAL2 document [1].
The GAG distinguishes analysis from batch production by taking into account not only the
response time (the total latency to see the results of the action triggered by the user) but also
the influence the user keeps on the submitted command.


As pointed out in the HEPCAL2 document, there are several scenarios relevant for analysis:
Analysis with fast response time and high level of user influence
Analysis with intermediate response time and influence
Analysis Long response times and low level of user influence.


The first scenario is important for interactive access to event data, for event displays and other
“debugging” tools. In these cases the user can effectively interact with the system due to the
fact that the size of the relevant data is minimal and all the computation can be performed
locally (as in the case of object rendering and related manipulation).


The last scenario is the well known batch system model. Note that for this case the response
time is given by three terms: the initial submission latency (issuing single “submit”
commands to fill the batch facility with the required number of tasks), the queuing time and
the actual job execution. The initial latency should not play an important role, provided that it
is not dominating the total time (i.e. submission time << actual execution) to allow an
effective use of the resources.


As discussed in [1] the most interesting scenarios are within the transition area. One can
assume this will cover most analysis activity.


In the context of HEPCAL2 GAG considered the resource consumption as the result of a
(large) set of users independently accessing the data. This is clearly a first starting point for
considerations.


In our consideration we should also take into account the typical organization of an HEP
experiment with physics analysis teams or analysis group (a “working group”). A preliminary
list of issues relevant for such working groups is following:
What is a “working group”? A short-lived highly dynamics VO? A set of users having a
special role (like role==”Searches” or role==”ECAL-calibration”)
Can a “working group” be managed by advanced users or the grid-operation group should
always be involved?
How data are shared within the “working group”?
How data are made visible to other “working groups” (not necessary within the same VO) or
to the entire VO
How the resources for a working group are identified, made available and guaranteed
Can there be “working group” across different VO’s allowed?




                                                                                               51
Technical Design Report                                              LHC COMPUTING GRID


More detailed (“microscopic”) use cases can be extracted from the PPDG CS11 document
“Grid Service Requirements for Interactive Analysis” [2], which covers a number of detailed
use case. In particular, the authors discuss calibration scenarios (alignment, etc…). The
calibration activities are very peculiar. On one side they share with production (simulation
and event reconstruction) the fact that they are best done by a small task force of experts and
the results are shared across the whole collaboration. On the other side, especially soon after
data taking, these activities have to deliver their results with the minimal latency (which is
part of the incompressible initial latency to reconstruct events when the first data arrive). This
requires fast access to data with an iterative and interactive access pattern (multiple passes,
frequent code/parameters changes).


A common view is that the analysis will be performed in the highest Tier hosting the data of
interest of the physicist. For example, an analysis requiring extensive RAW access – e.g.
detailed detector recalibration studies - will have to be performed in one or more Tier-1s,
while user-created ntuple-like microDST (skims of AOD) can be analyzed in smaller
“private” facilities. Since there is agreement that the total computing power located in high-
Tier centre will be very important (probably dominant over the Tier0+Tier1s capacity), inter
Tier-2/3 analysis will become important: these centers will be used by physicists with
intermittent load, making the case for allowing spare (off-peak) capacity. Event simulation
might be used in the background to profit from spare CPU cycles.


If one continues the considerations with the idea that the location of the data (datasets) will
determine the place were analysis is performed, it will be relatively simple to organize the
overall analysis activities of a single experiment by placing the data according to the plans of
the experiments. Without cooperation between Tier-2 centers, the problem is then just the
normal fair-share of a batch system (which is by no means a trivial task if multiple
users/working groups are active at the same time).


It should be noted that even in this minimalist scenario, the experiments should be allowed to
place data sets on given set of Tiers (if allowed by the local policies) and conversely users in a
given site should be able to stage in data from other facilities prior to start important analysis
efforts.


Another aspect is that users have to be provided with mechanisms for “random access” to
relatively rare events out of large number of files (event directories, tags). These schemes will
allow fast pre-filter based on selected quantities (trigger and selection bits, reconstructed
quantities). Technology choices have not been finalized yet and concrete solutions exist (e.g.
POOL collections).


The existing experiments’ frameworks (e.g. CMS Cobra) allow users to navigate across
different event components (AODRECORAW). It should be possible to implement
control mechanisms to prevent inappropriate usage (typically large data set recall) while some
activity does require it (debugging, event display, calibration verification, algorithm
comparison, etc…)


*** Data format: POOL, ROOT




52
LHC COMPUTING GRID                                                   Technical Design Report


At last, we note that the coexistence of multiple computing infrastructure is a fact the
experiments take note to (multiple Grid infrastructures, dedicated facilities, laptops for
personal use). In the case of analysis, the experiments are providing solutions to handle
heterogeneous environments (e.g. ATLAS Don Quijote). In the case of analysis, it will be
critical that the end users are shielded by the underlining infrastructures.


Batch-oriented analysis


All experiments will need batch-oriented analysis. In generally this will be made possible via
experiment-specific tools, simplifying the task to submit multiple jobs to a large set of files (a
data set). The executable will be based on the experiment framework.


As an example, GANGA (ATLAS and LHCb) is providing this functionality by allowing the
user to prepare and run programs even via a convenient user interface. The batch job can be
tested already on the local resources. At user request, through the same interface, the user can
take advantage of the available Grid resources (data storage and CPU power): GANGA is
providing seamless access to the grid and takes care to identify all necessary input files,
submits them to run in parallel, provide an easy way to control the full set of jobs and to
merges outputs at the end.


Such tools are necessary at least due to the very large number of input files required by even
“simple” studies (with RAW data rate in the 100 MB/s range and assuming 1GB files, 15’ of
data taking correspond to o(100) files: even taking into account streams etc, modest data sets
would correspond to very large number of files/jobs).


In principle most analyses could be done within this model. It provides therefore a baseline
solution for all experiments. Some specific analysis tasks will require this approach due to the
necessary resources: examples are detector studies on RAW data or systematic studies over
long periods of time etc ...


A different analysis scenario where this model will be of relevance is the preparation of large
skims, in particular when the new data have to be shared across large working groups. In
these cases, provisions have to be made to enable users or working groups to publish their
data (without interfering with the “production” catalogues holding the repository of all
“official” data: RAW, multiple reconstructed sets, analysis objects (AOD), corresponding
provenance/bookkeeping information etc…).
The reason is that batch systems allow to enable control mechanisms which are necessary to
keep this additional bookkeeping system under control.


*** Do we have solid numbers for calibration CPU, estimated elapsed time (multiple passes
on the same data, etc), calibration frequency, other constrains (time ordered calibration
procedures: i.e. calib(t+dt) need calib(t) as input)?


Interactive analysis




                                                                                               53
Technical Design Report                                              LHC COMPUTING GRID


The usage of interactive tools for analysis has proven to be very powerful and popular inside
the user community already in the LEP era (PAW being the best example). A similar
approach will be in place from the start for selected data sets (handful of files, very compact
ntuple-like skims, etc…). All experiments are using this approach already now for analyzing
of simulated samples and test beam data.


On the other hand the LEP model cannot be extended in a simplistic way just increasing the
available computer capacity (both CPU and disks). More advanced tools have to be developed
and deployed.


A first level is the availability of analysis facilities based on master-slaves architecture (best
example: PROOF). These facilities, made available in selected sites, will allow parallel
interactive access to large sample (hosted in the same site). Prototypes exist (FNAL, MIT,
CERN, …). While there is a general agreement on the interest of these systems, the LHC
community is just in the initial stage of using/considering them.


Several of these systems have been demonstrated at various occasions but some limitations
prevent adoption on a large scale. The relevant problems reach from the necessary resilience
and robustness to allow non-experts to use these tools to the resource sharing of multiple
concurrent users in a given facility. In the case of PROOF a significant effort has been put in
place to develop a new version addressing the known limitations.


Other projects as the high level analysis services DIAL (ATLAS) have been developed. This
project relies on fast batch systems to provide interactive response. These tools in grid
environments have yet to be demonstrated.


At the second level analysis system would allow users to profit of from multiple analysis
farms. This activity has been prototyped within ALICE (using AliEn as Grid middleware in
2003) and within ALICE/ARDA (using the gLite prototype in 2004). In both cases, the Grid
middleware was used to instantiate PROOF slave for the user. Some of the basic tools to
provide efficient access to Grid middleware services and to circumvent some of the
limitations of running on a Grid environment (like security issues, connectivity issues, etc…)
are being addressed within the ARDA project.


In generally one can predict that this scenario will be of fundamental importance for the
analysis of experiments. It allows the individual physicist to access grid resources without the
overhead of production systems. The current experiences with such systems are very
promising, but one has to keep in mind that a significant development effort will be necessary
to put in place effective systems.


Will the LHC analyses be “just” the same as the LEP analysis with more data and more
users…)


As already stated the LEP analyses stimulated and demonstrated the power and the flexibility
of interactive manipulation of data with tools like PAW.




54
LHC COMPUTING GRID                                                  Technical Design Report


The area which could be of significant interest at LHC are collaborative tools that are
somewhat integrated with the analysis tools. Some of them could be based on flexible
database services (to share “private” data in a reproducible way providing provenance
information). Some of the tools developed for detector operation could be also of interest for
physics analysis. (For example tools developed in the context of the WP5 of the GRIDCC
project or other online log book facilities). Although analysis is likely to remain an
individualistic activity, tools will be needed to make possible detailed comparisons inside
working groups (in general made up by people in different geographical locations).


The second area could be workflow systems. Every experiment now has developed complex
systems to steer large simulation campaigns. They are used in the operation of Data
Challenges. Effectively these systems are built on top of a data base system, control the status
of the assignment and the associated complex workflows needed to perform these simulation
campaigns (similar system will be in place for handling and distributing the experimental data
–RAW and reconstructed sets as well). The experience matured in building and operating
these systems (e.g. CMS RefDB), together with existing systems coming from non HEP
sources, should be used to provide the end users with handy tools to perform complex
operations. As today, in many cases, even basic workflow like split/merge for identical jobs
are done “by hand” and in a “static” way (results are summed up when all jobs have finished).
In the case of large scale batch analysis, many users would benefit from a framework to allow
simple error recovery/handling, dynamic/on-demand inspection of the results, set up robust
procedures to perform iterative procedures like in the case of some calibration, etc…


The final area is the access of resources on the Grid without strong requirements on the
installation itself (operating system and architecture, pre installed libraries, minimal disk
space requirements, etc…). Although the main Tier0/Tier1 activities (like RAW event
reconstruction) will run on controlled infrastructures, users might benefit from resources
made available by relatively small but numerous resource centers (Tier2 and below). An
example here is the Nordugrid infrastructure that is composed from many different versions
of the Linux operating system. Flexible mechanisms to run software on heterogeneous system
(maybe using tools like VMware or User Mode Linux installations) and to validate the results
could provide interesting opportunities for end users.


References:


LCG-GAG,        “HEPCAL2”                       (http://project-lcg-gag.web.cern.ch/project-lcg-
gag/LCG_GAG_Docs_Public.htm)
D. Olson and J. Perl, “Grid Service Requirements for Interactive Analysis”, PPDG CS11
September 2002
LHC                experiments’         Computing             Model:
The                 ALICE              Computing               Model
The                 ATLAS              Computing               Model
The CMS Computing Model, CERN-LHCC-2004-035/G-083, CMS NOTE/2004-031,
December                                                        2004
The LHCb Computing Model




                                                                                             55
Technical Design Report                                              LHC COMPUTING GRID


5.7   Data bases – distributed deployment
Dirk Duellmann and Maria Girone
{early draft which needs additional work. We left for now the abstract for comparison.}
Abstract
LCG user applications and middleware services rely on the availability of relational database
functionality as a part of the deployment infrastructure. Database applications like the
conditions database, production workflow, detector geometry, file, dataset and event level
meta-data catalogs will move into grid production. Besides database connectivity at CERN
tier 0, several of these applications also need a reliable and (grid) location independent service
at tier 1 and 2 sites to achieve the required availability and scalability.
In the first part of this chapter we summarize the architecture of the database infrastructure at
the CERN tier 0, including the database cluster node and storage set-up. Scalability and
flexibility are the key elements to be able to cope with the uncertainties of experiment
requests and changing access patterns during the ramp up phase of LHC.
In the second part we describe how the tier 0 service ties in with a distributed database
service, which is being discussed between LCG tiers within the LCG 3D project. Here the
main emphasis lies on a well defined layered deployment infrastructure defining different
levels of service quality for each database tier taking into account the available personnel
resources and existing production experience at the LCG sites.
5.7.1 Database Services at CERN T 0
The database services for LCG at CERN T0 are currently going through a major preparation
phase to be ready for the LCG startup. The main challenges are the significant increase in
database service requests from the application side together with the remaining uncertainties
of the experiment computing models in this area. To be able to cope with the needs at LHC
startup, the database infrastructure needs to be scalable not only in terms of data volume (the
storage system) but also in server performance (database server clusters) available to the
applications. During the early ramp-up phase (2005/2006) with several newly developed
database applications, a significant effort in application optimisation and service integration
will be required from the LCG database and development teams. Given the limited available
manpower this can only be achieved by consistent planning of the application lifecycles and
adhering to a strict application validation procedure.
Based on these requirements at T0 a homogenous database service based on the existing
Oracle service is proposed. In contrast to traditional database deployment for relatively stable
administrative applications, the database deployment for LCG will face significant changes of
access patterns and will typically operate close to the resource limitations. For this reason,
automated resource monitoring and the provision of guaranteed resource shares (in cpu, i/o
and network connections) to high priority database applications will be an essential part of the
database service to ensure stable production conditions. The recent Oracle 10g release offers
several important components {the following points will be described in more detail}
Oracle 10g RAC as building block for extensible database cluster
RAC on linux for cost efficiency and integration into the linux based fabric infrastructure of
LCG
Shared storage system (Storage Area Network) based on fiber channel attached disk arrays
10g service concept to structure (potentially large clusters) into well defined application
services which isolate key applications form lower priority tasks


{Figure showing the rac cluster setup and the connection to SAN based storage}



56
LHC COMPUTING GRID                                                  Technical Design Report




5.7.2 Distributed Services at tier 1 and higher
Building on the database services at the CERN T0 the LCG 3D project has been setup to
propose an architecture for the distributed database services at higher LCG tiers. The goals of
this infrastructure include:
Provide location independent database access for grid user jobs and other grid services
Increased service availability and scalability for grid application via distribution of
application data
Reduced service costs by sharing the service administration between several database teams
in different time zones
This service will handle the most common database requirements for site local or distributed
relational data. Given the wide area nature of the LCG this cannot be achieved with a single
distributed database with tight transactional coupling between the participating sites. The
approach chosen here is to loosely couple otherwise independent database servers (and
services) via asynchronous replication mechanism.(currently only between databases of the
same vendor). For several reasons including avoidance of early db vendor binding and
adaptation at the available database services at the different tiers a multi-vendor database
infrastructure is proposed. To allow to focus the limited existing database administration
resources on only one main database vendor per site it is proposed to deploy Oracle at tiers 0
and 1 and MySQL at higher tiers.
5.7.2.1 Requirement Summary
The 3D project has based its proposal based on service requirements from the participating
experiments (ATLAS, CMS, LHCb) and software providers (ARDA, EGEE, LCG-GD). The
ALICE experiment has been contacted, but did not plan any deployment of databases for their
applications outside of tier 0. ALICE has therefore only been taken into account for the
calculation of tier 0 requirements. The other experiments have typically submitted a list of 2-5
candidate database applications which are planned for deployment from LCG worker nodes.
Several of these applications are still in development phase and their volume and distribution
requirements are expected to concretise only after first deployment this year. The
requirements for the first year of deployment range from 50-500 GB at tier 0/1 and are
compatible with a simple replication scheme originating from tier 0. As data at tier 1 is
considered to be read-only (at least initially) the deployment of more complex muti-master
replication can be avoided.


{Table summarising the main requirements from the experiments/projects}
Currently the distributed database infrastructure is in prototyping phase and expected to move
into first production in autumn 2005. Based on the requirements from experiments and grid
projects and based on the available experience and manpower at the different tiers, the 3D
project has proposed to structure the deployment into two different level of service:
Database back-bone
Read/write access at T0 and T1 sites
Reliable database service including media recovery and backup services based on a
homogenous Oracle environment
consistent asynchronous replication of database data
Database cache
Read-only database access at T2 and higher



                                                                                             57
Technical Design Report                                             LHC COMPUTING GRID


Low latency access to read-only database data either through database copies or proxy caches
Local write access for temporary data will be provided but can not be relied on for critical
data


5.7.2.2 Database service lookup
In order to achieve location independent access to local database services a database location
catalog (DLS) is proposed, similar to the (file) replica location service. This service will map
a logical database name into a physical database connection string and avoid the hard-coding
of connection information into user applications.
As this service is in most respect very similar to the file cataloguing service it can likely re-
use the same service implementation and administration tools. A prototype catalog has been
integrated into POOL/RAL, which allows using any file catalog which is supported by POOL.
5.7.2.3 Database Authentication and Authorisation
To allow for secure access to the database service together with a scalable administration the
database access needs to be integrated with LCG certificates for authentication and with role
definition in the Virtual Organisation Membership Service (VOMS). This will provide a
consistent grid identity based on LCG certificates for file and database data and a single VO
role administration system which also controls the grid user rights for database access and
data modification.
Oracle provides for this purpose an external authentication mechanism between database and
a ldap based authentication server. This protocol can also be used to determine which
database roles a particular user may obtain. Also for the MySQL X.509 certificate based
authentication methods exist but for both database vendors complete end-to-end integration of
authentication and authorisation still needs to be proven. Also, the performance impact of
SSL based network connections for bulk data transfers still needs to be evaluated.
5.7.2.4 Database Network Connectivity
One implication of the proposed database service architecture is that a single grid program
may need to access both databases at T2 (for reading) and at higher tiers (eg T0) for writing.
This implies that appropriate connectivity for individual tcp ports of database servers at T1
(and T0) can be provided to worker nodes at T2. In addition the database servers at T0 and T1
need to be able to connect to each other in order to allow the directed database replication to
function. This will require some firewall configuration at all tiers but as the number of
individual firewall wholes in this structure is small (O(10) on all tiers) and contains only well
defined point-to-point connections, this is currently not seen as a major security risk or
deployment problem.
5.7.2.5 Integration with application software
A reference integration between the distributed database service and application software will
be provide in the LCG POOL/RAL relational abstraction layer. This will include the use of a
logical service lockup and support certificate based for Oracle and MySQL at least.
{this section need to be extended}




5.8     Lifecycle support – management of deployment and versioning
Ian Bird



58
LHC COMPUTING GRID                                                   Technical Design Report


Abstract
This section will cover the grid service middleware lifecycle. It has been shown in the past 2
years that in order to be able to deploy something close to a production quality service it is
essential to have in place a managed set of processes. These include:
the certification and testing cycle and interactions with the middleware suppliers, sites at
which the middleware is deployed, and users. It is important to point out the need for
negotiating support and maintenance agreements with the middleware suppliers and to agree
support cycle times for problems (of varying urgency) or feature requests. Discuss how
security patches and other urgent fixes can be rapidly deployed.
The deployment process, change management, upgrade cycles etc. This should include
discussion of backward compatibility of new versions, migration strategies if essential new
features are not backward compatible, and so on. Discussion of packaging and deployment
tools - based on experience. Feedback to the deployment teams and middleware support
teams.
Operation. Experience from operating and using the middleware and services should be
coordinated and fed back to the relevant teams - either as deployment considerations or as
problems/feature requests for the middleware itself.
Propose a layered model of services in order to cleanly separate the issues related to user
interfaces and libraries which require rapid update cycles, and core services which require
coordinated deployment throughout the infrastructure.

6     TECHNOLOGY
Sverre Jarp and experts
Abstract



6.1     Status and expected evolution
– processors – compilers – storage – mass-storage - networking
Sverre Jarp, Andreas Heiss, Ulrich Schwickerath, Ian Fisk, David Foster
This sub-chapter will describe the current status and expected evolution of the hardware
resources that will make up the LCG environment from 2005 until the start of high luminosity
running. The current LCG-2 environment as well as additional resources available at CERN
and the Tier-1 computing facilities will be described to outline the current status. The
preliminary results of the 2005 PASTA report, the experiment computing model documents,
and current estimates and proposals for networking requirements will all be utilized to attempt
to predict the expected technology evolution over the next several years. This sub-chapter
should cover processing, storage, and networking technology.
In the areas of processing, the current set of accepted processing architectures will be
described. The ramifications and expected performance improvements of the currently
occurring transition to 64 bit architectures will be addressed. The potential performance
improvement and licensing complications associated with alternative compilers should be
discussed. For the immediate future, potential issues associated with new multi-core
architectures should be described. For the future, it is interesting to compile a list of the most
attractive additional architectures that might be supported from the standpoint of gaining
significant processing resources.
In the area of storage, the current status of mass-storage and disk storage will be described. A
summary of the technology currently used in Tier-1 and Tier-0 mass storage systems will be
made. For the evolution of the mass storage area into the future, a definition of the mass


                                                                                               59
Technical Design Report                                            LHC COMPUTING GRID


storage system from the standpoint of data loss, data serving capacity, and data storage
capacity should be provided. This would be useful for sites preparing to deploy mass storage
systems that rely more heavily on large numbers of disk arrays and less on traditional tape
silos. In the area of disk storage, a brief technology report of the currently deployed
technologies will be provided. A description of the expected evolution in capacity and
performance will be given. Some discussion of current and future options for binding physical
disk systems into file systems and storage networks should be described.
The current status of networking technology should be described from the local and wide area
standpoints. This includes the required worker, server, and disk server connectivity as well as
the regional center connectivity. For the future network management and network allocation
technology possibilities should be described.


6.1.1 Processors

6.1.1.1 The microprocessor market
Although the microprocessor market has reached a certain maturity it continues to be worth
XX billion USD in annual sales.
Ignoring the embedded and low-end segment this market is dominated by x86 processors,
running mainly Windows and Linux. In this segment competition is throat-cutting, as
Transmeta has just demonstrated by exiting the market. On the other hand, Intel (as the
majority supplier) continues to profit from generous margins that seem to be only available to
those who manage to dominate the market. AMD, for instance, in spite of several efforts to
lead the market into new avenues, best exemplified by the push of the 64-bit extensions to
x86 (x86-64 or AMD64), has a hard time to break even.


6.1.1.2 The process technology
Our community got heavily into PCs in the late nineties which coincided with a “golden”
expansion period when the manufacturers were able to introduce new processes every two
years. Increased transistor budget allowed more and more functionality to be provided and the
shrink itself (plus shortened pipeline stages) allowed a spectacular increase in frequency; the
200 MHz Pentium Pro of yesteryear now looks rather ridiculous compared to today’s
processors at 3 GHz or more.
Nevertheless, the industry has now been caught by a problem that was almost completely
ignored ten years ago, namely heat generation from leakage currents. As the feature size
decreased from hundreds of nanometers to today’s 90 nm (and tomorrow’s 65 nm) the gate
oxide layer became only a few atom layers thick with the result that leakage currents grew
exponentially.
Moore’s law, which only stated that the transistor budget grows from one generation to the
next, will continue to come true, but both the problems with basic physics and the longer
verification time needed by more and more complex designs may start to delay the
introductions of new process technology. The good news for HEP is that the transistor budget
will from now on mainly be used to produce microprocessors with multiple cores and already
this year we should start seeing the first implementations (More about this later).


6.1.1.3 64-bit capable processors
64-bit microprocessors have been around for a long time as exemplified by, for instance, the
Alpha processor family which was 64-bit enabled from the start in the early nineties. Most



60
LHC COMPUTING GRID                                                   Technical Design Report


RISC processors, such as PA-RISC, SPARC and Power were extended to handle 64-bit
addressing, usually in a backwards compatible way by allowing 32-bit operating systems and
32-bit applications to continue to run natively.
When Intel came out with IA-64, now called the Itanium Processor Family (IPF), they
deviated from this practice. Although the new processors could execute x86 binaries, this
capability was not part of the native instruction set and the 32-bit performance was rather
inadequate.
AMD spotted the weakness in this strategy and announced an alternative plan to extend x86
with native 64-bit capabilities. This proved to be to the liking of the market at large,
especially since the revision of the architecture brought other improvements as well, such as
the doubling of the number of general purpose registers. This architectural “clean-up” gives a
nice performance boost for most applications (See, for instance, the CMS benchmark paper).
After the introduction of the first 64-bit Opterons, Intel has been quick to realize that this was
more than a “fad”, and, today, only a year after the first introduction of 64-bit capable Intel
processors, we are being told that we are likely to see that almost all x86 processors add this
capability in the near future. During a transition period it is unavoidable that our computer
centers will have a mixture of 32-bit hardware and 32/64-bit hardware, but we should aim at a
transition that is as rapid as possible by acquiring only 64-bit enabled hardware from now on..


All LHC experiments must make a real effort to ensure that all of their offline
software is “64-bit clean”. This should be done in such a way that one can, at any
moment, create either a 32-bit or a 64-bit version of the software.

6.1.1.4 Current processors and performance
Today, AMD offers Opteron server processors at 2.6 GHz whereas Intel offers Pentium 4
Xeon processors at 3.6 GHz. Both are produced in 90 nm process technology and, as far as
SPEC performance measurements are concerned, both offer specINT2000 results of about
1600-1700 (dependent on which compiler and which addressing more are being used).
The current 1.6 GHz Itanium processor, although it has an impressive L3 cache of nine MB,
offers ~1440 SI2K under Linux with the Intel C/C++ compiler (see SGI result 2004Q4). It
should be kept in mind that this is a processor produced in 130 nm and that results from the
forthcoming 90 nm Montecito processor are not yet available.
The current 90 nm 1.9 GHz Power-5 processor from IBM (with a 36MB L3 off-chip cache!)
offers ~1380 SI2K when measured under AIX. (There does not seem to be a Linux result
available).
SPARC results are still below 1000 SI2K and are not discussed further.
The results just mentioned, combined with pricing structures in today’s market, leave little
opportunity for non-x86 contenders for making themselves interesting. The situation is not
likely to change in the near-term future since both AMD and Intel have announced that they
will come out with dual core x86 processors in the very near future. Although both are likely
to reduce the peak frequency by about 400 MHz (according to current rumors) these offerings
are going to be extremely interesting for our community since they are likely to offer a
throughput increase of 1.6-1.9 over single core processors at equivalent frequencies (when
equipped with adequate memory).
AMD is rumored to provide dual core Opterons already in Q2 this year, whereas dual-core DP
Xeons may only be available by the end of the year. On the other hand, Intel is likely to be
able to profit from its 65 nm technology by then, so the race for market leadership is likely to
continue (to our great benefit).




                                                                                               61
Technical Design Report                                         LHC COMPUTING GRID


6.1.1.5 Multicore future
For the HEP community it would be great if the semiconductor world would agree to push a
geometric expansion of the number of cores. Why could we not reach 8, 16, or even 32 cores
in the near future and run our beloved event-level parallelism across all of them?
The main problem is going to be the “mass market acceptance” of such a new paradigm and
some skeptics believe that large-scale multicore will gradually limit itself to the “server
niche” which may not be dominated by commodity pricing in the same way as today’s x86
market with its basic “one size fits all” mentality.


Form factors
Several form factors are available in the PC market, the most common being desk-side towers
or 1U/2U rack-mounted systems. Blade systems are gradually becoming more and more
popular, but for the time being there is a price premium associated with such systems.
There seems to be no reason to recommend a particular form factor and LCG centres are
likely to choose systems based on local criteria, such as space availability, cooling
requirements, and so on.


6.1.1.6 Overall conclusions
To the great advantage of HEP computing, the x86 market is still flourishing and the initial
LCG acquisitions should be able to profit from another round of price/performance
improvements thanks to the race to produce multicore systems.
Should IPF and Power-based systems become attractive some years from now our best
position is to ensure that our programs are 64-bit clean under Linux.


The LCG sites should concentrate their purchases on the x86-64 architecture. The 32-
bit only x86 variant should be avoided since it will act as a roadblock for a quick
adoption of a 64-bit operating system and application environment inside the LHC
Computing Grid.

6.1.2 Infiniband
Infiniband (IBA) is a channel based, switched fabric which can be used for inter process
communication, network and storage I/O. The basic link speed is 2.5 Gb/s before 6/8
encoding. Today, the common link width is 4X (10Gb/s) bidirectional. 12X (30Gb/s)
hardware and 4X DDR technology which doubles the bandwidth is already available. 12X
DDR and 12X QDR (delivering up to 120Gb/s) are forseen. Copper cables can be used for
distances up to ≈15m. Fibre optics cables are available for long distance connections,
however prices are still high.
IBA silicon is mainly produced by one company, Mellanox Technologies, however recently
other companies announced their products. IBA HCAs (host channel adapters) are available
as PCI-X and PCI-Express versions, with one or two 4X ports, SDR or DDR. Different
companies offer modular switch systems from 12 4X-ports up to 288 4X-ports as well as 12X
uplink modules and FC and GE gateway modules to provide connectivity to other networks.
With its RDMA (Remote Direct Memory Access) capabilities, current 4X IBA hardware
allows data transfer rates up to ≈900 MB/s and latencies of 5μs and below.




62
LHC COMPUTING GRID                                                Technical Design Report


Several upper layer protocols are available for IPC (inter process communication) and
network as well as storage I/O:


MPI     : Message passing interface (several implementations, open source and proprietary)
IPoIB : IP tunneling over IBA. Does not utilize RDMA.
SDP     : Socket direct protocol. SDP provides support for socket based applications and
          utilizes the RDMA features of InfiniBand.
iSCSI : from iSCSI Linux open source project
iSER    : iSCSI RDMA extension (from OpenIB, see below)
SRP     : SCSI RDMA protocol for block oriented I/O
uDAPL : Direct access protocol layer (e.g. used by databases)


Also, a prototype implementation of RFIO (as used by CASTOR) is available which allows
the transfer of files at high speed and very low CPU consumption.


Infiniband drivers are available for Linux, Windows and some commercial UNIX systems.
Based on a reference implementation of Mellanox, other vendors deliver 'improved' versions
of the IBA software stack which sometimes cause incompatibilites especially concerning the
high level protocols such as SRP. However, recently the OpenIB.org initiative was formed
with the goal to provide a unified software stack working with the hardware of all vendors.
All major IBA vendors have joined this organization. The low level drivers of OpenIB have
recently been accepted for inclusion into the Linux kernel starting with version 2.6.11 .


IBA prices have been dropping rapidly over the last years. 4X switches can be purchased for
≈300$/port and less, dual-4X HCAs are ≈500$, and cables are available for ≈50-150$. The
street price of Mellanox's new single port 4X HCAs will certainly be below 300$. The latest
HCA chip of Mellanox is available well below 100$ and the first manufacturers announced
implementing IBA on the mainboard directly connected to the PCI-Express bus. Other
developemnts with a direct IBA-Memory connection are under way. These developments will
not only ensure further dropping prices and a wider maket penetration of IBA, but also enable
lower latency making it more suitable for very low latency dependent applications.


6.2     Choice of initial solutions
Bernd Panzer-Steindel, Sverre Jarp
Sverre Jarp, Bernd Panzer


Version 0.2 10.04..2005
6.2.1 Software : Batch Systems


We are using since about 5 years very successfully the LSF Batch scheduler from Platform
Computing in the CERN computing farm. This has evolved considerably during the years and
copes easily with our work load, i.e. detailed resource scheduling within more than 100
experiments/groups/sub-groups, the growth to up to 3000 concurrently executing user jobs,
more than 50000 jobs in the queues. So far no bottleneck has been seen. Our support


                                                                                             63
Technical Design Report                                            LHC COMPUTING GRID


relationship is very good and feedback from our experts is taken into account (new versions
contain our requested modifications). The license and support costs are well below the cost of
one FTE equivalent.


There are currently no reasons for a change in the foreseeable future (3 years).


6.2.2 Software : Mass Storage System


The mass storage system has two major components : a disk space management system and a
tape storage system. We have developed the CASTOR Mass Storage System at CERN and a t
the end of 2004 the system contains about 30 million files with an associated 4PB of disk
space. The system uses files and file systems as the basic unit to operate.
The new improved and re-written CASTOR software is in its final testing phase and will start
to be deployed during the next months. The CASTOR MSS software is our choice for the
foreseeable future.


6.2.3 Software : Management System


The Extremely Large Fabric management system ELFms was developed at CERN based on
software from the EU Datagrid project. It contains three components :
1. quattor : for the configuration, installation and management of nodes
2. lemon :   a service and system monitoring system
3. leaf   : a hardware and state management system




The system is now dealing with more than 2500 nodes in the center with varying functionality
(disk ,cpu, tape, service nodes) and multiple operating systems (RH7, SLC3, RHE3,
IA32&IA64) .
It is now since a year in full production and provides us with a consistent full-lifecycle
management and high automation level . This is our choice for the foreseeable future.




64
LHC COMPUTING GRID                                                Technical Design Report


6.2.4 Software : File System


The AFS (Andrew File System) file system is currently integral part of the user environment
of CERN.


repository for personal files and programs
repository for the experiment software
repository for some calibration data
repository for some analysis data
common shared environment for applications


AFS provides world wide accessibility for something like 14000 registered users. The system
has a constant growth rate of more than 20% per year. The current installation (end 2004)
hosts 113 million files on 27 servers with 12 TB of space. The access data rate is ~ 40 MB/s
during daytime and has ~ 660 million block transfers per day with a total availability of 99.8
%..
During 2004 (and ongoing) an evaluation of several new file systems took place to judge
whether they could replace AFS or even provide additional functionality in the are of an
analysis facility.
Missing redundancy/error recovery and weaker security were the main problems in the
investigated candidates. (reference to the report)
So far the conclusion is that the required functionality and performance for the next ~3 years
can only be provided by keeping the AFS file system.
6.2.5 Software : Operating System


All computing components (CPU, disk, tape and service nodes) are using the Linux operating
system. Since a couple of years our version is based on the RedHat Linux Distribution.
RedHat changed in 2003 their license policies and is selling since then in a profitable (for
them) manner their different Linux RH Enterprise versions. After long negotiations in
2003/2004 we decided to follow a four-way strategy :
collaboration with Fermilab on Scientific Linux, a HEP Linux distribution based on the re-
compiled RH Enterprise source code, which RH has to provide freely due to the GPL
obligations.
we are buying RH Enterprise licenses for our Oracle on Linux service
we have a support contract with RH
we have further negotiations wit RH about possible HEP wide agreements


An investigation about alternative Linux distributions came to the conclusion that there was
no advantage in using SUSE, Debian or others. SUSE for example follows the same
commercial strategies as RH and Debian is still a free version, but rather different in
implementation which would create a large cost (manpower) in adapting our management
tools to the specific Debian environment plus question marks about the community support.
We will continue with our Linux RH strategies for the next couple of years.



                                                                                           65
Technical Design Report                                          LHC COMPUTING GRID


6.2.6 Hardware : CPU Server


We are purchasing ‘white boxes’ from resellers in Europe since more than 5 years now.
The ‘sweet-spot’ are dual processor nodes with the last but one (or two) generation of
processors. So far we have used exclusively INTEL processors from their IA32 production
line. The 2005/2006 issues are the integration of the 64bit processor architecture and the
move away from higher frequencies to multi-core processors.
The road to 64bit is easier now that also INTEL has come up with an intermediate processor
family (EM64T, Nocona) which can run 32bit and 64bit code. AMD has this already since
more than a year with their Opteron processor line.
The AMD processors have currently a price/performance advantage of up to 40% , but the
TCO calculations are a bit more complicated to decide whether it really would be an
advantage to include AMD into the purchasing process. A common IT – Experiment project
has been created to clarify this before the next major purchasing round.
(e.g. code stability between platforms, compiler effects, performance benchmarks, etc.)


More details about expected cost developments and the influence of multi-core processors on
some fabric issues can be found in another paper.


The stability of the current hardware is good enough to continue the white box approach

The average uptime of one single node in the Lxbatch cluster is about 116 days (~4 month).
With about 1200 nodes (average 2004) that leads to about 10 reboots of nodes per day. Only
about one out of these 10 is due to a real hardware fault, the rest is due to the operating
system or application (e.g. swap space exceeded). The effects on the efficiencies are the
following :


     1. resource availability
             each reboot takes at maximum 30 minutes  10 *0.5 = 5 hours
                     1200 nodes * 24h = 28800 node hours per day
                     Availability == 100 - 5 / 28800 == > 99.9 %
             there are 322 hardware problems per year witch each take about 14
               days
                     for repair, thus one looses 14*322=4508 node-days per year.
                     There are in total 1200*365=438000 nodes-days per year.
                     Thus the resource availability is about 99%.

     2. loss of application time
            there are on average 2.5 jobs running per dual processor node, each of
            them executes on average for 8h, thus each reboot creates a loss of
            roughly 20h application time (pessimistically calculated).
            The loss per day is 200 hours out of 1200*24h*2.5=72000h
             0.3 % inefficiency (99.7 % efficiency)

We will continue with our current strategy in buying white boxes from reseller companies and
probably include AMD in the new purchases (if we have proven the advantages).


66
LHC COMPUTING GRID                                                 Technical Design Report




6.2.7 Hardware : Disk Storage
Today we are using the NAS disk server model of the installed 400 TB of disk space in the
center. There are in addition some R&D activities and evaluations ongoing for a variety of
alternative solutions like iSCSI servers (in connection with file systems), fiber channel
attached SATA disk arrays, large multi-processor systems, USB/firewire disk systems.


Besides the performance of the nodes it is important to understand the reliability of the
systems.
In the following chapter the state of the failure rate of disk servers and components are
described and the effect on the service.


From our monitoring of failures in the center we know that the MTBF (Mean Time Between
Failure) of the ATA disks used in the disk servers is in the range of 150000 hours. This means
there is one genuine disk failure per day with the currently installed 6000 disks. The vendor
quote figures in the range of 300000 to 500000 hours, but these are under certain conditions
(usage patterns common on home PCs) while we are running these in the 24*7 mode. These
differences are also realized by industry (see the IBM talk in CHEP 2004). We have also
statistical values for SCSI and fiber channel disks used in our AFS installation, although the
statistical evidence is lower (300 disks ) we have similar MTBF figures for these type of
disks.
Disk errors are ‘protected’ by using mirrored (RAID1) or RAID5 configurations.
One has to consider the rebuild of the underlying system, which has severe effects on the
performance of that file system and others on the disk server (controller, network card layout,
internal bus system). 20 MB/s for a 200 GB file system  3 hours rebuild time and
degradation of disk performance.


We have currently an uptime of 131 days for one single node disk server (~ 4.5 month). This
leads to about 2.5 reboots per day (330 server nodes) with about
1 h (pessimistic) downtime during a reboot.


The sharing of disk space and CPU resources is very similar. With about 3000 batch jobs
spread reading/writing over 330 disk server there are on average 9 jobs per server.
The crash/reboot of a server has several side effects :
    1. loss of applications
       Without more sophisticated redundancy in the application IO all jobs attached
       to a crashed disk server will die.
       9 jobs * 8h = 72 h lost out of 1200*24h*2.5 = 72000h  0.1 % loss
       New jobs starting after the crash will be redirected to newly staged data or
       wait for the server to come back.
    2. data is unavailable during the reboot/repair time 2.5 * 1h = 2.5h out of
       330*24h = 720h  0.4 %
       In the worst case a server is dead for a longer time, than the corresponding
       data sets on this server needs to be restaged on a brand new disk server (~2 TB
       per server). A server can safely be loaded with 60 MB/s which would fill it in


                                                                                            67
Technical Design Report                                              LHC COMPUTING GRID


        about 9 h. Such a severe incident happens about once per month per 330
        server.
        If it would happen once per day the 0.4% would move to 1.5%.


To cope with these negative effects one has to rely on the good redundancy and error
recovery of the Mass Storage System CASTOR and also similar efforts in the application
themselves.


Simple NAS servers still deliver the best price/performance values and have an acceptable
error rate. We will continue with this strategy but nonetheless make some effort to work with
the vendors on the improvement on the hardware quality.


6.2.8 Hardware : Tape Storage
We have already an STK installation of 10 robotic silos with 5000 cartridges each and 50 tape
drives from STK (9940B). This is in production since several years. The plan is to move to
the next generation of drives in the middle of 2006. This is a very complicated area, because
the equipment is not commodity and expensive. Details and considerations can be found in
the following documents and talks (talk1, talk2, document1).
There are currently three tape drive technologies interesting for us : IBM, STK and LTO. But
their latest drives are not yet available, only towards the end of 2005.
Thus today we cannot say in detail which technology we will choose for 2006 and onwards.
From the tests and information we got so far about the existing models we can preliminary
conclude that they are all valid candidates. The real issue here is to minimize the risk and total
cost, which has several ingredients :
cost of the drives. This actually is very much linked to expected efficiencies which we are
currently evaluating (depends on computing models e.g.)
cost of silos and robots. These are highly special installations and the prices depend heavily
on the package and negotiations
cost of the cartridges
Each of these items is about 1/3 of the costs over 4 years, but with large error bars attached
and support costs for the hard- and software needs to be included.
We will start negotiations with IBM and StorageTek about solutions
and fall-backs in the beginning of 2005.


6.2.9 Hardware :         Network


Our current network is based on standard Ethernet technology, where 24 port fast Ethernet
switches and 12 port gigabit Ethernet switches are connected to multi gigabit port backbone
routers (3Com and Enterasys). The new network system needed for 2008 will improve the
two involved layers by a factor 10 and the implementation of this will start in the middle of
2005.
A high end backbone router mesh for redundancy and performance based on 24 or higher 10
Gbit ports
A distribution layer based on multi port Gigabit switches with one or two 10 Gigabit uplinks.


68
LHC COMPUTING GRID                                                  Technical Design Report


The availability of high end 10 Gigabit routers from different companies has improved
considerably over the last 12 month.
A tender has been send out in the middle of January 2005 and we expect the result at the end
of February. The tender is split into two parts, one for the backplane and one for the
distribution layer. There could be one supplier for both or two independent ones.


For high performance (throughput and latency) the Infiniband product has become very
popular, because of it’s very much improving performance/cost ratio. This could in the future
replace corners in the distribution layer. The available Infiniband switch have the possibility
to add conversion modules to fiber channel and later this year also for 10 Gbit Ethernet. Tests
in this area are ongoing.



6.3     Hardware lifecycle
Bernd Panzer-Steindel, Sverre Jarp
Draft v0.2 10.04.2005




The strategy for the hardware lifecycle of the different equipment types (CPU server , disk
server, tape drives, network switches) is relatively straightforward. All the equipment is
bought with a certain warranty, e.g. our disk, cpu and tape servers are bought with a 3 year
guaranty. During that time we use the vendors for the repair of equipment and in principle the
equipment is completely replaced by new purchases when the warranty has ended. From
experience at CERN we have adopted a general 3 years lifetime for standard PC equipment
and about 5 years for tape drives and network switches.
The costing of this replacements in the 4th year has to be incorporated in the overall costing
model over the years (see costing chapter).
At CERN we are not replacing strictly the equipment at the end of the warranty period, but
rather leave the equipment in the production environment until :
the failure rate increases
there are physical space problems
e.g. the PC’s cannot run jobs anymore , because of too little memory or disk space
or too slow CPU speed
the effort to handle this equipment exceeds the ‘norm’
This ‘relaxed’ replacement model has so far been successful. These are than extra resources,
but they are not accounted for the in the resource planning, because one can’t rely on their
availability.

7     PROTOTYPES AND EVALUATIONS -
Laura Perini
Abstract
The LCG system has to be ready for use with full functionality and reliability from the start of
LHC data taking. In order to ensure this readiness the system is planned to evolve through a
set of steps involving the hardware infrastructure as well as the services to be delivered.


                                                                                             69
Technical Design Report                                                          LHC COMPUTING GRID


At each step the prototype LCG is planned to be used for extensive testing:
By the experiments, that perform their Data Challenges, progressively increasing in scope
and scale. The aim is stressing the LCG system with activities that are more and more similar
to the ones that will be performed when the real experiments will be running. The community
of physicists is also involved more and more and gives the feedback necessary to steer the
LCG evolution according to the need of the users.
By the service providers themselves, at CERN and in the outside Tier1s, that perform Service
Challenges, aimed at stressing the different specific services. The Service Challenge involve
for the specific services a scale, a complexity and a site coordination higher from the one
needed at the same time by the Data Challenges of the experiments.
Part of the plan of the Challenges has already been executed, and has provided useful
feedback. The evaluation of the results of the Challenges and the implementation of the
suggestions coming from this evaluation will give a crucial contribution for reaching the full
readiness of the LCG system on schedule


7.1       Data challenges
Yves Schutz, Farid Ould-Saada

7.1.1 Data Challenges (Farid)

7.1.1.1 Abstract

In a first step, I will mainly summarise the past (DC1) and current (DC2) ATLAS data
challenges. DC1 was conducted during 2002-03; the main goals achieved were to setting up of the
production infrastructure in a real worldwide collaborative effort and to gain experience in exercising an ATLAS
wide production model. DC2 (from May until December 2004) is divided into three phases: (i) Monte Carlo data
are produced using GEANT4 on three different Grids: LCG, GRID3 and NorduGrid; (ii) simulate the first pass
reconstruction of real data expected in 2007, and (iii) test the Distributed Analysis model. Experience with the use
of the system in world-wide DC2 production of ten million events will be presented. We also present how the three
Grid flavours are operated. The ATLAS collaboration decided to perform these DC’s in the
context of the LHC Computing Grid project, LCG, as well as to use both the middleware and
the resources of two other Grid projects, GRID3 and NorduGrid.


The LHC Computing Review in 2001 recommended that the LHC experiments should carry
out Data Challenges (DC) of increasing size and complexity. Data Challenge comprises, in
essence, the simulation, done as realistically as possible, of data (events) from the detector,
followed by the processing of that data using the software and computing infrastructure that
will, with further development, be used for the real data when the LHC starts operating.



7.1.2 ATLAS Data Challenges

The goals of the ATLAS Data Challenges are the validation of the ATLAS Computing
Model, of the complete software suite, of the data model, and to ensure the correctness of the
technical          computing              choices           to            be            made.

A major feature of the first Data Challenge (DC1) was the development and the deployment
of the software required for the production of large event samples required by the High Level
Trigger and Physics communities, and the production of those large data samples involving
institutions worldwide.



70
LHC COMPUTING GRID                                                   Technical Design Report


ATLAS intended to perform its Data Challenges using as much as possible Grid tools
provided by the LHC Computing Grid project (EDG), NorduGrid
and Grid3. DC1 saw the first usage of these technologies in ATLAS, where NorduGrid for
example relied entirely on Grid. 40 institutes from 19 countries participated in DC 1 which
ran from spring 2002 to spring 2003. It was divided into 3 phases: (1) Event generation and
detector simulation, (2) Pile-up production, (3) reconstruction. The compute power required
was 21 MSI2k-days. 70 Tbytes of data were produced in 100000 partitions.
In order to handle the task of ATLAS DC2 an automated production system was designed.
This production system consists of several parts: a database for defining and keeping track of
the computing tasks to be done, the Don Quijote data management system for handling the
input and output data of the computations, the Windmill supervisor program that was in
charge of distributing the tasks between various computing resources and a set of executors
responsible for carrying out the tasks. By writing various executors the supervisor could be
presented with a common interface to each type of computing resource available to ATLAS.
 Executors were written to handle resources on the LHC Computing Grid [5], Grid 3 [6, 7],
NorduGrid’s ARC and various legacy batch systems [8]. During ATLAS DC2 the three Grid
flavours carried out about one third of the total computational task each. The subject of this
paper is the executor written for NorduGrid’s ARC, called Dulcinea, and the part of DC2 that
was carried out with it.




7.1.2.1

In order to handle the task of ATLAS DC2 an automated production system was designed. All
jobs are defined and stored in a central database. A supervisor agent (Windmill) picks them
up, and sends their definition as XML message to various executors, via a Jabber server.
Executors are specialised agents, able to convert the XML job description into a Grid specific
language (e.g. JDL, job description language, for LCG and XRSL, extended resource
specification language, for NorduGrid). Four executors have been developed, for LCG
(Lexor), Nordugrid (Dulcinea), GRID3 (Capone) and legacy systems, allowing the Data
Challenge to be run on different Grids.
For data management, a central server, Don Quijote (DQ) offers a uniform layer over the
different replica catalogues of the 3 Grid flavors. Thus all the copy and registration operations



                                                                                              71
Technical Design Report                                            LHC COMPUTING GRID


are performed through calls to DQ. The automatic production system has submitted about
235000 jobs belonging to 158000 job definitions in the Database, producing around 250000
logical files and reaching approximately 2500-3500 jobs per day, evenly distributed over the
three Grid flavors. Overall these jobs consumed approximately 1.5 million SI2k months of
CPU (~ 5000 present CPUs per day) and produced more than 30 TB of physics data.
When a LCG job is received by Lexor, it builds the corresponding JDL description, creates
some scripts for data staging, and sends everything to a dedicated, standard Resource Broker
(RB) through a Python module built over the workload management system (WMS) API. The
requirements specified in the JDL let the RB choose a site where ATLAS software is present
and the requested amount of computation (expressed in SpecInt2000 * Time) is available. An
extra requirement is a good outbound connectivity, necessary for data staging.
Dulcinea, was implemented as a C++ shared library. This shared library was then imported
into the production system’s python framework. The executor calls the ARC user interface
API and the Globus RLS API to perform its tasks. The job description received from the
Windmill supervisor in form of an XML message was translated by the Dulcinea executor
into an extended resource specification language (XRSL) [15] job description. This job
description was then sent to one of the ARC enabled sites, selecting a suitable site using the
resource brokering capabilities of the ARC user interface API. In the brokering, among other
things, the availability of free CPUs and the amount of data needed to be staged in on each
site to perform a specific task is taken into account. The lookup of input data files in the RLS
catalogue and the stagein of these files to the site is done automatically by the ARC Grid
Manager]. The same is true for stageout of output data to a storage element and the
registration of these files in the RLS catalogue. The Dulcinea executor only has to add the
additional RLS attributes needed for the Don Quijote data management system to the existing
file                                                                               registrations.
Also in other respects the Dulcinea executor takes advantage of the capabilities of the ARC
middleware. The executor does not have to keep any local information about the jobs it is
handling, but can rely on the job information provided by the Grid information system.
GRID3 involved 27 sites with a peak of 2800 processors.
The 82 LCG deployed sites from 22 countries contributed with a peak of 7269 processors
and a total storage capacity of 6558 TB. In addition to problems related to Globus Replica
Location Services (RLS), the Resource Broker and the information system were unstable at
the initial phase. But it was not only the Grid software that needed many bug fixes, another
common failure was the mis-configuration of sites.
In total 22 sites in 7 countries participated in DC2 through NorduGrid/ARC, with 700 CPUs
out of 3,000 were dedicated to ATLAS. The amount of middleware related problems were
negligible, except for the initial instability of the RLS server. Most job failures were due to
specific hardware problems.
7.1.3 CMS
All CMS Computing data challenges are constructed to prepare for LHC running and include
the definition of the computing infrastructure, the definition and set-up of analysis
infrastructure, and the validation of computing model. By design they entail each year a factor
2 increase in complexity over the previous year, leading to a full scale test in 2006.
Even though their primary goal is to gradually build the CMS computing system in time for
the start of LHC, they are tightly linked to other CMS activities and provide computing
support for production and analysis of the simulated data needed for studies on detector,
trigger and DAQ design and validation, and for physics system setup.
The purpose of the 2004 Data Challenge (DC04) was to demonstrate the ability of the CMS
computing system to cope with a sustained data-taking rate equivalent to 25Hz at a luminosity




72
LHC COMPUTING GRID                                                Technical Design Report


of 0.2 1034 cm2 s 1 for a period of 1 month. This corresponds to the 25% of the LHC startup
rate (or 5% of the LHC full scale system).
The CMS Data Challenge in 2004 (DC04) had the following phases:
Reconstruction of data on the CERN Tier-0 farm for a sustained period at 25Hz.
Data distribution to Tier-1 and Tier-2 sites.
Prompt data analysis at remote sites on arrival of data.
Monitoring and archiving of resource and process information.
The aim of the challenge was to demonstrate the feasibility of operating this full processing
chain.
PCP04 Data productions
About 50 millions events were required to match the 25 Hz rate for a month. Actually more
than 70 millions events were requested by the CMS physicists. These were simulated during
2003 and the first months of 2004 and about 35 million of them were digitized in time for the
start of DC04. This task is known as the Pre-Challenge Production for DC04 (PCP04).
Simulation of other events and digitization of the whole sample continued after the end of
DC04. All events are being used by CMS physicists for the analysis needed for the Physics
Technical Design Report.
Data production runs in a heterogeneous environment where some of the computing centres
do not make use of Grid tools and the others use two different Grid systems: LCG in Europe
and Grid3 in the USA. A set of tools, OCTOPUS, provide the needed functionalities.
The workload management is done in two steps. The first assigns production slots to regional
centre. The brokering is done by the production manager who knows about validated sites
ready to take work. The second step assigns the actual production jobs to CPU resources.
Brokering is performed either by the local resource manager or by a Grid scheduler. In the
case of LCG this is the Resource Broker and in the case of Grid3 it is the match-making
procedure within Condor. RefDB is a database located at CERN where all information needed
to produce and analyze data is kept. It allows the submission of processing requests by the
physicists, the assignment of work to distributed production centre and the browsing of the
status of the requests. Production assignments are created by the production team and
assigned to centres that have demonstrated the ability to produce data properly (via the
execution of a validation assignment). At each site, McRunJob is used to create the actual
jobs that produce or analyze the data following the directives stored in RefDB. Jobs are
prepared and eventually submitted to local or distributed resources. Each job is instrumented
to send to a dedicated database (BOSS) information about the running status of the job and to
update the RefDB in case the job finished successfully. Information sent to RefDB by a given
job get processed by a validation script implementing necessary checks, after that RefDB gets
updated with information about the produced data. The RLS catalogue, also located at CERN,
was used during PCP as a file catalogue by the LCG Grid jobs.
SRB (Storage Resource Broker) has been used for moving data among the regional centres
and eventually to CERN where they have been used as input to the following steps of the data
challenge.
DC04 Reconstruction
Digitized data were stored on CASTOR Mass Storage System at CERN. A fake on-line
process made these data available as input for the reconstruction with a rate of 40 MB/s.
Reconstruction jobs were submitted to a computer farm of about 500 CPUs at the CERN Tier-
0. The produced data (4 MB/s) were stored on a CASTOR stage area, so files were
automatically archived to tape. Some limitations concerning the use of CASTOR at CERN
due to the overload of the central tape stager were found during DC04 operations.



                                                                                           73
Technical Design Report                                            LHC COMPUTING GRID


DC04 Data Distribution
For DC04 CMS developed a data distribution system over available Grid point-to-point file
transfer tools, to form a scheduled large-scale replica management system. The distribution
system was based on a structure of semiautonomous software agents collaborating by sharing
state information through a Transfer Management DataBase TMDB). A distribution network
with a star topology was used to propagate replicas from CERN to 6 Tier-1s and multiple
associated Tier-2s in the USA, France, UK, Germany, Spain and Italy. Several data transfer
tools were supported: the LCG Replica Manager tools, Storage Resource Manager (SRM)
specific transfer tools, and the Storage Resource Broker (SRB). A series of “export buffers” at
CERN were used as staging posts to inject data into the domain of each transfer tool.
Software agents at Tier-1 sites replicated files, migrated them to tape, and made them
available to associated Tier-2s. The final number of file-replicas at the end of the two months
of DC04 was ~3.5 million. The data transfer (~6TB of data) to Tier-1s was able to keep up
with the rate of data coming from the reconstruction at Tier-0. The total network throughput
was limited by the small size of the files being pushed through the system.
A single Local Replica Catalog (LRC) instance of the LCG Replica Location Service (RLS)
was deployed at CERN to locate all the replicas. Transfer tools relied on the LRC component
of the RLS as a global file catalogue to store physical file locations.
The Replica Metadata Catalog (RMC) component of the RLS was used as global metadata
catalogue, registering the files attributes of the reconstructed data; typically the metadata
stored in the RMC was the primary source of information used to identify logical file
collections. Roughly 570k files were registered in the RLS during DC04, each with 5 to 10
replicas and 9 metadata attributes per file (up to ~1 KB metadata per file). Some performance
issues were found when inserting and querying information; the RMC was identified as the
main source of these issues. The time to insert files with their attributes in the RLS- about
3s/file in optimal conditions- was at the limit of acceptability; however, service quality
degraded significantly with extended periods of constant load at the required data rate.
Metadata queries were generally too slow, sometimes requiring several hours to find all the
files belonging to a given “dataset” collection. Several workarounds were provided to speed
up the access to data in the RLS during DC04. However serious performance issues and
missing functionality, like a robust transaction model, still need to be addressed.
DC04 Data Analysis
Prompt analysis of reconstructed data on arrival at a site was performed in quasi real time at
the Italian and Spanish Tier-1 and Tier-2 centres using a combination of CMS-specific
triggering scripts coupled to the data distribution system and the LCG infrastructure. A set of
software agents and automatic procedures were developed to allow analysis-job preparation
and submission as data files were replicated to Tier-1s. The data arriving at the Tier-1
CASTOR data server (Storage Element) were replicated to disk Storage Elements at Tier-1
and Tier-2 sites by a Replica agent. Whenever new files were available on disk the Replica
agent was also responsible for notifying an Analysis agent, which in turn triggered job
preparation when all files of a given file set (run) were available. The jobs were submitted to
an LCG-2 Resource Broker, which selected the appropriate site to run the jobs.
The official release of the CMS software required for analysis (ORCA) was pre-installed on
LCG-2 sites by the CMS software manager by running installation Grid jobs. The ORCA
analysis executable and libraries for specific analyses were sent with the job.
The analysis job was submitted from the User Interface (UI) to the Resource Broker (RB) that
interpreted the user requirements specified using the job description language (JDL). The
Resource Broker queried the RLS to discover the location of the input files needed by the job
and selected the Computing Element (CE) hosting those data. The LCG information system
was used by the Resource Broker to find out the information about the available Grid




74
LHC COMPUTING GRID                                                  Technical Design Report


resources (Computing Elements and Storage Elements). A Resource Broker and an
Information System reserved for CMS were set-up at CERN.
CMS could dynamically add or remove resources as needed. The jobs ran on Worker Nodes,
performing the following operations: establish a CMS environment, including access to the
pre-installed ORCA; read the input data from a Storage Element (using the rfio protocol
whenever possible otherwise via LCG Replica Manager commands); execute the user-
provided executable; store the job output on a data server; and register it to the RLS to make it
available to the whole collaboration.
The automated analysis ran quasi-continuously for two weeks, submitting a total of more than
15000 jobs, with a job completion efficiency of 90-95%. Taking into account that the number
of events per job varied from 250 to 1000, the maximum rate of jobs, ~260 jobs/hour,
translated into a rate of analyzed events of about 40 Hz. The LCG submission system could
cope very well with this maximum rate of data coming from CERN. The Grid overhead for
each job, defined as the difference between the job submission time and the time of start
execution, was on average around 2 minutes. An average latency of 20 minutes between the
appearance of the file at CERN and the start of the analysis job at the remote sites was
measured during the last days of DC04 running.
DC04 Monitoring
MonaLisa and GridICE were used to monitor the distributed analysis infrastructure, collecting
detailed information about nodes and service machines (the Resource Broker, and Computing
and Storage Elements), and were able to notify the operators in the event of problems. CMS-
specific job monitoring was managed using BOSS. BOSS extracts the specific job
information to be monitored from the standard output and error of the job itself and stores it in
a dedicated MySQL database. The job submission time, the time of start and end execution,
the executing host are monitored by default. The user can also provide to BOSS the
description of the parameters to be monitored and the way to access them by registering a job-
type. An analysis specific job-type was defined to collect information like the number of
analyzed events, the datasets being analyzed.
DC04 Summary
About 100 TB of simulated data in more than 700,000 files have been produced, during the
pre-production phase, corresponding to more than 400 KSI2000 years of CPU. Data have
been reconstructed at the Tier-0, distributed to all Tier-1 centres and re-processed at those
sites at a peak rate of 25 Hz (4MB/s output rate). This rate was kept only for limited amount
of time (only one full day); nevertheless the functionality of the full chain was demonstrated.
The main outcomes of the challenge were:
the production system was able to cope with an heterogeneous environment (local, Grid3 and
LCG) with high efficiency in the use of resources
local reconstruction at the Tier-0 could well cope with the planned rate; some overload of the
CERN Castor stager was observed
a central catalogue implemented using the LCG RLS, managing at the same time location of
files and their attributes was not able to cope with the foreseen rate
the data transfer system was able to cope with the planned rate and to deal with multiple
point-to-point transfer systems
the use of the network bandwidth was not optimal due to the small size of the files
the use of MSS at the Tier-1 centres was limited by the big number of files of small size it had
to deal with; only about 1/3 of the transferred data was safely stored on Tier-1's MSS
quasi-real-time analysis at the Tier-1 centres could well cope with the planned rate; a median
latency of ~20 minutes was measured between the appearance of the file at CERN and the
start of the analysis job at remote sites


                                                                                              75
Technical Design Report                                              LHC COMPUTING GRID


The main issues addressed after the end of DC04 are the optimization of file sizes and the re-
design of the data catalogues.

7.1.4 LHCb & LCG

7.1.4.1 Introduction
In this chapter a description of the LHCb use of the LCG Grid during Data Challenge’04 is
outlined. The limitations of the LCG at the time and the lessons learnt are highlighted. We
also summarise the baseline services that LHCb need in LCG in order for the data to be
processed and analysed in the Gird environment in 2007, The detailed implementation of
these service within the LHCb environment are described earlier in this document.
7.1.4.2 Use of LCG Grid
The results described in this section reflect the experiences and the status of the LCG during
the LHCb data challenge in 2004 and early 2005. The data challenge was divided into three
phases:
        Production: Monte Carlo simulation
        Stripping: Event pre-selection
        Analysis
The main goal of the Data Challenge was to stress test the LHCb production system and to
perform distributed analysis of the simulated data. The production phase was carried out with
a mixture of LHCb dedicated resources and LCG resources. LHCb managed to achieve their
goal of using LCG to provide at least 50% of the total production capacity.
7.1.4.3 Production
The DC04 production used the Distributed Infrastructure with Remote Agent Control
(DIRAC) system. DIRAC was used to control resources both at DIRAC dedicated sites and
those available within the LCG environment.
A number of central services were deployed to serve the Data Challenge. The key services
are:
A production database where all prepared jobs to be run are stored
A Workload Management System that dispatches jobs to all the sites according to a “pull”
paradigm
Monitoring and accounting services that are necessary to follow the progress of the Data
Challenge and allow the breakdown of resources used
A Bookkeeping service and the AliEn File Catalog to keep track of all datasets produced
during the Data Challenge.
Before the production can commence the production application software should be prepared
for shipping. It is important requirement for the DIRAC system to be able to install new
versions of the production software soon after release by the production manager. All the
information describing the production tasks are stored in the production database, In principle
the only human intervention during the production by the central manager is to prepare the
production tasks for DIRAC. The first step of production is the preparation of a workflow,
which describes the sequence of applications that are to be executed together with the
necessary application parameters. Once the workflow is defined, a production run can be
instantiated. The production run determines a set of data to be produced under the same
conditions. The production run is spit into jobs as units of the scheduling procedure. Each
DIRAC production agent request, from a worker node, is served with a single job. When new
datasets are produced on the worker nodes they are registered by sending a XML dataset



76
LHC COMPUTING GRID                                                     Technical Design Report


description to the bookkeeping service. The output datasets are then transferred to the
associated Tier-1 and the replica is registered in the bookkeeping service.
The technologies used in this production are based on C++ (LHCb software), python (DIRAC
tools) and XML-RPC (the protocol used to communicate between jobs and central services).
ORACLE and MySQL are the two databases behind all of the services. Oracle used for the
production and bookkeeping databases, and MySQL for the workload management system.
On the LCG, “agent installation” jobs were submitted continuously. These jobs check if the
Worker Node where the LCG job was placed was configured to run a LHCb job. If these
checks were positive, the job installed the DIRAC agent, which then executed as on a DIRAC
site within the time limit allowed for the job. This mode of operation on LCG allowed the
deployment of the DIRAC infrastructure on LCG resources and uses them together with other
LHCb Data Challenge resource in a consistent way.
A cron script submits DIRAC agents to a number of LCG resource brokers (RB). When jobs
are scheduled to LCG sites on a Worker Node (WN), the script first downloads (using http) a
DIRAC tarball and deploys a DIRAC agent on the WN. A DIRAC agent is configured and
executed. This agent requests the DIRAC WMS for a workflow to be executed. If any job is
matched the workflow is downloaded on the WN and executed. The software is normally pre-
installed with the standard LCG software installation procedures. If the job is dispatched to a
site where software is not installed, then installation is performed in the current work
directory for the duration of the job. All data files as well as logfiles of the job are produced in
the current working directory of the job. Typically the amount of space needed is around
2 GB plus an additional 500 MB if the software needs to be installed. The bookkeeping
information (file “metadata”) for all produced files is uploaded for insertion into the LHCb
Bookkeeping Database (BKDB) At the end of the reconstruction, the DST file(s) are
transferred by GridFTP to the SEs specified for this site, usually an associated Tier1 centre.
Once the transfer is successful, the replicas of the DST file(s) are registered into the LHCb-
AliEn file catalogue and into the replica table of BKDB. Both catalogues were accessed via
the same DIRAC interface and can be used interchangeably.
By the end of the production phase, up to 3000 jobs were executed concurrently on LCG sites.
The total of 43 LCG sites were executing (at least one) LHCb jobs with major contributions
from CERN and the LHCb proto-Tier1 centres. A total of 211k jobs were submitted to LCG,
LHCb cancelled 26k after 24-36 hours in order to avoid the expiration of the proxy. Of the
remaining 185k, 113k were regarded as successful by the LCG. This is an efficiency of ~61%.
A breakdown of the performance is given in Table 4. A further breakdown of these 113k jobs
was made and are summarised in Table 5. The initialisation errors included missing PYTHON
on the worker node, failure of DIRAC installation, failure to connect to DIRAC server and
failed software installation. If there were no workflows waiting to be processed in the DIRAC
WMS that matched the criteria being requested by the agent, then the agent would simply
terminate. The application error is a misnomer as it includes errors not only with the LHCb
software but also hardware and system problems during the running of the application. The
errors while transferring or registering the output data were usually recoverable. In summary,
LCG registered that 69k jobs produced useful output datasets for LHCb but according to the
LHCb accounting system there were 81k successful LCG jobs that produced useful data. This
is interpreted that some of the LCG aborted jobs did run to completion and some jobs that
were marked as not running did actually run unbeknown to the LCG system.


                                         Jobs(k)                      % remaining

       Submitted                           211
        Cancelled                           26



                                                                                                 77
Technical Design Report                                            LHC COMPUTING GRID


        Remaining                         185                           100.0
     Aborted (not run)                    37                             20.1
          Running                         148                            79.7
       Aborted(run)                       34                             18.5
            Done                          113                            61.2
        Retrieved                         113                            61.2
              TABLE 4: LCG EFFICEINCY DURING LHCB DC’04 PRODUCTION PHASE


                                Jobs(k)                          % retrieved

Retrieved                       113                              100.0

Initilaisation error            17                               14.9

No job in DIRAC                 15                               13.1

Application error               2                                1.8

Other error                     10                               9.0

Success                         69                               61.2

Transfer error                  2                                1.8

Registration error              1                                0.6

            TABLE 5: OUTPUT SANDBOX ANALYSIS OF JOBS IN STATUS “DONE” FOR LCG
The Data Challenge demonstrated that the concept of light, customizable and simple to deploy
DIRAC agents is very effective. Once the agent is installed, it can effectively run as an
autonomous operation. The procedure to update or to propagate bug fixes for the DIRAC
tools is quick and easy as long as care is taken to ensure the compatibility between DIRAC
releases and ongoing operations.
To distribute the LHCb software, the installation of the software is triggered by a running job
and the distribution contains all the binaries and is independent of the Linux flavour.
Nevertheless, new services to keep track of available and obsolete packages and a tool to
remove software package should be developed.
The DIRAC system relies on a set of central services. Most of these services were running on
the same machine that ended up with a high load and too many processes. With thousands of
concurrent jobs running in normal operation, the services are approaching a Deny of Service
regime, where you have a slow response and with services become stalled.
In the future release of the DIRAC system, the approach to error handling and reporting to the
different services should be improved.
As LCG resources were used for the first time, several areas were identified where
improvements should be made. The mechanism for uploading or retrieving OutputSandbox
should be improved, in particular to have information about Failed or Aborted jobs. The
management of each site should be reviewed to avoid and detect that a misconfigured site
becomes a “black-hole”. The publication of information about site intervention should be also
provided to the Resource Broker or to the Computing Element. In particular, both DIRAC and


78
LHC COMPUTING GRID                                                Technical Design Report


the LCG need extra protection against external failures, e.g. network or unexpected
shutdowns


7.1.4.4 Organised analysis
The stripping process consists in running a DaVinci program that either executes the physics
selection for a number of channels or selects events that pass the first two levels of trigger
(L0+L1). The former will be run on all signal and background events while the latter will be
run on minimum bias events.
The DaVinci applications (including JobOptions files) was packaged as a standard production
application such that they can be deployed through the standard DIRAC or LCG software
installation procedures. For the handling of the stripping, a database separate from the LHCb
Bookkeeping Database (BKDB), called Processing Database (PDB), was used.
Information was extracted from the BKDB based on queries on the type of data. New files
were incrementally added to the PDB, upon the production manager request, and initially
marked as “created.” This database, is scanned for a given event type with enough data to be
stripped. The files are marked as “grouped“ and assigned a Group tag. Jobs are then prepared
to run on all files with the same Group tag. The files are then marked as “prepared.” The JDL
of the job contains the LFN of all selected files and from the list of files a GaudiCatalog
corresponding to those files was created and wass shipped in the jobs’ InputSandbox.




                                                                                           79
Technical Design Report                                            LHC COMPUTING GRID




      Figure 5: Workflow diagram for the staging, stripping and merging process

The stripping process performs the following steps and the workflow is illustrated in Figure 5.
As the jobs run on a large number of files, a pre-staging takes place in order to take advantage
of the optimisation of the underlying staging process. The staging was performed using the
technology-neutral SRM interface, and the files should be pinned on the disk pool (see figure
1). The staging happens asynchronously. The checking and stripping steps loop and wait for
input files to be available on the staging disk. A DaVinci application is run on the first
available file, in a single file processing. Depending on the outcome of DaVinci, the file will
be declared “Stripped”, “Bad Replica” or “Problematic.” The output of the stripping will be a
stripped DST file and Event Tag Collection (ETC), all kept on the local disk. A Gaudi job is
then run using all stripped DSTs as input and producing a merged stripped DST. This step
prepares all necessary BKDB updates as well as PDB updates. It takes care of saving the files
on an SE and registering them as replicas into the file catalog(s). The ETCs are also merged,
stored and registered.
SRM was used as a technology neutral interface to the mass storage system during this phase
of the LHCb data challenge. The original plan was to commence at CERN, CNAF and PIC
(CASTOR based sites) before moving to non-CASTOR technologies at other proto- LHCb
Tier-1 centres, such as FZK, IN2P3, NIKHEF/SARA and RAL. The SRM interface was



80
LHC COMPUTING GRID                                                   Technical Design Report


installed at CNAF and PIC at the request of LHCb and we were active in aiding debugging
these implementations.
The gfal APIs were modified for LHCb to allow some of the functionality requirements
described above to be available. The motivation of using gfal was to hide any SRM
implementation dependencies, such as the version installed at a site. From these API’s LHCb
developed a number of simple command line interfaces. In principle the majority of the
functionality required by LHCb was described in the SRM (version 1.0) documentation,
unfortunately the implementation of the basic SRM interfaces did not match the functional
design. Below we describe the missing functionality and number of ad-hoc solutions was
used.
The inability to pin/unpin or mark file for garbage collection means it is possible that files for
a SRM request are removed from the disk pool before being processed. A number of
temporary solutions were considered:
throttle the rate the jobs were submitted to a site. This would be a large overhead for the
production manager and needs detailed knowledge of the implementation of the disk pools at
all sites. It also assumes that the pool in use is only available to the production manager; this
is not the case. SRM used the default pool assigned to the mapped user in the SRM server.
Each time a file status is checked, a new SRM request is issued. This protected against a file
being “removed” from the disk pool before being processed but it was not clear what the
effect had on the staging optimisation. This was the solution adopted.
use of technology specific commands to (pin and) remove the processed file from disk. This
assumes that such commands are available on the worker nodes (not always the case) and an
information service that maps a site with a technology.
A problem was found when SRM requested a corrupted (or non-existent) file. Although the
stage request none of the files were returned in a “ready” status. No error was returned by the
GFAL/SRM to inform the user there was a problem with the original stage request. This was
an implementation problem associated with CASTOR. The only way to avoid this problem is
to remove manually every corrupted file as it comes to light or each time a file status is
checked issue a new SRM request.
Originally there was no control over the stage pool being used. It is highly desirable to have
separate pools for production activities and user analysis jobs to remove any destructive
interference. Mapping the production users in a VO to a particular user account solved this
problem but this required intervention at the LCG system level.
7.1.4.5 End user analysis




7.2     Service challenges
So as to be ready to fully exploit the scientific potential of the LHC, significant resources
needed to be allocated to a series of Service Challenges. These challenges should be seen as
an essential on-going and long-term commitment to achieving the goal of a production quality
world-wide Grid at a scale beyond what has previously been achieved.


Whilst many of the individual components that make up the overall system are understood or
even deployed and tested, much work remains to be done to reach the required level of
capacity, reliability and ease-of-use. These problems are compounded not only by the
inherently distributed nature of the Grid, but also by the need to get large numbers of



                                                                                               81
Technical Design Report                                              LHC COMPUTING GRID


institutes and individuals, all with existing, concurrent and sometimes conflicting
commitments, to work together on an incredibly aggressive timescale.


The service challenges must be run in an environment that is as realistic as possible, which
includes end-to-end testing of all key experiment use- cases over an extended period,
demonstrating that the inevitable glitches and longer-term failures can be handled gracefully
and recovered from automatically. In addition, as the service level is built up by subsequent
challenges, they must be maintained as stable production services on which the experiments
test their computing models.
7.2.1 Summary of Tier0/1/2 Roles


Whilst there are differences between the roles assigned to the tiers for the various
experiments, the primary functions are as follows:
Tier0 (CERN): safe keeping of RAW data (first copy); first pass reconstruction, distribution
of RAW data and reconstruction output to Tier1; reprocessing of data during LHC down-
times;
Tier1: safe keeping of a proportional share of RAW and reconstructed data; large scale
reprocessing and safe keeping of corresponding output; distribution of data products to Tier2s
and safe keeping of a share of simulated data produced at these Tier2s;
Tier2: Handling analysis requirements and proportional share of simulated event production
and reconstruction.
7.2.2 Overall Workplan

In order to ramp up the services that are part of LCG phase 2, a series of Service Challenges
are being carried out. These start with the basic infrastructure, including reliable file transfer
services, and gradually increase from a subset of the Tier1 centres together with CERN to
finally include all Tier1s, the main Tier2s and the full functionality required by the LHC
experiments’ offline processing, including analysis.


The first two challenges – December 2004 and March 2005 – focused on the basic
infrastructure and involved neither the experiments nor Tier2 sites. Nevertheless, the
experience from these challenges proved extremely useful in building up the services and in
understanding the issues involved in offering stable production services around the clock for
extended periods.


During the remainder of 2005, the Service Challenges will expand to include all the main
offline Use Cases of the experiments apart from analysis and will begin to include selected
Tier2 sites. Additional components over the basic infrastructure will be added step by step,
including experiment-specific solutions. It is important to stress that each challenge includes a
setup period, during which residual problems are ironed out, followed by a period that
involves the experiments but during which the focus is on the “service”, rather than any data
that may be generated and/or transferred (that is, the data is not necessarily preserved and the
storage media maybe periodically recycled). Finally, there is an extended service phase
designed to allow the experiments to exercise their computing models and software chains.


Given the significant complexity of the complete task, we break down the overall workplan as
below.


82
LHC COMPUTING GRID                                                 Technical Design Report


7.2.3 CERN / Tier0 Workplan
The workplan for the Tier0 and for CERN in general covers not only ramping up the basic
services to meet the data transfer needs, but also includes playing an active coordination role
– the activity itself being reviewed and monitored by the Grid Deployment Board. This
coordination effort involves interactions with the experiments, the Tier1 sites and through
these and appropriate regional bodies, such as ROCs and national Grid projects, as well as the
Tier2s. In conjunction with other activities of the LCG and related projects, it also involves
making available the necessary software components – such as the Reliable File Transfer
software and the Disk Pool Manager, together with associated documentation and installation
guides. This activity is clearly strongly linked to the overall Grid Deployment plan for CERN,
described in more detail elsewhere in this document.
7.2.4 Tier1 Workplan
The basic goals of 2005 and early 2006 are to add the remaining Tier1 sites to the challenges,
whilst progressively building up the data rates and adding additional components to the
services. The responsibility for planning and executing this build up lies with the Tier1s
themselves, including the acquisition of the necessary hardware, the setting up of services,
together with adequate manpower to maintain them at the required operational level 24 hours
a day, 365 days a year. This requires not only managed disk and tape storage with an agreed
SRM interface, together with the necessary file transfer services and network infrastructure,
but also sufficient CPU resources to process (and where appropriate generate) the data that
will be produced in service challenges 3 and 4. The data rates that each Tier1 are expected to
support in service challenge 3 are 150MB/s to managed disk and 60MB/s to managed tape.
By the end of service challenge 4, these need to be increased to the full nominal operational
rates and increased again by an additional factor of 2 by the time that the LHC enters
operation.


The following table gives the Tier-1 centres that have been identified at present, with an
indication of the experiments that will be served by each centre. Many of these sites offer
services for multiple LHC experiments and will hence have to satisfy the integrated rather
than individual needs of the experiments concerned.


    Centre                                   ALICE          ATLAS          CMS          LHCb
    ASCC, Taipei                                               X             X
    CNAF, Italy                                  X             X             X            X
    PIC, Spain                                                 X             X            X
    IN2P3, Lyon                                  X             X             X            X
    GridKA, Germany                              X             X             X            X
    RAL, UK                                      X             X             X            X
    BNL, USA                                                   X
    FNAL, USA                                                                X
    TRIUMF, Canada                                             X
    NIKHEF/SARA, Netherlands                     X             X                          X
    Nordic Centre                                X             X


                                                                                            83
Technical Design Report                                               LHC COMPUTING GRID


CHEP(Korea) has also indicated that it might be a Tier1 centre for CMS.
A Tier1 site for ALICE in the US is also expected.


7.2.5 Tier2 Workplan
The role that the Tier2 sites will play varies between the experiments, but globally speaking
they are expected to contribute significantly to Monte Carlo production and processing, the
production of calibration constants and in most cases also analysis. In general, however, they
will not offer guaranteed long-term storage and will hence require such services from Tier1
sites, from where there will typically download data subsets for analysis and upload Monte
Carlo data. This implies that they will need to offer some level of reliable file transfer service,
as well as provided managed storage – typically disk-based. On the other hand, they are not
expected to offer as high a level of service as the Tier0 or Tier1 sites. Over one hundred Tier2
sites have currently been identified and we outline below the plan for ramping up the required
services, with a focus on those required for the service challenges.


In the interests of simplicity, it is proposed that Tier2 sites are normally configured to upload
Monte Carlo data to a given Tier1 (which can if necessary be dynamically redefined) and that
the default behaviour should the “link” to this Tier1 become available – e.g. if the Tier1 is
down for air conditioning maintenance – be to stop and wait. On the other hand, any Tier2
must be able to access data at or from any other site (some of the data being split across sites),
so as not to limit a physicist’s ability to perform analysis by her/his geographic location. This
logical view should, however, not constrain the physical network topology.


A small number of Tier2 sites have been identified to take part in service challenge 3, where
the focus is on upload of Monte Carlo datasets to the relevant Tier1 site, together with the
setup of the managed storage and file transfer services. These sites have been selected in
conjunction with the experiments, giving precedence to sites with the relevant local expertise
and manpower. We note that both US-ATLAS and US-CMS are actively involved with Tier2
sites in the US for their respective experiments.


     Site                             Tier1                             Experiment
     Bari, Italy                      CNAF, Italy                       CMS
     Turin, Italy                     CNAF, Italy                       Alice
     DESY, Germany                    FZK, Germany                      ATLAS, CMS
     Lancaster, UK                    RAL, UK                           ATLAS
     London, UK                       RAL, UK                           CMS
     ScotGrid, UK                     RAL, UK                           LHCb
     US Tier2s                        BNL / FNAL                        ATLAS / CMS
              Table 6 – Partial List of Candidate Tier2 sites for Service Challenge 3


In addition to the above, both Budapest and Prague have expressed their interested in early
participation in the Service Challenges and this list is expected to grow.


As a longer term goal, the issue of involving all Tier2 sites is being addressed initially
through national and regional bodies such as GridPP in the UK, INFN in Italy and US-
ATLAS / US-CMS. These bodies are expected to coordinate the work in the respective
region, provide guidance on setting up and running the required services, give input regarding


84
LHC COMPUTING GRID                                                 Technical Design Report


the networking requirements and participate in setting the goals and milestones. The initial
target is to have these sites setup by the end of 2005 and to use the experience to address all
remaining sites – including via workshops and training – during the first half of 2006.


    Tier2 Region                    Coordinating Body                Comments
    Italy                           INFN                             A workshop is
                                                                         foreseen for May
                                                                         during which
                                                                         hands-on training
                                                                         on the Disk Pool
                                                                         Manager and File
                                                                         Transfer
                                                                         components will be
                                                                         held.
    UK                              GridPP                           A coordinated effort to
                                                                         setup managed
                                                                         storage and File
                                                                         Transfer services is
                                                                         being managed
                                                                         through GridPP
                                                                         and monitored via
                                                                         the GridPP T2
                                                                         deployment board.
    Asia-Pacific                    ASCC Taipei                      The services offered by
                                                                         and to Tier2 sites
                                                                         will be exposed,
                                                                         together with a
                                                                         basic model for
                                                                         Tier2 sites at the
                                                                         Service Challenge
                                                                         meeting held at
                                                                         ASCC in April
                                                                         2005.
    Europe                          HEPiX                            A similar activity will
                                                                         take place at
                                                                         HEPiX at FZK in
                                                                         May 2005, together
                                                                         with detailed
                                                                         technical
                                                                         presentations on
                                                                         the relevant
                                                                         software
                                                                         components.
    US                              US-ATLAS and US-                 Tier2 activities in the
                                       CMS                               US are being
                                                                         coordinated
                                                                         through the
                                                                         corresponding
                                                                         experiment bodies.


                                                                                            85
Technical Design Report                                               LHC COMPUTING GRID


     Canada                           Triumf                               A Tier2 workshop will
                                                                              be held around the
                                                                              time of the Service
                                                                              Challenge meeting
                                                                              to be held in
                                                                              Triumf in
                                                                              November 2005.
     Other sites                      CERN                                 One or more
                                                                              workshops will be
                                                                              held to cover those
                                                                              Tier2 sites with no
                                                                              obvious regional or
                                                                              other coordinating
                                                                              body, most likely
                                                                              end 2005 / early
                                                                              2006.
                            Table 7 - Initial Tier2 Activities by Region
7.2.6 Network Workplan
The network workplan is described elsewhere in this document. As far as the service
challenges are concerned, the principle requirement is that the bandwidth and connectivity
between the various sites should be consistent with the schedule and goals of the service
challenges. Only modest connectivity is required between Tier2 sites and Tier1s during 2005,
as the primary focus during this period is on functionality and reliability. However,
connections of 10Gb/s are required from CERN to each Tier1 no later than end 2005.
Similarly, connectivity between the Tier1s at 10Gb/s is also required by summer 2006 to
allow the analysis models to be fully tested. Tier1-Tier2 connectivity of at least 1Gb/s is also
required on this timescale, to allow both Monte Carlo upload and analysis data download.
7.2.7 Experiment Workplan
The experiment-specific workplans and deliverables are still in the process of being defined.
However, at the highest level, the overall goals for service challenge 3 are to test all aspects of
their offline computing models except for the analysis phase, which in turn will be included in
service challenge 4. It is expected that the data access and movement patterns that
characterize the individual computing models will initially be exercised by some scripts, then
by running the offline software without preserving the output data beyond what is required to
verify the network and / or disk – tape transfers and finally by a full production phase that is
used to validate their computing models and offline software on the basis of the service that
has been established during the initial stages. The experiment-specific components and
services need to be identified by early April, so that component testing can commence in May
followed by integration testing in June. An important issue will be the identification and
provisioning of the resources required for running the production chains and for storing the
resultant data.


It is currently expected that ATLAS and LHCb will become actively involved in the service
challenges on the October 2005 timeframe, although work has already started on identifying
the additional components that will be required in addition to the reliable file transfer service
and in establishing a detailed workplan.
Both ALICE and CMS expect to be ready to participate as early as the targeted start date of
SC3 – namely July 2005 – and would be interested in using some of the basic components,
such as the reliable file transfer service, even earlier.


86
LHC COMPUTING GRID                                                    Technical Design Report


Regular series of meetings will commence with the experiments (one by one) to identify the
various experiment-specific components that need to be put in place and elaborate a detailed
work plan both during the SC3 setup and pre-production phases and for the various phases of
the challenge itself, including the Service phase. It is expected that the issue of analysis will
also be raised, even if not formally a goal of SC3 (the data produced by the experiments
during the service phase will clearly be analysed by the physicists involved in the respective
collaborations).
7.2.8 Selection of Software Components
Given the focus of the initial challenges on the reliable file transfer service, it is natural that
this component was the first to be selected. This has been done on the basis of an extensive
list of requirements together with stress testing of the candidate software. This software – the
gLite File Transfer Service (FTS) component – was required to meet the full list of
requirements as well as run reliably for a week in an environment as close as possible to
production prior to the March 2005 Service Challenge meeting. The gLite File Transfer
Service0 is the lowest-level data movement service defined in the gLite software architecture.
It is responsible for reliably moving sets of files from one site to another, allowing the
participating sites to control the network resource usage. It provides an
agent based mechanism for higher level services (such as cataloguing and VO-specific retry
policies) to plug in.


Similarly acceptance criteria are foreseen for all other components, the list of which is in the
process of being defined, together with the experiments and the LCG Baseline Services
Working Group as appropriate.
7.2.9 Service Workplan – Coordination with Data Challenges
From Service Challenge 3, significant resources are required to generate, process and store
the data used in the challenge (except for the initial setup phases). Whilst it is clearly
important to separate the goals of testing the infrastructure and services from those of testing
the experiments’ computing models and offline software, it would be highly preferable if the
“service phase” of the Service Challenge could map more or less completely to an
experiment’s Data Challenge. This will require agreement on the goals and durations of the
various phases as well as negotiation with the resource providers and all sites involved.
However, the benefits to all parties if such agreement can be reached are clear.
7.2.10 Current Production Setup
In parallel to the various Service Challenge setups, the primary WAN data transfer service out
of CERN is currently offered by “CASTORGRID”. This is a load balanced service with
special high network throughput routing, so as not to overload the firewall. It runs both
GridFTP and SRM. At the time of writing in consists of 8 nodes, each with 2 GB of RAM,
with 2 x 1 Gbit/sec connectivity per 4 nodes.


The current topology is shown below.




                                                                                                87
Technical Design Report                                            LHC COMPUTING GRID




                           Figure 6 - CASTORGRID setup at CERN
The current network usage is relatively low, as shown in the following plot.




             Figure 7 - Network Traffic Through CASTORGRID in January 2005


7.2.11 Security Service Challenges (Draft, Dave)
(new subsection of 7.2 - Security Service Challenges)


This section will describe the plans for a number of security service challenges during the
preparation for LHC startup. These will test the various operational procedures, e.g. incident
response, and also that the deployed grid middleware is producing audit logs with appropriate
detail. An important aim of the challenges is to ensure that site managers and security officers
understand their responsibilities and that audit logs are being collected and maintained
according to the agreed procedures. Experience from the service challenges will be used to
improve the audit logs and associated procedures.




7.3     Results of Service Challenge 1 & 2


Service Challenge 1 was scheduled to complete in December 2004, demonstrating sustained
aggregate 500 MB/sec mass store to mass store between CERN and three Tier-1 sites. 500


88
LHC COMPUTING GRID                                                     Technical Design Report


MB/sec was sustained between FNAL and CERN during three days in November. The
sustained data rate to SARA(NIKHEF) in December was only 54 MB/sec., but this had been
pushed up to 200 MB/sec by the start of SC2 in mid-March. 500 MB/sec was achieved in
January with FZK. Although the SC1 goals were not achieved a great deal was learned at
CERN and other sites, and we are reasonably confident that the SC2 goals will be achieved.




             Figure 8 - Data Transfer Rate between CERN and FNAL Prior to SC1




                          Figure 9 - Service Challenge 1 Setup at CERN


Service Challenge 2 started on 14 March. The goal is to demonstrate 100 MB/sec reliable file
transfer between CERN and 7 Tier-1s (BNL, CNAF, FNAL, FZK, IN2P3, NIKHEF and
RAL), with one week at a sustained aggregate throughput of 500 MB/sec at CERN. At the
time of writing this report NIKHEF, FNAL, IN2P3 and CNAF had started. The service
challenge is scheduled to finish on 8 April.




                         Figure 10 - Service Challenge 2 Setup at CERN



7.3.1 Goals of Service Challenge 3

In terms of file transfer services and data rates, the goals of service challenge 3, to start in July
2005, are to demonstrate reliable transfers at rates of 150MB/s per Tier1 managed disk to
managed disk and 60MB/s to managed tape. The total aggregate data rate out of CERN that


                                                                                                  89
Technical Design Report                                             LHC COMPUTING GRID


should be achieved is 1GB/s. It is foreseen that all Tier1 sites, with the exception of PIC and
the Nordic Tier1 plus any that still have to be identified, will participate in this challenge. A
small number of Tier2 sites will also be involved (see the table above), focusing on those with
good local support, both at the level of the required infrastructure services and from the
relevant experiment. In addition to building up the data rates that can be supported at both
CERN and outside sites, this challenge will include additional components, such as catalogs,
support for multiple VOs, as well as experiment-specific solutions. It is foreseen that the
challenge will start with a phase that demonstrates the basic infrastructure, albeit with higher
data rates and more sites, including selected Tier2s. Subsequently, the data flows and access
patterns of the experiments will be tested, initially by emulating the models described in the
Computing Model documents and subsequently by running the offline frameworks
themselves. However, during both of these phases the emphasis will be on the Service, rather
than the Data, which will not normally be preserved. Finally, an extended Service Phase is
entered, currently foreseen from September 2005 until the end of the year, during which the
experiments validate their computing models using the facilities that have been built up
during the Service Phase.
7.3.2 Goals of Service Challenge 4
 Service challenge 4 needs to demonstrate that all of the offline data processing requirements
expressed in the experiments’ Computing Models, from raw data taking through to analysis,
can be handled by the Grid at the full nominal data rate of the LHC. All Tier1 sites need to be
involved, together with the majority of the Tier2s. The challenge needs to successfully
complete at least 6 months prior to data taking. The service that results from this challenge
becomes the production service for the LHC and is made available to the experiments for
final testing, commissioning and processing of cosmic ray data. In parallel, the various centres
need to ramp up their capacity to twice the nominal data rates expected from the production
phase of the LHC, to cater for backlogs, peaks and so forth. The analysis involved is assumed
to be batch-style analysis, rather than interactive analysis, the latter expected to be performed
primarily “off the Grid”. The total aggregate data rate out of CERN that needs to be supported
is double that of Service Challenge 3, namely 2GB/s.
7.3.3 Timeline and Deliverables

     Due Date                        Milestone                        Responsible
     April SC meeting                Produce updated “How             CERN
                                        to joint Service
                                        Challenges as a
                                        T1” document.
     April SC meeting                Produce corresponding            DESY + FZK + CERN
                                        document for T2
                                        sites.
     April SC meeting                Detailed SC3 plan for            ALICE + SC + Tiern
                                        ALICE                            sites
     April SC meeting                Detailed SC3 plan for            CMS + SC + Tiern
                                        CMS                              sites
                    Table 8 - Summary of Main Milestones and Deliverables




90
LHC COMPUTING GRID                                                  Technical Design Report


           June05 - Technical Design Report

                            Sep05 - SC3 Service Phase

                                               May06 –SC4 Service Phase

                                                      Sep06 – Initial LHC Service in
                                                            stable operation
           2005                  2006                    2007                   2008

    SC2
     SC3                                       cosmics             First physics
                                                         First beams           Full physics run
                  SC4
                   LHC Service Operation


7.3.4 Summary
The service challenges are a key element of the strategy for building up the LCG
services to the level required to fully exploit the physics potential of the LHC machine
and the detectors. Starting with the basic infrastructure, the challenges will be used to
identify and iron out problems in the various services in a full production
environment. They represent a continuous on-going activity, increasing step-wise in
complexity and scale. The final goal is to deliver a production system capable of
meeting the full requirements of the LHC experiments at least 6 months prior to first
data taking. Whilst much work remains to be done, a number of parallel activities
have been started addressing variously the Tier1/2 issues, networking requirements
and the specific needs of the experiments. Whilst it is clear that strong support from
all partners is required to ensure success, the experience from the initial service
challenges suggest that the importance of the challenges is well understood and that
future challenges will be handled with appropriate priority.
7.3.5 References

EGEE DJRA1.1 EGEE Middleware Architecture.
https://edms.cern.ch/document/476451/

8    START-UP SCENARIO
Laura Perini
The data processing in the very early phase of data taking will only slowly approach the
steady state model. While the distribution and access to the data should be well-prepared and
debugged by the various data challenges, there will still be a requirement for heightened
access to raw data to produce the primary calibrations and to optimise the reconstruction
algorithms in the light of the inevitable surprises thrown up by real data. The access to raw
data is envisaged in two formats, RAW files and (if sensible) DRD.


The steady-state model has considerable capacity for analysis and detector/physics group
files. There is also a considerable planned capacity for analysis and optimisation work in the
CERN analysis facility. It is envisaged that in the early stages of data-taking, much of this is



                                                                                             91
Technical Design Report                                               LHC COMPUTING GRID


taken up with a deep copy of the express and calibration stream data. For the initial weeks, the
express data may be as upwards of 20 Hz, but it is clear that averaged over the first year, it
must be less than this. If this averages at 10 Hz over the full year, and we assume we require
two processing versions to be retained at any time at the CERN analysis facility, this
translates to 620 TB of disk.


It is also assumed that there will be considerable reprocessing of these special streams. The
CPU involved must not be underestimated. For example, to process the sample 10 times in 6
months would require a CPU capacity of 1.1 MSI2k (approximately 1000 current processors).
This is before any real analysis is considered. Given the resource requirements, even
reprocessing this complete smaller sample will have to be scheduled and organised through
the physics/computing management. Groups must therefore assess carefully the required
sample sizes for a given task. If these are small enough, they can be replicated to Tier-2 sites
and processed in a more ad hoc manner there. Some level of ad hoc reprocessing will of
course be possible on the CERN Analysis Facility.


The CERN Analysis Facility resources are determined in the computing model by a steady-
state mixture of activities that includes AOD-based and ESD-based analysis and steady-state
calibration and algorithmic development activities. This gives 1.1 PB of disk, 0.58 PB of tape
and 1.7 MSI2k processing power for the initial year of data taking. This resource will initially
be used far more for the sort of RAW-data based activity described in sections Error!
Reference source not found.Error! Reference source not found. and Error! Reference
source not found.Error! Reference source not found., but must make a planned transition
to the steady state through the first year. If the RAW data activities continue in the large scale
for longer, the work must move to be shared by other facilities. The Tier-1 facilities will also
provide calibration and algorithmic development facilities throughout, but these will be
limited by the high demands placed on the available CPU by reprocessing and ESD analysis.


There is considerable flexibility in the software chain in the format and storage mode of the
output datasets. For example, in the unlikely event of navigation between ESD and RAW
proving problematic when stored in separate files, they could be written to the same file. As
this has major resource implications if it were adopted as a general practice, this would have
to be for a done for a finite time and on a subset of the data. Another option that may help the
initial commissioning process is to produce DRD, which is essentially RAW data plus
selected ESD objects. This data format could be used the commissioning of some detectors
where the overhead of repeatedly producing ESD from RAW is high and the cost of storage
of copies of RAW+ESD would be prohibitive. In general, the aim is to retain flexibility for
the early stage of data taking in both the software and processing chain and in the use of the
resources available.
In order that the required flexibility be achievable, it is essential that the resources be in place
in a timely fashion, both in 2007 and 2008. The estimated hardware resources required at the
start of 2007 and 2008 are given in Table 1 and


Table 10.

                                     CPU(MSI2k) Tape (PB) Disk (PB)
                 CERN Tier-0            1.8        2.0       0.2
                  CERN AF               1.0        0.2       0.8
               Sum of Tier-1's              8,1              3.0            5.7


92
LHC COMPUTING GRID                                                          Technical Design Report


                Sum of Tier-2’s                7.3               0.0             3.2
                    Total                     18.2               5.2              9.
Table      9:                                                                                         The
projected                                                                                            total
resources required at the start of 2007 for the case when 20% of the data rate is fully simulated.



                                        CPU(MSI2k) Tape (PB) Disk (PB)
Table     10:     CERN Tier-0              4.1        6.2      0.35                                   The
projected          CERN AF                 2.8        0.6       1.8                                  total
resources                                                                                    required at
the start of    Sum of Tier-1's               26.5              10.1            15.5         2008 for the
case when       Sum of Tier-2's               21.1               0.0            10.1         20% of the
data rate is        Total                     54.5              16.9            27.8                 fully
simulated.




8.1      Pilot run
Roger Jones
Abstract
The current LHC planning is for a physics run in the second half of 2007. Based on the
current best estimates from the accelerator community, this is expected to be at a lower
luminosity (nominally 5x1032cm-2s-1). For planning purposes, this run is assumed to be of 100
days, with a 50% live-time. The computing models in this run will be very far from the steady
state. There will be requirements for far higher access to raw data and ESD as the experiments
begin to understand their detectors, the calibration and alignment, the backgrounds and the
reconstruction algorithms. There will also be a particularly acute analysis demand, which will
itself make higher demands on the early data formats. While flexibility will be essential,
preliminary plans for the early running will be presented.


8.2      Initial running
Yves Schutz, Claudio Grandi



9     RESOURCES
Chris Eck
 1.1    Member Institutions of the LHC Computing Grid Collaboration pledge Computing
Resource Levels to one or more of the LHC Experiments and Service Levels to the
Collaboration, having in both cases secured the necessary funding. Institutions may clearly
have other resources that they do not pledge in this way. The Institutions shall pledge
“Resources” and “Services” separately, specifying all of the parameters relevant to each
element (e.g. size, speed, number, effort, as the case may be). As far as possible they shall
associate with each element key qualitative measures such as reliability, availability and
responsiveness to problems. Tier1 Centres shall also pledge (separately) the consolidated
Computing Resource and Service Levels of other Tier Centres (if any), for which the Tier1
has responsibility:
1.1.1 Resources. These shall be pledged separately (as applicable) for Tier1 services and
Tier 2 services (defined in Section x)



                                                                                                       93
Technical Design Report                                            LHC COMPUTING GRID


•       Processing capacity (expressed in commonly agreed units).
•       Networking. Due to the distributed nature of the LHC Computing Grid, it is
particularly important that each Institution provides appropriate network capacity with which
to exchange data with the others. The associated Computing Resource Levels shall include
I/O throughput and average availability .
•        Access to data (capacity and access performance parameters of the various kinds of
storage, making clear which figures refer to archival storage).
1.1.2   Services
Grid Operations Services spanning all or part of the LHC Computing Grid are described in
Section x.y. For considerations of efficiency, it is vital that they be pledged on a long-term
basis and not just from year to year.
1.2     If, for whatever reason, the Computing Resource Levels pledged by an Institution to a
particular LHC Experiment are not being fully used, the Institution concerned shall consult
with the managements of the LHC Experiments it supports and with that of the Collaboration.
In such situations it is encouraged to make available some part or all of the Computing
Resource Levels in question to one or more of the other LHC Experiments it supports and/or
to the Collaboration management for sharing amongst the LHC Experiments as it sees fit.
1.3    Sections 10.2 to 10.4 show, for each Institution, the Computing Resource and Service
Levels pledged in the next year and planned to be pledged in each of the four subsequent
years.
1.4      The Institutions, supported by their Funding Agencies, shall make their best efforts to
provide the Computing Resource and Service Levels listed in Sections 10.2 to 10.4. In
particular, in order to protect the accumulated data and the Grid Operations services, any
Institution planning to reduce its pledged storage and/or Grid Operations services shall take
the measures necessary to move to other Institutions the affected data (belonging to the LHC
Computing Grid and/or LHC Experiments) of which it has the unique copy (or unique
permanent backup copy) and/or Grid Operations services that it has been providing, before
closing access to the data and/or provision of the Grid Operations services. Such moving of
data and/or Grid Operations services shall be negotiated with the managements of the LHC
Experiment(s) concerned and of the Collaboration.
1.5      It is a fundamental principle of the Collaboration that each Institution shall be
responsible for ensuring the funding required to provide its pledged Computing Resource and
Service Levels, including storage, manpower and other resources. The funding thus provided
will naturally be recognised as a contribution of the Funding Agency or Agencies concerned
to the operation of the LHC Experiments.
1.6     Institutions may clearly have computing resources that are earmarked for purposes
unrelated to the LHC Computing Grid and are not pledged to LHC Experiments as
Computing Resource Levels. These resources are neither monitored centrally by the
management of the Collaboration nor accounted as contributions to LHC computing. Any
such resources that are nevertheless subsequently made available to the LHC Experiments
(and used by them) will be accounted in the normal way as contributions to LHC computing.
1.7     The users of the pledged Computing Resource Levels are the LHC Experiments,
represented in their relations with the Collaboration by their managements.
1.8     The Computing Resources Review Board (“C-RRB”) shall approve annually, at its
autumn meeting, on the advice of an independent, impartial and expert review body - the
Resources Scrutiny Group (“RSG”), which shall operate according to the procedures set out
in Section n.m, the overall refereed resource requests of each LHC Experiment for the
following year. At the same meeting it shall take note of the Computing Resource Levels
pledged for the same year to each Experiment by the Institutions. If it emerges that the



94
LHC COMPUTING GRID                                                  Technical Design Report


pledged Computing Resource Levels are inadequate to satisfy the refereed requests of one or
more Experiment, the C RRB shall seek further contributions of Computing Resource Levels.
Should a shortfall persist, the C-RRB shall refer the matter to the LHCC, which may require a
scaling down and/or prioritisation of requests in order to fit the available Computing Resource
Levels.
1.       The Computing Resources Review Board (C-RRB) shall appoint a Resources
Scrutiny Group (“RSG”) to assist it in exercising its duty with respect to the oversight of the
provision of computing for the LHC Experiments and in particular the independent scrutiny
of the resource requests from the Experiments for the coming year. The RSG has a technical
role and shall be composed of ten persons chosen appropriately by the C-RRB. The RSG
shall perform its duties for all of the LHC Experiments. The members chosen by the C-RRB
shall normally include at least one person from each of CERN, a large Member State, a small
Member State, a large non-Member State and a small non-Member State.
2.       The members of the RSG are appointed with renewable mandates of 3 years provided
that, in the interest of continuity, half of the first members shall be appointed for a 2-year
period.
3.     The CERN Chief Scientific Officer shall select the Chair of the RSG from amongst
the members chosen by the C-RRB.
4.     At his or her discretion, the Chair of the RSG shall accept that, in exceptional
circumstances, a member is replaced at an individual meeting by a named proxy.
5.      Annually (year n), at the spring meeting of the C-RRB, three pieces of information
are presented:
i.      the LHC Computing Grid management reports the resource accounting figures for the
preceding year (n-1);
ii.     the LHC Experiments explain the use they made of these resources;
iii.     the LHC Experiments submit justified overall requests for resources in the following
year (n+1) and forecasts of needs for the subsequent two years (n+2, n+3). Although the
justification will necessarily require an explanation of the proposed usage to sufficient level
of detail, the RSG will only advise on the overall level of requested resources. It shall be for
the managements of each LHC Experiment then to control the sharing within their
Experiment.
The C-RRB passes this information to the RSG for scrutiny.
6.       Over the summer, the RSG shall examine all the requests made by the Experiments in
the light of the previous year’s usage and of any guidance received from the C RRB. In doing
so it shall interact as necessary with the Experiments and in particular with representatives
who are knowledgeable about their Experiment’s computing models/needs. It shall also
examine the match between the refereed requests and the pledges of Computing Resource
Levels from the Institutions, and shall make recommendations concerning any apparent
under-funding for the coming years. It is not the task of the RSG to negotiate Computing
Resource Levels with the Institutions.
7.       The RSG shall present the results of its deliberations to the autumn meeting of the C
RRB. In particular it shall present, for approval, the refereed sharing of resources for the next
year (n+1) and shall make any comments thought relevant on the previous year’s (n-1) usage.
It shall also draw attention, for action, to any mismatch (including mismatch due to lack of
manpower) with the planned pledges of Computing Resource Levels for the next year (n+1)
and the subsequent year (n+2).
8.     In order to ensure efficient use of the pledged Computing Resource Levels, adapt to
changing needs and respond to emergency situations, the RSG may convene at other times
throughout the year, on the request of the LHC Computing Grid or LHC Experiment


                                                                                              95
Technical Design Report                                             LHC COMPUTING GRID


managements, to advise on any resource sharing adjustments that seem to be desirable. Such
adjustments would then be applied by common consent of those concerned.




9.1 Minimal Computing Resource and Service Levels to qualify for
membership of the LHC Computing Grid Collaboration
This Section describes the qualitative aspects of the Computing Resource and Service Levels
to be provided by the Host Laboratory (CERN), Tier1 Centres and Tier2 Centres in order to
fulfil their obligations as Parties to this MoU. Also described are the qualitative aspects of
Grid Operations Services that some of the Parties will provide. The quantitative aspects of all
of these services are described for each Party in Sections n to m. Only the fundamental
aspects of Computing Resource and Service Levels are defined here. Detailed service
definitions with key metrics will be elaborated and maintained by the operational boards of
the Collaboration. All centres shall provide & support the Grid services, and associated
software, as requested by the experiments and agreed by the LHC Computing Grid
Collaboration. A centre may also support additional Grid services as requested by an
experiment but is not obliged to do so.
Annex 1.1. Host Laboratory Services
The Host Laboratory shall supply the following services in support of the offline computing
systems of all of the LHC Experiments according to their computing models.
i.      Operation of the Tier0 facility providing:
1.     high bandwidth network connectivity from the experimental area to the offline
computing facility (the networking within the experimental area shall be the responsibility of
each Experiment);
2.      recording and permanent storage in a mass storage system of one copy of the raw data
maintained throughout the lifetime of the Experiment;
3.       distribution of an agreed share of the raw data to each Tier1 Centre, in-line with data
acquisition;
4.      first pass calibration and alignment processing, including sufficient buffer storage of
the associated calibration samples for up to 24 hours;
5.      event reconstruction according to policies agreed with the Experiments and approved
by the C-RRB (in the case of pp data, in-line with the data acquisition);
6.      storage of the reconstructed data on disk and in a mass storage system;
7.      distribution of an agreed share of the reconstructed data to each Tier1 Centre;
8.      services for the storage and distribution of current versions of data that are central to
the offline operation of the Experiments, according to policies to be agreed with the
Experiments.
ii.     Operation of a high performance, data-intensive analysis facility with the
functionality of a combined Tier1 and Tier2 Centre, except that it does not offer permanent
storage of back-up copies of raw data. In particular, its services include:
1.     data-intensive analysis, including high performance access to the current versions of
the Experiments’ real and simulated datasets;
2.      end-user analysis.




96
LHC COMPUTING GRID                                                   Technical Design Report


iii.    Support of the termination of high speed network connections by all Tier1 and Tier2
Centres as requested.
iv.     Coordination of the overall design of the network between the Host Laboratory, Tier1
and Tier2 Centres, in collaboration with national research networks and international research
networking organisations.
v.     Tools, libraries and infrastructure in support of application program development and
maintenance.
vi.      Basic services for the support of standard physics “desktop” systems used by
members of the LHC Collaborations resident at CERN (e.g. mail services, home directory
servers, web servers, help desk).
vii.    Administration of databases used to store physics data and associated meta-data.
viii.  Infrastructure for the administration of the Virtual Organisation (VO) associated with
each Experiment.
ix.     Provision of the following services for Grid Coordination and Operation:
1.     Overall management and coordination of the LHC grid - ensuring an effective
management structure for grid coordination and operation (e.g. policy and strategy
coordination, security, resource planning, daily operation,...);
2.      The fundamental mechanism for integration, certification and distribution of software
required for grid operation;
3.     Organisation of adequate support for this software, generally by negotiating
agreements with other organisations;
4.      Participation in the grid operations management by providing an engineer in charge
of daily operation one week in four (this service is shared with three or more other institutes
providing amongst them 52-week coverage).
The following parameters define the minimum levels of service:


Service Maximum delay in responding to operational problems Average                 availability
measured on an annual basis
        Service interruption     Degradation of the capacity of the service by more than 50%
        Degradation of the capacity of the service by more than 20%       During accelerator
operation       At all other times
Raw data recording       4 hours 6 hours 6 hours 99%      n/a
Event reconstruction or distribution of data to Tier-1 Centres during accelerator operation 6
hours 6 hours 12 hours           99% n/a
Networking service to Tier-1 Centres during accelerator operation          6 hours 6 hours 12
hours 99% n/a
All other Tier-0 services12 hours          24 hours       48 hours         98%     98%
All other services – prime service hours          1 hour 1 hour 4 hours 98%        98%
All other services4 – outwith prime service hours5
12 hours        24 hours         48 hours         97%     97%
Annex 1.2. Tier-1 Services
Each Tier1 Centre forms an integral part of the central data handling service of the LHC
Experiments. It is thus essential that each such centre undertakes to provide its services on a
long-term basis (initially at least 5 years) and to make its best efforts to upgrade its


                                                                                                97
Technical Design Report                                           LHC COMPUTING GRID


installations steadily in order to keep pace with the expected growth of LHC data volumes
and analysis activities.
Tier1 services must be provided with excellent reliability, a high level of availability and
rapid responsiveness to problems, since the LHC Experiments depend on them in these
respects.
The following services shall be provided by each of the Tier1 Centres in respect of the LHC
Experiments that they serve, according to policies agreed with these Experiments. With the
exception of items i, ii, iv and x, these services also apply to the CERN analysis facility:
i.      acceptance of an agreed share of raw data from the Tier0 Centre, keeping up with
data acquisition;
ii.     acceptance of an agreed share of first-pass reconstructed data from the Tier0 Centre;
iii.  acceptance of processed and simulated data from other centres of the LHC
Computing Grid;
iv.     recording and archival storage of the accepted share of raw data (distributed back-up);
v.       recording and maintenance of processed and simulated data on permanent mass
storage;
vi.       provision of managed disk storage providing permanent and temporary data storage
for files and databases;
vii.    provision of access to the stored data by other centres of the LHC Computing Grid
and by named AF’s as defined in paragraph 1.13 of this MoU;
viii.   operation of a data-intensive analysis facility;
ix.     provision of other services according to agreed Experiment requirements;
x.      ensure high-capacity network bandwidth and services for data exchange with the
Tier0 Centre, as part of an overall plan agreed amongst the Experiments, Tier1 and Tier0
Centres;
xi.     ensure network bandwidth and services for data exchange with Tier1 and Tier2
Centres, as part of an overall plan agreed amongst the Experiments, Tier1 and Tier2 Centres;
xii.    administration of databases required by Experiments at Tier1 Centres.
All storage and computational services shall be “grid enabled” according to standards agreed
between the LHC Experiments and the regional centres.
The following parameters define the minimum levels of service:
Service Maximum delay in responding to operational problems Average                availability3
measured on an annual basis
        Service interruption     Degradation of the capacity of the service by more than 50%
        Degradation of the capacity of the service by more than 20%       During accelerator
operation       At all other times
Acceptance of data from the Tier-0 Centre during accelerator operation 12 hours            12
hours 24 hours         99% n/a
Networking service to the Tier-0 Centre during accelerator operation      12 hours         24
hours 48 hours          98% n/a
Data-intensive analysis services, including networking to Tier-0, Tier-1 Centres outwith
accelerator operation 24 hours         48 hours       48 hours       n/a    98%
All other services4 – prime service hours          2 hour 2 hour 4 hours 98%       98%
All other services4 – outwith prime service hours7


98
  LHC COMPUTING GRID                                                                Technical Design Report


  24 hours             48 hours        48 hours              97%    97%
  The response times in the above table refer only to the maximum delay before action is taken
  to repair the problem. The mean time to repair is also a very important factor that is only
  covered in this table indirectly through the availability targets. All of these parameters will
  require an adequate level of staffing of the services, including on-call coverage outside of
  prime shift.

  9.2        Tier-2 Services
  Tier2 services shall be provided by centres or federations of centres as provided for in this
  MoU. In this Annex the term Tier2 Centre refers to a single centre or to the federation of
  centres forming the distributed Tier2 facility. As a guideline, individual Tier2 Centres or
  federations are each expected to be capable of fulfilling at least a few percent of the resource
  requirements of the LHC Experiments that they serve.
  The following services shall be provided by each of the Tier2 Centres in respect of the LHC
  Experiments that they serve, according to policies agreed with these Experiments. These
  services also apply to the CERN analysis facility:
  i.      provision of managed disk storage providing permanent and/or temporary data
  storage for files and databases;
  ii.     provision of access to the stored data by other centres of the LHC Computing Grid
  and by named AF’s as defined in paragraph 1.13 of this MoU;
  iii.       operation of an end-user analysis facility;
  iv.     provision of other services, e.g. simulation, according to agreed Experiment
  requirements;
  v.      ensure network bandwidth and services for data exchange with Tier1 Centres, as part
  of an overall plan agreed between the Experiments and the Tier1 Centres concerned.
  All storage and computational services shall be “grid enabled” according to standards agreed
  between the LHC Experiments and the regional centres.
  The following parameters define the minimum levels of service:


     Service                Maximum delay in responding to operational                       Average availability15
                                              problems                                            measured on an
                                                                                                   annual basis
                           Service        Degradation of           Degradation of           During         At all other
                              interr          the capacity             the capacity            accele            times
                              uptio              of the                   of the               rator
                                  n           service by               service by              operat
                                              more than                more than                ion
                                                  50%                     20%

Raw data recording         4 hours           6 hours                  6 hours                99%               n/a

Event                      6 hours           6 hours                 12 hours                99%               n/a
     reconstruction
     or distribution
     of data to



  15 (time running)/(scheduled up-time)



                                                                                                            99
  Technical Design Report                                                     LHC COMPUTING GRID


      Tier-1 Centres
      during
      accelerator
      operation

Networking service          6 hours          6 hours               12 hours           99%                n/a
      to Tier-1
      Centres
      during
      accelerator
      operation

All other Tier-0            12 hours         24 hours              48 hours           98%                98%
      services

All other services16         1 hour           1 hour                4 hours           98%                98%
      – prime
      service
      hours17

All other services16        12 hours         24 hours              48 hours           97%                97%
      – outwith
      prime service
      hours17




  9.3           Grid Operations Services
  This section lists services required for the operation and management of the grid for LHC
  computing. These will be provided by the Parties to this MoU.
  This section reflects the current (3/2005) state of experience with operating grids for high
  energy physics. It will be refined as experience is gained.


  •       Grid Operations Centres – Responsible for maintaining configuration databases,
  operating the monitoring infrastructure, pro-active fault and performance monitoring,
  provision of accounting information, and other services that may be agreed. Each Grid
  Operations Centre shall be responsible for providing a defined sub-set of services, agreed by
  the Collaboration. Some of these services may be limited to a specific region or period (e.g.
  prime shift support in the country where the centre is located). Centres may share
  responsibility for operations as agreed from time to time by the Collaboration.




  •             User Support for grid and computing service operations:
  o       First level (end-user) helpdesks are assumed to be provided by LHC Experiments
  and/or national or regional centres, and are not covered by this MoU.



  16 Services essential to the running of the Centre and to those who are using it.
  17 Prime service hours for the Host Laboratory: 08:00-18:00 in the time zone of the Host Laboratory,
  Monday-Friday, except public holidays and scheduled laboratory closures.


  100
LHC COMPUTING GRID                                                        Technical Design Report


o       Grid Call Centres – Provide second level support for grid-related problems, including
pro-active problem management. These centres would normally support only service staff
from other centres and expert users. Each call centre shall be responsible for the support of a
defined set of users and regional centres and shall provide coverage during specific hours.




Here will come just three sections with resource tables: T0+CAF, T1s and T2s. Until June we
may have not the full extent of the last table.


9.4        Costing
Bernd Panzer
Abstract
The costing of the fabric CERN is calculated with a detailed sub-structure:
separation by experiment, by functional structure (Tier-0, Tier-1, Tier-2) and by detailed
hardware units (CPU, disk, tape, network, sysadmin)
and combinations of these three to answer questions like
what is the processing cost for experiment X in the Tier-1 part ? or
how much does the disk cache for experiment Y costs during Tier-0 data taking ?
 The costing exercise fort he CERN fabric uses the following input parameters to calculate the
full cost of the set-up during the years 2006-2010 :

      1.   the base computing resource requirements from the experiments (CPU, Disk and Tape)
      2.   derived resources (tape access speed, networking, sysadmin ) from the combination of the
           base resources and the computing models
      3.   the reference points of the equipment costs
      4.   the cost evolution over time of the different resources

The detailed list of base resource requirements have already been given in chapter xxxx, including part
(or all) of the derived resources.



Reference points

The previous cost calculations used a certain model to calculate the cost of the computing equipment :
     we use Moore’s Law as the underlying formula to estimate the cost decrease over time of a
        certain capacity == the price for the same amount of capacity (CPU, disk) is reduced by a
        factor 2 in 18 month.
     The reference point is taking in the middle of a year where the required capacity needs to be
        available.
     The granularity for these calculations was one year and it was assumed that the price
        reductions were coming in a smooth and linear way.



As the start of LHC is getting closer we need to have a look into the more fine grain purchasing logistic
to get a more precise estimate of the cost evolution.
There are two ways to upgrade the computing capacity:

      1.   fixed amount of capacity per year



                                                                                                      101
Technical Design Report                                                   LHC COMPUTING GRID


          everything needed for year 200x is installed and in production in February of
           200x, just before the accelerator starts

    2.   the capacity is upgraded in a constant manner
          every 3 month new equipment is added, independent of the accelerator timing



The first point is the currently implemented way at CERN.
Restrictions are coming from the way the current purchasing procedures are implemented at CERN.
The timing here is dictated by two point :we need to align all procedures to the dates of the meeting of
the finance committee and because of our limited influence on company selections and the ‘cheapest-
wins’ criteria we have to foresee enough slack to cope with problems and delays. Thus one can
calculate now backwards from February 200x :
      at the end of February everything has to be installed and working, so that there is enough time
          until April to have equipment exchanges in case something goes wrong
      as there are 6 weeks delivery plus 4 weeks testing plus the Christmas period one has to
          consider the Finance Committee meeting in Q4 of 200x-1. The choice would be the
          November meeting as would leave enough time to correct mistakes and review the tender
          again at the December meeting.
      The outcome of the tender has to be analyzed and a paper prepared for Finance Committee in
          November  3-4 weeks. Thus the tender must be opened in end September or first week in
          October latest.
      The tender process is fixed to 6 weeks plus one has to consider the ‘dead’ time of August, thus
          everything needs to start at the end of July 200x-1

The price of the equipment is made by the vendors during the start of the tendering, but there is the
possibility to re-negotiate the price before the order is made. This would be if there are no hiccups in
the middle of November. Thus there are 8 month between fixing the price and the reference point in
July of 200x. In principle should the re-negotiation of prices in November take into account Moore’s
Law for the alleged price drop between August and November (17.5 %). But from experience I would
assume that 10% is a much more realistic value if at all, thus another month more difference.
The total difference between fixing the price and the reference point of middle of the year would
than be 9 month or half of the expected Moore’s law price development.

Another disadvantage is that the price evolution of computing equipment does not follow a smooth
curve but rather has from time to time larger step functions. In the Appendix one can find a few
examples of price curves over the last 18 month for CPU, disk and memory items. Today the
processors contribute at the 30% level to cost of a CPU node. Memory is in the order of 20% but this
will rise to 30-40% in the future with the introduction of multi-core processors and our rising memory
requirements per job.

Until we have a new purchasing logistic in place (blanket order + agreed with experiments on fine
grain increase of capacity over the year) the new cost calculations will use as the cost index for
capacity in the middle of year 200x a value determined 9 month earlier. That would lead to a cost
increase of 50% compared to the ideal case of following Moore’s law on the spot.



Equipment cost evolution


The following table shows the anticipated development of the price for a delivered CPU capacity of
one SI2000 unit.


102
LHC COMPUTING GRID                                                         Technical Design Report




    Year                 2004         2005        2006         2007         2008        2009         2010
    CHF/SI2000           1.89         1.25        0.84         0.55         0.37        0.24         0.18

The calculation is based on several assumptions and measurements :
    1. these are dual CPU nodes
    2. the overall assumed efficiency is 0.8
    3. the scaling factor for 2004 – 2010 is a constant 1.5 per year
         this factor and the starting value in 2003 was based on a cost evaluation
         of CPU and nodes in the years 2001-2003  further details
    4. a factor 2 is included for the difference between low end nodes and high end nodes. We are
         buying in the lower medium range, but this is not a large market
         and we might be forced into the higher end market.
    5. there is a10 % cost increase for the node infrastructure (racks, cables, console,
        etc.)



The replacement period is assumed (from experience) to be 3 years , that is equipment bought in the
first year will be replaced in the 4th year. The original motivation was the saving in system admin costs,
as they were in the beginning directly proportional to the number of boxes. Our new scheme has a
much better scaling, thus boxes can in principle run as long as they fail (cause trouble) and fit within
the space and power consumption envelope.

The important point is that 1/3 of the component (box, powersupply, motherboard, local disk) are
essentially constant in cost and the processors are only contributing at the 30% level to the box
costs.

Memory requirements
Price uncertainty factor on the box costs

Power consumption still an issue to be constantly checked.




The following table shows the anticipated developments of disk space costs :

    Year                 2004        2005         2006         2007        2008         2009         2010
    CHF/GByte            8.94        5.59         3.49         2.18        1.36         0.85         0.53

The following assumptions are included :

    1.   the disk space is mirrored (which reduces the raw capacity by 2)
    2.   it is usable capacity (after mirroring this reduces capacity by a further ~ 5%)
    3.   it uses consumer market disks ( ATA, SATA) (NOT SCSI or fibre channel disks)
    4.   there is a 10 % increase per tray/box (10-20 disks) for the infrastructure
    5.   the scaling factor for 2004 – 2010 is a constant 1.6 per year
         this factor and the starting value in 2003 was based on a cost evaluation
         of disks and disk servers in the years 2001-2003  further details

    The replacement period is assumed to be 3 years , that is equipment bought in the first year will be



                                                                                                     103
Technical Design Report                                                     LHC COMPUTING GRID


    replaced in the 4th year.

The space requirements from the experiments do NOT take into account any performance
requirements. The trend in the disk market of fast growing size per disk , moderate growing
performance for sequential access and most importantly minimal progress in the random access times
will increase the overall costs.
For example : The size of an event sample is 2 TB, but there are 50 users who want access to this data
in a random manner. In 2008 one would need only 2 disks to fulfill the space requirement, but
probably 5 times the number of disks to fulfill the access requirements.



The cost for the network in not included, neither for the CPU servers nor the disk servers.



The cost for the network equipment is not following this kind of ‘clear’ year-by-year’ decrease.
We had luckily a major step function in 2004.



The cost of tape storage is also essentially flat, only with the introduction of new tape drive
Technology one can expect a decrease in the cost of the tape media.




Cost tables




10 INTERACTIONS/DEPENDENCIES/BOUNDARIES
The LCG Collaboration will depend on close cooperation with several major publicly funded
projects and organisations, for the provision of network services, specialised grid software,
and the management and operation of grid infrastructure. These three areas are considered
separately in this section. In the case of grid software and infrastructure it is expected that the
situation will evolve rapidly during the period of construction and commissioning of the LHC
computing facility, and so the LCG collaboration will have to remain flexible and review
support and collaboration agreements at frequent intervals.


10.1     Network Services
{Maybe this should be included in the section on Network Management?}


In most cases the network services used to interconnect the regional computing centres
participating in the LHC Computing Grid will be provided by the national research networks
with which the centres are affiliated and, in the case of European sites, the pan-European
backbone network, GÉANT. The architecture of these services is described elsewhere in the
TDR. While LCG is one of the many application domains served by these general purpose
research networks it will, during the early years of LHC, be one of the most demanding
applications, particularly between CERN, the Tier-1 and major Tier-2 centres. The formal
service agreements will be made directly between the computing centres and the national
research network organisations. However, in order to ensure that the individual service
agreements will provide a coherent infrastructure to satisfy the LHC experiments' computing


104
LHC COMPUTING GRID                                                Technical Design Report


models and requirements, and that there is a credible solution for the management of the end-
to-end network services, an informal relationship has been established between the major
centres and research networks through the Tier-0/1/2 Networking Group, a working group of
the Grid Deployment Board. It is expected that this group will persist throughout 2006 while
the various components of the high-bandwidth infrastructure are brought into full operation.
At this stage it is not clear what, if any, special relationship will be required between LCG
and the research networks after this point.


10.2    Grid Software
The grid software foreseen to be used to provide the grid infrastructure for the initial LCG
service has been developed by a number of different projects. Some of these are no longer in
operation, some have funding for only a limited period, while others have longer term plans.
In the case of software developed by individual institutes, or by projects that have ceased
operation, bilateral support agreements have generally been made between LCG and the
developers, with different levels of formality according to the complexity of the software
involved. There are several cases, however, where it is necessary to have more complex
relationships.



10.3    Globus, Condor and the Virtual Data Toolkit
Key components of the middleware package used at the majority of the sites taking part in
LCG have been developed by the Globus and Condor projects. These are long-term
projects that continue to evolve their software packages, providing support for a broad range
of user communities. It is important that LCG maintains a good working relationship with
these projects to ensure that LHC requirements and constraints are understood by the projects
and that LCG has timely information on the evolution of their products. At present there are
two main channels for this: key members of Globus and Condor take part in the Open Science
Grid and in the middleware development activity of the EGEE project. Both of these projects
and their relationships to LCG are decribed below.
The Virtual Data Toolkit (VDT) group at the University of Wisconsin acts as a delivery and
primary support channel for Globus, Condor and some other components developed by
projects in the US and Europe. At present VDT is funded by the US National Science
Foundation to provide these services for LCG. It is expected that this or a similar formal
relationship will be continued.



10.4    The gLite Toolkit of the EGEE Project
The EGEE project (Enabling Grids for E-sciencE) is funded on a 50% basis by the
European Union to operate a multi-science grid built on infrastructure developed by the LCG
project and an earlier EU project called DataGrid. EGEE includes a substantial middleware
development and delivery activity with the goal of providing tools aimed at the High Energy
Physics and Bio-medical applications. This activity builds on earlier work of the
DataGrid and AliEn projects and includes participation of the Globus and Condor projects.
The EGEE project and the LCG Collaboration are closely linked at a management level: the
middleware activity manager and the technical director of EGEE are members of the LCG
Project Execution Board; the EGEE project director is a member of the LCG Project
Oversight Board; the LCG project leader is a member of the EGEE project management
board. The EGEE project also provides some funding for the support of applications using the
EGEE developed software.



                                                                                        105
Technical Design Report                                             LHC COMPUTING GRID


10.5    The Nordugrid Project


The NorduGrid Project develops and maintains a set of grid middleware, based on the Globus
toolkit, that is used at some sites providing LHC capacity, particularly in the Nordic
Countries. At present there is no formal relationship between LCG and NorduGrid. The sites
and experiments using the NorduGrid software make their own arrangements for support.
Possibilities are being explored for providing a forum where compatibility and inter-working
of the NorduGrid middleware with other systems used in LCG can be discussed.


10.6    Operation of Grid Infrastructure


There are three major operational groupings of centres that will provide capacity for LHC
computing: EGEE/LCG, Open Science Grid, and the Nordic sites. Each of these groups uses a
specific base set of middleware tools (as explained above) and has its own grid
operations infrastructure. The body governing the overall operational policy and strategy for
the LHC collaboration is the Grid Deployment Board. This has national representation,
usually from the major centre(s) in each country.


10.7    EGEE/LCG


This group is an evolution of the centres that took part in the DataGrid project, expanded
during 2003-04 to include other centres involved in the LCG project outside of the Unites
States and the Nordic countries, and centres receiving funding from or associated with the
EGEE project. The EGEE/LCG grid has at present over 120 centres, including all of the
centres serving LCG in the countries concerned. This grid includes many national grid
organisations with their own administrative structure, but all of the entities involved agree to
install the same base middleware and cooperate in grid operations. The operational
infrastructure at present receives, in Europe, important support from the EGEE project for
Core Infrastructure Centres and Regional Operations Centres, but the infrastructure is also
supported by significant national contributions in Europe, Asia and Canada. The centres
involved in EGEE have contracts with the EU to provide these infrastructure and operations
services. The centres involved in LCG will commit to provide the services through the LCG
MoU.
The operation is managed at present by the LCG Grid Deployment Area manager, who also
holds the position of operations manager of the equivalent activity of EGEE. This risks of
course to cause some confusion, especially at those sites that are not members of both
projects, and could lead to potential conflicts as LCG and EGEE have different, though not
incompatible, goals. The LCG Grid Deployment Board is the effective organ for operations
policy and strategy in this overlapping LCG/EGEE environment, which so far has been able
to cover non-physics centres through its national representation. The long-term idea is that
EGEE will evolve into an organisation that will provide core operation for a science grid in
Europe and perhaps further afield, rather akin to the role of GÉANT in reseacrh networking.
However, the EGEE project is at present funded only until March 2006. It is expected that the
project will be extended for a further period of two years, which means that it would stop at
the beginning of the first full year of LHC operation. It is therefore important that
LCG maintains its role in the core operation, and that the LCG collaboration prepares a fall-
back plan in the event that the EU-subsidised evolution beyond EGEE does not materialse or
does not fulfill LCG's requirements. This is clearly a difficult strategy, with significant risk,




106
LHC COMPUTING GRID                                                Technical Design Report


but the long term advantages of a multi-science grid infrastructure receiving significant non-
HEP funding must be taken into account.


10.8    Open Science Grid
This section should cover:
a short description of OSG, how the infrastructure is managed and whether there are formal
contracts or MoUs;
relationship between the OSG governing bodies and the LCG and EGEE projects;
agreements on inter-operation with EGEE/LCG, both at the level of the operational
management and instrastructure, and at the level of resource sharing;
Tier-1 and Tier-2 sites taking part in LCG commit to provide resources through the LCG
MoU.


10.9    The Nordic Data Grid Facility


This section should include:
a short description of the Nordic Data Grid Facility (NDGF), how the infrastructure is
managed and whether there are formal contracts or MoUs;
relationship between the NDGF governing bodies and the LCG and EGEE projects;
agreements on inter-operation with EGEE/LCG, both at the level of the operational
management and instrastructure, and at the level of resource sharing;
Tier-1 and Tier-2 sites taking part in NDGF commit to provide resources through the LCG
MoU.
New Section under chapter 10 INTERACTIONS/DEPENDENCIES/BOUNDARIES


10.10 (new section 10.11 - Security Policy and Procedures) draft, Dave


Not sure if this section is needed or not (as many of the issues will have been already
addressed in the other sections), but I include it for now, given there is a need to achieve
interoperable security policy and procedures across the various grids (EGEE, OSG, Nordic,
Asia Pacific).


JSPG is working hard to achieve common agreed security policy and procedures. This will
enable users to obtain just one X.509 certificate from any of the accredited Certification
Authorities and then register once with one of the four LHC experiment VOs signing just one
AUP as part of this process. The policies and procedures need to accepted by all grid projects
and participating sites to allow this to be possible.


Security operational procedures also need to be agreed jointly to allow interworking between
grids for important procedures such as incident response.




                                                                                         107
Technical Design Report                                          LHC COMPUTING GRID


10.11 DAQ systems
Nick Brook



11 MILESTONES
Jürgen Knobloch
Abstract
A summary table of milestones leading to the implementation of the system described in the
TDR.

12 RESPONSIBILITIES – ORGANISATION, PARTICIPATING
   INSTITUTIONS
Chris Eck
1.10    Members of the LCG Collaboration shall be CERN as the Host Laboratory, the
provider of the Tier0 Centre and the CERN Analysis Facility, and as the coordinator of the
LCG project, on the one hand, and all the Institutions participating in the provision of the
LHC Computing Grid with a Tier1 (listed in section x.y) and/or Tier2 (listed in Section z.m)
Computing Centre (including federations of such Institutions with computer centres that
together form a Tier1 or Tier2 Centre).
1.11    The Parties together constitute the LHC Computing Grid Collaboration (hereinafter
“Collaboration”), of which they are termed Members. Each federation of Institutions
constituted in accordance with paragraph 1.10 above shall count as a single Member of the
Collaboration. For each Member, 0 and Error! Reference source not found. show the duly
authorised representative to the Collaboration. Collaboration Members will receive
appropriate credit in the scientific papers of the LHC Experiments that they serve.
1.12     An Institution may have one or several Funding Agencies, which are established
bodies controlling all or part of the Institution’s funding. In the execution of this MoU, an
Institution, depending on its situation, may be represented in funding matters by its Funding
Agency or Agencies, or it may have the authority to represent itself in some or all matters.
1.13      The LHC Experiments will have available to them Additional Facilities (hereinafter
“AF’s”) that access the services of the LHC Computing Grid or expose resources to it,
without themselves being Collaboration Members. These AF’s are thus not Parties to this
MoU. To such AF’s as are named by the LHC Experiments, the Members of the
Collaboration shall give access to the necessary software and to the LHC Computing Grid
itself, for purposes related to the execution of the LHC Experiments. In order to ensure the
smooth functioning of the LHC Computing Grid for its users, such access will be subject to
written acceptance of such conditions as the Collaboration shall from time to time decide but
which shall in any event include the conditions set out in Error! Reference source not found.
and paragraph Error! Reference source not found. of this MoU. It shall be the duty of the
LHC Experiments to ensure that these AF’s receive and install the necessary software and are
competent in its use, and that they comply with the conditions for access to the LHC
Computing Grid.
Annex 1.5. The Organizational Structure of the Collaboration
1.      High-level Committees:
1.1.    Concerning its main technical directions, the Collaboration shall be governed by the
LHC Computing Grid Collaboration Board (CB). The CB shall be composed of a
representative of each Institution or federation of Institutions that is a Member of the
Collaboration, the LCG Project Leader and the Spokespersons of each LHC Experiment, with


108
LHC COMPUTING GRID                                                  Technical Design Report


voting rights; and the CERN Chief Scientific Officer (CSO), and CERN/IT and CERN/PH
Department Heads, as ex-officio members without voting rights, as well as a Scientific
Secretary. The CB elects the Chairperson of the CB from among its Members. The CB
meets annually and at other times as required.
1.2.     A standing committee of the CB, the Overview Board (OB), has the role of
overseeing the functioning of the Collaboration and of this MoU in particular. It also acts as a
clearing-house for conflicts that may arise within the Collaboration. The OB shall be chaired
by the CERN CSO.            Its other members comprise one person appointed by the
agency/agencies that funds/fund each of the Tier-1 Centres, the Spokespersons of the LHC
Experiments, the LCG Project Leader, the CERN/IT and CERN/PH Department Heads, and a
Scientific Secretary. It meets about four times per year.
Both the CB and the OB may co-opt additional non-voting members as they deem necessary.
The non-voting members complement the regular members by advising on (e.g.) matters
concerning the environment in which the Collaboration operates or on specialist aspects
within their areas of expertise.
2.      The LHC Computing Grid Management Board (MB) supervises the work of the
Collaboration. It is chaired by the LCG Project Leader and reports to the OB. The MB
organises the work of the Collaboration as a set of formal activities and projects. It maintains
the overall programme of work and all other planning data necessary to ensure the smooth
execution of the work of the Collaboration. It provides quarterly progress and status reports
to the OB. The MB endeavours to work by consensus but, if this is not achieved, the LCG
Project Leader shall make decisions taking account of the advice of the Board. The MB
membership includes the LCG Project Leader, the Technical Heads of the Tier-1 Centres, the
leaders of the major activities and projects managed by the Board, the Computing Coordinator
of each LHC Experiment, the Chair of the Grid Deployment Board (GDB), a Scientific
Secretary and other members as decided from time to time by the Board.
3.      The Grid Deployment Board (GDB) is the forum within the Collaboration where the
computing managements of the experiments and the regional computing centres discuss and
take, or prepare, the decisions necessary for planning, deploying and operating the LHC
Computing Grid. Its membership includes: as voting members - one person from each
country with a regional computing centre providing resources to an LHC experiment (usually
a senior manager from the largest such centre in the country), a representative of each of the
experiments; as non-voting members - the Computing Coordinators of the experiments, the
LCG Project Leader, and leaders of formal activities and projects of the Collaboration. The
Chair of the GDB is elected by the voting members of the board from amongst their number
for a two year term. The GDB may co-opt additional non-voting members as it deems
necessary.
4.      Concerning all technical matters, the Collaboration shall be subject to review by the
Large Hadron Collider experiments Committee (LHCC), which makes recommendations to
the Research Board (RB).
5.       Concerning all resource and legal matters, the Collaboration shall be subject to the
Computing Resource Review Board (C-RRB). The C-RRB is chaired by CERN's Chief
Scientific Officer. The C-RRB membership comprises a representative of each Funding
Agency, with voting rights, and (ex-officio) members of the LHC Computing Grid
Management and CERN Management, without voting rights.
6.      The LCG Project Leader represents the Collaboration to the outside and leads it in all
day-to-day matters. He/she shall be appointed by the CERN Director General in consultation
with the CB.


LHC Computing Grid Tier0 and Tier1 Centres, and the CERN Analysis Facility



                                                                                           109
   Technical Design Report                                               LHC COMPUTING GRID


   Tier0 and the CERN Analysis Facility
            Experiments served with               Representative to
                     priority                LHC Computing Grid Collaboration
                 AT
         ALI                       LH
                    L CM
             C                        C
                    A         S
             E                        b
                    S
          X       X        X        X                     W. von Rüden

   Tier1
                                                      Experiments served with           Representative        Funding
                                                               priority                        to
                                                                                              LHC
                                                            AT
                        Centre                 ALI                            LH           Computing
                                                                 L   CM
                                                      C                             C         Grid
                                                                 A        S
                                                      E                             b      Collaborati
                                                                 S
                                                                                               on
TRIUMF, Canada                                              X                                            NSERC
GridKA, Germany                                   X         X        X          X                        BMBF
CC_IN2P3, France                                  X         X        X          X                        IN2P3
CNAF, Italy                                       X         X        X          X                        INFN
NIKHEF/SARA, NL                                   X         X                   X                        NIKHEF
Nordic Data Grid Facility                         X         X
ASCC, Taipei                                                X        X                                   NSC/Academia
RAL, UK                                           X         X        X          X       J. Gordon        PPARC
BNL, US                                                     X                                            DOE
FNAL, US                                                             X                                   DOE
PIC, Spain                                                  X        X          X                        MEC-CSIC
   Notes
   1) Entries in italics have still to be confirmed


   13 GLOSSARY – ACRONYMS – DEFINITIONS
   Jürgen Knobloch, François Grey
   Abstract
   Several attempts have been made to explain the three and four letter acronyms used in the
   world of grids:
   http://public.eu-egee.org/faq/acronyms.html
   http://www.gridpp.ac.uk/docs/GAS.html
   http://lcg.web.cern.ch/LCG/lcg-acronyms.html
   etc. etc. - try GOOGLE with “GRID” “acronyms”
   This chapter should contain a consolidated summary.




   110
LHC COMPUTING GRID                                                       Technical Design Report




----------------------------------------------------------------------




                                                                                              111

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:8
posted:2/13/2012
language:
pages:119