U.S. ATLAS Computing Facilities
(Overview)
Bruce G. Gibbard
Brookhaven National Laboratory
Review of U.S. LHC Software and Computing Projects
Fermi National Laboratory
November 27-30, 2001
Outline
US ATLAS Computing Facilities Definition
Mission
Architecture & Elements
Motivation for Revision of the Computing Facilities Plan
Schedule
Computing Model & Associated Requirements
Technology Evolution
Tier 1 Budgetary Guidance
Tier 1 Personnel, Capacity, & Cost Profiles for New
Facilities Plan
B. Gibbard Review of US LHC Software and Computing Projects 27-30 November, 2001 2
US ATLAS Computing Facilities Mission
Facilities procured, installed and operated
…to meet U.S. “MOU” obligations to ATLAS
Direct IT support (Monte Carlo generation, for example)
Support for detector construction, testing, and calibration
Support for software development and testing
…to enable effective participation by US physicists in the
ATLAS physics program!
Direct access to and analysis of physics data sets
Simulation, re-reconstruction, and reorganization of data as
required to complete such analyses
B. Gibbard Review of US LHC Software and Computing Projects 27-30 November, 2001 3
Element of US ATLAS Computing Facilities
A Hierarchy of Grid Connected Distributed Resources Including:
Tier 1 Facility Located at Brookhaven – Rich Baker / Bruce Gibbard
Operational at < 0.5% level
5 Permanent Tier 2 Facilities (to be Selected in April ’03)
2 Prototype Tier 2’s selected earlier this year and now active
Indiana University – Rob Gardner
Boston University – Jim Shank
Tier 3 / Institutional Facilities
Several currently active; most candidate to become Tier 2’s
Univ. of California at Berkeley, Univ. of Michigan, Univ. of Oklahoma, Univ. of
Texas at Arlington, Argonne Nat. Lab.
Distribute IT Infrastructure – Rob Gardner
US ATLAS Persistent Grid Testbed – Ed May
HEP Networking – Shawn McKee
Coupled to Grid Projects with designated liaisons
PPDG – Torre Wenaus
GriPhyN – Rob Gardner
iVDGL – Rob Gardner
EU Data Grid – Craig Tull
B. Gibbard Review of US LHC Software and Computing Projects 27-30 November, 2001 4
Tier 2’s
Mission of Tier 2’s for US ATLAS
A primary resource for simulation
Empower individual institutions and small groups to do relatively
autonomous analysis using high performance regional networks
and more directly accessible and locally managed resources
Prototype Tier 2’s were selected based on their ability to
contribute rapidly to Grid architecture development
Goal in future Tier 2 selections will be to leverage
particularly strong institutional resources of value to ATLAS
Aggregate of the 5 Tier 2’s is expected to be comparable to
Tier 1 in CPU and disk capacity available for analysis
B. Gibbard Review of US LHC Software and Computing Projects 27-30 November, 2001 5
US ATLAS Persistent Grid Testbed
U Michigan Boston
Esnet, Mren University
UC Berkeley
LBNL-NERSC NPACI,
Abilene
Argonne
National
Laboratory
Calren Esnet,
Abilene, Nton
Brookhaven
Esnet National
Abilene
Laboratory
Oklahoma
University
Indiana Prototype Tier 2s
University
HPSS sites University of Texas
At Arlington
B. Gibbard Review of US LHC Software and Computing Projects 27-30 November, 2001 6
Evolution of US ATLAS Facilities Plan
In Responds to Changes or Potential Changes in
Schedule
Computing Model & Requirements
Technology
Budgetary Guidance
B. Gibbard Review of US LHC Software and Computing Projects 27-30 November, 2001 7
Changes in Schedule
LHC start-up projected to be a year later
2005/2006 2006/2007
ATLAS Data Challenges (DC’s) have, so far, stayed fixed
DC0 – Nov/Dec 2001 – 105 events
Software continuity test
DC1 – Feb/Jul 2002 – 107 events
~1% scale test
DC2 – Jan/Sep 2003 – 108 events
~10% scale test
A serious functionality & capacity exercise
A high level of US ATLAS facilities participation is deemed very
important
B. Gibbard Review of US LHC Software and Computing Projects 27-30 November, 2001 8
Computing Model and Requirements
Nominal model was:
At Tier 0 (CERN)
Raw ESD/AOD/TAG pass done, result shipped to Tier 1’s
At Tier 1’s (six anticipated for ATLAS)
TAG/AOD/~25% of ESD on Disk, Tertiary storage for remainder of ESD
Selection passes through complete ESD ~monthly
Analysis of TAG/AOD/selected ESD/etc. (n-tuples) on disk for analysis pass by
~200 users within 4 hours
At Tier 2’s (five in U.S.)
Data access primarily via Tier 1 (to control load on CERN and transatlantic link)
Support ~50 users as above but frequent access ESD on disk at Tier 1 likely
Serious limitations are
A month is a long time to wait for the next selection pass
Only 25% of ESD is available for event navigating from TAG/AOD during analysis
The 25% of ESD on disk will rarely have been consistently selected (once a
month) and will be continuously rotating, altering the accessible subset of data
B. Gibbard Review of US LHC Software and Computing Projects 27-30 November, 2001 9
Changes in Computing Model and
Requirements (2)
Underlying problem:
Selection pass and analysis event navigation access to ESD is sparse
Estimated to be ~1 out of 100 events per analysis
ESD is on tape rather than on disk
Tape is a sequential medium
Must access 100 times more data then needed
Tape is expensive per unit of I/O bandwidth
As much as 10 times that of disk
Thus penalty in access cost relative to disk may be a factor of ~1000
Solution:
Get all ESD on disk
Methods for accomplishing this are:
Buy more disk at Tier 1 – most straight forward
Unify/coordinate use of existing disk across multiple Tier 1’s – more economical
Some combination of above – compromise as necessitated by available funding
B. Gibbard Review of US LHC Software and Computing Projects 27-30 November, 2001 10
“2007” Capacities for U.S. Tier 1
Options
Tape Based 3 Tier 1 Standalone
Model Disk Model Disk Model
CPU (SPECint95) 209 329 500
Disk (TBytes) 365 483 1000
Tape (PBytes) 1.85 1.85 1.85
Disk (GBytes/sec) 18.3 18.3 18.3
Tape (MBytes/sec) 802 185 185
WAN (Mbit/sec) 4610 9864 9864
1/3+1/6 of ESD on disk Add other 2/3 of ESD
ESD pass each month ESD pass per group each day
“3 Tier 1” Model (Complete ESD found on disk of U.S. plus 2 other Tier 1’s)
Highly dependent on the performance of other Tier 1’s and the Grid middleware and
network (transatlantic) used to connect to them
“Standalone” Model (Complete ESD on disk of US Tier 1)
While avoiding above dependencies, is more expensive
B. Gibbard Review of US LHC Software and Computing Projects 27-30 November, 2001 11
Changes in Technology
No dramatic new technologies
Previously assumed technologies are tracking Moore’s Law well
Recent price performance points from RHIC Computing Facility
CPU: IBM procurement - $33/SPECint95
310 Dual 1 GHz Pentium III nodes @ 97.2 SPECint95/Node
Delivered Aug 2001
$1M fully racked including cluster management hardware & software
Disk: OSSI/LSI procurement - $27k/TByte
33 Usable TB of high availability Fibre Channel RAID 5 @ 1400 MBytes/sec
Delivered Sept 2001
$887k including SAN switch
Strategy is to project, somewhat conservatively, from these points for
facilities design and costing
Actually used 20 month rather than the observed <18 month
price/performance halving time for disk and cpu
B. Gibbard Review of US LHC Software and Computing Projects 27-30 November, 2001 12
Changes in Budgetary Assumptions (2)
Assumed Funding Profiles (At Year $K)
Planning Date FY 01 FY 02 FY 03 FY 04 FY 05 FY 06 FY 07 FY 08
Nov-00 1411 1609 2398 3270 5074 8348
Nov-01 855 839 1600 2500 4600 7000 10600 8000
For revise LHC startup schedule, new profile is better
For ATLAS DC 2 which stayed fixed in ’03, new profile is worse
Hardware capacity goals of DC 2 will not be met
Personnel intensive facility development may be as much as 1 year behind
Hope is that another DC will be added allowing validation of a more
nearly fully developed Tier 1 and US ATLAS facilities Grid
B. Gibbard Review of US LHC Software and Computing Projects 27-30 November, 2001 13
Profiles for Standalone Disk Option
Much higher functionality (than other options) and, given new stretched out LHC
schedule, within budget guidance
Fractions in revised profiles in table below are of a final system which has
nearly 2.5 times the capacity of that discussed last year
Year 2001 2002 2003* 2004 2005 2006 2007
Previous Profiles
ATLAS 5% 15% 40% 100%
US ATLAS 1% 2% 5% 10% 20% 100%
Revised Profiles
ATLAS * "5%" 18% 45% 100%
US ATLAS 0.1% 0.2% 1% 3% 10% 30% 100%
* Converted from a funding profile
Region of strictly limited funding
* DC2
B. Gibbard Review of US LHC Software and Computing Projects 27-30 November, 2001 14
Associated Labor Profile
FY '01 FY '02 FY '03 FY '04 FY '05 FY '06 FY '07 FY '08
11/00 Projection (FTE's) 5 7 10 15 25 25 25 25
11/01 Projection* (FTE's) 2.7 4.2 6.5 11 16 22 25 25
Labor Cost (@Yr $K) 419 677 1090 1918 2901 4149 4903 5099
Support Costs (@Yr $K) 50 66 91 141 199 271 313 322
Total Cost (@Yr $K) 469 743 1181 2058 3100 4420 5216 5421
* Not including .5 FTE of PPDG in FY '02-'04
B. Gibbard Review of US LHC Software and Computing Projects 27-30 November, 2001 15
Summary Tier 1 Cost Profile
(At Year $K)
2001 2002 2003 2004 2005 2006 2007 TOTAL 2008
CPU $ 30 $ - $ 59 $ 117 $ 305 $ 565 $ 1,316 $ 2,392
Disk $ 100 $ - $ 118 $ 263 $ 564 $ 1,058 $ 2,446 $ 4,549
Tertiary Storage $ 55 $ 6 $ 45 $ 140 $ 120 $ 225 $ 305 $ 896
LAN $ 79 $ - $ 20 $ 20 $ 90 $ 100 $ 250 $ 559
Other Infrastructure $ 40 $ - $ 11 $ 26 $ 53 $ 90 $ 207 $ 427
Sftwr, Lic. & Maint. $ 50 $ 89 $ 128 $ 165 $ 215 $ 307 $ 443 $ 1,398
Overhead $ 35 $ 19 $ 47 $ 80 $ 136 $ 228 $ 455 $ 999
Hardware $ 389 $ 114 $ 428 $ 811 $ 1,484 $ 2,573 $ 5,422 $ 11,220 $ 2,572
Labor $ 469 $ 743 $ 1,181 $ 2,058 $ 3,100 $ 4,420 $ 5,216 $ 17,187 $ 5,421
Total $ 857 $ 857 $ 1,609 $ 2,869 $ 4,584 $ 6,992 $ 10,638 $ 28,407 $ 7,993
Guidance $ 855 $ 839 $ 1,600 $ 2,500 $ 4,600 $ 7,000 $ 10,700 $ 28,094 $ 8,000
Current plan violated guidance by $370k in FY ’04, but this is a year of
some flexibility in guidance
Strict adherence to FY ’04 guidance would …
reduce facility capacity from 3% to 1.5% or staff by 2 FTE’s
B. Gibbard Review of US LHC Software and Computing Projects 27-30 November, 2001 16
Tier 1 Capacity Profile
2001 2002 2003 2004 2005 2006 2007
CPU (SPECint95) 3 3 6 15 50 150 500
Disk (TBytes) 2 2 8 30 100 300 1,000
Disk (MBytes/sec) 40 40 200 600 2,000 6,000 20,000
Tape (PBytes) 0.01 0.02 0.05 0.09 0.15 0.65 1.85
Tape (MBytes/sec) 10 10 20 20 48 106 212
WAN (Mbits/sec) 155 155 622 622 2488 9952 9952
B. Gibbard Review of US LHC Software and Computing Projects 27-30 November, 2001 17
Tier 1 Cost Profiles
At Year $K
$6,000K
$5,000K
$4,000K
Hardw are
$3,000K
Labor
$2,000K
$1,000K
$-
2001 2002 2003 2004 2005 2006 2007
Year
B. Gibbard Review of US LHC Software and Computing Projects 27-30 November, 2001 18
Standalone Disk Model Benefits
All ESD, AOD, and TAG data on local disk
Enables analysis specific 24 hour selection passes (versus one month
aggregated passes) – faster, better tuned, more consistent selection
Allows navigation for individual events (to all processed, but not Raw,
data) without recourse to tape and associated delay – faster more
detailed analysis of larger consistently selected data sets
Avoids contention between analyses over ESD disk space and the
need for complex algorithms to optimize management of that space –
better result with less effort
While prepared to serve appropriate levels of data access to other Tier
1’s, US will not in general be unduly sensitive to the performance of
other Tier 1’s or intervening network (transatlantic) and middleware –
improved system reliability, availability, robustness and performance
B. Gibbard Review of US LHC Software and Computing Projects 27-30 November, 2001 19
Tier 2 Issues
The high availability of the complete ESD set on disk at the Tier 1 and
the associated increased frequency of ESD selection passes will, for
connected Tier 2’s (and Tier 3’s), lead to …
More analysis activity – (Increasing CPU & Disk utilization)
More frequent analysis passes on
More and larger usable TAG, AOD and ESD subsets
More network traffic into the site from the Tier 1 – (Increasing WAN utilization)
Selection results
Event navigation into the full disk resident ESD
As in the case of the Tier 1, an additional year of funding before turn-on
and the increased effectiveness of “year later” funding contribute to
satisfying these increased needs within or near the integrated out year
(’05-’07) budget guidance
The delay of some ’06 funding to ’07 is required for a better match of profiles
B. Gibbard Review of US LHC Software and Computing Projects 27-30 November, 2001 20
Tier 2 Distribution Of Hardware Cost
Total Tier 2 Hardware Costs
CPU
Disks
3% 4% 0%
2% 2% Interactive
5%
32% FireWall
Spec. Purp
10%
LAN
Desktop
Backup
SW
Travel
2%
Videoconf
1%
8% Supplies
4%
Tapes
2% 1% 24%
Tape HW
Tape LM
B. Gibbard Review of US LHC Software and Computing Projects 27-30 November, 2001 21
Tier 1 Distribution Of Hardware Cost
Total Procurements Through 2007
23%
45%
Disk
Tertiary Storage
LAN
Other Infrastructure
Sftwr, Lic. & Maint.
CPU
14%
4%
5% 9%
B. Gibbard Review of US LHC Software and Computing Projects 27-30 November, 2001 22
FY 2007 Capacity Comparison of Models
Previous New
Tier 1 Tier 1
CPU (kSPEint95)
CPU 209
(kSPEint95) 500
Disk (TBytes)Disk (TBytes)365 1000
Tape (TBytes)
Tape 2000
(TBytes) < 2000
Tier 2 a-e Tier 2 a-e
CPU (kSPEint95)
CPU 250
(kSPEint95) 500
Disk (TBytes)Disk (TBytes)375 500
Tape (TBytes)
Tape 1000
(TBytes) < 1000
Total Total
CPU (kSPEint95)
CPU 459
(kSPEint95) 1000
Disk (TBytes)Disk (TBytes)740 1500
Tape (TBytes)
Tape 3000
(TBytes) < 3000
B. Gibbard Review of US LHC Software and Computing Projects 27-30 November, 2001 23
Conclusions
Standalone disk model
A dramatic improvement over previous tape based mode –
Functionality & Performance
A significant improvement over multi-Tier 1 disk model –
Performance, Reliability & Robustness
Respects funding guidance in model sensitive out-years
If costs are higher or funding lower than expected, a
graceful fallback is to access some of the data on disks at
other Tier 1’s
Adiabaticly move toward multi-Tier 1 model
B. Gibbard Review of US LHC Software and Computing Projects 27-30 November, 2001 24