Data Management at CERN’s
Large Hadron Collider (LHC)
Dirk Düllmann
CERN IT/DB, Switzerland
http://cern.ch/db
http://pool.cern.ch
D. Duellmann, CERN Data Management at the LHC 1
Outline
• Short Introduction to CERN & LHC
• Data Management Challenges
• The LHC Computing Grid (LCG)
• LCG Data Management Components
• Object Persistency and the POOL Project
• Connecting to the GRID – LCG Replica Location Service
D. Duellmann, CERN Data Management at the LHC 2
CERN - The European Organisation for Nuclear Research
The European Laboratory for Particle Physics
• Fundamental research in particle physics
• Designs, builds & operates large accelerators
• Financed by 20 European countries (member states)
+ others (US, Canada, Russia, India, ….)
~€650M budget - operation + new accelerators
2000 staff + 6000 users (researchers) from all over the world
• Next Major Research Project - LHC start ~2007
• 4 LHC Experiments, each with
• 2000 physicists, 150 universities, apparatus costing ~€300M,
computing ~€250M to setup, ~€60M/year to run
• 10-15 year lifetime
27km
Computer Centre Geneva
D. Duellmann, CERN Data Management at the LHC 4
The LHC machine
Two counter- circulating
proton beams
Collision energy 7+7 TeV
27 Km of magnets
with a field of 8.4 Tesla
Super-fluid Helium
cooled to 1.9°K
The world’s largest superconducting structure
D. Duellmann, CERN Data Management at the LHC 5
online system
multi-level trigger
filter out background
reduce data volume from
40TB/s to 500MB/s
D. Duellmann, CERN Data Management at the LHC 6
LHC Data Challenges
• 4 large experiments, 10-15 year lifetime
• Data rates: 500MB/s – 1.5GB/s
• Total data volume: 12-14PB / year
• Several hundred PB total !
• Analysed by thousands of users world-wide
• Data reduced from “raw data” to “analysis data” in
a small number of well-defined steps
D. Duellmann, CERN Data Management at the LHC 7
Data Handling and
Computation for
detector
event filter
(selection & Physics Analysis
reconstruction)
event processed
summary data
data
raw
data
batch
physics
event analysis
reprocessing
analysis objects
(extracted by physics topic)
event
les.robertson@cern.ch
simulation
interactive
physics
CER analysis
N
Estimated Mass Storage at CERN Estimated DISK Capacity at CERN
Mass Storage 7000
Disk
140
6000
120
5000
100
PetaBytes
Other experiments
TeraBytes
80 Other 4000
60 experiments
3000
40
20
LHC 2000
LHC
0 1000
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
0
1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010
Year year
CPU
Estimated CPU Capacity at CERN
6,000
5,000
Planned capacity Other experiments
evolution at CERN 4,000
K SI95
3,000
2,000
LHC
1,000
0 Moore’s law
1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010
year
Multi Tiered Computing Models - Computing Grids
Uni x Lab m
regional group
Uni a
CERN Tier 1
Lab a UK
USA
France
Tier3
The
TierLHC
1
Computing
Tier2 Centre
CERN
Uni n
physics
department Italy ……….
Desktop
………. Germany
Lab b
Lab c
les.robertson@cern.ch
physics group
Uni y
Uni b
LHC Data Models
• LHC data models are complex! Event
• Typically hundreds (500-1000) of
structure types (classes in OO)
• Many relations between them
• Different access patterns Tracker Calor.
• LHC experiments rely on
OO technology
TrackList HitList
• OO applications deal with networks of
objects
• Pointers (or references) are
used to describe inter object relations Track Hit
Track Track Hit
Hit
Track Track Hit
• Need to support this navigational Hit
model in our data store
D. Duellmann, CERN Data Management at the LHC 11
What is POOL?
• POOL is the common persistency framework for physics applications at the LHC
• Pool Of persistent Objects for LHC
• Hybrid Store – Object Streaming & Relational Database
• Eg ROOT I/O for object streaming
- complex data, simple consistency model (write once)
• Eg RDBMS for consistent meta data handling
- simple data, transactional consistency
• Initiated in April 2002
• Ramping up over the last year from 1.5 FTE to ~10 FTE
• Common effort between LHC experiments and the CERN Database group
• project scope and architecture and development
• => Rapid feedback cycles between project and its users
• First larger data productions starting now!
D. Duellmann, CERN Data Management at the LHC 12
Component Architecture
• POOL (as most other LCG software) is based on a strict component
software approach
• Components provide technology neutral APIs
• Communicate with other components only via abstract component
interfaces
• Goal: Insulate the very large experiment software
systems from concrete implementation details and
technologies used today
• POOL user code is not dependent on any implementation libraries
• No link time dependency on any implementation packages
(e.g. MySQL, Root, Xerces-c..)
• Component implementations are loaded at runtime via a plug-in
infrastructure
• POOL framework consists of three major, weakly coupled, domains
D. Duellmann, CERN Data Management at the LHC 13
POOL Components
POOL API
Storage Service FileCatalog Collections
ROOT I/O XML Explicit
Storage Svc Catalog Collection
RDBMS MySQL Implicit
Storage Svc Catalog Collection
EDG Replica
Location Service
D. Duellmann, CERN Data Management at the LHC 14
POOL Generic Storage Hierarchy
• A application may access databases
(eg streaming files) from one or
more file catalogs
POOL Context
• Each database is structured into
containers of one specific technology FileCatalog
(eg ROOT trees or RDBMS Tables)
• POOL provides a “Smart Pointers” Database
type pool::Ref
• to transparently load objects from
the back end into a client side cache Container
• define persistent inter object
associations across file or technology
boundaries Object
D. Duellmann, CERN Data Management at the LHC 15
Data Dictionary & Storage
C++ Abstract
Header DDL
Dictionary GCC-XML Code Generator
Generation
LCG dictionary code
Other Clients
Gateway
dictionary
dictionary
CINT
LCG
I/O
Data I/O Reflection
Technology
D. Duellmann, CERN dependent
Data Management at the LHC 16
POOL File Catalog
• Files are referred to inside POOL via a unique and immutable file identifier
which is system generated at file creation time
• This allows to provide stable inter-file reference
• FileID are implemented as Global Unique Identifier (GUID)
• Allows to create consistent sets of files with internal references
- without requiring a central ID allocation service
• Catalog fragments created independently can later be merged without
modification to corresponding data file
Logical Naming
LFN1 PFN1, technology
LFN2 FileID PFN2, technology
LFNn PFNn, technology
Object Lookup
File Identity and
metadata
D. Duellmann, CERN Data Management at the LHC 17
EDG Replica Location Services
- Basic Functionality
Each file has a unique GUID. Users may assign aliases to the
Locations corresponding to the GUIDs. These are kept in the
GUID are kept in the Replica
Replica Metadata Catalog.
Location Service.
james.casey@cern.ch
Files have replicas stored at
Replica Metadata
many Grid sites on Storage
Catalog
Elements.
Replica Location
Replica Manager Service
The Replica Manager provides
atomicity for file operations, assuring
consistency of SE and catalog
contents.
Storage Storage
Element Element
D. Duellmann, CERN Data Management at the LHC 18
Interactions with other Grid
Middleware Components
User Interface or
Worker Node
Resource Broker
james.casey@cern.ch
Virtual Organization
Membership Service
Information Service
Replica Metadata
Catalog
Replica Location
Replica Manager Service
Replica Optimization
Service
Applications and users interface to data
Storage Storage SE
through the Replica Manager either
Network Monitor
Element Element Monitor
directly or through the Resource
Broker.
D. Duellmann, CERN Data Management at the LHC 19
RLS Service Goals
• To offer production quality services for LCG 1 to meet the
requirements of forthcoming (and current!) data challenges
• e.g. CMS PCP/DC04, ALICE PDC-3, ATLAS DC2, LHCb CDC’04
• To provide distribution kits, scripts and documentation to assist
other sites in offering production services
• To leverage the many years’ experience in running such services
at CERN and other institutes
• Monitoring, backup & recovery, tuning, capacity planning, …
• To understand experiments’ requirements in how these services
should be established, extended and clarify current
limitations
• Not targeting small-medium scale DB apps that need to be run
and administered locally (to user)
D. Duellmann, CERN Data Management at the LHC 20
Conclusions
• Data Management at LHC remains a significant challenge because of
data volume, project lifetime, complexity of S/W and H/W setups.
• The LHC Computing Grid (LCG) approach is based on eg the EDG and
GLOBUS Middleware projects and uses a strict component approach
for physics application software
• The LCG-POOL project has developed a technology neutral
persistency framework which is currently being integrated into the
experiment production systems
• In conjunction with POOL a data catalog production service is
provided to support several upcoming data productions in the 100 of
terabyte area
D. Duellmann, CERN Data Management at the LHC 21
LHC Software Challenges
• Experiment software systems are large and complex
• Developed by teams of expert developers
• Permanent evolution and improvement for years…
• Analysis is performed by many end user developers
• Often participating only for short time
• Usually without strong computer science background
• Need simple and stable software environment
• Need to manage change over a long project lifetime
• Migration to new software, implementation languages
• New computing platforms, storage media
• New computing paradigms ???
• Data management system needs to be designed such confine the impact
of unavoidable change during the project
D. Duellmann, CERN Data Management at the LHC 23