Embed
Email

POOL

Document Sample

Shared by: linzhengnd
Categories
Tags
Stats
views:
1
posted:
11/16/2011
language:
English
pages:
22
Data Management at CERN’s

Large Hadron Collider (LHC)



Dirk Düllmann

CERN IT/DB, Switzerland

http://cern.ch/db

http://pool.cern.ch









D. Duellmann, CERN Data Management at the LHC 1

Outline



• Short Introduction to CERN & LHC

• Data Management Challenges

• The LHC Computing Grid (LCG)



• LCG Data Management Components

• Object Persistency and the POOL Project

• Connecting to the GRID – LCG Replica Location Service









D. Duellmann, CERN Data Management at the LHC 2

CERN - The European Organisation for Nuclear Research

The European Laboratory for Particle Physics









• Fundamental research in particle physics

• Designs, builds & operates large accelerators

• Financed by 20 European countries (member states)

+ others (US, Canada, Russia, India, ….)

 ~€650M budget - operation + new accelerators

 2000 staff + 6000 users (researchers) from all over the world

• Next Major Research Project - LHC start ~2007

• 4 LHC Experiments, each with

• 2000 physicists, 150 universities, apparatus costing ~€300M,

computing ~€250M to setup, ~€60M/year to run

• 10-15 year lifetime

27km









Computer Centre Geneva

D. Duellmann, CERN Data Management at the LHC 4

The LHC machine

Two counter- circulating

proton beams

Collision energy 7+7 TeV





27 Km of magnets

with a field of 8.4 Tesla



Super-fluid Helium

cooled to 1.9°K





The world’s largest superconducting structure



D. Duellmann, CERN Data Management at the LHC 5

online system

multi-level trigger

filter out background

reduce data volume from

40TB/s to 500MB/s









D. Duellmann, CERN Data Management at the LHC 6

LHC Data Challenges



• 4 large experiments, 10-15 year lifetime

• Data rates: 500MB/s – 1.5GB/s

• Total data volume: 12-14PB / year

• Several hundred PB total !

• Analysed by thousands of users world-wide



• Data reduced from “raw data” to “analysis data” in

a small number of well-defined steps







D. Duellmann, CERN Data Management at the LHC 7

Data Handling and

Computation for

detector

event filter

(selection & Physics Analysis

reconstruction)





event processed

summary data

data



raw

data

batch

physics

event analysis

reprocessing

analysis objects

(extracted by physics topic)





event









les.robertson@cern.ch

simulation

interactive

physics

CER analysis

N

Estimated Mass Storage at CERN Estimated DISK Capacity at CERN

Mass Storage 7000

Disk

140

6000

120

5000

100

PetaBytes









Other experiments









TeraBytes

80 Other 4000



60 experiments

3000

40

20

LHC 2000

LHC

0 1000

1998



1999



2000



2001



2002



2003



2004



2005



2006



2007



2008



2009



2010

0

1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010

Year year









CPU

Estimated CPU Capacity at CERN



6,000





5,000

Planned capacity Other experiments

evolution at CERN 4,000

K SI95









3,000





2,000

LHC

1,000





0 Moore’s law

1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010

year

Multi Tiered Computing Models - Computing Grids





Uni x Lab m



regional group

Uni a

CERN Tier 1

Lab a UK

USA



France

Tier3

The

TierLHC

1

Computing

Tier2 Centre

CERN

Uni n

physics

department Italy ……….



Desktop

 ………. Germany

Lab b

Lab c









les.robertson@cern.ch

physics group



Uni y

 Uni b

LHC Data Models

• LHC data models are complex! Event

• Typically hundreds (500-1000) of

structure types (classes in OO)

• Many relations between them

• Different access patterns Tracker Calor.



• LHC experiments rely on

OO technology

TrackList HitList

• OO applications deal with networks of

objects

• Pointers (or references) are

used to describe inter object relations Track Hit

Track Track Hit

Hit

Track Track Hit

• Need to support this navigational Hit

model in our data store

D. Duellmann, CERN Data Management at the LHC 11

What is POOL?

• POOL is the common persistency framework for physics applications at the LHC

• Pool Of persistent Objects for LHC



• Hybrid Store – Object Streaming & Relational Database

• Eg ROOT I/O for object streaming

- complex data, simple consistency model (write once)

• Eg RDBMS for consistent meta data handling

- simple data, transactional consistency



• Initiated in April 2002

• Ramping up over the last year from 1.5 FTE to ~10 FTE



• Common effort between LHC experiments and the CERN Database group

• project scope and architecture and development

• => Rapid feedback cycles between project and its users



• First larger data productions starting now!









D. Duellmann, CERN Data Management at the LHC 12

Component Architecture

• POOL (as most other LCG software) is based on a strict component

software approach

• Components provide technology neutral APIs

• Communicate with other components only via abstract component

interfaces

• Goal: Insulate the very large experiment software

systems from concrete implementation details and

technologies used today



• POOL user code is not dependent on any implementation libraries

• No link time dependency on any implementation packages

(e.g. MySQL, Root, Xerces-c..)

• Component implementations are loaded at runtime via a plug-in

infrastructure



• POOL framework consists of three major, weakly coupled, domains

D. Duellmann, CERN Data Management at the LHC 13

POOL Components





POOL API





Storage Service FileCatalog Collections





ROOT I/O XML Explicit

Storage Svc Catalog Collection





RDBMS MySQL Implicit

Storage Svc Catalog Collection





EDG Replica

Location Service









D. Duellmann, CERN Data Management at the LHC 14

POOL Generic Storage Hierarchy

• A application may access databases

(eg streaming files) from one or

more file catalogs

POOL Context

• Each database is structured into

containers of one specific technology FileCatalog

(eg ROOT trees or RDBMS Tables)



• POOL provides a “Smart Pointers” Database

type pool::Ref

• to transparently load objects from

the back end into a client side cache Container

• define persistent inter object

associations across file or technology

boundaries Object





D. Duellmann, CERN Data Management at the LHC 15

Data Dictionary & Storage

C++ Abstract

Header DDL







Dictionary GCC-XML Code Generator

Generation

LCG dictionary code









Other Clients

Gateway

dictionary









dictionary

CINT









LCG

I/O

Data I/O Reflection

Technology

D. Duellmann, CERN dependent

Data Management at the LHC 16

POOL File Catalog

• Files are referred to inside POOL via a unique and immutable file identifier

which is system generated at file creation time

• This allows to provide stable inter-file reference



• FileID are implemented as Global Unique Identifier (GUID)

• Allows to create consistent sets of files with internal references

- without requiring a central ID allocation service

• Catalog fragments created independently can later be merged without

modification to corresponding data file





Logical Naming

LFN1 PFN1, technology

LFN2 FileID PFN2, technology



LFNn PFNn, technology

Object Lookup

File Identity and

metadata



D. Duellmann, CERN Data Management at the LHC 17

EDG Replica Location Services

- Basic Functionality

Each file has a unique GUID. Users may assign aliases to the

Locations corresponding to the GUIDs. These are kept in the

GUID are kept in the Replica

Replica Metadata Catalog.

Location Service.









james.casey@cern.ch

Files have replicas stored at

Replica Metadata

many Grid sites on Storage

Catalog

Elements.

Replica Location

Replica Manager Service





The Replica Manager provides

atomicity for file operations, assuring

consistency of SE and catalog

contents.

Storage Storage

Element Element



D. Duellmann, CERN Data Management at the LHC 18

Interactions with other Grid

Middleware Components

User Interface or

Worker Node

Resource Broker









james.casey@cern.ch

Virtual Organization

Membership Service

Information Service

Replica Metadata

Catalog



Replica Location

Replica Manager Service



Replica Optimization

Service





Applications and users interface to data

Storage Storage SE

through the Replica Manager either

Network Monitor

Element Element Monitor

directly or through the Resource

Broker.

D. Duellmann, CERN Data Management at the LHC 19

RLS Service Goals



• To offer production quality services for LCG 1 to meet the

requirements of forthcoming (and current!) data challenges

• e.g. CMS PCP/DC04, ALICE PDC-3, ATLAS DC2, LHCb CDC’04



• To provide distribution kits, scripts and documentation to assist

other sites in offering production services



• To leverage the many years’ experience in running such services

at CERN and other institutes

• Monitoring, backup & recovery, tuning, capacity planning, …



• To understand experiments’ requirements in how these services

should be established, extended and clarify current

limitations



• Not targeting small-medium scale DB apps that need to be run

and administered locally (to user)



D. Duellmann, CERN Data Management at the LHC 20

Conclusions

• Data Management at LHC remains a significant challenge because of

data volume, project lifetime, complexity of S/W and H/W setups.



• The LHC Computing Grid (LCG) approach is based on eg the EDG and

GLOBUS Middleware projects and uses a strict component approach

for physics application software



• The LCG-POOL project has developed a technology neutral

persistency framework which is currently being integrated into the

experiment production systems



• In conjunction with POOL a data catalog production service is

provided to support several upcoming data productions in the 100 of

terabyte area







D. Duellmann, CERN Data Management at the LHC 21

LHC Software Challenges

• Experiment software systems are large and complex

• Developed by teams of expert developers

• Permanent evolution and improvement for years…



• Analysis is performed by many end user developers

• Often participating only for short time

• Usually without strong computer science background

• Need simple and stable software environment



• Need to manage change over a long project lifetime

• Migration to new software, implementation languages

• New computing platforms, storage media

• New computing paradigms ???



• Data management system needs to be designed such confine the impact

of unavoidable change during the project





D. Duellmann, CERN Data Management at the LHC 23



Related docs
Other docs by linzhengnd
i-Health
Views: 0  |  Downloads: 0
State employees recall events of September 11
Views: 7  |  Downloads: 0
0804050421330_2110
Views: 4  |  Downloads: 0
Listino2009 - Meetup
Views: 0  |  Downloads: 0
TwoSurveyCalculator
Views: 0  |  Downloads: 0
Guidelines.xlsx
Views: 0  |  Downloads: 0
APPALACHIA AND THE OZARKS
Views: 2  |  Downloads: 0
Proliferation Studies
Views: 0  |  Downloads: 0
By registering with docstoc.com you agree to our
privacy policy

You are almost ready to download!

You are almost ready to download!