POOL LHC data Persistency

Document Sample
POOL LHC data Persistency Powered By Docstoc
					POOL Project Overview
          Dirk Düllmann

CERN Openlab storage workshop
      17th March 2003
What is POOL?
• POOL is the LCG Persistency Framework
  – Pool of persistent objects for LHC
• Started by LCG-SC2 in April ’02
  – Common effort in which the experiments take
    over a major share of the responsibility
     • for defining the system architecture
     • for development of POOL components
  – ramping up over the last year from 1.5 to
POOL and the LCG Architecture
• POOL is a component based system
   – A technology neutral API
       • Abstract C++ interfaces
   – Implemented reusing existing technology
       • ROOT I/O for object streaming
           – complex data, simple consistency model (write once)
       • RDBMS for consistent meta data handling
           – simple data, transactional consistency

• POOL does not replace any of it’s components
   – It integrates them to provides higher level services
   – Insulates physics applications from implementation details of
     components and technologies used today
Pool as a LCG component

• Persistency is just one of several projects in the LCG
  Applications Area
   – Sharing a common architecture and s/w process
       • as described in the Blueprint and Persistency RTAG documents
   – Persistency is important…
       • …but not important enough to allow for uncontrolled direct
         dependencies eg of experiment code on its implementation
• Common effort in which the experiments take over a
  major share of the responsibility
   – for defining the overall and detailed architecture
   – for development of Pool components
LCG Blueprint Software Decomposition

                                     Algorithms                                        Scripting
    EvtGen       Engine                                           NTuple
                                     Reconstruction                Analysis                    Services
      Event         Detector
  Generation       Simulation

                                             Geometry              Event Model              Calibration

    StoreMgr                                                                    Scheduler
                            Dictionary             PluginMgr

                            Whiteboard                                           Monitor             Grid
     Persistency                                      Core Services                              Services

                                                               Foundation and Utility Libraries
       ROOT        GEANT4       FLUKA          MySQL           DataGrid       Python        Qt       ...
POOL Work Package breakdown
• Based on outcome of SC2 persistency RTAG

• File Catalog
    –   keep track of files (and their physical and logical names) and their description
    –   resolve a logical file reference (FileID) into a physical file
    –   pool::IFileCatalog

• Collections
    –   keep track of (large) object collection and their description
    –   pool::Collection<T>

• Storage Service
    –   stream transient C++ objects into/from storage
    –   resolve a logical object reference into a physical object

• Object Cache (DataService)
    –   keep track of already read objects to speed up repeated access to the same data
    –   pool::IDataSvc and pool::Ref<T>
POOL Internal Organisation

                                                 POOL API

               Storage Service                  FileCatalog                Collections

  ROOT I/O                        XML                          Explicit
 Storage Svc                     Catalog                      Collection

   RDBMS                         MySQL                         Implicit
Storage Svc ?                    Catalog                      Collection

                              EDG Replica
                             Location Service
POOL and the GRID
• GRID mostly deals with data of file level
  – File Catalog connects POOL to Grid Resources
     • eg via our EDG-RLS backend
  – POOL Storage Service deals with intra file structure
     • need connection via standard Grid File access
• Both File and Object based Collections are seen
  as important End User concepts
  – POOL offers a consistent interface to both types
• Need to understand to what extend these can
  be provided in a Grid environment
How does POOL fit into the
environment POOL client                                                               Book Keeping

                                            on a CPU Node                         Production Workflow
• POOL will be mainly used from                                                  Exp. DB Services
   experiment frameworks                         User Application
    –   mostly as client library loaded
        from user application                Experiment Framework

• Production Manager                                      POOL                     Collection Description?
    –   Creates and maintains shared
        file catalogs and (event)
        collections                                                                  Collection Location?
    –   eg add the catalog fragment for
        the new simulation data to the
        published analysis catalog                                                    Collection Access

• End User                                                                       RDBMS Services
    –   Uses shared collections
    –   eg iterate over collection X

                                       File Description      Replica Location   Remote File I/O?

                                                      Grid (File) Services
POOL File Catalog
 Logical Naming
   Logical Filename 1                           Physical Filename 1
   Logical Filename 2                           Physical Filename 2

   Logical Filename n                           Physical Filename m

                                                        Object Lookup

• POOL uses GUID implementation for FileID
   – unique and immutable identifier for a file (generated at create time)
   – allows to produce sets of file with internal references without
     requiring a central ID allocation service
   – catalog fragments created independently can later be merged
     without modification to data files.
• Object lookup is based only on right side box!
   – Logical filenames are supported but not required
Use Case: Working in Isolation
 • The user extracts a set of
    interesting files and a catalog   File Catalog & Descr
    fragment describing them
    from a (central) grid based                                Extraction
    catalog into a local (eg XML       Grid File Storage                    Local File Catalog
    based) catalog.
     – Selection is performed based
       on file or collection                                                   Local Files
       descriptions                   Result
 • After disconnecting from the       Publishin
    grid the user executes some
    standard jobs navigating          g                                        Local
    through the extracted data.                            New Files           Processing
     – New output files are
       registered into the local
       catalog                                   New Catalog & Descr
 • Once the new data is ready
    for publishing and the user is
    connected the new catalog
    fragment is submitted to the
    grid based catalog.
Use Case: Farm Production
• Production manager may             Local File Catalog      Local File Catalog    Local File Catalog
    pre-register output files with
    the catalog (eg a “local”
    MySQL or XML catalog)               Local Files                Local Files        Local Files
     – File ID, physical filename
       job ID and optionally also    Production Node 1        Production Node 2    Production Node n
       logical filenames
• A production job runs and
    creates files and their
    catalog entries locally.                                                       Post
•   During the production the                                                      Processing
    catalog can be used to                             New Files
    cleanup files (and their
    registration) from
    unsuccessful jobs based on                        New Catalog & Descr
    their associated job ID.   Result
•   Once the data quality      Publishing
    checks have been passed
    the production manager
    decides to publishes the                            File Catalog & Descr
    production catalog fragment
    to the grid based catalog.                            Grid File Storage

                                                                   Grid Cataloge
POOL Storage Hierarchy
• A application may access               POOL Context
  databases (eg ROOT files)
  from a set of catalogs                  FileCatalog
• Each database has containers
  of one specific technology (eg
  ROOT trees)                              Database
• Smart Pointers are used
   – to transparently load objects        Container
     into a client side cache
   – define object associations across
     file or technology boundaries          Object
          Client Data Access
            Ref<T>   Data Service

                              Data Cache

 Client     Ref<T>   Data Service

                              Data Cache

            Ref<T>   Data Service
                                        .h                                    .xml

Dictionary           ROOTCINT                          GCC-XML       Code Generator
                 CINT dictionary code                     LCG dictionary code

                                                                                      Other Clients


Data I/O                                                  Reflection
Project Status & Plans
• First four POOL releases delivered planned functionality
  on time
   – Aggressive schedule so far focusing on adding functionality
   – no consistent attempt of performance optimisation yet
• Functional complete (LCG-1 feature set) POOL V1.0
  release scheduled for April
   – several functional extensions compared to V0.4
   – automated system tests are being
• Bug fix and performance release POOL V1.1 in June
   – Aim to be ready for first deployment together with LCG-1
   – Will release
• Work on proof of concept storage service re-
  implementation based on an RDBMS back end starting
• The LCG Pool project provides a hybrid store integrating object
   streaming (eg Root I/O) with RDBMS technology (eg MySQL) for
   consistent meta data handling
    – Strong emphasis on component decoupling and well defined
    – Transparent cross-file and cross-technology object navigation via C++
      smart pointers
    – Integration with Grid technology (via EDG-RLS)
        • but preserving networked and grid-decoupled working modes

• Next two releases (V1.0-functionality and V1.1-reliability &
   performance) will be crucial for POOL acceptance
    – Need tight coupling with experiment development and production
      teams to validate the feature set
    – Assume tight integration with LCG deployment activities
How to find out more about POOL?

• POOL Home Page
• POOL savannah portal

Shared By: