ATLAS and the Grid - GridPP by wulinqing

VIEWS: 5 PAGES: 40

									   ATLAS and the Grid

ACAT02    Moscow June 2002
 RWL Jones Lancaster University
                 The ATLAS Computing Challenge

   Running conditions at startup:




   0.8x109 event sample  1.3 PB/year, before data processing

   “Reconstructed” events, Monte Carlo data  ~10 PB/year (~3 PB on disk)

   CPU: ~1.6M SpecInt95 including analysis

      CERN alone can handle only a fraction of these resources


                                                                 RWL Jones, Lancaster University
The Solution: The Grid


                                        Note: Truly
                                         HPC, but
                                         requires
                                           more

                               Not designed for
                                 tight-coupled
                              problems, but spin-
                                   offs many




                         RWL Jones, Lancaster University
              ATLAS Needs Grid Applications


   The ATLAS OO software framework is Athena, which co-
    evolves with the LHCb Gaudi framework
   ATLAS is truly intercontinental
   In particular, it is present on both sides of the Atlantic
      Opportunity: the practical convergence between US and
       European Grid projects will come through the transatlantic
       applications
      Threat: There is an inevitable tendency towards
       fragmentation/divergence of effort to be resisted
   Other relevant talks:
      Nick Brook: co-development with LHCb, especially through
       UK GridPP collaboration (or rather, I‟ll present this later)
      Alexandre Vaniachine, describing work for the ATLAS Data
       Challenges


                                                         RWL Jones, Lancaster University
               Data –Data Challenges
         Test Bench Challenges

   Prototype I        May 2002
      Performance and scalability testing of components of the
       computing fabric (clusters, disk storage, mass storage
       system, system installation, system monitoring) using
       straightforward physics applications. Test job scheduling
       and data replication software (DataGrid release 1.2)
   Prototype II       Mar 2003
      Prototyping of the integrated local computing fabric, with
       emphasis on scaling, reliability and resilience to errors.
       Performance testing of LHC applications. Distributed
       application models (DataGrid release 2).
   Prototype III      Mar 2004
      Full scale testing of the LHC computing model with fabric
       management and Grid management software for Tier-0 and
       Tier-1 centres, with some Tier-2 components (DataGrid
       release 3).


                                                  RWL Jones, Lancaster University
                                              The Hierarchical View

                                                                                         1 TIPS = 25,000 SpecInt95
                             ~PBytes/sec
                                          Online System           ~100 MBytes/sec        PC (1999) = ~15 SpecInt95

                                                                             Offline Farm
  •One bunch crossing per 25 ns                                               ~20 TIPS
                                                                                        ~100 MBytes/sec
  •100 triggers per second
                                                              Tier                  CERN Computer
  •Each event is ~1 Mbyte                  ~ Gbits/sec
                                                                                    Centre >20 TIPS
                                           or Air Freight
                                                              0
Tier
         US Regional               Italian Regional               French Regional                 UK Regional
1          Centre                        Centre                       Centre                      Centre (RAL)




                                            Tier                  Northern Tier                     Tier2 Centre
                                                                                         Tier2 Centre          Tier2 Centre
                                                                    ~1 TIPS                ~1 TIPS ~1 TIPS ~1 TIPS
                                            2
               Tier                ~Gbits/sec

               3       Lancaster
                                                                       Physicists work on analysis “channels”
                               Liverpool Manchester Shef f ield
                       ~0.25TIPS
                                                                       Each institute has ~10 physicists working
  Physics data cache                     100 - 1000                    on one or more channels
                                          Mbits/sec
                                                                       Data for these channels should be cached
                                                Tier                   by the institute server
              Workstations
                                                4                                                                RWL Jones, Lancaster University
                             A More Grid-like Model

                                    Uni x                          Lab m


                                                         USA
                                                                                       Lancs
                                                      Brookhaven
                     Lab a                                                 UK
                                          USA
                                        FermiLab
                                                                             France
    Physics
                                     The
                                   Tier LHC
                                          1
                                 Computing Facility         CERN
  Department   Tier2                                                                      Uni n
                                        Italy                                ……….

Desktop
                                                      NL            Germany
                     Lab b
                                                                                      Lab c

                 
                                      Uni y
                                                                     Uni b
                             


                                                                                         RWL Jones, Lancaster University
                   Features of the Cloud Model


   All regional facilities have 1/3 of the full reconstructed data
   Allows more on disk/fast access space, saves tape
   Multiple copies mean no need for tape backup
   All regional facilities have all of the analysis data (AOD)
   Resource broker can still keep jobs fairly local
   Centres are Regional and NOT National
      Physicists from other Regions should have also Access to the
       Computing Resources
      Cost sharing is an issue
   Implications for the Grid middleware on accounting
      Between experiments
      Between regions
      Between analysis groups
   Also, different activities will require different priorities



                                                                   RWL Jones, Lancaster University
Resource Estimates




                     RWL Jones, Lancaster University
                       Resource Estimates


   Analysis resources?:
        20 analysis groups
        20 jobs/group/day = 400 jobs/day
        sample size : 108 events
        2.5 SI95s/ev => 1011 SI95 (s/day) = 1.2*106 SI95
        Additional 20% for activities on smaller samples




                                                            RWL Jones, Lancaster University
                     Rough Architecture


                                   Installation of
                                  Software and Env
   Compute + Store Sites

Middleware
                 Data Catalogue
 RB, GIS


  User Interface to Grid              Job Configuration/VDC
 + experiment framework                     /metadata


        User
                                              RWL Jones, Lancaster University
                                Test Beds


   EDG Test Bed 1
      Common to all LHC experiments
      Using/testing EDG test bed 1 release code
      Already running boxed fast simulation and installed full simulation
   US ATLAS Test Bed
      Demonstrate success of grid computing model for HEP
           in data production
           in data access
           in data analysis
      Develop & deploy grid middleware and applications
           wrap layers around apps
           simplify deployment
      Evolve into fully functioning scalable distributed tiered grid
   NorduGrid
      Developing a regional test bed
      Light-weight Grid user interface, working prototypes etc
      see talk by Aleksandr Konstantinov

                                                                  RWL Jones, Lancaster University
                               EDG Release 1.2

        EDG has strong emphasis on middleware development;
         applications come second
        ATLAS has been testing the `stable‟ releases of the EDG
         software as they become available as part of WP8 (ATLAS
         key contact Silvia Resconi)
        EDG Release (1.2) is under test by Integration Team people
         plus Loose Cannons (experiment independent people) on
         the development testbed at CERN.
        Standard requirements must be met before the ATLAS
         Applications people test a release:
           1. The development testbed “must” consist of at least 3 sites in 3
              different countries ( e.g. CERN, CNAF, RAL )
           2. There “must” be a long ( > 24 hours) unattended period with a
              low error rate ( < 1% of jobs failed )

http://pcatl0a.mi.infn.it/~resconi/validation/valid.html

                                                               RWL Jones, Lancaster University
EDG TestBed 1 Status
 28 May 2002 17:03


                           Web interface
                           showing status of
                           (~400) servers at
                           testbed 1 sites

                       5 Main Production Centres




                                RWL Jones, Lancaster University
GridPP Sites in Testbed(s)




                             RWL Jones, Lancaster University
                           NorduGrid Overview

 Launched in spring 2001,
  with the aim of creating a
  Grid infrastructure in the
  Nordic countries
 Partners from Denmark,
  Norway, Sweden, and
  Finland
 Initially the Nordic branch
  of the EU DataGrid (EDG)
  project testbed
 Independent developments
 Relies on funding from
  NorduNet2
http://www.nordugrid.org
                                                RWL Jones, Lancaster University
                      US Grid Test Bed Sites

                                      U Michigan


Lawrence Berkeley
National Laboratory
                                                        Boston
                                                        University
                         Argonne
                         National
                         Laboratory


                                                      Brookhaven
                                                      National
                                      Indiana         Laboratory
      Oklahoma                        University
      University


                      University of
                      Texas at
                      Arlington

 US -ATLAS testbed launched February 2001
                                                   RWL Jones, Lancaster University
                  US Hardware and Deployment


    8 gatekeepers - ANL, BNL, LBNL, BU, IU, UM, OU, UTA
    Farms - BNL, LBNL, IU, UTA + Multiple R&D gatekeepers
    Uniform OS through kickstart
        Running RH 7.2 
    First stage deployment
        Pacman, Globus 2.0b, cernlib (installations)
        Simple application package
    Second stage deployment
        Magda, Chimera, GDMP… (Grid data management)
    Third stage
       MC production software + VDC
Many US names mentioned later, thanks also to Craig Tull, Dan Engh, Mark
    Sosebee



                                                            RWL Jones, Lancaster University
                    Important Components


 GridView - simple script tool to monitor status of test bed
  (Java version being developed)
 Gripe - unified user accounts
 Magda - MAnager for Grid Data
 Pacman - package management and distribution tool
 Grappa - web portal based on active notebook technology




                                                     RWL Jones, Lancaster University
                      Grid User Interface


   Several prototype interfaces
      GRAPPA
      EDG
      Nordugrid
          Lightweight
   Nothing experiment specific
      GRAT
          Line mode (and we will always need to retain line mode!)
   Now defining an ATLAS/LHCb joint user interface, GANGA
      Co-evolution with Grappa
      Knowledge of experiment OO architecture needed
       (Athena/Gaudi)



                                                        RWL Jones, Lancaster University
   Interfacing Athena/Gaudi to the
                GRID




          GUI
                 GANGA/Grappa
                                   ?
                      Histograms
jobOptions/                             GRID
Virtual Data          Monitoring
                      Results          Services
Algorithms

                Athena/GAUDI
                 Application

                                   ?




                                         RWL Jones, Lancaster University
                                  GRAPPA

 Based on XCAT Science Portal, framework for building personal
  science portals
 A science portal is an application-specific Grid portal
 Active notebook
     HTML pages to describe the features of the notebook and how to use it
     HTML forms which can be used to launch parameterizable scripts
      (transformation)
     Parameters stored in a sub-notebook (derivation)
 Very flexible
 Jython - access to Java classes
     Globus Java CoG kit
     XCAT
     XMESSAGES
 Not every user has to write scripts
 Notebooks can be shared among users
     Import/export capability
Shava Smallen, Rob Gardner                                   RWL Jones, Lancaster University
                 GRAPPA/XCAT Science Portal Architecture



                     User’s Web
                      Browser                 The prototype can:

                  Portal Web Server                     •Submit Athena jobs
            (tomcat server + java servlets)             to Grid computing
                                                        elements
                                                        •Manage JobOptions,
    GSI                 Jython           Notebook       record sessions
Authentication        Intepreter         Database
                                                        •Staging and output
                                                        collection supported
                                                        •Tested on US ATLAS
                                                        Grid Testbed

                       Grid
                                                           RWL Jones, Lancaster University
                GANGA/Grappa Development
                        Strategy

   Completed existing technology + requirement survey
   Must be Grid aware but not Grid-dependent
      Still want to be able to `pack and go‟ to a standalone laptop
   Must be component-based
   Interface Technologies (Standards needed  GGF)
        Programmatic API (eg. C, C++, etc)
        Scripting as Glue ala Stallman (eg. Python)
        Others eg. SOAP, CORBA, RMI, DCOM, .NET, etc.
        ………
   Defining the experiment software services to capture and
    present the functionality of the Grid service




                                                          RWL Jones, Lancaster University
                          Possible Designs


 Two ways of implementation:
    Based on one of the general-purpose grid portals (not tied to a
     single application/framework):
        Alice Environment (AliEn)
        Grid Enabled Web eNvironment for Site-Independent User Job
         Submission (GENIUS)
        Grid access portal for physics applications (Grappa)
    Based on the concept of Python bus (P. Mato):
        use different modules whichever are required to provide full
         functionality of the interface
        use Python to glue this modules, i.e., allow interaction and
         communication between them




                                                           RWL Jones, Lancaster University
                                          Python Bus

     Local




                      GUI
     user
                                                         PYTHON SW BUS




                            GaudiPython    Java Module    OS Module          EDG API             PythonROOT

         Workspaces
            DB


                            GAUDI
GAUDI                                                                       GRID             Bookkeeping
client                                                                                           DB
                                                                 Job
                                                            Configuration
                                                                 DB
Remote user
HTML page
                                     Internet                                 Production
                                                                                  DB




                                                                                   RWL Jones, Lancaster University
                    Installation Tools

   To use the Grid, deployable software must be
    deployed on the Grid fabrics, and the deployable run-
    time environment established (Unix and Windows)
   Installable code and run-time
    environment/configuration
   Both ATLAS and LHCb use CMT for the software
    management and environment configuration
   CMT knows the package interdependencies and
    external dependencies  this is the obvious tool to
    prepare the deployable code and to `expose‟ the
    dependencies to the deployment tool (Christian
    Arnault, Chas Loomis)
   Grid aware tool to deploy the above

   PACMAN (Saul Youssef) is a candidate which seems
    fairly easy to interface with CMT
                                                     RWL Jones, Lancaster University
                          Installation Issues

   Most Grid projects seem to assume either code is pre-installed or
    else can be dumped each time into the input sandbox
   The only route for installation of software through the Grid
    seems to be as data in Storage Elements
      In general these are non-local
      Hard to introduce directory trees etc this way (file based)
   How do we advertise installed code?
      Check it is installed by a preparation task sent to the remote fabric
       before/with the job
      Advertise the software is installed in your information service for
       use by the resource broker
   Probably need both!
   The local environment and external packages will always be a
    problem
      Points to a virtual machine idea eventually; Java?
   Options?
      DAR – mixed reports, but CMS are interested
      PACKMAN from AliEn
      LGCG, OSCAR – not really suitable, more for site management?
                                                                 RWL Jones, Lancaster University
                  CMT and deployable code


   Christian Arnault and Charles Loomis have a beta-release
    of CMT that will produce package rpms, which is a large
    step along the way
      Still need to have minimal dependencies/clean code!
      Need to make the package dependencies explicit
      Rpm requires root to install in the system database (but not
       for a private installation)
   Developer and binary installations being produced,
    probably needs further refinement
   Work to expose dependencies as PACMAN cache files
    ongoing

   Note: much work elsewhere in producing rpms of ATLAS
    code, notably in Copenhagen; this effort has the advantage
    of the full dependency knowledge in CMT being exposable


                                                         RWL Jones, Lancaster University
                               pacman


 Package manager for the grid in development by Saul
  Youssef (Boston U, GriPhyN/iVDGL)
 Single tool to easily manage installation and environment
  setup for the long list of ATLAS, grid and other software
  components needed to „Grid-enable‟ a site
    fetch, install, configure, add to login environment, update
 Sits over top of (and is compatible with) the many software
  packaging approaches (rpm, tar.gz, etc.)
 Uses dependency hierarchy, so one command can drive the
  installation of a complete environment of many packages
 Packages organized into caches hosted at various sites
    How to fetch can be cached rather than the desired object
 Includes a web interface (for each cache) as well as
  command line tools



                                                           RWL Jones, Lancaster University
# An encryption package needed by Globus
#
name = ‘SSLeay’
description = ‘Encryption’
url         = ‘http://www.psy.uq.oz.au/~ftp/Crypto/ssleay’
source      = ‘http://www.psy.uq.oz.au/~ftp/Crypto/ssleay’
systems     = { ‘linux-i386’: [‘SSLeay-0.9.0b.tar.gz’,’SSLeay-0.9.0b’],\
                ‘linux2’    : [‘SSLeay-0.9.0b.tar.gz’,’SSLeay-0.9.0b’],\
                ‘sunos5’    : [‘SSLeay-0.9.0b.tar.gz’,’SSLeay-0.9.0b’] }
depends     = []
exists      = [‘/usr/local/bin/perl’]
inpath      = [‘gcc’]
bins        = []
paths       = []
enviros     = []
localdoc    = ‘README’
daemons     = []
install     = { \
   ‘root’: [‘./Configure linux-elf’,’make clean’,                      \
       ’make depend’,’make’,’make rehash’,’make test’,’make install’], \
   ‘*’   : [‘./Configure linux-elf’,’make clean’,’make depend’,’make’, \
       ’make rehash’,’make test’] }

                                                          RWL Jones, Lancaster University
                         Grid Applications Toolkit


   Horst Severini, Kaushik De, Ed May, Wensheng Deng, Jerry
    Gieraltowski.… (US Test Bed)
   Repackaged Athena-Atlfast (OO fast detector simulation) for grid testbed
    (building on Julian Phillips and UK effort)
              Script 1: can run on any globus enabled node (requires transfer of ~17MB
               source)
              Script 2: runs on machine with packaged software preinstalled on grid
               site
              Script 3: runs on afs enabled sites (latest version of software is used)
   Other user toolkit contents
              to check status of grid nodes
              submit jobs (without worrying about underlying middleware or ATLAS
               software)
              uses only basic RSL & globus-url-copy




                                                                     RWL Jones, Lancaster University
                              Monitoring Tool


   GridView - a simple visualization tool using Globus Toolkit
      First native Globus application for ATLAS grid (March 2001)
      Collects information using Globus tools. Archival information is stored
       in MySQL server on a different machine. Data published through web
       server on a third machine.
Plans:
   Java version
   Better visualization
      Historical plots
      Hierarchical MDS information
      Graphical view of system health
   New MDS schemas
   Optimize archived variables
   Publishing historical information through GIIS servers??
   Explore discovery tools
   Explore scalability to large systems
Patrick McGuigan
                                                                  RWL Jones, Lancaster University
RWL Jones, Lancaster University
                 MDS Information




Listing of available object classes
                                      RWL Jones, Lancaster University
More Details




               RWL Jones, Lancaster University
                 Data Management
                   Architecture


   AMI               MAGDA                   VDC
 ATLAS              MAnager               Virtual
Metatdata           for Grid-              Data
Interface          based Data             Catalog
 Query LFN             Manage             Derive and
                     replication,      transform LFNs
 Associated
                   physical location
attributes and
    values




                                       RWL Jones, Lancaster University
                                    Managing Data -Magda

   MAnager for Grid-based Data (essentially the `replica catalogue‟ tool)
   Designed for „managed production‟ and „chaotic end-user‟ usage
   Designed for rapid development of components to support users quickly, with
    components later replaced by Grid Toolkit elements
        Deploy as an evolving production tool and as a testing ground for Grid Toolkit components
        GDMP will be incorporated
   Application in DCs
        Logical files can optionally be organized into collections
        File management in production; replication to BNL; CERN, BNL data access
        GDMP integration, replication and end-user data access in DC1

Developments
   Interface with AMI (ATLAS Metadata Interface, allows queries on Logical File Name
    collections by users, Grenoble project)
   Interfaces to Virtual Data Catalogue, see AV‟s talk)
   Interfacing with hybrid ROOT/RDBMS event store
   Athena (ATLAS offline framework) integration; further grid integration
    Info: http://www.usatlas.bnl.gov/magda/info
    Engine: http://www.usatlas.bnl.gov/magda/dyShowMain.pl
    T Wenaus, W Deng
                                                                                    RWL Jones, Lancaster University
                     Magda Architecture




DB access via perl, C++, java, cgi (perl) scripts;
C++ and Java APIs auto-generated off the MySQL DB schema
User interaction via web interface and command line
                                                            RWL Jones, Lancaster University
                              Conclusion


   The Grid is the only viable solution to the ATLAS Computing
    problem
      The problems of coherence across the Atlantic are large
      ATLAS (and CMS etc) are `at the sharp end‟, so we will force the
       divide to be bridged
   Many applications have been developed, but need to be
    refined/merged
      These revise our requirements – we must use LCG/GGF and any
       other forum to ensure the middleware projects satisfy the real
       needs; this is not a test bed!
   The progress so far is impressive and encouraging
      Good collaborations (especially ATLAS/LHCb)
   The real worry is scaling up to the full system
      Money!
      Manpower!
      Diplomacy?!



                                                               RWL Jones, Lancaster University

								
To top