Results of the LHCb experiment Data Challenge 2004

Document Sample
Results of the LHCb experiment Data Challenge 2004 Powered By Docstoc
					Results of the LHCb experiment
      Data Challenge 2004

          Joël Closier
         CERN / LHCb
           CHEP’ 04
          The LHCb DC04 team
 Dirac
  – Andrei Tsaregorodtsev, Vincent Garonne, Ian Stokes-Rees
 Production management
  – Joel Closier, Ricardo Graciani (LCG), Johan Blouw, Andrew
    Pickford … and the LHCb site managers
 LHCb Bookkeeping, Monitoring & accounting
  – Markus Frank, Carmine Cioffi, Manuel Sanchez, Ruben Vizcaya
 LCG-LHCb liaison
  – Flavia Donno, Roberto Santinelli
 The LCG-GDA team
  – Ian Bird, Laurence Field, Maarten Litmaath, Markus Schulz,
    David Smith, Zdenek Sekera, Marco Serra…


                         Result of LHCb DC04                      2
                  Outline
   Aims of the LHCb Data Challenge 2004
   Production model
   Performances of DC’04
   Lessons from DC’04
   Conclusions




                  Result of LHCb DC04      3
                 LHCb DC’04 aims
 Main goal :gather information to be used for writing the
  LHCb computing Technical Design Report
   – Robustness test of the LHCb software and production system
      • Using software as realistic as possible in terms of performance
   – Test of the LHCb distributed computing model
      • Including distributed analyses
      • Realistic test of analysis environment, need realistic analyses
   – Incorporation of the LCG application area software into the LHCb
     production environment
   – Use of LCG resources (at least 50% of the production capacity)
   – 3 phases
      • Production : MC simulation and reconstruction
      • Stripping : Event pre-selection
      • Analysis


                            Result of LHCb DC04                           4
        LHCb DC04 aims (cont’d)
 Physics goals
   – HLT studies, consolidating efficiencies
   – Background/Signal studies, consolidate background estimates +
     background properties
 Requires quantitative increase in number of signal and
  background events compared to DC03:
   – 30 106 signal events
   – 15 106 specific background
   – 125 106 background (B inclusive + minimum bias, ratio 1:1.8)




                          Result of LHCb DC04                        5
                                              Production
 Production done with DIRAC system
   –   Track 4 - Distributed Computing Services : id 377


 DIRAC is deployed to each site participating to DC’04
 Central Services supporting the Data Challenge
   –   Production database
   –   Workload Management System
   –   Monitoring, Accounting
   –   Bookkeeping, ALIEN File Catalog
 Technologies used by the production services
   – C++, python, XML-RPC
   – ORACLE and mysql databases




                                                      Result of LHCb DC04   6
                                  LHCb job
         Non LCG site                                     LCG site
1.   DIRAC deployment (CE).                  1. Input SandBox:
2.   DIRAC JobAgent:                              – Small bash script (~50 lines).
     –   Check CE status.                         1. Check environment:
     –   Request a DIRAC task (jdl).                    •   Site, hostname, CPU, Memory,
                                                            Disk Space…
     –   Install LHCb software if
         needed                                   2. Install DIRAC:
                                                        •   Download DIRAC tarball (~1 MB).
     –   Submit to Local Batch
         System the job.                                •   Deploy DIRAC on WN.
             –   Execute task:                    3. Execute the job:
             –   Check Steps.                           A. Request a DIRAC task (LHCb
                                                           Simulation job)
             –   Upload results
                                                        B. Execute task:
3.   DIRAC TransferAgent.                               C. Check Steps
                                                        D. Upload results:

                                             2. Retrieval of SandBox
                                             3. Analysis of Retrieved Output
                                                SandBox
                                  Result of LHCb DC04                                7
                         Strategy
 Test sites:
   – Each site is tested with special and production-like jobs.
 Enable site :
   – DIRAC Workload Management System.
 Always keep jobs in the queues


           DIRAC                                      LCG
 Run Local Agent continuously:         Submit jobs continuously:
   – Via cron jobs                        – Via cron job on User
   – Via runsv                              Interface
   – Via daemon                        PS: LCG is considered as a site for
                                          DIRAC point of view


                           Result of LHCb DC04                               8
                    Data Storage
 All the output of the reconstructed phase (DST) are send
  to CERN (as Tier0)
 All the intermediate files are not kept.
 DSTs are also stored in one of our 5 TIER1
   –   CNAF (Italy)
   –   Karlsruhe (Germany)
   –   Lyon (France)
   –   PIC (Spain)
   –   RAL (United Kingdom)




                          Result of LHCb DC04            9
DC’04 performances



       Result of LHCb DC04   10
                Phase 1 results

            186 M Produced Events                        Phase 1
                                                        Completed
                                          3-5 106/day

            LCG                LCG

            paused           restarted

  LCG in
   action
               1.8 106/day


DIRAC
alone




                             Result of LHCb DC04                    11
Daily performance

                           5 million/day




     Result of LHCb DC04         12
Sites involved

                         20 DIRAC Sites


                          Used resources from
                           non-LHCb countries
                          e.g. Hungary produced
                                ~2M events

                         43 LCG Sites (8
                         also DIRAC sites)




   Result of LHCb DC04                     13
Simultaneous jobs (a snapshot)




           Result of LHCb DC04   14
            TIER storage
  TIER 0     Nb of Events           Size (TB)
CERN          187 557 231                       62

   Tier 1     Nb of Events          Size (TB)
 CNAF             37 129 350             12.6
 RAL              19 462 850              6.5
 PIC              16 505 010              5.4
 Karlsruhe        12 486 300                4
 Lyon              4 368 656              1.5

              Result of LHCb DC04                    15
              DIRAC-LCG : events share
                                  LHCb DC'04

             200

                                                         LCG
             150
Events (M)




                                                         DIRAC
             100

             50

              0
                    Total   may       june        july   august
                                     Month



                   50% of events were produced using LCG
                                  Result of LHCb DC04             16
DIRAC – LCG : CPU share
    376 CPU · Years
   May: 88%:12%                   Jun: 78%:22%

   11% of DC’04                   25% of DC’04




   Jul: 75%:25%                   Aug: 26%:74%

   22% of DC’04                   42% of DC’04




            Result of LHCb DC04              17
                   LCG performance
                          211k Submitted Jobs to LCG

                                   After Running:
113 k Done (Successful)                                         34 k Aborted

                                        Jobs(k) %Sub %Remain
                    Submitted               211 100.0%
                    Cancelled                26 12.2%
                    Remaining               185 87.8%  100.0%
                    Aborted (not Run)        37 17.6%   20.1%
                    Running                 148 70.0%   79.7%
                    Aborted (Run)            34 16.2%   18.5%
                    Done                    113 53.8%   61.2%
                    Retrieved              113 53.8%    61.2%

                          LCG Efficiency: 61 %
                                  Result of LHCb DC04                     18
DC’04 lessons


    Result of LHCb DC04   19
          Lessons learnt: DIRAC
 The concept of the light, customizable and simple to
  deploy agents proved to be very effective
 Easy update procedure - propagate bug fixes quickly of
  DIRAC tools
 Applications software installation triggered by a running
  job
 Most of the central services were running on the same
  machine
   – Too many processes, high loads
      Improve Server Availability
 Improve Error Handling and Reporting.


                           Result of LHCb DC04                20
                 Lessons learnt: LCG
 Improve OutputSandBox Upload | Retrieval mechanism:
    – Should also be available for Failed and Aborted Jobs.
 Improve reliability of CE status collection methods (timestamps?).
 Add intelligence on CE or RB to detect and avoid large number of
  aborted jobs on start-up:
    – Avoid miss-configured site to become a black-hole.
 Need to collect LCG-log info and tool to navigate them (including
  different JobIDs).
 Need a way to limit the CPU (and Wall-clock time):
    – LCG Wrapper must issue appropriated signals to User Job to allow graceful
      termination.
 How to manuals:
    – Clear instruction to Site Managers on the procedure to shutdown a site (for
      maintenance and/or upgrade).
    – Problems with site configurations (LCG config, firewalls, gridFTP servers..)




                                Result of LHCb DC04                            21
                            Conclusions
 LHCb DC’04 Phase 1 is over.
 The Production Target was achieved:
   – 186 M Events in 424 CPU years.
   – ~ 50% on LCG Resources (75-80% at the last weeks).
 LHCb Strategy successful:
   – Submitting “empty” DIRAC Agents to LCG has proven to be very flexible
     allowing a success rate above LCG alone.
 Big room for improvements, both on DIRAC and LCG
   – DIRAC needs to improve in the reliability of the Servers:
       • big step already during DC.
   – LCG needs improvement on the single job efficiency:
       • ~40% aborted jobs, ~10% did the work but failed from LCG viewpoint.
   – In both cases extra protections against external failures (network, unexpected
     shutdowns…) must be built in.
 Success due to dedicated support from LCG team and DIRAC Site
  Managers
                                 Result of LHCb DC04                           22
                          Other links
 CHEP04 talks:
   – File-Metadata Management System for the LHCb Experiment
       • (Track 4 - Distributed Computing Services) id 392
       • 27-Sep-2004         17:30
   – DIRAC Workload Management System
       • (Track 5 - Distributed Computing Systems and Experiences)   id 365
       • 29-Sep-2004         10:00
   – Grid Information and Monitoring System using XML-RPC and Instant
     Messaging for DIRAC
       • (Track 4 - Distributed Computing Services) id 368
       • 29-Sep-2004         10:00
   – DIRAC - The Distributed MC Production and Analysis for LHCb
       • (Track 4 - Distributed Computing Services) id 377
       • 30-Sep-2004         18:10
   – A Lightweight Monitoring and Accounting System for LHCb DC04
     Production
       • (Track 4 - Distributed Computing Services) id388
       • 30-Sep-2004         17:30


                               Result of LHCb DC04                        23

				
DOCUMENT INFO
Shared By:
Categories:
Stats:
views:8
posted:5/19/2009
language:English
pages:23