Batch Systems at Fermilab

Document Sample
Batch Systems at Fermilab Powered By Docstoc
					          Preparing for the Grid—
        Changes in Batch Systems at

               HEPiX Batch System Workshop
                    Karlsruhe, Germany
               Ken Schumacher, Steven Timm

May 12, 2005          Batch systems@FNAL--Batch   1
                       Workshop HEPiX Karlsruhe
• All big experiments at Fermilab (CDF, D0,
  CMS) are moving to grid-based
• This talk will cover the following:
     – Batch scheduling at Fermilab before the grid.
     – Changes of big Fermilab clusters to Condor
       and why it happened
     – Future requirements for batch scheduling at

May 12, 2005        Batch systems@FNAL--Batch          2
                     Workshop HEPiX Karlsruhe
               Before Grid--FBSNG
• Fermilab had four main clusters, CDF
  Reconstruction Farm, D0 Reconstruction Farm,
  General Purpose Farm, CMS.
• All used FBSNG (Farms Batch System Next
• Most early activities on these farms were
  reconstruction of experimental data and
  generation of Monte Carlo.
• All referred to generically as “Reconstruction

May 12, 2005       Batch systems@FNAL--Batch       3
                    Workshop HEPiX Karlsruhe
                 FBSNG scheduling in
                 Reconstruction Farms
• Dedicated reconstruction farm (CDF, D0)
     – Large cluster dedicated to one experiment
     – Small team of experts submits all jobs
     – Scheduling is trivial
• Shared reconstruction farm (General Purpose)
     – Small cluster shared by 10 experiments, each with one or more queues
     – Each experiment has maximum quota of CPU’s they can use at once
     – Each experiment has maximum share of farm it can use when farm is
     – Most queues do not have time limits. Priority is calculated taking into
       account the average time jobs that have been running in the queue
     – Special queues for I/O jobs that run on the head node and go to and
       from mass storage
     – Guaranteed scheduling means that everything will eventually run
          • Other queues may be manually held to let a job run
          • May have to temporarily idle some nodes in order to let a large parallel job
            start up.

May 12, 2005                   Batch systems@FNAL--Batch                                   4
                                Workshop HEPiX Karlsruhe
    FBSNG Advantages and Disadvantages

• Advantages
     – Light resource consumption by batch system daemons
     – Simple design—based on resource counting rather than load
       measuring and balancing
     – Cost--No per-node license fee
     – Customized for Fermilab Strong Authentication requirements
     – Quite reliable, rarely if ever does FBSNG software fail.
• Disadvantages
     – Designed strictly for Fermilab Run II production
     – Doesn’t have grid-friendly features (x509 authentication),
       although it could be added.

May 12, 2005              Batch systems@FNAL--Batch                 5
                           Workshop HEPiX Karlsruhe
     Grid can use any batch system,
              Why Condor?
• Free software (but you can buy support).
• Supported by large team at U. of Wisconsin (and not by Fermilab
• Widely deployed in multi-hundred node clusters.
• New versions of Condor allow Kerberos 5 and x509 authentication
• Comes with Condor-G which simplifies submission of grid jobs
• Condor-C components allow for interoperation of independent
  Condor pools.
• Some of our grid-enabled users take advantage of the extended
  Condor features, so it is the fastest way to get our users on the grid.
• USCMS production cluster at Fermilab has switched to Condor,
  CDF reconstruction farms cluster is switching.
• General Purpose Farms, which are smaller, also plan to switch to
  Condor to be compatible with the two biggest compute resources on

May 12, 2005             Batch systems@FNAL--Batch                      6
                          Workshop HEPiX Karlsruhe
          Rise of Analysis Clusters
• Experiments now use multi-hundred node Linux clusters
  for analysis as well, replacing expensive central
     – CDF Central Analysis Facility (CAF) originally used FBSNG—
       Now has switched to Condor.
     – D0 Central Analysis Backend (CAB) uses PBS/Torque
     – USCMS User Analysis Facility (UAF) used FBSNG as primitive
       load balancer for interactive shells—will switch to Cisco load
       balancer shortly.
• Heterogeneous job mix
• Many different users and groups have to be prioritized
  within the experiment

May 12, 2005             Batch systems@FNAL--Batch                      7
                          Workshop HEPiX Karlsruhe
                      CAF software
• In CDF terms, CAF refers to the cluster and the software
  that makes it go.
• CDF collaborators (UCSD+INFN) wrote a series of
  wrappers around FBSNG referred to as “CAF”.
     – Wrappers allow connection to debug running job, or tail files on
       job that is running, many other things
     – Also added monitoring functions
     – Users are tracked by Kerberos principal, and prioritized with
       different batch queues, but all jobs run with just a few userID’s,
       making management easy.
• dCAF is distributed CAF, the same setup replicated at
  dedicated CDF resources around the world.
• Info at
May 12, 2005               Batch systems@FNAL--Batch                        8
                            Workshop HEPiX Karlsruhe
         CondorCAF in production
•   CDF changed batch system to Condor in analysis facility
•   Also rewrote monitoring software to work with Condor
•   Condor “computing on demand” capacity allows users to list files, tail
    files, debug on batch nodes.
•   Lots of work from the Condor team to get them going with Kerberos
    authentication and the large number of nodes (~700).
•   Now half of CDF reconstruction farm also running Condor
•   Rest of CDF reconstruction farm will convert once validation is
•   SAM is data delivery and bookkeeping mechanism
     – used to fetch data files, keep track of intermediate files, store the
     – Replaces user-written bookkeeping system that was high-maintenance
• Next steps, GlideCAF to make CAF work with Condor Glide-ins
  across the grid on non-dedicated resources.
May 12, 2005               Batch systems@FNAL--Batch                           9
                            Workshop HEPiX Karlsruhe
               Screen from CondorCAF

May 12, 2005         Batch systems@FNAL--Batch   10
                      Workshop HEPiX Karlsruhe
• D0 is using SAMGrid for all remote generation of Monte
  Carlo and reprocessing at several sites world wide.
• D0 Farms at FNAL are biggest site.
• Special job managers written to do intelligent handling of
  production and Monte Carlo requests
• All job requests and data requests go through head
  nodes to the outside net. Significant scalability issues,
  but it is in production.
• D0 reconstruction farms at Fermilab will continue to use

May 12, 2005         Batch systems@FNAL--Batch             11
                      Workshop HEPiX Karlsruhe
                  Open Science Grid
•   Continuation of efforts that were begun in Grid3.
•   Integration testing has been ongoing since February
•   Provisioning and deployment is occurring as we speak.
•   At Fermilab, USCMS production cluster and General Purpose Farms will be
    initial presence on OSG.
•   10 Virtual Organizations so far, mostly US-based:
     –   USATLAS (ATLAS collaboration)
     –   USCMS (CMS collaboration)
     –   SDSS (Sloan Digital Sky Survey)
     –   fMRI (functional Magnetic Resonance Imaging, based at Dartmouth)
     –   GADU (Applied Genomics, based at Argonne)
     –   GRASE (Engineering applications, based at SUNY Buffalo)
     –   LIGO (Laser Interferometer Gravitational Observatory)
     –   CDF (Collider Detector at Fermilab)
     –   STAR (Solenoidal Tracker at RHIC—BNL)
     –   iVDGL (International Virtual Data Grid Laboratory)

May 12, 2005                  Batch systems@FNAL--Batch                     12
                               Workshop HEPiX Karlsruhe
      Structure of General Purpose
      Farms OSG Compute Element
• One node runs Globus Gatekeeper and does all
  communication with the grid
• Software comes from VDT (Virtual data toolkit,
• In this configuration this gatekeeper is also the
  Condor master. Condor software is part of VDT.
• Will make a separate Condor head node later
  once software configuration is stable.
• All grid software is exported by NFS to the
  compute nodes. No change to compute node
  install is necessary.

May 12, 2005      Batch systems@FNAL--Batch       13
                   Workshop HEPiX Karlsruhe
• Fermigrid is an internal project at Fermilab to get
  different Fermilab resources to be able to interoperate,
  and be available to the Open Science Grid
• Fermilab will start with General Purpose Farms and CMS
  being available to OSG and to each other.
• All non-Fermi organizations will send jobs through
  common site gatekeeper.
• Site gatekeeper will route jobs to the appropriate cluster,
  probably using Condor-C, details to be determined.
• Fermigrid provides VOMS server to manage all the
  Fermilab-based Virtual Organizations
• Fermigrid provides GUMS server to map the grid
  Distinguished Names to unix userid’s.

May 12, 2005         Batch systems@FNAL--Batch             14
                      Workshop HEPiX Karlsruhe
                      Current Farms Configuration

                       FBSNG                         ENSTORE

               RAID           FBS Submit

                      GP Farms
                      Worker Nodes
                      102 currently

May 12, 2005             Batch systems@FNAL--Batch             15
                          Workshop HEPiX Karlsruhe
                          Configuration with Grid
Job from
             Fermigrid1            FNGP-                   FNPCSRV1
                Site               OSG                     FBSNG
             gatekeeper                                                   ENSTORE
                                   Gate-                   HEAD
                                   keeper                  NODE
     Job from Fermilab
                                                                 FBS Submit

                                   Condor              GP Farms
            Condor WN              WN                  FBSNG
            40 (coming
                                   14                  Worker Nodes
           this summer)
                                                       102 currently

    May 12, 2005               Batch systems@FNAL--Batch                      16
                                Workshop HEPiX Karlsruhe
• Scheduling
     – Current FBSNG installation in general purpose farms has
       complicated shares and quotas
     – Have to find best way to replicate this in Condor.
     – Hardest case to handle—low priority long jobs come into the
       farm while it is idle and fill it up. Do we pre-empt? Suspend?
• Grid credentials and mass storage
     – Need to verify that we can use Storage Resource Manager and
       gridftp from compute nodes, not just head node.
• Grid credentials—authentication + authorization
     – Condor has Kerberos 5 and x509 authentication
     – Need way to pass these credentials through the Globus GRAM
       bridge to the batch system
     – Otherwise local as well as grid jobs end up running non-
       authenticated and trusting the gatekeeper.

May 12, 2005              Batch systems@FNAL--Batch                     17
                           Workshop HEPiX Karlsruhe
                   Requirements 2
• Accounting and auditing
     – Need features to track which groups and which users are using
       the resources
     – VO’s need to know who within their VO is using resources
     – Site admins need to know who is crashing their batch system
• Extended VO Privilege
     – Should be able to set priorities in the batch system and mass
       storage system by virtual organization and role.
     – In other words, Production Manager should be able to jump
       ahead of Joe Graduate Student in the queue.
• Practical Sysadmin concerns
     – Some grid user mapping scenarios visualize hundreds of pool
       userid’s per VO.
     – Have to give all of these accounts, quotas, home directories, etc.
     – Would be very nice to do as CondorCAF does and run with a few
       user id’s traceable back to kerberos principal or grid credential.

May 12, 2005              Batch systems@FNAL--Batch                    18
                           Workshop HEPiX Karlsruhe

Shared By: