                              Batch System
                                                                Batch server
qsub,                                                           and cluster       Scheduler and
qdel,                                                          Configuration,       additional
qstat                                                            Job queue,          cluster
                                                                 State table      Configuration
                     start,     Batch Server
                     stop,       (pbsserver)                        Node,
                     status                                         job,
                                   Job, start, stop, status

    Execution host       Execution host     Execution host      Execution host
      (pbsmom              (pbsmom            (pbsmom)            (pbsmom)

          Maui scheduler
• Seems to originate at Maui High
  Performance Computing Centre (MHPCC)
• But now available from
  in Covered Bridge Canyon, Utah

                Maui/PBS Integration
[martin@masternode martin]$ qmgr            # maui.cfg 3.2
Max open servers: 4                         #

Qmgr: list server                           # 18/5/04 built by maui with extras added by xCAT and the 12Mar04 version

Server masternode                           #
                                            SERVERHOST              masternode
    server_state = Idle
                                            # primary admin must be first in list
    scheduling = False
                                            ADMIN1              root
    default_queue = dque
                                            RMCFG[base] TYPE=PBS
    log_events = 127
    mail_from = adm
    query_other_jobs = True                 RMPOLLINTERVAL              00:01:00
    resources_default.walltime =
    00:01:00                                SERVERPORT              42559

    scheduler_iteration = 60                SERVERMODE                 NORMAL

    node_pack = False
    pbs_version = OpenPBS_2.4

          Maui Philosophy (1)
• Maui is particularly concerned about scheduling
  multiprocessor jobs
• How do you arrange a matching set of processors
  to be simultaneously available for a single job ?
• Maui tries to plan the execution of such jobs at a
  particular time when it expects sufficient
  processors to be available - on the basis of the job
  maximum walltime parameters.
• It establishes reservations on a set of processors
  for a job – ensuring all the processors are free at
  the planned time
      Job                           Reservation for job
      12340                               12345
      Job 12341                     Reservation for job
      Job                           Reservation for job
      12345                               12345
      Job 12343                     Reservation for job
      Job 12344                     Reservation for job
      Job                           Reservation for job
      12340                               12345
           Maui Philosophy(2)
• As the reservations take effect, more and more
  processors become idle as the planned job time
• A scheme called backfill tries to exploit these idle
  processors by running short single/few processor
  jobs out of priority order in the gaps
• Maximum efficiency is achieved by scheduling
  big jobs first and running small jobs in the gaps !
  perhaps not what the users really want ?
• Maui really cares about walltimes
                Job Priority (1)
• Jobs are selected for execution in priority order
• Priority is calculated as a linear combination of
  factors based on
   –   Credentials – who, class/queue,..
   –   Fair Share
   –   Resources requested
   –   Waiting time
   –   Target Service level – eg maximum wait
• Most sites would have most coefficients set to 0

        Sample Priority Component
• Fairshare (FS) Component Fairshare components allow a site to favor jobs
    based on short term historical usage. The Fairshare Overview describes the configuration
    and use of Fairshare in detail.
•      After the brief reprieve from complexity found in the QOS factor, we come to the Fairshare
    factor. This factor is used to adjust a job's priority based on the historical percentage
    system utilization of the jobs user, group, account, or QOS. This allows you to 'steer' the
    workload toward a particular usage mix across user, group, account, and QOS
    dimensions. The fairshare priority factor calculation is
•       Priority += FSWEIGHT * MIN(FSCAP, (
           FSUSERWEIGHT * DeltaUserFSUsage +
           FSGROUPWEIGHT * DeltaGroupFSUsage +
           FSACCOUNTWEIGHT * DeltaAccountFSUsage +
           FSQOSWEIGHT           * DeltaQOSFSUsage +
           FSCLASSWEIGHT * DeltaClassFSUsage))
•       All '*WEIGHT' parameters above are specified on a per partition basis in the maui.cfg
    file. The 'Delta*Usage' components represents the difference in actual fairshare usage from
    a fairshare usage target. Actual fairshare usage is determined based on historical usage
    over the timeframe specified in the fairshare configuration. The target usage can be either a
    target, floor, or ceiling value as specified in the fairshare config file. The fairshare
    documentation covers this in detail but an example should help obfuscate things
    completely. Consider the following information associated with calculating the fairshare
    factor for job X.
            Job Priority (2)
• Multiple queues/classes are but one factor
  in maui calculations and decisions
• Jobs are normally given a whole cpu or
  even a whole execution host
• Priorities are recalculated on every maui
  iteration – say 1 per minute
• Jobs selected for backfill can bypass higher
  priority jobs
• Jobs can be given priority increments or
  decrements according to whether their
  user/group/…. „s recent usage is below or
  above target fairshare
• There are a selection of throttling
  parameters to prevent various forms of
  excessive behaviour – max jobs, max
  submission rate,….
• The administrator can set manual
  reservations – handy for shutting node
  down at particular time
• Standing reservations repeat – eg
  ScotGRID-Glasgow reserves a few nodes
  for short jobs 08:00 – 20:00 every day.
  – Backfill allows a jobs of  12 hours on these
    nodes during the night
              Node selection
• Some heterogeneity in the cluster may require all
  processors for a job to come from some subset for
  best performance eg sharing a Myrinet switch.
• Some constraints on node selection based on
  ownership may be demanded
• Maui has additional cluster configuration settings
  that can define sets of execution hosts as
  partitions (simple member list) or as nodesets
  (set defined by common node feature)
• Maui has a scheme for recording a usage
  profile over some period – eg a week
• The profile can then be played back with a
  different maui configuration in simulation
  mode to test new settings
• Quite a few “under construction” sections in
  the manual about this

   Resource Allocation Manager
• “Payment” for usage
• Maui can interwork with the QBank resource allocation
       • Pacific Northwest National Laboratory (PNNL) in Richland,
   – Reserves payment before job (lien) and takes actual payment for
     resources used after the job
• May be important when cluster is funded from many
  sources and value for money needs to be proved

         Experience (1)
– OpenPBS and maui built and configured by IBM’s eXtreme
  Cluster Administration Tool (xCAT)
   • xCAT is not a product – more a kit of parts supplied to IBM
     customers to operate Linux clusters – some Open Source
   • xCAT includes scripts to build OpenPBS and Maui according to
     the xCAT scheme
– Fairshares used to balance between user groups
   • Calculated wrt an average over 7 days – decaying 20% per day
   • Most effective with a steady demand across all users/groups –
     less good when job submission is more peaks and troughs

              Experience (2)
• Standing reservation for short jobs during daytime
   –   Currently 3 nodes with a maximum walltime of 1 hour
   –   Intended for development/test runs
   –   Grid monitoring test jobs
   –   No experience yet of multiprocessor jobs, simulation,
       resource allocation management
• Bioinfomatics group demonstrated that maui has a
  compiled limit of 4096 on the maximum number of
  jobs that can be in the queue !

• Maui Documentation is extensive but not completely
• Maui is not keen on error messages
• Priority calculation is hard to get to grips with
• A misbehaving pbs_mom hangs both OpenPBS and Maui
   – ssh allnodes service pbs status
   – hope to use ganglia ( ) to spot cases
     where whole execution host in trouble
       • Ganglia‟s gmetad (that aggregates local data) contributes a load
         average of ~1 on our 1 GHz PIII ..
         Looks like gmetad needs its own cpu

• The EDG (and LCG?) job submission
  system relies on sites giving an
  estimate of time before a job would start
  to execute – FIFO behaviour
• Maui does not execute jobs in
  submission order – non FIFO behaviour
• RB gets an unreliable estimate

• Gridpp have a Batch solution replacing
  OpenPBS with Torque and Maui – see
  words of Steve Traylen at
• A Google search on
           Maui lcg rpm
  reveals many other sites getting into maui

