PWF Condor What it is, Why you want it and How to use it by thebest11


									         PWF Condor:
   What it is, Why you want it
      and How to use it
                 Bruce Beckles
e-Science Specialist, University Computing Service
What is PWF Condor?
          What is Condor?

• Specialised batch (unattended jobs)
  scheduling system:
   Distributed: jobs spread over multiple
    machines (nodes)
   „Cycle scavenging‟: can be set to use idle
    time on machines
   Cross-platform: UNIX/Linux, MacOS X,
   Often regarded as “grid middleware”
          A Centralised Condor Pool
                                           Submit Node tells
                                           Central Manager
                     Submit                about a job.              Central
                                           Central Manager
                     Node                  tells it to which         Manager
                                           Execute Node it
                                           should send job.

     Send job to
     Execute Node.        Each Execute Node
                          tells Central Manager
                          about itself. Central
                                                         Execute Nodes
                          Manager tells each
                          Execute Node when
                          to accept a job from
                          Submit Node.

           Execute Node returns
           results to Submit Node.

Centralised architecture:
Few Submit Nodes,
Many Execute Nodes
(Not all pools are like this)
     Types of Condor Job
• Different “Universes” (types of job):
   Vanilla: ordinary batch jobs
   Java: Java programs
   Standard: linked against Condor
    libraries (remote I/O, checkpointing)
   PVM, MPI: parallel jobs
   …etc

          What is PWF Condor?
• A restricted implementation of Condor on the
   Condor 6.6.10 (latest release of current Stable series)
   More secure:
     • “Principle of least privilege”
     • Sanitised environment
   Machines (daemons) authenticate using Kerberos
   Vanilla and Java Universes only (limited support for
    Java Universe)
   Centralised architecture: single submit node (SSH
    access only)
   1 TB dedicated short-term storage provided on
    submit node (not backed up)
   Currently PWF Linux (SuSE 9.3) only
            PWF Condor: Architecture
                 Submit Node                Submit Node tells
                                            Central Manager
                  SSH access only           about a job.                Central
                                            Central Manager
                                            tells it to which           Manager
                 PWF Linux server           Execute Node it             PWF Linux server
                                            should send job.
                     Send jobs to
 1 TB                Execute Nodes.
                                                                PWF Linux machines
1TB of dedicated          Each Execute Node                     (Execute Nodes)
short-term storage        tells Central Manager
                          about itself. Central
                          Manager tells each
                          Execute Node when
                          to accept a job from
                          Submit Node.

            Execute Nodes return
            results to Submit Node.

All daemon-daemon (i.e.
communication in the pool is
authenticated via Kerberos.
    Why you want PWF Condor

(and now, a word from our sponsors)

 Advantages for MCS sites (1)
• Allows machines‟ idle time to be productively used,
  relatively safely:
    Most PWF/MCS workstations are idle most of the time,
     especially late at night
• Allows academics to use MCS workstations for their
    With PWF Condor, your MCS cluster is now a research tool
     and so is even more valuable to your Department,
     improving the ROI of the MCS cluster
• Provides “cheap” computational resource for
  Departments with limited budgets:
    Your MCS cluster is already paid for, so no extra cost!
    Why purchase, install, administer and maintain a dedicated
     cluster if you can get equivalent resources already?
   Advantages for MCS sites (2)
• May free up expensive dedicated clusters:
   Suitable “embarrassingly parallel” jobs can be moved
    from existing clusters to PWF Condor, freeing those
    clusters for the jobs that really need such dedicated
• Satisfies growing demand for Condor with little
  administration overhead for MCS sites:
   Use of Condor in academic research is increasing,
    especially amongst the computationally-intensive
   …so if there‟s a demand for Condor in your
    Department, you can now easily satisfy it with PWF
    Advantages for users (1)
• Ideal for “embarrassingly parallel” jobs (that
  are not too long):
   Simulations, models
   Parameter sweeps
• Same results, faster:
   Part II Physics project: 15 mins per run; 10,000
    runs required ≈ 105 days or about 15 weeks on a
    single machine (forget it)
   Assume about 120 PCs, 11 hours a day running
    PWF Condor ≈ 2 days (yay!)

       Advantages for users (2)
• More results, larger parameter sweeps:
    “Same results, faster” means user can get more results
    Biochemistry PhD student: 30-min simulation ≈ 40 runs a day
     on their PC… so they‟ll only run the simulation in what they
     believe to be the most interesting parameter range
    Assume about 120 PCs, 11 hours a day running PWF Condor ≈
     2,500 runs per day… an increase of over 60-fold, so now they
     can explore larger parameter ranges (…and maybe find
     unexpected behaviour)
• Significant help migrating to PWF Condor:
    We really want users for PWF Condor, so…
    …we will actively try to “hand-hold” users – during this initial
     phase – as they learn to use Condor
    Currently working with users in Chemistry, Biochemistry,
     Engineering and Physics
        What is PWF Condor?

(back to our regularly scheduled programme)

PWF Condor: Policy Details (1)
• Jobs only run between certain times
  (currently 2100 – 0800 every day):
   MCS sites can choose their own times
   Jobs can be submitted 24/7, but will only
    attempt to run between set times
• At start time (2100), idle machines boot
  themselves into PWF Linux / start Condor
• At end time (0800), Condor daemons are
  stopped and idle machines return to
  default operating system
PWF Condor: Policy Details (2)
• All users of PWF Condor have access
  to all participating PWF workstations
  in CS rooms:
   Titan Teaching Room 1
   Titan Teaching Room 2
   Phoenix Teaching Room
• MCS sites can choose to restrict
  which users can run jobs on their
  machines (not recommended)
PWF Condor: Policy Details (3)
• Machines running jobs display
  message to this effect on screen
• If a user starts using a machine that
  is running a job, the job is killed
• Job files on execute nodes deleted
  when job finished or killed
• …however, we do not guarantee
  privacy of your data or executables:
   Don’t use this with confidential data!
PWF Condor: Policy Details (4)
• Condor uses a “fair share” priority system
  for users:
   All users start out with equal priority
   Priority decreases as usage increases
   Priority then increases over time (with an
    exponential half-life) until it returns to normal
• Job preemption has been disabled:
   Jobs will run to completion, or until a user
    starts using a machine, or until PWF Condor
    end time (currently 0800)

PWF Condor: Policy Details (5)
• Submit Node only provides short-term
   Currently no quotas on 1 TB filesystem
   Periodically monitor filesystem usage:
     • Users will be warned, then files will be deleted
• Submit Node cannot cope with a job
  queue of more than about 4,000 jobs:
   Tries to stop job submission if too many jobs
    already in the queue

       PWF Condor: Limitations
• Condor jobs do not have access to users‟ home directories:
    Submit Node is PWF Linux server, so…
    …only access to home directory when user is logged in…
    …so files for Condor job must be stored locally on Submit Node, and
     results of Condor job will be also be stored locally on Submit Node
• Condor is bad at handling large numbers of short-running (5
  minutes or less) jobs:
    Recommend jobs should be at least 15 minutes long
    Users should “batch up” several shorter jobs to make a longer job
• Maximum run-time for jobs:
    Currently PWF Condor runs for 11 hour intervals
    Experience from UCL suggests that in this scenario the expected
     maximum run-time is about 5 hours

How do you use PWF Condor?

Getting access to PWF Condor
• Users will need a PWF account
• …then e-mail
   Ask for access to PWF Condor
   Explain intended use of PWF Condor
    (sample code, etc. is much appreciated):
     • …not because we are control freaks, but…
     • …not all jobs are suitable for PWF Condor
• If everything‟s OK, user will be added
  to appropriate ACLs

     Submitting a job (1)
• SSH to Submit Node:
  (This hostname will change in due course)
• Login with CRSid and PWF password
• Change to local submit directory:
    cd /submit/<CRSid>
• Upload executable and data files to
  Submit Node
          Submitting a job (2)
• Create a submit description file for the job
• Submit the job with the condor_submit
  condor_submit <submit_description_file>
• You can monitor the job with the condor_q
   But don’t do this repeatedly as it will cripple the Submit
   …anyway, jobs won‟t execute until after 2100, so there‟s
    not much point monitoring them except between 2100
    and 0800.
• You can delete the job with the condor_rm
     Submitting a job (3)
• Inspect job log file and job‟s
  standard error if there are any
  problems (job fails, etc.)
• When the job has finished, transfer
  output back to your machine
• Then delete job executable, data
  files, output files and log file

       Submit description files (1)
• Tells Condor everything it needs to know about the
      Type of job (“Universe”)
      Executable
      Data files
      Executable‟s parameters
      Job requirements (memory, disk space, etc)
      Standard I/O redirections (stdout, stderr, etc)
    Log file
    …etc
• Traditionally have a cmd extension, e.g. my_job.cmd,
  but this is not mandatory
   Submit description files (2)
# Sample submit description file
# This is a comment
universe       = vanilla
executable     = my_prog
output         = my_prog.out
error          = my_prog.err
log            = my_prog.log
arguments      = -f data_file
transfer_input_files          = data_file,config_file
should_transfer_files         = YES
when_to_transfer_output       = ON_EXIT_OR_EVICT


   Submit description files (3)
# Another sample   submit description file
universe           =   vanilla
executable         =   /bin/echo
output             =   out.$(Process)
error              =   err.$(Process)
log                =   example.log
arguments          =   $(Process)

requirements       = Memory >= 64 && Disk >= 1024
rank               = KFlops

should_transfer_files        = YES
when_to_transfer_output      = ON_EXIT_OR_EVICT
transfer_executable          = False

notification       = Error

queue 100

PWF Condor: Documentation
• Documentation is being prepared
  and will be made available via the
  CS website soon
• In the meantime, I‟ve been helping
  interested users get to grips with
  PWF Condor
• Those already familiar with Condor
  will find using PWF Condor
• Any questions?

• E-mail
     Users wanting to use PWF Condor
     MCS sites wanting PWF Condor
     Any questions relating to PWF Condor
     Condor advice and queries


To top