Condor HTC
Document Sample


Condor HTC
Condor is called "specialized batch system". The purpose of a "batch system" is to enable users
to queue jobs which will be executed as soon as computation power becomes available. The
"specialized" refers to the ability of Condor to manage a whole pool of machines which are willing
to accept jobs. It does so by using internal tables containing so-called ClassAds (short for Class
advertises). Each machine that is willing to accept jobs is described by ClassAds (which contain
information about architecture, memory, CPU load) as well as each job submitted by users
(containing requirements for the job such as architecture, disk use, and priority).
The task of Condor is to distribute the jobs on matching machines. For different purposes, Condor
provides different universes to run jobs in. The most important ones are "standard" and "vanilla". If
you want to run a Java application, or an application that is written using MPI or PVM interfaces,
ask your administrator for further information.
(1) The "standard" Universe
This is probably the best universe in which to run jobs. Its only inconvenience is that you will have
to relink your program to use a library which permits Condor to map system calls to your local
machine. As jobs will run with your UID, they will appear to run on your own machine. The
relinking also permits Condor to create (transparently) periodical checkpoints of your job. This is
useful for two reasons. First, if a system crashes after working for hours, this time is not lost.
Condor simply takes the last checkpoint and migrates the job to a (healthy) machine. Second, if a
user claims a machine of the pool to locally work on (which happens sometimes!), Condor will first
suspend all its activity and then migrate the job on another machine.
(2) The "vanilla" Universe
Another way to run jobs without having to recompile them is to use the "vanilla" universe. This
universe enables you to run programs whose object files, or source files, are not available (or
which use shared memory, which is not compatible with the Condor library). Typically this is useful
for closed-source programs such as Matlab and for NS-2 (which cannot be linked against the
Condor library). This universe does not support checkpoints, therefore if a user claims local
access on a machine, your jobs will be suspended, probably canceled and relaunched on another
machine. In general, the jobs are run again until they succeed in completion.
(3) The .condor file
If you want to submit a job to the Condor system you will have to create a small job specification
file, a so-called .condor file. At minimum, the .condor file must contain the executable you wish to
run and to which universe you want to submit it. Furthermore, you have to specify files that will be
used as StdIn, StdOut and StdErr for your program. This is required since all jobs will run in a
non-interactive mode, and should not impose any problem for most cases.
We will now have a glance at the different definitions you can make in a .condor file.
! Screen! 1/4
The following list should be sufficient for all typical work needs.
Universe = (vanilla | standard)
This decision depends on whether you are able to relink your program (then use standard) using
Condor libraries or not (if not, use vanilla).
Executable = (/full/path/executable | executable)
It is recommended to indicate the whole path to your executable, because your $PATH variable
will not be searched. Otherwise you have to run condor_submit from the directory where your
executable is located.
InitailDir = [/home/90days/username/simulation]
This attribute determines the root for your job. All other paths defined in this .condor file will be
relative to this directory (the only exception is Executable). Assure that the directory exists.
Input = [relative/path/in]
Output = [relative/path/out]
Error = [relative/path/err]
These three attributes are used to assign StdIn, StdOut and StdErr to your program. In general,
Input is only used for programs waiting for interaction on startup, Real data input is often done by
Arguments and Transfer-Input-Files (see below). Output and Error should always be assigned. All
paths are relative to InitialDir.
Log = [relative/path/log]
You can specify where Condor saves its logfile. Logfiles contain at any point of execution on
which machine your job runs, whether and how often it was migrated or checkpointed, how much
I/O was done, how long it took and so on. This path is relative to InitialDir.
Arguments = [arg1 ... argN]
Here you can indicate with which arguments you want to run your program. If you have any
filenames among your arguments, you should probably have them also in Transfer-Input-Files or
Transfer-Output-Files (see below).
Transfer-Input-Files = [relative/path/in1 ... relative/path/inN]
Transfer-Output-Files = [relative/path/out1 ... relative/path/outN]
This is the place where Condor learns which files it should copy to the pool machine before
execution and which ones it will have to copy back to your local machine after termination. You do
not have to indicate your Input file and Output file here, but don't forget to put the ones you have
in Arguments. Again, these paths are relative to InitialDir.
GetEnv = (True | False)
When you set this attribute to true (by default it is false), Condor will copy all your local shell
variables on the pool machine. This is vital for Matlab. Otherwise it is not recommended, as shell
variables are probable to refer files that do not exist on the pool machines.
! Screen! 2/4
Queue [N]
This submits your job as you have described it up to this point in your .condor file. For non-
deterministic programs, that means for programs that do not always return the same result, given
a fixed input set, you may want to indicate how many instances of this job you want do execute
(otherwise assumed 1).
A particularity of the .condor files is that you are allowed to overwrite attributes that you have
assigned before. As it is possible to have several Queue entries in your .condor file, this enables
you to specify several different instances of your program, using different input sets and / or
different output sets. With a little routine you will be able to run hundreds of different simulations
from only one .condor file and take a few days of vacation...
.condor variables
Condor provides a neat facility to create separate Files for different instances of your job. When
you do a "Queue N", all processes write and read from the same files if you specify static
filenames above. To create a unique filename, you can use $(Process) in your filenames. This will
be replaced automatically by the process number the job instance.
(4) How to prepare your program
If you have the sources or the object files of your program you may want to link it against the
Condor library in order to use the Standard universe. This is done quiet easily by simply
preceding the condor_compile
command to your standard compilation command.
For instance, if you want to assemble object files which is done by "gcc *.o -o program", you
simply replace this by "condor_compile gcc *.o -o program". If you have a Makefile, you may type
"condor_compile make". Most of the compilers are supported (cc, CC, gcc, g++, ld). Remember
that certain technologies such as shared memory are not compatible with condor_compile.
(5) How to submit your job
When you have created your .condor file and eventually relinked your executable, you are able to
submit your job to Condor. In general you will log on icsil1-cluster with your user account,
eventually change into the directory where your program lies and submit you job
typing"condor_submit job.condor".
Now your task is to wait until a resource becomes available and Condor schedules a part of your
job. If you wish you can observe this procedure using condor_q and condor_status, or simply wait
until you get an e-mail from Condor that your job is done.
! Screen! 3/4
(6) How to monitor what Condor is currently doing
There are two essential tools that allow you to watch Condor at work. The first is condor_q which
will display all the jobs you have submitted and whether they are running or waiting for execution.
The other tool is condor_status which will give you a list of all machines of the Condor pool. You
will find information about who is actually using the machine (Unclaimed: nobody uses the
resource. Claimed: Condor is working on it. Owner: a user works directly on the machine).
! Screen! 4/4
Get documents about "