Learning Center
Plans & pricing Sign in
Sign Out

Condor HTC


									    Condor HTC
     Condor is called "specialized batch system". The purpose of a "batch system" is to enable users
     to queue jobs which will be executed as soon as computation power becomes available. The
     "specialized" refers to the ability of Condor to manage a whole pool of machines which are willing
     to accept jobs. It does so by using internal tables containing so-called ClassAds (short for Class
     advertises). Each machine that is willing to accept jobs is described by ClassAds (which contain
     information about architecture, memory, CPU load) as well as each job submitted by users
     (containing requirements for the job such as architecture, disk use, and priority).

     The task of Condor is to distribute the jobs on matching machines. For different purposes, Condor
     provides different universes to run jobs in. The most important ones are "standard" and "vanilla". If
     you want to run a Java application, or an application that is written using MPI or PVM interfaces,
     ask your administrator for further information.

    (1) The "standard" Universe

     This is probably the best universe in which to run jobs. Its only inconvenience is that you will have
     to relink your program to use a library which permits Condor to map system calls to your local
     machine. As jobs will run with your UID, they will appear to run on your own machine. The
     relinking also permits Condor to create (transparently) periodical checkpoints of your job. This is
     useful for two reasons. First, if a system crashes after working for hours, this time is not lost.
     Condor simply takes the last checkpoint and migrates the job to a (healthy) machine. Second, if a
     user claims a machine of the pool to locally work on (which happens sometimes!), Condor will first
     suspend all its activity and then migrate the job on another machine.

    (2) The "vanilla" Universe

     Another way to run jobs without having to recompile them is to use the "vanilla" universe. This
     universe enables you to run programs whose object files, or source files, are not available (or
     which use shared memory, which is not compatible with the Condor library). Typically this is useful
     for closed-source programs such as Matlab and for NS-2 (which cannot be linked against the
     Condor library). This universe does not support checkpoints, therefore if a user claims local
     access on a machine, your jobs will be suspended, probably canceled and relaunched on another
     machine. In general, the jobs are run again until they succeed in completion.

    (3) The .condor file

     If you want to submit a job to the Condor system you will have to create a small job specification
     file, a so-called .condor file. At minimum, the .condor file must contain the executable you wish to
     run and to which universe you want to submit it. Furthermore, you have to specify files that will be
     used as StdIn, StdOut and StdErr for your program. This is required since all jobs will run in a
     non-interactive mode, and should not impose any problem for most cases.
     We will now have a glance at the different definitions you can make in a .condor file.

!                                         Screen!                                       1/4
    The following list should be sufficient for all typical work needs.

    Universe = (vanilla | standard) 
    This decision depends on whether you are able to relink your program (then use standard) using
    Condor libraries or not (if not, use vanilla).

    Executable = (/full/path/executable | executable) 
    It is recommended to indicate the whole path to your executable, because your $PATH variable
    will not be searched. Otherwise you have to run condor_submit from the directory where your
    executable is located.

    InitailDir = [/home/90days/username/simulation]
    This attribute determines the root for your job. All other paths defined in this .condor file will be
    relative to this directory (the only exception is Executable). Assure that the directory exists.

    Input = [relative/path/in]
    Output = [relative/path/out]
    Error = [relative/path/err] 
    These three attributes are used to assign StdIn, StdOut and StdErr to your program. In general,
    Input is only used for programs waiting for interaction on startup, Real data input is often done by
    Arguments and Transfer-Input-Files (see below). Output and Error should always be assigned. All
    paths are relative to InitialDir.

    Log = [relative/path/log] 
    You can specify where Condor saves its logfile. Logfiles contain at any point of execution on
    which machine your job runs, whether and how often it was migrated or checkpointed, how much
    I/O was done, how long it took and so on. This path is relative to InitialDir.
    Arguments = [arg1 ... argN] 
    Here you can indicate with which arguments you want to run your program. If you have any
    filenames among your arguments, you should probably have them also in Transfer-Input-Files or
    Transfer-Output-Files (see below).

    Transfer-Input-Files = [relative/path/in1 ... relative/path/inN]
    Transfer-Output-Files = [relative/path/out1 ... relative/path/outN] 
    This is the place where Condor learns which files it should copy to the pool machine before
    execution and which ones it will have to copy back to your local machine after termination. You do
    not have to indicate your Input file and Output file here, but don't forget to put the ones you have
    in Arguments. Again, these paths are relative to InitialDir.

    GetEnv = (True | False) 
    When you set this attribute to true (by default it is false), Condor will copy all your local shell
    variables on the pool machine. This is vital for Matlab. Otherwise it is not recommended, as shell
    variables are probable to refer files that do not exist on the pool machines. 

!                                         Screen!                                        2/4
     Queue [N] 
     This submits your job as you have described it up to this point in your .condor file. For non-
     deterministic programs, that means for programs that do not always return the same result, given
     a fixed input set, you may want to indicate how many instances of this job you want do execute
     (otherwise assumed 1).

     A particularity of the .condor files is that you are allowed to overwrite attributes that you have
     assigned before. As it is possible to have several Queue entries in your .condor file, this enables
     you to specify several different instances of your program, using different input sets and / or
     different output sets. With a little routine you will be able to run hundreds of different simulations
     from only one .condor file and take a few days of vacation...

     .condor variables 
     Condor provides a neat facility to create separate Files for different instances of your job. When
     you do a "Queue N", all processes write and read from the same files if you specify static
     filenames above. To create a unique filename, you can use $(Process) in your filenames. This will
     be replaced automatically by the process number the job instance.

    (4) How to prepare your program

     If you have the sources or the object files of your program you may want to link it against the
     Condor library in order to use the Standard universe. This is done quiet easily by simply
     preceding the condor_compile
     command to your standard compilation command.

     For instance, if you want to assemble object files which is done by "gcc *.o -o program", you
     simply replace this by "condor_compile gcc *.o -o program". If you have a Makefile, you may type
     "condor_compile make". Most of the compilers are supported (cc, CC, gcc, g++, ld). Remember
     that certain technologies such as shared memory are not compatible with condor_compile.

    (5) How to submit your job

     When you have created your .condor file and eventually relinked your executable, you are able to
     submit your job to Condor. In general you will log on icsil1-cluster with your user account,
     eventually change into the directory where your program lies and submit you job
     typing"condor_submit job.condor".

     Now your task is to wait until a resource becomes available and Condor schedules a part of your
     job. If you wish you can observe this procedure using condor_q and condor_status, or simply wait
     until you get an e-mail from Condor that your job is done.

!                                          Screen!                                        3/4
    (6) How to monitor what Condor is currently doing
     There are two essential tools that allow you to watch Condor at work. The first is condor_q which
     will display all the jobs you have submitted and whether they are running or waiting for execution.

     The other tool is condor_status which will give you a list of all machines of the Condor pool. You
     will find information about who is actually using the machine (Unclaimed: nobody uses the
     resource. Claimed: Condor is working on it. Owner: a user works directly on the machine).

!                                         Screen!                                      4/4

To top