Data driven scheduling in Condor

Document Sample
Data driven scheduling in Condor Powered By Docstoc
					 Data-driven Workflow Planning
in Cluster management Systems

  Srinath Shankar
  David J DeWitt

  Department of Computer Sciences
  University of Wisconsin-Madison, USA
        Data explosion in science

   Scientific applications – Traditionally
    considered as compute-intensive
   Data explosion in recent years
   Astronomy – hundreds of TB
       Sloan Digital Sky Survey
       LIGO – Laser Interferometry Gravitational-wave
   Bioinformatics –
       BIRN – Biomedical informatics research network
       SwissProt – Protein database
      Scientific workflows and files
Jobs with dependencies organized in Directed Acyclic Graphs
     Large number of similar DAGs make up a workflow

                A, B, C and D are programs
      File1 and File2 are pipeline (intermediate) files
   FileInput is a batch input file -- common to all DAGs
        Distributed scientific computing

   Scientists have exploited distributed
    computing to run their programs and
   One popular distributed computing system
    is Condor
   Condor harvests idle CPU cycles on
    machines in a network
   Condor has been installed on roughly
    113,000 machines across 1,600 clusters
    around the world
           But …
   Several advances have been made since the development of Condor in
    the `80s
   Machines are getting cheaper
       Organizations no longer rely solely on idle desktop machines for
        computing cycles
       The proportion of machines dedicated to Condor computing in a
        cluster is increasing
   Disk capacities are increasing
       A single machine may have 500 GB of disk space
       Thus, desktop machines may also have a lot of free disk space
   Dedicated and desktop machines have unused disk space
   Half a petabyte of disk space spread over a modest cluster of 1000

   The volume of data processed by
    scientific applications is increasing.
   How can we leverage distributed disk
    space to improve data management in
    cluster computing systems (like Condor) ?
   Step 1: Store workflow data across the
    disks of machines in a cluster
   Step 2: Schedule workflows based on
    data location – Exploit disk space to
    improve workflow execution times
      Overview of Condor
            Machine info                   Job info

Job info
                                                         Machine info
 Execute                                              Submit
 Machine                                              Machine
                  User input data
                                    Output data

       Data flow                Control flow
         Job and workflow submission
   To submit a job, the user provides a “submit”
    file containing
       Complete job description – The input, output and
        error files, when to transfer these files etc.
       Machine preferences like OS, CPU speed and
   Workflows are managed in a separate layer
       The user specifies dependencies between jobs in a
        separate “DAG description” file
       A DAG manager process (DAGMan) on the submit
        machine continuously monitors job completion
       This process submits a job only when all its parents
        have completed
          Limitations of Condor

   The “source” of files in Condor is the
    submit machine, or perhaps a shared or
    third-party file system
       Inefficient handling of files during workflow
       Files always transferred to and from the
        submit machine
   The planner only handles single jobs
       It has no direct knowledge of job
       It only sees a job after DAGMan submits it.
     Distributed file caching
   Keep the files of a job on the disk of
    machines after execution
   Utilize local disks on execute machines
    as sources of files
   Schedule dependent jobs on same
    machine whenever feasible
   Avoid network file transfer
   Reduce overall workflow execution
     Disk aware planning

   Goal – reduce workflow execution time
    by minimizing file transfers
   Planner must be aware of the locations
    of cached files
   Requires a planner that is also aware
    of workflow structure
        Two phase planning algorithm
   AssignDAGs : Each DAG in a workflow
    tentatively assigned to the best
    machine based on disk cache contents
   But, assigning whole DAGs ignores
    inter-job parallelism
   Parallelize : Exploit parallelism in DAG
    to distribute load
       Cost-benefit analysis used when
        scheduling dependent jobs on different
      Planning example

A          B
 F1       F2
      C            Suppose we have 4 machines available
                   to run the workflow shown below
Sample DAG

      A        A     A    A     A    A        B

           C       C     C     C    C     C

               Sample Workflow (6 DAGs)
         Assignment of DAGs

   For each DAG in the workflow, we determine the
    machine that will result in earliest completion time
    for that DAG, and assign it to that machine.
   DAG runtime = Sum of job runtimes and file
    transfer times
       File transfer times depends on cache contents of the
   Effectively, each DAG is treated like a single job
    in this phase.
   Schedule after AssignDAGs

       M1   M2   M3   M4     Jobs in the same DAG
                             are of the same color
       A    A    A    A

       B    B    B    B
       C    C    C    C      The schedule produced
                            after AssignDAGs entails
       A         A         no transfer of intermediat
       B         B

       C         C
     Assignment phase (contd.)

   While DAGs are being assigned, a
    cumulative runtime is maintained for
    each machine
   Once a DAG has been scheduled on a
    machine, we assume that machine
    caches the workflow batch input
    (common to all DAGs)
   Thus, batch input transfer times are
    not included in calculations of the
    runtime of other DAGs on that
       Parallelization of DAGs
 After assignment phase, uneven load on
   There are “extra” DAGs on a few heavily loaded
   There are some machines with a much lighter load
 Exploit inter-job parallelism to distribute load
 The “extra” DAGs are examined in turn.
 If two jobs in a DAG can be run in parallel, we
  try to move one of them to a lightly loaded
           Parallelization – Costs and
   Cost of parallelization – When you move a job to
    a different machine than its parents and children,
    its input and output files have to be transferred to
    and from that machine.
   Cost = (input_size +output_size)/net_BW.
       Input_size and output_size are the sizes of the input
        files and output files for the job
        Net_BW is the network bandwidth
       Cost is the time taken to perform data transfers to and
        from the different machines
   Benefit = Time saved due to parallel execution of
       Final Schedule

       M1    M2   M3   M4

       A     A    A     A

        B    B    B     B   In the final schedule, files
Time                      are transferred from M2 to M1
       C     C    C     C        and from M4 to M3

        A    B    A     B
            F2         F2        Network file transfer
       C          B

       C          C
        Parallelization (contd.)

   In the formula for the cost of
    parallelization, input_size and output_size
    are adjusted for files already cached on
    either machine
   If a job being considered for
    parallelization has no children,
    output_size is taken as 0 since its output
    files do not need to be transferred back

   Main feature is a database used to store
       File information – checksums, sizes, file type,
        file locations
       Job information – Files used by jobs, job
       Workflow schedules – Produced by the
   The Condor daemons were modified to
    directly connect to the database and
    perform insert/updates/queries
     Role of database


                     Workflow, file info

Execute                                      Submit
Machine   Cache    Data-      Workflow       Machine
           info    base      and file info
  File                                         User
 Cache                                         Data
        Implementation – versioning

   Versions of input and executables are
    determined by checksums computed at
    submission time
   The versions of intermediate and output
    files are “derived” from the versions of the
    inputs and executables that produce them
           Implementation – Distributed
   Before a job executes on a machine, its input
    files are retrieved
       Files available in the machine‟s local cache are used
       Unavailable files are retrieved from other machines
        in the cluster. Any machine can serve as a file server
   After a job completes, its executable, input and
    output files are saved in the execute machine‟s
    disk cache.
   Once a job has completed, the database is
    updated with the new status and cache
          Implementation – Workflow
   An entire workflow is submitted at one time
   The workflow submission tools directly update
    the database with job and workflow information
   This information includes files used by the
    workflow as well as job dependencies in the
   The planner directly uses the information in the
    database. Thus
       It has knowledge of job dependencies during planning
       It has knowledge of the locations of the relevant files
        during planning
        Performance testing
   Comparison of three systems
   ORIG – The original Condor system
   DAG-C – Our caching and DAG-
    oriented planning framework
   Job-C
       Same caching mechanism as DAG-C
       No DAG-based planning. When a job is
        ready, it is matched to the machine that
        caches most input
     Description of setup

   Tested on BLAST and synthetic workflows
    with varying branch-in factor and pipeline
   Cluster of 25 execute machines – all files
    were in the same network
   Two submit machines
   Network bandwidth was 100 Mbps
   No shared file system was used
   All experiments run with initially clean disk
            The BLAST workflow
                                                Batch input :(~4GB)
 nr_db.psq (986 MB)                                 nr_db.psq (23 MB)                        
  seq    blastall seq.blast   java-   seq.csv       nr_db.phr
         (3.1 MB)             wrap
                                      seq.bin          nr.gz

nr_db.psq (986 MB) (23 MB)                Pipeline volume: seq.blast (~2MB)

  BLAST is a sequence alignment workflow. Given a protein
sequence “seq”, blastall checks a database of known proteins
   for any similarities. Proteins with similar sequences are
 expected to have similar properties. Javawrap converts the
       results into CSV and binary format for later use.
   BLAST results

 Running 400
time (min) 300                               ORIG
          200                                DAG-C
                 25     50     75      100
                      Number of DAGs
     Sensitivity to pipeline volume

   F1, F2, G1 and G2 are
    distinct files
   10 minutes per job
   Varying size per file –
    100MB, 1GB, 1.5 GB,
   50 DAGs per workflow
      Pipeline I/O results

           150                                   ORIG
time (min)
           100                                   Job-C
                 100 MB   1 GB   1.5 GB   2 GB
                          Size per file
         DAG breadth

   File Fi, Gi are
   Varied branching
    factor (n) from 3
    to 6
   10 min per job
   Tested a 50 DAG
    workflow with
    1GB per file
        DAG breadth results (1GB)

           300                             ORIG
time (min)
           200                             Job-C
                 3      4       5      6
                     DAG breadth (n)
     Varying computation time

   Size of each file set to 1GB
   Varied the time per job from 10 to 30
    minutes. (i.e. time per DAG from 80 to 240
   Tested a 50 DAG workflow
   Increasing computation

  running                                      ORIG
time (min)                                     Job-C
           200                                 DAG-C
                 10   15     20     25    30
                 Running time per job (min)
          Results – Summary
   Job-C and DAG-C are better than ORIG
       In ORIG, all file traffic through submit machine
       In Job-C and DAG-C, files can be retrieved from
        multiple locations
   Thus, caching helps
   DAG-C is significantly better than Job-C
    when pipeline volume, branching factor are
       In Job-C parent jobs often run on different
       Output files have to be transferred to the machine
        where their child executes
   Thus, DAG-oriented planning helps
          Distributed file caching – other
   Scientists frequently reuse files (such as
    executables) – These can be used directly at
    their stored locations.
   Maintaining user data
       „What were the programs run to obtain this
        output ?‟
       „When did I last use a particular version of a
         Ongoing work
   Planning
       Evaluating planning overhead, dependence on DB
       Make planning scheme more responsive to job
        failure, machine failure
   A cache replacement policy based on an LRFU
    scheme has been implemented, but not
    validated (See paper for details). Ongoing work
       Validating the cache replacement policy and
        determining the best policy for a workflow
        depending on user‟s submission pattern
       Including the time needed for generating a file in
        estimates of its “cache-worthiness”
         Related work
   ZOO, GridDB – data centric workflow
    management systems
   Thain et al. – Pipeline and batch sharing in
    Grid workloads – HPDC 2003
   Romosan et al. – Coscheduling of
    computation and data on computer clusters –
    SSDBM 2005
   Bright et al. – Efficient scheduling and
    execution of scientific workflow tasks –
    SSDBM 2005
Questions ?