Docstoc

Data driven scheduling in Condor

Document Sample
Data driven scheduling in Condor Powered By Docstoc
					 Data-driven Workflow Planning
in Cluster management Systems

  Srinath Shankar
  David J DeWitt

  Department of Computer Sciences
  University of Wisconsin-Madison, USA
        Data explosion in science

   Scientific applications – Traditionally
    considered as compute-intensive
   Data explosion in recent years
   Astronomy – hundreds of TB
       Sloan Digital Sky Survey
       LIGO – Laser Interferometry Gravitational-wave
        observer
   Bioinformatics –
       BIRN – Biomedical informatics research network
       SwissProt – Protein database
      Scientific workflows and files
Jobs with dependencies organized in Directed Acyclic Graphs
     Large number of similar DAGs make up a workflow




                A, B, C and D are programs
      File1 and File2 are pipeline (intermediate) files
   FileInput is a batch input file -- common to all DAGs
        Distributed scientific computing

   Scientists have exploited distributed
    computing to run their programs and
    workflows
   One popular distributed computing system
    is Condor
   Condor harvests idle CPU cycles on
    machines in a network
   Condor has been installed on roughly
    113,000 machines across 1,600 clusters
    around the world
           But …
   Several advances have been made since the development of Condor in
    the `80s
   Machines are getting cheaper
       Organizations no longer rely solely on idle desktop machines for
        computing cycles
       The proportion of machines dedicated to Condor computing in a
        cluster is increasing
   Disk capacities are increasing
       A single machine may have 500 GB of disk space
       Thus, desktop machines may also have a lot of free disk space
   Dedicated and desktop machines have unused disk space
   Half a petabyte of disk space spread over a modest cluster of 1000
    machines
        Focus

   The volume of data processed by
    scientific applications is increasing.
   How can we leverage distributed disk
    space to improve data management in
    cluster computing systems (like Condor) ?
   Step 1: Store workflow data across the
    disks of machines in a cluster
   Step 2: Schedule workflows based on
    data location – Exploit disk space to
    improve workflow execution times
      Overview of Condor
            Machine info                   Job info
                           Planner

Job info
                                                         Machine info
 Execute                                              Submit
 Machine                                              Machine
                  User input data
                                    Output data
   User
                                                        User
  Process
                                                        Data




       Data flow                Control flow
         Job and workflow submission
   To submit a job, the user provides a “submit”
    file containing
       Complete job description – The input, output and
        error files, when to transfer these files etc.
       Machine preferences like OS, CPU speed and
        memory
   Workflows are managed in a separate layer
       The user specifies dependencies between jobs in a
        separate “DAG description” file
       A DAG manager process (DAGMan) on the submit
        machine continuously monitors job completion
        events
       This process submits a job only when all its parents
        have completed
          Limitations of Condor

   The “source” of files in Condor is the
    submit machine, or perhaps a shared or
    third-party file system
       Inefficient handling of files during workflow
        execution
       Files always transferred to and from the
        submit machine
   The planner only handles single jobs
       It has no direct knowledge of job
        dependencies.
       It only sees a job after DAGMan submits it.
     Distributed file caching
   Keep the files of a job on the disk of
    machines after execution
   Utilize local disks on execute machines
    as sources of files
   Schedule dependent jobs on same
    machine whenever feasible
   Avoid network file transfer
   Reduce overall workflow execution
    time
     Disk aware planning

   Goal – reduce workflow execution time
    by minimizing file transfers
   Planner must be aware of the locations
    of cached files
   Requires a planner that is also aware
    of workflow structure
        Two phase planning algorithm
   AssignDAGs : Each DAG in a workflow
    tentatively assigned to the best
    machine based on disk cache contents
   But, assigning whole DAGs ignores
    inter-job parallelism
   Parallelize : Exploit parallelism in DAG
    to distribute load
       Cost-benefit analysis used when
        scheduling dependent jobs on different
        machines
      Planning example

A          B
 F1       F2
      C            Suppose we have 4 machines available
                   to run the workflow shown below
Sample DAG


      A        A     A    A     A    A        B


           C       C     C     C    C     C

               Sample Workflow (6 DAGs)
         Assignment of DAGs

   For each DAG in the workflow, we determine the
    machine that will result in earliest completion time
    for that DAG, and assign it to that machine.
   DAG runtime = Sum of job runtimes and file
    transfer times
       File transfer times depends on cache contents of the
        machine
   Effectively, each DAG is treated like a single job
    in this phase.
   Schedule after AssignDAGs

       M1   M2   M3   M4     Jobs in the same DAG
                             are of the same color
       A    A    A    A

       B    B    B    B
Time
       C    C    C    C      The schedule produced
                            after AssignDAGs entails
       A         A         no transfer of intermediat
                                       files
       B         B

       C         C
     Assignment phase (contd.)

   While DAGs are being assigned, a
    cumulative runtime is maintained for
    each machine
   Once a DAG has been scheduled on a
    machine, we assume that machine
    caches the workflow batch input
    (common to all DAGs)
   Thus, batch input transfer times are
    not included in calculations of the
    runtime of other DAGs on that
    machine
       Parallelization of DAGs
 After assignment phase, uneven load on
  machines
   There are “extra” DAGs on a few heavily loaded
    machines
   There are some machines with a much lighter load
 Exploit inter-job parallelism to distribute load
 The “extra” DAGs are examined in turn.
 If two jobs in a DAG can be run in parallel, we
  try to move one of them to a lightly loaded
  machine.
           Parallelization – Costs and
           benefits
   Cost of parallelization – When you move a job to
    a different machine than its parents and children,
    its input and output files have to be transferred to
    and from that machine.
   Cost = (input_size +output_size)/net_BW.
       Input_size and output_size are the sizes of the input
        files and output files for the job
        Net_BW is the network bandwidth
       Cost is the time taken to perform data transfers to and
        from the different machines
   Benefit = Time saved due to parallel execution of
    jobs
       Final Schedule

       M1    M2   M3   M4

       A     A    A     A

        B    B    B     B   In the final schedule, files
Time                      are transferred from M2 to M1
       C     C    C     C        and from M4 to M3

        A    B    A     B
            F2         F2        Network file transfer
       B
       C          B
                  C

       C          C
        Parallelization (contd.)

   In the formula for the cost of
    parallelization, input_size and output_size
    are adjusted for files already cached on
    either machine
   If a job being considered for
    parallelization has no children,
    output_size is taken as 0 since its output
    files do not need to be transferred back
          Implementation

   Main feature is a database used to store
       File information – checksums, sizes, file type,
        file locations
       Job information – Files used by jobs, job
        dependencies
       Workflow schedules – Produced by the
        planner
   The Condor daemons were modified to
    directly connect to the database and
    perform insert/updates/queries
     Role of database

                  Planner

                     Workflow, file info



                     Schedule
Execute                                      Submit
Machine   Cache    Data-      Workflow       Machine
           info    base      and file info
  File                                         User
 Cache                                         Data
        Implementation – versioning

   Versions of input and executables are
    determined by checksums computed at
    submission time
   The versions of intermediate and output
    files are “derived” from the versions of the
    inputs and executables that produce them
           Implementation – Distributed
           Storage
   Before a job executes on a machine, its input
    files are retrieved
       Files available in the machine‟s local cache are used
        directly
       Unavailable files are retrieved from other machines
        in the cluster. Any machine can serve as a file server
   After a job completes, its executable, input and
    output files are saved in the execute machine‟s
    disk cache.
   Once a job has completed, the database is
    updated with the new status and cache
    information.
          Implementation – Workflow
          submission
   An entire workflow is submitted at one time
   The workflow submission tools directly update
    the database with job and workflow information
   This information includes files used by the
    workflow as well as job dependencies in the
    workflow
   The planner directly uses the information in the
    database. Thus
       It has knowledge of job dependencies during planning
       It has knowledge of the locations of the relevant files
        during planning
        Performance testing
   Comparison of three systems
   ORIG – The original Condor system
   DAG-C – Our caching and DAG-
    oriented planning framework
   Job-C
       Same caching mechanism as DAG-C
       No DAG-based planning. When a job is
        ready, it is matched to the machine that
        caches most input
     Description of setup

   Tested on BLAST and synthetic workflows
    with varying branch-in factor and pipeline
    volume
   Cluster of 25 execute machines – all files
    were in the same network
   Two submit machines
   Network bandwidth was 100 Mbps
   No shared file system was used
   All experiments run with initially clean disk
    caches
            The BLAST workflow
                                                Batch input :(~4GB)
 nr_db.psq (986 MB)                                 nr_db.psq
  nr_db.pin (23 MB)                                  nr_db.pin
                                                       nr_db
  seq    blastall seq.blast   java-   seq.csv       nr_db.phr
         (3.1 MB)             wrap
                              (1KB)
                                      seq.bin          nr.gz

nr_db.psq (986 MB)
 nr_db.pin (23 MB)                Pipeline volume: seq.blast (~2MB)

  BLAST is a sequence alignment workflow. Given a protein
sequence “seq”, blastall checks a database of known proteins
   for any similarities. Proteins with similar sequences are
 expected to have similar properties. Javawrap converts the
       results into CSV and binary format for later use.
   BLAST results

          700
          600
          500
 Running 400
time (min) 300                               ORIG
                                             Job-C
          200                                DAG-C
          100
            0
                 25     50     75      100
                      Number of DAGs
     Sensitivity to pipeline volume

   F1, F2, G1 and G2 are
    distinct files
   10 minutes per job
   Varying size per file –
    100MB, 1GB, 1.5 GB,
    2GB
   50 DAGs per workflow
      Pipeline I/O results

          300
          250
          200
 Running
           150                                   ORIG
time (min)
           100                                   Job-C
                                                 DAG-C
           50
            0
                 100 MB   1 GB   1.5 GB   2 GB
                          Size per file
         DAG breadth

   File Fi, Gi are
    distinct
   Varied branching
    factor (n) from 3
    to 6
   10 min per job
   Tested a 50 DAG
    workflow with
    1GB per file
        DAG breadth results (1GB)

          600
          500
          400
 Running
           300                             ORIG
time (min)
           200                             Job-C
                                           DAG-C
          100
            0
                 3      4       5      6
                     DAG breadth (n)
     Varying computation time




   Size of each file set to 1GB
   Varied the time per job from 10 to 30
    minutes. (i.e. time per DAG from 80 to 240
    min)
   Tested a 50 DAG workflow
   Increasing computation

           900
           800
           700
           600
 Workflow
           500
  running                                      ORIG
           400
time (min)                                     Job-C
           300
           200                                 DAG-C
           100
             0
                 10   15     20     25    30
                 Running time per job (min)
          Results – Summary
   Job-C and DAG-C are better than ORIG
       In ORIG, all file traffic through submit machine
       In Job-C and DAG-C, files can be retrieved from
        multiple locations
   Thus, caching helps
   DAG-C is significantly better than Job-C
    when pipeline volume, branching factor are
    high
       In Job-C parent jobs often run on different
        machines
       Output files have to be transferred to the machine
        where their child executes
   Thus, DAG-oriented planning helps
          Distributed file caching – other
          benefits
   Scientists frequently reuse files (such as
    executables) – These can be used directly at
    their stored locations.
   Maintaining user data
       „What were the programs run to obtain this
        output ?‟
       „When did I last use a particular version of a
        file?‟
         Ongoing work
   Planning
       Evaluating planning overhead, dependence on DB
        size
       Make planning scheme more responsive to job
        failure, machine failure
   A cache replacement policy based on an LRFU
    scheme has been implemented, but not
    validated (See paper for details). Ongoing work
    includes
       Validating the cache replacement policy and
        determining the best policy for a workflow
        depending on user‟s submission pattern
       Including the time needed for generating a file in
        estimates of its “cache-worthiness”
         Related work
   ZOO, GridDB – data centric workflow
    management systems
   Thain et al. – Pipeline and batch sharing in
    Grid workloads – HPDC 2003
   Romosan et al. – Coscheduling of
    computation and data on computer clusters –
    SSDBM 2005
   Bright et al. – Efficient scheduling and
    execution of scientific workflow tasks –
    SSDBM 2005
Questions ?

				
DOCUMENT INFO