Document Sample
212 Powered By Docstoc
					       DFRN: A New Approach for Duplication Based Scheduling for Distributed
                       Memory Multiprocessor Systems*

      Gyung-Leen Park                                    Behrooz Shirazi                                   Jeff Marquis
Dept. of Comp. Sc. and Eng.                       Dept. of Comp. Sc. and Eng.                       Parallel Technologies, Inc.
 Univ. of Texas at Arlington                       Univ. of Texas at Arlington                       2000 North Plano Road,
Arlington, Texas 76019-0015                       Arlington, Texas 76019-0015                       Richardson, Texas 75082                       

                            Abstract                                         Most of the non-duplication scheduling methods are
   Duplication Based Scheduling (DBS) is a relatively                    based on the list scheduling algorithm [8] since they
new approach for solving multiprocessor scheduling                       maintain a list of nodes according to their priorities. A
problems. The problem is defined as finding an optimal                   list scheduling algorithm repeatedly carries out the
schedule which minimizes the parallel execution time of                  following steps: (1) Tasks ready to be assigned to a
an application on a target system. In this paper, we                     processor are put onto a priority queue. Tasks are
classify DBS algorithms into two categories according to                 assigned to processors based on some priority criteria. A
the task duplication method used. We then present our                    task becomes ready for assignment when all of its parents
new DBS algorithm that extracts the strong features of                   are scheduled. (2) Select a “suitable Processing Element
the two categories of DBS algorithms. Our simulation                     (PE)” for assignment. Typically, a suitable PE is one
study shows that the proposed algorithm achieves                         that can execute the task the earliest. (3) Assign the task
considerable performance improvement over existing                       at the head of the priority queue to this PE.
DBS algorithms with equal or less time complexity. We                        Duplication Based Scheduling (DBS) is a relatively
analytically obtain the boundary condition for the worst                 new approach to the scheduling problem [5, 9-14]. The
case behavior of the proposed algorithm and also prove                   DBS algorithms are capable of reducing communication
that the algorithm generates an optimal schedule for a                   overhead by duplicating remote parent tasks on local
tree structured input directed acyclic graph .                           processing elements. Similar to non-duplication
                                                                         algorithms, DBS methods have been shown to be NP-
                                                                         complete [15]. Thus, many of the proposed DBS
1. Introduction                                                          algorithms are based on heuristics. This paper classifies
                                                                         DBS algorithms into two categories according to the task
   Efficient scheduling of parallel programs, represented                duplication approach used: Scheduling with Partial
as a Directed Acyclic Graph (DAG), onto processing                       Duplication (SPD) and Scheduling with Full Duplication
elements of parallel and distributed computer systems are                (SFD).
extremely difficult and important issues [1-7]. The goals                    SPD algorithms do not duplicate the parent of a join
of the scheduling process are to efficiently utilize                     node unless the parent is critical. A join node is defined
resources and to achieve performance objectives of the                   as a node with an in-degree greater than one (i.e., a node
application (e.g., to minimize program parallel execution                with more than one incoming edge). Instead, they try to
time). Since it has been shown that the multiprocessor                   find the critical iparent which is defined later in this
scheduling problem is NP-complete, many researchers                      paper as an immediate parent which gives the largest
have proposed scheduling algorithms based on heuristics.                 start time to the join node. The join node is scheduled on
The scheduling algorithms can be classified into two                     the processor where the critical iparent has been
general categories: algorithms that employ task                          scheduled. Because of the limited task duplication,
duplication and algorithms that do not employ task                       algorithms in this category have a low complexity but
duplication.                                                             may not be appropriate for systems with high
                                                                         communication overhead. They typically provide good

    This work has in part been supported by grants from NSF(CDA-9531535 and MIPS-9622593) and state of Texas ATP 003656-087
schedules for an input DAG where computation cost is          C(Vi, Vj) is the communication cost for edge E(Vi, Vj)
strictly larger than communication cost. CPM [12],            which connects task Vi and Vj. The edge E(Vi, Vj)
SDBS [13], and FSS [18] belong to this category.              represents the precedence constraint between the node Vi
    SFD algorithms attempt to duplicate all the parents of    and Vj. In other words, task Vj can start the execution
a join node and apply the task duplication algorithm to       only after the output of Vi is available to Vj. When the
all the processors that have any of the parents of the join   two tasks, Vi and Vj, are assigned to the same processor,
node. Thus, algorithms in this category have a higher         C(Vi, Vj) is assumed to be zero since intra-processor
complexity but typically show better performance than         communication cost is negligible compared with the
SPD algorithms. DSH [14], BTDH [11], LCTD [5,10],             interprocessor communication cost. The weights
and CPFD [9] belong to this category.                         associated with nodes and edges are obtained by
    A trade-off exists between algorithms in these two        estimation [16].
categories: performance (better application parallel             This paper defines two relations for precedence
execution time) versus time complexity (longer time to        constraints. The Vi ⇒ Vj relation indicates the strong
carry out the scheduling algorithm itself). This paper        precedence relation between Vi and Vj . That is, Vi is an
proposes a new DBS algorithm that attempts to achieve         immediate parent of Vj and Vj is an immediate child of
the performance of SFD algorithms with a time                 Vi. The terms iparent and ichild are used to represent
complexity approaching SPD algorithms. The proposed           immediate parent and immediate child, respectively. The
algorithm, called Duplication First and Reduction Next        Vi → Vj relation indicates the weak precedence relation
(DFRN), duplicates the parents of any join node as done       between Vi and Vj. That is, Vi is a parent of Vj but not
in SFD algorithms but with reduced computational              necessarily the immediate one. Vi → Vj and Vj → Vk
complexity.                                                   imply Vi → Vk . Vi ⇒ Vj and Vj ⇒ Vk do not imply Vi
    Our simulation study shows that the proposed              ⇒ Vk , but imply Vi → Vk. The relation → is transitive,
algorithm       achieves    considerable      performance     and the relation ⇒ is not. A node without any parent is
improvement over existing algorithms with equal or less       called an entry node and a node without any child is
time complexity while it obtains comparable                   called an exit node.
performance to algorithms which have higher time                 Graphically, a node is represented as a circle with a
complexities. It is also shown that the performance           dividing line in the middle. The number in the upper
improvement becomes greater as Communication to               portion of the circle represents the node ID number and
Computation Ratio is increased. This paper analytically       the number in the lower portion of the circle represents
obtains a boundary condition for the worst case               the computation cost for the node. For example, for the
performance of the proposed algorithm and also proves         sample DAG in Figure 1, the entry node is V1 which has
that the algorithm provides an optimal schedule for a tree    a computation cost of 10. In the graph representation of a
structured input DAG.                                         DAG, the communication cost for each edge is written on
    The remainder of this paper is organized as follows.      the edge itself. For each node, incoming degree is the
Section 2 presents the system model and the problem           number of input edges and outgoing degree is the
definition. Section 3 briefly covers the existing             number of output edges. For example, in Figure 1, the
algorithms. The proposed DBS algorithm is presented in        incoming and outgoing degrees for the node V5 are 3 and
Section 4. Section 4 also contains the worst case and the     1, respectively. A few terms are defined here for a more
optimality analysis. The performance of the proposed          clear presentation.
algorithm is compared with that of the existing
algorithms in Section 5. Finally, Section 6 concludes this    Definition 1: A node is called a fork node if its outgoing
paper.                                                        degree is greater than 1.

2. System model and problem definition                        Definition 2: A node is called a join node if its incoming
                                                              degree is greater than 1.
      A parallel program is usually represented by a
Directed Acyclic Graph (DAG), which is also called a              Note that the fork node and the join node are not
task graph. As defined in [13], a DAG consists of a tuple     exclusive terms, which means that one node can be both
(V, E, T, C), where V, E, T, and C are the set of task        a fork and also a join node; i.e., both of the node’s
nodes, the set of communication edges, the set of             incoming and outgoing degrees are greater than one.
computation costs associated with the task nodes, and the     Similarly, a node can be neither a fork nor a join node;
set of communication costs associated with the edges,         i.e., both of the node’s incoming and outgoing degrees
respectively. T(Vi) is a computation cost for task Vi and     are one. In the task graph of Figure 1, nodes V1, V2, V3,
and V4 are fork nodes while nodes V5, V6, V7, and V8 are       Definition 9: The level of a node is recursively defined
join nodes.                                                    as follows. The level of an entry node, V0, is zero. Let
                                                               Lv(Vi) be the level of Vi. Then Lv(V0) = 0. Lv(Vj) =
Definition 3: The Earliest Start Time, EST(Vi, Pk), and        Lv(Vi) + 1, Vi ⇒ Vj, for non-join node Vj. Lv(Vj) =
Earliest Completion Time, ECT(Vi, Pk), are the times that      Max(Lv(Vi )) + 1, Vi ⇒ Vj, for join node Vj. For
a task Vi starts and finishes its execution on processor Pk,   example, the level of node V1, V2, V5, V8 are 0, 1, 2, and
respectively.                                                  3, respectively. Even though we assume that there is an
                                                               edge from node 1 to 5, the level of node 5 is still 2 not 1
Definition 4: A message arriving time (MAT) from Vi to         since Lv(V5) = Max(Lv(Vi )) + 1, Vi ⇒ V5, for join
Vj, or MAT(Vi, Vj), is the time that the message from Vi       node V5.
arrives at Vj. If Vi and Vj are scheduled on the same
processor Pk, MAT(Vi, Vj) becomes ECT(Vi, Pk).                                                1
Otherwise, MAT(Vi , Vj) = ECT(Vi, Pk) + C(Vi, Vj).                                           10
                                                                                       50     50        50
Definition 5: An iparent of a join node is called its                                 2       3               4
critical iparent if it provides the largest MAT to the join                          20 80                    60
node. The critical iparent is denoted as Vi = CIP(Vj) if Vi                                        50
is the critical iparent of Vj. More formally, Vi = CIP(Vj)                      40     50    60
if and only if MAT(Vi, Vj) > MAT(Vk, Vj), for all Vk,
                                                                                       70               100
Vk ⇒ Vj, Vi ⇒ Vj, i ≠ k. If there are more than one
iparent providing the same largest MAT, CIP is chosen                              5          6                 7
arbitrary.                                                                        50         60                70

Definition 6: An immediate parent node of a join node is                             30       20         50
called the decisive iparent of the join node if it provides                                  8
the second largest MAT to the join node. The decisive                                        10
iparent is denoted as Vi = DIP(Vj) if Vi is the decisive
iparent of Vj. Formally, Vi = DIP(Vj) if and only if
                                                                              Figure 1. Sample DAG
MAT(Vi, Vj) > MAT(Vk, Vj), for all Vk, Vk ⇒ Vj, Vk ≠
CIP(Vj), Vi ⇒ Vj, i ≠ k. If there are more than one               Similar to existing DBS algorithms, the number of
iparent providing the same second largest MAT, DIP is          processors are assumed to be unbounded. The topology
chosen       arbitrary.    EST(Vj,       P c)      becomes     of the target system is also assumed to be a complete
Max(ECT(CIP(Vj), Pc), MAT(DIP(Vj), Vj)) if Vj is               graph; i.e., all processors can directly communicate with
scheduled without any task duplication on Pc where             each other. Thus, the multiprocessor scheduling process
CIP(Vj) has been scheduled.                                    becomes a mapping of the task nodes in the input DAG
                                                               to the processors in the target system with the goal of
Definition 7: The processor which has the critical iparent     minimizing the execution time of the entire program. The
for Vi is called the critical processor of Vi.                 execution time of the entire program after scheduling is
                                                               called the parallel time to be distinguished from the
Definition 8: The critical path of a task graph is the path    completion time of an individual task node.
from an entry node to an exit node which has the largest
sum of computation and communication costs of the
nodes and edges on the path. The Critical Path Including
                                                               3. Related work
Communication cost (CPIC) is the length of the critical
                                                                   This section briefly covers several typical scheduling
path including communication costs in the path while the
                                                               algorithms belonging to each category. They are used
Critical Path Excluding Communication cost (CPEC) is
                                                               later in this paper for performance comparison.
the length of the critical path excluding communication
costs in the path. For example, the critical path of the
sample graph in Figure 1 consists of node V1, V4, V7, and      3.1 Heavy Node First (HNF) algorithm
V8. Then CPIC is T(V1) + C(V1, V4) + T(V4) + C(V4, V7)
+ T(V7) + C(V7, V8) + T(V8), which is 400. CPEC is                The HNF algorithm [1] assigns the nodes in a DAG to
T(V1) + T(V4) + T(V7) + T(V8), which is 150.                   processors level by level. At each level, the scheduler
                                                               selects the eligible nodes for scheduling in descending
                                                               order based on computational weight with the heaviest
node (i.e. the node which has the largest computation            We have classified the existing DBS algorithms into
cost) selected first. The node is selected arbitrarily if     two categories: SPD (Scheduling with Partial
multiple nodes at the same level have the same                Duplication) and SFD (Scheduling with Full
computation cost. The selected node is assigned to a          Duplication). Both SPD and SFD approaches duplicate a
processor which gives the earliest start time to the node.    fork node if the ichild of the fork node is not a join node.
                                                              On the other hand, the SPD approach does not duplicate
3.2 Linear Clustering (LC) algorithm                          any parent except CIP for a join node while the SFD
                                                              approach tries to duplicate all the parents. Naturally,
    The LC algorithm [17] is a traditional critical path      there exists a trade-off between better performance
based clustering method. The scheduler identifies the         (smaller parallel time of the application, typically
critical path, removes the nodes in the path from the         achieved by SFD algorithms) and better running time
DAG, and assigns the nodes in the path into a linear          (smaller time to carry out the scheduling process itself,
cluster. The process is repeated until there is no task       typically achieved by SPD algorithms) between the two
node remaining in the DAG. Each cluster is then               approaches to the duplication based scheduling. Our goal
scheduled onto a processor.                                   is to introduce a new task duplication scheduling
                                                              algorithm with a performance better than, but a running
3.3 Fast      and     Scalable     Scheduling       (FSS)     time comparable to, the SPD algorithms. Table I
algorithm                                                     summarizes the time complexity of these algorithms and
                                                              indicates the class of algorithms they belong to (i.e.,
    Like SDBS [13] algorithm, the FSS [18] algorithm          whether they are SPD or SFD algorithms). Note that, for
first calculates the start time and the completion time of    a DAG with V nodes, all the SFD algorithms have a
each node by traversing the input DAG. The algorithm          complexity of O(V4) while the SPD algorithms have a
then generates clusters by performing depth first search      complexity of O(V2).
starting from the exit node. While performing the task
assignment process, only critical tasks which are              Table I. Comparison of scheduling algorithms
essential to establish a path from a particular node to the         Scheduler       Classification    Complexity
entry node are duplicated. The algorithm has a small
                                                                       HNF         List Scheduling     O(VlogV)
complexity because of the limited duplication. If the                                                       3
                                                                        LC            Clustering        O(V )
number of processors available is less than that needed,                                                    4
                                                                      DSH                SFD            O(V )
the algorithm executes the processor reduction                        BTDH               SFD
                                                                                                        O(V )
procedure. In this paper, unbounded number of                         CPM                SPD
                                                                                                        O(V )
processors are used for FSS for performance comparison.               SDBS               SPD
                                                                                                        O(V )
                                                                       FSS               SPD            O(V )
3.4 Critical Path Fast Duplication (CPFD)                             LCTD               SFD            O(V )
algorithm                                                             CPFD               SFD            O(V )

   The CPFD algorithm [9] classifies the nodes in a                 As an illustration, Figure 2 presents the schedules
DAG into three categories: Critical Path Node (CPN),          obtained by each algorithm for the sample DAG of
In-Branch Node (IBN), and Out-Branch Node (OBN). A            Figure 1. In this example, Pi represents processing
CPN is a node on the critical path. An IBN is a node          element i: PT is the Parallel Time of the DAG; and
from which there is a path to a CPN. An OBN is a node         [EST(Vi, Pk), i, ECT(Vi, Pk)] represents the earliest
which is neither a CPN nor an IBN. CPFD tries to              starting time and earliest completion time of task i.
schedule CPNs first. If there is any unscheduled IBN for
a CPN, CPFD traces the IBN and schedules it first.            P1: [0, 1, 10][10, 4, 70][190, 7, 260][260, 8, 270]
OBNs are scheduled after all the IBNs and CPNs have           P2: [60, 3, 90][170, 6, 230]
                                                              P3: [60, 2, 80][160, 5, 210]
been scheduled. The motivation behind CPFD is that the
                                                                           (a) Schedule by HNF (PT = 270)
parallel time will be likely to be reduced by trying to
schedule CPNs first. Performance comparisons shows            P1: [0, 1, 10][10, 4, 70][140, 7, 210][210, 8, 220]
that CPFD outperforms DSH and BTDH in most cases.             P2: [0, 1, 10][10, 3, 40]
                                                              P3: [0, 1, 10][10, 2, 30]
3.5 Comparison                                                P4: [0, 1, 10][10, 4, 70][100, 6, 160]
                                                              P5: [0, 1, 10][10, 4, 70][110, 5, 160]
                                                                            (b) Schedule by FSS (PT = 220)
P1: [0, 1, 10][10, 4, 70][190, 7, 260][260, 8, 270]           hand, DFRN approach also achieves considerable
P2: [60, 3, 90][120, 5, 170]                                  performance improvement over SPD approaches.
P3: [60, 2, 80][170, 6, 230]
               (c) Schedule by LC (PT = 270)
                                                              4.2 Description of the proposed algorithm
P1: [0,1,10][10,4,70][70,3,100][110,7,180][180,8,190]
P2: [0, 1, 10][10, 3, 40]                                        The high level description of DFRN algorithm is
P3: [0, 1, 10][10, 2, 30]                                     shown in Figure 3. In this figure, the notations, Pc, Pu,
P4: [0, 1, 10][10, 4, 70][70, 3, 100][100, 6, 160]            CIP, IP, LN, and JN are used for the critical processor,
P5: [0, 1, 10][10, 4, 70][70, 3, 100][100, 5, 150]
                                                              an unused processor, the critical iparent, iparent, the last
             (d) Schedule by DFRN (PT = 190)
                                                              node, and a join node, respectively. In addition, a new
P1: [0, 1, 10][10, 4, 70][70, 3, 100][100, 5, 150]            term used in the algorithm is defined first.
P2: [0,1,10][10,3,40][40,4,100][110,7,180][180,8,190]
P3: [0, 1, 10][10, 2, 30][30, 4, 90][100, 6, 160]             Definition 10: At any step of the scheduling process, the
             (e) Schedule by CPFD (PT = 190)                  last node of processor Pi is the most recent node assigned
                                                              to Pi. In Figure 2.(a), the last node of P1 , P2 , and P3 are
Figure 2. Schedules by various schedulers                     V8, V6, and V5, respectively.

4. The proposed algorithm                                         The term, iparent, used in the algorithm in Figure 3
                                                              indicates the iparent which has the minimum EST if there
4.1 Motivation                                                are more than one iparent image across different
                                                              processors. For example, in Figure 2. (d), V3 on P2 is
    When we ran existing scheduling algorithms for a          identified as the iparent of its ichild since EST(V3, P2) =
DAG with about 400 nodes, we observed that a SFD              10 while EST(V3, Pk) = 70 for k = 1, 4, and 5. The
algorithm takes about 50 minutes to generate a schedule       critical iparent and the critical processor are used in the
while a SPD algorithm takes less than one second. We          same way. For example, V3 on P2 is identified as the
need a scheduler with a performance better than SPD           critical iparent if V3 is the critical iparent of any node.
algorithms, but with running time adequate for                The critical processor is P2 in this case.
applications consisting of large number of tasks. This            Note that the algorithm is presented in a generic form
need became our goal, and the goal was achieved by            so that we can use any list scheduling algorithm as a
employing a new task duplication approach called DFRN         node selection algorithm. The node selection algorithm
(Duplication First and Reduction Next).                       decides which node is considered first. HNF is used as
    DFRN approach behaves the same as SPD and SFD             the node selection algorithm in this paper.
approaches in handling fork nodes but differently in              In step (1), initialize() reads the input DAG and
handling join nodes. A SFD algorithm recursively              identifies the level of each node. All the nodes in the
estimates the effect of a possible duplication and decides    same level are sorted in descending order of their
whether to duplicate each node one by one. As a               computation costs (as per the HNF heuristic). Step (2)
consequence, for a DAG with V nodes, each node may be         considers each node according to the priority given in
considered V times for duplication in the worst case.         step (1). The node under consideration, V i, can be either
    Unlike the SFD approach, DFRN first duplicates all        a join node or not. Steps (3) through (10) handle non-join
parent nodes in a bottom-up fashion to the parent which       nodes. The iparent in step (4) may or may not be the last
has been scheduled on the same processor, without             node. If the iparent is the last node, Vi is scheduled after
estimating the effect of their duplications. Then each        the iparent as shown in step (6). If the iparent is not the
duplicated task is removed if the task does not meet          last node, tasks scheduled on the processor up to the
certain conditions. Also, SFD algorithms are applied to       iparent are copied to (i.e. duplicated) an unused
all the processors on which any iparent of the join node      processor. Then Vi is scheduled onto the unused
has been scheduled. We observed that the completion           processor to make EST of Vi the same as ECT of the
time of the join node on the critical processor was shorter   iparent. Otherwise, EST of Vi is increased due to the
than those on other processors after the duplication          computation time of the tasks between the iparent and the
process in most cases. Thus, DFRN applies the                 last node in the schedule. For example, V1 is the iparent
duplication only for the critical processor with the hope     of V3 in Figure 1. V3 is going to be scheduled onto the
that the critical processor is the best candidate for the     processor where V1 has been scheduled. If V3 is
join node. Those two differences provide, incomparably        scheduled on P1 in Figure 2.(d), EST(V3, P1) = 70 since
shorter running time than, but comparable performance         ECT of the last node, V4, is 70. If we copy the iparent,
to, SFD algorithms as shown in Section 5. On the other
V1, to an unused processor, P2, and schedule V3 on P2,         (12)     identify CIP and Pc
EST(V3, P2) = 10 since ECT of the last node, V1, is 10.        (13)     if CIP is LN
    If Vi is a join node, the critical iparent of Vi and the   (14)         DFRN (Pc ,Vi) // apply DFRN to Pc
                                                               (15)     else // if CIP is not LN
critical processor are identified in step (12). DFRN is        (16)         copy the schedule up to CIP onto Pu
applied to a join node in steps (14) or (17) after handling    (17)         DFRN (Pu ,Vi) // apply DFRN to Pu
the last node in the same way. DFRN(Pa ,Vi) consists of        (18)     endif
two      procedures,     try_duplication(Pa      ,Vi)    and   (19) endif
try_deletion(Pa ,Vi), as shown in steps (21) and (22).         (20) end for
try_duplication(Pc ,Vi) first tries to duplicate the iparent
giving the largest MAT to Vi. The procedure recursively        DFRN(Pa, Vi)
                                                               (21) try_duplication(Pa, Vi)
searches its iparent from Vi in a bottom-up fashion until      (22) try_deletion(Pa, Vi)
it finds the parent which has already been scheduled on
Pa as shown in step (24) and (25). When it finds the           try_duplication(Pa, Vi)
parent on Pa, it stops the search and duplicates the           (23) for each Vp,
parents searched so far as shown in step (27). As a result,           (MAT(Vp ,Vi) ≥ MAT(Vq,Vi),Vp ⇒ Vi, Vq ⇒ Vi, p ≠ q,
Vi is duplicated before Vj, when Vi ⇒ Vj, and                         Vp and Vq are not on Pa yet)
try_deletion(Pa ,Vi) considers each duplicated node one                        // from the node giving the largest MAT to
                                                                               the node giving the smallest MAT
by one in the same sequence.
                                                               (24) if there is any Vx,
    After the duplication step, try_deletion(Pa ,Vi) decides          (MAT(Vx,Vp) ≥ MAT(Vy,Vp), Vx⇒Vp, Vy⇒Vp, x≠y,Vx
whether to delete any of the duplicated tasks based on the            and Vy are not on Pa yet)
two conditions in step (30). The first condition is for the                    // if any IP of Vp is not scheduled on Pa
case when the output of the duplicated task is available       (25)      try_duplication(Pa, Vx)
earlier by a message from the task on another processor                               //traces the IP which is not on Pa
than the duplicated task itself; the duplicated task is        (26) else       // if all its IPs are scheduled on Pa
                                                               (27)      schedule Vp onto Pa
deleted since the duplication is not necessary. The
                                                                                   //duplicates the IP which is not on Pa
second condition is for the case when the duplication          (28) endif
does not decrease EST(Vi, Pc) any more. By the second          (29) end for
condition, EST of any node obtained by the DFRN
algorithm is guaranteed to be less than or equal to EST of     try_deletion(Pa, Vi)
the same node obtained by SPD algorithms since the             (30) delete any duplicated task Vk if
second condition results in EST(Vi, Pc) ≤ MAT(DIP(Vi),            (i) ECT(Vk , Pa) > MAT(Vk, Vd ) or
Vi) while EST(Vi, Pc) = MAT(DIP(Vi), Vi) assuming                 (ii) ECT(Vk , Pa) > MAT(DIP(Vi), Vi ))
                                                                       // Vd is the ichild of Vk for which Vk is duplicated
ECT(CIP(Vi), Pc) ≤          MAT(DIP(Vi), Vi) in SPD
approach. The parallel time obtained from DFRN is also
                                                               Figure 3. Description of the DFRN algorithm
less than or equal to that from a SPD algorithm since the
parallel time is the largest ECT of all the nodes in a
                                                                      For a DAG with V nodes, step (1) takes O(V2) time
DAG. Note that the FSS code used in our comparison
                                                               for sorting the nodes. Note that this complexity comes
study in Section 5 is not a pure SPD algorithm since the
                                                               from the node selection heuristic, HNF, not from DFRN
code assigns all the task nodes to one processor when the
                                                               itself. It is trivial to see that it takes O(V) time for steps
parallel time obtained is larger than the sum of the
                                                               (4) through (9). Step (12) takes O(V) time to identify the
computation costs of the nodes in the input DAG.
                                                               critical      iparent     and     the     critical processor.
                                                               try_duplication(Pa, Vi) duplicates V nodes in the worst
Scheduling algorithm with DFRN
(1) initialize() // build a priority queue using HNF           case. Since try_duplication(Pa, Vi) duplicates parents by
(2) for each node Vi in the queue // in FIFO manner            the order of MAT, the sorting takes O(V2), which makes
(3) if Vi is not a JN            // Vi has only one IP         the complexity of the routine O(V2). try_deletion(Pa, Vi)
(4)         identify the IP                                    also takes O(V2) time since it considers deletion m times
(5)         if the IP is LN                                    and takes O(p) time for calculation of EST(Vi, Pa)
(6)              schedule Vi to the PE having the IP           whenever any node is deleted, where m is the number of
(7)         else // if the IP is not LN
(8)              copy the schedule up to the IP onto Pu
                                                               tasks duplicated and p is the number of deleted iparent of
                         // now the IP is LN in Pu             the node, m ≤ V, p ≤ V. Thus the complexity of
(9)              schedule Vi to Pu.                            try_deletion(Pa, Vi) becomes O(V2). The whole
(10)        endif                                              complexity becomes O(V3) since DFRN(Pa, Vi) is
(11) else           // if Vi is a join node
executed q times where q is the number of join nodes in         and Vj,k+1 are scheduled onto the same processor while
the DAG, q ≤ V.                                                 C(Vi,k , Vj,k+1) in the Ln(Vj,k+1) calculation can not be
                                                                zeroed by the definition of CPIC. The inductive step
4.3 Analysis of the proposed algorithm                          compares ECT(Vj,k+1) and Ln(Vj,k+1) for all possible
                                                                cases exhaustively.
   This section proves that the parallel time obtained by
the proposed algorithm is always in the boundary                Theorem 1: For any input DAG, when using DFRN for
between CPEC and CPIC, which were defined in Section            scheduling, ECT(Ve) ≤ Ln(Ve ).
2. The proposed algorithm guarantees the worst case             Proof)
completion time for any Vj, such that ECT(Vj) =                 1) Basis: ECT(Vi,1) ≤ Ln(Vi,1).
ECT(Vi) + T(Vj), Vi ⇒ Vj, for a non-join node Vj and            At level zero, ECT(Vr) = Ln(Vr) = T(Vr). ∀ Vi,1,
ECT(Vj) = Max(ECT(CIP(Vj)) + T(Vj), ECT(DIP(Vj)) +              Lv(Vi,1) = 1, Vi,1 is a non-join node since the DAG has
C(DIP(Vj), Vj) + T(Vj)) for a join node Vj according to         only one entry node. ECT(Vi,1) = T(Vr) + T(Vi,1) while
condition (ii) in step (30) of Figure 3. We will also prove     Ln(Vi,1) = T(Vr) + C(Vr, Vi,1) + T(Vi,1). ECT(Vi,1) ≤
that the proposed algorithm generates an optimal                Ln(Vi,1) since C(Vr, Vi,1) ≥ 0.
schedule for a tree structured input DAG. In the proofs,        2) Inductive Hypothesis: If ECT(Vi,k) ≤ Ln(Vi,k) then
we assume that there is only one entry node and one exit              ECT(Vj,k+1) ≤ Ln(Vj,k+1), Vi,k ⇒ Vj,k+1.
node in the DAG. This does not limit the applicability of       3) Inductive Step:
the proofs since any DAG can be easily transformed to           3.1) If Vj,k+1 is a non-join node, then ECT(Vj,k+1) =
this type of DAG by adding a dummy node for each entry          ECT(Vi,k) + T(Vj,k+1) while Ln(Vj,k+1) = Ln(Vi,k) + C(Vi,k
node and exit node; communication costs for the edges           , Vj,k+1) + T(Vj,k+1). Therefore ECT(Vj,k+1) ≤ Ln(Vj,k+1).
connecting the dummy nodes are zeroes. The notations            3.2) If Vj,k+1 is a join node, then ECT(Vj,k+1) =
used in the proofs are summarized first as follows. The         Max(ECT(CIP(Vj,k+1)) + T(Vj,k+1), ECT(DIP(Vj,k+1)) +
examples next to the definitions are taken from the             C(DIP(Vj,k+1), Vj,k+1) + T(Vj,k+1)).
sample DAG in Figure 1.                                            3.2.1) If ECT(CIP(Vj,k+1)) + T(Vj,k+1) ≥
                                                                   ECT(DIP(Vj,k+1)) + C(DIP(Vj,k+1), Vj,k+1) + T(Vj,k+1),
•   Vr is the entry node and Ve is the exit node; e.g., Vr         then ECT(Vj,k+1) = ECT(CIP(Vj,k+1)) + T(Vj,k+1) ≤
    = V1, Ve = V8.                                                 Ln(CIP(Vj,k+1)) + C(CIP(Vj,k+1), Vj,k+1) + T(Vj,k+1) ≤
•   Ln(Vi) is CPIC up to node Vi. Then Ln(Vr) = T(Vr)              Ln(Vj,k+1) by definition.
    and Ln(Vj) = Ln(Vi) + C(Vi, Vj) + T(Vj), Vi ⇒ Vj;              3.2.2) If ECT(CIP(Vj,k+1)) + T(Vj,k+1) <
    e.g., Ln(V7) = 340 and Ln(V8) = 400                            ECT(DIP(Vj,k+1)) + C(DIP(Vj,k+1), Vj,k+1) + T(Vj,k+1),
•   Vi,k is a task node Vi at level k.                             then ECT(Vj,k+1) = ECT(DIP(Vj,k+1)) + C(DIP(Vj,k+1),
                                                                   Vj,k+1) + T(Vj,k+1) ≤ Ln(DIP(Vj,k+1)) + C(DIP(Vj,k+1),
    Note that the parallel time is the same as ECT(Ve),            Vj,k+1) + T(Vj,k+1) ≤ Ln(Vj,k+1). Therefore ECT(Vj,k+1)
and the CPIC is the same as Ln(Ve). Therefore, proving             ≤ Ln(Vj,k+1).
that the parallel time is always less than or equal to CPIC
is equivalent to proving ECT(Ve) ≤ Ln(Ve). The proof               The proposed algorithm generates an optimal
for ECT(Ve) ≤ Ln(Ve) is done by induction on the level          schedule for a tree structured input DAG. If we can
of the nodes. The sketch of the proof is described first        prove that ECT(Ve) = CPEC for any tree, then our
and the formal proof is presented next.                         schedule is an optimal because CPEC is the lower bound
    Since the DAG has only one entry node, none of the          achievable. It is trivial to prove that no scheduler can
nodes in level one is a join node. ECT of any node in           generate a schedule shorter than CPEC. A scheduler is
level one is less than or equal to the length of the critical   not executing all the tasks in the critical path if the
path up to the node since, for any node Vi,1 in level one,      schedule length is shorter than CPEC, which leads to a
ECT(Vi,1) = T(Vr) + T(Vi,1) while Ln(Vi,1) = T(Vr) +            contradiction to the system model defined in Section 2.
C(Vr, Vi,1) + T(Vi,1). The basis holds. We will prove that
ECT(Vj,k+1) ≤ Ln(Vj,k+1) if ECT(Vi,k) ≤ Ln(Vi,k), Vi,k ⇒        Theorem 2: For any tree structured input DAG, when
Vj,k+1, which is the inductive hypothesis. The basic idea       using DFRN for scheduling, ECT(Ve) = CPEC.
behind the inductive step is that Ln(Vj,k+1) = Ln(Vi,k) +       Proof)
C(Vi,k , Vj,k+1) + T(Vj,k+1) while ECT(Vj,k+1) = ECT(Vi,k)      i) Basis: ECT(Vi,0) = T(Vi,0).
+ C(Vi,k , Vj,k+1) + T(Vj,k+1), Vi,k ⇒ Vj,k+1, where            ii) Inductive Hypothesis: If ECT(Vi,k) = T(Vi,k), then
ECT(Vi,k) ≤ Ln(Vi,k) from the basis. In addition, C(Vi,k ,      ECT(Vj,k+1) = T(Vi,k) + T(Vj,k+1), Vi,k ⇒ Vj,k+1.
Vj,k+1) in the ECT(Vj,k+1) calculation can be zeroed if Vi,k
iii) Inductive Step: Since a tree does not have a join        seconds by HNF, 0.34 seconds by FSS, 2.95 minutes by
node, ECT(Vj,k+1) = ECT(Vi,k) + T(Vj,k+1). Since              LC, 17.3 seconds by DFRN, and 46.4 minutes by CPFD.
ECT(Vi,k) = T(Vi,k) from the basis, ECT(Vj,k+1) = T(Vi,k)     There are significant differences among the running
+ T(Vj,k+1).                                                  times.

  The induction leads to ECT(Ve) =∑k∑iT(Vi,k) =                   Table II. Comparison of running times (in
CPEC, which is the lower bound for any scheduler.                                seconds)
                                                                N      HNF      FSS        LC       CPFD        DFRN
5. Performance comparison                                      100     0.3      0.01      4.16      15.17        0.48
                                                               200     1.29     0.13      25.00    222.96        3.23
    We generated 1000 random DAGs to compare the               300     3.18     0.23      77.54    894.77        8.24
performance of DFRN with existing scheduling                   400     5.97     0.34     177.14    2782.56       17.3
algorithms. We used three parameters the effects of
which we were interested to investigate: the number of           Table III shows the result of the comparison between
nodes, CCR (Communication to Computation Ratio),              each pair of algorithms. Each entry of the table consists
and the average degree (defined as the ratio of the           of three elements in “> a, = b, < c” format, which means
number of edges to the number of nodes in the DAG).           that the algorithm in the same row provides longer
The numbers of nodes used are 20, 40, 60, 80, and 100         parallel time a times more than, same parallel time b
while CCR values used are 0.1, 0.5, 1.0, 5.0, and 10.0.       times as, and shorter parallel time c times more than the
CCR is the ratio of average communication cost to             algorithm in the same column. For example, if we want
average computation cost. We gave a parameter value to        to see the comparison between DFRN and HNF, we look
control the degree of the nodes in the DAG and obtained       up DFRN in the fifth row and HNF in the first column or
the average degree from a number of resulting DAGs. 40        vice versa. In this case, the entry is “> 2, = 22, < 976”,
DAGs are generated for each case of the 25                    which means that DFRN provides the longer parallel
combinations, which makes 1000 DAGs.                          time 2 times more than, same parallel time 22 times as,
    From existing schedulers, we selected one list            and shorter parallel time 976 times more than HNF for
scheduling, one clustering , one SPD, and one SFD             1000 randomly generated DAGs. The comparison shows
algorithm for performance comparison. HNF is chosen           that applying DFRN to HNF shortens the parallel time in
from the list scheduling algorithms. LC is chosen from        97.6 % of the cases. Comparing DFRN with LC which
the clustering algorithms since LC uses the typical           has the same complexity as DFRN, DFRN generates
critical path based method. Since HNF is chosen as the        shorter parallel time 829 times, same parallel time 171
node selection method, the effect of task duplication can     times, and no longer parallel time while the running time
be easily seen by comparing HNF with DFRN. From               of DFRN was shorter than that of LC. We also confirmed
SPD algorithms, a more recent one, FSS [18] is chosen         that the parallel time obtained by DFRN is always less
for comparison. CPFD is chosen from SFD algorithms            than CPIC in the 1000 runs. On the other hand, DFRN
since it has been shown that CPFD outperforms DSH and         generates shorter parallel time 27 times more than, same
BTDH [9].                                                     parallel time 685 as, and longer parallel time 288 times
    For performance comparison, we define one                 more than CPFD. Note that DFRN provides the same
normalized performance measure named Relative                 parallel time as that obtained by CPFD in 68.5% of the
Parallel Time (RPT), which is a ratio of the parallel time    cases with 0.00006% of running time of CPFD, which
to CPEC. For example, if the parallel time obtained by        implies the effectiveness of DFRN approach. Due to the
DFRN is 200 and CPEC is 100, RPT of DFRN is 2.0. A            incomparably long running time of CPFD, DFRN would
smaller RPT value is indicative of a shorter parallel time.   be a good candidate for application programs consisting
The RPT of any scheduling algorithm can not be lower          of large number of tasks. For a DAG with very large
than one since CPEC is the lower bound.                       number of nodes, FSS will be appropriate because of its
    One of our objectives is to observe the trade-off         very short running time.
between the performance (the parallel time obtained) and
the running time (the time taken to generate a schedule)             Table III. Comparison of parallel times
among the scheduling algorithms. Table II shows the                     HNF       FSS       LC      CPFD      DFRN
actual average running time of the five algorithms. The        HNF     >0       > 885     > 587    > 978     > 976
running time is the user time obtained by time command                 = 1000   = 48      = 39     = 22      = 22
on a Sun Sparc10 workstation. For an input DAG with                    <0       < 67      < 374    <0        <2
400 nodes, the time taken to get a schedule was 5.97           FSS     > 67     >0        > 27     > 575     > 567
          = 48        = 1000   = 165     = 425    = 430                  RPT
          < 885       <0       < 808     <0       <3                 8
  LC      > 374       > 808    >0        > 829    > 829              7
          = 39        = 165    = 1000    = 171    = 171              6
          < 587       < 27     <0        <0       <0                 5
                                                                     4                                     LC
CPFD      >0          >0       >0        >0       > 27               3                                     DFRN
          = 22        = 425    = 171     = 1000   = 685              2
          < 978       < 575    < 829     <0       < 288              1
DFRN      >2          >3       >0        > 288    >0                 0
          = 22        = 430    = 171     = 685    = 1000                   0.1   0.5     1      5   10
          < 976       < 567    < 829     < 27     <0                                   CCR

   Graphical representations of the performance
                                                               Figure 5. Comparison with respect to CCR
comparison are shown in Figure 4, Figure 5, and Figure 6
with respect to N (the number of nodes), CCR, and the
average degree, respectively. Each case in Figure 4 is an            RPT
average of 200 runs varying CCR and the degree. The
                                                                    10                                       HNF
average CCR value and degree value are 3.3 and 3.8,
respectively. As shown in Figure 4, the number of nodes              8                                       FSS
does not significantly affect the relative performance of            6                                       LC
scheduling algorithms. In other words, the performance               4                                       DFRN
comparison shows similar patterns regardless of N. In the            2                                       CPFD
pattern, DFRN shows much shorter parallel time than                  0
existing algorithms with equal or lower time complexity                    1.5     3.1       4.6    6.1
while it shows a comparable performance to CPFD.
   CCR is a critical parameter. As CCR is increased, the
performance gap becomes larger as shown in Figure 5.
The difference among 5 algorithms was negligible until        Figure 6. Comparison with respect to degree
CCR is one. But when CCR is 5, RPT of HNF, FSS, LC,
DFRN, and CPFD become 3.38, 2.57, 3.61, 1.67, and           6. Conclusion
1.61, respectively. When CCR is 10, they are 5.79, 5.01,
7.68, 2.45, and 2.27, respectively. As expected,                This paper classified existing DBS algorithms into
duplication-based      scheduling     algorithms    show    two categories, SPD and SFD algorithms, according to
considerable performance improvement for a DAG with         the duplication method used for a join node. SFD
high CCR values. Various values of the average degree       algorithms try to duplicate iparents of a join node while
do not significantly change the pattern of the graph but    SPD algorithms do not. As a result, a SFD algorithm
changes the scale of the graph.                             outperforms a SPD algorithms while its running time is
                                                            incomparably longer than that of a SPD algorithms. This
               RPT                                          paper presented a new duplication-based scheduling
                                                            algorithm (i.e, DFRN) by trying to combine good
                                                   HNF      features of the two approaches. The motivation is to
                                                   FSS      duplicate iparents for a join node if the duplication
                                                   LC       reduces EST of the join node as done in SFD algorithms
         1.5                                       DFRN     but without adding much complexity so that the new
          1                                        CPFD     approach will be well suitable for applications consisting
                                                            of large number of tasks.
                                                                Unlike the existing methods, DFRN first duplicates all
                 20    40      60   80    100
                                                            the duplicable parents at once and tries to delete then one
                                                            by one later if the duplicated task does not meet certain
                                                            conditions. We analytically obtained the boundary
       Figure 4. Comparison with respect to N
                                                            condition for the worst case performance of the proposed
                                                            algorithm and also proved that the algorithm generates an
                                                            optimal schedule for a tree structured input DAG.
   Our performance study showed that DFRN has a run-             Multiprocessors,” Proc. of Supercomputing’92, Nov. 1992,
time comparable to SPD and non-duplicating scheduling            pp. 512-521.
algorithms, while out-performing such algorithms by              [12] J. Y. Colin and P. Chretienne, “C.P.M. Scheduling with
generating schedules with much shorter parallel times.           Small Communication Delays and Task Duplication,”
                                                                 Operations Research, 1991, pp. 680-684.
Compared to SFD algorithms, DFRN offers comparable
                                                                 [13] S. Darbha and D. P. Agrawal, “SDBS: A task duplication
performance with a run-time which is several orders of           based optimal scheduling algorithm,” Proc. of Scalable High
magnitude shorter.                                               Performance Computing Conf., May 1994, pp. 756-763.
                                                                 [14] B. Kruatrachue and T. G. Lewis, “Grain Size
Acknowledgment                                                   Determination for parallel processing,” IEEE Software, Jan.
                                                                 1988, pp. 23-32.
                                                                 [15] C. H. Papadimitriou and M. Yannakakis, “Towards an
   We would like to express our appreciation to Drs.             architecture-independent analysis of parallel algorithms,” ACM
Ishfaq Ahmad, Sekhar Darbha, Dharma Agrawal, and                 Proc. of Symp. on Theory of Computing (STOC), 1988, pp.
their research groups for providing their comments and           510-513.
the source code for CPFD and FSS schedulers which                [16] M. Y. Wu and D. D. Gajski, “Hypertool: A Programming
were used in our performance comparison study.                   Aid for Message-Passing Systems,” IEEE Trans. on Parallel
                                                                 and Distributed Systems, vol. 1, no. 3, Jul. 1990, pp. 330-340.
                                                                 [17] S. J. Kim and J. C. Browne, “A general approach to
References                                                       mapping of parallel computation upon multiprocessor
                                                                 architectures,” Proc. of Int’l Conf. on Parallel Processing, vol
[1] B. Shirazi, M. Wang, and G. Pathak, “Analysis and            III, 1988, pp. 1-8.
Evaluation of Heuristic Methods for Static Task Scheduling,”     [18] S. Darbha and D. P. Agrawal, “A Fast and Scalable
Journal of Parallel and Distributed Computing, vol. 10, No. 3,   Scheduling Algorithm for Distributed Memory Systems,” Proc.
1990, pp. 222-232.                                               of Symp. On Parallel and Distributed Processing, Oct. 1995,
[2] B. Shirazi, A. R. Hurson, "Scheduling and Load Balancing:    pp. 60-63.
Guest Editors' Introduction," Journal of Parallel and
Distributed Computing, Dec. 1992, pp. 271-275.
[3] B. Shirazi, A. R. Hurson, "A Mini-track on Scheduling and
Load Balancing: Track Coordinator's Introduction," Hawaii
Int'l Conf. on System Sciences (HICSS-26), Jan. 1993, pp. 484-
[4] B. Shirazi, A. R. Hurson, K. Kavi, "Scheduling & Load
Balancing," IEEE Press, 1995.
[5] B. Shirazi, H.-B. Chen, and J. Marquis, “Comparative
Study of Task Duplication Static Scheduling versus Clustering
and Non-Clustering Techniques,” Concurrency: Practice and
Experience, vol. 7(5), Aug. 1995, pp. 371-389.
[6] M.Y. Wu, A dedicated track on “Program Partitioning and
Scheduling in Parallel and Distributed Systems,” in the Hawaii
Int’l Conference on Systems Sciences, Jan. 1994.
[7] T. Yang and A. Gerasoulis, A dedicated track on
“Partitioning and Scheduling for Parallel and Distributed
Computation,” in the Hawaii Int’l Conference on Systems
Sciences, Jan. 1995.
[8] T. L. Adam, K. Chandy, and J. Dickson, “A Comparison of
List Scheduling for Parallel Processing System,”
Communication of the ACM, vol. 17, no. 12, Dec. 1974, pp.
[9] I. Ahmad and Y. K. Kwok, “A New Approach to
Scheduling Parallel Program Using Task Duplication,” Proc.
of Int’l Conf. on Parallel Processing, vol. II, Aug. 1994, pp.
[10] H. Chen, B. Shirazi, and J. Marquis, “Performance
Evaluation of A Novel Scheduling Method: Linear Clustering
with Task Duplication,” Proc. of Int’l Conf. on Parallel and
Distributed Systems, Dec. 1993, pp. 270-275.
[11] Y. C. Chung and S. Ranka, “Application and Performance
Analysis of a Compile-Time Optimization Approach for List
Scheduling        Algorithms      on       Distributed-Memory

Shared By: