Document Sample

DFRN: A New Approach for Duplication Based Scheduling for Distributed Memory Multiprocessor Systems* Gyung-Leen Park Behrooz Shirazi Jeff Marquis Dept. of Comp. Sc. and Eng. Dept. of Comp. Sc. and Eng. Parallel Technologies, Inc. Univ. of Texas at Arlington Univ. of Texas at Arlington 2000 North Plano Road, Arlington, Texas 76019-0015 Arlington, Texas 76019-0015 Richardson, Texas 75082 gpark@cse.uta.edu shirazi@cse.uta.edu Abstract Most of the non-duplication scheduling methods are Duplication Based Scheduling (DBS) is a relatively based on the list scheduling algorithm [8] since they new approach for solving multiprocessor scheduling maintain a list of nodes according to their priorities. A problems. The problem is defined as finding an optimal list scheduling algorithm repeatedly carries out the schedule which minimizes the parallel execution time of following steps: (1) Tasks ready to be assigned to a an application on a target system. In this paper, we processor are put onto a priority queue. Tasks are classify DBS algorithms into two categories according to assigned to processors based on some priority criteria. A the task duplication method used. We then present our task becomes ready for assignment when all of its parents new DBS algorithm that extracts the strong features of are scheduled. (2) Select a “suitable Processing Element the two categories of DBS algorithms. Our simulation (PE)” for assignment. Typically, a suitable PE is one study shows that the proposed algorithm achieves that can execute the task the earliest. (3) Assign the task considerable performance improvement over existing at the head of the priority queue to this PE. DBS algorithms with equal or less time complexity. We Duplication Based Scheduling (DBS) is a relatively analytically obtain the boundary condition for the worst new approach to the scheduling problem [5, 9-14]. The case behavior of the proposed algorithm and also prove DBS algorithms are capable of reducing communication that the algorithm generates an optimal schedule for a overhead by duplicating remote parent tasks on local tree structured input directed acyclic graph . processing elements. Similar to non-duplication algorithms, DBS methods have been shown to be NP- complete [15]. Thus, many of the proposed DBS 1. Introduction algorithms are based on heuristics. This paper classifies DBS algorithms into two categories according to the task Efficient scheduling of parallel programs, represented duplication approach used: Scheduling with Partial as a Directed Acyclic Graph (DAG), onto processing Duplication (SPD) and Scheduling with Full Duplication elements of parallel and distributed computer systems are (SFD). extremely difficult and important issues [1-7]. The goals SPD algorithms do not duplicate the parent of a join of the scheduling process are to efficiently utilize node unless the parent is critical. A join node is defined resources and to achieve performance objectives of the as a node with an in-degree greater than one (i.e., a node application (e.g., to minimize program parallel execution with more than one incoming edge). Instead, they try to time). Since it has been shown that the multiprocessor find the critical iparent which is defined later in this scheduling problem is NP-complete, many researchers paper as an immediate parent which gives the largest have proposed scheduling algorithms based on heuristics. start time to the join node. The join node is scheduled on The scheduling algorithms can be classified into two the processor where the critical iparent has been general categories: algorithms that employ task scheduled. Because of the limited task duplication, duplication and algorithms that do not employ task algorithms in this category have a low complexity but duplication. may not be appropriate for systems with high communication overhead. They typically provide good * This work has in part been supported by grants from NSF(CDA-9531535 and MIPS-9622593) and state of Texas ATP 003656-087 schedules for an input DAG where computation cost is C(Vi, Vj) is the communication cost for edge E(Vi, Vj) strictly larger than communication cost. CPM [12], which connects task Vi and Vj. The edge E(Vi, Vj) SDBS [13], and FSS [18] belong to this category. represents the precedence constraint between the node Vi SFD algorithms attempt to duplicate all the parents of and Vj. In other words, task Vj can start the execution a join node and apply the task duplication algorithm to only after the output of Vi is available to Vj. When the all the processors that have any of the parents of the join two tasks, Vi and Vj, are assigned to the same processor, node. Thus, algorithms in this category have a higher C(Vi, Vj) is assumed to be zero since intra-processor complexity but typically show better performance than communication cost is negligible compared with the SPD algorithms. DSH [14], BTDH [11], LCTD [5,10], interprocessor communication cost. The weights and CPFD [9] belong to this category. associated with nodes and edges are obtained by A trade-off exists between algorithms in these two estimation [16]. categories: performance (better application parallel This paper defines two relations for precedence execution time) versus time complexity (longer time to constraints. The Vi ⇒ Vj relation indicates the strong carry out the scheduling algorithm itself). This paper precedence relation between Vi and Vj . That is, Vi is an proposes a new DBS algorithm that attempts to achieve immediate parent of Vj and Vj is an immediate child of the performance of SFD algorithms with a time Vi. The terms iparent and ichild are used to represent complexity approaching SPD algorithms. The proposed immediate parent and immediate child, respectively. The algorithm, called Duplication First and Reduction Next Vi → Vj relation indicates the weak precedence relation (DFRN), duplicates the parents of any join node as done between Vi and Vj. That is, Vi is a parent of Vj but not in SFD algorithms but with reduced computational necessarily the immediate one. Vi → Vj and Vj → Vk complexity. imply Vi → Vk . Vi ⇒ Vj and Vj ⇒ Vk do not imply Vi Our simulation study shows that the proposed ⇒ Vk , but imply Vi → Vk. The relation → is transitive, algorithm achieves considerable performance and the relation ⇒ is not. A node without any parent is improvement over existing algorithms with equal or less called an entry node and a node without any child is time complexity while it obtains comparable called an exit node. performance to algorithms which have higher time Graphically, a node is represented as a circle with a complexities. It is also shown that the performance dividing line in the middle. The number in the upper improvement becomes greater as Communication to portion of the circle represents the node ID number and Computation Ratio is increased. This paper analytically the number in the lower portion of the circle represents obtains a boundary condition for the worst case the computation cost for the node. For example, for the performance of the proposed algorithm and also proves sample DAG in Figure 1, the entry node is V1 which has that the algorithm provides an optimal schedule for a tree a computation cost of 10. In the graph representation of a structured input DAG. DAG, the communication cost for each edge is written on The remainder of this paper is organized as follows. the edge itself. For each node, incoming degree is the Section 2 presents the system model and the problem number of input edges and outgoing degree is the definition. Section 3 briefly covers the existing number of output edges. For example, in Figure 1, the algorithms. The proposed DBS algorithm is presented in incoming and outgoing degrees for the node V5 are 3 and Section 4. Section 4 also contains the worst case and the 1, respectively. A few terms are defined here for a more optimality analysis. The performance of the proposed clear presentation. algorithm is compared with that of the existing algorithms in Section 5. Finally, Section 6 concludes this Definition 1: A node is called a fork node if its outgoing paper. degree is greater than 1. 2. System model and problem definition Definition 2: A node is called a join node if its incoming degree is greater than 1. A parallel program is usually represented by a Directed Acyclic Graph (DAG), which is also called a Note that the fork node and the join node are not task graph. As defined in [13], a DAG consists of a tuple exclusive terms, which means that one node can be both (V, E, T, C), where V, E, T, and C are the set of task a fork and also a join node; i.e., both of the node’s nodes, the set of communication edges, the set of incoming and outgoing degrees are greater than one. computation costs associated with the task nodes, and the Similarly, a node can be neither a fork nor a join node; set of communication costs associated with the edges, i.e., both of the node’s incoming and outgoing degrees respectively. T(Vi) is a computation cost for task Vi and are one. In the task graph of Figure 1, nodes V1, V2, V3, and V4 are fork nodes while nodes V5, V6, V7, and V8 are Definition 9: The level of a node is recursively defined join nodes. as follows. The level of an entry node, V0, is zero. Let Lv(Vi) be the level of Vi. Then Lv(V0) = 0. Lv(Vj) = Definition 3: The Earliest Start Time, EST(Vi, Pk), and Lv(Vi) + 1, Vi ⇒ Vj, for non-join node Vj. Lv(Vj) = Earliest Completion Time, ECT(Vi, Pk), are the times that Max(Lv(Vi )) + 1, Vi ⇒ Vj, for join node Vj. For a task Vi starts and finishes its execution on processor Pk, example, the level of node V1, V2, V5, V8 are 0, 1, 2, and respectively. 3, respectively. Even though we assume that there is an edge from node 1 to 5, the level of node 5 is still 2 not 1 Definition 4: A message arriving time (MAT) from Vi to since Lv(V5) = Max(Lv(Vi )) + 1, Vi ⇒ V5, for join Vj, or MAT(Vi, Vj), is the time that the message from Vi node V5. arrives at Vj. If Vi and Vj are scheduled on the same processor Pk, MAT(Vi, Vj) becomes ECT(Vi, Pk). 1 Otherwise, MAT(Vi , Vj) = ECT(Vi, Pk) + C(Vi, Vj). 10 50 50 50 Definition 5: An iparent of a join node is called its 2 3 4 critical iparent if it provides the largest MAT to the join 20 80 60 30 node. The critical iparent is denoted as Vi = CIP(Vj) if Vi 50 100 is the critical iparent of Vj. More formally, Vi = CIP(Vj) 40 50 60 150 if and only if MAT(Vi, Vj) > MAT(Vk, Vj), for all Vk, 70 100 Vk ⇒ Vj, Vi ⇒ Vj, i ≠ k. If there are more than one iparent providing the same largest MAT, CIP is chosen 5 6 7 arbitrary. 50 60 70 Definition 6: An immediate parent node of a join node is 30 20 50 called the decisive iparent of the join node if it provides 8 the second largest MAT to the join node. The decisive 10 iparent is denoted as Vi = DIP(Vj) if Vi is the decisive iparent of Vj. Formally, Vi = DIP(Vj) if and only if Figure 1. Sample DAG MAT(Vi, Vj) > MAT(Vk, Vj), for all Vk, Vk ⇒ Vj, Vk ≠ CIP(Vj), Vi ⇒ Vj, i ≠ k. If there are more than one Similar to existing DBS algorithms, the number of iparent providing the same second largest MAT, DIP is processors are assumed to be unbounded. The topology chosen arbitrary. EST(Vj, P c) becomes of the target system is also assumed to be a complete Max(ECT(CIP(Vj), Pc), MAT(DIP(Vj), Vj)) if Vj is graph; i.e., all processors can directly communicate with scheduled without any task duplication on Pc where each other. Thus, the multiprocessor scheduling process CIP(Vj) has been scheduled. becomes a mapping of the task nodes in the input DAG to the processors in the target system with the goal of Definition 7: The processor which has the critical iparent minimizing the execution time of the entire program. The for Vi is called the critical processor of Vi. execution time of the entire program after scheduling is called the parallel time to be distinguished from the Definition 8: The critical path of a task graph is the path completion time of an individual task node. from an entry node to an exit node which has the largest sum of computation and communication costs of the nodes and edges on the path. The Critical Path Including 3. Related work Communication cost (CPIC) is the length of the critical This section briefly covers several typical scheduling path including communication costs in the path while the algorithms belonging to each category. They are used Critical Path Excluding Communication cost (CPEC) is later in this paper for performance comparison. the length of the critical path excluding communication costs in the path. For example, the critical path of the sample graph in Figure 1 consists of node V1, V4, V7, and 3.1 Heavy Node First (HNF) algorithm V8. Then CPIC is T(V1) + C(V1, V4) + T(V4) + C(V4, V7) + T(V7) + C(V7, V8) + T(V8), which is 400. CPEC is The HNF algorithm [1] assigns the nodes in a DAG to T(V1) + T(V4) + T(V7) + T(V8), which is 150. processors level by level. At each level, the scheduler selects the eligible nodes for scheduling in descending order based on computational weight with the heaviest node (i.e. the node which has the largest computation We have classified the existing DBS algorithms into cost) selected first. The node is selected arbitrarily if two categories: SPD (Scheduling with Partial multiple nodes at the same level have the same Duplication) and SFD (Scheduling with Full computation cost. The selected node is assigned to a Duplication). Both SPD and SFD approaches duplicate a processor which gives the earliest start time to the node. fork node if the ichild of the fork node is not a join node. On the other hand, the SPD approach does not duplicate 3.2 Linear Clustering (LC) algorithm any parent except CIP for a join node while the SFD approach tries to duplicate all the parents. Naturally, The LC algorithm [17] is a traditional critical path there exists a trade-off between better performance based clustering method. The scheduler identifies the (smaller parallel time of the application, typically critical path, removes the nodes in the path from the achieved by SFD algorithms) and better running time DAG, and assigns the nodes in the path into a linear (smaller time to carry out the scheduling process itself, cluster. The process is repeated until there is no task typically achieved by SPD algorithms) between the two node remaining in the DAG. Each cluster is then approaches to the duplication based scheduling. Our goal scheduled onto a processor. is to introduce a new task duplication scheduling algorithm with a performance better than, but a running 3.3 Fast and Scalable Scheduling (FSS) time comparable to, the SPD algorithms. Table I algorithm summarizes the time complexity of these algorithms and indicates the class of algorithms they belong to (i.e., Like SDBS [13] algorithm, the FSS [18] algorithm whether they are SPD or SFD algorithms). Note that, for first calculates the start time and the completion time of a DAG with V nodes, all the SFD algorithms have a each node by traversing the input DAG. The algorithm complexity of O(V4) while the SPD algorithms have a then generates clusters by performing depth first search complexity of O(V2). starting from the exit node. While performing the task assignment process, only critical tasks which are Table I. Comparison of scheduling algorithms essential to establish a path from a particular node to the Scheduler Classification Complexity s entry node are duplicated. The algorithm has a small HNF List Scheduling O(VlogV) complexity because of the limited duplication. If the 3 LC Clustering O(V ) number of processors available is less than that needed, 4 DSH SFD O(V ) the algorithm executes the processor reduction BTDH SFD 4 O(V ) procedure. In this paper, unbounded number of CPM SPD 2 O(V ) processors are used for FSS for performance comparison. SDBS SPD 2 O(V ) 2 FSS SPD O(V ) 4 3.4 Critical Path Fast Duplication (CPFD) LCTD SFD O(V ) 4 algorithm CPFD SFD O(V ) The CPFD algorithm [9] classifies the nodes in a As an illustration, Figure 2 presents the schedules DAG into three categories: Critical Path Node (CPN), obtained by each algorithm for the sample DAG of In-Branch Node (IBN), and Out-Branch Node (OBN). A Figure 1. In this example, Pi represents processing CPN is a node on the critical path. An IBN is a node element i: PT is the Parallel Time of the DAG; and from which there is a path to a CPN. An OBN is a node [EST(Vi, Pk), i, ECT(Vi, Pk)] represents the earliest which is neither a CPN nor an IBN. CPFD tries to starting time and earliest completion time of task i. schedule CPNs first. If there is any unscheduled IBN for a CPN, CPFD traces the IBN and schedules it first. P1: [0, 1, 10][10, 4, 70][190, 7, 260][260, 8, 270] OBNs are scheduled after all the IBNs and CPNs have P2: [60, 3, 90][170, 6, 230] P3: [60, 2, 80][160, 5, 210] been scheduled. The motivation behind CPFD is that the (a) Schedule by HNF (PT = 270) parallel time will be likely to be reduced by trying to schedule CPNs first. Performance comparisons shows P1: [0, 1, 10][10, 4, 70][140, 7, 210][210, 8, 220] that CPFD outperforms DSH and BTDH in most cases. P2: [0, 1, 10][10, 3, 40] P3: [0, 1, 10][10, 2, 30] 3.5 Comparison P4: [0, 1, 10][10, 4, 70][100, 6, 160] P5: [0, 1, 10][10, 4, 70][110, 5, 160] (b) Schedule by FSS (PT = 220) P1: [0, 1, 10][10, 4, 70][190, 7, 260][260, 8, 270] hand, DFRN approach also achieves considerable P2: [60, 3, 90][120, 5, 170] performance improvement over SPD approaches. P3: [60, 2, 80][170, 6, 230] (c) Schedule by LC (PT = 270) 4.2 Description of the proposed algorithm P1: [0,1,10][10,4,70][70,3,100][110,7,180][180,8,190] P2: [0, 1, 10][10, 3, 40] The high level description of DFRN algorithm is P3: [0, 1, 10][10, 2, 30] shown in Figure 3. In this figure, the notations, Pc, Pu, P4: [0, 1, 10][10, 4, 70][70, 3, 100][100, 6, 160] CIP, IP, LN, and JN are used for the critical processor, P5: [0, 1, 10][10, 4, 70][70, 3, 100][100, 5, 150] an unused processor, the critical iparent, iparent, the last (d) Schedule by DFRN (PT = 190) node, and a join node, respectively. In addition, a new P1: [0, 1, 10][10, 4, 70][70, 3, 100][100, 5, 150] term used in the algorithm is defined first. P2: [0,1,10][10,3,40][40,4,100][110,7,180][180,8,190] P3: [0, 1, 10][10, 2, 30][30, 4, 90][100, 6, 160] Definition 10: At any step of the scheduling process, the (e) Schedule by CPFD (PT = 190) last node of processor Pi is the most recent node assigned to Pi. In Figure 2.(a), the last node of P1 , P2 , and P3 are Figure 2. Schedules by various schedulers V8, V6, and V5, respectively. 4. The proposed algorithm The term, iparent, used in the algorithm in Figure 3 indicates the iparent which has the minimum EST if there 4.1 Motivation are more than one iparent image across different processors. For example, in Figure 2. (d), V3 on P2 is When we ran existing scheduling algorithms for a identified as the iparent of its ichild since EST(V3, P2) = DAG with about 400 nodes, we observed that a SFD 10 while EST(V3, Pk) = 70 for k = 1, 4, and 5. The algorithm takes about 50 minutes to generate a schedule critical iparent and the critical processor are used in the while a SPD algorithm takes less than one second. We same way. For example, V3 on P2 is identified as the need a scheduler with a performance better than SPD critical iparent if V3 is the critical iparent of any node. algorithms, but with running time adequate for The critical processor is P2 in this case. applications consisting of large number of tasks. This Note that the algorithm is presented in a generic form need became our goal, and the goal was achieved by so that we can use any list scheduling algorithm as a employing a new task duplication approach called DFRN node selection algorithm. The node selection algorithm (Duplication First and Reduction Next). decides which node is considered first. HNF is used as DFRN approach behaves the same as SPD and SFD the node selection algorithm in this paper. approaches in handling fork nodes but differently in In step (1), initialize() reads the input DAG and handling join nodes. A SFD algorithm recursively identifies the level of each node. All the nodes in the estimates the effect of a possible duplication and decides same level are sorted in descending order of their whether to duplicate each node one by one. As a computation costs (as per the HNF heuristic). Step (2) consequence, for a DAG with V nodes, each node may be considers each node according to the priority given in considered V times for duplication in the worst case. step (1). The node under consideration, V i, can be either Unlike the SFD approach, DFRN first duplicates all a join node or not. Steps (3) through (10) handle non-join parent nodes in a bottom-up fashion to the parent which nodes. The iparent in step (4) may or may not be the last has been scheduled on the same processor, without node. If the iparent is the last node, Vi is scheduled after estimating the effect of their duplications. Then each the iparent as shown in step (6). If the iparent is not the duplicated task is removed if the task does not meet last node, tasks scheduled on the processor up to the certain conditions. Also, SFD algorithms are applied to iparent are copied to (i.e. duplicated) an unused all the processors on which any iparent of the join node processor. Then Vi is scheduled onto the unused has been scheduled. We observed that the completion processor to make EST of Vi the same as ECT of the time of the join node on the critical processor was shorter iparent. Otherwise, EST of Vi is increased due to the than those on other processors after the duplication computation time of the tasks between the iparent and the process in most cases. Thus, DFRN applies the last node in the schedule. For example, V1 is the iparent duplication only for the critical processor with the hope of V3 in Figure 1. V3 is going to be scheduled onto the that the critical processor is the best candidate for the processor where V1 has been scheduled. If V3 is join node. Those two differences provide, incomparably scheduled on P1 in Figure 2.(d), EST(V3, P1) = 70 since shorter running time than, but comparable performance ECT of the last node, V4, is 70. If we copy the iparent, to, SFD algorithms as shown in Section 5. On the other V1, to an unused processor, P2, and schedule V3 on P2, (12) identify CIP and Pc EST(V3, P2) = 10 since ECT of the last node, V1, is 10. (13) if CIP is LN If Vi is a join node, the critical iparent of Vi and the (14) DFRN (Pc ,Vi) // apply DFRN to Pc (15) else // if CIP is not LN critical processor are identified in step (12). DFRN is (16) copy the schedule up to CIP onto Pu applied to a join node in steps (14) or (17) after handling (17) DFRN (Pu ,Vi) // apply DFRN to Pu the last node in the same way. DFRN(Pa ,Vi) consists of (18) endif two procedures, try_duplication(Pa ,Vi) and (19) endif try_deletion(Pa ,Vi), as shown in steps (21) and (22). (20) end for try_duplication(Pc ,Vi) first tries to duplicate the iparent giving the largest MAT to Vi. The procedure recursively DFRN(Pa, Vi) (21) try_duplication(Pa, Vi) searches its iparent from Vi in a bottom-up fashion until (22) try_deletion(Pa, Vi) it finds the parent which has already been scheduled on Pa as shown in step (24) and (25). When it finds the try_duplication(Pa, Vi) parent on Pa, it stops the search and duplicates the (23) for each Vp, parents searched so far as shown in step (27). As a result, (MAT(Vp ,Vi) ≥ MAT(Vq,Vi),Vp ⇒ Vi, Vq ⇒ Vi, p ≠ q, Vi is duplicated before Vj, when Vi ⇒ Vj, and Vp and Vq are not on Pa yet) try_deletion(Pa ,Vi) considers each duplicated node one // from the node giving the largest MAT to the node giving the smallest MAT by one in the same sequence. (24) if there is any Vx, After the duplication step, try_deletion(Pa ,Vi) decides (MAT(Vx,Vp) ≥ MAT(Vy,Vp), Vx⇒Vp, Vy⇒Vp, x≠y,Vx whether to delete any of the duplicated tasks based on the and Vy are not on Pa yet) two conditions in step (30). The first condition is for the // if any IP of Vp is not scheduled on Pa case when the output of the duplicated task is available (25) try_duplication(Pa, Vx) earlier by a message from the task on another processor //traces the IP which is not on Pa than the duplicated task itself; the duplicated task is (26) else // if all its IPs are scheduled on Pa (27) schedule Vp onto Pa deleted since the duplication is not necessary. The //duplicates the IP which is not on Pa second condition is for the case when the duplication (28) endif does not decrease EST(Vi, Pc) any more. By the second (29) end for condition, EST of any node obtained by the DFRN algorithm is guaranteed to be less than or equal to EST of try_deletion(Pa, Vi) the same node obtained by SPD algorithms since the (30) delete any duplicated task Vk if second condition results in EST(Vi, Pc) ≤ MAT(DIP(Vi), (i) ECT(Vk , Pa) > MAT(Vk, Vd ) or Vi) while EST(Vi, Pc) = MAT(DIP(Vi), Vi) assuming (ii) ECT(Vk , Pa) > MAT(DIP(Vi), Vi )) // Vd is the ichild of Vk for which Vk is duplicated ECT(CIP(Vi), Pc) ≤ MAT(DIP(Vi), Vi) in SPD approach. The parallel time obtained from DFRN is also Figure 3. Description of the DFRN algorithm less than or equal to that from a SPD algorithm since the parallel time is the largest ECT of all the nodes in a For a DAG with V nodes, step (1) takes O(V2) time DAG. Note that the FSS code used in our comparison for sorting the nodes. Note that this complexity comes study in Section 5 is not a pure SPD algorithm since the from the node selection heuristic, HNF, not from DFRN code assigns all the task nodes to one processor when the itself. It is trivial to see that it takes O(V) time for steps parallel time obtained is larger than the sum of the (4) through (9). Step (12) takes O(V) time to identify the computation costs of the nodes in the input DAG. critical iparent and the critical processor. try_duplication(Pa, Vi) duplicates V nodes in the worst Scheduling algorithm with DFRN (1) initialize() // build a priority queue using HNF case. Since try_duplication(Pa, Vi) duplicates parents by (2) for each node Vi in the queue // in FIFO manner the order of MAT, the sorting takes O(V2), which makes (3) if Vi is not a JN // Vi has only one IP the complexity of the routine O(V2). try_deletion(Pa, Vi) (4) identify the IP also takes O(V2) time since it considers deletion m times (5) if the IP is LN and takes O(p) time for calculation of EST(Vi, Pa) (6) schedule Vi to the PE having the IP whenever any node is deleted, where m is the number of (7) else // if the IP is not LN (8) copy the schedule up to the IP onto Pu tasks duplicated and p is the number of deleted iparent of // now the IP is LN in Pu the node, m ≤ V, p ≤ V. Thus the complexity of (9) schedule Vi to Pu. try_deletion(Pa, Vi) becomes O(V2). The whole (10) endif complexity becomes O(V3) since DFRN(Pa, Vi) is (11) else // if Vi is a join node executed q times where q is the number of join nodes in and Vj,k+1 are scheduled onto the same processor while the DAG, q ≤ V. C(Vi,k , Vj,k+1) in the Ln(Vj,k+1) calculation can not be zeroed by the definition of CPIC. The inductive step 4.3 Analysis of the proposed algorithm compares ECT(Vj,k+1) and Ln(Vj,k+1) for all possible cases exhaustively. This section proves that the parallel time obtained by the proposed algorithm is always in the boundary Theorem 1: For any input DAG, when using DFRN for between CPEC and CPIC, which were defined in Section scheduling, ECT(Ve) ≤ Ln(Ve ). 2. The proposed algorithm guarantees the worst case Proof) completion time for any Vj, such that ECT(Vj) = 1) Basis: ECT(Vi,1) ≤ Ln(Vi,1). ECT(Vi) + T(Vj), Vi ⇒ Vj, for a non-join node Vj and At level zero, ECT(Vr) = Ln(Vr) = T(Vr). ∀ Vi,1, ECT(Vj) = Max(ECT(CIP(Vj)) + T(Vj), ECT(DIP(Vj)) + Lv(Vi,1) = 1, Vi,1 is a non-join node since the DAG has C(DIP(Vj), Vj) + T(Vj)) for a join node Vj according to only one entry node. ECT(Vi,1) = T(Vr) + T(Vi,1) while condition (ii) in step (30) of Figure 3. We will also prove Ln(Vi,1) = T(Vr) + C(Vr, Vi,1) + T(Vi,1). ECT(Vi,1) ≤ that the proposed algorithm generates an optimal Ln(Vi,1) since C(Vr, Vi,1) ≥ 0. schedule for a tree structured input DAG. In the proofs, 2) Inductive Hypothesis: If ECT(Vi,k) ≤ Ln(Vi,k) then we assume that there is only one entry node and one exit ECT(Vj,k+1) ≤ Ln(Vj,k+1), Vi,k ⇒ Vj,k+1. node in the DAG. This does not limit the applicability of 3) Inductive Step: the proofs since any DAG can be easily transformed to 3.1) If Vj,k+1 is a non-join node, then ECT(Vj,k+1) = this type of DAG by adding a dummy node for each entry ECT(Vi,k) + T(Vj,k+1) while Ln(Vj,k+1) = Ln(Vi,k) + C(Vi,k node and exit node; communication costs for the edges , Vj,k+1) + T(Vj,k+1). Therefore ECT(Vj,k+1) ≤ Ln(Vj,k+1). connecting the dummy nodes are zeroes. The notations 3.2) If Vj,k+1 is a join node, then ECT(Vj,k+1) = used in the proofs are summarized first as follows. The Max(ECT(CIP(Vj,k+1)) + T(Vj,k+1), ECT(DIP(Vj,k+1)) + examples next to the definitions are taken from the C(DIP(Vj,k+1), Vj,k+1) + T(Vj,k+1)). sample DAG in Figure 1. 3.2.1) If ECT(CIP(Vj,k+1)) + T(Vj,k+1) ≥ ECT(DIP(Vj,k+1)) + C(DIP(Vj,k+1), Vj,k+1) + T(Vj,k+1), • Vr is the entry node and Ve is the exit node; e.g., Vr then ECT(Vj,k+1) = ECT(CIP(Vj,k+1)) + T(Vj,k+1) ≤ = V1, Ve = V8. Ln(CIP(Vj,k+1)) + C(CIP(Vj,k+1), Vj,k+1) + T(Vj,k+1) ≤ • Ln(Vi) is CPIC up to node Vi. Then Ln(Vr) = T(Vr) Ln(Vj,k+1) by definition. and Ln(Vj) = Ln(Vi) + C(Vi, Vj) + T(Vj), Vi ⇒ Vj; 3.2.2) If ECT(CIP(Vj,k+1)) + T(Vj,k+1) < e.g., Ln(V7) = 340 and Ln(V8) = 400 ECT(DIP(Vj,k+1)) + C(DIP(Vj,k+1), Vj,k+1) + T(Vj,k+1), • Vi,k is a task node Vi at level k. then ECT(Vj,k+1) = ECT(DIP(Vj,k+1)) + C(DIP(Vj,k+1), Vj,k+1) + T(Vj,k+1) ≤ Ln(DIP(Vj,k+1)) + C(DIP(Vj,k+1), Note that the parallel time is the same as ECT(Ve), Vj,k+1) + T(Vj,k+1) ≤ Ln(Vj,k+1). Therefore ECT(Vj,k+1) and the CPIC is the same as Ln(Ve). Therefore, proving ≤ Ln(Vj,k+1). that the parallel time is always less than or equal to CPIC is equivalent to proving ECT(Ve) ≤ Ln(Ve). The proof The proposed algorithm generates an optimal for ECT(Ve) ≤ Ln(Ve) is done by induction on the level schedule for a tree structured input DAG. If we can of the nodes. The sketch of the proof is described first prove that ECT(Ve) = CPEC for any tree, then our and the formal proof is presented next. schedule is an optimal because CPEC is the lower bound Since the DAG has only one entry node, none of the achievable. It is trivial to prove that no scheduler can nodes in level one is a join node. ECT of any node in generate a schedule shorter than CPEC. A scheduler is level one is less than or equal to the length of the critical not executing all the tasks in the critical path if the path up to the node since, for any node Vi,1 in level one, schedule length is shorter than CPEC, which leads to a ECT(Vi,1) = T(Vr) + T(Vi,1) while Ln(Vi,1) = T(Vr) + contradiction to the system model defined in Section 2. C(Vr, Vi,1) + T(Vi,1). The basis holds. We will prove that ECT(Vj,k+1) ≤ Ln(Vj,k+1) if ECT(Vi,k) ≤ Ln(Vi,k), Vi,k ⇒ Theorem 2: For any tree structured input DAG, when Vj,k+1, which is the inductive hypothesis. The basic idea using DFRN for scheduling, ECT(Ve) = CPEC. behind the inductive step is that Ln(Vj,k+1) = Ln(Vi,k) + Proof) C(Vi,k , Vj,k+1) + T(Vj,k+1) while ECT(Vj,k+1) = ECT(Vi,k) i) Basis: ECT(Vi,0) = T(Vi,0). + C(Vi,k , Vj,k+1) + T(Vj,k+1), Vi,k ⇒ Vj,k+1, where ii) Inductive Hypothesis: If ECT(Vi,k) = T(Vi,k), then ECT(Vi,k) ≤ Ln(Vi,k) from the basis. In addition, C(Vi,k , ECT(Vj,k+1) = T(Vi,k) + T(Vj,k+1), Vi,k ⇒ Vj,k+1. Vj,k+1) in the ECT(Vj,k+1) calculation can be zeroed if Vi,k iii) Inductive Step: Since a tree does not have a join seconds by HNF, 0.34 seconds by FSS, 2.95 minutes by node, ECT(Vj,k+1) = ECT(Vi,k) + T(Vj,k+1). Since LC, 17.3 seconds by DFRN, and 46.4 minutes by CPFD. ECT(Vi,k) = T(Vi,k) from the basis, ECT(Vj,k+1) = T(Vi,k) There are significant differences among the running + T(Vj,k+1). times. The induction leads to ECT(Ve) =∑k∑iT(Vi,k) = Table II. Comparison of running times (in CPEC, which is the lower bound for any scheduler. seconds) N HNF FSS LC CPFD DFRN 5. Performance comparison 100 0.3 0.01 4.16 15.17 0.48 200 1.29 0.13 25.00 222.96 3.23 We generated 1000 random DAGs to compare the 300 3.18 0.23 77.54 894.77 8.24 performance of DFRN with existing scheduling 400 5.97 0.34 177.14 2782.56 17.3 algorithms. We used three parameters the effects of which we were interested to investigate: the number of Table III shows the result of the comparison between nodes, CCR (Communication to Computation Ratio), each pair of algorithms. Each entry of the table consists and the average degree (defined as the ratio of the of three elements in “> a, = b, < c” format, which means number of edges to the number of nodes in the DAG). that the algorithm in the same row provides longer The numbers of nodes used are 20, 40, 60, 80, and 100 parallel time a times more than, same parallel time b while CCR values used are 0.1, 0.5, 1.0, 5.0, and 10.0. times as, and shorter parallel time c times more than the CCR is the ratio of average communication cost to algorithm in the same column. For example, if we want average computation cost. We gave a parameter value to to see the comparison between DFRN and HNF, we look control the degree of the nodes in the DAG and obtained up DFRN in the fifth row and HNF in the first column or the average degree from a number of resulting DAGs. 40 vice versa. In this case, the entry is “> 2, = 22, < 976”, DAGs are generated for each case of the 25 which means that DFRN provides the longer parallel combinations, which makes 1000 DAGs. time 2 times more than, same parallel time 22 times as, From existing schedulers, we selected one list and shorter parallel time 976 times more than HNF for scheduling, one clustering , one SPD, and one SFD 1000 randomly generated DAGs. The comparison shows algorithm for performance comparison. HNF is chosen that applying DFRN to HNF shortens the parallel time in from the list scheduling algorithms. LC is chosen from 97.6 % of the cases. Comparing DFRN with LC which the clustering algorithms since LC uses the typical has the same complexity as DFRN, DFRN generates critical path based method. Since HNF is chosen as the shorter parallel time 829 times, same parallel time 171 node selection method, the effect of task duplication can times, and no longer parallel time while the running time be easily seen by comparing HNF with DFRN. From of DFRN was shorter than that of LC. We also confirmed SPD algorithms, a more recent one, FSS [18] is chosen that the parallel time obtained by DFRN is always less for comparison. CPFD is chosen from SFD algorithms than CPIC in the 1000 runs. On the other hand, DFRN since it has been shown that CPFD outperforms DSH and generates shorter parallel time 27 times more than, same BTDH [9]. parallel time 685 as, and longer parallel time 288 times For performance comparison, we define one more than CPFD. Note that DFRN provides the same normalized performance measure named Relative parallel time as that obtained by CPFD in 68.5% of the Parallel Time (RPT), which is a ratio of the parallel time cases with 0.00006% of running time of CPFD, which to CPEC. For example, if the parallel time obtained by implies the effectiveness of DFRN approach. Due to the DFRN is 200 and CPEC is 100, RPT of DFRN is 2.0. A incomparably long running time of CPFD, DFRN would smaller RPT value is indicative of a shorter parallel time. be a good candidate for application programs consisting The RPT of any scheduling algorithm can not be lower of large number of tasks. For a DAG with very large than one since CPEC is the lower bound. number of nodes, FSS will be appropriate because of its One of our objectives is to observe the trade-off very short running time. between the performance (the parallel time obtained) and the running time (the time taken to generate a schedule) Table III. Comparison of parallel times among the scheduling algorithms. Table II shows the HNF FSS LC CPFD DFRN actual average running time of the five algorithms. The HNF >0 > 885 > 587 > 978 > 976 running time is the user time obtained by time command = 1000 = 48 = 39 = 22 = 22 on a Sun Sparc10 workstation. For an input DAG with <0 < 67 < 374 <0 <2 400 nodes, the time taken to get a schedule was 5.97 FSS > 67 >0 > 27 > 575 > 567 = 48 = 1000 = 165 = 425 = 430 RPT < 885 <0 < 808 <0 <3 8 LC > 374 > 808 >0 > 829 > 829 7 HNF = 39 = 165 = 1000 = 171 = 171 6 FSS < 587 < 27 <0 <0 <0 5 4 LC CPFD >0 >0 >0 >0 > 27 3 DFRN = 22 = 425 = 171 = 1000 = 685 2 < 978 < 575 < 829 <0 < 288 1 CPFD DFRN >2 >3 >0 > 288 >0 0 = 22 = 430 = 171 = 685 = 1000 0.1 0.5 1 5 10 < 976 < 567 < 829 < 27 <0 CCR Graphical representations of the performance Figure 5. Comparison with respect to CCR comparison are shown in Figure 4, Figure 5, and Figure 6 with respect to N (the number of nodes), CCR, and the average degree, respectively. Each case in Figure 4 is an RPT 12 average of 200 runs varying CCR and the degree. The 10 HNF average CCR value and degree value are 3.3 and 3.8, respectively. As shown in Figure 4, the number of nodes 8 FSS does not significantly affect the relative performance of 6 LC scheduling algorithms. In other words, the performance 4 DFRN comparison shows similar patterns regardless of N. In the 2 CPFD pattern, DFRN shows much shorter parallel time than 0 existing algorithms with equal or lower time complexity 1.5 3.1 4.6 6.1 while it shows a comparable performance to CPFD. Degree CCR is a critical parameter. As CCR is increased, the performance gap becomes larger as shown in Figure 5. The difference among 5 algorithms was negligible until Figure 6. Comparison with respect to degree CCR is one. But when CCR is 5, RPT of HNF, FSS, LC, DFRN, and CPFD become 3.38, 2.57, 3.61, 1.67, and 6. Conclusion 1.61, respectively. When CCR is 10, they are 5.79, 5.01, 7.68, 2.45, and 2.27, respectively. As expected, This paper classified existing DBS algorithms into duplication-based scheduling algorithms show two categories, SPD and SFD algorithms, according to considerable performance improvement for a DAG with the duplication method used for a join node. SFD high CCR values. Various values of the average degree algorithms try to duplicate iparents of a join node while do not significantly change the pattern of the graph but SPD algorithms do not. As a result, a SFD algorithm changes the scale of the graph. outperforms a SPD algorithms while its running time is incomparably longer than that of a SPD algorithms. This RPT paper presented a new duplication-based scheduling 3.5 algorithm (i.e, DFRN) by trying to combine good 3 HNF features of the two approaches. The motivation is to 2.5 FSS duplicate iparents for a join node if the duplication 2 LC reduces EST of the join node as done in SFD algorithms 1.5 DFRN but without adding much complexity so that the new 1 CPFD approach will be well suitable for applications consisting 0.5 of large number of tasks. 0 Unlike the existing methods, DFRN first duplicates all 20 40 60 80 100 the duplicable parents at once and tries to delete then one N by one later if the duplicated task does not meet certain conditions. We analytically obtained the boundary Figure 4. Comparison with respect to N condition for the worst case performance of the proposed algorithm and also proved that the algorithm generates an optimal schedule for a tree structured input DAG. Our performance study showed that DFRN has a run- Multiprocessors,” Proc. of Supercomputing’92, Nov. 1992, time comparable to SPD and non-duplicating scheduling pp. 512-521. algorithms, while out-performing such algorithms by [12] J. Y. Colin and P. Chretienne, “C.P.M. Scheduling with generating schedules with much shorter parallel times. Small Communication Delays and Task Duplication,” Operations Research, 1991, pp. 680-684. Compared to SFD algorithms, DFRN offers comparable [13] S. Darbha and D. P. Agrawal, “SDBS: A task duplication performance with a run-time which is several orders of based optimal scheduling algorithm,” Proc. of Scalable High magnitude shorter. Performance Computing Conf., May 1994, pp. 756-763. [14] B. Kruatrachue and T. G. Lewis, “Grain Size Acknowledgment Determination for parallel processing,” IEEE Software, Jan. 1988, pp. 23-32. [15] C. H. Papadimitriou and M. Yannakakis, “Towards an We would like to express our appreciation to Drs. architecture-independent analysis of parallel algorithms,” ACM Ishfaq Ahmad, Sekhar Darbha, Dharma Agrawal, and Proc. of Symp. on Theory of Computing (STOC), 1988, pp. their research groups for providing their comments and 510-513. the source code for CPFD and FSS schedulers which [16] M. Y. Wu and D. D. Gajski, “Hypertool: A Programming were used in our performance comparison study. Aid for Message-Passing Systems,” IEEE Trans. on Parallel and Distributed Systems, vol. 1, no. 3, Jul. 1990, pp. 330-340. [17] S. J. Kim and J. C. Browne, “A general approach to References mapping of parallel computation upon multiprocessor architectures,” Proc. of Int’l Conf. on Parallel Processing, vol [1] B. Shirazi, M. Wang, and G. Pathak, “Analysis and III, 1988, pp. 1-8. Evaluation of Heuristic Methods for Static Task Scheduling,” [18] S. Darbha and D. P. Agrawal, “A Fast and Scalable Journal of Parallel and Distributed Computing, vol. 10, No. 3, Scheduling Algorithm for Distributed Memory Systems,” Proc. 1990, pp. 222-232. of Symp. On Parallel and Distributed Processing, Oct. 1995, [2] B. Shirazi, A. R. Hurson, "Scheduling and Load Balancing: pp. 60-63. Guest Editors' Introduction," Journal of Parallel and Distributed Computing, Dec. 1992, pp. 271-275. [3] B. Shirazi, A. R. Hurson, "A Mini-track on Scheduling and Load Balancing: Track Coordinator's Introduction," Hawaii Int'l Conf. on System Sciences (HICSS-26), Jan. 1993, pp. 484- 486. [4] B. Shirazi, A. R. Hurson, K. Kavi, "Scheduling & Load Balancing," IEEE Press, 1995. [5] B. Shirazi, H.-B. Chen, and J. Marquis, “Comparative Study of Task Duplication Static Scheduling versus Clustering and Non-Clustering Techniques,” Concurrency: Practice and Experience, vol. 7(5), Aug. 1995, pp. 371-389. [6] M.Y. Wu, A dedicated track on “Program Partitioning and Scheduling in Parallel and Distributed Systems,” in the Hawaii Int’l Conference on Systems Sciences, Jan. 1994. [7] T. Yang and A. Gerasoulis, A dedicated track on “Partitioning and Scheduling for Parallel and Distributed Computation,” in the Hawaii Int’l Conference on Systems Sciences, Jan. 1995. [8] T. L. Adam, K. Chandy, and J. Dickson, “A Comparison of List Scheduling for Parallel Processing System,” Communication of the ACM, vol. 17, no. 12, Dec. 1974, pp. 685-690. [9] I. Ahmad and Y. K. Kwok, “A New Approach to Scheduling Parallel Program Using Task Duplication,” Proc. of Int’l Conf. on Parallel Processing, vol. II, Aug. 1994, pp. 47-51. [10] H. Chen, B. Shirazi, and J. Marquis, “Performance Evaluation of A Novel Scheduling Method: Linear Clustering with Task Duplication,” Proc. of Int’l Conf. on Parallel and Distributed Systems, Dec. 1993, pp. 270-275. [11] Y. C. Chung and S. Ranka, “Application and Performance Analysis of a Compile-Time Optimization Approach for List Scheduling Algorithms on Distributed-Memory

DOCUMENT INFO

Shared By:

Categories:

Tags:
Carolina Herrera, 212 Perfume, 212 SEXY, New York, Eau de Toilette Spray, Carolina Herrera 212, EDT SPRAY, floral fragrance, Gift Set, New York City

Stats:

views: | 31 |

posted: | 4/25/2011 |

language: | English |

pages: | 10 |

OTHER DOCS BY xiangpeng

Docstoc is the premier online destination to start and grow small businesses. It hosts the best quality and widest selection of professional documents (over 20 million) and resources including expert videos, articles and productivity tools to make every small business better.

Search or Browse for any specific document or resource you need for your business. Or explore our curated resources for Starting a Business, Growing a Business or for Professional Development.

Feel free to Contact Us with any questions you might have.