VIEWS: 9 PAGES: 14 POSTED ON: 12/15/2011
CMPE 511 COMPUTER ARCHITECTURE PRESENTATION REPORT "Low-Cost Task Scheduling for Distributed-Memory Machines" Andrei Radulescu and Arjan J. C. van Gemund submitted by Bahadır Kaan Özütam Boğaziçi University Fall - 2003 TABLE OF CONTENT Abstract 3 1. Introduction 4 2. Preliminaries 6 3. General Framework for List Scheduling with Static Priorities 8 3.1 Processor selection 8 3.2 Task selection 9 3.3 Complexity Analysis 10 3.3 Case Study 10 4. Extensions for List Scheduling with Dynamic Priorities 12 5. Conclusion 14 2 Abstract List scheduling algorithms schedule tasks in order of high priority. This priority can be computed either 1) statically, before the scheduling, or 2) dynamically, during the scheduling. This paper shows that list scheduling with statically computing priorities can be performed at a significantly lower cost than existing approaches, without sacrificing performance. The low complexity is achieved by using low-complexity methods for the most time consuming parts in list scheduling algorithms i.e. processor selection and task selection preserving the criteria used in the original algorithms. 3 1. Introduction For obtaining performance from a parallel program , it is very important to efficiently map the program to the target system. The problem is referred as task scheduling, and the tasks are schedulable units of a program. Scheduling parallel applications has been proven to be NP- complete. So to solve the problem efficiently heuristics are used. In order to be practical for large applications, scheduling heuristics must have a low-complexity. For shared memory systems, it has been proven that even a low-cost scheduling heuristic is guaranteed to produce linear speed up in the performance. In distributed memory case, however, communication also must be taken into account. This significantly complicates the problem. In this case scheduling with low-cost remains a challenge. Distributed memory scheduling heuristics exist for both bounded and unbounded number of processors. Unbounded number of processors is attractive but it is not always applicable because the required number of processors is usually not available. Distributed memory scheduling heuristics application is typically included in the multi-step scheduling methods for a bounded number of processors. Scheduling for a bounded number of processors can also be performed in a single step as well as multi-step methods. Single-step approaches usually produce better results, but they have higher cost. Scheduling for a bounded number of processors can be performed either using duplication or without using duplication. Duplicating tasks results in better performance but significantly increases the scheduling cost. Non-duplicating heuristics have a lower complexity and still obtain good schedules. However when compiling very large programs for very large systems, the complexity of current approaches is often prohibitive. An important class of algorithms for a bounded number of processors is list scheduling. For bounded number of processors list scheduling algorithms perform well at a relatively lower cost compared to other scheduling algorithms. 4 There are two approaches in list scheduling. The first approach is list scheduling with static priorities (LSSP). In LSSP, the priorities are previously computed and the tasks are scheduled in order of these priorities. Each task is scheduled in its best processor. Thus, at each scheduling step, first the task is selected and the processor for this task is selected. If the performance is the main concern, the best processor is the processor enabling the earliest start time for the given task. However, if the speed is the main concern, the best processor is the processor becoming idle the earliest when the task is scheduled. The second approach is scheduling with dynamic priorities (LSDP). In this case, at each scheduling step a ready task and its processor are selected at the same time. The selection is based on the priorities associated with the task-processor pairs. The combination that produces the earliest start and finish times is selected. LSDP has a more complex task and processor selection scheme. It is able to produce better schedules than LSSP, but at a significantly higher cost. In this paper, it is proven that any LSSP algorithm can be performed at a significantly lower cost compared to existing approaches. Existing LSSP algorithms already having a low time complexity have the complexity of O (V log (V) + (V+E) P) where V is the number of tasks and E if the number of dependences, and P is the number of processors. Using the proposed approach, LSSP complexity is reduced to O ( V log (P) + E ) maintaining the performance comparable. The cost is reduced by 1) considering only two processors when selecting the destination processor for a given task (proven to preserve the original selection criterion). 2) maintaining a partially sorted task priority queue in which only a fixed number of tasks are sorted. They approach is generalized to be used for a particular class of LSDP algorithms. 5 2. Preliminaries A parallel program can be modeled by a Directed Acyclic Graph (DAG) G = (V, E), where V is a set of vertices and E is a set of edges. V E E V V V E E E E E V V V E E E V Figure 1. Example of A Directed Acyclic Graph (DAG) modeling a parallel program. Each task t has a computation cost Tw(t). The edges corresponding to task dependencies have a communication cost Tc(t, t’). If two tasks t1 and t2 are considered to be scheduled on the same processor, Tc(t1, t2) is assumed to be zero. The communication and computation ratio (CCR) of a task graph is a measure of the graph’s granularity and can be defined as the ratio of the average communication and computation costs in the graph. 6 The task graph width (W) is defined as the maximum number of tasks that are not connected through a path. Usually W is less than V, but in the worse case it may be equal to V. Tasks with no input edges are called entry tasks and tasks with no output edges are called exit tasks. The bottom level (Tb) of a task is defined as the length of the longest path from that task to any exit task where the length is the sum of the communication costs and computation costs belonging to that path. A task is said to be ready if its parents have been scheduled. Note that the number of ready tasks never exceeds W. This will be important for our technique. Once a task t has been scheduled, it has been associated with a processor pt(t) with a start time Ts(t) and a finish time Tf(t). A partial schedule is obtained when only a subset of the tasks have been scheduled. The processor ready time of a processor p on a partial schedule is defined as the finish time of the last task scheduled on that processor. Tr(p) = max Tf(t) , t V, Pr(t) = p. Given a partial schedule, we define the processor becoming idle the earliest (pr) to be the processor with the minimum Tr. Tr(pr) = min Tr(p) , p P If there are more than one processors becoming idle at the same time, pr is randomly selected between them. The last message arrival time of a ready task is defined as Tm(t) = max { Tf(t’) + Tc(t’, t) } , (t’, t) E The enabling processor pe(t) of a ready task t is the processor from which last message arrives. Also in this case, if there are more than one processors for which the same Tm(t) occurs, the enabling processor is randomly selected between them. The messages sent within the same processor are assumed to take zero communication time. Therefore, we define the effective message arrival time Te(t,p) = max { Tf(t’) + Tc(t’, t) } , (t’, t) E , pt(t’) <> p The start time of a ready task, once it is scheduled to a processor is defined as the maximum between the effective message arrival time, and the processor’s ready time. Ts(t, p) = max { Te(t, p), Tr(p) } 7 3. General Framework for List Scheduling with Static Priorities If we analyze LSSP algorithms, we can distinguish three parts : Task's priority computation, which takes at least O ( E + V ) time, since the whole graph must be traversed. Task selection includes sorting the ready tasks according to their priorities and selecting at each iteration, the task with the highest priority. task selection takes O ( V log W ) time, since the ready tasks have to be maintained sorted. Processor selection selects the best processor for the selected task. This is usually the processor on which the task can start the earliest. Processor selection takes O ( ( E + V ) P ) time. To find the earliest start time of a task, the start time of the task must be computed for each processor. We can see that the highest complexity parts of the LSSP algorithms are the task and processor selection parts. Now we will explain the approach which reduces the time complexity of these two parts. 3.1 Processor Selection Selecting a processor on which a task starts the earliest need not consider all processors. It needs to consider only two: 1. The enabling processor pe(t) 2. The processor becoming idle the earliest The start time of a task t on a processor p is defined as the maximum between 1. the effective message arrival time 2. the time p becomes idle. Thus, the start time is minimized on one of the two processors that minimize the two components of the start time. Consequently, there are two candidates; the enabling processor and the processor becoming idle the earliest. This is formalized in lemma 1. 8 Lemma 1. p <> pe(t) : Te (t, p) = Tm(t) Theorem 1 follows. Theorem 1. t is a ready task, one of the processors Pe(t) or Pr satisfies Ts (t, p) = min Ts(t, px), px P From theorem 1 we can understand that restricting the selection to these two processors doesn't affect the performance of the algorithm. But although the resulting selection of the approach is very similar to the selection of the original algorithm, there is a minor difference. A ready task can start at the same earliest time on different processors. Considering only two processors or all processors may change the criteria used in this case. But as a consequence, we can say that there are few cases in which the approach selects a different processor from the original algorithm for a task to be scheduled. The processor selection still performs as accurate as the original processor selection does. But the complexity is significantly reduced from O ( (E + V) P ) to O (V log (P) + E ). We need O (E + V) to traverse the task graph and O (V log P) to maintain the processors sorted at each step. 3.2 Task Selection The original complexity is O (V log W). This can be reduced by sorting only a constant number of ready tasks. Thus, the task priority queue is composed from 1. A sorted list of fixed size H 2. A first-in first-out list. Some of the tasks are stored in the sorted list and when the number of tasks exceed the size of the queue, the rest is taken in the FIFO list. The FIFO list has an O(1)access time complexity. When a task is ready, it is added to the sorted list if there is room. Otherwise it is added to the FIFO list. For this reason as long as the sorted list is not full, there can't exist a task in the 9 FIFO list. The tasks are always dequeued from the sorted list. After the dequeue operation if the FIFO list is not empty, a task is moved from the FIFO list to the sorted list. If we reduce the size of the priority queue to H, the time complexity of sorting tasks decreases from O (V log W) to O (V log H). Note that H is still kept as part of the complexity, not dropped as a constant. The reason is that for achieving a good performance, H needs to be adjusted with P. A possible disadvantage of sorting only a limited number of tasks is the possibility that the highest priority task may not be included in the sorted list, but temporarily kept in the FIFO list. To minimize the probability for this H must be kept large enough. At the same time , it must not be too large to increase the time complexity. In the experiments, it is recognized that a size of P is required to maintain the performance comparable to the original. 3.3 Complexity Analysis The complexity of the resulting LSSP algorithm is as follows; Computing task priorities takes O (E + V). For a fully sorted priority queue, the task selection takes O (V log W). For a partially sorted priority queue of size H, the task selection takes O (V log H). As a size of P for the sorted priority queue gives good results, for a partially sorted priority queue of size P, the task selection takes O (V log P). Finding the processor becoming idle the earliest takes O(log P) for each task, and O(V log P) for all tasks. As a result, the total complexity of the LSSP algorithm is O (V (log W + log P) + E) if a fully sorted queue is used and O (V log P + E) if a partially sorted queue is used. 3.4 Case Study In this section, the task and processor selection techniques are illustrated by applying them to a slightly modified version of MCP (Modified Critical Path). In MCP, the task having the highest bottom level has the highest priority. We modify MCP by using our task and processor selection techniques and name the resulting algorithm as FCP (Fast Critical Path). It is applied to a 3 processor case, using a partially sorted queue of size 2. We have 7 tasks with the task graph given in figure 2. In table 1, the execution trace of this case is given. 10 t0 / 2 1 1 4 t1 / 2 t2 / 2 t3 / 3 1 4 1 3 2 t4 / 3 t5 / 3 t6 / 2 2 1 3 t7 / 2 Figure 2. The task graph of the case study in section 3.4 Ready tasks Scheduling t Sorted FIFO t -> p [ Ts - Tf ] t0 [15] - t0 t0 -> p0 [0 - 2] t1 [11] t3 [12] t1 t1 -> p0 [2 - 4] t2 [9] t3 [12] t4 [6] t3 t3 -> p1 [3 - 6] t2 [9] t5 [8] t2 [9] t5 [8] t2 t2 -> p0 [4 - 6] t4 [6] t5 [8] t6 [6] t5 t5 -> p2 [6 - 9] t4 [6] t4 [6] - t4 t4 -> p0 [6 - 9] t6 [6] t6 [6] - t6 t6 -> p1 [7 - 9] t7 [2] - t7 t7 -> p2 [11 - 13] Table 1. the execution trace of the case study in section 3.4 11 4. Extensions for the List Scheduling with Static Priorities In this section, we explore the possibility of extending the results for LSSP algorithms to LSDP algorithms. In LSDP algorithms, priorities are associated to pairs of task and processor. At each iteration, a pair of task and processor having the highest priority is selected. The priorities are changing during the scheduling process and they are recomputed at each iteration. For example, ETF schedules at each step the ready task that starts the earliest on the processor where this earliest time is obtained. ERT schedules at each step the ready task that finishes the earliest on the processor where this earliest time is obtained. DLS defines its priority called dynamic level, and schedules the task and processor having the highest dynamic level. In general, the algorithms have the following dynamic priority; (t, p) = ( t ) + max { Te (T, p), Tr (p) } where ( t ) is a value that is independent of the scheduling process. It can be computed before the scheduling started. Different ( t ) values for different algorithms are ETF ( t ) = 0, ERT ( t ) = Tw( t ), DLS ( t ) = -Tb(t). We separate the two cases that the task starts on its enabling processor and the task starts on a non-enabling processor. In the EP case, selecting the task having the highest priority on its enabling processor is performed in two steps. First, on each processor the tasks enabled by that processor are sorted according to their priority. Second, the processors are sorted according to the priorities of the tasks enabled by them. In the non-EP case, we observe that a task's priority on a non-enabling processor is minimized by the processor becoming idle the earliest. If a task and its enabling processor are selected, the task will have a higher priority than any task starting on a non-enabling processor. As a 12 consequence the EP-case will give the task and processor with highest priority and the result will not be affected. Using this task and processor selection scheme, we are able to find the task and processor pair having the highest priority in three tries, one for the EP case, and two for the non-EP case. There are P task priority queues maintained for the EP case and two for the non-EP case. However, each task is added to three task priority queues, one for the EP case and two for the non-EP case. Two other processor queues are maintained, one for the EP case and one for the non-EP case. As a result, the time complexity of the LSDP algorithms becomes O (V (log W + log P)) + E). This is already a significant improvement compared to the original O (W (E + V) P) time complexity. This can be further reduced using partially ordered priority queue. A size of P is required to maintain comparable performance. In this case the complexity becomes O (V log (P) + E). 13 5. Conclusion In this paper, it is shown that list scheduling with static priorities can be performed at a significantly lower cost, without reducing performance. The approach is general and can be applied to any list scheduling algorithm with static priorities. In this framework, the list scheduling algorithms have low time complexity because low complexity methods are used for most consuming parts of the algorithms, which are processor selection and task selection. Processor selection is performed by selecting between only two processors; either the task's enabling processor or the processor which becomes idle the earliest. For task selection, instead of sorting all tasks, only a limited number of tasks are sorted. Using an extension of this method, we can also significantly reduce the time complexity of a particular class of list scheduling algorithms with dynamic priorities. Especially, in terms of the large problem and processor dimensions involved with real-world high-performance computing, the results indicate that the approach offers a superior cost- performance trade-off compared to the current list scheduling algorithms. 14