Embed
Email

task

Document Sample

Shared by: Kerala g
Categories
Tags
Stats
views:
1
posted:
12/15/2011
language:
pages:
14
CMPE 511 COMPUTER ARCHITECTURE

PRESENTATION REPORT





"Low-Cost Task Scheduling for Distributed-Memory

Machines"

Andrei Radulescu and Arjan J. C. van Gemund









submitted by

Bahadır Kaan Özütam







Boğaziçi University

Fall - 2003

TABLE OF CONTENT





Abstract 3

1. Introduction 4

2. Preliminaries 6

3. General Framework for List Scheduling with Static Priorities 8

3.1 Processor selection 8

3.2 Task selection 9

3.3 Complexity Analysis 10

3.3 Case Study 10

4. Extensions for List Scheduling with Dynamic Priorities 12

5. Conclusion 14









2

Abstract





List scheduling algorithms schedule tasks in order of high priority. This priority can be

computed either 1) statically, before the scheduling, or 2) dynamically, during the scheduling.

This paper shows that list scheduling with statically computing priorities can be performed at

a significantly lower cost than existing approaches, without sacrificing performance. The low

complexity is achieved by using low-complexity methods for the most time consuming parts

in list scheduling algorithms i.e. processor selection and task selection preserving the criteria

used in the original algorithms.









3

1. Introduction





For obtaining performance from a parallel program , it is very important to efficiently map the

program to the target system. The problem is referred as task scheduling, and the tasks are

schedulable units of a program. Scheduling parallel applications has been proven to be NP-

complete. So to solve the problem efficiently heuristics are used. In order to be practical for

large applications, scheduling heuristics must have a low-complexity.





For shared memory systems, it has been proven that even a low-cost scheduling heuristic is

guaranteed to produce linear speed up in the performance. In distributed memory case,

however, communication also must be taken into account. This significantly complicates the

problem. In this case scheduling with low-cost remains a challenge.





Distributed memory scheduling heuristics exist for both bounded and unbounded number of

processors. Unbounded number of processors is attractive but it is not always applicable

because the required number of processors is usually not available. Distributed memory

scheduling heuristics application is typically included in the multi-step scheduling methods

for a bounded number of processors.





Scheduling for a bounded number of processors can also be performed in a single step as well

as multi-step methods. Single-step approaches usually produce better results, but they have

higher cost.





Scheduling for a bounded number of processors can be performed either using duplication or

without using duplication. Duplicating tasks results in better performance but significantly

increases the scheduling cost. Non-duplicating heuristics have a lower complexity and still

obtain good schedules. However when compiling very large programs for very large systems,

the complexity of current approaches is often prohibitive.





An important class of algorithms for a bounded number of processors is list scheduling. For

bounded number of processors list scheduling algorithms perform well at a relatively lower

cost compared to other scheduling algorithms.









4

There are two approaches in list scheduling. The first approach is list scheduling with static

priorities (LSSP). In LSSP, the priorities are previously computed and the tasks are scheduled

in order of these priorities. Each task is scheduled in its best processor. Thus, at each

scheduling step, first the task is selected and the processor for this task is selected. If the

performance is the main concern, the best processor is the processor enabling the earliest start

time for the given task. However, if the speed is the main concern, the best processor is the

processor becoming idle the earliest when the task is scheduled.





The second approach is scheduling with dynamic priorities (LSDP). In this case, at each

scheduling step a ready task and its processor are selected at the same time. The selection is

based on the priorities associated with the task-processor pairs. The combination that

produces the earliest start and finish times is selected. LSDP has a more complex task and

processor selection scheme. It is able to produce better schedules than LSSP, but at a

significantly higher cost.





In this paper, it is proven that any LSSP algorithm can be performed at a significantly lower

cost compared to existing approaches. Existing LSSP algorithms already having a low time

complexity have the complexity of O (V log (V) + (V+E) P) where V is the number of tasks

and E if the number of dependences, and P is the number of processors. Using the proposed

approach, LSSP complexity is reduced to O ( V log (P) + E ) maintaining the performance

comparable.





The cost is reduced by

1) considering only two processors when selecting the destination processor for a

given task (proven to preserve the original selection criterion).

2) maintaining a partially sorted task priority queue in which only a fixed number of

tasks are sorted.





They approach is generalized to be used for a particular class of LSDP algorithms.









5

2. Preliminaries





A parallel program can be modeled by a Directed Acyclic Graph (DAG) G = (V, E), where V

is a set of vertices and E is a set of edges.









V



E

E





V V V



E

E E E

E





V V V



E

E E









V









Figure 1. Example of A Directed Acyclic Graph (DAG) modeling a parallel program.





Each task t has a computation cost Tw(t). The edges corresponding to task dependencies have

a communication cost Tc(t, t’). If two tasks t1 and t2 are considered to be scheduled on the

same processor, Tc(t1, t2) is assumed to be zero.



The communication and computation ratio (CCR) of a task graph is a measure of the graph’s

granularity and can be defined as the ratio of the average communication and computation

costs in the graph.









6

The task graph width (W) is defined as the maximum number of tasks that are not connected

through a path. Usually W is less than V, but in the worse case it may be equal to V.



Tasks with no input edges are called entry tasks and tasks with no output edges are called exit

tasks.



The bottom level (Tb) of a task is defined as the length of the longest path from that task to

any exit task where the length is the sum of the communication costs and computation costs

belonging to that path.



A task is said to be ready if its parents have been scheduled. Note that the number of ready

tasks never exceeds W. This will be important for our technique.



Once a task t has been scheduled, it has been associated with a processor pt(t) with a start time

Ts(t) and a finish time Tf(t).



A partial schedule is obtained when only a subset of the tasks have been scheduled. The

processor ready time of a processor p on a partial schedule is defined as the finish time of the

last task scheduled on that processor.



 Tr(p) = max Tf(t) , t V, Pr(t) = p.



Given a partial schedule, we define the processor becoming idle the earliest (pr) to be the

processor with the minimum Tr.



 Tr(pr) = min Tr(p) , p P

If there are more than one processors becoming idle at the same time, pr is randomly selected

between them. The last message arrival time of a ready task is defined as



 Tm(t) = max { Tf(t’) + Tc(t’, t) } , (t’, t) E



The enabling processor pe(t) of a ready task t is the processor from which last message

arrives. Also in this case, if there are more than one processors for which the same Tm(t)

occurs, the enabling processor is randomly selected between them. The messages sent within

the same processor are assumed to take zero communication time. Therefore, we define the

effective message arrival time



 Te(t,p) = max { Tf(t’) + Tc(t’, t) } , (t’, t) E , pt(t’) p



The start time of a ready task, once it is scheduled to a processor is defined as the maximum

between the effective message arrival time, and the processor’s ready time.



 Ts(t, p) = max { Te(t, p), Tr(p) }









7

3. General Framework for List Scheduling with Static Priorities



If we analyze LSSP algorithms, we can distinguish three parts :





 Task's priority computation, which takes at least O ( E + V ) time, since the

whole graph must be traversed.

 Task selection includes sorting the ready tasks according to their priorities and

selecting at each iteration, the task with the highest priority. task selection

takes O ( V log W ) time, since the ready tasks have to be maintained sorted.

 Processor selection selects the best processor for the selected task. This is

usually the processor on which the task can start the earliest. Processor

selection takes O ( ( E + V ) P ) time. To find the earliest start time of a task,

the start time of the task must be computed for each processor.





We can see that the highest complexity parts of the LSSP algorithms are the task and

processor selection parts. Now we will explain the approach which reduces the time

complexity of these two parts.





3.1 Processor Selection





Selecting a processor on which a task starts the earliest need not consider all processors. It

needs to consider only two:

1. The enabling processor pe(t)

2. The processor becoming idle the earliest





The start time of a task t on a processor p is defined as the maximum between

1. the effective message arrival time

2. the time p becomes idle.





Thus, the start time is minimized on one of the two processors that minimize the two

components of the start time. Consequently, there are two candidates; the enabling processor

and the processor becoming idle the earliest.





This is formalized in lemma 1.





8

Lemma 1.

 p pe(t) : Te (t, p) = Tm(t)





Theorem 1 follows.





Theorem 1.



t is a ready task, one of the processors Pe(t) or Pr satisfies Ts (t, p) = min Ts(t, px), px 

P



From theorem 1 we can understand that restricting the selection to these two processors

doesn't affect the performance of the algorithm. But although the resulting selection of the

approach is very similar to the selection of the original algorithm, there is a minor difference.

A ready task can start at the same earliest time on different processors. Considering only two

processors or all processors may change the criteria used in this case. But as a consequence,

we can say that there are few cases in which the approach selects a different processor from

the original algorithm for a task to be scheduled.





The processor selection still performs as accurate as the original processor selection does. But

the complexity is significantly reduced from O ( (E + V) P ) to O (V log (P) + E ). We need O

(E + V) to traverse the task graph and O (V log P) to maintain the processors sorted at each

step.





3.2 Task Selection





The original complexity is O (V log W). This can be reduced by sorting only a constant

number of ready tasks. Thus, the task priority queue is composed from

1. A sorted list of fixed size H

2. A first-in first-out list.





Some of the tasks are stored in the sorted list and when the number of tasks exceed the size of

the queue, the rest is taken in the FIFO list. The FIFO list has an O(1)access time complexity.

When a task is ready, it is added to the sorted list if there is room. Otherwise it is added to the

FIFO list. For this reason as long as the sorted list is not full, there can't exist a task in the





9

FIFO list. The tasks are always dequeued from the sorted list. After the dequeue operation if

the FIFO list is not empty, a task is moved from the FIFO list to the sorted list.





If we reduce the size of the priority queue to H, the time complexity of sorting tasks decreases

from O (V log W) to O (V log H). Note that H is still kept as part of the complexity, not

dropped as a constant. The reason is that for achieving a good performance, H needs to be

adjusted with P.





A possible disadvantage of sorting only a limited number of tasks is the possibility that the

highest priority task may not be included in the sorted list, but temporarily kept in the FIFO

list. To minimize the probability for this H must be kept large enough. At the same time , it

must not be too large to increase the time complexity. In the experiments, it is recognized that

a size of P is required to maintain the performance comparable to the original.





3.3 Complexity Analysis





The complexity of the resulting LSSP algorithm is as follows; Computing task priorities takes

O (E + V). For a fully sorted priority queue, the task selection takes O (V log W). For a

partially sorted priority queue of size H, the task selection takes O (V log H). As a size of P

for the sorted priority queue gives good results, for a partially sorted priority queue of size P,

the task selection takes O (V log P). Finding the processor becoming idle the earliest takes

O(log P) for each task, and O(V log P) for all tasks.





As a result, the total complexity of the LSSP algorithm is O (V (log W + log P) + E) if a fully

sorted queue is used and O (V log P + E) if a partially sorted queue is used.





3.4 Case Study





In this section, the task and processor selection techniques are illustrated by applying them to

a slightly modified version of MCP (Modified Critical Path). In MCP, the task having the

highest bottom level has the highest priority. We modify MCP by using our task and

processor selection techniques and name the resulting algorithm as FCP (Fast Critical Path). It

is applied to a 3 processor case, using a partially sorted queue of size 2. We have 7 tasks with

the task graph given in figure 2. In table 1, the execution trace of this case is given.





10

t0 / 2



1

1 4





t1 / 2 t2 / 2 t3 / 3



1 4 1 3

2





t4 / 3 t5 / 3 t6 / 2



2

1 3









t7 / 2





Figure 2. The task graph of the case study in section 3.4





Ready tasks Scheduling

t

Sorted FIFO t -> p [ Ts - Tf ]



t0 [15] - t0 t0 -> p0 [0 - 2]



t1 [11]

t3 [12] t1 t1 -> p0 [2 - 4]

t2 [9]

t3 [12] t4 [6]

t3 t3 -> p1 [3 - 6]

t2 [9] t5 [8]

t2 [9]

t5 [8] t2 t2 -> p0 [4 - 6]

t4 [6]

t5 [8]

t6 [6] t5 t5 -> p2 [6 - 9]

t4 [6]

t4 [6]

- t4 t4 -> p0 [6 - 9]

t6 [6]



t6 [6] - t6 t6 -> p1 [7 - 9]



t7 [2] - t7 t7 -> p2 [11 - 13]







Table 1. the execution trace of the case study in section 3.4









11

4. Extensions for the List Scheduling with Static Priorities





In this section, we explore the possibility of extending the results for LSSP algorithms to

LSDP algorithms.





In LSDP algorithms, priorities are associated to pairs of task and processor. At each iteration,

a pair of task and processor having the highest priority is selected. The priorities are changing

during the scheduling process and they are recomputed at each iteration.





For example, ETF schedules at each step the ready task that starts the earliest on the processor

where this earliest time is obtained. ERT schedules at each step the ready task that finishes the

earliest on the processor where this earliest time is obtained. DLS defines its priority called

dynamic level, and schedules the task and processor having the highest dynamic level.





In general, the algorithms have the following dynamic priority;





 (t, p) =  ( t ) + max { Te (T, p), Tr (p) }





where  ( t ) is a value that is independent of the scheduling process. It can be computed

before the scheduling started. Different  ( t ) values for different algorithms are ETF ( t ) = 0,

ERT ( t ) = Tw( t ), DLS ( t ) = -Tb(t).





We separate the two cases that the task starts on its enabling processor and the task starts on a

non-enabling processor.





In the EP case, selecting the task having the highest priority on its enabling processor is

performed in two steps. First, on each processor the tasks enabled by that processor are sorted

according to their priority. Second, the processors are sorted according to the priorities of the

tasks enabled by them.





In the non-EP case, we observe that a task's priority on a non-enabling processor is minimized

by the processor becoming idle the earliest. If a task and its enabling processor are selected,

the task will have a higher priority than any task starting on a non-enabling processor. As a







12

consequence the EP-case will give the task and processor with highest priority and the result

will not be affected.





Using this task and processor selection scheme, we are able to find the task and processor pair

having the highest priority in three tries, one for the EP case, and two for the non-EP case.

There are P task priority queues maintained for the EP case and two for the non-EP case.

However, each task is added to three task priority queues, one for the EP case and two for the

non-EP case. Two other processor queues are maintained, one for the EP case and one for the

non-EP case. As a result, the time complexity of the LSDP algorithms becomes O (V (log W

+ log P)) + E). This is already a significant improvement compared to the original O (W (E +

V) P) time complexity.





This can be further reduced using partially ordered priority queue. A size of P is required to

maintain comparable performance. In this case the complexity becomes O (V log (P) + E).









13

5. Conclusion





In this paper, it is shown that list scheduling with static priorities can be performed at a

significantly lower cost, without reducing performance. The approach is general and can be

applied to any list scheduling algorithm with static priorities.





In this framework, the list scheduling algorithms have low time complexity because low

complexity methods are used for most consuming parts of the algorithms, which are processor

selection and task selection. Processor selection is performed by selecting between only two

processors; either the task's enabling processor or the processor which becomes idle the

earliest. For task selection, instead of sorting all tasks, only a limited number of tasks are

sorted.





Using an extension of this method, we can also significantly reduce the time complexity of a

particular class of list scheduling algorithms with dynamic priorities.





Especially, in terms of the large problem and processor dimensions involved with real-world

high-performance computing, the results indicate that the approach offers a superior cost-

performance trade-off compared to the current list scheduling algorithms.









14



Related docs
Other docs by Kerala g
union-budget-2012-13-highlights
Views: 89  |  Downloads: 0
notification M.Tech_05-03-09
Views: 58  |  Downloads: 0
India_Customs Regulation 1
Views: 55  |  Downloads: 0
CE Notification 39-2011-12.9.2011
Views: 53  |  Downloads: 0
STATISTICS
Views: 71  |  Downloads: 0
A Hero (R.K. Narayan)
Views: 88  |  Downloads: 6
RRBPatna-Info-HN
Views: 100  |  Downloads: 0
RRB-Notice-Para
Views: 102  |  Downloads: 0
By registering with docstoc.com you agree to our
privacy policy

You are almost ready to download!

You are almost ready to download!