; task
Learning Center
Plans & pricing Sign in
Sign Out
Your Federal Quarterly Tax Payments are due April 15th Get Help Now >>



  • pg 1

 "Low-Cost Task Scheduling for Distributed-Memory
                         Andrei Radulescu and Arjan J. C. van Gemund

                   submitted by
               Bahadır Kaan Özütam

                Boğaziçi University
                    Fall - 2003

Abstract                                                          3
1. Introduction                                                   4
2. Preliminaries                                                  6
3. General Framework for List Scheduling with Static Priorities   8
      3.1 Processor selection                                     8
      3.2 Task selection                                          9
      3.3 Complexity Analysis                                     10
      3.3 Case Study                                              10
4. Extensions for List Scheduling with Dynamic Priorities         12
5. Conclusion                                                     14


List scheduling algorithms schedule tasks in order of high priority. This priority can be
computed either 1) statically, before the scheduling, or 2) dynamically, during the scheduling.
This paper shows that list scheduling with statically computing priorities can be performed at
a significantly lower cost than existing approaches, without sacrificing performance. The low
complexity is achieved by using low-complexity methods for the most time consuming parts
in list scheduling algorithms i.e. processor selection and task selection preserving the criteria
used in the original algorithms.

1. Introduction

For obtaining performance from a parallel program , it is very important to efficiently map the
program to the target system. The problem is referred as task scheduling, and the tasks are
schedulable units of a program. Scheduling parallel applications has been proven to be NP-
complete. So to solve the problem efficiently heuristics are used. In order to be practical for
large applications, scheduling heuristics must have a low-complexity.

For shared memory systems, it has been proven that even a low-cost scheduling heuristic is
guaranteed to produce linear speed up in the performance. In distributed memory case,
however, communication also must be taken into account. This significantly complicates the
problem. In this case scheduling with low-cost remains a challenge.

Distributed memory scheduling heuristics exist for both bounded and unbounded number of
processors. Unbounded number of processors is attractive but it is not always applicable
because the required number of processors is usually not available. Distributed memory
scheduling heuristics application is typically included in the multi-step scheduling methods
for a bounded number of processors.

Scheduling for a bounded number of processors can also be performed in a single step as well
as multi-step methods. Single-step approaches usually produce better results, but they have
higher cost.

Scheduling for a bounded number of processors can be performed either using duplication or
without using duplication. Duplicating tasks results in better performance but significantly
increases the scheduling cost. Non-duplicating heuristics have a lower complexity and still
obtain good schedules. However when compiling very large programs for very large systems,
the complexity of current approaches is often prohibitive.

An important class of algorithms for a bounded number of processors is list scheduling. For
bounded number of processors list scheduling algorithms perform well at a relatively lower
cost compared to other scheduling algorithms.

There are two approaches in list scheduling. The first approach is list scheduling with static
priorities (LSSP). In LSSP, the priorities are previously computed and the tasks are scheduled
in order of these priorities. Each task is scheduled in its best processor. Thus, at each
scheduling step, first the task is selected and the processor for this task is selected. If the
performance is the main concern, the best processor is the processor enabling the earliest start
time for the given task. However, if the speed is the main concern, the best processor is the
processor becoming idle the earliest when the task is scheduled.

The second approach is scheduling with dynamic priorities (LSDP). In this case, at each
scheduling step a ready task and its processor are selected at the same time. The selection is
based on the priorities associated with the task-processor pairs. The combination that
produces the earliest start and finish times is selected. LSDP has a more complex task and
processor selection scheme. It is able to produce better schedules than LSSP, but at a
significantly higher cost.

In this paper, it is proven that any LSSP algorithm can be performed at a significantly lower
cost compared to existing approaches. Existing LSSP algorithms already having a low time
complexity have the complexity of O (V log (V) + (V+E) P) where V is the number of tasks
and E if the number of dependences, and P is the number of processors. Using the proposed
approach, LSSP complexity is reduced to O ( V log (P) + E ) maintaining the performance

The cost is reduced by
       1) considering only two processors when selecting the destination processor for a
       given task (proven to preserve the original selection criterion).
       2) maintaining a partially sorted task priority queue in which only a fixed number of
       tasks are sorted.

They approach is generalized to be used for a particular class of LSDP algorithms.

2. Preliminaries

A parallel program can be modeled by a Directed Acyclic Graph (DAG) G = (V, E), where V
is a set of vertices and E is a set of edges.



                      V                              V           V

                              E              E                           E

                      V                  V                           V

                          E              E


Figure 1. Example of A Directed Acyclic Graph (DAG) modeling a parallel program.

Each task t has a computation cost Tw(t). The edges corresponding to task dependencies have
a communication cost Tc(t, t’). If two tasks t1 and t2 are considered to be scheduled on the
same processor, Tc(t1, t2) is assumed to be zero.

The communication and computation ratio (CCR) of a task graph is a measure of the graph’s
granularity and can be defined as the ratio of the average communication and computation
costs in the graph.

The task graph width (W) is defined as the maximum number of tasks that are not connected
through a path. Usually W is less than V, but in the worse case it may be equal to V.

Tasks with no input edges are called entry tasks and tasks with no output edges are called exit

The bottom level (Tb) of a task is defined as the length of the longest path from that task to
any exit task where the length is the sum of the communication costs and computation costs
belonging to that path.

A task is said to be ready if its parents have been scheduled. Note that the number of ready
tasks never exceeds W. This will be important for our technique.

Once a task t has been scheduled, it has been associated with a processor pt(t) with a start time
Ts(t) and a finish time Tf(t).

A partial schedule is obtained when only a subset of the tasks have been scheduled. The
processor ready time of a processor p on a partial schedule is defined as the finish time of the
last task scheduled on that processor.

        Tr(p) = max Tf(t) , t V, Pr(t) = p.

Given a partial schedule, we define the processor becoming idle the earliest (pr) to be the
processor with the minimum Tr.

        Tr(pr) = min Tr(p) , p P
If there are more than one processors becoming idle at the same time, pr is randomly selected
between them. The last message arrival time of a ready task is defined as

        Tm(t) = max { Tf(t’) + Tc(t’, t) } , (t’, t) E

The enabling processor pe(t) of a ready task t is the processor from which last message
arrives. Also in this case, if there are more than one processors for which the same Tm(t)
occurs, the enabling processor is randomly selected between them. The messages sent within
the same processor are assumed to take zero communication time. Therefore, we define the
effective message arrival time

        Te(t,p) = max { Tf(t’) + Tc(t’, t) }   , (t’, t) E , pt(t’) <> p

The start time of a ready task, once it is scheduled to a processor is defined as the maximum
between the effective message arrival time, and the processor’s ready time.

        Ts(t, p) = max { Te(t, p), Tr(p) }

3. General Framework for List Scheduling with Static Priorities

If we analyze LSSP algorithms, we can distinguish three parts :

              Task's priority computation, which takes at least O ( E + V ) time, since the
               whole graph must be traversed.
              Task selection includes sorting the ready tasks according to their priorities and
               selecting at each iteration, the task with the highest priority. task selection
               takes O ( V log W ) time, since the ready tasks have to be maintained sorted.
              Processor selection selects the best processor for the selected task. This is
               usually the processor on which the task can start the earliest. Processor
               selection takes O ( ( E + V ) P ) time. To find the earliest start time of a task,
               the start time of the task must be computed for each processor.

We can see that the highest complexity parts of the LSSP algorithms are the task and
processor selection parts. Now we will explain the approach which reduces the time
complexity of these two parts.

3.1 Processor Selection

Selecting a processor on which a task starts the earliest need not consider all processors. It
needs to consider only two:
       1. The enabling processor pe(t)
       2. The processor becoming idle the earliest

The start time of a task t on a processor p is defined as the maximum between
       1. the effective message arrival time
       2. the time p becomes idle.

Thus, the start time is minimized on one of the two processors that minimize the two
components of the start time. Consequently, there are two candidates; the enabling processor
and the processor becoming idle the earliest.

This is formalized in lemma 1.

Lemma 1.
         p <> pe(t) : Te (t, p) = Tm(t)

Theorem 1 follows.

Theorem 1.

        t is a ready task, one of the processors Pe(t) or Pr satisfies Ts (t, p) = min Ts(t, px), px 

From theorem 1 we can understand that restricting the selection to these two processors
doesn't affect the performance of the algorithm. But although the resulting selection of the
approach is very similar to the selection of the original algorithm, there is a minor difference.
A ready task can start at the same earliest time on different processors. Considering only two
processors or all processors may change the criteria used in this case. But as a consequence,
we can say that there are few cases in which the approach selects a different processor from
the original algorithm for a task to be scheduled.

The processor selection still performs as accurate as the original processor selection does. But
the complexity is significantly reduced from O ( (E + V) P ) to O (V log (P) + E ). We need O
(E + V) to traverse the task graph and O (V log P) to maintain the processors sorted at each

3.2 Task Selection

The original complexity is O (V log W). This can be reduced by sorting only a constant
number of ready tasks. Thus, the task priority queue is composed from
        1. A sorted list of fixed size H
        2. A first-in first-out list.

Some of the tasks are stored in the sorted list and when the number of tasks exceed the size of
the queue, the rest is taken in the FIFO list. The FIFO list has an O(1)access time complexity.
When a task is ready, it is added to the sorted list if there is room. Otherwise it is added to the
FIFO list. For this reason as long as the sorted list is not full, there can't exist a task in the

FIFO list. The tasks are always dequeued from the sorted list. After the dequeue operation if
the FIFO list is not empty, a task is moved from the FIFO list to the sorted list.

If we reduce the size of the priority queue to H, the time complexity of sorting tasks decreases
from O (V log W) to O (V log H). Note that H is still kept as part of the complexity, not
dropped as a constant. The reason is that for achieving a good performance, H needs to be
adjusted with P.

A possible disadvantage of sorting only a limited number of tasks is the possibility that the
highest priority task may not be included in the sorted list, but temporarily kept in the FIFO
list. To minimize the probability for this H must be kept large enough. At the same time , it
must not be too large to increase the time complexity. In the experiments, it is recognized that
a size of P is required to maintain the performance comparable to the original.

3.3 Complexity Analysis

The complexity of the resulting LSSP algorithm is as follows; Computing task priorities takes
O (E + V). For a fully sorted priority queue, the task selection takes O (V log W). For a
partially sorted priority queue of size H, the task selection takes O (V log H). As a size of P
for the sorted priority queue gives good results, for a partially sorted priority queue of size P,
the task selection takes O (V log P). Finding the processor becoming idle the earliest takes
O(log P) for each task, and O(V log P) for all tasks.

As a result, the total complexity of the LSSP algorithm is O (V (log W + log P) + E) if a fully
sorted queue is used and O (V log P + E) if a partially sorted queue is used.

3.4 Case Study

In this section, the task and processor selection techniques are illustrated by applying them to
a slightly modified version of MCP (Modified Critical Path). In MCP, the task having the
highest bottom level has the highest priority. We modify MCP by using our task and
processor selection techniques and name the resulting algorithm as FCP (Fast Critical Path). It
is applied to a 3 processor case, using a partially sorted queue of size 2. We have 7 tasks with
the task graph given in figure 2. In table 1, the execution trace of this case is given.

                                           t0 / 2

                          1                           4

                 t1 / 2                                   t2 / 2                t3 / 3

                              1                   4                         1            3

                 t4 / 3                       t5 / 3                              t6 / 2

                          1                   3

                                                      t7 / 2

Figure 2. The task graph of the case study in section 3.4

           Ready tasks                                                                     Scheduling
      Sorted           FIFO                                                              t -> p [ Ts - Tf ]

       t0 [15]                       -                         t0                        t0 -> p0 [0 - 2]

       t1 [11]
                                  t3 [12]                      t1                         t1 -> p0 [2 - 4]
        t2 [9]
       t3 [12]                    t4 [6]
                                                               t3                         t3 -> p1 [3 - 6]
        t2 [9]                    t5 [8]
        t2 [9]
                                  t5 [8]                       t2                         t2 -> p0 [4 - 6]
        t4 [6]
        t5 [8]
                                  t6 [6]                       t5                         t5 -> p2 [6 - 9]
        t4 [6]
        t4 [6]
                                     -                         t4                         t4 -> p0 [6 - 9]
        t6 [6]

        t6 [6]                       -                         t6                         t6 -> p1 [7 - 9]

        t7 [2]                       -                         t7                    t7 -> p2 [11 - 13]

Table 1. the execution trace of the case study in section 3.4

4. Extensions for the List Scheduling with Static Priorities

In this section, we explore the possibility of extending the results for LSSP algorithms to
LSDP algorithms.

In LSDP algorithms, priorities are associated to pairs of task and processor. At each iteration,
a pair of task and processor having the highest priority is selected. The priorities are changing
during the scheduling process and they are recomputed at each iteration.

For example, ETF schedules at each step the ready task that starts the earliest on the processor
where this earliest time is obtained. ERT schedules at each step the ready task that finishes the
earliest on the processor where this earliest time is obtained. DLS defines its priority called
dynamic level, and schedules the task and processor having the highest dynamic level.

In general, the algorithms have the following dynamic priority;

         (t, p) =  ( t ) + max { Te (T, p), Tr (p) }

where  ( t ) is a value that is independent of the scheduling process. It can be computed
before the scheduling started. Different  ( t ) values for different algorithms are ETF ( t ) = 0,
ERT ( t ) = Tw( t ), DLS ( t ) = -Tb(t).

We separate the two cases that the task starts on its enabling processor and the task starts on a
non-enabling processor.

In the EP case, selecting the task having the highest priority on its enabling processor is
performed in two steps. First, on each processor the tasks enabled by that processor are sorted
according to their priority. Second, the processors are sorted according to the priorities of the
tasks enabled by them.

In the non-EP case, we observe that a task's priority on a non-enabling processor is minimized
by the processor becoming idle the earliest. If a task and its enabling processor are selected,
the task will have a higher priority than any task starting on a non-enabling processor. As a

consequence the EP-case will give the task and processor with highest priority and the result
will not be affected.

Using this task and processor selection scheme, we are able to find the task and processor pair
having the highest priority in three tries, one for the EP case, and two for the non-EP case.
There are P task priority queues maintained for the EP case and two for the non-EP case.
However, each task is added to three task priority queues, one for the EP case and two for the
non-EP case. Two other processor queues are maintained, one for the EP case and one for the
non-EP case. As a result, the time complexity of the LSDP algorithms becomes O (V (log W
+ log P)) + E). This is already a significant improvement compared to the original O (W (E +
V) P) time complexity.

This can be further reduced using partially ordered priority queue. A size of P is required to
maintain comparable performance. In this case the complexity becomes O (V log (P) + E).

5. Conclusion

In this paper, it is shown that list scheduling with static priorities can be performed at a
significantly lower cost, without reducing performance. The approach is general and can be
applied to any list scheduling algorithm with static priorities.

In this framework, the list scheduling algorithms have low time complexity because low
complexity methods are used for most consuming parts of the algorithms, which are processor
selection and task selection. Processor selection is performed by selecting between only two
processors; either the task's enabling processor or the processor which becomes idle the
earliest. For task selection, instead of sorting all tasks, only a limited number of tasks are

Using an extension of this method, we can also significantly reduce the time complexity of a
particular class of list scheduling algorithms with dynamic priorities.

Especially, in terms of the large problem and processor dimensions involved with real-world
high-performance computing, the results indicate that the approach offers a superior cost-
performance trade-off compared to the current list scheduling algorithms.


To top