Chapter 5 Workload Analysis by cometjunkie56

VIEWS: 73 PAGES: 42

									Chapter 5: Workload Analysis
This chapter is concerned with the nature and composition of workloads in NOWs and the way in which those workloads can be classified. This is a key precursor to the design of a load sharing scheme and to the selection of load sharing policies. The new model of process behaviour provides data for the workload analysis. The richinformation approach taken in this work facilitates differentiation between tasks based on their specific resource-requirements. This in turn permits detailed workload descriptions. The findings of previous works that have involved significant workload analysis are discussed. Existing task classification techniques are evaluated and a new descriptive framework, suitable for use in a load sharing context, is outlined. The extent of the benefits of detailed workload descriptions, and the extent to which workload composition and intensity are defining parameters of a load sharing policy are evaluated. Task signatures illustrated in this chapter have been generated on the ‘water’ processing node unless otherwise stated. For clarity, only task-signature elements that are relevant to each particular illustration are shown in each case. Results that represent per-task values are based on the most-recent values in the task-signature, results that represent general behaviour of a task-type are based on the average values in the task-signature. See section 4.9 for processing node configuration details. The task-types used in the analysis are described in appendix A. 5.1 Introduction Workload analysis can be defined as the study of the types and concentration of tasks that are found in a given system. Workload analysis has been somewhat neglected by developers of load sharing schemes. A minority of load sharing publications have addressed the issue of workload analysis in any depth. More commonly, load sharing models assume that tasks are homogenous or differ only in their processing-time requirement, while the majority of implemented load sharing schemes presume limiting workload characteristics and/or

Richard John Anthony

D.Phil. thesis

Chapter 5

restrict support to specific types of task. A detailed review of models and schemes has been provided in chapter 2. Previous detailed workload analyses have been carried out in [LO86,FZ87] however these analyses have not decomposed the behaviour of tasks. [KUN91] Describes tasks as generally consisting of a number of coarse phases such as: read data file, process data, write data file. Terms such as computation-intensive and short-lived are often used imprecisely and not as part of a clearly defined descriptive framework. Examples include [KC91,SVE90,TN95]. This undermines the analysis and leads to vague workload descriptions. The model’s support for workload analysis The model defined in chapter 4 follows the rich-information approach to load sharing. This is in contrast to the majority of load sharing models that use limited information concerning the availability of resources and the characteristics of tasks. Several features of the model’s design directly facilitate workload analysis: • • • Factorisation of tasks’ execution-time into resource-type related fractions. Separate representation of the load on each main resource-type. Maintenance of a resource-use signature for each task-type.

5.2 The role of workload analysis To develop load sharing policies that are effective it is important that workloads are understood. Workloads can be characterised by a number of factors that include: • • • • Intensity, i.e. the number of tasks present or the utilisation of resources, at the processing node. The average arrival-rate of tasks. The average service-time of tasks. The types of task present, in terms of the types of resources they use.

These factors are dependent on the specific application context in which a system is used, typically differ from node to node, and fluctuate with time.

Richard John Anthony

2

D.Phil. thesis

Chapter 5

5.3 Classification of tasks This section first investigates existing task classification techniques and then introduces a new framework for the description of tasks. The framework follows the richinformation approach. 5.3.1 Existing techniques There are three main ways in which tasks have been categorised: 1. The most common approach is to use a coarse differentiation based on the predominant type of resource used. For example, a task that primarily uses the CPU would be described as CPU-intensive or compute-intensive. In [LO86] 9.5 million tasks are examined and classified by their major resource-type use. It is concluded that nearly all tasks can be classified simply as either diskintensive or CPU-intensive. This method of classification is unrepresentative of tasks that make use of more than a single resource-type. 2. The second common method of classification is in terms of the communication intensity of tasks. It is generally accepted that tasks that have a high computation-tocommunication ratio are better suited to remote execution since the message overhead is relatively low. Terms such as coarse-grained are often used imprecisely to describe such tasks, examples are found in [HAG86,KAL88]. [SHU94] Defines medium-grained parallel tasks as those in which components execute for between 1 and 100 milliseconds. The definition is used in the context of tightly-coupled systems. Macharia [MAC90] defines three levels of granularity:fine: medium: large: execution-time ≈ g0 communication-time execution-time ≈ g1 communication-time execution-time ≈ g2 communication-time

Where g is the grain factor and is used to ensure that the definitions are architecture independent. Macharia’s work is based on tightly-coupled systems in which there is a relatively low per-message communication cost, the value of g = 10 was found suitable. The granularity of tasks is a more significant issue in loosely-coupled systems than tightly-coupled systems due to the much higher per-message cost, which can
Richard John Anthony 3 D.Phil. thesis Chapter 5

significantly affect execution performance. Thus in loosely-coupled systems it is desirable to localise the components of finer-grained applications to minimise the communications overhead. In this sense the definition of granularity must reflect the ‘looseness’ of the coupling in terms of the per-message communication cost. For load sharing in loosely-coupled systems a definition of granularity is required to help decide whether or not a component task of a distributed application can execute efficiently remotely. The communications cost incurred must not outweigh the processing speedup achieved. The significance of communication granularity is illustrated in [ESC95] in which the speedup factor achieved by 2 parallel tasks in a four-node loosely-coupled system are investigated. The first, Mandelbrot, requires that each node compute a large number of points, each point being independent of its neighbors. It is relatively coarse-grained and achieves a near-linear speedup of 3.8. The second, a heat propagation simulation in which the temperature at each point is dependent on the temperature of neighboring points, requires considerable inter-process communication (IPC). This finer-grained task achieves a poorer speedup of only 1.9. 3. The third way in which tasks are commonly classified is by their longevity. Where short tasks are concerned, the migration overhead can outweigh any gains made by executing the tasks at a faster, or less-loaded node. The History scheme [SVE90] filters tasks which historically are found to execute for short amounts of time so that they are not migrated. Such tasks are termed shortlived. Condor [LLM88] selects long running tasks for transfer to idle workstations. The problems with such a system of classification are its coarseness and its vagueness. It is quite reasonable to describe processes that execute for hours as longlived. Consider a task that executes for 5 minutes, is this task short-lived? The answer depends on the execution environment and the context in which the question is asked.

Richard John Anthony

4

D.Phil. thesis

Chapter 5

In a loosely-coupled load sharing context, the divide between short-lived and longlived tasks should be defined as the cutoff point at which migration becomes beneficial. Such a distinction is system dependent and affected by the migration mechanism employed. Non preemptive transfers can be achieved at very low cost. In contrast, preemptive transfers are relatively costly in terms of processing time, network bandwidth and execution delay to the migrating process. In either case, transferred tasks must execute for sufficient time after transfer to make possible a performance gain greater than the cost (in terms of delay in particular). Thus the threshold execution-time that divides short-lived and long-lived tasks will be higher in systems where a preemptive transfer mechanism is employed. 5.3.2 A new descriptive framework Weaknesses in the ways that tasks are commonly classified have been identified. The unique design of the model permits greater precision task (and thus workload) descriptions. This increased precision of description is advantageous for the development of load sharing policies. A new descriptive framework, designed to overcome some of the identified weaknesses, is proposed in this section. Tasks’ classification is based on information held in their unique resource-use signatures. Workloads are then described based on their task composition. Three methods of task classification are collected into a framework: 1. Tasks are primarily classified by their resource-use intensity. For a given resourcetype, the use-intensity of a specific task-type is determined as the fraction (percentage) of execution-time spent using the particular resource, on average. The exception is memory intensity which is represented as memory use (Kilobytes) per unit of execution-time (Jiffys). Figure 5.3.2-1 shows typical signatures for a number of task-types. Also shown are tasks’ use-intensity values for each resource-type.
response run (CPU) disk network TTY-IO memory use inactive

Richard John Anthony

5

D.Phil. thesis

Chapter 5

Task-type (and arguments) hanoi12 tsp11 createintfile (2 million records) readwriteint (2 million records) vi (typical sample) CPUmem20 monte-carlo-pi (10 million points) Table 5.3.2-1

-time (Jiffys) 484 1054 534 486 6323 343 1248

absolute (Jiffys) 157 1051 307 435 5 343 1247

%

absolute (Jiffys) 0 0 214 47 4 0 0

%

absolute (Jiffys) 0 0 0 0 0 0 0

%

absolute (Jiffys) 0 0 0 0 0 0 0

33 100 57 90 0 100 100

0 0 40 10 0 0 0

Absolute memory (Jiffys) (Kilobytes) intensity (KB/Jiffy) 324 67 76 0 0.157 3 0 44 0 0.042 0 0 44 0 0.082 0 2 0 0 0 0 0 0 44 156 20612 40 0.091 0.025 60.093 0.032 0 6311 0 0

%

Task signatures showing their resource-use fractions as both absolute and intensity (percentage) values

Using this approach, the hanoi12 task that would by the classic method of primary resource use be described as simply IO-intensive may be more accurately described as 67% IO-intensive and 33% CPU-intensive. This distinction is important in a load sharing scheduling context as it indicates that the task will be affected by both CPU load and IO load, and provides an indication of the relative sensitivity of the task to each type of load. Table 5.3.2-1 also shows, for example: 1 tasks such as createintfile and readwriteint, which might be assumed to be disk-intensive, are actually found to be 57% and 90% CPU-intensive respectively; and 2 the CPUmem20 task-type is 100% CPU-intensive in terms of execution time and has a memory-intensity of over 60 KB/Jiffy which indicates the task-type is likely to be highly sensitive to memory load (the effects of paging). 2. Granularity. In the specific context of load sharing in loosely-coupled systems, a definition of granularity is required as a means of distinguishing between tasks which are coarseenough grained to execute efficiently remotely and those that are not. In this sense it is only necessary to define a threshold value that indicates the crossover point. Computation time is the portion of execution time attributable to actually computing results, i.e. it does not include communication time. The computation-time gain achieved by transferring a component of a distributed application between a specific pair of processing nodes can be defined as:
computation-time gain = computation-timeLOCAL - computation-timeREMOTE

Richard John Anthony

6

D.Phil. thesis

Chapter 5

where

LOCAL

implies that all components of a distributed application execute at the same node and

REMOTE

implies that the application is distributed across nodes. (1)

To achieve efficient remote execution of application components, the computation-time gain must be greater than the total communication delay incurred, so an aggregate execution-time gain is made. Thus granularity can be defined as:
granularity = coarse-grained granularity = fine-grained where where computation-time gain > total communication time computation-time gain ≤ total communication time (2)

A 100% compute-intense application (monte-carlo-pi) is used to illustrate the issue of granularity. A parallel, variable-grained version has been developed. The application is configured to have a farmer that issues work and any number of workers that are each initially given a fraction of the total samples to compute. Each time a worker completes its current work it passes the partial result to the farmer, requesting more work. This continues until the farmer has received all of the partial results, so that the final result can be computed. The amount of work issued at a time (number of samples to compute), and thus the granularity of the application, is governed by the grain-size. For the granularity investigation a single worker is used. A number of differentgranularity configurations are investigated, each representing a different (fixed-grain) application for the purposes of this experiment. The total computation requirement, i.e. the total number of samples, is the same for all configurations, at 1,000,000. For each configuration, there are two scenarios: 1, The worker is local to (at the same node as) the farmer. In this case the main cause of delay is context-switching between the farmer and the worker. 2, The worker is at a remote processing node. In this case the main cause of delay is network latency. The processing nodes in the development system have different CPU performances. Expressed as ratios against the slowest node, the performances are: earth 1.00, air 2.14, fire 4.26, and water 10.84 (fastest).
Richard John Anthony 7 D.Phil. thesis Chapter 5

If the farmer is always executed at the slowest node, a performance gain is expected when the worker is executed at a remote (faster) node, so long as the communication granularity is sufficiently coarse.

Richard John Anthony

8

D.Phil. thesis

Chapter 5

Table 5.3.2-2 illustrates the behaviour and granularity categorisations for nine different configurations of the parallel Monte-Carlo approximation of pi application.
config location farmer worker earth earth earth air earth fire earth water earth earth earth air earth fire earth water earth earth earth air earth fire earth water earth earth earth air earth fire earth water earth earth earth air earth fire earth water earth earth earth air earth fire earth water earth earth earth air earth fire earth water earth earth earth air earth fire earth water earth earth earth air earth fire earth water grain-size (samples) 1000000 1000000 1000000 1000000 100000 100000 100000 100000 10000 10000 10000 10000 1000 1000 1000 1000 100 100 100 100 50 50 50 50 30 30 30 30 10 10 10 10 1 1 1 1 response- computation computationcommunication granularity time time (Jiffys) time gain (Jiffys) delay incurred (Jiffys) (Jiffys) 1218 1218 N/A N/A Coarse 694 694 524 0 387 387 831 0 190 190 1028 0 1224 1218 N/A N/A Coarse 661 694 524 0 381 387 831 0 161 190 1028 0 1264 1218 N/A N/A Coarse 707 694 524 13 414 387 831 27 195 190 1028 5 1606 1218 N/A N/A Coarse 1068 694 524 374 712 387 831 325 482 190 1028 292 5083 1218 N/A N/A Fine 5064 694 524 4370 3745 387 831 3358 3397 190 1028 3207 8409 1218 N/A N/A Fine 9369 694 524 8675 6853 387 831 6466 6256 190 1028 6066 14347 1218 N/A N/A Fine 15272 694 524 14578 11213 387 831 10826 10647 190 1028 10457 37495 1218 N/A N/A Fine 42120 694 524 41426 33771 387 831 33384 30277 190 1028 30087 359337 1218 N/A N/A Fine 428961 694 524 428267 346530 387 831 346143 310324 190 1028 310134

1

2

3

4

5

6

7

8

9

Calculation of values: Computation time. Computation-time gain. Communication delay incurred. Granularity. Table 5.3.2-2

Approximated using the response time values of configuration 1 which involves minimal communication. From (1), using the response-time data for configuration 1. Approximated as response-time - computation time. From (2).

Granularity classification of a number of differently-grained configurations of the parallel MonteCarlo approximation of pi

The total computation requirement can be approximated as the response-time of the first configuration which only needs a single message to be sent from the worker at the end of its processing.

Richard John Anthony

9

D.Phil. thesis

Chapter 5

Using (1), the computation-time gains are calculated. These are the same in each configuration because the total computation requirement is kept constant over all configurations. As the grain-size is reduced, the communication delay increases. This causes responsiveness to fall dramatically, as shown in column five of table 5.3.2-2. The results show that configurations 5-9 incur a communication delay greater than the computation-time gains achieved. The distributed execution of these configurations is inefficient and, using (2), they are classified fine-grained. 3. Longevity. Migration delay is a factor that should be considered when determining the longevity rating of tasks. For a given system, a threshold value can be derived that represents the minimum post-migration task-execution time required by a task to make migration feasible (i.e. a performance gain is possible but not guaranteed). The preemptive migration policy used in [HD96] is to choose the task that has the highest probability of executing for longer than the transfer delay period. [CHO90] Shows that load sharing can be beneficial when the time-cost of a task-transfer is as much as 30% of the service-time of the task concerned. In simulations in [ELZ86B] the performance of load sharing policies is found to degrade rapidly as transfer costs exceed 25% of processing cost. These results indicate that:
tasks should only be transferred if they are expected to execute for at least four times the transfer delay period, after transfer, in order to significantly reduce the occurrence of inefficient task transfers. (3)

The formula to derive the threshold is dependent on the task transfer mechanism employed. With preemptive migration mechanisms a task can migrate at any point in its execution. Thus it is important to consider the elapsed time at the point of transfer and thus to reason about the remaining execution time. The approach taken is to assume that:

Richard John Anthony

10

D.Phil. thesis

Chapter 5

(on average) at the point of migration a task is at its mid-point in execution. (4)

This concurs with results in [HD96] in which it is found that the probability of a process with expired CPU-time t seconds using at least an additional t seconds of CPU-time is about 0.5.
Let transfer delay = d (assumed constant for a given system) Let response-time = r (task specific)

For preemptive migration environments, longevity is defined in the framework as:
longevity = long-lived longevity = short-lived where where r > 2 * 4d r ≤ 2 * 4d from (3) and (4).

With non preemptive transfer mechanisms a task is moved prior to execution and thus all of its execution-time is spent at the remote node. For non preemptive migration environments longevity is defined in the framework as:
longevity = long-lived longevity = short-lived where where r > 4d r ≤ 4d

from (3).

The longevity definitions ensure that a task meets minimum requirements to be eligible for transfer but do not guarantee a net performance gain. The Concert load sharing scheme developed in this work (see chapter 7) uses a non preemptive transfer mechanism. The time-delay cost of transfers has been found to be almost always less than 0.12 seconds with an average of 0.08 seconds. The value d = 0.12 seconds is used rather than the average value to ensure a safety margin. A task can be classified as long-lived in this environment if its average response-time is greater than 48 Jiffys (0.48 seconds). Table 5.3.2-3 shows that the tsp, hanoi, and MM tasks are long-lived in the Concert environment (based on typical sample signatures) whilst the fib and nroot tasks are short-lived.

Richard John Anthony

11

D.Phil. thesis

Chapter 5

disk network TTY NFS IPC inactive memory task and arguments response- run (Jiffys) (Jiffys) (Jiffys) (Jiffys) (Jiffys) (Jiffys) (Jiffys) (Kilobytes) time (Jiffys) tsp 12 0 0 2 0 0 0 92 12846 12844 hanoi 12 176 0 0 308 0 0 0 76 484 fib 1,000,000 1 0 0 1 0 0 0 36 2 MM 250 697 0 0 0 0 0 0 12 697 nroot 4 10,000 23 2 0 0 0 0 0 0 36 2
Table 5.3.2-3 Classification of a sample of tasks by longevity

This method of classification is used as an eligibility check to filter tasks that are expected to execute for too-short a time to recover the costs of migration. Tasks that meet the longevity criterion can be passed to the load sharing policy for consideration. The History and Stealth load sharing schemes have incorporated forms of longevity filtering. However, these schemes assume the response-time of the task will follow a pattern based only on previous response-times, i.e. current load levels are not taken into account. In chapter 4 the potential severity of performance penalties caused by load has been demonstrated. Consider a (nominally) short-lived task submitted to a node that has significant memory-load. Executing the task in this environment may cause a significant slowdown factor in its response-time (see figure 4.7.3.1-1). There is a high likely-hood that the task will execute as a long-lived task if executed locally, thus it should be eligible for migration. The model facilitates the prediction of the response-time for a given task instance prior to the longevity rating being assigned. This takes into account the resource-use behaviour of the task-type (which is encapsulated in its signature) and the level of load at the local node (see section 4.8). Workload descriptions The term workload is used to collectively describe the tasks present at a processing node or within a system of nodes. The task classification framework introduced above is based on the model described in chapter 4 and thus distinguishes between task-types by their behaviour and resource usage. This approach is extended to describe workloads:

Richard John Anthony

12

D.Phil. thesis

Chapter 5

A workload can be described in terms of its task composition by category. In this way there are three aspects to workload description: 1. In terms of resource-use. For example, the workload is said to be primarily compute-intensive if the majority of tasks are primarily compute-intensive as defined by the framework. 2. In terms of granularity. A workload is described as 60% fine-grained if 60% of the tasks are fine-grained as defined by the framework. 3. In terms of longevity. A workload is described as 70% long-lived if 70% of the tasks are long-lived as defined by the framework. 5.3.3 Applicability of the framework to load sharing During the development of a load sharing scheme it is advantageous to know the typical patterns of workload composition and intensity for a system. Such information aids the selection of appropriate mechanisms and policies. For example, simpler policies, perhaps based only on the CPU run-queue length, can be employed for homogenous workloads. Highly variable workloads require a load sharing policy that can differentiate between the diverse resource requirements of tasks and schedule accordingly. The resource-use intensity method of classification can be used by a load sharing policy to select the execution location for each task based on a mapping between the resource requirements of the specific task and the resource availability at processing nodes. In a load sharing context the granularity classification technique can be used when composite (distributed or parallel) applications are scheduled, to prevent the inefficient migration of components of fine-grained applications. In a load sharing context the longevity method of classification can be used as a migration eligibility check to filter short-lived tasks that are expected to execute for tooshort a time to recover the costs of migration. 5.4 Workload composition for development and testing During the development of a load sharing scheme or codified model it is important that the workloads used are highly representative of those found in the target domain. Generally, stable workloads ease the stress on a load sharing scheme whilst highly variable workloads are more demanding. As variability in workloads increases, so does
Richard John Anthony 13 D.Phil. thesis Chapter 5

the complexity of the load sharing problem. [DL97] Observes that the performance of distributed load sharing policies is sensitive to variance in task service-times and interarrival times. More sophisticated policies may be required for systems in which workload variability is generally high. If the development workloads are too restricted, representing only a fraction of the actual task behaviour space, then it is likely that test results will be optimistically good. This is likely to be true of many of the models that have been based on restrictive workload assumptions, examples include [ELZ86A,FIN90,FMP94]. On the other hand, testing a scheme with a workload that varies excessively compared to the expected real workloads is likely to lead to pessimistic results. As it is difficult to predict the actual effect that an inappropriate workload will have on test results, it is important that development workloads are carefully devised to ensure that they are adequately representative (see section 5.11). This work primarily focuses on general-purpose workload scenarios in which widevariance in task-behaviour can occur. For this reason, workloads are developed that represent a wide state-space of possible task behaviour and resource-use intensity. Many systems will in fact have more restricted workloads. Each of these cases can be treated later as a special (simpler) instance of the general case. By tackling the general (difficult) case, the work herein has applicability to all systems. 5.4.1 The effects of different task mixes The extent to which a workload is composed of similar or dissimilar tasks can affect the response-time of those tasks. Tasks that compete for the same resource-types experience worse slowdowns than when the tasks each primarily use different resources. In this sense, workloads consisting of similar tasks (in terms of resource-use) can be said to be competitive while workloads in which tasks each primarily use different resources can be said to be complementary. The first illustration compares the responsiveness of a compute-intensive task-type (a travelling-salesman-problem solver) in increasingly competitive compute-intensive background workloads; with the responsiveness of the same task-type in more

Richard John Anthony

14

D.Phil. thesis

Chapter 5

complementary

disk-intensive

background

workloads.

The

slowdown

factors

encountered as the background workload intensity increases are shown in table 5.4.1-1.
background task tsp 999 tsp 999 tsp 999 tsp 999 disk_const disk_const disk_const disk_const number of background tasks 0 1 2 3 0 1 2 3 response-time (Jiffys) 1051 2170 3243 4349 1053 2051 2596 3920 run (Jiffys) 1050 1065 1068 1068 1050 1149 1167 1178 ready (Jiffys) 0 1104 2174 3279 1 900 1426 2740 TTY (Jiffys) 1 1 0 0 2 2 2 2 memory use (Kilobytes) 44 44 44 44 44 44 44 44 slowdown factor 1.000 0.484 0.324 0.242 1.000 0.513 0.406 0.269

Note: tsp999 is 100% compute-intensive, disk_const is 18.5% disk-intensive.
Table 5.4.1-1 Comparison of the effects of compute-intensive workloads and disk-intensive workloads on the performance of a compute intensive task tsp11

The second illustration compares the responsiveness of a disk-intensive task-type (sort) in increasingly competitive disk-intensive background workloads; with the responsiveness of the same task-type in more complementary compute-intensive workloads. In each case the same background tasks are used as in the first illustration. The text-file to be sorted consists of 9,006,916 characters in 214,602 lines. For consistency the original file is restored after each execution of sort. The slowdown factors encountered as the background workload intensity increases are shown in table 5.4.1-2.
background task disk_const disk_const disk_const disk_const tsp 999 tsp 999 tsp 999 tsp 999 number of background tasks 0 1 2 3 0 1 2 3 response-time (Jiffys) 1935 6300 7733 9105 2003 3866 5690 7243 run (Jiffys) 1271 1228 1302 1265 1267 1075 1060 1075 ready (Jiffys) 48 787 1770 3236 53 1714 3038 4113 disk (Jiffys) 616 4285 4661 4603 683 1076 1592 2055 memory use (Kilobytes) 1072 1072 1072 1072 1072 1072 1072 1072 slowdown factor 1.000 0.307 0.250 0.213 1.000 0.518 0.352 0.277

Table 5.4.1-2 Comparison of the effects of disk-intensive workloads and compute-intensive workloads on the performance of a disk intensive task sort

In each illustration it can be seen that the slowdowns are less severe with complementary task mixes than with competitive task mixes. This implies that careful placement of tasks, based on knowledge of existing workload can be beneficial. The rich-information approach permits this. 5.5 Task-arrival rates Some models of load sharing have assumed that task-arrival rates are homogenous across all nodes, for example [BAN93]. This is primarily a convenient simplification to

Richard John Anthony

15

D.Phil. thesis

Chapter 5

those models and is not realistic in general. In fact, differences in task-arrival rates at nodes is one of the main causes of load imbalance in many systems. As discussed in section 2.6.3, many models assume the task-arrival process to be Poisson. [TL89] Observes that users spend at least 1 minute thinking between task submission on average. They assume a Poisson distribution with a mean of 1 task per minute. Where user-interactive tasks predominate, there may be times when the task-arrival rate drops to zero at many nodes for significant periods due to lunch-times, night-times and weekends. Such use-patterns represent significant inefficiency in resource utilisation. A load sharing scheme can take advantage by migrating load to the unused nodes. This is the approach adopted in Freedman-Sharp and DAWGS, and in earlier schemes including Sprite and Condor. The way in which a particular system is used, in terms of the types of application present, can have a significant effect on the task-arrival pattern: • • • Manufacturing, control, and monitoring systems are likely to have quite regular or predictable task-arrival patterns. Business systems and other user-interactive systems tend to be driven by userrelated events, which are random, in terms of timing, to a first approximation. Batch computing still has its place. It is used in application areas which include: scientific computing, payroll and auditing applications. Typically the task-arrival rate will be low as the workload will consist of few, large tasks. The task-arrival rate can be dependent in-part on the service time of tasks since the results from one task are often required prior to the initiation of another task. In such a scenario, a specific task cannot ‘arrive’ (start execution) until all earlier tasks in the task-graph on which the subject task has dependencies have completed. 5.5.1 Significance when designing a load sharing scheme The frequency of load measurement should relate to the frequency of major events, including task arrival, which cause load levels to change.

Richard John Anthony

16

D.Phil. thesis

Chapter 5

The extent to which task-arrival rates fluctuate may be a determining factor in selecting the load information dissemination technique. For example, where the rate is highly variable, state-change dissemination is most suitable. Periodic dissemination is appropriate for systems in which the arrival-rate variance is low. 5.6 Task response-times Recall that response-time has been defined in section 4.2 as the sum of the time a task spends executing and the time it spends waiting to use each different resource-type, i.e. the total time the task is in the system. Many models have been based on the assumption that tasks are homogenous, each having the same fixed response-time. This can be a dangerous assumption. The effects of the two forms of turbulence identified in section 4.5 ensure that there is a very low probability that a task will execute in exactly the same way twice. Even if tasks are homogenous and the input parameters remain constant, the complex interaction with other tasks through communication and competition for resources adds a pseudorandom element to response-time. A slightly more realistic approach is to assume that tasks are the same in nature but have service times that follow some known distribution. The Poisson distribution and other exponential distributions are popular. This assumption fails to recognise the large variations in task behaviour that can occur between task-types. Svensson in [SVE90] suggests that distributions for CPU-time requirement are far from exponential. [TL89] Assumes a workload consisting of short, interactive tasks and few long-running compute-intensive tasks. This assumption is given credence by the results of a processlifetime distributions study in [HD96]. They find that typical workloads consist of predominantly shorter tasks and few long tasks. The variance in workloads is found to be greater than that of an exponential distribution. In general agreement with earlier work in [LO86], the main finding is that a distribution of 1/T is more representative of service-times than an exponential distribution, where T is process lifetime. This is because the exponential distributions lack the tail of long-lived tasks. The 1/T distribution is compared with popular exponential distributions in figure 5.6-1.

Richard John Anthony

17

D.Phil. thesis

Chapter 5

1 0.8 Event probability 0.6 0.4 0.2 0 1 2

Figure 5.6-1 Comparison of the 1/T and exponential (Poisson and e-x ) distributions

1/T Poisson (mean = 1) e^-x

3

4

5

Interval

6

7

8

9

10

5.6.1 Significance when designing a load sharing scheme The predominant duration of tasks within a workload may affect the choice of load measurement and dissemination techniques and frequencies; this is appropriate for specialised systems. In general-purpose systems the composition of workloads is more variable and less predictable. The greater the variance in task-type and response-times, the lower the applicability of the most popular load index, the CPU run-queue length which provides no information regarding the constitution of workloads. A more-detailed load index (consisting of multiple load metrics) is needed to represent the effects of high-variance workloads. The performance of some load sharing policies is sensitive to the mean execution time of tasks. There is a greater risk of inefficient transfers when tasks have predominantly short execution times. 5.7 Detailed treatment of turbulence The new concept of turbulence is developed further in this section as a framework for describing ways in which a task's execution behaviour can be influenced by factors internal and external to the task. The taxonomy described in section 4.5 provides structure for the discussion. 5.7.1 Extrinsic turbulence Extrinsic turbulence is caused by the competition for resources that arises between tasks in the system. Much of this competition occurs between local tasks for local resources. In addition resources such as the network (communication bandwidth) and remote files are contended by tasks at several nodes.
Richard John Anthony 18 D.Phil. thesis Chapter 5

Section 4.7 deals with extrinsic turbulence in considerable detail. Ways in which load on each of the major resource-types arises have been investigated. Numerous metrics which represent the way that load affects the behaviour of tasks are evaluated. The most representative metrics are retained in the model. Ways in which the effects of load on the response-time of tasks can be predicted, are developed in section 4.8. Quantification of extrinsic turbulence is difficult. An initial approach of determining the number of dimensions of extrinsic turbulence that are present is taken. In this approach each resource that is contended for contributes one dimension of extrinsic turbulence. For a given node, there are a specific number of resources that can be contended between processes. A task that executes in a system that consists of processing nodes with, for example four resource types: CPU, disk, memory and network interface, can therefore experience a maximum of four dimensions of extrinsic turbulence. However, task-types that use a subset of resources are only affected by load on those resources. The matrix multiplication task MM predominantly uses CPU and memory. This tasktype is sensitive to two dimensions of extrinsic turbulence. 5.7.2 Intrinsic turbulence Intrinsic turbulence arises from the internal behaviour of tasks and the way in which this behaviour is affected by direct and indirect changes in the input parameters. There are many possible causes of intrinsic turbulence, specific to the task-type concerned. A number of task-types are used to investigate the nature of intrinsic turbulence and the way in which it affects the response-time of tasks. Extrinsic turbulence is avoided as far as possible during this investigation. Complexity factors of intrinsic turbulence This sub-section is concerned with issues encompassed by traditional complexity analysis (CA). For example a matrix multiplication program requires memory and processing time in some proportion to the sizes of the matrixes. The relationship is fixed in the design of the algorithm and is a measure of its computational complexity. For the purposes of discussion each source of variance is termed a dimension. The matrix multiplication example is subject to a single dimension of intrinsic turbulence (the size of the matrixes).
Richard John Anthony 19 D.Phil. thesis Chapter 5

Once the number of dimensions of intrinsic turbulence have been identified for a given task-type, there still remains the problem of quantifying the turbulence in each dimension. This can be achieved for individual task-types theoretically by examination of the algorithm, or empirically over several executions, changing only one input parameter at a time and keeping extrinsic turbulence constant over all executions. In many systems that need an estimate of task's resource requirements or response-time, for example real-time schedulers and some load sharing schemes, the user is required to define the behaviour of task-types. Such an approach has a number of drawbacks that include: • Lack of transparency to users. • Requirement that users understand the functional behaviour of all task-types that they use. • Susceptibility to description errors, leading to incorrect behaviour predictions. • Description error can compound as the number of dimensions of intrinsic turbulence increases. • Generally, every task-type is considered a special case. The codified model facilitates empirical quantification of intrinsic turbulence through the use of the automatically maintained task signatures. This is generally applicable to all task-types and removes the need for programmer or user intervention. A number of task-types are used to demonstrate the utility of signatures in this respect. The towers-of-Hanoi task (hanoi) execution-time is found to increase by a power of 2 with each increment in problem size (the number of disks moved between the towers, passed as a parameter). It is said to have computational complexity O(2n), where n represents the problem size. In contrast, the travelling-salesman solution task (tsp) execution-time increases in a factorial relationship with the problem size (the number of cities in the tour). It has computational complexity O(n!). Figures 5.7.2-1 and 5.7.2-2 show the effect of intrinsic turbulence in a single dimension on the hanoi and tsp tasktypes respectively. The processing node used in these experiments ('water') has a Pentium 200MHz CPU and 32MB RAM.

Richard John Anthony

20

D.Phil. thesis

Chapter 5

Problem size (no. disks) 8 9 10 11 12 13 14 15 16 17

ResponseIncrease due to problem size time increment (seconds) (ratio to previous value) 0.25 N/A 0.51 2.04 1.08 2.12 2.31 2.14 4.48 1.94 10.35 2.31 21.75 2.10 45.47 2.09 95.19 2.09 199.99 2.10

Problem size (no. cities) 9 10 11 12 13 14

Response- Increase due to problem size time increment (seconds) (ratio to previous value) 0.13 N/A 1.03 7.92 10.96 10.64 129.08 11.78 1658.67 12.85 22962.00 13.84

Table 5.7.2-1: Intrinsic turbulence in towers-of-Hanoi

Table 5.7.2-2: Intrinsic turbulence in the travelling-salesman problem

Table 5.7.2-3 illustrates the effect of a single dimension of intrinsic turbulence (the number of random points to generate) on the resource-use behaviour of the monte_carlo_pi task-type.
Element → Argument value (number of random points generated) 1,000,000 2,000,000 4,000,000 8,000,000 16,000,000 32,000,000 64,000,000 Table 5.7.2-3 responsetime (Jiffys) 126 251 501 1002 2003 4005 8008 run (Jiffys) 126 251 501 1002 2003 4005 8008 memory use (Kilobytes) 40 40 40 40 40 40 40

The effect of intrinsic turbulence on the monte_carlo_pi task-type

The task-type is 100% CPU-intense. This intensity is unaffected by intrinsic turbulence. In this task-type the response-time increases in direct proportion to the increase in problem size. It has computational complexity O(n). The signature results show memory requirement of the task-type is unaffected by problem size. The createintfile task-type creates a file of a user-specified number of records. Each record consists of a single integer field containing a random number. createintfile is subject to intrinsic turbulence in the size of the file to create, as shown in table 5.7.2-4.
Element → Argument value (number of records in created file) 500,000 1,000,000 2,000,000 4,000,000 8,000,000 16,000,000 * 500,000 Table 5.7.2-4 response -time (Jiffys) 75 140 381 877 1639 3539 270 run (Jiffys) 64 127 305 620 1294 2685 84 ready (Jiffys) 1 0 10 21 52 117 4 disk (Jiffys) 10 13 66 236 293 737 182 memory use (Kilobytes) 44 44 44 44 44 44 44

The effect of intrinsic turbulence on the createintfile task-type

Richard John Anthony

21

D.Phil. thesis

Chapter 5

Records are created serially in a loop. Memory requirement is unaffected by the number of loop iterations and thus unaffected by intrinsic turbulence in the size of the file to create. The CPU-time requirement (the run element of the signature) and disk element clearly increase with file size. However the patterns are less clear than in the montecarlo-pi example. The deviation from simple patterns is attributed to the randomness in disk-access time as identified in section 4.7.2. The final case, marked (*) involved truncation of a very large existing file, unlike in the first case in which the file did not exist before the task was executed. This is a form of turbulence not treated by CA. This example illustrates that intrinsic turbulence can be present even when parameter values remain unchanged. The readwriteint task-type reads a file created by createintfile and writes it to another file, i.e. it is a specialised record-by-record file-copy utility. readwriteint is subject to intrinsic turbulence in file size as shown in table 5.7.2-5.
Element → Number of data records in file 500,000 1,000,000 2,000,000 4,000,000 8,000,000 Table 5.7.2-5 response -time (Jiffys) 104 246 587 1607 4537 run (Jiffys) 99 221 480 993 2148 ready (Jiffys) 0 1 14 35 119 disk (Jiffys) 5 24 93 579 2270 memory use (Kilobytes) 44 44 44 44 44

The effect of intrinsic turbulence on the readwriteint task-type

The effects of intrinsic turbulence on this task-type are similar to the effects on the createintfile task-type. Note however, the substantial increases in total disk-access time caused by increasing file size. This effect is attributed to disk buffering and caching activities, see section 4.7.2.1. The nroot task-type uses recursive successive approximation to determine the nth root of a number. There are three input parameters: root, number, and number-ofapproximations. This quite simple task-type is used to illustrate some of the effects of multiple dimensions of intrinsic turbulence. First, each parameter is changed in isolation, then multiple parameters are changed at a time. The results are shown in table 5.7.2-6.

Richard John Anthony

22

D.Phil. thesis

Chapter 5

parameter #1 root (n) 100 200 300 400 500 600 200 200 200 200 200 200 500 500 500 500 500 100 100 150 200 200 300

parameter #2 number 5 5 5 5 5 5 2 5 10 20 40 60 5 5 5 5 5 100 200 100 25 50 12.5

parameter #3 number-of-approximations 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 500 1000 2000 4000 8000 15000 2000 12000 9000 6000 18000

run (Jiffys) 4 6 8 10 12 14 6 6 6 6 6 6 4 7 12 22 43 20 4 22 21 15 52

memory use (Kilobytes) 36 36 36 36 36 36 36 36 36 36 36 36 36 36 36 36 36 36 36 36 36 36 36

Table 5.7.2-6 The effect of multiple dimensions of intrinsic turbulence on the nroot task-type

When only the root argument is varied a clearly identifiable relationship between the value of the argument and the CPU-time requirement of the task emerges. The value of the number whose root is to be found has no effect on the behaviour of the task. This is a simple but important observation, showing that a task's sensitivity to turbulence can vary across parameters. Isolated changes in the number-of-approximations parameter is found to relate almost linearly to the resulting change in CPU-time requirement. However, when several parameters change at a time the resulting CPU-time requirement pattern is less obvious. If the exact functional nature of a task-type is known, or a case-history of a number of differently parameterised samples is available (the approach taken in this work), then the effect of complexity factors on that task-type can in general be identified, as with the tasks illustrated above. Once a relationship has been established it can be extrapolated for new values of the parameter. For example, the response-time for hanoi with a parameter value 18, in the absence of extrinsic turbulence, can be predicted to be approximately 400 seconds.

Richard John Anthony

23

D.Phil. thesis

Chapter 5

Non-complexity factors of intrinsic turbulence This sub-section focuses on aspects of intrinsic turbulence that fall outside of the scope of CA. The non-complexity aspects of intrinsic turbulence arise dynamically at run-time from the execution context of the process. Examples of task-types that are affected by non-complexity intrinsic turbulence include: • • • sort, affected by 3 forms of intrinsic turbulence: size of file, location of file (local or remote) and initial degree of ‘sortedness’ of the file. grep, affected by 4 forms of intrinsic turbulence: size of file, location of file (local or remote), size of pattern to search for and number of ‘hits’. editors (such as vi) and wordprocessors, affected by 4 forms of intrinsic turbulence: size of file, location of file (local or remote), average user-response time interval and number of editing events. Probably the most difficult aspect of intrinsic turbulence is human interaction. The extent to which a task-type is sensitive to turbulence arising from interaction with users is dependent to some extent on the computation to I/O ratio of the task-type. Consider an editor, most of its execution-time will be spent waiting for user input. The responsetime of the task is therefore largely dictated by the think-time and typing speed of the user. An editor thus has a very low computation to I/O ratio. The level of load on resources such as the CPU and disk is of relatively low significance in determining the response-time of the task. A scientific computing application that requires initial interaction for configuration purposes, and then runs independently for a relatively long time will have a high computation to I/O ratio and thus the effects of non-complexity intrinsic turbulence will be low. Quantification of intrinsic turbulence The use of the concept of dimensions is only the first step to quantifying turbulence. How much turbulence is present in each dimension? and how can it be measured? One possible approach is to introduce the concept of distance (the difference between parameter values). For example, the task instances monte_carlo_pi 1000000 and monte_carlo_pi 2000000 could be said to be a distance of 1 million apart in a single dimension. In this way, it may be possible to employ vector notation to represent the extent of turbulence across multiple dimensions. However, parameter values are used
Richard John Anthony 24 D.Phil. thesis Chapter 5

internally in a wide variety of ways. They may govern the number of times a loop is executed, be incorporated into a mathematical formula, be used to index a record in a database, etc. The extent to which a file is in sorted order is unknown in general prior to the initiation of a sort utility, but this factor can have a significant effect on the response-time of the utility. Therefore distance, measured as above, is an unreliable representation of the amount of intrinsic turbulence. There is also the problem of how to deal with task-types which have variable numbers of parameters, such as the cc task-type which compiles together one or more source files with zero or more control parameters. Due to the time limits imposed, quantification of intrinsic turbulence in complex cases is left for further work. 5.8 Distributed Applications Scheduling distributed applications introduces additional complexities that include: • Whether the application is suitable for physical distribution across processing nodes or whether the components should be kept at the same node. This is a matter primarily of communication granularity. • • If physically distributed, how are the locations of the components decided? To what extent does the choice of execution location affect performance?

A simple RPC-based Client/Server database application (db1) has been developed to illustrate these issues. The db1Client task-type requests a user-supplied number of 1Kilobyte records from the server. Records are requested one at a time, in succession. The db1Server task-type is inactive between requests. On receiving a request, the appropriate record is read from the data file and transferred to the client. The server treats each record request as an atomic activity. Table 5.8-1 shows the signature data for the db1Client task-type under a wide-range of db1 configurations.

Richard John Anthony

25

D.Phil. thesis

Chapter 5

Configuration table section client location water water water water water fire fire fire fire fire fire fire fire fire fire water water water water water earth earth earth earth air air air air fire fire fire fire water water water water server location water water water water water water water water water water fire fire fire fire fire fire fire fire fire fire earth air fire water earth air fire water earth air fire water earth air fire water

no. data records to response run ready network retrieve (client’s -time (Jiffys) (Jiffys) (Jiffys) user-supplied parameter) (Jiffys) 100 200 400 800 1600 100 200 400 800 1600 100 200 400 800 1600 100 200 400 800 1600 400 400 400 400 400 400 400 400 400 400 400 400 400 400 400 400 50 146 492 1860 7359 133 312 807 2501 8283 99 284 947 3561 13880 153 391 1123 3799 13864 3589 4099 1453 1055 3545 3778 1322 970 3436 3772 948 796 3375 3549 1125 489 12 22 40 86 139 19 35 60 137 270 23 35 72 152 325 11 13 32 70 131 328 280 305 299 228 212 205 219 61 65 94 59 42 21 34 31 0 3 6 4 5 3 0 3 12 15 1 0 5 7 23 2 0 2 4 7 40 9 5 8 6 24 8 6 7 4 27 6 0 3 2 17 0 0 0 0 0 1 3 5 12 9 0 0 0 0 0 2 0 6 11 12 0 5 6 5 4 0 7 5 6 4 0 7 4 4 8 0

inactive (Jiffys) 38 121 446 1770 7215 107 271 738 2338 7988 74 248 869 3401 13531 132 374 1083 3704 13703 3219 3803 1135 741 3307 3542 1147 741 3361 3698 826 723 3329 3521 1074 441

memory use (Kilobytes) 288 488 888 1688 3288 288 488 888 1688 3288 288 488 888 1688 3288 288 488 888 1688 3288 904 904 904 904 904 904 904 904 888 888 888 888 888 888 888 888

1

2

3

4

5

Table 5.8-1 Signatures of the db1 database application client

The first and third sections of the table illustrate the effect of increasing the number-ofrecords parameter on the response-time of the client when it is local to the server. The response-time is found to increase in a worse-than-linear fashion due to database fileaccess behaviour at the server. The inactive element values are caused by the client blocking between sending a request to, and receiving a reply from, the server. This includes the time required by the server to process the request. The second and fourth sections of the table illustrate that the same trend emerges when the client is remote. The response-time in each case is worse than the equivalent localclient case due to network delay.

Richard John Anthony

26

D.Phil. thesis

Chapter 5

In the db1Client signatures, delay accrues on the inactive state primarily, rather than the network state because RPC blocking is based on the select system call rather than blocking sockets (see section 4.6.1). The final section of the table compares the sixteen possible combinations of client and server location in order to determine the relative performance sensitivity of the application to the client and server locations respectively. The results reveal that the responsiveness of the db1Client task-type is dictated mainly by the performance of the processing node at which the server is located, the client location having little significance (see column 5 of table 5.8-1). The server performance ranking is found to be much more in line with nodes’ CF values rather than their DF values. This indicates that the db1Server task-type is primarily compute-intensive while it is servicing requests. However between servicing requests the server is inactive which will lower its signature-based CPU-intensity rating. 5.9 Local versus remote access to resources This section investigates the performance overheads associated with remote access to resources. Data files are used as realistic examples of remotely accessed resources in the discussion. There are three ways in which a remote resource can be accessed: 1. Remote access, for example NFS provides transparent access to remote files but there is a performance penalty. 2. Move the resource to the process. This is only possible with data resources (e.g. files) rather than physical resources. This approach involves copying the whole file prior to access and can seriously affect the responsiveness of the subject process. 3. Move the process to the resource (process migration). This approach has the advantage that it is applicable for use with all types of resource. The sort task-type is used as a vehicle to compare the three approaches. Four differentsized input files are used in the experiments, as described in table 5.9-1. These are sometimes local to the process and sometimes remote. The output file is always directed to the original location of the input file.

Richard John Anthony

27

D.Phil. thesis

Chapter 5

file name sortfile1 sortfile2 sortfile3 sortfile4

type of data text text text text Table 5.9.1

number of characters 111,074 6,581,433 9,006,916 13,162,866

number of lines 6,886 151,401 214,602 302,802

Description of the input data files (sort)

Remote access The effect of remote access to files, on the response-time of sort is investigated. The results are shown in table 5.9-2.
process location water water water water water water water water fire fire fire fire fire fire fire fire file input file location water sortfile1 fire sortfile1 water sortfile2 fire sortfile2 water sortfile3 fire sortfile3 water sortfile4 fire sortfile4 fire sortfile1 water sortfile1 fire sortfile2 water sortfile2 fire sortfile3 water sortfile3 fire sortfile4 water sortfile4 response-time (Jiffys) 13 137 1042 7900 2163 11238 4401 16407 23 124 1359 7397 2913 11675 4816 16896 run (Jiffys) 9 12 646 570 1260 1118 1859 1585 21 15 1115 1286 2059 1944 3288 2908 ready (Jiffys) 0 4 24 43 57 79 91 102 0 1 218 290 751 594 1146 742 disk network (Jiffys) (Jiffys) 4 0 1 14 372 0 4 818 846 0 41 1031 2450 0 81 1542 2 0 0 4 25 0 0 779 102 0 1 1030 381 0 0 1544 NFS slowdown (Jiffys) factor 0 1.000 105 0.095 0 1.000 6460 0.132 0 1.000 8968 0.192 0 1.000 13096 0.268 0 1.000 103 0.185 0 1.000 5039 0.184 0 1.000 8104 0.250 0 1.000 11700 0.285

Table 5.9-2 Comparison of the effects of local and remote file access on the performance of sort

Note that when the input file is remote, the disk-state time is reduced and executiontime accumulates on the NFS-state accordingly. The slowdown factors caused are found to be quite significant. Move the resource One alternative to remote access is to copy the file to the local node (i.e. the process execution site), access the file locally, and then copy the updated file back to the original file location. This saves the remote access penalties illustrated in table 5.9-2. However, this approach introduces file copying costs. The costs of copying the files across the network (in terms of the task signatures for the copy utility cp) are shown in table 5.9-3.
copy from fire fire fire fire water water water water Table 5.9-3 copy to water water water water fire fire fire fire file copied sortfile1 sortfile2 sortfile3 sortfile4 sortfile1 sortfile2 sortfile3 sortfile4 response-time (Jiffys) 71 3923 5344 7915 86 4003 5508 7987 run (Jiffys) 1 272 380 540 9 498 677 1267 ready (Jiffys) 2 20 31 49 1 150 245 374 disk (Jiffys) 1 93 68 176 0 0 0 4 NFS memory used (Jiffys) (Kilobytes) 66 68 3538 68 4865 68 7150 68 75 68 3354 68 4585 68 6341 68

File transfer costs represented in terms of the cp task signatures

Richard John Anthony

28

D.Phil. thesis

Chapter 5

The total costs for the move-the-resource approach are shown in table 5.9-4. These values are obtained by combining: 1, the costs of copying the input file to the process location; 2, the local access costs (from table 5.9.2); and 3, the costs of copying the output file back to the remote site. Note that the second copy stage is only needed when the data resource is modified by the process that uses it, as is the case with sort.
water water water water fire fire fire fire process location fire fire fire fire water water water water input file (original location) sortfile1 sortfile2 sortfile3 sortfile4 sortfile1 sortfile2 sortfile3 sortfile4 input file 170 8968 13015 20303 180 9285 13765 20718 response-time (Jiffys)

Table 5.9-4 Total costs for the move-the-resource approach (copy + local access + copy)

In this example, file-copy time makes up the largest component of the total time-costs in move-the-resource. Remote access is more efficient than move-the-resource in this example. However, the total copy component of move-the-resource is a one-off cost, so a task that makes sufficient repeated access to the same file will eventually gain by using this technique. Move the process to the resource The third approach, move-process-to-resource is more efficient than move-the-resource when the cost of moving the process (once) is less than that of moving the resource (possibly twice). It can be more efficient than remote access but this depends on the number of access events that occur. The remote access approach is relatively more efficient when fewer accesses are required. The relative efficiency of the move-process-to-resource approach is affected by the choice of preemptive or non preemptive process migration, the latter usually having much lower latency than the former. The load sharing scheme developed (see chapter 7) implements a non preemptive task migration mechanism. The design of this mechanism is such that the migration delay between any directed pair of processing nodes is independent of the task-type involved. The average migration delay has been found to be 8 Jiffys. The highest recorded value of 12 Jiffys is used in calculations to ensure that findings are safe.

Richard John Anthony

29

D.Phil. thesis

Chapter 5

The total costs for the move-process-to-resource approach are shown in table 5.9-5. These values are obtained by combining the costs of transferring the sort task (12 Jiffys) with the costs of local access to the file (from table 5.9-2).
process’ arrival site fire fire fire fire water water water water file location (process’ execution site) input file sortfile1 35 sortfile2 1371 sortfile3 2925 sortfile4 4828 sortfile1 25 sortfile2 1054 sortfile3 2175 sortfile4 4413 response-time (Jiffys)

water water water water fire fire fire fire

Table 5.9-5 Total costs for the move-process-to-resource approach (task transfer + local access)

Table 5.9.6 provides a summary of the costs of each approach, using direct access to a local resource as a control case. The figures shown are the averages over the four data files. Results are application-scenario dependent and are generated in the absence of load.
process location local access to the file (used as remote access to (initial) a control for the experiment) the file via NFS water 1904.75 8920.50 fire 2277.75 9023.00 move-themove-process-to-resource resource (process migration) 10614.00 2289.75 10987.00 1916.75

Table 5.9-6 Comparison of the different approaches to access to remote resources (file example). Average performance over the four file sizes. All values in Jiffys.

The low latency of the task-transfer mechanism of Concert advocates process migration; the effective access times being only marginally worse than those of the control. Process migration is particularly favourable where the destination processor is faster than the arrival-node processor, as shown in table 5.9.6 (‘water’ having a faster processor than ‘fire’). 5.10 Parallel applications In this section the model is used to investigate the behaviour of parallel applications in loosely-coupled systems. So long as the granularity of the component tasks is large enough such applications can operate efficiently. Static allocation of application components to execution sites For a performance-heterogeneous system it is possible to derive a formula to determine the best distribution of a problem, given a fixed number of nodes and a fixed number of application components.

Richard John Anthony

30

D.Phil. thesis

Chapter 5

To illustrate, a parallel version of the Monte-Carlo approximation of pi application has been devised. The application is initially configured into 10 equal sized components (2,000,000 samples each). The problem is to allocate the components over the processing nodes such that response-time, for the application as a whole, is minimised. Note that this is a fixed-grain version, unlike that used in the earlier section 5.3.2. This fixed-grain configuration is used to prevent the example allocation problem becoming over-simple. As a control, table 5.10-1 shows the performance of the equivalent non-parallel version of monte-carlo-pi (20,000,000 samples) at each node.
node earth air fire water response-time (Jiffys) 23991 11932 6767 2493 run (Jiffys) memory use (Kilobytes) 23926 56 11902 56 6755 40 2493 40

Table 5.10-1 Signatures of the non-parallel monte_carlo_pi application with 20 million sample points

Therefore in the absence of load the performance of the non-parallel task is optimised by placing it on ‘water’. The issues involved with the scheduling of the parallel task are: 1, can the performance be further improved by placing some of the components on the slower nodes? and 2, how can the best placement pattern be found? The static allocation placement problem is stated as: given the different processing speeds of the nodes, and assuming no load at nodes, find the best distribution of the application. The monte-carlo-pi task-type has been found to be 100% CPU intensive, see table 5.3.21, therefore only differences in the CPU performance of each processing node need be considered. The CPU performance factors for the processing nodes are:
CFearth =1.0, CFair = 2.14, CFfire = 4.26 and CFwater = 10.84

A formula can be derived to find the optimum distribution of the application. This is stated as:
Let e = integer number of workers at earth a = integer number of workers at air f = integer number of workers at fire w = integer number of workers at water

Minimise the largest term in

{

e , CFearth

a , CFair 31

f , CFfire

w } CFwater D.Phil. thesis Chapter 5

Richard John Anthony

Where

e + a + f + w = number of equal sized component tasks (10 in this example) (5)

To investigate the effect of different assignment patterns, a simple methodology is followed: 1, Primarily use the fastest node. 2, Progressively move components onto the slower nodes until no further improvement can be achieved. In all cases the farmer process (results collector) is executed at the 'water' node. Both theoretical and empirical results for a number of distributed configurations are shown in table 5.10-2. The theoretical results discount communication costs.
Application configuration. Number of worker components on: fire 4 0 1 3 2 2 3 2 1 3 3 2 1 2 air 0 0 1 1 1 2 2 2 0 1 0 0 1 1 earth 0 0 1 1 1 0 0 1 0 0 0 0 0 0 Empirical results. Theoretical Signatures for the results-collector component of the application speedup (executed on ‘water’) factors achievable. response-time run network memory speedup factors From (5) (Jiffys) (Jiffys) (Jiffys) (Kilobytes) achieved 0.982 2754 1 2753 40 0.895 1.000 2466 1 2465 40 (base case) 1.000 0.923 2457 1 2456 40 1.004 0.923 2436 2 2434 40 1.012 0.923 2435 1 2434 40 1.013 0.987 2434 2 2432 40 1.013 0.987 2433 1 2432 40 1.014 0.923 2429 1 2428 40 1.015 1.111 2225 2 2223 40 1.108 1.310 2058 3 2055 40 1.198 1.310 2056 1 2055 40 1.199 1.250 1982 1 1981 40 1.244 1.250 1979 1 1978 40 1.246 1.429 1743 1 1742 40 1.415

Distribution case number water 1 6 2 10 3 7 4 5 5 6 6 6 7 5 8 5 9 9 10 6 11 7 12 8 13 8 14 7

Table 5.10-2 Empirical results - fixed-grain, 2 million samples per worker

The imposed restriction (10 equal size chunks) is now lifted. The application becomes finely divisible as the total number of samples to be computed (20 million) can be arbitrarily divided amongst the workers. In this way the distribution of load can be adjusted further to more closely map onto the availability of processing time at nodes. Note that the fine divisibility of the application does not necessarily result in fine granularity. In this particular example each worker sends only its final partial result to the farmer upon completion; a single message. The methodology for the evaluation is to place one task component at each processing node, the size of the components are decided according to the formula:
Let e = number of samples to be computed by the worker at earth a = number of samples to be computed by the worker at air f = number of samples to be computed by the worker at fire w = number of samples to be computed by the worker at water

Minimise the largest term in

{

e , CFearth

a , CFair

f , CFfire

w } CFwater (6)

Where

e + a + f + w = total number of samples to be computed (20,000,000 in this example) 32 D.Phil. thesis Chapter 5

Richard John Anthony

The sample allocations derived by this means are given in table 5.10-3
node earth air fire water samples allocation , from (6) 1,096,000 2,346,000 4,671,000 11,887,000

Table 5.10-3 Theoretically optimum sample allocations to workers at nodes

The speedup factor expected using this distribution is 1.683. As before, the theoretical result discounts communication overheads. The actual results achieved using the theoretically optimal distribution are shown in table 5.10-4.
number of samples allocated for computation by the signature results for the results-collector component speedup worker on: of the application, executed on 'water' achieved water fire air earth response-time run network memory use (Jiffys) (Jiffys) (Jiffys) (Kilobytes) 11,887,000 4,671,000 2,346,000 1,096,000 1596 1 1595 40 1.545 Table 5.10-4 Empirical results – theoretically optimum distribution

Using the same base performance as in table 5.10-2 (response-time = 2466 Jiffys), the speedup achieved through this more precise distribution is 1.545. The importance of performance-heterogeneity support in the model can be demonstrated by examining the effects of an even distribution (each node must compute the same number of samples). Such a distribution would occur if the nodes were assumed to have the same processing capability. The results are shown in table 5.10-5.
number of samples allocated for computation by the signature results for the results-collector component speedup worker on: of the application achieved water fire air earth response-time run network memory use (Jiffys) (Jiffys) (Jiffys) (Kilobytes) 5,000,000 5,000,000 5,000,000 5,000,000 6034 1 6033 40 0.409 Table 5.10-5 Empirical results – even distribution

Using the same base performance as in table 5.10-2 (response-time = 2466 Jiffys), a significant slowdown is observed. The performance advantage of the faster nodes is lost as the final result cannot be computed until the slowest node has computed its share of the samples. Dynamic allocation of application components to execution sites The limitations of static allocation have been addressed in chapter 2. A dynamic load sharing policy should be capable of automatic, dynamic placement of the component tasks of parallel applications, taking into account performance-heterogeneity of
Richard John Anthony 33 D.Phil. thesis Chapter 5

processing nodes and current load levels. The rich-information approach taken in this work permits the resource-use characteristics of the component tasks to be incorporated into the distribution strategy. Dynamic allocation is given detailed treatment in chapter 7 in which the performance of a number of load sharing policies is compared. Load sharing within applications The novel concept of developing applications to ‘self load balance’ is investigated in this section. The parallel version of the Monte-Carlo approximation of pi is re-engineered so that a farmer process allocates chunks of work to workers as they demand it (this is the same version as used in section 5.3.2). In this experiment a worker is placed at each processing node (in section 5.3.2 only a single worker was used). In this way the workers at the faster nodes handle more work in direct proportion to the performance differences between the processing nodes. This approach has additional benefits of automatically taking the current load levels at nodes into account and of being tolerant to the failure of all but one worker. Disadvantages include: increased communications overhead; sensitivity to the reduced grain-sizes of the work chunks that are needed in order to permit the most optimal distribution; greater complexity in the design of the application; and the fact that the approach is not universally applicable. The results achieved with a number of different configurations of the application are shown in table 5.10-6.

Richard John Anthony

34

D.Phil. thesis

Chapter 5

farmer location water water water water water water water water water water water water water water water water earth (slowest node) Table 5.10-6

farmer signatures grain size response-time run network memory use speedup worker configuration (samples) (Jiffys) (Jiffys) (Jiffys) (Kilobytes) achieved one worker at water only 2,000,000 2498 1 2497 64 1.000 (base case) one worker at each node 2,000,000 1764 2 1762 64 1.416 one worker at each node one worker at each node one worker at each node one worker at each node (best performance) one worker at each node one worker at each node one worker at each node one worker at each node one worker at each node one worker at each node one worker at each node one worker at each node one worker at each node one worker at each node one worker at each node 1,000,000 500,000 250,000 100,000 50,000 20,000 10,000 5,000 3,000 2,000 1,800 1,600 1,000 500 100,000 1637 1546 1549 1523 1534 1568 1630 1747 1909 2111 2292 2660 3607 4140 1599 3 2 4 6 28 48 90 156 273 418 450 507 802 1600 55 1633 1542 1545 1514 1502 1503 1522 1564 1572 1604 1689 1979 2572 2032 1539 64 64 64 64 64 64 64 64 64 64 64 64 64 64 80 1.526 1.616 1.613 1.640 1.628 1.593 1.533 1.430 1.309 1.183 1.090 0.939 0.693 0.603 1.562

Performance results for various configurations of the self-distributing parallel application

The results show: 1 As the grain-size is initially reduced, the response-time of the application improves. This is because the work can be better proportionally distributed across the nodes at finer grain sizes. 2 The application becomes fine-grained (as defined in section 5.3.2) at a grain-size of about 1700, this being the point at which the communication overhead cancels out the performance gains achieved. 3 The final row in the table indicates that the location of the farmer is of low significance in determining the performance of the application. Parallel applications summary Parallel applications can execute efficiently in loosely-coupled systems. However performance is highly sensitive to the granularity of the component tasks and to the distribution of those tasks over the available nodes. This indicates that:

Richard John Anthony

35

D.Phil. thesis

Chapter 5

1. It is beneficial to have available a clear description of the application’s component tasks (resource-use, granularity, and longevity) at the time of scheduling. 2. The resource configuration and current workloads at processing nodes should be known at the time of scheduling. 3. A dynamic scheduler is required for systems in which the relative load level at processing nodes fluctuates. 5.11 Workload generation techniques For development, testing, and evaluation of load sharing schemes it is important that realistic workloads are used. Although it is possible to use real workloads by testing on a live system, this is generally not desirable. To solve the problem, various techniques for workload generation have been devised. This section examines these techniques and evaluates their effectiveness. 5.11.1 The minimalist approach This section identifies example cases in which the performance of load sharing schemes are evaluated simplistically, implying a lack of formal workload analysis. In [ZAY87] the efficiency of Accent’s relocation mechanism is evaluated using seven “representative” tasks. The tasks are chosen to represent different behaviour but there is no evidence to suggest that this small number of tasks are sufficiently representative of the entire task behaviour space. Artificial workloads used during the development of MOSIX consisted of a number of identical “IO bound” tasks [BS85]. No definition of the term “IO bound” was provided. Such a workload represents only a fraction of the possible behaviour of tasks. [PD97] Compares five load sharing algorithms, taking into account communication costs. However the evaluation is based entirely on the performance of a single task-type (matrix multiplication). Varying intensity background workloads are used, but these only place load on the CPU. In [AE87] only two task-types are used to evaluate remote execution efficiency: calculation of the lowest prime below 10,000,000; and compilation of the sort utility.

Richard John Anthony

36

D.Phil. thesis

Chapter 5

The PMS scheme is tested with a single task-type (a ray tracer) [FRE91]. The task-type is long-lived, having a response-time of approximately an hour, and is CPU and memory intensive. The testing is subjective, and not representative of general workloads. The Monash scheme [LS97] is tested with three C programs identified only by their source-code size. 5.11.2 Simple workload descriptions This section identifies example cases in which workloads are represented by simple descriptions. In [KAR97] workloads are represented by two parameters: intensity; the average CPU utilisation on all hosts, and pattern; the standard deviation of the CPU utilisation on all hosts. [DAN95] and [DL97] use four parameters to describe workloads: the task mean interarrival time and its coefficient of variation, the task mean service-time and its coefficient of variation. Such workload descriptions permit simulation of many different workload scenarios by adjusting the parameters. However, there are limits to the accuracy and scope of representation of real-world workloads achievable by the simulations. For example, it is not possible to represent the effect of specific mixes of tasks, each with their own resource requirements. 5.11.3 Detailed investigations This section identifies cases in which workload generation is based on more-detailed workload analysis. A performance evaluation of Stealth is described in [KC91]. Workloads are generated from scripts to simulate users. A series of six different tasks are executed in a loop. Each task is separated by a 5-second sleep to simulate user think-time. The script is also used to start and stop a single background workload task. The background task is run at low priority whilst the simulated-user tasks are run at high priority. This approach is quite realistic except that: 1 think-time should not be assumed constant, 2 workloads
Richard John Anthony 37 D.Phil. thesis Chapter 5

consisting of more than one background task should be supported, and 3 the extent to which artificially changing task priorities is realistic needs investigation. In [FZ87] a number of task-execution scripts are used to simulate the work generation characteristics of users. From an extended study of task-type popularity in a unix system, 30 frequently used commands were selected for use in the scripts. The commands used are all of the user-utility category, including df, du, ps and wc. These tasks tend to have quite short execution times and by their nature are not generally good candidates for process migration. The scripts are grouped into high, moderate and low workload types based on the types of task used and the amount of delay between task execution which is used to simulate user-think-time. The workloads thus generated are representative of specifically user-interactive environments in which a use-session typically consists of the execution of many short tasks. In [KUN91] a single generic task is devised for generating workloads. Workloads are characterised in four dimensions: arrival process, CPU-time requirements, IO requirements, and memory requirements. The generic task: 1 allocates memory, 2 reads data from a file, 3 performs simple computation in a loop, and 4 writes the data records back to the file. The generic task is probabilistically configured to exhibit IO-intense, compute-intense, or balanced behaviour. Workloads are made up of a number of generic task instances. This approach is basically sound. However it means that all tasks exhibit basically the same behaviour, although differently parameterised. The probabilistic configuration is a barrier to repeatable experiments. [KI92] Presents a user-oriented workload generator in which workloads are generated using artificial tasks configured based on real workload trace data. The workload generator is specifically designed to support file-system performance investigations and thus the task activity is focused on file-handling activities. Users can specify any particular workload distribution to model file usage. 5.11.4 Standard Workloads There is a need for a set of standard workloads for load sharing evaluation. This would help overcome some of the difficulties of comparing the performance of different schemes, highlighted in section 2.3.10.

Richard John Anthony

38

D.Phil. thesis

Chapter 5

The Standard Performance Evaluation Corporation (SPEC) [SPEC00] provide benchmarks for the performance evaluation of a number of aspects of modern computer systems. Benchmarks are provided for the evaluation of hardware, for example CPUs and multi-processor architectures; and for the evaluation of software such as NFS, Java Virtual Machines, server-side Java and graphics-intensive applications. The High Performance Computing benchmark suite (HPC) consists of three very-large applications suitable for stressing high-end computing systems. Of these, GAMESS (General Atomic and Molecular Electronic Structure System) is an explicitly parallel benchtest, components of which can be distributed to accomplish load balancing on high-performance tightly-coupled architectures. The SPEC benchmarks are highly focussed, each based around a single or small number of task-types. Each benchmark emphasises the performance of a specific component of the system. The HPC benchmark suite is representative of computation-intense workloads found in high-performance scientific computing, rather than of the workloads of general-purpose systems with a wide range of resource requirements and behavioural variance. In particular, short tasks and interactive tasks are omitted. 5.11.5 The workload generation approach adopted in this work Artificially generated workloads that represent a wide-range of behaviour are needed for the development and evaluation of the model described in chapter 4, and the load sharing scheme and various load sharing policies described in chapter 7. A number of goals are first identified: • • • • • A very-wide space of task behaviour must be represented. The behaviour of tasks-types should be representative of real tasks. Variable workload intensities and mixes (in terms of resource utilisation) are required. Experiments should be automated. Experiments should be repeatable.

To ensure a wide space of behaviour, in terms of resource-type use and resource-use intensity, a mix of real and artificial tasks is used. Real tasks, i.e. those which solve real

Richard John Anthony

39

D.Phil. thesis

Chapter 5

problems, are used because they capture real behaviour patterns. Artificial tasks that can be tailored to exhibit specific, predictable behaviour patterns are also used. Performance evaluations involve measuring the response-time of specific subject tasks, executed in the foreground, in the presence of a workload of background tasks. Background workloads need to be quite stable to ensure repeatability of experiments and consistency within experiments. To this end, special versions of some task-types are adapted to have continuous, consistent behaviour. Varying numbers and mixes of the background tasks are used to create a wide range of workload intensities. To facilitate automated, repeatable experiments, tasks are executed under controlled background workload conditions. Job-scripts are used which can start and stop any number and mix of background tasks, as well as starting the subject tasks under investigation. The load levels prior to subject task execution are recorded to a log file along with the resource-use signature data obtained when the subject task completes. A whole series of tests can be combined into a single job-script, for example to evaluate the performance of a specific task-type under a wide-range of background workload conditions or to evaluate the performance of a large number of different tasks under consistent workload conditions. During the development of the model, and subsequent load sharing scheme, experiments involving artificially generated workloads have been carried out for four main purposes: 1, To test the correctness of mechanisms. 2, To investigate the relative usefulness of various load metrics. 3, To investigate ways to predict the behaviour of tasks based on the past behaviour of the relevant task-type and current load levels. And 4, to investigate the performance of a number of load sharing policies. 5.12 Summary This chapter has illustrated the importance of detailed workload analysis and the benefits offered in this regard by the rich-information approach, and the use of signatures in particular. The ways in which the model supports workload analysis are identified.

Richard John Anthony

40

D.Phil. thesis

Chapter 5

Existing workload classification techniques have been analysed and weaknesses identified. A new framework for task and workload description, based on the rich-information approach, has been presented. The utility of the new model and framework have been demonstrated through the detailed analysis of sample tasks. A number of significant issues concerning the nature of workloads in NOWs have been investigated. These include: the effects of different task mixes, task-arrival rates and response-times; causes and effects of turbulence in task behaviour; the relative suitability of different methods for accessing remote resources; and distribution issues specific to the scheduling of parallel and distributed applications. Workload generation techniques have been investigated and problems that can arise from using an inappropriate technique have been highlighted. A workload generation technique that combines the use of real and artificial tasks and facilitates automated, repeatable experiments has been described. This technique has been used during the development of the model and load sharing scheme.

Richard John Anthony

41

D.Phil. thesis

Chapter 5

Richard John Anthony

42

D.Phil. thesis

Chapter 5


								
To top