Document Sample

INFORMS Journal on Computing informs ® Vol. 23, No. 2, Spring 2011, pp. 174–188 issn 1091-9856 eissn 1526-5528 11 2302 0174 doi 10.1287/ijoc.1100.0392 © 2011 INFORMS Sequential Grid Computing: Models and Computational Experiments Sam Ransbotham Carroll School of Management, Boston College, Chestnut Hill, Massachusetts 02467, sam.ransbotham@bc.edu Ishwar Murthy Indian Institute of Management, Bangalore 560076, India, ishwar@iimb.ernet.in Sabyasachi Mitra, Sridhar Narasimhan College of Management, Georgia Institute of Technology, Atlanta, Georgia 30332 {saby.mitra@mgt.gatech.edu, sri.narasimhan@mgt.gatech.edu} T hrough recent technical advances, multiple resources can be connected to provide a computing grid for processing computationally intensive applications. We build on an approach, termed sequential grid comput- ing, that takes advantage of idle processing power by routing jobs that require lengthy processing through a sequence of processors. We present two models that solve the static and dynamic versions of the sequential grid scheduling problem for a single job. In the static and dynamic versions, the model maximizes a reward function tied to the probability of completion within service-level agreement parameters. In the dynamic version, the static model is modiﬁed to accommodate real-time deviations from the plan. We then extend the static model to accommodate multiple jobs. Extensive computational experiments highlight situations (a) where the models provide improvements over scheduling the job on a single processor and (b) where certain factors affect the quality of solutions obtained. Key words: grid computing; stochastic shortest path; dynamic programming History: Accepted by S. Raghavan, Area Editor for Telecommunications and Electronic Commerce; received October 2007; revised April 2009, February 2010; accepted March 2010. Published online in Articles in Advance July 2, 2010. 1. Introduction Grid computing is largely viewed in the literature Advances in technology have made it possible to con- as a mechanism for implementing parallel computing. nect numerous disparate systems to create a virtual In parallel computing, an application is written grid of computing resources that can be exploited to to execute on multiple machines concurrently by solve computationally intensive problems (Rosenberg dividing large computations into numerous smaller 2004). Known by various related terms such as grid calculations that are executed in parallel. By enabling computing, utility computing, and Web-based com- multiple machines to work on the application in puting, the concept has received signiﬁcant attention parallel, the total time taken for completion can be recently in the academic and practitioner litera- reduced signiﬁcantly. The topics addressed in the lit- ture (Bhargava and Sundaresan 2004, Kumar et al. erature on parallel grid computing include grid archi- 2009, Meliksetian et al. 2004, Shalf and Bethel 2003, tectures (Meliksetian et al. 2004), distributed data Stockinger 2006). The last few years have also wit- management (Venugopal et al. 2006), distributed pro- nessed the growth of computationally demanding cessing for biological and visualization applications applications, particularly in the scientiﬁc (Korpela (Hansen and Johnson 2003), reliability of grid archi- et al. 2001), biological (Deonier et al. 2005, Ellisman tectures (Levitin et al. 2006), task scheduling in a grid et al. 2004), and business (Krass 2003) ﬁelds, that are environment (Kaya and Aykanat 2006, Rosenberg impractical to perform on a single resource. Grid com- 2004), and market design for grid computing (Bapna puting has emerged as a cost-effective method for et al. 2006, 2008). providing an infrastructure for such computationally Unfortunately, widespread diffusion of parallel intensive applications, and several vendors (e.g., IBM, computing is not without impediments. In particular, Sun, and Hewlett-Packard) are developing technology the development of software that can take advantage to enable a grid computing environment (Chang et al. of grid resources is difﬁcult. As noted by Donald and 2004, Eilam et al. 2004). Martonosi (2006, p. 14), “Writing parallel programs 174 Ransbotham et al.: Sequential Grid Computing INFORMS Journal on Computing 23(2), pp. 174–188, © 2011 INFORMS 175 is much more difﬁcult and costly than sequential future task assignments and send only the pertinent programming ” Furthermore, complexities such as software modules to those locations. Third, sequen- synchronization of access to resources and interpro- tial operations are recognized as one of the necessary cess communications in single-machine environments fundamental building blocks of modern systems (e.g., are exacerbated in the context of grid computing. van der Aalst and Kumar 2003). Thus, for applications According to Boeres and Rebello (2004, p. 426), speciﬁcally designed for parallel computing, sequen- “If writing efﬁcient programs for stable, dedicated tial grid algorithms can still augment the performance parallel machines is difﬁcult, for the grid the problem of application segments for which parallel algorithms is even harder. This factor alone is sufﬁcient to inhibit are not possible. Therefore, our perspective on sequen- the wide acceptance of grid computing.” In addition, tial grid computing is that it is not an alternative to even when parallel programs are feasible, the cost of parallel grid computing but rather another mechanism program conversions can be prohibitive (Donald and to accrue additional beneﬁts. Martonosi 2006). In this paper, we develop two models and atten- In this research, we explore another dimension of dant solution procedures for the sequential grid com- grid computing that avoids some of the grid imple- puting environment that optimally routes a single mentation obstacles described above and increases application or job based on the stochastic availabil- utilization of a grid infrastructure but has received ity of idle resources at each time period at each limited attention in the research literature. In addition processor. The ﬁrst model is a static model that to parallel processing, the completion time for com- determines at the start of processing which machine putationally intensive applications can be reduced will process the application in each time period. The by having machines work on an application sequen- second is a dynamic model that provides a policy tially over time. It is typical, particularly in a corpo- for allocating the job to a machine in each time rate setting, for different computing resources to have period based on the current status of the job. Both utilization rates that vary dramatically over time— models are benchmarked against a single machine machine utilization will experience peak periods as assignment in which the entire job is processed by well as lean periods that may not be concurrent for just one machine. The static model is computation- all machines. This is particularly true for grid net- ally efﬁcient, is simple to implement, and requires works that are geographically dispersed or that are little overhead information; the dynamic model is assigned to different functions within an organiza- computationally demanding and requires more over- tion. Even when machines are colocated in a central- head information but may provide superior solutions. ized data center, such as for a vendor providing on- Based on the static model, we also present an heuris- demand computing resources to a large number of tic for scheduling multiple jobs on the grid. clients, utilization rates of machines allotted to each client will vary based on client usage characteristics. This research makes two main contributions to the In such a situation, a large background application emerging literature on grid computing in the infor- can be routed sequentially through several machines mation systems area. First, although most optimal on the grid, thereby taking advantage of their lean task scheduling algorithms have focused on parallel periods to reduce overall job completion time. grid computing, we address the optimal schedul- Our research builds on this concept of sequential grid ing problem in the sequential grid computing envi- computing (Berten et al. 2006, Buyya et al. 2002, Sonmez ronment. Second, through extensive computational and Gursoy 2007, Yu and Buyya 2006). At ﬁrst glance, experiments, we characterize the conditions under sequential grid computing offers three primary advan- which sequential grid computing provides beneﬁts tages. First, unlike parallel grid computing, it is not over using a single machine to process the job, and we necessary to rewrite applications to take advantage of also identify the conditions under which the dynamic parallel processing—a major implementation bottle- model provides superior schedules compared with neck. In contrast, sequential grid computing requires the static model. These computational experiments an interface mechanism that allows for the software to provide a proof of concept for our sequential grid be processed by different machines at different times. computing scheduling models and yield insights into Second, relatively simple grid architectures can imple- when the beneﬁts associated with it are the most ment the concept in practice. Once it is estimated pronounced. where each task will be executed, the relevant soft- The rest of this paper is organized as follows. In §2, ware can be sent in advance to those locations. As each we describe the sequential grid computing environ- task is completed, its intermediate state can be sent ment and the task assignment problem in detail. In §3, to the location where the next task is to be executed. we focus on scheduling a single job and deﬁne static Even in environments where the computing resource and dynamic models along with their respective solu- availability is stochastic, one can predict the likely tion procedures. In §4, we extend the static model to Ransbotham et al.: Sequential Grid Computing 176 INFORMS Journal on Computing 23(2), pp. 174–188, © 2011 INFORMS incorporate the scheduling of multiple jobs simulta- be selected after considering the trade-off between neously on the grid. In §§5 and 6, we illustrate the accuracy and computational efﬁciency. Smaller time effectiveness of the static and dynamic models using buckets allow the scheduling algorithms to incorpo- computational experiments with thousands of ran- rate idle CPU time estimates at a lower level of granu- domly generated problems that vary in complexity larity, but they increase computational effort. Because and requirements. Section 7 discusses our computa- idle CPU times in each time bucket are estimates, the tional results, and §8 concludes this paper. duration of a time bucket should reﬂect the periodic- ity of such estimates. If it is only possible to estimate idle CPU times hourly, there is no beneﬁt obtained 2. Sequential Grid Computing from time buckets that are less in duration. Environment Processors in a grid typically handle a large number Following the common architectures of grid comput- of internal (nongrid) tasks for which the CPU require- ing (e.g., Joseph et al. 2004, Meliksetian et al. 2004), we ments are small. Furthermore, the actual CPU cycles consider a centralized, software-based grid manager. required by these tasks, as well the number that would The grid manager sells idle computing resources to arrive in a given time period, is uncertain. Assuming one or more buyers who each have one or more com- that these tasks arrive independently, then because of puting jobs. Each such job k has an expected resource c the central limit theorem (Andersen and Dobri´ 1987), requirement of Uk central processing unit (CPU) cycles it follows that the CPU cycles utilized on a machine and a deadline of Dk time units from the time of in a given time period follow a normal distribution. submission. Similar to recent economic models for Because the total amount of CPU cycles available in grid computing (Bapna et al. 2006, 2008; Sonmez and any time period is ﬁxed, the idle CPU cycles avail- Gursoy 2007), we map the probability of job comple- able on a processor after processing all the nongrid tion into economic terms. The price (or reward) for tasks would also follow a normal distribution. Thus, completing a job k of size Uk within deadline Dk is we assume that in time period t on processor l, the idle labeled Rk . If job k is not completed within the dead- CPU cycles available (cl t ) follow a normal distribu- line, the grid manager incurs a penalty cost Ck as part tion with a mean of l t and a variance of l t and are of a service-level agreement (SLA). estimated by the grid manager from historical data. For example, banks perform several lengthy jobs at Consider that the grid manager has at her disposal the end of each day as batch processes (e.g., check a set M of m processors ( M = m) for executing a job k processing, daily interest calculations, automated requiring Uk CPU cycles with a deadline of Dk . Given clearinghouse transactions, etc.) Consider the end- the stochastic nature of resource availability, the grid of-day batch process for automated clearinghouse job may not be completed within the deadline. The (ACH) transactions. Because ACH transactions are grid manager assigns a job to processors in each time routinely processed, there is likely to be a reliable esti- period to maximize the expected net reward (ENR). mate of the processing time required based on the If, for some schedule sk , the probability of meeting the number of transactions. Because the account updates deadline is ℘ sk , then are needed before the start of the next business day, there is also an associated deadline. Furthermore, data ENR k = ℘ sk × Rk − 1 − ℘ sk × Ck (1) consistency requirements may dictate that the trans- For a single job, maximizing Equation (1) is the actions are processed in sequential order. If the pro- equivalent of maximizing ℘ sk . When the grid man- cessing is outsourced to a grid provider, the SLA ager must schedule multiple jobs, we assume that the would provide for payments and penalties based on time horizon is divided into similar time buckets, but completion within deadlines. Even if the processing the job deadlines (Dk ) may differ. The ENR for multi- is not outsourced, payments and penalties reﬂect the ple jobs is computed as follows; however, because the customer service, inconvenience, and reputation costs schedules are not independent, we cannot maximize associated with noncompletion within deadlines. Equation (2) by independently maximizing ℘ sk for Before accepting the job, the grid manager must each job: ﬁrst estimate the probability of completing the job. This requires estimating the resources available K (expressed in CPU cycles) on each machine on the ENR = ℘ sk × Rk − 1 − ℘ sk × Ck (2) k=1 grid. However, estimations of available resources vary not only by machine but also by time. To incorporate In the context of the banking ACH example, con- the variation of resource availability by time, we par- sider that the grid provider has access to a set of tition the time horizon Dk into Tk distinct time periods mainframes located around the world that are used (see Bapna et al. 2008 for a similar discrete treatment to process a variety of other online jobs for the bank, of time). The number of distinct time periods should such as teller and ATM transactions, account queries, Ransbotham et al.: Sequential Grid Computing INFORMS Journal on Computing 23(2), pp. 174–188, © 2011 INFORMS 177 and internal bank systems processing. For an external Zhou 1999), total job time and makespan (Allahverdi grid provider, the mainframes can also be used to and Mittenthal 1995), and the number of successful process transactions for other clients. An estimate of jobs when machines are unreliable (Herbon et al. 2005, the processing time required for the ACH transac- Pinedo and Ross 1980). Although a review of this lit- tion batch job can be generated from historical data. erature is beyond the scope of this paper, there are Estimates of the CPU cycles available at each main- two unique characteristics of the sequential grid com- frame can also be generated from historical availability puting environment studied here. First, unlike the data. The time buckets can be of ﬁxed duration (e.g., stochastic shop-ﬂoor scheduling literature, we con- 30-minute intervals), or they can be adjusted to reﬂect sider a single large job that is routed from one proces- the periodicity of the estimates or ﬁne-tuned over time sor to another utilizing unused CPU cycles until the to optimize performance. Once an optimal schedule job is completed. Because processors on a grid may is determined, ACH transactions can be distributed be geographically dispersed or may serve different in advance to the mainframes based on the optimal clients of a grid provider, their peak utilizations do not schedule (with some overlap because the actual pro- occur concurrently. Thus, it is advantageous to route cessing may deviate from the optimal schedule as a single job through multiple processors to exploit a result of the stochastic nature of available CPU unused CPU cycles, a situation that does not have an cycles). At the end of each day, the grid manager can equivalent in the stochastic shop-ﬂoor scheduling lit- receive other batch jobs to process (e.g., for calculating erature. Second, we focus on optimizing completion account daily interest, check processing, etc.) and will probabilities within a speciﬁed time period. Existing need to optimize the schedules of multiple batch jobs research in the scheduling literature has not modeled simultaneously. The reward and penalty functions for both of these dimensions of relevance to the sequential each job will reﬂect the importance of the job and the grid computing environment. disutility from missing processing deadlines. 2.2. Network Representation We initially focus on the case where the grid manager 2.1. Related Literature is examining the request for a single job that needs There is a large body of research in the opera- to be processed on the grid, and we later extend the tions management and operations research areas on analysis to multiple jobs. For simplicity of presenta- stochastic shop-ﬂoor scheduling problems that are rel- tion, we drop the subscript k that denotes a job from evant to the sequential grid models described here our discussion. The problem can be represented on (Allahverdi and Mittenthal 1995, Baker and Scudder an acyclic directed network G N A . As shown in 1990, Balut 1973, Kise and Ibaraki 1983, Pinedo and Figure 1, the node set N is arranged into rows and Ross 1980). The basic problem studied in this litera- columns. The rows are labeled from 0 to T + 1. Rows 0 ture is that of scheduling a number of jobs on mul- and T + 1 consist of singleton nodes, with the former tiple machines with stochastic processing times and representing the source node (start of processing) and failure probabilities so as to optimize a variety of the latter representing the terminal node n (end of performance measures such as number of tardy jobs processing). Nodes belonging to rows 1 through T are (Balut 1973, Kise and Ibaraki 1983), earliness and tar- arranged into m columns, with each column associ- diness penalties (Baker and Scudder 1990, Cai and ated with a processor. Thus, a node at the intersection n Column 1 Column m Row T Li, j Row 2 m+2 Row 1 2 m+1 1 Figure 1 Network Representation of the Sequential Grid Ransbotham et al.: Sequential Grid Computing 178 INFORMS Journal on Computing 23(2), pp. 174–188, © 2011 INFORMS of row t and column i (1 ≤ i ≤ m, 1 ≤ t ≤ T ) represents 3. Sequential Grid Models for a processor i in time period t. Hence, the node set N Single Job consists of mT + 2 nodes, numbered as shown. The arc set A includes continuation arcs and trans- 3.1. Problem Formulation fer arcs. First, there is an arc from source node 1 to We now construct two models to help the grid man- every node in row 1. Similarly, there is an arc from ager schedule a single job on the grid. We then describe solution procedures for optimally solving each node in row t to every node in row t + 1, where these two models. The ﬁrst is a static model that gen- t=1 T . Finally, an arc i j ∈ A with j lying at the erates a static schedule (a list of T processors, one intersection of column l and row t (t = 1 T and for each time period, to which the job is assigned). l=1 m) represents the action of assigning the job This schedule is sent with the job so that it can be to processor l in time period t. The length of each such routed by each processor at the end of each time arc (Li j ) represents the stochastic availability of CPU period. Thus, the overhead information that needs to cycles from processor l in time period t. A complete be transmitted with the job to implement the static path from node 1 to node n represents a schedule. model is minimal. The second is a dynamic model, the Let p denote a complete path and P denote the col- output of which is not a predetermined schedule but lection of all such paths p in G N A . We deﬁne an rather an optimal policy. Let ut represent the cumula- arc i j to be a continuation arc if the processors cor- tive CPU cycles obtained by the job thus far in time responding to nodes i and j in G N A are the same period t. The optimal policy speciﬁes, for each node and a transfer arc if the processors corresponding to i in G N A and for each possible value of ut at that and j in G N A are different. If i j is a continua- node (1 ≤ ut ≤ U ), the next processor that the job is tion arc with node j being associated with row (time assigned to in period t + 1. Clearly, to implement the period) t and column (processor) l, then its length Li j dynamic model, signiﬁcantly more control informa- corresponds to cl t deﬁned earlier and follows a nor- tion needs to be transmitted with the job. In addi- mal distribution with a mean of l t and a variance tion, the computational requirements of the dynamic model are several orders of magnitude higher. of l t . Also, if i j is a transfer arc, then there is a transfer cost fl1 l2 . We express the transfer cost in 3.2. The Static Model for Single-Job Assignment terms of processing cycles because processing time is In the discussion for scheduling a single job, we drop lost as a result of the transfer of the job from l1 to l2 . the subscript k (denoting the job) for simplicity of Hence, for such arcs, the length Li j has a mean of presentation. Let Lp denote the length of a path p in ( l t − fl1 l2 ) and a variance of l t . G N A ; i.e., In the context of our banking example, the columns Lp = Li j (3) in Figure 1 represent designated bank mainframes i j ∈p that are available to the grid manager for batch pro- We maximize ENR for which it sufﬁces to maxi- cessing. The rows are the time buckets into which the mize the probability of completion within the dead- batch processing duration is divided. For example, all line. Accordingly, the static model, denoted as PS-1 U.S. batch processing can be scheduled between mid- (indicating the scheduling of a single job), can now be night and 4:00 a.m. Eastern Standard Time, divided stated as selecting a path in G N A that maximizes into eight time buckets (rows) of 30 minutes dura- the probability of completion: tion each. It is important to note that although a job could potentially be transferred to another proces- PS-1 Maximize ℘ Lp ≥ U p ∈ P sor at any time by communicating its intermediate state (variable values, registers, temporary ﬁles, etc.), where ℘ denotes probability. In PS-1, because each path p ∈ P is composed of such transfers are signiﬁcantly easier at key transition arcs whose lengths are independent and normally dis- points (e.g., at the end of a module). Clearly, such tributed random variables, the path length Lp is also transition points may not coincide with time buckets normally distributed with a mean of p and a vari- because processing times are stochastic. However, if ance of p , where p = i j ∈p i j and p = i j ∈p i j . software programs are written in a modular fashion Because ℘ Lp ≥ U in (PS-1) is monotonic in the (a common software engineering practice), the grid √ reduced Gaussian, g p p = p − U / p , PS-1 is manager will need to wait a short time at the end of reduced to the deterministic equivalent of maximiz- a time bucket for the module to complete before tran- ing g p p over the set P . sitioning the job. Thus, deviations from the schedule To evaluate the computational complexity of PS-1, will be small and our algorithms are likely to provide we consider the decision version of PS-1—does there computationally efﬁcient approximations. exist a path p ∈ P such that g p p ≥ L? For the Ransbotham et al.: Sequential Grid Computing INFORMS Journal on Computing 23(2), pp. 174–188, © 2011 INFORMS 179 special case when L = 0, we simply solve the longest- Lemma 3. The function g is quasi convex in and path problem on this acyclic graph with the arc length for all > U and > 0 (Case 1) and quasi concave in i j for each i j ∈ A. If the path length is greater and for all ≤ U and > 0 (Case 2). than or equal to U , then the answer is yes; other- wise, it is no. If L is not zero, then path variance 3.2.3. Pruning Rules. Based on the previously makes the problem more complex. When L > 0 and stated results, the algorithmic approach we use recog- the maximum mean path has a length that exceeds U , nizes and prunes as many subpaths as possible that Nikolova et al. (2006) provide a quasi-polynomial are not part of the optimal path. The algorithm incor- time whose running time is O n log n . Thus, the porates two basic approaches to pruning nonoptimal existence of a true polynomial-time algorithm for subpaths based on the lemmas: (a) local preference rela- such instances is an open question. If L < 0 and the tions (based on Lemma 2) and (b) upper bound compar- maximum mean path length is less than U , the deci- isons (based on Lemma 3). The pruning signiﬁcantly sion version of PS-1 is NP-complete (Karger et al. improves the performance of the stochastic shortest- 1997, Nikolova et al. 2006). path algorithm. 3.2.1. Best Single-Processor Assignment Algo- In pruning based on local preference relations, we use rithm. First, we describe an easy algorithm that we two rules based on the two cases identiﬁed above. refer to in the computational experiments as the best Consider p1 j and p2 j denoting two subpaths from single-processor assignment (PA-1). If the path p is node 1 (start node) to node j. The subpath p1 j dom- restricted to the use of continuation arcs alone, then inates p2 j if there is at least one feasible extension of determining the optimal path becomes easy. This is p1 j to node n that is at least as good as all feasible the case when the job is assigned to the same machine extensions of p2 j to node n. In such a case, the sub- over all time periods. Let P ⊂ P denote the subset of path p2 j can be discarded (pruned). The rules that paths that consists of continuation arcs alone. Note that determine the conditions when one subpath domi- because P = m, PA-1 can be solved quickly through nates another are referred to as local preference rela- enumeration: tions. The following two pruning rules are based on PA-1 Maximize ℘ Lp ≥ U p ∈ P Lemma 2 for Case 1 and Case 2, respectively; Rule 1 3.2.2. Characteristics of the Reduced Gaussian. applies for Case 1 and Rule 2 applies for Case 2. The solution method presented here for PS-1 is a Rule 1. The subpath p1 j dominates p2 j if modiﬁcation of the stochastic shortest-path algorithm (a) p1 j ≥ p2 j and (b) p1 j ≤ p2 j . (Murthy and Sarkar 1998) to suit the special struc- ture of PS-1. Lemmas 1, 2, and 3 present some results Rule 2. The subpath p1 j dominates p2 j if about the nature of the function g p p that will be (a) p1 j ≥ p2 j and (b) p1 j ≥ p2 j . used in our algorithm to solve PS-1. Some of these results are straightforward while others are based on In terms of pruning based on upper bound com- results that have appeared earlier in the literature, parisons, the basic algorithmic approach is to com- most notably in Henig (1990). For that reason, we pare the best extension of a newly created path pnew j state Lemmas 1, 2, and 3 here without proof. from node 1 to node j to a current best-known fea- Lemma 1. Consider two paths p1 p2 ∈ P such that sible path pI . If the best extension of pnew j results (i) p1 ≥ U and (ii) p2 < U . Then, g p1 p1 > in a path that is no better than pI , then pnew j can g p2 p2 for any p1 p2 > 0. be discarded. Let p j denote a path from node j to The signiﬁcance of this result is that if there exists node n (terminal node) whose mean and variance are ˆ a path p ∈ P whose p ≥ U , then all paths p whose denoted as j and j , respectively. The best extension mean length p < U can be ignored. The existence ˆ of pnew j can be obtained by solving the subproblem question can be answered efﬁciently by solving a sin- gle longest-path problem (not stochastic) on G N A SPS-1 Maximize g ¯ j ¯j p j ∈ P j using i j as arc lengths for each i j ∈ A. If the answer to the existence question is positive (Case 1), where ¯ j = new j + j , ¯j = new j + j , and P j is then we restrict our attention to only those paths p ∈ P the set of all feasible paths from j to node n (terminal whose p ≥ U . If the answer is negative (Case 2), then node). Of course, SPS-1 is as hard to solve as PS-1 we know that p < U for all p ∈ P . Thus, PS-1 is par- and hence we consider suitable relaxations of SPS-1 titioned into two dichotomous cases. that utilize Lemma 3 to obtain an upper bound on Lemma 2. The function g is increasing in and the best extension of pnew j by taking advantage of decreasing in for all > U and > 0 (Case 1) while the quasi-convex (Case 1) and quasi-concave (Case 2) increasing in and increasing in for all ≤ U and nature of g. This value is compared to a current best- > 0 (Case 2). feasible path pI and is pruned accordingly. Ransbotham et al.: Sequential Grid Computing 180 INFORMS Journal on Computing 23(2), pp. 174–188, © 2011 INFORMS 3.2.4. Algorithmic Approach for the Static Model. To solve PD-1 using dynamic programming, we For simplicity of presentation, we omit the details frame the recursive Bellman equation (see Equa- of the algorithm used to solve PS-1. The approach tion (4)) in the following way. Suppose that the value is based on a well-known labeling procedure (see function Ft i ui denotes the optimal ENR from stage Murthy and Sarkar 1998) that uses the pruning rules t onward given that, as represented by the state st , the described earlier. The procedure starts at node 1 and job is at node i, having obtained ui cumulative units proceeds towards node n, processing nodes sequen- of CPU thus far. Furthermore, let pi j k denote the tially. At each node, the procedure stores all the probability of obtaining k units of CPU from travers- nondominated paths from node 1 to that node. The ing arc i j , for k = 0 U − ui . Let the probabil- two pruning methods described earlier substantially ity of obtaining more than U − ui CPU cycles from improve the performance of the labeling procedure traversing arc i j be pi j U + . The recursive Bellman (Murthy and Sarkar 1998). When node n is reached, equation is the procedure picks the best path from the pruned set of nondominated paths. If ℘∗ is the corresponding U −ui optimal completion probability for the best path, the Ft i ui = Max pi j k × Ft+1 j ui + k i j ∈f i optimal ENR from the static model can be obtained k=0 from substituting ℘∗ in (1). + pi j U + × Ft+1 j U (4) 3.3. The Dynamic Model for Single-Job Assignment The term within the parentheses in (4) is the expected The dynamic model is a stochastic control problem value function in stage t + 1 if the grid manager that is solved using dynamic programming. To frame chooses to traverse link i j . The optimal value of this problem as a dynamic program, consider it as Ft i ui is obtained by choosing the link i j that consisting of T stages. At each stage t, the job is in maximizes this expected value. The recursive equa- state st , deﬁned by the tuple i ui , where i ∈ N is a tion is solved by working backward from the last node in G N A , and ui is the cumulative amount of row T . The boundary conditions that apply for all processing units obtained by the job thus far. Further- nodes j in row T is FT j k = −C (penalty for noncom- more, at stage t, imagine a random process t ∈ W pletion) for k = 0 U − 1 and FT j k = R (reward that generates the arc lengths ci j randomly from their for completion) for k ≥ U . The solution to PD-1 cor- respective distributions for each i j emanating out responds to the value function F1 (Allahverdi and of node i. Traversing arc i j corresponds to assign- Mittenthal 1995). The computational effort required ing the job on machine j from which the actual CPU to solve PD-1 using the recursive Equation (4) is time obtained is a random variable drawn from a nor- O n2 U 2 . In summary, the dynamic model develops a mal distribution with a mean of i j and a variance of policy that speciﬁes for each node i ∈ N and for each i j , the realization of which is known only after the value of ui ≤ U (where ui is the CPU cycles obtained grid manager has taken a decision. The grid manager thus far by the job) the machine where the job will now has to choose a decision xt from a feasible choice set, X st ; i.e., xt ∈ X st . Here, X st constitutes the be processed in the next time period. However, trans- forward star f i , the set of all arcs i j ∈ A that orig- mitting this policy to the distributed grid manager inate from i. Using a decision rule ht S × W → X, the software at each location requires more control infor- grid manager takes the decision xt , i.e., xt = ht st t , mation to be attached and signiﬁcantly greater com- which amounts to selecting an arc i j ∈ f i . As a putation time. result, the job moves to a new state st+1 in stage t + 1. 3.3.1. Comparing the Dynamic and Static The sequence of decision rules T = h0 h1 hT Models. The optimal policy obtained from solving constitutes a policy. In simple terms, the policy will PD-1 is superior to the optimal solution obtained specify for each node i ∈ N and for each value of from solving the static model (PS-1) because the ui ≤ U (i.e., for each state st ) the optimal decision xt dynamic model implicitly includes the static solution (which node in G N A to move to in the next stage). and therefore evolves a policy that is at least as good As a practical matter (because U is relatively large) as the static solution. To illustrate, consider the simple an approximation ui is assumed to take on a discrete graph shown in Figure 2 that illustrates the mean and set of values (or states), 0 1 U . Let denote the variance ( i j i j ) of the CPU obtained by traversing set of all feasible policies. Because of the ﬁniteness each arc i j . Suppose that the CPU required U = 45 of N , f i , and U , the state space S and the decision units. From the static model, the optimal path is set X are also ﬁnite. As a result, is also ﬁnite. Let 1 − 2 − 4 − 5 and not 1 − 2 − 3 − 5 because the the value function T be ENR as deﬁned in Equa- standard normal √ associated with path 1 − 2 − 4 − 5 tion (1). The dynamic programming model is is z1 = 60 − 45 / 73 = 1 76, whereas that associated √ PD-1 Maximize T T ∈ with 1 − 2 − 3 − 5 is z2 = 50 − 45 / 25 = 1 00. This Ransbotham et al.: Sequential Grid Computing INFORMS Journal on Computing 23(2), pp. 174–188, © 2011 INFORMS 181 (25, 16) (5, 0) (20, 9) 3 1 2 5 (35, 64) 4 (5, 0) Figure 2 Static and Dynamic Paths for a Simple Graph implies that path 1 − 2 − 4 − 5 must be traversed the mean–variance pair l l can be obtained as irrespective of the actual CPU obtained upon arriving l = i j ∈pl i j and l = i j ∈pl i j . If a job k is at node 2. Instead, after reaching node 2, if it is assigned to machine l, then the probability of its com- √ discovered that 30 units have been obtained so far, pletion, ℘k l = Pr z ≤ l − Uk / l , can be deter- traversing path 2 − 3 − 5 would yield a better chance mined using the normal distribution. Accordingly, the of meeting the requirement of 45 units than the path ENR obtained from assigning job k to machine l can 2 − 4 − 5. The z value√ associated with the former is be determined as ENRk l = Rk + Ck ℘k l − Ck for each z = 30 + 25 + 5 − 45 / 16 = 3 75 and that associated √ k=1 K and l = 1 m. Because m K, addi- with the latter is z = 30 + 35 + 5 − 45 / 64 = 3 13. tional m − K dummy jobs are created whose ENR Therefore, the chance of meeting the deadline require- is zero when assigned to any machine l. PA-K can ment is better by following a policy that allows for be solved as the classical single assignment problem, varying the route based on information available at where m jobs are assigned to m processors so that the node 2. total ENR is maximized. 4. Sequential Grid Models for 4.2. Multiperiod Static Assignment Problem with Multiple Jobs Multiple Jobs We now examine the case where buyers approach the Although PA-K can be solved efﬁciently, the quality grid manager with requests for processing K jobs on of the solution obtained may not be good because it the grid with K > 1. Each job k requires Uk units and does not take advantage of sequential grid comput- carries a reward Rk if it is completed on time and a ing. We now consider the assignment of K jobs to K of penalty Ck otherwise. We assume that there are a suf- the m available machines while allowing the assign- ﬁcient number of processors on the grid; i.e., m K. ment to vary over the T periods. However, the assign- Furthermore, each processor can process only one ments over the T periods are determined a priori and grid-supplied job at a time. We consider two heuristic are hence static. Relating this problem to the graph approaches for scheduling the K jobs on the grid. Both in Figure 1, each job k traverses the acyclic network maximize the ENR (Equation (2)). The ﬁrst approach from node 1 to node n. Such a traversal amounts to is the single-period assignment problem, which is a assigning job k to different processors over T periods. direct extension of PA-1. Each job k is assigned to Because each processor can process at most one grid- a different processor l, and this assignment remains supplied job in each time period, the K paths are node unchanged over the entire duration of T time periods. disjoint except for node 1 and node n. The problem The second approach is the multiperiod static assign- then is to determine K node disjoint paths, one for ment problem and is a direct extension of the static each job, so that the total ENR is maximized. This model (PS-1). Each job k is assigned to a different problem is a direct extension of PS-1 to K jobs and is processor l, but each job is allowed to be processed therefore referred to as PS-K. by different processors in each time period. However, 4.2.1. Computational Complexity of PS-K. It can like PS-1, the schedule that is developed is considered be shown that problem PS-K is NP-hard. The deci- static because it does not change based on the state sion version of PS-K is as follows: Does there exist achieved at a node. K node disjoint paths in the acyclic graph G N A 4.1. Single-Period Assignment for Multiple Jobs such that the total ENR is at least W ? We show Because this problem is a direct extension of PA-1, in the Online Supplement (available at http://joc. we will refer to it as PA-K, where K jobs have to be pubs.informs.org/ecompanion.html) that the decision assigned to m different machines. As described for version of PS-K is NP-complete. Given an acyclic PA-1, let P ⊂ P denote the set of paths in G N A graph G N A , where each i j ∈ A has an integer- consisting of only continuation arcs. Hence, P = m valued arc length ci j , problem MaxMinD-K is deﬁned and traversing each such path amounts to the job as that of ﬁnding K node disjoint paths so that the being processed by a single speciﬁc machine. All path length of the longest path amongst these K paths pl ∈ P are node disjoint except for the starting paths is minimized, which is known to be NP-hard. and ending nodes. Associated with each path pl ∈ P , We show that the decision version of MaxMinD-K Ransbotham et al.: Sequential Grid Computing 182 INFORMS Journal on Computing 23(2), pp. 174–188, © 2011 INFORMS reduces to an instance of the decision version of PS-K. 5.1. Parameters for Problem Instances The theorem is stated here without proof. The results excerpted for presentation are based on a subset of 3,100 instances with varying job sizes and Theorem 1. The decision version of PS-K for K ≥ 2 is estimates of the mean and variance of CPU cycles NP-complete (proof is in the Online Supplement). available at each processor in each time period. As a 4.2.2. An Efﬁcient Heuristic for PS-K. Because benchmark for the static and dynamic models we PS-K is shown to be NP-hard, it is reasonable to also estimated the probability of job completion for explore fast heuristics that derive good workable each of the 3,100 instances, assuming that the job was schedules. In the next section, we empirically explore assigned to the single best processor for all time peri- the following simple heuristic. The K jobs are sorted ods (PA-1). PA-1 estimates the best-case completion in decreasing order of Rk +Ck , the sum of the reward probability without sequential grid computing. The and penalty. It is assumed that this ordering is consis- mean CPU cycles available at the processor in each tent with the ordering by Uk ; that is, jobs with greater time period were randomly selected from a uniform computational requirements carry a greater price and distribution ( 95 105 units). The corresponding vari- penalty. The heuristic involves solving K number of ance was also selected from a uniform distribution PS-1 in sequence. The ﬁrst PS-1 problem solved uses ( 5 10 units). Transfer cost was ﬁxed at one CPU the original parameters i j i j for each i j ∈ A. cycle to evaluate situations where transfer costs are As a result, a static path is obtained where each inter- low because high transfer costs will simply impede mediate node corresponds to a machine assignment. sequential grid computing. To simulate peak loads After determining the ENR associated with the ﬁrst of machines, during one-third of randomly chosen job, the machines used are removed from considera- time periods the available CPU cycles were reduced tion for subsequent jobs. This process is repeated K to 20% of the maximum capacity. The metric of CPU times, after which we have schedules for all K jobs. cycles is intended to be an abstract relative measure of resources required to resources available rather than a speciﬁc absolute measure; the models can also be 5. Computational Results for applied to speciﬁc resources (e.g., CPU, memory, or Single-Job Models storage). Interestingly, a grid of 15 personal comput- To evaluate the performance of the static and dynamic ers was used to run the computational experiments. models for a single-job assignment, we coded the two models PS-1 and PD-1 using C++ and ran several 5.2. Model Performance and Job Size thousand instances using randomly generated input First, we investigated the effects of job size on the data. The purpose of our computational experiments relative performances of PS-1 and PD-1. We used the was twofold: (a) to understand the factors that affect CPU requirements of the submitted job as the focal the beneﬁts from sequential grid computing by com- metric. The grid was composed of 100 machines oper- paring the completion probabilities provided by the ating over ﬁve time periods. Figure 3 shows that the static and dynamic models (PS-1 and PD-1, respec- improvement over PA-1 is most pronounced within a tively) with that obtained by performing the job on range in the middle section of the ﬁgure, with a 100% the same machine (PA-1), and (b) to understand the maximum improvement in probability of completion. factors that affect the difference in completion prob- The intuition behind these results is straightforward. abilities obtained by the static versus dynamic mod- For a small job where the probability of completion els. The ﬁrst analysis determines the conditions when is nearly one, there is little beneﬁt in routing the job sequential grid computing provides the greatest ben- through multiple processors because even a single eﬁts, and the second analysis explores whether the processor provides good solutions. Conversely, when beneﬁts from the dynamic model outweigh its addi- tional complexity. 1.10 Probability of completion We focus on the impact of three characteristics of Single machine (PA-1) 0.90 Static plan (PS-1) the sequential grid-computing environment on com- Dynamic policy (PD-1) 0.70 pletion probabilities—(1) the job size (using the CPU cycles required as a representative metric), (2) the grid 0.50 resources available (using the number of processors 0.30 available as a representative metric), and (3) the het- 0.10 erogeneity of available grid resources (using the vari- ance of CPU cycles available at each processor as a –0.10 400 420 440 460 480 500 520 540 560 580 600 representative metric). These three factors capture key Job size (CPU cycles) differences in grid environments likely to affect the beneﬁts from sequential grid computing. Figure 3 Performance and Job Size Ransbotham et al.: Sequential Grid Computing INFORMS Journal on Computing 23(2), pp. 174–188, © 2011 INFORMS 183 the job size is so high that even the static and dynamic able to achieve better than a 50% probability of com- models yield low probabilities of completion, there pletion. In contrast, with the static and dynamic is once again little beneﬁt from multiple proces- models, each new processor added to the grid sors. Within these extremes, sequential grid comput- provides incrementally more ﬂexibility to the grid ing provides signiﬁcant improvement over the sin- manager and quickly increases the probability of gle machine best case. Although these general results job completion to one. The improvement is high- hold for any variation in parameters of the peak est at smaller grid sizes and then diminishes but period, the beneﬁts of the sequential grid-computing still remains substantial throughout the experiments. models are more pronounced as either (a) the length Again, little performance difference is seen between of the peak period increases or (b) the resource avail- the static plan and dynamic policy. ability during the peak period is reduced. Little per- formance difference is seen between the static plan 5.4. Model Performance and Resource and dynamic policy. Heterogeneity We investigated the effects of the resource heterogene- 5.3. Model Performance and Resources Available ity on completion probability. In this experiment, we Next, we investigated the impact of the resources used the variance of CPU cycles available at each available to the grid manager on the performance processor in each time period as the focal metric. of the two models (PS-1 and PD-1) and the beneﬁts The grid was composed of 100 machines operating from sequential grid computing. We used the num- over ﬁve time periods. The variance of CPU cycles ber of processors on the grid as the focal metric. The available in each period for each processor was ran- job required 430 CPU cycles and the grid operated domly generated from an uniform distribution ( 1 V over ﬁve time periods. Based on these parameters, the units), where V is the value shown on the x axis of results depicted in Figure 4 show the improvement Figures 5, 6, and 7. This experiment explored three in probability of completion relative to PA-1, the best demand scenarios—low (425 CPU cycles required; single-machine case. see Figure 5), medium (475 CPU cycles required; see As the number of processors available increases, Figure 6), and high (525 CPU cycles required; see under the sequential grid models (both the static Figure 7). plan and dynamic policy), the completion probability For the results shown in Figure 5, a small enough increases dramatically at the initial stages (Figure 4) job demand was selected such that it was likely that while the improvement ﬂattens out as the completion a single machine had mean CPU cycles available to probability reaches close to one. On the other hand, complete the job. Thus, with low variance in CPU the completion probability in the single-processor cycles available, the probability of job completion in case exhibits slower stepwise improvement as the the single-machine best case is high. The probability number of available processors increases. The iden- of completion diminishes in the single-machine case tity of the best processor changes infrequently as pro- as the variance in CPU cycles increases. However, the cessors are added in the single-processor case. For sequential grid computing models are robust to the example, in our reported result, the 14th machine increase in variance because the grid manager is able added has a large capacity and dramatically increases to work around potential problems. the completion probability. This machine is selected For the results shown in Figure 6, a medium-sized in future samples because no subsequent machines job demand was selected such that it was unlikely match its capacity. The single-processor case is not that a single machine had mean CPU cycles available to complete the job, but there was still a relatively high availability of processing power on the grid com- 1.10 pared to the job size. The medium job size is indepen- dent of resource heterogeneity. The single-processor Probability of completion 0.90 case is unlikely to complete the job; the sequential 0.70 grid models are almost guaranteed to ﬁnish. Again, 0.50 there is minimal difference between the static plan and dynamic policy. 0.30 Single machine (PA-1) For the results shown in Figure 7, a high-demand Static plan (PS-1) 0.10 Dynamic policy (PD-1) job size was selected such that it was unlikely that a single machine had mean CPU cycles available – 0.10 0 10 20 30 40 50 to complete the job, and there was a relatively low Resources available (machine units) availability of processing power on the grid com- pared to the job size. Single-machine assignment is Figure 4 Performance and Resources unlikely to complete the job with any variation in Ransbotham et al.: Sequential Grid Computing 184 INFORMS Journal on Computing 23(2), pp. 174–188, © 2011 INFORMS 1.10 5.5. Comparison of a Dynamic Policy Probability of completion 0.90 vs. a Static Plan The computational experiments show deﬁnite evi- 0.70 dence of performance beneﬁts from using the 0.50 sequential grid models. However, in the majority of 0.30 cases, there were few differences between the static Single machine (PA-1) Static plan (PS-1) plan and the dynamic policy. The dynamic policy sub- 0.10 Dynamic policy (PD-1) sumes the static plan; therefore, it is possible to use – 0.10 0 10 20 30 40 50 60 the dynamic policy alone. Unfortunately, the dynamic Low demand: Resource heterogeneity (variance) policy requires signiﬁcantly more computational time and routing overhead. Figure 5 Performance and Heterogeneity with Low Demand To provide evidence of the contrast in computa- tional requirements, we investigated the effects of the instance size on the calculation runtimes of the 1.10 static (PS-1) and dynamic (PD-1) models. With a sim- Probability of completion 0.90 ilar setup as in previous experiments, this trial used Single machine (PA-1) ﬁve processing periods to complete a job of 500 CPU 0.70 Static plan (PS-1) Dynamic policy (PD-1) cycles. The results are shown in Figure 8. As expected, 0.50 PS-1 requires little processing time compared to PD-1. 0.30 Furthermore, as the problem size increases, the pro- cessing time for the dynamic model increases signif- 0.10 icantly while the corresponding processing times for – 0.10 the static model remains at a relatively constant low 0 10 20 30 40 50 level. For reference, the runtimes reported were found Medium demand: Resource heterogeneity (variance) using a 2.13 GHz Pentium processor with 2.0 GB of memory. Figure 6 Performance and Heterogeneity with Medium Demand Therefore, if sequential grid models are used, when should a grid manager select a dynamic policy over 1.10 a static plan? We investigated thousands of problem Single machine (PA-1) instances with varying parameters and discovered Probability of completion 0.90 Static plan (PS-1) Dynamic policy (PD-1) two situations when the dynamic policy can be 0.70 advantageous over the static plan—when the job is behind schedule and when this deviation occurs 0.50 ` during the early stages of the job. The experiments 0.30 described below have a similar setup as prior exper- iments with a grid of 50 machines available over 0.10 10 time periods for a job requiring 1 040 CPU cycles. – 0.10 We compare the probability of completion from the 0 10 20 30 40 50 High demand: Resource heterogeneity (variance) dynamic policy over the static plan. Figure 7 Performance and Heterogeneity with High Demand 2,500 Static plan (PS-1) resource heterogeneity; however, the sequential grid 2,000 Dynamic policy (PD-1) Runtime (seconds) models begin with a low probability of completion 1,500 but rapidly increase as the heterogeneity increases. The sequential models achieve 50% or higher prob- 1,000 ability of completion at higher levels of variance 500 in machine availability. In this scenario, again there is clear value to sequential grid computing as the 0 sequential models route the job intelligently through 0 10 20 30 40 50 60 70 80 the grid. Interestingly, the dynamic policy shows a –500 Problem size (machine units) slight increase in performance; we explore this differ- ence further in the next section. Figure 8 Runtime for the Static and Dynamic Models Ransbotham et al.: Sequential Grid Computing INFORMS Journal on Computing 23(2), pp. 174–188, © 2011 INFORMS 185 5.6. Comparative Performance by Time 0.09 dynamic policy (PD-1) vs. static plan (PS-1) Increase in probability of completion using Period 1 Until Deadline 0.08 Period 5 First, we examine the relative performance of the Period 9 0.07 dynamic policy versus the static plan as the deadline for completion approaches and the job is behind or 0.06 ahead of schedule. At any time period t, let ut be the 0.05 cumulative amount of CPU time obtained by the job thus far. We quantify the deviation of the job from 0.04 plan through the variable Zt , deﬁned as the difference 0.03 between the expected value of the remaining avail- able CPU times on the static path (reduced by any 0.02 applicable transfer costs), T −1 cl k − fk k+1 , and the k=t+1 0.01 amount of processing required to complete the job, 0 U − ut , divided by the square root of the variance –4 –2 0 2 4 T −1 in the remaining available processing, k=t+1 vl k . Job status (Z-values) Thus, positive values of Zt represent a job ahead of schedule, and negative values of Zt represent a Figure 10 Comparison of Dynamic vs. Static by Job Status job behind schedule. For the static plan, we ﬁrst determine the overall static schedule and calculate because both models are likely to complete success- the probability of completion assuming that the job fully. Thus, for jobs that are behind schedule, the remains on the original static schedule irrespective of dynamic policy preserves more options for complet- the value of ut (and hence Zt ). For the dynamic plan, ing jobs until much later in the processing schedule. we use the value of ut to determine the new optimal path from the stored dynamic policy for that node. 5.7. Comparative Performance by Job Status Figure 9 depicts the increase in probability of com- Alternatively, we can view the results from the per- pletion from using the dynamic policy over the static spective of the relative performance of the dynamic plan for deviations that occur at different time peri- policy versus the static plan by the job status. Fig- ods. In Figure 9, we use the speciﬁc Zt -value shown to ure 10 depicts the increase in probability of comple- calculate ut and the corresponding completion proba- tion from using the dynamic policy over the static bilities from the static and dynamic models. For jobs plan for a range of job states (Zt -values). The data that are signiﬁcantly behind schedule (Zt 0), there are generated in exactly the same way as in Figure 9. is some increase in probability of completion by using However, in Figure 10, each line in represents the time the dynamic policy when the deviations occur during period in the processing schedule where the deviation early periods. However, as the deadline approaches occurs. When the deviation occurs early (period 1), there is little chance of recovery for either the dynamic the dynamic model provides improvements even policy or static plan. Alternatively, for jobs that are when the job is signiﬁcantly behind schedule (Zt 0). signiﬁcantly ahead of schedule, the dynamic policy When the deviation occurs during later periods, the provides little increase in probability of completion dynamic model shows improvements only when the deviation is small (Zt is close to 0). Overall, the value of the dynamic program is highest when the devia- dynamic policy (PD-1) vs. static plan (PS-1) Increase in probability of completion using 0.10 tions occur early in the processing schedule. 0.09 0.08 6. Model Performance with 0.07 0.06 Multiple Jobs We now consider the efﬁcacy of the static model 0.05 when multiple jobs are scheduled. We use the num- 0.04 ber of jobs available as the focal metric, keeping 0.03 the grid size constant. Because we do not evaluate 0.02 the dynamic alternative, we can consider larger grid Z = –1 Z=0 sizes. For the experiment reported, the grid contains 0.01 Z=1 100 machines evaluated across 10 time periods. Jobs 0 0 2 4 6 8 10 generated required an average of 1 040 CPU cycles. Time period Expected revenue for each job was set at US$2 per CPU cycle requested, and the penalty was allowed Figure 9 Comparison of Dynamic vs. Static by the Time Until Deadline to vary uniformly from $100 to $500. Transfer costs Ransbotham et al.: Sequential Grid Computing 186 INFORMS Journal on Computing 23(2), pp. 174–188, © 2011 INFORMS 160,000 Improvements over the single-processor best case. For Single machine 140,000 small jobs with relatively low CPU requirements, Expected net reward Static plan 120,000 the single-machine case provides a high completion 100,000 probability, and the improvements that result from 80,000 the sequential grid models (both static and dynamic) 60,000 are low. Likewise, for large jobs with low comple- 40,000 tion probability, the sequential grid models provide 20,000 little improvement because they too cannot com- 0 plete the job. Between these two extremes, the static 0 10 20 30 40 50 60 70 80 90 100 Number of jobs submitted and dynamic models provide signiﬁcant beneﬁts. Experiments indicate that when the single-processor Figure 11 Expected Net Reward for Multiple Jobs best case provides low probabilities of completion (around 20%), sequential grid models provide the 1,800 greatest beneﬁts, increasing the completion probabil- Expected net reward per job 1,600 ity to around 70%. 1,400 Increasing resources. As resources are added to a 1,200 sequential grid, completion probability increases dra- 1,000 matically but reaches saturation quickly. Unlike the 800 Single machine (PA-K) parallel grid environment, sequential grid models use 600 Static plan (PS-K) only one processor in each period. Thus, additional 400 resources initially increase the options available to the 200 0 grid manager, but the beneﬁts are muted as resources 0 10 20 30 40 50 60 70 80 90 100 increase further. In the single-processor best case, the Number of jobs submitted improvement in probability of completion is stepwise and idiosyncratic. An interesting implication is that Figure 12 Expected Net Reward per Job for Multiple Jobs in sequential grid computing, the grid size can be kept fairly small to obtain most of the beneﬁts with- were kept constant at one CPU cycle for any change out signiﬁcantly increasing the complexity for the grid of machine. As mentioned previously, the optimal manager. scheduling of multiple jobs (PS-K) is itself a difﬁcult Impact of heterogeneity in grid resources. As uncer- problem. For this experiment, a greedy heuristic was tainty regarding the idle capacity at each processor in used where jobs were scheduled on the grid sequen- each time period increases, the beneﬁts of the sequen- tially based on a descending order of potential rev- tial grid over the single-processor case increases. enue. This was compared to PA-K, i.e., where each of At low levels of demand relative to the capacity of a the K jobs was assigned to one of K (K ≤ m) machines. single machine, the increase is marginal. At medium The result is depicted in Figure 11. demand levels, the sequential models allow for job At very low numbers of requested jobs, the differ- completion irrespective of the resource heterogeneity. ences in expected net reward of the static plan rela- At high levels of demand, both the single-processor tive to the same machine assignment are fairly small. and sequential grid models can beneﬁt from increased Relatively quickly, however, the static plan is able to variability; however, the sequential models consis- take advantage of grid resources to provide increased tently outperform the single-processor case. positive ENR. Because negative ENR jobs would not Difference between dynamic and static models. be accepted, ENR increases with the number of jobs Although the dynamic model always provides submitted. However, Figure 12 depicts the ENR per superior solutions to the static model by design, job and illustrates the consistent superiority of the extensive experiments indicate that the difference static plan over single-machine assignment. The static is small. At the same time, the dynamic model is plan allows the grid manager to accept many jobs that computationally intensive, requires large overhead could not be accepted because of negative ENR in the information and goes through dramatically longer single-machine assignment case. runtimes. However, when a job is behind schedule especially during the early stages, the dynamic policy has greater ability to recover. The static model is 7. Discussion of Computational computationally efﬁcient and easy to implement, and Results it also provides good solutions. The computational results highlight several interest- Handling of multiple jobs. The sequential grid mod- ing observations about the sequential grid-computing els prove robust in multiple-job scenarios. The lim- environment. its of single-machine assignment are quickly reached Ransbotham et al.: Sequential Grid Computing INFORMS Journal on Computing 23(2), pp. 174–188, © 2011 INFORMS 187 while sequential grid models continue to extract addi- Baker, K. R., G. D. Scudder. 1990. Sequencing with earliness and tional value with each additional job considered. Fur- tardiness penalties: A review. Oper. Res. 38(1) 22–36. Balut, S. J. 1973. Scheduling to minimize the number of late jobs thermore, the average per-job expected net reward when set-up and processing times are uncertain. Management is stable and consistently higher than single-machine Sci. 19(11) 1283–1288. assignment. Bapna, R., S. Das, R. Garﬁnkel, J. Stallaert. 2006. A continuous auc- tion model for stochastic grid resource pricing and allocation. Workshop Inform. Tech. Systems WITS , Milwaukee, Association 8. Conclusions of Information Systems, Atlanta, 1–6. In this paper, we deﬁned a grid-computing model Bapna, R., S. Das, R. Garﬁnkel, J. Stallaert. 2008. A market design (termed sequential grid computing) that has signiﬁ- for grid computing. INFORMS J. Comput. 20(1) 100–111. Berten, V., J. Goossens, E. Jeannot. 2006. On the distribution cant advantages in processing large jobs. In sequen- of sequential jobs in random brokering for heterogeneous tial grid computing, a computationally intensive job compuational grids. IEEE Trans. Parallel Distrib. Systems 17(2) is routed through several processors toward comple- 113–124. tion but is assigned to one processor during each Bhargava, H. K., S. Sundaresan. 2004. Computing as utility: Manag- ing availability, commitment, and pricing through contingent time period. We also deﬁned two models (static and bid auctions. J. Management Inform. Systems 21(2) 201–227. dynamic) that solve the routing problem associated Boeres, C., V. E. F. Rebello. 2004. EasyGrid: Towards a framework with sequential grid computing—that is, determining for the automatic Grid enabling of legacy MPI applications. the processor to which the job is assigned for each Concurrency Comput.: Practice Experience 16(5) 425–432. Buyya, R., D. Abramson, J. Giddy, H. Stockinger. 2002. Eco- time period. The static model is computationally efﬁ- nomic models for resource management and scheduling in cient and easy to implement, and it also provides grid computing. Concurrency Comput.: Practice Experience 14(13– good solutions under a variety of conditions, whereas 15) 1507–1542. the dynamic model is computationally intensive and Cai, X., S. Zhou. 1999. Stochastic scheduling on parallel machines subject to random breakdowns to minimize expected costs for requires more overhead information to be transmitted earliness and tardy jobs. Oper. Res. 47(3) 422–437. with the job. Our computational experiments provide Chang, K., A. Dasari, H. Madduri, A. Mendoza, J. Mims. 2004. evidence that the sequential grid computing models Design of an enablement process for on demand applications. have signiﬁcant beneﬁt when compared to the single- IBM Systems J. 43(1) 190–203. Deonier, R. C., S. Tavaré, M. S. Waterman. 2005. Computational processor best case. Genome Analysis: An Introduction. Springer, New York. The research can be extended in several ways. First, Donald, J., M. Martonosi. 2006. An efﬁcient, practical paralleliza- although we have shown the beneﬁts of sequential tion methodology for multicore architecture simulation. IEEE grid computing and provided a proof of concept, the Comput. Architecture Lett. 5(2) 14–17. Eilam, T., K. Appleby, J. Breh, G. Breiter, H. Daur, S. A. Fakhouri, software architecture and protocols required to imple- G. D. H. Hunt et al. 2004. Using a utility computing framework ment the environment are a signiﬁcant future research to develop utility systems. IBM Systems J. 43(1) 97–120. issue. Second, we have assumed ﬁxed time buckets in Ellisman, M., M. Brady, D. Hart, F.-P. Lin, M. Müller, L. Smarr. 2004. both the static and dynamic models. Determining the The emerging role of biogrids. Comm. ACM 47(11) 52–57. Hansen, C., C. Johnson. 2003. Graphics applications for grid com- optimal size of the scheduling time interval is a difﬁ- puting. IEEE Comput. Graph. Appl. 23(2) 20–21. cult research problem and will depend on the modu- Henig, M. I. 1990. Risk criteria in a stochastic knapsack problem. larity of the application, the transfer cost, and the size Oper. Res. 38(5) 820–825. of the state information that must be transmitted with Herbon, A., E. Khmelnitsky, I. Ben-Gal. 2005. Using a pseudo- stochastic approach for multiple-parts scheduling on an unre- the job. Third, the procedures and protocols required liable machine. IIE Trans. 37(3) 189–199. to implement the models in a peer-to-peer environ- Joseph, J., M. Ernest, C. Fellenstein. 2004. Evolution of grid com- ment (without a centralized grid manager) are also puting architecture and grid adoption models. IBM Systems J. future research issues and are of particular relevance 43(4) 624–645. Karger, D., R. Motwani, G. D. S. Ramkumar. 1997. On approximat- in the Internet environment. Fourth, we have assumed ing the longest path in a graph. Algorithmica 18(1) 82–98. independence of available processing times at each Kaya, K., C. Aykanat. 2006. Iterative-improvement-based heuristics processor, an assumption that may not be realistic if for adaptive scheduling of tasks sharing ﬁles on heterogeneous processor failures or nongrid jobs are related. Fifth, the master-slave environments. IEEE Trans. Parallel Distrib. Systems 17(8) 883–896. amount of CPU time required by a job may be difﬁcult Kise, H., T. Ibaraki. 1983. On Baluts algorithm and NP-completeness to determine and can be treated as a random variable for a chance-constrained scheduling problem. Management Sci. in the models. Finally, models that combine parallel 29(3) 384–388. and sequential grid computing will enable the beneﬁts Korpela, E., D. Werthimer, D. Anderson, J. Cobb, M. Leboisky. 2001. SETI@home—Massively distributed computing for SETI. Com- of both grid computing paradigms. put. Sci. Engrg. 3 78–83. Krass, P. 2003. Grid computing. CFO.com (November 17), http:// www.cfo.com/article.cfm/3010943. References Kumar, S., K. Dutta, V. Mookerjee. 2009. Maximizing business value Allahverdi, A., J. Mittenthal. 1995. Scheduling on a two-machine by optimal assignment of jobs to resources in grid computing. ﬂowshop subject to random breakdowns with a makespan Eur. J. Oper. Res. 194(3) 856–872. objective function. Eur. J. Oper. Res. 81(2) 376–387. Levitin, G, Y.-S. Dai, H. Ben-Haim. 2006. Reliability and perfor- Andersen, N. T., V. Dobri´ . 1987. The central limit theorem for c mance of star topology grid service with precedence con- stochastic processes. Ann. Probab. 15(1) 164–177. straints on subtask execution. IEEE Trans. Reliab. 55(3) 507–515. Ransbotham et al.: Sequential Grid Computing 188 INFORMS Journal on Computing 23(2), pp. 174–188, © 2011 INFORMS Meliksetian, D. S., J.-P. Prost, A. S. Bahl, I. Boutboul, D. P. Currier, Shalf, J., E. W. Bethel. 2003. The grid and future visualization sys- S. Fibra, J.-Y. Girard et al. 2004. Design and implementation of tem architectures. IEEE Comput. Graph. Appl. 23(2) 6–9. an enterprise grid. IBM Systems J. 43(4) 646–664. Sonmez, O. O., A. Gursoy. 2007. A novel economic-based schedul- Murthy, I., S. Sarkar. 1998. Stochastic shortest path problems with ing heuristic for computational grids. Internat. J. High Perfor- piecewise-linear concave utility functions. Management Sci. mance Comput. Appl. 21(1) 21–29. 44(11, Part 2) S125–S136. Stockinger, H. 2006. Grid computing: A critical discussion on busi- Nikolova, E., J. A. Kelner, M. Brand, M. Mitzenmacher. 2006. ness applicability. IEEE Distrib. Systems Online 7(6) 1–8. Stochastic shortest paths via quasi-convex maximization. Proc. van der Aalst, W. M. P., A. Kumar. 2003. XML-based schema deﬁ- 2006 Eur. Sympos. Algorithms ESA ’06 , Zurich. Lecture Notes in Computer Science, Vol. 4168. Springer, Berlin, 552–563. nition for support of interorganizational workﬂow. Inform. Sys- Pinedo, M. L., S. M. Ross. 1980. Scheduling jobs subject to tems Res. 14(1) 23–46. nonhomogeneous Poisson shocks. Management Sci. 26(12) Venugopal, S., R. Buyya, K. Ramamohanarao. 2006. A taxonomy 1250–1257. of data grids for distributed data sharing, management, and Rosenberg, A. L. 2004. On scheduling mesh-structured computa- processing. ACM Comput. Surv. 38(1) Article 3. tions for Internet-based computing. IEEE Trans. Comput. 53(9) Yu, J., R. Buyya. 2006. A taxonomy of workﬂow management sys- 1176–1186. tems for grid computing. J. Grid Comput. 3(3–4) 171–200.

DOCUMENT INFO

Shared By:

Categories:

Tags:

Stats:

views: | 7 |

posted: | 10/4/2011 |

language: | English |

pages: | 15 |

OTHER DOCS BY n.rajbharath

How are you planning on using Docstoc?
BUSINESS
PERSONAL

By registering with docstoc.com you agree to our
privacy policy and
terms of service, and to receive content and offer notifications.

Docstoc is the premier online destination to start and grow small businesses. It hosts the best quality and widest selection of professional documents (over 20 million) and resources including expert videos, articles and productivity tools to make every small business better.

Search or Browse for any specific document or resource you need for your business. Or explore our curated resources for Starting a Business, Growing a Business or for Professional Development.

Feel free to Contact Us with any questions you might have.