Document Sample

Fault tolerant distributed global computations against malicious attacks Majid Khonji September 10, 2009 Master Thesis In partial fulﬁllment of the requirements for the degree of Master of Security, Cryptology and Coding of Information Systems, Ensimag. Supervised by Jean-Louis Roch & Clément Pernet Abstract Global computing platforms provide huge computing power that can be used for several applications. However, the platform has some fault ratio and may be targeted by attacks. Throughout this work, we study diﬀerent ways of performing secure parallel computations over untrusted resources using algorithm based fault tolerant (ABFT) for correcting erroneous results. We focus our study on the case of modular computa- tions in a Global computing system (GCS). We may compute a function modulo several primes in parallel and then left to the original result. However, the ﬁnal result could be corrupted due to errors or byzantine, therefore we use a CRT ABFT. In order to tolerate byzantine, we need to add some redundancy to the computation in an eﬃcient way to assure result correctness without wasting too many resources. In this work, we study diﬀerent possibilities of performing these computations on diﬀer- ent types of resources and study various risks involved. We propose an online scheme for modular computation which computes certiﬁed results of user functions using minimum trusted resources. The model is sup- ported with detailed analysis and risk study for several types of failure and attacks. Finally, we provide a design speciﬁcation and an implemen- tation of the proposed model with an example of computing the matrix determinant. 1 Preface This paper is the result of our master thesis project carried out at LIG lab in Montbonnot in partial fulﬁllment of the requirements for the degree of Master of Security, Cryptology and Coding of Information Systems at Ensimag, France. Laboratoire d’Informatique de Grenoble(LIG) is a research center in parallel and distributed computing. The master thesis is a joint project with Thomas Stalinski who studied and developed ABFT during his master thesis, and his algorithms are used through out this project. I would like to thank my supervisors Jean-Louis Rock and Clément Pernet for their invaluable assistance and feedbacks throughout the project. A great deal of thanks is extended to Salim Ouari for the fruitful discussions about this work. Many thanks to Thomas Stanlinski and all members of LIG for their ultimate cooperation and support. 2 Contents Contents 3 List of Figures 6 1 Background on distributed systems problems & ABFT 11 1.1 Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 1.2 Fault models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 1.3 Distributed problems . . . . . . . . . . . . . . . . . . . . . . . . 13 1.3.1 Consensus . . . . . . . . . . . . . . . . . . . . . . . . . . 13 1.3.2 Byzantine agreement . . . . . . . . . . . . . . . . . . . . 13 1.3.3 Resource allocation problems . . . . . . . . . . . . . . . 14 1.4 Imposiblity of consensus and byzantine agreement in some models 15 1.5 Algorithm-based fault-tolerant (ABFT) . . . . . . . . . . . . . 15 1.5.1 Fault-tolerant Polynomial Interpolation . . . . . . . . . 15 1.5.2 Error correction by Chinese Reminder Theorem . . . . . 15 2 Related works & positioning our contribution 18 2.1 Related works . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.1.1 Checkpoint and restart . . . . . . . . . . . . . . . . . . 18 2.1.2 Replication based (voting) . . . . . . . . . . . . . . . . . 18 2.1.3 Spot-checking & Blacklisting . . . . . . . . . . . . . . . 19 2.1.4 ABFT and GCS defects . . . . . . . . . . . . . . . . . . 19 2.2 Our contribution . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.2.1 Current limitations . . . . . . . . . . . . . . . . . . . . . 21 2.2.2 Contribution rationale . . . . . . . . . . . . . . . . . . . 21 2.2.3 General assumptions . . . . . . . . . . . . . . . . . . . . 22 3 Theoretical study: eﬃcient modular computations using min- imum reliable resources 23 3.1 Types of resources . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.1.1 Untrusted resource . . . . . . . . . . . . . . . . . . . . . 23 3.1.2 Semi-trusted resources . . . . . . . . . . . . . . . . . . . 23 3.1.3 Trusted resources . . . . . . . . . . . . . . . . . . . . . . 24 3.2 Why ABFT instead of Replication? . . . . . . . . . . . . . . . . 24 3.3 Modular function parallelism . . . . . . . . . . . . . . . . . . . 25 3 CONTENTS 4 3.4 Impossibility of getting a lifted result from modular computa- tions using only untrusted resources . . . . . . . . . . . . . . . 25 3.4.1 Modular agreement problem . . . . . . . . . . . . . . . . 25 3.4.2 Byzantine Reduction . . . . . . . . . . . . . . . . . . . . 25 3.5 Possibility of getting a lifted result from modular computations using one trusted resource . . . . . . . . . . . . . . . . . . . . . 26 3.6 Ineﬃciency of using untrusted resources for decoding and trusted for certifying . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 4 Proposed online post certiﬁcation model for modular com- putation 30 4.1 Why online model? . . . . . . . . . . . . . . . . . . . . . . . . . 30 4.2 Model Architecture . . . . . . . . . . . . . . . . . . . . . . . . . 31 4.3 Components details . . . . . . . . . . . . . . . . . . . . . . . . . 32 4.3.1 Front-end . . . . . . . . . . . . . . . . . . . . . . . . . . 32 4.3.2 Decoder . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 4.3.3 Data-proxy . . . . . . . . . . . . . . . . . . . . . . . . . 36 4.3.4 Certiﬁer . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 4.3.5 Public GCS . . . . . . . . . . . . . . . . . . . . . . . . . 37 4.4 Components Interactions . . . . . . . . . . . . . . . . . . . . . . 38 4.4.1 Control ﬂow . . . . . . . . . . . . . . . . . . . . . . . . . 38 4.4.2 Data buﬀering . . . . . . . . . . . . . . . . . . . . . . . 40 5 Model Analysis 41 5.1 Communication channels risk analysis & Man-in-the-middle at- tacks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 5.1.1 Front-end – Decoder . . . . . . . . . . . . . . . . . . . . 41 5.1.2 front-end – Certiﬁer . . . . . . . . . . . . . . . . . . . . 42 5.1.3 front-end – Public GCS . . . . . . . . . . . . . . . . . . 42 5.1.4 front-end – Data-proxy . . . . . . . . . . . . . . . . . . 42 5.1.5 Decoder – Data-proxy . . . . . . . . . . . . . . . . . . . 42 5.1.6 Within GCS . . . . . . . . . . . . . . . . . . . . . . . . 42 5.2 Impacts of resources failure . . . . . . . . . . . . . . . . . . . . 43 5.2.1 Untrusted resources failure . . . . . . . . . . . . . . . . 43 5.2.2 Semi-trusted resources failure . . . . . . . . . . . . . . . 43 5.2.3 Trusted resources failure . . . . . . . . . . . . . . . . . . 43 6 Design speciﬁcations 44 6.1 Components structure . . . . . . . . . . . . . . . . . . . . . . . 45 6.2 Class diagrams . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 6.3 Interface interactions . . . . . . . . . . . . . . . . . . . . . . . . 47 7 Implementation & Analysis 50 7.1 Kaapi Library . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 7.2 Grid5000 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 7.3 Online Certiﬁer . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 CONTENTS 5 7.3.1 Deployment over grid5000 . . . . . . . . . . . . . . . . . 51 7.3.2 Package structure . . . . . . . . . . . . . . . . . . . . . . 52 7.4 Installing & using online-certiﬁer . . . . . . . . . . . . . . . . . 56 7.5 How to write a user function ( Fibonacci function example) . . 58 7.6 Read/Write atomicity . . . . . . . . . . . . . . . . . . . . . . . 60 7.7 Prime generation . . . . . . . . . . . . . . . . . . . . . . . . . . 60 7.8 Implementation Amortize Technique . . . . . . . . . . . . . . . 61 7.9 An application of matrix determinant computations . . . . . . 62 7.10 Current limitations . . . . . . . . . . . . . . . . . . . . . . . . . 63 Bibliography 66 List of Figures 0.1 Overview of cloud computing . . . . . . . . . . . . . . . . . . . . . 9 2.1 Error rate of majority voting for various values of m and f [16] . . 19 4.1 System Architecture on diﬀerent types of resources . . . . . . . . . 32 4.2 Job growth by doubling . . . . . . . . . . . . . . . . . . . . . . . . 34 4.3 Classical components interaction . . . . . . . . . . . . . . . . . . . 39 6.1 Deployment diagram . . . . . . . . . . . . . . . . . . . . . . . . . . 46 6.2 Communicators class diagram . . . . . . . . . . . . . . . . . . . . . 47 6.3 user functions class diagram . . . . . . . . . . . . . . . . . . . . . . 48 6.4 certiﬁer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 6.5 public GCS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 6.6 interface interactions through communicators and the data types exchanged . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 7.1 Grid5000 ssh access . . . . . . . . . . . . . . . . . . . . . . . . . . 52 7.2 Online certiﬁer logo . . . . . . . . . . . . . . . . . . . . . . . . . . 52 7.3 Online certiﬁer deployment on Grid5000 . . . . . . . . . . . . . . . 53 7.4 Diﬀerent amortize functions . . . . . . . . . . . . . . . . . . . . . . 61 7.5 ρ-amortize vs f . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 6 List of Algorithms 1.1 CRT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 3.1 Byzantine Agreement reduction . . . . . . . . . . . . . . . . . . 25 3.2 Certiﬁcation algorithm . . . . . . . . . . . . . . . . . . . . . . . 27 4.1 Basic Front-end algorithm . . . . . . . . . . . . . . . . . . . . . 33 4.2 Decoder manager algorithm . . . . . . . . . . . . . . . . . . . . 35 7 Introduction Beyond the "More than Moore" law, parallelism becomes a standard: per- sonal computers uses several cores to increase the computation power rather than just speeding up the clock as it was in olden days. On a larger scale, we have clusters, grid computing where multi machines cooperatively produce an even bigger computational power. Global Computing Systems (GCS) represent several geographically scattered computational nodes, exhibit a single compu- tational entity. These nodes could be volunteers, who give or donate extra cpu cycles to be added to the overall GCS power. Global Computing Systems (GCS) becomes more and more eﬀective since it can achieve huge and relatively cheap computing power which can be utilized to solve diﬀerent problems related to several ﬁelds such as mathematical problems and cryptographic challenges. One of the popular examples is distributed.net, which solved the RSA RC5-56 challenge in 1997 using thousands of volunteers’ personal computers[1]. Another example is the Folding@home project which aims to solve problems related to protein folding behavior as to cure and under- stand several human diseases. This network utilizes volunteer machines such as PS3 nodes, PCs and GPUs running a simple client application which harvests only spare processing power. The PS3 network had reached 1 PetaFlop with about 799103 PS3 running the Folding@home client. The overall network has achieved around 4.6 PetaFlops[2]. Such a power can provide a good business demand and supply chain as well. For example, one can rent 100 machines to operate for one hour on some heavy computations, rather than buying 1 machine to run for 100 hours. Another business model is to pay volunteers for giving their CPU time as it is done in several startup companies[16]. Cloud computing is a new trend of computing industry, where software will be given as a service. End users will no longer need to worry about internal structure of the hardware, or the way the software is deployed; he or she would rather simply use the service and pay according to the usage only. The term cloud includes both the software provided and the underlying hardware. Figure 0.11 shows an overview of could computing. A cloud is Public Cloud, when it is made available in a pay-as-you-go manner to the general public. The term Private Cloud refers to internal data-center (hardware & software) or other organizations’ services which is not available for the general public. From the 1 The image is taken from http://en.wikipedia.org/wiki/Cloud_computing [30, august 2009] 8 List of Figures 9 hardware point of view, cloud providers have the ability to pay for using the computing power on short term bases (i.e., cpu by hour and storage by the day). They can rent resources when they need, and release them when they are no longer useful[21]. Figure 0.1: Overview of cloud computing Global Computing systems have many forms and structures, from a master- slave structure such as SETI@home[3], to fully decentralized p2p as mostly seen in ﬁle sharing systems. Moreover, some grids consist of fully controlled com- puting nodes, whereas others consists of uncontrolled nodes. The uncontrolled environment rises an important issue of reliability. Such a powerful computing could perform bad computations due to malicious attacks which must be han- dled by such as system. These attacks cannot be completely contained using traditional means of cryptography and network security, whereas in traditional controlled grid systems, these techniques can be suﬃcient. In order to enhance reliability in such uncontrolled environment, there are basically two strategies, a priori prevention and a posteriori veriﬁcation. Prevention forces the use of proper software and ﬁles. At the user level, one can use code encryption to track individual executions for instance by computing checksums at various levels of the code[4]. However, code encryp- tion cannot protect against the use of correct code by faulty inputs, which can be crafted smartly by an attacker; therefor, it should be used reasonably for each speciﬁc environment . At the system level, prevention could be achieved by using some embedded digital rights management (DRM) technology inside machines’ hardware, therefore application vendors can lock the users of any inappropriate usage violation. An example of system prevention is TCPA / Palladium initiative[5]. However, a lot of people are against system prevention as they consider it against their freedom. The other type is a posteriori veri- ﬁcation, which is more considered in this paper, can be prescribed as a set of tests on the ﬁnal computations, decides whether the global computations are correct or not within an acceptable error rate with a bounded error probability. List of Figures 10 ABFT (Algorithm-based fault-tolerant) consists in introducing redundancy in computations in a tricky way, avoiding brute force replications: the objective is to tolerate byzantine, possibly malicious, errors. This technique is especially well suited to large scale parallel and distributed computations that run on resources that cannot be blindly trusted, such as peer-to-peer or cloud com- puting. In particular, ABFT is well suited to compute intensive applications in arithmetic, cryptology or linear algebra where ABFT relies on algebraic properties on some codes, like Reed-Solomon for instance. Global computational systems should always tolerate an error margin of the global results which might be due to network failures, or disconnected nodes. However, the results could be massively attacked and there must be a posteriori check to assure that there is an ABFT that can correct and obtain the ﬁnal computation. On the other hand, these tests has some cost and should be performed on trusted resources which are known to be limited and costly. Therefore, a broad question is how we can use the most of the power of untrusted resource to perform certiﬁed computations regardless to the type of the fault or attack, while keeping the use of trusted resources as limited as possible. In this work, we want to construct a model which uses as few trusted resource as possible and enables the end user to perform certiﬁed computations transparently, without the need to worry about internal details, underlying network and error correction. This thesis consists of 7 chapters. Chapter 1 provides brieﬂy, the required background information for this thesis such as distributed systems models and problems, and also examples of ABFT and how they work. In chapter 2, we present related works in the ﬁeld, and the position our work among others. Chapter 3 is a theoretical study of several possibilities of performing trusted computation over diﬀerent types of resources. In chapter 4, we propose an online model for modular computations, and discuss the logic behind model components and several other choices. In chapter 5, we study several risks involved and impacts of resources failure and attacks. Chapter 6 provides the design speciﬁcation of the model. Finally, chapter 6, provides implementations details and analysis. Chapter 1 Background on distributed systems problems & ABFT In distributed systems, the type of solvable and unsolvable problems are mainly related to the assumptions we make about the distributed environment or the the distributed model. A small modiﬁcation in these assumption may radically alter the class of problem solvability. Some problems are absolutely impossible to solve, where some others are unsolvable due to high lower bounds of either space or time in certain environment. If we can solve a problem in a restricted model, then we can solve the same problem in more relaxed models. Un- derstanding models can help understanding the solvability and comparability between certain problems in diﬀerent models or environments. [8]. 1.1 Models A distributed system is composed of several processes, each of them execute a sequential algorithm. These processes communicate with each other through two diﬀerent ways. In message passing models, processes send messages to each other via a communication channel. This can be modeled by a graph, where nodes are processes and edges are channels. A correct channel behaves as a (FIFO) queue with the sender enqueueing data and receiver dequeuing them. If the queue is empty, the receiver will get a special empty queue message[8]. In shared memory models, processes communicate by performing opera- tions on shared data structured called objects. A shared object consists of several types. Each type speciﬁes a set of states of the object, allowed op- erations on the object, and possible outputs of the object. At any time, an object has a single state, and when a process performs an operation on it, it might change the state, and return an output to the process. For example, stack object stores a series of values in its state and supports push and pop operations. The basic type of stack object is register, which stores a value in its state and supports read/write operations from all processes. We could have a stack consists of another type such as single-reader/writer register, which 11 CHAPTER 1. BACKGROUND ON DISTRIBUTED SYSTEMS PROBLEMS & ABFT 12 allows only one process to read/write its value instantaneously. Consistency conditions show how an object behaves when accessed by several concurrent operation; for example, linearizability condition is that operations happen at distinct points in time, and the order in which operation happen, must consist with real time (i.e if operation A occur before B, B must start after A ter- minates). Moreover, a linearizable object type could be either deterministic: the outcome of each operation is uniquely determined by the object’s current state, or non-deterministic: an object type may have more than one possible outcome for an operation on the same state[8]. In randomized algorithms, a process may have many choices for its next step, but the choice is made according to some probability distribution. For randomized algorithms, termination condition is required only with a high probability, and one considers the worst case expected time. Non-determinism in the shared object makes problems harder to solve, while allowing random- ization in the algorithm can make a problem easier to solve[8]. A process could have a unique identiﬁer, or be anonymous: all processes are similar and run an identical code. This could be important issue when dealing with comparison-based algorithms which may depend on identiﬁers for comparing their values. Timing assumptions are a critical part of models. When the system is syn- chronous, it means that all processes take steps at exactly the same speed. In asynchronous systems, processes proceed steps in diﬀerent speeds. In syn- chronous message passing models, messages sent in one round are available to be received in the next round. At each step, a process enqueue at most 1 mes- sage to send and receives (dequeues) at most one message at a time. However, in asynchronous systems, in one step, a process can either send at most one mes- sage, receive at most one message, or access a single shared object. In partially synchronous models, process run on diﬀerent speeds, but there are bounds on the relative speeds of processes and message-delivery times for message passing systems[8]. In synchronous systems, time is measured by the number or rounds. However, in asynchronous and partially synchronous systems, there are diﬀer- ent ways to measure time. Step complexity is the maximum number taken by a single process. Work is the total steps taken by all processes. Asynchronous computations can be divided into asynchronous rounds, where a round ends when all processes at least take one step since the beginning of that round[8]. In message passing system, the total number of messages sent is an impor- tant measure of the algorithm’s complexity, which is called message complexity. Bit complexity counts the total number of bits in these messages[8]. 1.2 Fault models Crash failure is when a process halts permanently. However, communication channels may also fail. A way to model such a failure, is to consider a process either fails sending a message or fails receiving a message at some endpoint, which is called Omission failure. Arbitrary process fault, or Byzantine fault, is CHAPTER 1. BACKGROUND ON DISTRIBUTED SYSTEMS PROBLEMS & ABFT 13 when a process fails and then perhaps recovers, its state become corrupted, or it behaves arbitrarily. Such a fault is used to model malicious attacks because we don’t know any thing about their behavior[8]. We call a process correct process when it never fails during the execution. In an f -faulty systems, we have at most f faulty processes. An algorithm that works for f -faulty system is called f -resilient (i.e., can tolerate up to f faults). Moreover, an f -resilient algorithm is also f -resilient, for all f ≤ f . A wait-free algorithm ensures that all non-faulty processes will correctly complete their task, taking only ﬁnite number of steps, even if any number of other processes crash. For randomized algorithms, wait-freedom means is the expected number of steps needed by a process to complete its task is ﬁnite[8]. 1.3 Distributed problems Here we deﬁne a set of problems that are mainly concerned in distributed systems, and the impossibility of solving these problems in certain models. 1.3.1 Consensus This problem is used as a primitive block of several distributed problems. Con- sensus is an example of decision task problem with three conditions: • Each process gets a private input value from a set. • Produces an output (task speciﬁcation describes which output is valid for a given input). • and then terminates. For consensus There are two correctness properties that must be satisﬁed: • Agreement: The output values for all processes must be identical. • Validity: The output value of each process is the input value of some processes (i.e., the output is read by some processes). In models where arbitrary faults are allowed, these properties are weakened and applied only to correct processes. 1.3.2 Byzantine agreement It is also called terminating reliable broadcast problems which is a version of consensus. The diﬀerence is that: • There is one process, the sender, has an input that it must send to all other processes. • The sender receives outputs from other processes, if the sender is correct. CHAPTER 1. BACKGROUND ON DISTRIBUTED SYSTEMS PROBLEMS & ABFT 14 • The agreement property is the same as consensus (all correct processes outputs are identical ). In Byzantine agreement, each process has a priori knowledge that the sender s is going to send a message. The goal is to transfer a data (input) from the sender to the set of receiving processes. A process may perform several I/O operations during the execution, but eventually must deliver a message to the sender or it may deliver a special message “sender failure”[9]. To be more precise, this problem must satisfy four properties[9]: • Termination: every correct process output a value. • Validity: if the sender s is correct and broadcasts a message m, then every correct process delivers back m. • Integrity: a correct process delivers a message m at most once, and only if m was previously broadcasted by s. • Agreement: if a correct process delivers a message m, then all correct processes deliver m. Another restricted version of Byzantine agreement is simultaneous consensus or coordinated attacks where all processes must output in the same round[8]. 1.3.3 Resource allocation problems Mutual exclusion is problem of sharing resources (e.g., a printer), where there are several processes and one process wants an exclusive access to a resource called critical section. There are three main properties for any correct algorithm to assure: • Only one process accesses the critical section at most each time. • Liveness property (deadlock freedom): If some process wants to access the critical section and no other processes are in, then eventually, some process will get the access. • Fairness condition (lockout freedom): If a process wants to access the critical section, then eventually, it will be given the permission. The dining philosopher is another resource allocation problem, where processes are organized in a ring, and each two adjacent processes have a shared resource and need to have exclusive access to it. Renaming is a problem where all processes have initial unique identiﬁers from a large set, and they all want to rename to another unique identiﬁer but from a smaller set[8]. CHAPTER 1. BACKGROUND ON DISTRIBUTED SYSTEMS PROBLEMS & ABFT 15 1.4 Imposiblity of consensus and byzantine agreement in some models Consensus is unsolvable in message passing models, if messages can be lost. Because, if all messages are lost, the validity condition will no longer hold. Even if one of few messages are lost, lost messages can be scheduled by an adversary in a way that consensus conditions will no longer hold for any correct algorithm[8]. Byzantine agreement is unsolvable if only one crash failure can occur in the model. The agreement condition fails: if the sender crashes after sending the message to some processes, but before sending it to others. The ﬁrst set or correct processes will deliver the message back, but the other correct processes will deliver “sender failure” message[8, 9]. 1.5 Algorithm-based fault-tolerant (ABFT) Here we shall brieﬂy show an example of ABFT to simply give an overall idea about how this technique works[10]. Also, we show a simple way to perform modular computations for an integer function in parallel and then using Chinese Remainder theorem (CRT) to left them up to the original result. 1.5.1 Fault-tolerant Polynomial Interpolation Let F be a ﬁeld. Let (xi , yi )i=0,...n−1 be n points in F2 , all xi are distinct. We need to compute a polynomial P ∈ F[X] of degree at most k − 1 such that P (xi ) = yi for at least n − t index i. • Input : three integers n, k, t and n points (xi , yi )i=0,...n−1 in F2 , all xi are distinct. • output: a polynomial P of at most a degree k − 1 , such that #{i : P (xi ) = yi } ≥ n − t It is proved [10](with constructive algorithms) that there is a unique solution P = k−1 ai xi iﬀ n ≥ k + 2t. k evaluations in distinct points are necessary i=0 and suﬃcient to characterize a polynomial of degree k − 1. This means that if n distinct evaluations of a polynomial of degree k − 1 are provided, among which at most t = n−k are erroneous, it is possible to recover the polynomial 2 p, correcting the (at most) t erroneous evaluations [10]. This technique can be used to send a data (i.e. polynomial evaluations) over the network, and eventually, a decoder computes the original polynomial in presence of byzantine faults t. 1.5.2 Error correction by Chinese Reminder Theorem Theorem 1. Chinese Reminder Theorem theorem: CHAPTER 1. BACKGROUND ON DISTRIBUTED SYSTEMS PROBLEMS & ABFT 16 Let p1 , p2 , . . . , pn be pairwise coprime natural numbers ≥ 2, and x1 , x2 , . . . , xn ∈ Z . Then there are integral solutions such that the set of simultaneous con- gruences: x ≡ x1 mod p1 , x ≡ x2 mod p2 , . . . , x ≡ xn mod pn . If x, x are two solutions, then x ≡ x mod π, where π = p1 × p . . . pn . Conversely, if x is a solution and x ≡ x mod π, then x is a solution. The CRT algorithm that enables to compute a solution x from x1 , x2 , . . . , xn is shown in algorithm 1.1. Algorithm 1.1 CRT π = p 1 p 2 . . . , pn π πi = pi −1 yi = πi mod pi n−1 x = i=0 xi πi yi return x The correction part is illustrated by Mandelbaum paper[11]and developed by Thomas Stalinski in his master thesis. We will provide here a brief descrip- tion (see the Master thesis of Thomas Stalinski) about error correction. We add some redundancy to residue computations, since we expect some of these residues to be faulty. Assuming we need a minimum k coprime numbers such that Mk = p1 × p2 . . . pk ≥ x. Similarly to the polynomial interpolation, if we add n ≥ k + 2t residues, then it will be possible to correct up to t = r errors. 2 We need to add r = n − k redundant congruences to the system. Let F = {i = 1, 2, . . . , n : x mod pi = xi }. Let C = i∈F pi be the product of pi corresponding to errors. x=x −e =x − xi πi yi i∈F Mn =x − B mod π C for some 0 ≤ B ≤ C, where x is the correct result, x is lifted from all congruences including faulty ones, and e is the lifted solution from only faulty coprimes. log pmin −log 2 Theorem 2. Mandelbaum theorem: if e ≤ (n − k) log pmax +log pmin x B 1 − ≤ 2 Mn C C where Mn = p1 × p2 . . . pn . CHAPTER 1. BACKGROUND ON DISTRIBUTED SYSTEMS PROBLEMS & ABFT 17 x pi Now, we can use continued fractions to approximate Mn . Let qi be the ith convergent of the continued fraction of x Mn . Mn pi If q i ∈ Z and x = x − e ≤ Mk where Mk = p1 . . . pk , then we have corrected x. If i ∈ F then x mod pi = x mod pi / e mod pi = 0 (e is multiple of of pi ) If i ∈ F then x mod pi = x mod pi e mod pi = x mod pi − x mod pi = xi − x mod pi (If pi is prime then e is prime to pi ) As a conclusion, the product of correct primes (not in F ) is Mn = gcd(M n, x − C x). Using this code, we can simply add more redundancy by adding more residues to the system, therefore, we can increase the correction capability dynamically. Chapter 2 Related works & positioning our contribution 2.1 Related works There are several works and diﬀerent methods studies techniques to assure the correctness of global computations or at least reduce the potentiality of having incorrect results in presence of faults. 2.1.1 Checkpoint and restart Several researches study methods to eﬃciently break the execution of a dis- tributed program to several consistent checkpoints. If we detect an error at any stage, we can simply roll back to the previous checkpoint, and redo the computations until we reach the next checkpoint correctly. For instance, H. Hi- gaki, K. Shima, T. Tachikawa, M. Takizawa[12]propose an algorithm for taking checkpoints eﬃciently. 2.1.2 Replication based (voting) This technique is used by several global computing systems such as SETI@home, mersenne and Boinc[3, 14, 13]. The idea is to replicate one computation to sev- eral nodes, and let them perform the same computation, and the system votes between the replicas and accepts only the majority results. This technique reduces the error rate exponentially, but requires all work to be done at least twice[16]. When error rates are smaller, this technique works better than hav- ing higher error rates. For example, if the error rate high, f = 20%, doing all the work 6 times, leaves the error rate larger than 1% whereas when error rate f = 0.001%, repeating the work 6 time results f 10−21 as shown in the ﬁgure 2.1[16]. There are several ways to optimize the computations, some of them presented by James Cowling[15]. 18 CHAPTER 2. RELATED WORKS & POSITIONING OUR CONTRIBUTION 19 Figure 2.1: Error rate of majority voting for various values of m and f [16] 2.1.3 Spot-checking & Blacklisting In spot-checking, the master node randomly gives a worker a spotter work object whose result is known by the master node. If the output of the worker doesn’t match the known results, the master invalidates all the results given from that node. The master may blacklist this node so it can’t participate in any work in the future, or perhaps for the current batch job only. Spot-checking reduces the error rate linearly, while only costing an extra fraction of the original time[16]. Sermanta[16] presents a new idea of Credibility-based Fault-Tolerance where he uses both voting and spot-checking techniques together to achieve a better error rate (i.e., reduce it to the maximum accepted error rate), with less redun- dancy than voting alone. The scheme automatically trades-oﬀ performance for correctness. It is similar than voting except that the number of replication, m is dynamically determined at the runtime based on some credibility values given to diﬀerent entities of the system: workers, group (a table of results for speciﬁc job), and the job. These credibility values are given mainly based on spotters given to workers, and their history. The system accepts the ﬁnal voted results only if the work group credibility reaches a threshold deﬁned by the maximum accepted error for the result and the actual error rate in the network. 2.1.4 ABFT and GCS defects Germain-Renaud, C. and Monnier-Ragaigne[4] model the distribution of de- fects in a global computing system jobs as a Bernoulli distribution (0 = correct, 1 = defect), with unknown p of being defective. They deﬁne a test τ which CHAPTER 2. RELATED WORKS & POSITIONING OUR CONTRIBUTION 20 decides if a batch (a set of jobs) has an error rate p ≤ p0 or p ≥ p1 with p1 ≥ p0 . Also they deﬁne two conﬁdence parameters α and β: if p ≤ p0 , τ accepts with probability ≥ 1 − α if p ≥ p1 , τ rejects with probability ≥ 1 − β They perform a sequential test based on Wald’s sequential test that pro- vides adaptivity (i.e., the test sample size is not ﬁxed, but a random variable known during the execution), which is shown to provide much less cost than the classical tests where the sample size is ﬁxed. The model can be tuned a bit so that the end user may provide two parameters, pa , the maximum accepted error rate, and , the acceptable risk (i.e. probability of test failure). If the test succeeds, the user may use an ABFT, that can correct up to pa . They show also a weak form of blacklisting bad workers if they can be identiﬁed (using a class of algorithm that can reject bad workers), where they eliminate only workers who produce a high error rate rather than a single error rate. Finally, they introduce a resource allocation function based on the credibility of resource on a GCS and the user quality requirement (i.e., error acceptance). Jean-Louis Roch and Sebastien Varrette[6], present a probabilistic certiﬁca- tion approach for massive attack detection on a set of global computed results using Extended Monte Carlo Test (EMCT). Because, jobs are performed by a global computation platform which is resilient to a small number of errors, but not resilient to massive attacks. The technique works as follows: all jobs’ executions is stored in secure checkpoint server as a data ﬂow graph. By using EMTC algorithm, a set of veriﬁers randomly select some tasks to be re-executed securely on a reliable resource. Veriﬁers perform N ,q calls to EMTC algorithm where log N ,q = log (1 − q) If one of EMCT tests fails, then it indicates that there is a massive attack ≥ q with a probability of certiﬁcation failure equal to [7, 6]. The work is extended with certiﬁcation cost-analysis using work-stealing technique, and applied to an application of exact matrix vector product. The computation scheme works as follows: 1 • Pick an attack ratio 0 ≤ q ≤ 2 and accordingly construct a code with k the same corrector rate (i.e., [n, k] code where n = 1−2q ). • Perform the computations on unreliable resources. • Perform EMCT test N ,q times to detect a massive attack q. If the test succeeded (i.e., we have a massive attack), redo the actual computations again. • If the test succeeded, decode with the constructed code. CHAPTER 2. RELATED WORKS & POSITIONING OUR CONTRIBUTION 21 2.2 Our contribution 2.2.1 Current limitations Checkpointing techniques requires the system to save its states, and restart if an error occurs before the next checkpoint. These procedures might cost a lot of time (e.g., one hour to save a state of a large cluster) and synchroniza- tion between several asynchronous processes is very costly to achieve. This technique is convenient in controlled systems, where we have few errors due to hardware, or software failures; however, in GCS and volunteer computing, errors are much higher, and could be even malicious. Blacklisting techniques is usually diﬃcult to achieve since anonymity is something easily doable nowadays. In some cases, it could be a waste of re- sources, if a week blacklisting algorithm blacklists a correct worker, or even a byzantine worker that rarely generates errors. However, it can be used as compliment to other techniques to enhances the performance, as it is done by Sermanta[16, 4]. Therefore, it should be used wisely and in appropriate situations. Replication is very common in big GCS due to its simplicity, but usually, it is very costly since requires the work to be done at least twice, and reduces the eﬃciency of a GCS. ABFT usually gives a better alternative, and less costly way to tolerate errors1. 2.2.2 Contribution rationale There are some protocols allowing end user to develop distributed application such as OGSA or JXTA used for p2p systems[18, 17] and APIs such as MPI[19], or athapascan[20]. These protocols and APIs reduces the underlying complex- ity of developing distributed application to the end user, and some of them lets the middle-ware to handle most of the complexity. However, it remains a cumbersome duty for the end user to obtain a certiﬁed computation, specially when having massive attacks. Even though GCS is internally resilient to some ratio of faults, massive attacks are still possible. Several works discuss tech- niques about reducing or checking errors in the computations, but the choice of the algorithm and error correction remains the duty of the end user. The end user might not be interested in correcting and certifying computations, he simply needs a big computational power, which is not secure in most cases, to execute a user function in parallel and return a secure result. In this work we present a high level model that removes all the unwanted complexities oﬀ of the user’s shoulders such as error correcting, result certiﬁ- cation, parallelizing computations, and dealing with the type of the underlying network; and most importantly, reduce the use of trusted resource as much as possible. We would like to provide a service, in which the end user can simply 1 We will show in the next chapter brieﬂy why ABFT is better than replication in usual scenarios CHAPTER 2. RELATED WORKS & POSITIONING OUR CONTRIBUTION 22 invoke a function from a set of functions and provide its input and eventually get a certiﬁed output. We would like to have a model: • Independent of the underlying network, that is, it can be applied to any type of GCS (ex. p2p, classical grid...etc). • With no assumptions about faults or attack (i.e., the model shall adapt to massive attacks dynamically) with a ﬂexible error correction capability. • Online: performs jobs computations, decoding, and result certiﬁcation simultaneously (i.e,. as far inputs are available, components produce outputs on the ﬂy). • Keeps the use of trusted resources minimum. We restrict our theoretical study and model speciﬁcation and implementation to independent integer functions (i.e., the function is itself a work unit with no dependencies). 2.2.3 General assumptions Through out this literature, we will assume that we have asynchronous message passing models with no assumptions about the attack (i.e., all types of faults are allowed) unless we state otherwise. We assume that the error ratio in the 1 GCS is 0 ≤ e < 2 , otherwise, we can never correct. Whenever we say a massive attack, we mean a high error ratio q such that 0 ≤ q ≤ 1 and q > e, among a set of computations. We assume that an attack is temporary (i.e., asymptotically we will get the e ratio, if we repeat computations many times). We will use prime integer modular computations to achieve parallelism for user functions’ computations, which will be eventually lifted using CRT algorithm. Recall that for CRT we need coprime integers to build the solution only. Since all primes are also coprime to each others, we will use primes for simplicity. The work of CRT lifting and error correction is done by Thomas Stalinski and we will not throughly discuss about them. Chapter 3 Theoretical study: eﬃcient modular computations using minimum reliable resources 3.1 Types of resources We shall categorize resources to three types1 : 3.1.1 Untrusted resource These types of resource represents resources on the public GCS in which the behavior of these machines are not predicted (i.e., could be massively attacked). We shall assume the following: • All types of faults are allowed, namely, crash faults, omission faults, and arbitrary faults. • These resources are very cheap, and have a big computational power. • Machines could have unique IDs, or at least a weak form identiﬁcation (e.g., email, or account). 3.1.2 Semi-trusted resources These resources are controlled resources that are either controlled by us, or a trusted service provider. This type is segregated from trusted resources and cannot access them. We assume the following about this type: • These resources have limited computation power, and more expensive than untrusted resources. 1 The rationale behind this categorization will be shown later in this literature 23 CHAPTER 3. THEORETICAL STUDY: EFFICIENT MODULAR COMPUTATIONS USING MINIMUM RELIABLE RESOURCES 24 • This type is accessible by untrusted resources directly (or indirectly, by other semi-trusted resources) • All types of faults are very unlikely to occur (almost impossible), since these resources could be controlled by us, or by other trusted providers. Through out this work, we consider it reliable resources. 3.1.3 Trusted resources They are fully controlled resources by us (or very trusted partner). We assume the following: • No faults can occur in these resources. • We also assume that these resources are limited and very expensive (more than Semi-Trusted resources). • These resources are not accessible by untrusted or semi-trusted resources. It could be only accessible by trusted resources. 3.2 Why ABFT instead of Replication? Assuming we need to send n data D1 , D2 , . . . , Dn to retrieve a result R of b bytes R = r1 , r2 , . . . , rb (ri length is one byte). Let t be the total number of faulty data Di . Using Replication We need to replicate R n times such that ∀Di = R. In order to tolerate t faults among Di , we need the majority of Di to be correct. Therefore we need n = 2t + 1 which means total of (2t + 1)b bytes. Using ABFT let P = b−1 bi xi be a polynomial in F28 [x] of a degree at most b − 1 i=0 Using polynomial interpolation[10](discussed in chapter 1), in order to tol- erate t faults, we need only n = b + 2t evolutions of distinct points. Therefore, each Di is a distinct evolution of P . This works iﬀ b + 2t ≤ 28 (since b ∈ F28 [x] ). Assuming that distinct evolutions are suﬃcient to tolerate t faults, the total number of bytes needed is b + 2t bytes only. As we can see ABFT requires b+2t bytes where replication requires (2t+1)b bytes assuming t ≤ b (see [10] for details). CHAPTER 3. THEORETICAL STUDY: EFFICIENT MODULAR COMPUTATIONS USING MINIMUM RELIABLE RESOURCES 25 3.3 Modular function parallelism let f be any user function which takes an input from a set and output Y ∈ Z, which is unknown . We compute f mod p1 , f mod p2 , . . . , f mod pn , such that n−1 i=0 pi ≥ Y , and obtain a set of residues y1 , y2 , . . . , yn respectively. Then, we reconstruct the solution Y from lifting pn prime using CRT algorithm1.1. The main advantage of using modular computations is that the cost of operations in Fpi is much less than in Z given that Y pi . We will use this advantage to perform modular computations in parallel in diﬀerent machines, each one performs a CRT using a reasonable size of pi in order to achieve a ﬁne grained job size. 3.4 Impossibility of getting a lifted result from modular computations using only untrusted resources In this section, we want to show the impossibility of obtaining an agreed lifted result based on byzantine agreement problem. The proof is based on reducing byzantine agreement problem to modular agreement problem. 3.4.1 Modular agreement problem Modular computation decision task: • gets an input m ∈ {0, 1}. • computes x ≡ m mod pi where pi is a random prime number. • outputs x ∈ {0, 1} (x is valid iﬀ x ≡ m mod pi ) • terminates There is one process, a sender, that has input m and delivers m to all other processes. Other correct processes, receive m and output x (as speciﬁed in the decision task) to the sender, if the sender is correct. All properties are similar to byzantine agreement problem as discussed in chapter 1. 3.4.2 Byzantine Reduction Algorithm 3.1 Byzantine Agreement reduction Require: a value m ∈ {0, 1} Ensure: byzantine agreement value m return x = OracleM odularAgreement(m) As shown in algorithm3.1. If an oracle can solve modular agreement prob- lem (i.e., all correct nodes output m to the sender, if the sender is correct), CHAPTER 3. THEORETICAL STUDY: EFFICIENT MODULAR COMPUTATIONS USING MINIMUM RELIABLE RESOURCES 26 then Byzantine Agreement can be solved. However, by counter-positive, since Byzantine Agreement is known to be unsolvable (see chapter 1 for details), therefore modular agreement is unsolvable too. This indicates that it is impos- sible to obtain an agreed lifted result using only untrusted resources, since a simple agreement among resources is impossible to achieve in our model. 3.5 Possibility of getting a lifted result from modular computations using one trusted resource Now, for Modular Agreement problem, if the sender S is always a known ﬁxed correct process (never fails), where all correct process can identify it. Then obviously, all correct processes can deliver the same result m back to it, and eventually, we reach an agreement. Byzantine agreement will be also solvable in such a model for the same reason. In other words, assume that the correct process sends the inputs of the function f to all processes. Other processes, can return a result pair yi and pi . From these pairs, we can construct Y by CRT having suﬃcient primes (Y ≤ i pi ) . However, there are three bad possibilities: 1) some residues yi are faulty, 2) some pi are not prime (or not coprime either), or 3) some yi are faulty and some pi are not prime. For the ﬁrst case, the sender S may perform the Mandelbaum algorithm for error correction providing the Ymax 2 : the maximum possible result Y of f assuming we don’t have a massive attack3 . For the second case, we can build up a table of prime numbers (with some prime range known by all correct processes in the untrusted network), and reject any pair with prime pi not in the table. For the third case, since the pi is not prime, the second case covers such a scenario, and rejects the result pair. 3.6 Ineﬃciency of using untrusted resources for decoding and trusted for certifying Here we shall try to separate between two things, decoding and checking the validity of the decoded result. Whenever we decode a result on a trusted resource from n result pairs yi and pi , with t faulty yi knowing the maximum bound Ymax , we have 3 possibilities: 1) we can correct errors and return and result, 2) we can just detect an error and return “no result”, 3) return a wrong result. The ﬁrst case is obvious, due to correction capability of the algorithm. The second case is because of too many errors and insuﬃcient redundancy. For the third case, t faulty yi are the majority of the computations and all of them are simultaneous congruences (see 1.5.2), and n pi ≤ Ymax , therefore the i=1 CRT decoding algorithm will consider correct results as faulty and accordingly 2 This will be used by the Mandelbaum algorithm to know when to stop ﬁnding conver- gents in the continued fraction as explained in chapter 1. 3 We can solve the case of massive attack which will be shown later in this section CHAPTER 3. THEORETICAL STUDY: EFFICIENT MODULAR COMPUTATIONS USING MINIMUM RELIABLE RESOURCES 27 corrects them4 . This may happen due to a malicious massive attack, in which the attack crafts many hemogenious result pairs. This will imply that the decoding process itself cannot correct all errors even if it is a correct process. Therefore we need some extra tests to assure the cor- rectness of the decoded result. There are generally two methods of testing the correctness of results, either performing a probabilistic veriﬁcation (e.g., using MCT) [7, 6] or testing the ﬁnal decoded result only (i.e., the corrected result), which we call it post certiﬁcation (algorithm 3.2). The ﬁrst method, requires to repeat several computations, which usually costs more than performing one or few checks on the ﬁnal decoded result as it done by post certiﬁcation. Post certiﬁcation don’t require the certiﬁer to be able to access the results of the modular computations whereas we need only the decoder to access them, which will simplify the rule of the certiﬁer, and reduce the overall message complexity in the system. We shall consider using the second type, post certiﬁcation, for our computational scheme. The Post Certiﬁcation Algorithm 3.2 Certiﬁcation algorithm Require: the decoder result R Ensure: SUCCESS/FAILURE 1: pick up a new random prime pi 2: r ← f mod Zp 3: r ← R mod pi 4: if r = r then 5: return FAILURE 6: else 7: return SUCCESS 8: end if As we can see from algorithm 3.2, the prime pi must be never used in the CRT decoding, otherwise, the certiﬁcation will always return “SUCCESS”, because the CRT had already used that prime to construct some result Yf which could be faulty, given that the ﬁrst pi residue computation was correct. Another issue is that the algorithm might return “SUCCESS” due to an error probability from unlucky choice of pi (∃i : f mod pi = Yf mod pi ) . A rough 1 estimate on the certiﬁcation error could be pi which is not always correct since a modular function’s results are not always uniformly distributed over pi .5 Since the decoder could return faulty results even if we run it on trusted resource, we would like to run it on an untrusted resource: a correct process S picks a process sd randomly from untrusted resources in order to run a 4 In other words, errors exceed the detection capability of the code 5 The actual error probability could be studied in prospective works. CHAPTER 3. THEORETICAL STUDY: EFFICIENT MODULAR COMPUTATIONS USING MINIMUM RELIABLE RESOURCES 28 decoding algorithm. S, sends function f inputs, and the address of sd to a set of processes s1 , s2, , . . . , sn within untrusted resources. si returns a pair of residue and prime (yi , pi ) to sd 6 , if si is correct. We have two cases for sd : • If sd is correct, then it delivers a result Y to S (the result code be either corrected result, “no result”, or a wrong result) . • If sd is faulty, it either crashes (i.e., doesn’t send any thing7 ), or it returns a byzantine result to S. Assuming the decoding algorithm is fast and S has an upper bound on the execution time of sd . If sd is don’t deliver any result within bound, then S picks randomly another sd as a decoder. So, we only focus on the case where we have byzantine faults where always sd delivers a result to S. Whenever S gets a result from sd , S applies the certiﬁcation process to the decoded result. If sd is correct, then S must respond appropriately: • if sd returns a corrected result Y : S certiﬁcation process succeeds and the system terminates • if sd returns “error”: S must increase the redundancy r in the computation by askings1 , . . . , sr for more residues • if sd returns a wrong result: certiﬁcation process fails, and S must in- crease the redundancy as in the previous case. If sd is byzantine: it always return wrong result (including , “error”), therefore S certiﬁcation always fails. As we can see, we have several cases that S needs to deal with, and a diﬃculty for S to distinguish between them. One solution for S is, whenever S gets a certiﬁcation fault, it performs all n + r computations again, and then picks a new sd randomly. However, this solution is ineﬃcient, since for a single byzantine error, it requires to redo all the n+r computations. Another solution, is to pick up several sd1 , sd2 , . . . , sdt decoders and send these addresses to all si processes. S can now check all returned results by sdi decoders. This solution, reduces the probability of byzantine decoder faults by 1 However, it increases t the message complexity in the network at least by a factor of t. Assuming that decoding algorithm is not costly8 , we can use a trusted resource to run such an algorithm. However, the decoder might require untrusted resources to connect directly (e.g., a server listening to untrusted connections), which could be a security breach if the trusted resources are not robust enough9 . Therefore, in order to give the system administrator a 6 for simplicity we assume all si return unique primes, and errors only occur within residues 7 this could be either omission fault or crash failure 8 Detailed performance analysis is conducted by Thomas Stalinski in his thesis. 9 For example, a simple buﬀer overﬂow could be exploited by a malicious user to access all trusted resources CHAPTER 3. THEORETICAL STUDY: EFFICIENT MODULAR COMPUTATIONS USING MINIMUM RELIABLE RESOURCES 29 better view on the risks involved, and to reduce the risks on trusted resources (where we run the post certiﬁcation algorithm), we pro- pose a third solution where we run the decoder on a semi-trusted resource which is a segregated network from the trusted resources that cannot access these trusted resources. Hence, the decoder could run on a machine within the GCS, but with some credibility criteria, that makes it qualiﬁed to be trusted[4]. Chapter 4 Proposed online post certiﬁcation model for modular computation As we discussed in the previous chapter, the most convenient solution for having certiﬁed results is to run the post certiﬁer on a trusted resource, and run the decoder on a semi-trusted resource. In this chapter we would like to extend this solution to a more detailed and realistic computational scheme. 4.1 Why online model? We mean by online that all computations such as jobs execution (i.e, modular computations), decoding, and certifying happen simultaneously and on-the-ﬂy: whenever there are suﬃcient inputs we produce outputs without waiting for all inputs, or outputs to complete. One advantage of such a scheme is it allows early termination for some ap- plications. For example, for computing a matrix determinant, the best known bound is Hadamard bound[22]: n |det(A)| ≤ n 2 (max(|ai,j | n ) Where A is a square matrix of size n, and ai,j is an element of A. In case of n modular computations, this bound requires to have i=1 pi ≥ bound, therefor we need n modular computations in order to be sure that we can decode (as- suming error rate is very small). However, in case of having sparse matrices, or matrices with certain properties, this bound will be very pessimistic and cause a huge waste of computational power. Therefor, by having on-the-ﬂy algorithm, the bound can be discovered during the execution. Another advan- tage of online schemes, is that it cause less space cost. Using online techniques generally won’t require to store all results of intermediate computations at the same time. As far we produce an output, we can release some resources. This requirement will impose a constraint on the decoding algorithm (based on Mandelbaum), which requires a given maximum bound. The decoding al- 30 CHAPTER 4. PROPOSED ONLINE POST CERTIFICATION MODEL FOR MODULAR COMPUTATION 31 gorithm has been modiﬁed by Thomas Stalinski to be online. Later in this chapter, we will brieﬂy discuss how the new decoder works. 4.2 Model Architecture In this part, we would like to extend our previous solution gradually until we characterize the model components. In the previous chapter, we have decided the certifying process to run on a trusted resource, a decoding process on a semi-trusted resource, and several workers on untrusted resources in a global computing system (GCS). The decoding process, decoder, needs to read the outputs from workers (residue, prime pairs). However, we want the decoder to be independent from the underlying network (e.g., p2p, classical grid...etc). Therefore, we need another component, data proxy, to handle underlying net- work details. The data proxy, gets the data from the workers, and provide a static interface to the decoder. Since our decoder tolerates errors in residues only (not the prime modulo), we want the data proxy to reject faulty primes (or not unique) as well. Data proxy must run on a semi-trusted resource, otherwise, it can forge all results. Previously, we showed that the certiﬁer performs three tasks: reads the decoder result, veriﬁes the result and assigns jobs to workers. We want to limit the logical task of the certiﬁer to certiﬁcation only, and introduce an- other component, front-end, to manage other details such as dispatching jobs, and reading values from the decoder. This will give a better modularity for the system, and components can be independently improved and maintained overtime. Dispatching jobs require direct connection to workers in the GCS, which is logically the task of GCS itself to interact with its computing nodes. Therefore, we introduce another component, public GCS, that assigns jobs to workers. Public GCS, provides a static interface to the front end for commands (e.g., execute function f with inputs x1 , x2 , . . . , xk , using primes with index i1 to in ). Public GCS must run on at least a semi-trusted resource, because otherwise, it can control all jobs. Figure 4.1 show the overall architecture of the system with the following components: • Front-end: Managing system interactions. • Public GCS : Responsible of assigning modular computations jobs to workers. • Data proxy: Decoder data access point. • Decoder : Error correction. • Certiﬁer : Decoded result veriﬁcation. CHAPTER 4. PROPOSED ONLINE POST CERTIFICATION MODEL FOR MODULAR COMPUTATION 32 Figure 4.1: System Architecture on diﬀerent types of resources 4.3 Components details 4.3.1 Front-end This component is responsible for main components interactions. It also inter- acts directly with the end user: accepts a function call from a set of predeﬁned functions. Accordingly, this component initiates the computation by asking public GCS to execute the given function n times. Meanwhile, it asks the de- coder to start online decoding (since we expect the data to be available now). Whenever the front-end receives a result from the decoder, it immediately ask the certiﬁer to verify whether its correct or not, and so on until we get a certiﬁed result (algorithm 4.1). We would like the front-end to have more control on which prime is being chosen, in order to coordinate between certiﬁer and public GCS, so the certiﬁer won’t use a used prime1 . One prospective advantage could be, if the front-end deals several GCS (p2p, volunteer computing...etc) to perform computations, 1 We showed in the previous chapter, that otherwise, the certiﬁer will always return SUCCESS. CHAPTER 4. PROPOSED ONLINE POST CERTIFICATION MODEL FOR MODULAR COMPUTATION 33 Algorithm 4.1 Basic Front-end algorithm Require: user function f and its inputs Ensure: f (inputs) 1: while true do 2: n ← initial number of jobs 3: ask Public GCS: exec(f (inputs), n) 4: ask Decoder: start decoding 5: repeat 6: res ← Decoder output 7: cert ← ask Certiﬁer: certif y(res) 8: if cert = SU CCESS then 9: return res 10: end if 11: until res ="END OF DECODING" 12: increment n 13: end while deciding which prime to use is important to prevent overlapped computations. The solution is to give two values to each of certiﬁer, and public GCS repre- senting the range of prime each one can use. The front end may also require the decoder to return the number of corrected errors coupled with the decoded result. This can provide valuable information about the underlying network errors. Online schemes promotes early termination. The perfect early termination can be achieved by increasing the redundancy r slowly (i.e., one by one), but this solution, will require the system to perform a lot of certiﬁcation tests (r tests assuming the decoder returns one result each time we add redundancy) until we get the correct result, therefor we will have a high overhead for tests which result more iterations in the scheme and more certifying. In order to reduce the number of tests, one can add redundancy by doubling as shown in ﬁgure 4.2(where n is the unknown required redundancy). Using this technique we will perform k tests, while we dispatch 2k data. This solution reduces the number of certiﬁcation tests dramatically, but increases a lot the number of jobs which could waste computational resources. In the worst case, if n = 2k−1 + 1, the technique will execute 2k , which is 2n − 2 extra jobs. As we can see, we have a trade oﬀ between, number of jobs, and certiﬁcation tests or model iterations: number of times we need to add extra redundancy as the certiﬁcation fails. One solution is to use amortize control technique, therefore we perform tests at ρf (1) , ρf (2) , . . . , ρf (k) steps, where 1 < ρ < 2, and i f (i) = ia , 0 < α ≤ 1 or f (i) = log i . This technique is proved to achieve n with around log n tests, and asymptotically n + o(n ) total redundancy [23]. The choice of ρ and a control the speed of the growth. The reason amortize technique is important is because that the public GCS could be provided by as service provider who charges upon the number of CHAPTER 4. PROPOSED ONLINE POST CERTIFICATION MODEL FOR MODULAR COMPUTATION 34 Figure 4.2: Job growth by doubling resources used. A business model could associate a price to the number of ﬂoating point operations per hour (op/hour). Therefore, by amortize, we try to keep the cost as low as possible. Hence, the speed of amortize growth can be determined based on the trade oﬀ between provider pricing, and our quality of service: how fast we want to deliver computations to end users. Another business model could associate a cost to machine/hour, regardless to the number of operations. This will require to use the full potential of each computing node during the execution. We shall use amortize technique to control the data rate sent to the decoder. By adding computing nodes, the data rate increase accordingly, and our pricing trade oﬀ will be based on this rate instead2 . We shall consider the ﬁrst model for the simplicity. By using amortize, we can dynamically increase the number of resources. Assuming we want each machine to perform a grain of at most n computations, r the number of machines would be n , where r is the amortize redundancy (or the extra jobs). 4.3.2 Decoder Stalinski showed in his Master thesis that the decoder works eﬃciently when using 21 bit prime numbers, but not so eﬃciently when we use larger primes, therefore we will consider primes of this size in this work. There are 73586 21 bit primes that we can use for decoding, which is enough for reconstructing very large data (i.e., we can build at least a result ≥ (220 )73586 = 21471720 , 2 This may be a prospective work of this project CHAPTER 4. PROPOSED ONLINE POST CERTIFICATION MODEL FOR MODULAR COMPUTATION 35 assuming few errors). The Mandelbaum decoder requires a maximum value in order to know when to stop ﬁnding convergents as we have shown before. Thus, the front end must provide this maximum, which could be either hard to deﬁne, or very pessimistic as in matrix determinant. The solution is to let the decoder send a stream of candidate results and let the certiﬁer discriminate. But there are two problems: how many primes shall we use for CRT computation before decoding, and when to stop looking for convergents. A proposed solution is that the front end tells the decoder two things: the extra amortized jobs, and the maximum expected error. The number of amor- tized jobs will be used by the decoder to know how many result pairs it expects to read from the data proxy in order to build the CRT solution. Maximum errors will be used to tell when to stop ﬁnding convergents (computing the maximum Mk automatically. See chapter 1 for details). Algorithm 4.2 shows the internal interactions of the new decoder. Algorithm 4.2 Decoder manager algorithm Require: stream of (xi , pi ), stream of of (kj , errorsj ) Ensure: decoded result stream from all (xi , pi ) 1: repeat 2: for i = 0 to kj do 3: CRT ← Lif ter.add(xi , pi ) 4: end for 5: repeat 6: result, errs ← Decoder(CRT, errorsj ) {errs: #corrected errors} 7: send result and errs to Front end 8: until result = "FINISH" 9: until k = "TERMINATE" Instead of the maximum errors, one can check if the corrected result sta- bilizes after k convergents. This will return a corrected result with an error probability of p1 . However, deciding k could be diﬃcult by the decoder, for k instance, choosing a large k could slow down the computations (ﬁnding many convergents) for no reason when we have few errors in the computations. As the front-end knows better about types of GCS being used, providing the number of errors would be more reasonable. The front-end can give an 1 error rate close to 2 , in order to assure correction capability even if we have a massive attack with high ratio close to 1 . However, this again will slow the 2 decoder down, because it always checks for many convergents, even when we don’t have a massive attack. We assumed previously, that we expect all types faults in untrusted re- sources including omission faults or crash faults, and massive attack could be close to 1 (but no asymptotically). Therefore the front end must consider ask- ing more jobs than the reported jobs to the decoder, otherwise, the decoder might keep waiting for result pairs while there are no more results available. For example, assuming the network default error rate is e, and the number CHAPTER 4. PROPOSED ONLINE POST CERTIFICATION MODEL FOR MODULAR COMPUTATION 36 of amortized jobs is r at some time. The front end must tell the decoder to decode for r jobs while it tells the public GCS to dispatch r + e.r jobs3 . On the other hand, if we have a massive attack q > e, the previous solution will no more hold and the decoder will keep waiting. A naive solution to this issue is to let the decoder send a signal whenever it reads all the results (i.e., builds the CRT solution out of result pairs read from data proxy). This signal could be the basic CRT solution, without error correction. Then we use a timeout based on our control risk assessment, or in other words, how much we can aﬀord to wait for slow workers, how can we handle a massive omission attack (a type of byzantine attacks), or even, handling temporary network crash failure. The timeout may be not ﬁxed, it should be a function of the number of extra amortized jobs r, and triggered whenever the front-end gets a task termination signal from GCS (if possible). However timeouts are always problematic solution specially when dealing with asynchronous systems. A better solution is introduced by [4], to let the public GCS test the workers whenever the front end asks for redundancy and update the front end back with the expected error rate. The decoder assumes that the front end always knows the maximum errors. 4.3.3 Data-proxy The internal structure of data proxy is totally implementation dependent. One implementation can propose a server listening to workers data, one other can assume an NFS system that stores results from workers...etc. Either ways, the Data-proxy should assure two conditions on the data submitted by workers: • All primes used are unique. • A worker can only submit few results. For the ﬁrst conditions, one solutions is that the data proxy creates a table of all 21 bit primes (which is not very costly), and whenever, it receives a prime, it ﬂags the corresponding prime as used. This solution requires a binary search in the table (given the table is sorted) for each prime received from the worker. So, if we have n primes from workers, in the worst case, we need to do < n log2 73586 16n table look ups. We need also, space of size 21 × 73586 = 1545306 bits. Another solution, is that we create a table of prime indices from 0 − 73586 (or a simple array, where it contents are only ﬂags), and we require workers to inform the data proxy which indices have been used. Obviously, this will give us O(1) for table look up, and smaller space cost, since we need 1 bits to represent a binary ﬂag (a bitmap of size 73586). This will requires workers to send few more messages about indices. However, there is problem where the worker may provide a faulty index of the prime, or even a non prime number. The solution is to construct a prime table, and each time 3 In real GCS, e errors are small, therefore we compute very few more redundant jobs plus the amortize redundancy which could be much more in usual cases. CHAPTER 4. PROPOSED ONLINE POST CERTIFICATION MODEL FOR MODULAR COMPUTATION 37 given the prime index we check the corresponding prime given by the user and the one in the table. This would require O(1) for table lookup and the same space complexity as the ﬁrst solution, but few more messages from workers. For the second condition, there is a possibility that a malicious worker, ﬁlls the data proxy table with faulty results; as a result, one attacker can shut the whole computations. Therefore we need the GCS to able to identify workers so a worker cannot submit more than a result. The attacker power is restricted by the GCS identiﬁcation algorithm which we need it to be used by the data proxy. There is one issue, if each worker has the right to submit at most m results. For this case, we need the front-end to provide m to data proxy, therefore, whoever submits a range of results ≥ m, data proxy shall reject the results and may spot the worker as malicious or faulty. A malicious worker could use a prime index used by the certiﬁer to generate a result. As a consequence, the certiﬁcation succeeds even if the decoded result is faulty. A solution is to let the front end send the certiﬁer’s prime ranges to the data proxy; however the range may not be known at the beginning of the computations, and should be sent gradually during the execution. We propose a simple solution in which the certiﬁer will use diﬀerent than 21 bit primes. 4.3.4 Certiﬁer From the previous part, we propose the certiﬁer to use 20 bit primes instead; however this may increase the certiﬁcation error probability. We will not con- sider this issue in this work. As we can see in the certiﬁcation algorithm 3.2, if we repeat the certiﬁcation more with new diﬀerent primes, we will obviously get a lower certiﬁcation error probability . As we haven’t studied in this work the exact value of , we assume that the current certiﬁcation result is enough to verify the correctness of the result. 4.3.5 Public GCS We proposed, for several reasons, the front end to ask the public GCS to execute globally a function f with its inputs, and two indices representing a prime range. As we discussed before, a GCS could have either centralized or decentralized architecture. Either ways, we don’t want a single point to be responsible for prime generation for three reasons: 1. Could introduce a big load to let one process communicate with all other processes and send primes (Specially in p2p architecture). 2. One component failure (prime generator) would lead to whole computa- tions failure. 3. Some work stealing algorithms work better when jobs are splitter recur- sively (e.g. as a binary tree)[24]. CHAPTER 4. PROPOSED ONLINE POST CERTIFICATION MODEL FOR MODULAR COMPUTATION 38 We propose that the public GCS gives each worker two indices for the range of primes to be used, and therefore each worker generates its own primes . This solution, increase the overall computational cost, since each node need to generate primes up to the second prime index. Generally, 21 bit primes is not costly4 . The third reason rises a new problem, what happens if the public GCS gives half of the work load to a faulty worker? the answer is, we get a fault ratio q ≥ 1 . Usually, with work stealing algorithms, jobs are splitted (as a 2 binary tree), and actual computations occur at the ﬁnal nodes. Generally, work stealing algorithms are eﬃcient, and work very well when we have small error rates, which is very likely to have in a GCS. There are actually two ways to know the average error rate e in a GCS: • Experimentally, by running the overall model and compute the average number of corrected errors. This can be achieved by the front end, giving the decoder a high maximum errors, as we discussed before. Therefore, after several experiments, we can estimate e. • Or by performing Germain-Renaud[4]sequential tests to ﬁnd the GCS error rate e with a bounded test failure. We assume that the public GCS performs such as test whenever the front end asks for more jobs. Public GCS shall use the job results for the test batch sequentially5 . Anyways, If we get a massive attack at any time, the model can adopt the new error rate, and the front end will provide more expected errors to the decoder. 4.4 Components Interactions We shall show the overall interactions between several components, and the type of the data exchanged as well. Figure 4.3 shows a sequence diagrams for a classical component interaction. The grain represents the maximum number of jobs, we expect each workers to produce. ﬁrst and last, are used to control prime indices. For example, to fork 100 jobs, starting from index 0, ﬁrst will be 0, and last will be 100. The the public GCS may return “FALSE” for some technical reasons, this will be useful since we don’t expect the decoder to be able to decode when the GCS fails to dispatch jobs. Therefore, we need either to study the reason, or just retry. Also, the public GCS performs some tests on the job results sequentially, therefor it may return a new error rate, if it is diﬀerent the the old one. 4.4.1 Control ﬂow From the sequence diagram in ﬁgure 4.3, we can see that the front end doesn’t give the control to the decoder. In other words, we have a stream of commands 4 Rough experiments show generally less than a half a second on normal machines. 5 We will not discuss details of such a test in this work. Further details can be found in [4]. CHAPTER 4. PROPOSED ONLINE POST CERTIFICATION MODEL FOR MODULAR COMPUTATION 39 Figure 4.3: Classical components interaction CHAPTER 4. PROPOSED ONLINE POST CERTIFICATION MODEL FOR MODULAR COMPUTATION 40 from the front end, and a stream of results from the decoder. This is an important aspect of the online system, where we don’t wait for function calls until they ﬁnish. We actually don’t have function calls for some components, rather data streams. For decoder, and the data proxy, whenever there is a result, they are immediately send. Received data stream is buﬀered until each component reads values. The front end also don’t give the execution control to the public GCS, even though we have one result value (not a stream), because the front end needs to deal with decoder and the certiﬁer in the mean time. On the other hand, the front end should give the execution control to the certiﬁer, since it cannot decide anything until it gets a return value. During this control blockage, received data from the decoder is buﬀered, until the control is released (also returned value from the public GCS). 4.4.2 Data buﬀering Since components are sent data asynchronously, we need to buﬀer received results until we processes them. The size of buﬀer will depend on the size of expected received data. We expect the decoder to have relatively large buﬀer for reading many results. The front end should also have a large buﬀer to be able to read large results from the decoder. Chapter 5 Model Analysis We assume, that communications are bi-directional, and authentication and encryption techniques are unbreakable. For the communication link, we assume two types of attacks: • Channel breaking (may be physical). • Man-in-the-middle attack (impersonation) When we say impersonation, we mean the attacker can control both the iden- tities and the data in the channel. We have two degree of attack severity: • Result forgery • Denial of service (DOS) We assume the ﬁrst one is much more dangerous. We also assume that an adversary can control all resources, and communication channels completely. Assuming also that the certiﬁer error bound is negligible. 5.1 Communication channels risk analysis & Man-in-the-middle attacks 5.1.1 Front-end – Decoder If this connection is broken, then the system will never terminate (i.e., DOS), and won’t be able to produce any output to the user. If an adversary, can impersonate both the decoder and the front end, then the system either won’t terminate, or the adversary can give faulty results to the front end. In this case, the certiﬁer will capture the error. The adversary can also cause DOS for both sides by ﬁlling buﬀers at each sides, or by sending termination signal to the decoder. Therefore, we assume this link to be encrypted, and both sides won’t communicate until they are authenticated to each others. 41 CHAPTER 5. MODEL ANALYSIS 42 5.1.2 front-end – Certiﬁer This is the most critical communication link in the system. If it breaks, then we will have DOS, since the control blocks from the front end until the certiﬁer returns a result. On the other hand, if an attacker can impersonate both sides, he can give faulty certiﬁcation, and eventually the system will return a forged result, which is the most dangerous system fault. This communication must be encrypted, and both sides are authenticated to each other. Its possible that both are in one private trusted network. In such as case, we might not need encryption or authentication. 5.1.3 front-end – Public GCS If this connection is broken, we will have DOS. If an attacker can impersonate both sides, then the system might either never terminate (if he denies the commands), or he can give faulty computations. This could be very useful for the attacker, since he can control the GCS for his own beneﬁt, and a big economical problems if the GCS charges money on the resource usage. Therefore, the corrective act is to use authentication and channel encryption. 5.1.4 front-end – Data-proxy If the connection is broken, the front end won’t be able to provide the grain size of jobs. As we explained before, if we have on faulty worker, he can ﬁll prime table with faulty residues, and accordingly we will have a DOS. A corrective act might be to use a default grain size, if the link breaks. However, if an attacker impersonates the front end, then he might provide a large grain (e.g., number of 21 bit primes), and one faulty worker can ﬁll up the table with faulty results as we explained. The corrective act is to use authentication, and encryption to protect this channel. 5.1.5 Decoder – Data-proxy If the link breaks, then we will have DOS. On the other hand if an attacker can impersonate both, a whole set of result pairs can be forged, therefore we will have DOS. The attacker can cause DOS also by ﬁlling decoder buﬀers. Another obvious DOS attack is to send a terminate signal to the data proxy. 5.1.6 Within GCS We have three main diﬀerent links in the GCS: • Public GCS – workers • Worker – Worker • Worker – Data-proxy CHAPTER 5. MODEL ANALYSIS 43 We assumed that all communication channels can break. If we have a perma- nent massive network failure (or Internet crash), then obviously we will have DOS. The impact of links failure is really dependent on the structure of the GCS network, and the algorithm we are using (e.g., work stealing). If the public GCS only communicates with all workers directly, then one link break could be as one worker crash failure (or omission failure). The system should correct such an error. The same for the link between each worker and data proxy. However, when we use work stealing algorithms recursively (as a binary tree execution), then one link crash could cost a fraction of errors e, such that log n ≤ e ≤ 1 , where n is the number of jobs. If a worker can impersonate 1 2 the public GCS, then he or she can control the whole GCS computation. We assume that workers cannot initialize the global computations, and the public GCS is known by all workers. 5.2 Impacts of resources failure Assuming an adversary can control any type of resources. We would like to see, according to our models, what impacts could we have. 5.2.1 Untrusted resources failure 1 The system can correct errors e < 2 , since as we add more redundancy, the decoding correction capability increases. As far we have n ≥ k + 2t , where k is the required primes, t is faulty computations, we can always correct results. If the adversary manages to cause asymptotically an error rate e ≥ 1 , the 2 system will never terminate. We assume that the sequential test computes e with bounded test error probability. If we are unlucky, then we will have DOS. 5.2.2 Semi-trusted resources failure The decoder, data proxy, and public GCS are semi trusted resources. The adversary will cause a DOS to the system. Or the system might return faulty result due certiﬁcation failure only, otherwise the system will never return a wrong function result. Also for public GCS, the adversary may utilize the GCS system for his beneﬁt only. 5.2.3 Trusted resources failure If either the front end the decoder are compromised then the adversary can cause result forgery, which is the most critical error, and/or DOS. If he controls the front end, then he can control the system completely. These resources are the most critical part of the model, and the whole system security is based on their security. Chapter 6 Design speciﬁcations As to reduce the design complexity we require to accept only one user function at a time1 . However, in order to achieve multiple function calls, we need to replicate the whole system. We impose the following constraints on the design of the systems: • User functions must be independent from other components. User func- tions could be developed by the deployment team and should be evolved over time easily, without rewriting other components. • User functions are given at the run time as an argument (e.g., execute f unction1 arg1 arg2 ...) • Inputs of user function must be not limited by a number to any data type (e.g., could be a mixture of ﬁles and string values). The system needs to deal with serialization/serialization of user functions and data. • User functions could return diﬀerent types of results. The basic type t is a ﬁxed bit 21 bit integer. However, the system should support other types based on t such as a vector [t0 , t1 , . . . , tn ] or a matrix of basic type t. The system should be designed in a way to enable ﬂexibility in adding new result types in future. • The design of the core decoder algorithm must be independent of the system. We propose an Object Oriented Programming design for the stem for the fol- lowing reasons: • Polymorphism: by using polymorphism, we can use on abstract class pointer to refer to several child implementations. This concept can help us to get user functions at run time, since we need one place to resolve string name to a child polymorphic class, and the whole system will use the abstract interface to deal with user functions. 1 We don’t want to handle multi clients at this stage 44 CHAPTER 6. DESIGN SPECIFICATIONS 45 • Modularity: OO gives a better modularity for the system, through the concept of encapsulation, implementation details are hidden from objects’ users. • Maintainability: Since objects are independent logical entities with clear interfaces, internal implementations can improved independently and less costly. In the design diagrams, we focus on the case where we have only integer return type for the sake of simplicity. 6.1 Components structure Communications happen now through communicators which hides complexity details from main components as shown in ﬁgure 6.1. The certiﬁer and public GCS don’t need communicators since they implement remote functions which are invoked only through front end communicator. CRT lifter and decoder is designed by Thomas Stalinski, and he used this communicator to send results. The main component contains some glue implementation code for the decoder, and is also responsible for managing communications with front end and data proxy. 6.2 Class diagrams Here we shall show the class diagram of the system. Some data types shown in the diagrams are not language dependent, and need to be deﬁned. However, we named these types in a way that it describes it actual data . Figure 6.2 show the communicators class diagram. All communicator use an instance of communicatorUtil, which provides low level communication handling. The current communicator show only integer return type for user functions (for the sake of simplicity). For other types, one needs to add the corresponding overloaded members. According the the design constraints, we designed a polymorphic user func- tions shown in ﬁgure 6.3. userModularFunctionBase is an abstract class, which later will be used as a pointer to its polymorphic implementations (user de- ﬁned functions). This class has a static member initializeByName, which cre- ates polymorphic user functions by their string name. In the diagram, the functions with italic font are abstract function which require children to imple- ment. The second level of inheritance, specify the user function return type. userModularFunctionInteger will be the parent of all functions returning one integer value such as Fibonacci and determinant. Another type is userModular- FunctionVector which represent function returning vector types. This scheme enable us to add new types such as matrices, or vector space. The second level implement mainly outputResult, which prints the resultPairs (to a stream or to stdout, based on the implementation). The third level of inheritance, repre- CHAPTER 6. DESIGN SPECIFICATIONS 46 Figure 6.1: Deployment diagram sents the classes in which we can create objects (not abstract). These function are designed to be easily added by the deployer of the system. In order to add a new function, one must simply extend an abstract ap- propriate class (e.g., userModularFunctionInteger), and implement two things: userModularFunction, and a constructor which receives a list of arguments: arg_list. The ﬁrst function is void, therefore, in order to return a value, the developer must invoke the parent’s returnResult member, which will handle storing it a list of results, resultPairs. If one or more arguments are ﬁles, then the developer should inform the system by calling setFile giving the index of that argument. We imposed this solution to simplify the user code. An alternative solution could be a classical Marshalling/Demarshalling; however, this would require the user to write extra members to support these operations. As we store the arguments and ﬁle order, we can send these informations to reconstruct the object again elsewhere in the system. The certiﬁer interface is shown in ﬁgure 6.4, and public GCS interface is shown in ﬁgure 6.5. publicGCS may return a new error value if it performed sequential tests. We use the type SUCCESS_or_FAILURE for simplicity and for some implementations who would like to get the error rate from the end CHAPTER 6. DESIGN SPECIFICATIONS 47 Figure 6.2: Communicators class diagram user. 6.3 Interface interactions Figure 6.6 show the components interactions through interfaces controlled by communicators. The diagram show mainly the type of the data used for online components and the ordering of calls. This diagram is similar to ﬁgure 4.3, but it focuses more on the type of the data being sent. In addition, the front end now tells both the decoder and the data proxy the type of the data (i.e., integer, vector, matrix ...etc), so each one can deal with each type diﬀerently. The current communicators diagram show only integer types. CHAPTER 6. DESIGN SPECIFICATIONS 48 Figure 6.3: user functions class diagram Figure 6.4: certiﬁer Figure 6.5: public GCS CHAPTER 6. DESIGN SPECIFICATIONS 49 Figure 6.6: interface interactions through communicators and the data types exchanged Chapter 7 Implementation & Analysis We implemented the speciﬁcation to work on top of classical grid systems (i.e., consists of controlled clusters). The implementation written in c++ code, and using Linbox library [27]for fast linear algebra computations used for comput- ing the matrix determinant. Linbox requires GMP and Givaro [28, 29], we also used GMP for certiﬁcation computation, and also used for decoding1 . For distributed computations, we used kaapi middle-ware [20] using its Athapascan API[25]. The grid experiments are conducted on Grid5000 [26]. The imple- mentation uses ssh protocol to communicate with the grid being used, and also oarsh[30] to connect to Grid5000 job machines. Through out this implemen- tation, we force the certiﬁer and the front end to run on the same machine, for the sake of simplicity. Also, since the grid is controlled, we relax the prime uniqueness condition on the data proxy, since errors are very low in controlled systems and they are more probably to be in the function computation rather than prime generation2. On the other hand, if a worker provides incomplete data (e.g., a result but not a prime), then this condition will be rejected at the decoder communicator side. Through out this implementation, for the sake of simplicity, we don’t compute error rates dynamically by a sequential test, we simply accept it as a user input. Now, we shall brieﬂy describe both Kaapi and Grid5000. 7.1 Kaapi Library KAAPI means Kernel for Adaptative, Asynchronous Parallel and Interactive programming. It is a C++ library that allows to execute multithreaded com- putation with data ﬂow synchronization between threads. The library is able to schedule ﬁne/medium size grain program on distributed machine. The data ﬂow graph is dynamic (unfold at runtime). Target architectures are clusters of SMP machines. It is based on work-stealing algorithms which can run on 1 developed by Stalinski 2 However, if there is a error in the prime value, then there could be a problem with the system. This case should be handled in future updates. 50 CHAPTER 7. IMPLEMENTATION & ANALYSIS 51 various processors, various architectures (clusters or grids), and it contains non-blocking and scalable algorithms[20]. 7.2 Grid5000 Grid5000 is an instrumental computer science project aimed at gathering 5000 processors at the national scale in France (currently 9 sites are involved). The project is dedicated for researches in large-scale parallel and distributed sys- tems. Each of site comprises several clusters of machines (2-3 clusters of 60-600 machines). It has a heterogeneous architecture of 3202 cpus and 5714 cores. Within each site, there is an NFS server which is only accessible by site’s machines. Each user of Grid5000 has a home directory in each of 9 sites (i.e., 9 home directories). The way a user connects to the grid is by ssh access to the site front access shown in ﬁgure 7.1. For resource management, it used OAR tool [30]. In order to reserve some resource, the user must connect through ssh to the access machine, then ssh to the frontend of the site. From there, he can use OAR tool to reserve resources, and he/she will get a listed of machines and may get an ssh key to accesses these machines. oarsh command is an alternative to ssh in the sense, that it hides key management details for accessing resources. One can easily conned between reserved nodes without the need to specify where the job key is stored. This tool ﬁnds the key transparently whenever a user moves between OAR reserved machines. 7.3 Online Certiﬁer We created an open source project called online-certiﬁer licensed under GNU General Public License v3 [31] (ﬁgure 7.2 shows the project logo). It acts as middle-ware for executing user functions on a classical grid systems. The application is compatible with Unix like systems3 . The application is developed to work in the three mode: Grid, Grid5000, and Locally (for testing). 7.3.1 Deployment over grid5000 Grid5000 imposes some constraints on the components connectivity. Both data proxy, and public GCS run on resources within a reserved job, which are not directly accessible by the front end and data proxy (ﬁgure 7.3 show the de- ployment of the system). Data proxy and public GCS could be on either two diﬀerent machines or share a single machine. We use ssh port forwarding to secure the communications and to redirect the communications to connect between components. The workers in Grid5000 store their outputs (residues 3 For windows, it would need to modify the source, since we use Unix system calls for socket communications and some ﬁle descriptors. CHAPTER 7. IMPLEMENTATION & ANALYSIS 52 Figure 7.1: Grid5000 ssh access Sun symbol represents the power of distributed computing while preserving the control over resources against attack. It is a symbol of stability and continuity. Figure 7.2: Online certiﬁer logo and primes), to a directory on site’s NFS. The data proxy reads ﬁles in that directory and send results stream to the decoder. 7.3.2 Package structure The package tar ball has the following structure (compliant to a subset of the speciﬁcation)4 : 4 The package is still not ﬁnalized. We need to use gnu automake and autoconﬁg to package it in a better way, and simplify the installation and usage process to be POSIX CHAPTER 7. IMPLEMENTATION & ANALYSIS 53 Figure 7.3: Online certiﬁer deployment on Grid5000 • communicator/: – communicatorUtil.(hh/cpp): Provide utility function for other com- municators – frontendCommunicator.(hh/cpp) – decoderCommunicator.(hh/cpp) – dataproxyCommunicator.(hh/cpp) – Makeﬁle • decoder/ – mandelbaum.(hh/cpp): contains two classes, lifter and decoder 5 – main.cpp: controls the execution of the whole decoder (deals with decoderCommunicator, lifter, and decoder) to interact with other components, namely, front end and data proxy. compliant 5 Developed by Stalinski CHAPTER 7. IMPLEMENTATION & ANALYSIS 54 – Makeﬁle – decoder.conf: conﬁguration ﬁle such as port number, data proxy address...etc. • front-end/ – certiﬁer.(hh/cpp) – main.(hh/cpp): deals with certiﬁer, and frontendCommunicator to control system operations. – Makeﬁle – compile.sh: exports Linbox ﬂags and compiles the frontend binary. • grid5000/ – kaapiTask.cpp: Kaapi compatible application represents the public GCS component. – dataProxy.cpp: uses dataproxyCommunicator – Makeﬁle – compile.sh: exports Kaapi and Linbox ﬂags for compilation – dataProxy.conf: contains data proxy conﬁgurations as server port number • user-functions/ – userModularFunctionBase.(hh/cpp) – userModularFunctionInteger.(hh/cpp) – determinant.(hh/cpp): a user function for modular matrix determi- nant computation – ﬁbo.(hh/cpp): a user function for computing Fibonacci numbers – userIncludes.hh: contains the headers of all user functions. Used by frontend and kaapiTask to recognize user functions. – test.cpp: application used to test user function before using them in the system. Accepts function name, a prime value, and outputs a result and function informations. – Makeﬁle – compile.sh: exports Linbox ﬂags and compiles the source • util/ – generatePrime.inl: 21 bit prime generation code – typeDeﬁnes.hh: contains global decoded result types and return types used over the system. CHAPTER 7. IMPLEMENTATION & ANALYSIS 55 • doc/: doxygen documentation6 • bin/: An empty directory will contain executables and conﬁg ﬁles after compilations. • data/: contains dense and sparse matrix generation codes. After compiling the system, we will get four executable: frontend, decoder, kaapiTask, and dataProxy. The kaapiTask is a kaapi application which can run independently using karun command. This command mainly receives a machine ﬁle representing machine pool which we perform our application on. kaapiTask receives the following arguments: • output directory: the directory where we store outputs. This should be on a mounted NFS directory, otherwise each process will store results locally • ﬁrst: ﬁrst prime index • last: last prime index - 1 • grain size: number of jobs to be grouped in one machine • user function name • function arguments The dataProxy server is in charge of reading from a given data directory (the same used for kaapiTask output) containing ﬁles of each computing machine result, and then send these contents to a connected client (decoder). Each time the data proxy reads a ﬁle, it immediately deletes it after ﬁnishing reading to release the space7 . The data proxy accepts the following arguments: • port: data proxy server listen port number • dir: the directory which we need to read results from the dataProxy accepts either arguments, or a conﬁg ﬁle located at same direc- tory where we run the executable. The frontend uses frontendCommunicator, as speciﬁed in the speciﬁcation, to connects to the decoder through a port. However, it uses ssh connection to connect to the grid and execute kaapiTask program using karun. The frontend usage formate is “frontend function_name grain_size args...”. More conﬁguration details are in frontend.conf ﬁle, which contains compo- nents addresses and ports, execution mode, a ﬁle address containing reserved machines addresses on the grid, output dir, a task name (would be used to create a subdirectory under output dir to store kaapiTask outputs), maximum 6 As this is stage is not completely ﬁnished, but will be available soon 7 In future, it could be useful to provide an option to archive results for future use. CHAPTER 7. IMPLEMENTATION & ANALYSIS 56 expected error rate, and an ssh key address used to access grid5000 reserved job (this is used if the mode is GRID5000)8 . The conﬁguration ﬁle must be located where we execute the frontend binary. The decoder uses decoderCommunicator to communicate by both fron- tend and dataProxy through ports. The usage formate is “decoder port dat- aProxy_addr dataproxy_port”. It either accepts arguments or reads from dat- aProxy.conf which must be located where the binary is executed. 7.4 Installing & using online-certiﬁer Both frontend and kaapiTask requires Linbox library (for matrix determinant user function). kaapiTask requires Kaapi to be installed (with Kaapi commands visible globally by adding binaries directory to PATH variable). The frontend requires GMP library as well for the certiﬁer. However, GMP is required when we want to install Linbox9 . To install Linbox you requires the following steps brieﬂy: • Download the source of Linbox and untar it • You will also need to install GMP and Givaro. You will need to ﬁrst install GMP (with the option –enable-cxx passed to the conﬁgure script, then Givaro with the –with-GMP=<GMP-path>) • You also need a BLAS library. If none is available (libblas or libcblas), then install ATLAS. • conﬁgure Linbox with the options “–with-blas=<blas-path> –with-gmp=<gmp- path> –with-Givaro=<Givaro-path>”, and then install it. If we want to compile the system locally (for testing) we need ﬁrst to install Linbox with GMP and then perform the following steps (in order): 1. Go to communicator directory and type make 2. Go to user-functions directory and run ./ compile 3. Go to front-end directory and run ./ compile 4. Go to decoder directory and type make. 5. Go to grid5000 directory and type ./compile Now, all executable and default conﬁg ﬁles are available in bin directory. To run the system, we need the following steps in order: 8 The conﬁguration ﬁle contains comments about each of required arguments 9 There is a Linbox 1.1.6 bug with Kaapi 2.4. After Linbox 1.1.6 installation, you might need to modify /linbox/include/dir/linbox/solutions/det.h by commenting line 328 and 335 CHAPTER 7. IMPLEMENTATION & ANALYSIS 57 1. Run: dataProxy <server_port> <kaapiTask_output_directory> (or put the corresponding inputs in dataProxy.conf and then run without argu- ments) 2. Run: decoder <server_port> localhost <dataproxy_server_port> (or provide the corresponding inputs in the decoder.conf ) 3. Open frontend.conf and check that the decoder port is correct. Also, change the error rate to 0 (since we don’t expect errors locally), and make sure the mode is LOCAL. Make sure the the directory of kaapiTask binary is correct. Also, make sure the “grid_data_dir/task_name” is the same directory given to dataProxy (kaapiTask_output_directory). Hence, modify grid_data_dir and task_name individually. 4. Run: frontend <user_function_name> <grain_size> (e.g., ./frontend determinant 8) For Grid5000, we need few more steps for compilations. In order to install the frontend on a machine, you will need to install Linbox with GMP, and compile: communicator, user-function, and front-end. For the decoder (if on either the same or a diﬀerent machine), you need GMP library and then to simply compile communicator, then decoder. For the dataProxy and kaapiTask on grid, we need to ﬁrst install Linbox with GMP, then reserve a machine (using oarsub -I command on the frontend). Now we need to compile: communicator, user-functions, grid5000. We also need to give our ssh public key to grid access machine, so we won’t need to write a password when we connect to Grid5000 (probably using ssh-copy-id myuser@mygridaccess). In order to run online-certiﬁer on Grid5000, we need the following prepa- rations: 1. Reserve a number of machines using (oarsub -I -l hosts=xxx -e job_key_ﬁle or oargridsub command). 2. Store the machines addresses inside a ﬁle (e.g., oargridstat -l grid_job_number > machines or on a reserved machine cat $OAR_FILE_NODE > ma- chines). 3. In our front end machine (not site frontend), modify the conﬁguration ﬁle: change the grid address (which is a combination of user@access.site.grid5000.fr)10. Modify the job_key, and put the reserved job key address of Grid5000. Modify the mode to GRID5000, and change the public GCS machine address (from the set of reserved machines). The last option is either to use amortize technique or not (If you set the option to “no” then the 10 You may modify ~/.ssh/conﬁg ﬁle and but a short hand name. Then put this name instead of user@address CHAPTER 7. IMPLEMENTATION & ANALYSIS 58 frontend will let the kaapiTask run for all primes, and provides amor- tize data to the decoder only11 ). Make sure the decoder address is what you want to be (only IP address is accepted for now). Finally, make sure that grid_data_dir and task_name is what you want to be on Grid5000. Give a ﬁnal check to all conﬁguration, and make sure it is exactly what you want. 4. Now on the decoder machine, we need to perform three times ssh local port forwarding to reach dataProxy (a machine from reserved grid job). For example, run: ssh -L 9999:localhost:9999 grid. Again on the grid ac- cess machine run: ssh -L 9999:localhost:9999 frontend. Finally run oarsh -i <grid_ssh_job_key> -L 9999:localhost:9999 <one_machine_from_machines_ﬁle>. Assuming we want the dataProxy server to run on port 9999. This will let the dataProxy run as it is on the decoder machine since we did local port forwarding. 5. On another terminal, from the front end machine, do ssh local port for- warding to the decoder in order to secure the channel (e.g., ssh -L 5555:lo- calhost:5555 decoder). Steps to launch the computations: 1. Run the dataProxy server on the grid machine (the conﬁguration should be correct, as we described in local execution). 2. On another terminal, in the decoder machine, run the decoder (make sure the conﬁgurations are correct). 3. Finally, run the frontend with appropriate user function input, grain size, and arguments. In order to run online-certiﬁer on any type of clusters, or grids with an NFS server accessible by all machines. Do all the steps we did for the Grid5000, except the port forwarding and providing the job_key to the frontend.conf. The frontend mode should be now GRID. Other steps remain similar. The user has also the option to simply run kaapiTask on a set of machines using karun command providing machines’ ﬁle. This will let us do experiments without using the model. 7.5 How to write a user function ( Fibonacci function example) In order to add a new user function, one should do four steps: 11 This options not recommended, since it wastes a lot of space. It is useful sometime when karun doesn’t behaves properly. Sometime it requires manual termination (“^C”), which is painful for an iterative model. CHAPTER 7. IMPLEMENTATION & ANALYSIS 59 • Write the class header and content of the function (with some few con- straints) and put them in user-functions directory • Put your class header in user-functions/userIncludes.hh ﬁle • Add two lines of code in user-functions/userModularFunctionBase.cpp (to resolve string to user function object) • Add an entry in the Makeﬁle Since we want to create a function with integer return type, then we need to extend userModularFunctionInteger. The code of ﬁbo header should be as the following: 1 # include " u s e r M o d u l a r F u n c t i o n I n t e g e r. hh " 2 # include < vector > 3 class fibo : public u s e r M o d u l a r F u n c t i o n I n t e g e r{ 4 public : 5 void userModular F un c ti o n ( fixedBitSi ze I nt e ge r p ); 6 fibo ( std :: vector < string >& args ); 7 private : 8 int n ; 9 }; As we can see, we simply need to extend userModularFunctionInteger, and implement one abstract member userModularFunction. Also, we need to im- plement a constructor with an argument of vector<string> type representing the list of user function inputs. Hence, the user need to perform his modular function using the prime p given as an argument. The body would be: 1 # include " stdlib . h " 2 # include " u s e r M o d u l a r F u n c t i o n I n t e g e r. hh " 3 # include " fibo . hh " 4 5 using namespace std ; 6 void fibo :: userModul ar F un c ti o n ( fixedBitSiz e In t eg e r p ){ 7 if (n <2){ 8 storeResult (( fixedBitSi ze I nt e ge r )n , p ); // return the result 9 } else 10 { 11 unsigned long long fibo =1; 12 unsigned long long fibo_p =1; 13 unsigned long long tmp =0; 14 unsigned long long i =0; 15 for ( i =0; i <n -2; i ++){ 16 tmp = ( fibo + fibo_p ) % p ; 17 fibo_p = fibo ; 18 fibo = tmp ; 19 } 20 storeResult (( fixedBitSi ze I nt e ge r ) fibo , p ); // return the result 21 } CHAPTER 7. IMPLEMENTATION & ANALYSIS 60 22 } 23 fibo :: fibo ( vector < string >& args ): u s e r M o d u l a r F u n c t i o n I n t e g e r( args ){ 24 if ( args . size () < 1) 25 throw " no argument for fibonacci ! " ; 26 n = ( fixedBitSi z eI n te g er ) atol ( args [0]. c_str ()); 27 } The user would read the arguments in the constructor and store it in any class attribute he wants. Then he would use his attribute in userModular- Function. The most important part of userModularFunction is that whenever the user wants to return a result, he would use storeResult function, with the result, and the prime used p. The user must use ﬁxedBitSizeInteger for the result of integer type. In other types of functions, the user may need to read a ﬁle. In this case he would need to specify an argument as ﬁle, so that the frontend can copy the input ﬁle to the grid. In the constructor, the user must call setFile(index), where index is the argument index corresponding to a ﬁle. Finally, we need to edit userModularFunctionBase.cpp in initializeByName function add: 1 else if ( name == " fibo " ) 2 tmp = new fibo ( args ); Now you should add an entry in the make ﬁle, and compile. In order to test the Fibonacci function modulo 7, you may run: ./test ﬁbo 7 100. 7.6 Read/Write atomicity Between the dataProxy and Kaapi we have the problem of result write atom- icity. The kaapiTask on a machine might write some incomplete content, and meanwhile, dataProxy might read some part and try to delete the rest, before the completion of writing. One solution is to use ﬁle locks. However ﬁle locks will force each time the dataProxy reads a ﬁle to wait until the kaapiTask ﬁnishes. This would result a huge slow down if we have very slow workers, or having heavy computations on each node. We proposed a simple solution, where kaapiTask write a special character “$” at the beginning of a ﬁle, then it writes the outputs, and when it ﬁnishes, it replaces the ﬁrst character by another character “\n”. The dataProxy would read each ﬁle, and whenever the ﬁle is either empty or with “$” character at ﬁrst line, it would consider the ﬁle is locked, and simply closes the ﬁle without deleting it. 7.7 Prime generation As this scheme requires unique primes for decoding, we provided a simple scheme for prime generation suitable for Grid5000. kaapiTask is responsible for this task, since the frontend gives only two values representing the prime CHAPTER 7. IMPLEMENTATION & ANALYSIS 61 Figure 7.4: Diﬀerent amortize functions range required, f irst, and last, where last − 1 is the index of the last prime. kaapiTask recursively splits the prime range by two until it meets a condition: if n = last − f irst ≤ grain, then each node compute a prime table of size n that contains primes of indices i0 = f irst to in−1 = last − 1. We want node to compute primes to prevent bottle nicks as we discussed in chapter 4. One solution was to store all 21 bit prime in a ﬁle in the NFS and let machines read prime from it. However this would cause a lot of read operation and some load on the NFS. Another solution was to store these values locally, which is diﬃcult to achieve on GRID5000, since the home directory is mounted to the NFS. We may store in the /tmp directory, but this requires to send the prime ﬁles to all machines after reserving a job. Therefore, the best solution is to compute them at each node, which is not so costly operations. To generate a whole 21 bit prime table, it take less than a second on a machine in genepi cluster in Grenoble site of Grid5000. 7.8 Implementation Amortize Technique For our implementation, we used an amortize function f (i) = f (i − 1) + f (i−1) log2 f (i−1) instead of doubling the redundancy. Other amortized functions dis- cussed in chapter 3, increase a bit slowly. In our case, the overhead of running new jobs using karun is high and resources are already booked and no pay- ment associated (ﬁgure 7.4 & 7.5 show diﬀerent function growth). We start the amortize function by 32 instead of 2, to allow even faster growth. The frontend applies this function to compute extra redundancy (in the Grid) whenever the decoder fails to decode. However, the frontend asks kaap- iTask for f (i) + errors, where errors = f (i) × t, and t is the error rate given as an input to the frontend. The reason for that is to suppress the case where we might have omission faults as we discussed before. CHAPTER 7. IMPLEMENTATION & ANALYSIS 62 i f (i−1) ρ(i) = 1.9 log i vs f (i) = f (i − 1) + log f (i−1) Figure 7.5: ρ-amortize vs f 7.9 An application of matrix determinant computations In this part we shall discuss an eﬀective application of matrix determinant. For large matrices, the matrix determinant computation could be very costly since operations in Z are costly. However, computing the determinate many times over Fpi , ∀pi coprime integers, would reduces the cost of operations since pi is much smaller than Z. However, in some cases when we have large matrices (e.g., 1000 x 1000), each modular computation would also take a considerable time (of course much less than in Z); therefore, performing computation in parallel would be important. The number of modular computation depends on the matrix A of size n and the size of elements in A. The Hadamard bound requires |det(A)| ≤ n n 2 (max(|ai,j |)n ≤ i pi , where ai,j is an element in A. Let k be the number of pi primes requires for the Hadamard bound such n log n+n log(max(|ai,j |) that k ≥ 2 log pi . We tested a simple dense matrix of size 100 with maximum of 30 bit elements, using 21 bit primes. The Hadamard bounds requires 167 primes. Whereas computing this matrix in an online manner, using linear amortize technique (e.g., each time increment the redundancy by 1) with 0 error rate in the network, gave us an average 156 primes for diﬀerent random matrices of the same size. However, using amortize technique, it takes around 168 primes, which is very few overhead for this case. Using online certiﬁer, we computed the determinant of a dense matrix of size 1000, with 100 bit elements, using 160 cores. The determinant was about 30849 digits long using 0 error rate for the Grid. The system used total 7903 primes to compute the determinant while the Hadamard bound requires 5250 primes (an overhead of 0.336) which is good since we have a dense matrix with all elements of 100 bits (which is almost the worst cast). The total execution time CHAPTER 7. IMPLEMENTATION & ANALYSIS 63 was around 10 minutes12 . 7.10 Current limitations The grid implementation is now limited to one cluster only, where all machines are connected to a single NFS. In order to allow multiple clusters, we need ﬁrst to let the front end synchronize the user inputs ﬁle to all NFS, and also let the data proxy read from all NFS involved. Another limitation is that Linbox executable is quite large (e.g., kaapiTask is a around 5 MB), which would slow down the execution over a distributed environment. Another issue is that Linbox is not thread safe. For computing matrix determinant, we use a function det from Linbox library which is global function. For time being, we use karun command with option -t 1 to run one thread per process. We could develop a wrapper code in future to synchronize access to det. The current implementation on Grid5000 requires 3 levels of ssh port forwarding, which needs some time to get used to. Also, the current implementation doesn’t dynamically allocate resources during run time. Grid API13 could be used in future for such a purpose. 12 The implementation requires some of optimizations to get better timing performance. Due to lake of time, we couldn’t provide a detailed performance analysis. 13 The API is still under development for most of the Grid5000 sites. Conclusion Through out this work we showed how we eﬃciently could use untrusted re- sources for modular computation using few trusted resources, a certiﬁer mainly, to verify the computations. We provide a simple yet strong scheme to perform a sequential function in parallel and showed in details several types of risks and possible attacks, and how the system can be resilient or not for each type. This scheme could be extended to include more functions with integer return type, even with vector, or matrix return types. We showed an example of com- puting a matrix determinant as an eﬀective application which requires huge computing power for certain inputs. More applications to this scheme could be discovered in future: we believe that our proposed scheme could be utilized in several ﬁelds, and several applications, because it covers a wide space user requirements. A possible enhanced implementation could be a library of user functions provided as a service to third parties, who may use it for perform- ing heavy computations. This service could be either provided for researchers freely only, or one can make a business out of it; of course, with a a set of user demanded user functions. Prospectives This work could be improved in several ways and several directions. First, we would like to let the decoder use the certiﬁcation computations in its CRT and treat them in special way to enhance the performance. This will require a change in the model, where we store certiﬁer’s results in a secure storage and let the decoder read from it14 . We also would like to improve the decoder to handle vectors, and matrices, as the user function may return these types. Moreover, we think it is important to study the certiﬁcation error proba- bility and provide an accurate bound for it, because a user may really need a clear bound for his application using our functions. We may incorporate a weak form of blacklisting to reduce error rates in a public GCS as it is studied by Cecile[4]. A prospective work could improve the implementation to handle multi-clients, and several decoders using a load balancer to reduce the load on a single machine. The same thing might be applied for the certiﬁer. The front end can be improved to manage several GCS systems, and according to each 14 More details are in Thomas master thesis 64 CHAPTER 7. IMPLEMENTATION & ANALYSIS 65 system cost and eﬃciency criteria, and based on the user priority or quality of service, the front end may choose an appropriate GCS system for each user function call. The will add lot more complexities to the system, but it would increase the eﬃciency and the service quality. The amortize technique could be used diﬀerently in models where machine cost is only associated to usage time. One could study how to control the data rate, by adding more machines dynamically, from the data proxy to the decoder, and control the increase of this rate using amortize control technique, so that he can reach the unknown bound, eﬃciently and using the full potential of the booked machines. This could be useful for clusters and classical grid systems. Another direction to improve this work, is to introduce dependencies for certain user functions. For example, a user may want to write a function for matrix matrix product, and want to fork each vector vector multiplication on diﬀerent nodes. We want the system to handle this requirement transparently from the end user, providing simple libraries to control the execution tree. Through out this work, we provided proofs and theoretical study for the case of modular computations. However, the same scheme may apply to ﬂoat- ing point functions, with diﬀerent way of distributing tasks and decoding. This would make the scheme much more eﬀective, since there are dozens of appli- cations in statistics and diﬀerent ﬁelds relay on ﬂoating point functions. The vision of this work is to have an online library of diﬀerent type of function (exact and ﬂoating point), accessible though APIs and interactive user inter- faces remotely. A public cloud hides all complexities from the user and provide on-the-ﬂy computations with certiﬁed results to the public and other business parties. Bibliography [1] A. L. Beberg, J. Lawason, D. MacNett, distributed.net home page. http://www.distributed.net [2] Folding@Home website. http://folding.stanford.edu [3] SETI@home website. http://setiathome.ssl.berkeley.edu [4] Germain-Renaud, C. and Monnier-Ragaigne. Grid result check- ing. In Proceedings of the 2nd Conference on Comput- ing Frontiers. CF ’05. ACM, New York, NY, 87-96. DOI= http://doi.acm.org/10.1145/1062261.1062280, may 2005 [5] TCPA Main Speciﬁcation 1.1b. http://www.trustedcomputing.org [6] Roch, Jean-Louis and Varrette, Sebastien, Probabilistic certiﬁca- tion of divide & conquer algorithms on global computing plat- forms: application to fault-tolerant exact matrix-vector prod- uct, 978-1-59593-741-4, 88–92, London, Ontario, Canada. ACM, DOI=http://doi.acm.org/10.1145/1278177.1278191, 2007 [7] Jean-Louis Roch, Samir Jafar and Sébastien Varrette. A Probabilistic Ap- proach for Task and Result Certiﬁcation of Large-Scale Distributed Ap- plications in Hostile Environments . 978-3-540-26918-2, 2005 [8] Faith Fich and Eric Ruppert. Hundreds of Impossibility Results for Dis- tributed Computing. Distributed Computing. Volume 16. Pages 121-163, 2003 [9] Xavier DÉFAGO. Agreement-related problems: from semi-passive repli- cation to totally ordered broadcase, 2000 [10] Clement Pernet, Jean-Louis Roch, Thomas RocheFault-tolerant Polyno- mial Interpolation, 2009 [11] Mandelbaum, D. On a class of arithmetic codes and a decoding algorithm, 1976 66 BIBLIOGRAPHY 67 [12] H. Higaki, K. Shima, T. Tachikawa, M. Takizawa, "Checkpoint and Roll- back in Asynchronous Distributed Systems," IEEE Computer and Com- munications Societies, Annual Joint Conference of the, pp. 998, INFO- COM ’97. Sixteenth Annual Joint Conference of the IEEE Computer and Communications Societies. Driving the Information Revolution, 1997. [13] BOINC website. http://boinc.berkeley.edu [14] Mersenne website. http://www.mersenne.org [15] James Cowling. HQ Replication, 2007 [16] Luis F. G. Sarmenta. Sabotage-tolerance mechanisms for volunteer com- puting systems. Future Generation Computer Systems, 18(4):561–572, 2002 [17] JXTA website. https://jxta.dev.java.net [18] OGSA home page. http://www.globus.org/ogsa [19] MPI home page. http://www.mcs.anl.gov/research/projects/mpi [20] Kaapi website. http://kaapi.gforge.inria.fr [21] Michael Armbrust, Armando Fox, Rean Griﬃth, Anthony D. Joseph, Randy Katz, Andy Konwinski, Gunho Lee, David Patterson, Ariel Rabkin, Ion Stoica, and Matei Zaharia, Above the Clouds: A Berkeley View of Cloud Computing, 2009 [22] Serdar Boztaş, Hsiao-Feng Lu. Applied algebra, algebraic algorithms and error-correcting codes: 17th international symposium, AAECC-17, Ban- galore, India, December 16-20, 2007 [23] O. Beaumont, E.M. Daoudi, N. Maillard, P. Manneback, and J.-L. Roch. Tradeoﬀ to minimize extra-computations and stopping criterion tests for parallel iterative schemes. In PMAA’04, 2004. [24] Luciano Soares, Clément Ménier, Bruno Raﬃn, and Jean-Louis Roch. Work Stealing for Time-constrained Octree Exploration: Application to Real-time 3D Modeling [25] Francois Galilee, Gerson G. H. Cavalheiro, Jean-Louis Roch, Mathias Dor- eille. Athapascan-1: On-Line Building Data Flow Graph in a Parallel Lan- guage [26] Grid5000 website. https://www.grid5000.fr [27] Linbox website. http://www.linalg.org [28] GMP website. http://gmplib.org [29] Givaro home page. http://www-lmc.imag.fr/CASYS/LOGICIELS/givaro BIBLIOGRAPHY 68 [30] OAR description page. https://www.grid5000.fr/mediawiki/index.php/OAR2 (retrieved on 1st sept 2009) [31] Online-certiﬁer project site. http://code.google.com/p/online-certiﬁer.

DOCUMENT INFO

Shared By:

Categories:

Tags:
Fault Tolerant, distributed systems, fault tolerance, distributed computing, Fault-Tolerant Distributed Systems, Distributed System, Fault Model, stable storage, consensus problem, Byzantine Generals

Stats:

views: | 19 |

posted: | 8/9/2011 |

language: | English |

pages: | 68 |

OTHER DOCS BY liuqingyan

Feel free to Contact Us with any questions you might have.