Fault tolerant distributed global computations against malicious

Document Sample
Fault tolerant distributed global computations against malicious Powered By Docstoc
					 Fault tolerant distributed global computations against
                    malicious attacks

                                 Majid Khonji

                             September 10, 2009


                                 Master Thesis

In partial fulfillment of the requirements for the degree of Master of Security,
          Cryptology and Coding of Information Systems, Ensimag.

           Supervised by Jean-Louis Roch & Clément Pernet

                                    Abstract

         Global computing platforms provide huge computing power that can
     be used for several applications. However, the platform has some fault
     ratio and may be targeted by attacks. Throughout this work, we study
     different ways of performing secure parallel computations over untrusted
     resources using algorithm based fault tolerant (ABFT) for correcting
     erroneous results. We focus our study on the case of modular computa-
     tions in a Global computing system (GCS). We may compute a function
     modulo several primes in parallel and then left to the original result.
     However, the final result could be corrupted due to errors or byzantine,
     therefore we use a CRT ABFT. In order to tolerate byzantine, we need
     to add some redundancy to the computation in an efficient way to assure
     result correctness without wasting too many resources. In this work, we
     study different possibilities of performing these computations on differ-
     ent types of resources and study various risks involved. We propose an
     online scheme for modular computation which computes certified results
     of user functions using minimum trusted resources. The model is sup-
     ported with detailed analysis and risk study for several types of failure
     and attacks. Finally, we provide a design specification and an implemen-
     tation of the proposed model with an example of computing the matrix
     determinant.




                                        1
Preface

This paper is the result of our master thesis project carried out at LIG lab in
Montbonnot in partial fulfillment of the requirements for the degree of Master
of Security, Cryptology and Coding of Information Systems at Ensimag, France.
    Laboratoire d’Informatique de Grenoble(LIG) is a research center in parallel
and distributed computing. The master thesis is a joint project with Thomas
Stalinski who studied and developed ABFT during his master thesis, and his
algorithms are used through out this project.
    I would like to thank my supervisors Jean-Louis Rock and Clément Pernet
for their invaluable assistance and feedbacks throughout the project. A great
deal of thanks is extended to Salim Ouari for the fruitful discussions about this
work. Many thanks to Thomas Stanlinski and all members of LIG for their
ultimate cooperation and support.




                                       2
Contents

Contents                                                                                        3

List of Figures                                                                                 6

1 Background on distributed systems problems & ABFT                                             11
  1.1 Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .                    11
  1.2 Fault models . . . . . . . . . . . . . . . . . . . . . . . . . . . .                      12
  1.3 Distributed problems . . . . . . . . . . . . . . . . . . . . . . . .                      13
      1.3.1 Consensus . . . . . . . . . . . . . . . . . . . . . . . . . .                       13
      1.3.2 Byzantine agreement . . . . . . . . . . . . . . . . . . . .                         13
      1.3.3 Resource allocation problems . . . . . . . . . . . . . . .                          14
  1.4 Imposiblity of consensus and byzantine agreement in some models                           15
  1.5 Algorithm-based fault-tolerant (ABFT) . . . . . . . . . . . . .                           15
      1.5.1 Fault-tolerant Polynomial Interpolation . . . . . . . . .                           15
      1.5.2 Error correction by Chinese Reminder Theorem . . . . .                              15

2 Related works & positioning our contribution                                                  18
  2.1 Related works . . . . . . . . . . . . . . . . . . .   .   .   .   .   .   .   .   .   .   18
      2.1.1 Checkpoint and restart . . . . . . . . .        .   .   .   .   .   .   .   .   .   18
      2.1.2 Replication based (voting) . . . . . . . .      .   .   .   .   .   .   .   .   .   18
      2.1.3 Spot-checking & Blacklisting . . . . . .        .   .   .   .   .   .   .   .   .   19
      2.1.4 ABFT and GCS defects . . . . . . . . .          .   .   .   .   .   .   .   .   .   19
  2.2 Our contribution . . . . . . . . . . . . . . . . .    .   .   .   .   .   .   .   .   .   21
      2.2.1 Current limitations . . . . . . . . . . . .     .   .   .   .   .   .   .   .   .   21
      2.2.2 Contribution rationale . . . . . . . . . .      .   .   .   .   .   .   .   .   .   21
      2.2.3 General assumptions . . . . . . . . . . .       .   .   .   .   .   .   .   .   .   22

3 Theoretical study: efficient modular computations using min-
  imum reliable resources                                                                       23
  3.1 Types of resources . . . . . . . . . . . . . . . . . . . . . . . . .                      23
      3.1.1 Untrusted resource . . . . . . . . . . . . . . . . . . . . .                        23
      3.1.2 Semi-trusted resources . . . . . . . . . . . . . . . . . . .                        23
      3.1.3 Trusted resources . . . . . . . . . . . . . . . . . . . . . .                       24
  3.2 Why ABFT instead of Replication? . . . . . . . . . . . . . . . .                          24
  3.3 Modular function parallelism . . . . . . . . . . . . . . . . . . .                        25


                                       3
CONTENTS                                                                                                  4


   3.4   Impossibility of getting a lifted result from modular computa-
         tions using only untrusted resources . . . . . . . . . . . . . . .                              25
         3.4.1 Modular agreement problem . . . . . . . . . . . . . . . .                                 25
         3.4.2 Byzantine Reduction . . . . . . . . . . . . . . . . . . . .                               25
   3.5   Possibility of getting a lifted result from modular computations
         using one trusted resource . . . . . . . . . . . . . . . . . . . . .                            26
   3.6   Inefficiency of using untrusted resources for decoding and trusted
         for certifying . . . . . . . . . . . . . . . . . . . . . . . . . . . .                          26

4 Proposed online post certification model               for modular com-
  putation                                                                                               30
  4.1 Why online model? . . . . . . . . . . . . .        .   .   .   .   .   .   .   .   .   .   .   .   30
  4.2 Model Architecture . . . . . . . . . . . . .       .   .   .   .   .   .   .   .   .   .   .   .   31
  4.3 Components details . . . . . . . . . . . . .       .   .   .   .   .   .   .   .   .   .   .   .   32
      4.3.1 Front-end . . . . . . . . . . . . . .        .   .   .   .   .   .   .   .   .   .   .   .   32
      4.3.2 Decoder . . . . . . . . . . . . . . .        .   .   .   .   .   .   .   .   .   .   .   .   34
      4.3.3 Data-proxy . . . . . . . . . . . . .         .   .   .   .   .   .   .   .   .   .   .   .   36
      4.3.4 Certifier . . . . . . . . . . . . . . .       .   .   .   .   .   .   .   .   .   .   .   .   37
      4.3.5 Public GCS . . . . . . . . . . . . .         .   .   .   .   .   .   .   .   .   .   .   .   37
  4.4 Components Interactions . . . . . . . . . .        .   .   .   .   .   .   .   .   .   .   .   .   38
      4.4.1 Control flow . . . . . . . . . . . . .        .   .   .   .   .   .   .   .   .   .   .   .   38
      4.4.2 Data buffering . . . . . . . . . . .          .   .   .   .   .   .   .   .   .   .   .   .   40

5 Model Analysis                                                                                         41
  5.1 Communication channels risk analysis & Man-in-the-middle at-
      tacks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .                            41
      5.1.1 Front-end – Decoder . . . . . . . . . . . . . . . . . . . .                                  41
      5.1.2 front-end – Certifier . . . . . . . . . . . . . . . . . . . .                                 42
      5.1.3 front-end – Public GCS . . . . . . . . . . . . . . . . . .                                   42
      5.1.4 front-end – Data-proxy . . . . . . . . . . . . . . . . . .                                   42
      5.1.5 Decoder – Data-proxy . . . . . . . . . . . . . . . . . . .                                   42
      5.1.6 Within GCS . . . . . . . . . . . . . . . . . . . . . . . .                                   42
  5.2 Impacts of resources failure . . . . . . . . . . . . . . . . . . . .                               43
      5.2.1 Untrusted resources failure . . . . . . . . . . . . . . . .                                  43
      5.2.2 Semi-trusted resources failure . . . . . . . . . . . . . . .                                 43
      5.2.3 Trusted resources failure . . . . . . . . . . . . . . . . . .                                43

6 Design specifications                                                                                   44
  6.1 Components structure . . . . . . . . . . . . . . . . . . . . . . .                                 45
  6.2 Class diagrams . . . . . . . . . . . . . . . . . . . . . . . . . . .                               45
  6.3 Interface interactions . . . . . . . . . . . . . . . . . . . . . . . .                             47

7 Implementation &         Analysis                                                                      50
  7.1 Kaapi Library        . . . . . . . . . . . . . . . . . . . . . . . . . . .                         50
  7.2 Grid5000 . . .       . . . . . . . . . . . . . . . . . . . . . . . . . . .                         51
  7.3 Online Certifier      . . . . . . . . . . . . . . . . . . . . . . . . . . .                         51
CONTENTS                                                                            5


       7.3.1 Deployment over grid5000 . . . . . . . . . . . . . . .        .   .   51
       7.3.2 Package structure . . . . . . . . . . . . . . . . . . . .     .   .   52
  7.4 Installing & using online-certifier . . . . . . . . . . . . . . .     .   .   56
  7.5 How to write a user function ( Fibonacci function example)           .   .   58
  7.6 Read/Write atomicity . . . . . . . . . . . . . . . . . . . . .       .   .   60
  7.7 Prime generation . . . . . . . . . . . . . . . . . . . . . . . .     .   .   60
  7.8 Implementation Amortize Technique . . . . . . . . . . . . .          .   .   61
  7.9 An application of matrix determinant computations . . . .            .   .   62
  7.10 Current limitations . . . . . . . . . . . . . . . . . . . . . . .   .   .   63

Bibliography                                                                       66
List of Figures

0.1   Overview of cloud computing . . . . . . . . . . . . . . . . . . . . .                                       9

2.1   Error rate of majority voting for various values of m and f [16] . .                                       19

4.1   System Architecture on different types of resources . . . . . . . . .                                       32
4.2   Job growth by doubling . . . . . . . . . . . . . . . . . . . . . . . .                                     34
4.3   Classical components interaction . . . . . . . . . . . . . . . . . . .                                     39

6.1   Deployment diagram . . . . .       . . . . . . . . . . . .             . . . . . . . . .                   46
6.2   Communicators class diagram        . . . . . . . . . . . .             . . . . . . . . .                   47
6.3   user functions class diagram .     . . . . . . . . . . . .             . . . . . . . . .                   48
6.4   certifier . . . . . . . . . . . .   . . . . . . . . . . . .             . . . . . . . . .                   48
6.5   public GCS . . . . . . . . . .     . . . . . . . . . . . .             . . . . . . . . .                   48
6.6   interface interactions through     communicators and                   the data types
      exchanged . . . . . . . . . . .    . . . . . . . . . . . .             . . . . . . . . .                   49

7.1   Grid5000 ssh access . . . . . . . . . . .      .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   52
7.2   Online certifier logo . . . . . . . . . . .     .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   52
7.3   Online certifier deployment on Grid5000         .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   53
7.4   Different amortize functions . . . . . . .      .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   61
7.5   ρ-amortize vs f . . . . . . . . . . . . . .    .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   62




                                           6
List of Algorithms

 1.1   CRT . . . . . . . . . . . . . . .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   16
 3.1   Byzantine Agreement reduction       .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   25
 3.2   Certification algorithm . . . . .    .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   27
 4.1   Basic Front-end algorithm . . .     .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   33
 4.2   Decoder manager algorithm . .       .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   35




                                     7
Introduction

Beyond the "More than Moore" law, parallelism becomes a standard: per-
sonal computers uses several cores to increase the computation power rather
than just speeding up the clock as it was in olden days. On a larger scale, we
have clusters, grid computing where multi machines cooperatively produce an
even bigger computational power. Global Computing Systems (GCS) represent
several geographically scattered computational nodes, exhibit a single compu-
tational entity. These nodes could be volunteers, who give or donate extra cpu
cycles to be added to the overall GCS power.
    Global Computing Systems (GCS) becomes more and more effective since it
can achieve huge and relatively cheap computing power which can be utilized to
solve different problems related to several fields such as mathematical problems
and cryptographic challenges. One of the popular examples is distributed.net,
which solved the RSA RC5-56 challenge in 1997 using thousands of volunteers’
personal computers[1]. Another example is the Folding@home project which
aims to solve problems related to protein folding behavior as to cure and under-
stand several human diseases. This network utilizes volunteer machines such as
PS3 nodes, PCs and GPUs running a simple client application which harvests
only spare processing power. The PS3 network had reached 1 PetaFlop with
about 799103 PS3 running the Folding@home client. The overall network has
achieved around 4.6 PetaFlops[2]. Such a power can provide a good business
demand and supply chain as well. For example, one can rent 100 machines
to operate for one hour on some heavy computations, rather than buying 1
machine to run for 100 hours. Another business model is to pay volunteers for
giving their CPU time as it is done in several startup companies[16].
    Cloud computing is a new trend of computing industry, where software will
be given as a service. End users will no longer need to worry about internal
structure of the hardware, or the way the software is deployed; he or she would
rather simply use the service and pay according to the usage only. The term
cloud includes both the software provided and the underlying hardware. Figure
0.11 shows an overview of could computing. A cloud is Public Cloud, when it
is made available in a pay-as-you-go manner to the general public. The term
Private Cloud refers to internal data-center (hardware & software) or other
organizations’ services which is not available for the general public. From the
   1 The image is taken from http://en.wikipedia.org/wiki/Cloud_computing [30, august

2009]


                                         8
List of Figures                                                                9


hardware point of view, cloud providers have the ability to pay for using the
computing power on short term bases (i.e., cpu by hour and storage by the
day). They can rent resources when they need, and release them when they
are no longer useful[21].




                   Figure 0.1: Overview of cloud computing

    Global Computing systems have many forms and structures, from a master-
slave structure such as SETI@home[3], to fully decentralized p2p as mostly seen
in file sharing systems. Moreover, some grids consist of fully controlled com-
puting nodes, whereas others consists of uncontrolled nodes. The uncontrolled
environment rises an important issue of reliability. Such a powerful computing
could perform bad computations due to malicious attacks which must be han-
dled by such as system. These attacks cannot be completely contained using
traditional means of cryptography and network security, whereas in traditional
controlled grid systems, these techniques can be sufficient. In order to enhance
reliability in such uncontrolled environment, there are basically two strategies,
a priori prevention and a posteriori verification.
    Prevention forces the use of proper software and files. At the user level,
one can use code encryption to track individual executions for instance by
computing checksums at various levels of the code[4]. However, code encryp-
tion cannot protect against the use of correct code by faulty inputs, which can
be crafted smartly by an attacker; therefor, it should be used reasonably for
each specific environment . At the system level, prevention could be achieved
by using some embedded digital rights management (DRM) technology inside
machines’ hardware, therefore application vendors can lock the users of any
inappropriate usage violation. An example of system prevention is TCPA /
Palladium initiative[5]. However, a lot of people are against system prevention
as they consider it against their freedom. The other type is a posteriori veri-
fication, which is more considered in this paper, can be prescribed as a set of
tests on the final computations, decides whether the global computations are
correct or not within an acceptable error rate with a bounded error probability.
List of Figures                                                                10


    ABFT (Algorithm-based fault-tolerant) consists in introducing redundancy
in computations in a tricky way, avoiding brute force replications: the objective
is to tolerate byzantine, possibly malicious, errors. This technique is especially
well suited to large scale parallel and distributed computations that run on
resources that cannot be blindly trusted, such as peer-to-peer or cloud com-
puting. In particular, ABFT is well suited to compute intensive applications
in arithmetic, cryptology or linear algebra where ABFT relies on algebraic
properties on some codes, like Reed-Solomon for instance.
    Global computational systems should always tolerate an error margin of
the global results which might be due to network failures, or disconnected
nodes. However, the results could be massively attacked and there must be a
posteriori check to assure that there is an ABFT that can correct and obtain
the final computation. On the other hand, these tests has some cost and
should be performed on trusted resources which are known to be limited and
costly. Therefore, a broad question is how we can use the most of the power
of untrusted resource to perform certified computations regardless to the type
of the fault or attack, while keeping the use of trusted resources as limited as
possible. In this work, we want to construct a model which uses as few trusted
resource as possible and enables the end user to perform certified computations
transparently, without the need to worry about internal details, underlying
network and error correction.
    This thesis consists of 7 chapters. Chapter 1 provides briefly, the required
background information for this thesis such as distributed systems models and
problems, and also examples of ABFT and how they work. In chapter 2, we
present related works in the field, and the position our work among others.
Chapter 3 is a theoretical study of several possibilities of performing trusted
computation over different types of resources. In chapter 4, we propose an
online model for modular computations, and discuss the logic behind model
components and several other choices. In chapter 5, we study several risks
involved and impacts of resources failure and attacks. Chapter 6 provides the
design specification of the model. Finally, chapter 6, provides implementations
details and analysis.
Chapter 1

Background on distributed systems
problems & ABFT

In distributed systems, the type of solvable and unsolvable problems are mainly
related to the assumptions we make about the distributed environment or the
the distributed model. A small modification in these assumption may radically
alter the class of problem solvability. Some problems are absolutely impossible
to solve, where some others are unsolvable due to high lower bounds of either
space or time in certain environment. If we can solve a problem in a restricted
model, then we can solve the same problem in more relaxed models. Un-
derstanding models can help understanding the solvability and comparability
between certain problems in different models or environments. [8].

1.1    Models
A distributed system is composed of several processes, each of them execute a
sequential algorithm. These processes communicate with each other through
two different ways. In message passing models, processes send messages to each
other via a communication channel. This can be modeled by a graph, where
nodes are processes and edges are channels. A correct channel behaves as a
(FIFO) queue with the sender enqueueing data and receiver dequeuing them.
If the queue is empty, the receiver will get a special empty queue message[8].
    In shared memory models, processes communicate by performing opera-
tions on shared data structured called objects. A shared object consists of
several types. Each type specifies a set of states of the object, allowed op-
erations on the object, and possible outputs of the object. At any time, an
object has a single state, and when a process performs an operation on it, it
might change the state, and return an output to the process. For example,
stack object stores a series of values in its state and supports push and pop
operations. The basic type of stack object is register, which stores a value in
its state and supports read/write operations from all processes. We could have
a stack consists of another type such as single-reader/writer register, which


                                      11
CHAPTER 1. BACKGROUND ON DISTRIBUTED SYSTEMS
PROBLEMS & ABFT                                                               12

allows only one process to read/write its value instantaneously. Consistency
conditions show how an object behaves when accessed by several concurrent
operation; for example, linearizability condition is that operations happen at
distinct points in time, and the order in which operation happen, must consist
with real time (i.e if operation A occur before B, B must start after A ter-
minates). Moreover, a linearizable object type could be either deterministic:
the outcome of each operation is uniquely determined by the object’s current
state, or non-deterministic: an object type may have more than one possible
outcome for an operation on the same state[8].
    In randomized algorithms, a process may have many choices for its next
step, but the choice is made according to some probability distribution. For
randomized algorithms, termination condition is required only with a high
probability, and one considers the worst case expected time. Non-determinism
in the shared object makes problems harder to solve, while allowing random-
ization in the algorithm can make a problem easier to solve[8].
    A process could have a unique identifier, or be anonymous: all processes
are similar and run an identical code. This could be important issue when
dealing with comparison-based algorithms which may depend on identifiers for
comparing their values.
    Timing assumptions are a critical part of models. When the system is syn-
chronous, it means that all processes take steps at exactly the same speed.
In asynchronous systems, processes proceed steps in different speeds. In syn-
chronous message passing models, messages sent in one round are available to
be received in the next round. At each step, a process enqueue at most 1 mes-
sage to send and receives (dequeues) at most one message at a time. However,
in asynchronous systems, in one step, a process can either send at most one mes-
sage, receive at most one message, or access a single shared object. In partially
synchronous models, process run on different speeds, but there are bounds on
the relative speeds of processes and message-delivery times for message passing
systems[8]. In synchronous systems, time is measured by the number or rounds.
However, in asynchronous and partially synchronous systems, there are differ-
ent ways to measure time. Step complexity is the maximum number taken by
a single process. Work is the total steps taken by all processes. Asynchronous
computations can be divided into asynchronous rounds, where a round ends
when all processes at least take one step since the beginning of that round[8].
    In message passing system, the total number of messages sent is an impor-
tant measure of the algorithm’s complexity, which is called message complexity.
Bit complexity counts the total number of bits in these messages[8].

1.2    Fault models
Crash failure is when a process halts permanently. However, communication
channels may also fail. A way to model such a failure, is to consider a process
either fails sending a message or fails receiving a message at some endpoint,
which is called Omission failure. Arbitrary process fault, or Byzantine fault, is
CHAPTER 1. BACKGROUND ON DISTRIBUTED SYSTEMS
PROBLEMS & ABFT                                                              13

when a process fails and then perhaps recovers, its state become corrupted, or
it behaves arbitrarily. Such a fault is used to model malicious attacks because
we don’t know any thing about their behavior[8]. We call a process correct
process when it never fails during the execution.
    In an f -faulty systems, we have at most f faulty processes. An algorithm
that works for f -faulty system is called f -resilient (i.e., can tolerate up to
f faults). Moreover, an f -resilient algorithm is also f -resilient, for all f ≤
f . A wait-free algorithm ensures that all non-faulty processes will correctly
complete their task, taking only finite number of steps, even if any number of
other processes crash. For randomized algorithms, wait-freedom means is the
expected number of steps needed by a process to complete its task is finite[8].

1.3     Distributed problems
Here we define a set of problems that are mainly concerned in distributed
systems, and the impossibility of solving these problems in certain models.

1.3.1    Consensus
This problem is used as a primitive block of several distributed problems. Con-
sensus is an example of decision task problem with three conditions:

   • Each process gets a private input value from a set.
   • Produces an output (task specification describes which output is valid for
     a given input).
   • and then terminates.
For consensus There are two correctness properties that must be satisfied:
   • Agreement: The output values for all processes must be identical.
   • Validity: The output value of each process is the input value of some
     processes (i.e., the output is read by some processes).
In models where arbitrary faults are allowed, these properties are weakened
and applied only to correct processes.

1.3.2    Byzantine agreement
It is also called terminating reliable broadcast problems which is a version of
consensus. The difference is that:
   • There is one process, the sender, has an input that it must send to all
     other processes.
   • The sender receives outputs from other processes, if the sender is correct.
CHAPTER 1. BACKGROUND ON DISTRIBUTED SYSTEMS
PROBLEMS & ABFT                                                               14

   • The agreement property is the same as consensus (all correct processes
     outputs are identical ).
In Byzantine agreement, each process has a priori knowledge that the sender
s is going to send a message. The goal is to transfer a data (input) from
the sender to the set of receiving processes. A process may perform several
I/O operations during the execution, but eventually must deliver a message to
the sender or it may deliver a special message “sender failure”[9]. To be more
precise, this problem must satisfy four properties[9]:
   • Termination: every correct process output a value.
   • Validity: if the sender s is correct and broadcasts a message m, then
     every correct process delivers back m.
   • Integrity: a correct process delivers a message m at most once, and only
     if m was previously broadcasted by s.

   • Agreement: if a correct process delivers a message m, then all correct
     processes deliver m.
Another restricted version of Byzantine agreement is simultaneous consensus
or coordinated attacks where all processes must output in the same round[8].

1.3.3    Resource allocation problems
Mutual exclusion is problem of sharing resources (e.g., a printer), where
there are several processes and one process wants an exclusive access to a
resource called critical section. There are three main properties for any correct
algorithm to assure:
   • Only one process accesses the critical section at most each time.
   • Liveness property (deadlock freedom): If some process wants to access
     the critical section and no other processes are in, then eventually, some
     process will get the access.
   • Fairness condition (lockout freedom): If a process wants to access the
     critical section, then eventually, it will be given the permission.
The dining philosopher is another resource allocation problem, where processes
are organized in a ring, and each two adjacent processes have a shared resource
and need to have exclusive access to it. Renaming is a problem where all
processes have initial unique identifiers from a large set, and they all want to
rename to another unique identifier but from a smaller set[8].
CHAPTER 1. BACKGROUND ON DISTRIBUTED SYSTEMS
PROBLEMS & ABFT                                                                      15

1.4     Imposiblity of consensus and byzantine agreement
        in some models
Consensus is unsolvable in message passing models, if messages can be lost.
Because, if all messages are lost, the validity condition will no longer hold.
Even if one of few messages are lost, lost messages can be scheduled by an
adversary in a way that consensus conditions will no longer hold for any correct
algorithm[8].
    Byzantine agreement is unsolvable if only one crash failure can occur in the
model. The agreement condition fails: if the sender crashes after sending the
message to some processes, but before sending it to others. The first set or
correct processes will deliver the message back, but the other correct processes
will deliver “sender failure” message[8, 9].

1.5     Algorithm-based fault-tolerant (ABFT)
Here we shall briefly show an example of ABFT to simply give an overall idea
about how this technique works[10]. Also, we show a simple way to perform
modular computations for an integer function in parallel and then using Chinese
Remainder theorem (CRT) to left them up to the original result.

1.5.1    Fault-tolerant Polynomial Interpolation
Let F be a field. Let (xi , yi )i=0,...n−1 be n points in F2 , all xi are distinct. We
need to compute a polynomial P ∈ F[X] of degree at most k − 1 such that
P (xi ) = yi for at least n − t index i.
   • Input : three integers n, k, t and n points (xi , yi )i=0,...n−1 in F2 , all xi are
     distinct.
   • output: a polynomial P of at most a degree k − 1 , such that #{i :
     P (xi ) = yi } ≥ n − t
It is proved [10](with constructive algorithms) that there is a unique solution
P = k−1 ai xi iff n ≥ k + 2t. k evaluations in distinct points are necessary
         i=0
and sufficient to characterize a polynomial of degree k − 1. This means that
if n distinct evaluations of a polynomial of degree k − 1 are provided, among
which at most t = n−k are erroneous, it is possible to recover the polynomial
                      2
p, correcting the (at most) t erroneous evaluations [10]. This technique can
be used to send a data (i.e. polynomial evaluations) over the network, and
eventually, a decoder computes the original polynomial in presence of byzantine
faults t.

1.5.2    Error correction by Chinese Reminder Theorem
Theorem 1. Chinese Reminder Theorem theorem:
CHAPTER 1. BACKGROUND ON DISTRIBUTED SYSTEMS
PROBLEMS & ABFT                                                                         16

    Let p1 , p2 , . . . , pn be pairwise coprime natural numbers ≥ 2, and x1 , x2 , . . . , xn ∈
Z . Then there are integral solutions such that the set of simultaneous con-
gruences: x ≡ x1 mod p1 , x ≡ x2 mod p2 , . . . , x ≡ xn mod pn . If x, x are two
solutions, then x ≡ x mod π, where π = p1 × p . . . pn . Conversely, if x is a
solution and x ≡ x mod π, then x is a solution.
    The CRT algorithm that enables to compute a solution x from x1 , x2 , . . . , xn
is shown in algorithm 1.1.

Algorithm 1.1 CRT
  π = p 1 p 2 . . . , pn
        π
  πi = pi
         −1
  yi = πi mod pi
          n−1
  x = i=0 xi πi yi
  return x




    The correction part is illustrated by Mandelbaum paper[11]and developed
by Thomas Stalinski in his master thesis. We will provide here a brief descrip-
tion (see the Master thesis of Thomas Stalinski) about error correction. We
add some redundancy to residue computations, since we expect some of these
residues to be faulty. Assuming we need a minimum k coprime numbers such
that Mk = p1 × p2 . . . pk ≥ x. Similarly to the polynomial interpolation, if we
add n ≥ k + 2t residues, then it will be possible to correct up to t = r errors.
                                                                         2
We need to add r = n − k redundant congruences to the system.
    Let F = {i = 1, 2, . . . , n : x mod pi = xi }. Let C = i∈F pi be the product
of pi corresponding to errors.
                                       x=x −e
                                    =x −          xi πi yi
                                            i∈F

                                    Mn
                                  =x −  B mod π
                                     C
for some 0 ≤ B ≤ C, where x is the correct result, x is lifted from all
congruences including faulty ones, and e is the lifted solution from only faulty
coprimes.
                                                log pmin −log 2
Theorem 2. Mandelbaum theorem: if e ≤ (n − k) log pmax +log pmin

                                    x    B   1
                                       −   ≤ 2
                                    Mn   C  C
    where Mn = p1 × p2 . . . pn .
CHAPTER 1. BACKGROUND ON DISTRIBUTED SYSTEMS
PROBLEMS & ABFT                                                                    17
                                                          x            pi
   Now, we can use continued fractions to approximate     Mn .   Let   qi   be the ith
convergent of the continued fraction of x
                                        Mn .    Mn pi
                                               If  q
                                                     i
                                                       ∈ Z and x = x − e ≤ Mk
where Mk = p1 . . . pk , then we have corrected x.
    If i ∈ F then x mod pi = x mod pi
         /                                  e mod pi = 0 (e is multiple of of pi )
    If i ∈ F then x mod pi = x mod pi         e mod pi = x mod pi − x mod pi =
xi − x mod pi (If pi is prime then e is prime to pi )
    As a conclusion, the product of correct primes (not in F ) is Mn = gcd(M n, x −
                                                                  C
x).
    Using this code, we can simply add more redundancy by adding more
residues to the system, therefore, we can increase the correction capability
dynamically.
Chapter 2

Related works & positioning our
contribution

2.1     Related works
There are several works and different methods studies techniques to assure the
correctness of global computations or at least reduce the potentiality of having
incorrect results in presence of faults.

2.1.1    Checkpoint and restart
Several researches study methods to efficiently break the execution of a dis-
tributed program to several consistent checkpoints. If we detect an error at
any stage, we can simply roll back to the previous checkpoint, and redo the
computations until we reach the next checkpoint correctly. For instance, H. Hi-
gaki, K. Shima, T. Tachikawa, M. Takizawa[12]propose an algorithm for taking
checkpoints efficiently.

2.1.2    Replication based (voting)
This technique is used by several global computing systems such as SETI@home,
mersenne and Boinc[3, 14, 13]. The idea is to replicate one computation to sev-
eral nodes, and let them perform the same computation, and the system votes
between the replicas and accepts only the majority results. This technique
reduces the error rate exponentially, but requires all work to be done at least
twice[16]. When error rates are smaller, this technique works better than hav-
ing higher error rates. For example, if the error rate high, f = 20%, doing
all the work 6 times, leaves the error rate larger than 1% whereas when error
rate f = 0.001%, repeating the work 6 time results f 10−21 as shown in the
figure 2.1[16]. There are several ways to optimize the computations, some of
them presented by James Cowling[15].




                                      18
CHAPTER 2. RELATED WORKS & POSITIONING OUR
CONTRIBUTION                                                                    19




 Figure 2.1: Error rate of majority voting for various values of m and f [16]


2.1.3    Spot-checking & Blacklisting
In spot-checking, the master node randomly gives a worker a spotter work object
whose result is known by the master node. If the output of the worker doesn’t
match the known results, the master invalidates all the results given from that
node. The master may blacklist this node so it can’t participate in any work in
the future, or perhaps for the current batch job only. Spot-checking reduces the
error rate linearly, while only costing an extra fraction of the original time[16].
    Sermanta[16] presents a new idea of Credibility-based Fault-Tolerance where
he uses both voting and spot-checking techniques together to achieve a better
error rate (i.e., reduce it to the maximum accepted error rate), with less redun-
dancy than voting alone. The scheme automatically trades-off performance for
correctness. It is similar than voting except that the number of replication, m is
dynamically determined at the runtime based on some credibility values given
to different entities of the system: workers, group (a table of results for specific
job), and the job. These credibility values are given mainly based on spotters
given to workers, and their history. The system accepts the final voted results
only if the work group credibility reaches a threshold defined by the maximum
accepted error for the result and the actual error rate in the network.

2.1.4    ABFT and GCS defects
Germain-Renaud, C. and Monnier-Ragaigne[4] model the distribution of de-
fects in a global computing system jobs as a Bernoulli distribution (0 = correct,
1 = defect), with unknown p of being defective. They define a test τ which
CHAPTER 2. RELATED WORKS & POSITIONING OUR
CONTRIBUTION                                                                     20

decides if a batch (a set of jobs) has an error rate p ≤ p0 or p ≥ p1 with p1 ≥ p0 .
Also they define two confidence parameters α and β:
    if p ≤ p0 , τ accepts with probability ≥ 1 − α
    if p ≥ p1 , τ rejects with probability ≥ 1 − β
    They perform a sequential test based on Wald’s sequential test that pro-
vides adaptivity (i.e., the test sample size is not fixed, but a random variable
known during the execution), which is shown to provide much less cost than
the classical tests where the sample size is fixed. The model can be tuned a bit
so that the end user may provide two parameters, pa , the maximum accepted
error rate, and , the acceptable risk (i.e. probability of test failure). If the
test succeeds, the user may use an ABFT, that can correct up to pa . They
show also a weak form of blacklisting bad workers if they can be identified
(using a class of algorithm that can reject bad workers), where they eliminate
only workers who produce a high error rate rather than a single error rate.
Finally, they introduce a resource allocation function based on the credibility
of resource on a GCS and the user quality requirement (i.e., error acceptance).
    Jean-Louis Roch and Sebastien Varrette[6], present a probabilistic certifica-
tion approach for massive attack detection on a set of global computed results
using Extended Monte Carlo Test (EMCT). Because, jobs are performed by
a global computation platform which is resilient to a small number of errors,
but not resilient to massive attacks. The technique works as follows: all jobs’
executions is stored in secure checkpoint server as a data flow graph. By using
EMTC algorithm, a set of verifiers randomly select some tasks to be re-executed
securely on a reliable resource. Verifiers perform N ,q calls to EMTC algorithm
where
                                            log
                                N ,q =
                                         log (1 − q)
   If one of EMCT tests fails, then it indicates that there is a massive attack
≥ q with a probability of certification failure equal to [7, 6]. The work is
extended with certification cost-analysis using work-stealing technique, and
applied to an application of exact matrix vector product. The computation
scheme works as follows:
                                      1
   • Pick an attack ratio 0 ≤ q ≤ 2 and accordingly construct a code with
                                                           k
     the same corrector rate (i.e., [n, k] code where n = 1−2q ).
   • Perform the computations on unreliable resources.
   • Perform EMCT test N ,q times to detect a massive attack q. If the test
     succeeded (i.e., we have a massive attack), redo the actual computations
     again.
   • If the test succeeded, decode with the constructed code.
CHAPTER 2. RELATED WORKS & POSITIONING OUR
CONTRIBUTION                                                                           21

2.2       Our contribution
2.2.1      Current limitations
Checkpointing techniques requires the system to save its states, and restart
if an error occurs before the next checkpoint. These procedures might cost a
lot of time (e.g., one hour to save a state of a large cluster) and synchroniza-
tion between several asynchronous processes is very costly to achieve. This
technique is convenient in controlled systems, where we have few errors due
to hardware, or software failures; however, in GCS and volunteer computing,
errors are much higher, and could be even malicious.
    Blacklisting techniques is usually difficult to achieve since anonymity is
something easily doable nowadays. In some cases, it could be a waste of re-
sources, if a week blacklisting algorithm blacklists a correct worker, or even
a byzantine worker that rarely generates errors. However, it can be used as
compliment to other techniques to enhances the performance, as it is done
by Sermanta[16, 4]. Therefore, it should be used wisely and in appropriate
situations.
    Replication is very common in big GCS due to its simplicity, but usually, it
is very costly since requires the work to be done at least twice, and reduces the
efficiency of a GCS. ABFT usually gives a better alternative, and less costly
way to tolerate errors1.

2.2.2      Contribution rationale
There are some protocols allowing end user to develop distributed application
such as OGSA or JXTA used for p2p systems[18, 17] and APIs such as MPI[19],
or athapascan[20]. These protocols and APIs reduces the underlying complex-
ity of developing distributed application to the end user, and some of them
lets the middle-ware to handle most of the complexity. However, it remains a
cumbersome duty for the end user to obtain a certified computation, specially
when having massive attacks. Even though GCS is internally resilient to some
ratio of faults, massive attacks are still possible. Several works discuss tech-
niques about reducing or checking errors in the computations, but the choice
of the algorithm and error correction remains the duty of the end user. The
end user might not be interested in correcting and certifying computations, he
simply needs a big computational power, which is not secure in most cases, to
execute a user function in parallel and return a secure result.
    In this work we present a high level model that removes all the unwanted
complexities off of the user’s shoulders such as error correcting, result certifi-
cation, parallelizing computations, and dealing with the type of the underlying
network; and most importantly, reduce the use of trusted resource as much as
possible. We would like to provide a service, in which the end user can simply
   1 We  will show in the next chapter briefly why ABFT is better than replication in usual
scenarios
CHAPTER 2. RELATED WORKS & POSITIONING OUR
CONTRIBUTION                                                                   22

invoke a function from a set of functions and provide its input and eventually
get a certified output. We would like to have a model:
   • Independent of the underlying network, that is, it can be applied to any
     type of GCS (ex. p2p, classical grid...etc).
   • With no assumptions about faults or attack (i.e., the model shall adapt to
     massive attacks dynamically) with a flexible error correction capability.
   • Online: performs jobs computations, decoding, and result certification
     simultaneously (i.e,. as far inputs are available, components produce
     outputs on the fly).
   • Keeps the use of trusted resources minimum.
We restrict our theoretical study and model specification and implementation
to independent integer functions (i.e., the function is itself a work unit with no
dependencies).

2.2.3    General assumptions
Through out this literature, we will assume that we have asynchronous message
passing models with no assumptions about the attack (i.e., all types of faults
are allowed) unless we state otherwise. We assume that the error ratio in the
                   1
GCS is 0 ≤ e < 2 , otherwise, we can never correct. Whenever we say a
massive attack, we mean a high error ratio q such that 0 ≤ q ≤ 1 and q > e,
among a set of computations. We assume that an attack is temporary (i.e.,
asymptotically we will get the e ratio, if we repeat computations many times).
    We will use prime integer modular computations to achieve parallelism
for user functions’ computations, which will be eventually lifted using CRT
algorithm. Recall that for CRT we need coprime integers to build the solution
only. Since all primes are also coprime to each others, we will use primes for
simplicity. The work of CRT lifting and error correction is done by Thomas
Stalinski and we will not throughly discuss about them.
Chapter 3

Theoretical study: efficient
modular computations using
minimum reliable resources

3.1     Types of resources
We shall categorize resources to three types1 :

3.1.1      Untrusted resource
These types of resource represents resources on the public GCS in which the
behavior of these machines are not predicted (i.e., could be massively attacked).
We shall assume the following:
   • All types of faults are allowed, namely, crash faults, omission faults, and
     arbitrary faults.
   • These resources are very cheap, and have a big computational power.
   • Machines could have unique IDs, or at least a weak form identification
     (e.g., email, or account).

3.1.2      Semi-trusted resources
These resources are controlled resources that are either controlled by us, or a
trusted service provider. This type is segregated from trusted resources and
cannot access them. We assume the following about this type:
   • These resources have limited computation power, and more expensive
     than untrusted resources.
   1 The   rationale behind this categorization will be shown later in this literature




                                               23
CHAPTER 3. THEORETICAL STUDY: EFFICIENT MODULAR
COMPUTATIONS USING MINIMUM RELIABLE RESOURCES                                    24

   • This type is accessible by untrusted resources directly (or indirectly, by
     other semi-trusted resources)
   • All types of faults are very unlikely to occur (almost impossible), since
     these resources could be controlled by us, or by other trusted providers.
     Through out this work, we consider it reliable resources.

3.1.3    Trusted resources
They are fully controlled resources by us (or very trusted partner). We assume
the following:
   • No faults can occur in these resources.
   • We also assume that these resources are limited and very expensive (more
     than Semi-Trusted resources).
   • These resources are not accessible by untrusted or semi-trusted resources.
     It could be only accessible by trusted resources.

3.2     Why ABFT instead of Replication?
Assuming we need to send n data D1 , D2 , . . . , Dn to retrieve a result R of b
bytes R = r1 , r2 , . . . , rb (ri length is one byte). Let t be the total number of
faulty data Di .

Using Replication
We need to replicate R n times such that ∀Di = R. In order to tolerate t faults
among Di , we need the majority of Di to be correct.
  Therefore we need n = 2t + 1 which means total of (2t + 1)b bytes.

Using ABFT
let P = b−1 bi xi be a polynomial in F28 [x] of a degree at most b − 1
           i=0
    Using polynomial interpolation[10](discussed in chapter 1), in order to tol-
erate t faults, we need only n = b + 2t evolutions of distinct points. Therefore,
each Di is a distinct evolution of P . This works iff b + 2t ≤ 28 (since b ∈ F28 [x]
).
    Assuming that distinct evolutions are sufficient to tolerate t faults, the total
number of bytes needed is b + 2t bytes only.
    As we can see ABFT requires b+2t bytes where replication requires (2t+1)b
bytes assuming t ≤ b (see [10] for details).
CHAPTER 3. THEORETICAL STUDY: EFFICIENT MODULAR
COMPUTATIONS USING MINIMUM RELIABLE RESOURCES                                         25

3.3     Modular function parallelism
let f be any user function which takes an input from a set and output Y ∈ Z,
which is unknown . We compute f mod p1 , f mod p2 , . . . , f mod pn , such that
   n−1
   i=0 pi ≥ Y , and obtain a set of residues y1 , y2 , . . . , yn respectively. Then, we
reconstruct the solution Y from lifting pn prime using CRT algorithm1.1.
    The main advantage of using modular computations is that the cost of
operations in Fpi is much less than in Z given that Y               pi . We will use this
advantage to perform modular computations in parallel in different machines,
each one performs a CRT using a reasonable size of pi in order to achieve a
fine grained job size.

3.4     Impossibility of getting a lifted result from modular
        computations using only untrusted resources
In this section, we want to show the impossibility of obtaining an agreed lifted
result based on byzantine agreement problem. The proof is based on reducing
byzantine agreement problem to modular agreement problem.

3.4.1    Modular agreement problem
Modular computation decision task:
   • gets an input m ∈ {0, 1}.

   • computes x ≡ m mod pi where pi is a random prime number.
   • outputs x ∈ {0, 1} (x is valid iff x ≡ m mod pi )
   • terminates
There is one process, a sender, that has input m and delivers m to all other
processes. Other correct processes, receive m and output x (as specified in the
decision task) to the sender, if the sender is correct. All properties are similar
to byzantine agreement problem as discussed in chapter 1.

3.4.2    Byzantine Reduction

Algorithm 3.1 Byzantine Agreement reduction
Require: a value m ∈ {0, 1}
Ensure: byzantine agreement value m
  return x = OracleM odularAgreement(m)

   As shown in algorithm3.1. If an oracle can solve modular agreement prob-
lem (i.e., all correct nodes output m to the sender, if the sender is correct),
CHAPTER 3. THEORETICAL STUDY: EFFICIENT MODULAR
COMPUTATIONS USING MINIMUM RELIABLE RESOURCES                                       26

then Byzantine Agreement can be solved. However, by counter-positive, since
Byzantine Agreement is known to be unsolvable (see chapter 1 for details),
therefore modular agreement is unsolvable too. This indicates that it is impos-
sible to obtain an agreed lifted result using only untrusted resources, since a
simple agreement among resources is impossible to achieve in our model.

3.5    Possibility of getting a lifted result from modular
       computations using one trusted resource
Now, for Modular Agreement problem, if the sender S is always a known fixed
correct process (never fails), where all correct process can identify it. Then
obviously, all correct processes can deliver the same result m back to it, and
eventually, we reach an agreement. Byzantine agreement will be also solvable
in such a model for the same reason.
    In other words, assume that the correct process sends the inputs of the
function f to all processes. Other processes, can return a result pair yi and
pi . From these pairs, we can construct Y by CRT having sufficient primes
(Y ≤ i pi ) . However, there are three bad possibilities: 1) some residues
yi are faulty, 2) some pi are not prime (or not coprime either), or 3) some yi
are faulty and some pi are not prime. For the first case, the sender S may
perform the Mandelbaum algorithm for error correction providing the Ymax 2 :
the maximum possible result Y of f assuming we don’t have a massive attack3 .
For the second case, we can build up a table of prime numbers (with some prime
range known by all correct processes in the untrusted network), and reject any
pair with prime pi not in the table. For the third case, since the pi is not prime,
the second case covers such a scenario, and rejects the result pair.

3.6    Inefficiency of using untrusted resources for
       decoding and trusted for certifying
Here we shall try to separate between two things, decoding and checking the
validity of the decoded result. Whenever we decode a result on a trusted
resource from n result pairs yi and pi , with t faulty yi knowing the maximum
bound Ymax , we have 3 possibilities: 1) we can correct errors and return and
result, 2) we can just detect an error and return “no result”, 3) return a wrong
result. The first case is obvious, due to correction capability of the algorithm.
The second case is because of too many errors and insufficient redundancy. For
the third case, t faulty yi are the majority of the computations and all of them
are simultaneous congruences (see 1.5.2), and n pi ≤ Ymax , therefore the
                                                     i=1
CRT decoding algorithm will consider correct results as faulty and accordingly
   2 This  will be used by the Mandelbaum algorithm to know when to stop finding conver-
gents in the continued fraction as explained in chapter 1.
   3 We can solve the case of massive attack which will be shown later in this section
CHAPTER 3. THEORETICAL STUDY: EFFICIENT MODULAR
COMPUTATIONS USING MINIMUM RELIABLE RESOURCES                                    27

corrects them4 . This may happen due to a malicious massive attack, in which
the attack crafts many hemogenious result pairs.
     This will imply that the decoding process itself cannot correct all errors even
if it is a correct process. Therefore we need some extra tests to assure the cor-
rectness of the decoded result. There are generally two methods of testing the
correctness of results, either performing a probabilistic verification (e.g., using
MCT) [7, 6] or testing the final decoded result only (i.e., the corrected result),
which we call it post certification (algorithm 3.2). The first method, requires
to repeat several computations, which usually costs more than performing one
or few checks on the final decoded result as it done by post certification. Post
certification don’t require the certifier to be able to access the results of the
modular computations whereas we need only the decoder to access them, which
will simplify the rule of the certifier, and reduce the overall message complexity
in the system. We shall consider using the second type, post certification, for
our computational scheme.

The Post Certification

Algorithm 3.2 Certification algorithm
Require: the decoder result R
Ensure: SUCCESS/FAILURE
 1:    pick up a new random prime pi
 2:    r ← f mod Zp
 3:    r ← R mod pi
 4:    if r = r then
 5:      return FAILURE
 6:    else
 7:      return SUCCESS
 8:    end if

     As we can see from algorithm 3.2, the prime pi must be never used in
the CRT decoding, otherwise, the certification will always return “SUCCESS”,
because the CRT had already used that prime to construct some result Yf
which could be faulty, given that the first pi residue computation was correct.
Another issue is that the algorithm might return “SUCCESS” due to an error
probability from unlucky choice of pi (∃i : f mod pi = Yf mod pi ) . A rough
                                                 1
estimate on the certification error could be pi which is not always correct
since a modular function’s results are not always uniformly distributed over
pi .5
     Since the decoder could return faulty results even if we run it on trusted
resource, we would like to run it on an untrusted resource: a correct process
S picks a process sd randomly from untrusted resources in order to run a
      4 In   other words, errors exceed the detection capability of the code
      5 The    actual error probability could be studied in prospective works.
CHAPTER 3. THEORETICAL STUDY: EFFICIENT MODULAR
COMPUTATIONS USING MINIMUM RELIABLE RESOURCES                                           28

decoding algorithm. S, sends function f inputs, and the address of sd to a
set of processes s1 , s2, , . . . , sn within untrusted resources. si returns a pair of
residue and prime (yi , pi ) to sd 6 , if si is correct. We have two cases for sd :
    • If sd is correct, then it delivers a result Y to S (the result code be either
      corrected result, “no result”, or a wrong result) .
    • If sd is faulty, it either crashes (i.e., doesn’t send any thing7 ), or it returns
      a byzantine result to S.
Assuming the decoding algorithm is fast and S has an upper bound on the
execution time of sd . If sd is don’t deliver any result within bound, then S
picks randomly another sd as a decoder. So, we only focus on the case where
we have byzantine faults where always sd delivers a result to S. Whenever S
gets a result from sd , S applies the certification process to the decoded result.
If sd is correct, then S must respond appropriately:
    • if sd returns a corrected result Y : S certification process succeeds and
      the system terminates
    • if sd returns “error”: S must increase the redundancy r in the computation
      by askings1 , . . . , sr for more residues
    • if sd returns a wrong result: certification process fails, and S must in-
      crease the redundancy as in the previous case.
If sd is byzantine: it always return wrong result (including , “error”), therefore
S certification always fails.
    As we can see, we have several cases that S needs to deal with, and a
difficulty for S to distinguish between them. One solution for S is, whenever
S gets a certification fault, it performs all n + r computations again, and then
picks a new sd randomly. However, this solution is inefficient, since for a single
byzantine error, it requires to redo all the n+r computations. Another solution,
is to pick up several sd1 , sd2 , . . . , sdt decoders and send these addresses to all si
processes. S can now check all returned results by sdi decoders. This solution,
reduces the probability of byzantine decoder faults by 1 However, it increases
                                                                 t
the message complexity in the network at least by a factor of t.
    Assuming that decoding algorithm is not costly8 , we can use a
trusted resource to run such an algorithm. However, the decoder might require
untrusted resources to connect directly (e.g., a server listening to untrusted
connections), which could be a security breach if the trusted resources are not
robust enough9 . Therefore, in order to give the system administrator a
   6 for   simplicity we assume all si return unique primes, and errors only occur within
residues
     7 this could be either omission fault or crash failure
     8 Detailed performance analysis is conducted by Thomas Stalinski in his thesis.
     9 For example, a simple buffer overflow could be exploited by a malicious user to access

all trusted resources
CHAPTER 3. THEORETICAL STUDY: EFFICIENT MODULAR
COMPUTATIONS USING MINIMUM RELIABLE RESOURCES                           29

better view on the risks involved, and to reduce the risks on trusted
resources (where we run the post certification algorithm), we pro-
pose a third solution where we run the decoder on a semi-trusted
resource which is a segregated network from the trusted resources
that cannot access these trusted resources. Hence, the decoder could run
on a machine within the GCS, but with some credibility criteria, that makes
it qualified to be trusted[4].
Chapter 4

Proposed online post certification
model for modular computation

As we discussed in the previous chapter, the most convenient solution for having
certified results is to run the post certifier on a trusted resource, and run the
decoder on a semi-trusted resource. In this chapter we would like to extend
this solution to a more detailed and realistic computational scheme.

4.1    Why online model?
We mean by online that all computations such as jobs execution (i.e, modular
computations), decoding, and certifying happen simultaneously and on-the-fly:
whenever there are sufficient inputs we produce outputs without waiting for all
inputs, or outputs to complete.
    One advantage of such a scheme is it allows early termination for some ap-
plications. For example, for computing a matrix determinant, the best known
bound is Hadamard bound[22]:
                                        n
                          |det(A)| ≤ n 2 (max(|ai,j | n )
Where A is a square matrix of size n, and ai,j is an element of A. In case of
                                                          n
modular computations, this bound requires to have i=1 pi ≥ bound, therefor
we need n modular computations in order to be sure that we can decode (as-
suming error rate is very small). However, in case of having sparse matrices,
or matrices with certain properties, this bound will be very pessimistic and
cause a huge waste of computational power. Therefor, by having on-the-fly
algorithm, the bound can be discovered during the execution. Another advan-
tage of online schemes, is that it cause less space cost. Using online techniques
generally won’t require to store all results of intermediate computations at the
same time. As far we produce an output, we can release some resources.
   This requirement will impose a constraint on the decoding algorithm (based
on Mandelbaum), which requires a given maximum bound. The decoding al-


                                        30
CHAPTER 4. PROPOSED ONLINE POST CERTIFICATION MODEL
FOR MODULAR COMPUTATION                             31

gorithm has been modified by Thomas Stalinski to be online. Later in this
chapter, we will briefly discuss how the new decoder works.

4.2     Model Architecture
In this part, we would like to extend our previous solution gradually until we
characterize the model components. In the previous chapter, we have decided
the certifying process to run on a trusted resource, a decoding process on a
semi-trusted resource, and several workers on untrusted resources in a global
computing system (GCS). The decoding process, decoder, needs to read the
outputs from workers (residue, prime pairs). However, we want the decoder
to be independent from the underlying network (e.g., p2p, classical grid...etc).
Therefore, we need another component, data proxy, to handle underlying net-
work details. The data proxy, gets the data from the workers, and provide a
static interface to the decoder. Since our decoder tolerates errors in residues
only (not the prime modulo), we want the data proxy to reject faulty primes
(or not unique) as well. Data proxy must run on a semi-trusted resource,
otherwise, it can forge all results.
    Previously, we showed that the certifier performs three tasks: reads the
decoder result, verifies the result and assigns jobs to workers. We want to
limit the logical task of the certifier to certification only, and introduce an-
other component, front-end, to manage other details such as dispatching jobs,
and reading values from the decoder. This will give a better modularity for
the system, and components can be independently improved and maintained
overtime.
    Dispatching jobs require direct connection to workers in the GCS, which is
logically the task of GCS itself to interact with its computing nodes. Therefore,
we introduce another component, public GCS, that assigns jobs to workers.
Public GCS, provides a static interface to the front end for commands (e.g.,
execute function f with inputs x1 , x2 , . . . , xk , using primes with index i1 to in ).
Public GCS must run on at least a semi-trusted resource, because otherwise,
it can control all jobs.
    Figure 4.1 show the overall architecture of the system with the following
components:
   • Front-end: Managing system interactions.
   • Public GCS : Responsible of assigning modular computations jobs to
     workers.
   • Data proxy: Decoder data access point.
   • Decoder : Error correction.
   • Certifier : Decoded result verification.
CHAPTER 4. PROPOSED ONLINE POST CERTIFICATION MODEL
FOR MODULAR COMPUTATION                             32




        Figure 4.1: System Architecture on different types of resources


4.3     Components details
4.3.1    Front-end
This component is responsible for main components interactions. It also inter-
acts directly with the end user: accepts a function call from a set of predefined
functions. Accordingly, this component initiates the computation by asking
public GCS to execute the given function n times. Meanwhile, it asks the de-
coder to start online decoding (since we expect the data to be available now).
Whenever the front-end receives a result from the decoder, it immediately ask
the certifier to verify whether its correct or not, and so on until we get a certified
result (algorithm 4.1).
   We would like the front-end to have more control on which prime is being
chosen, in order to coordinate between certifier and public GCS, so the certifier
won’t use a used prime1 . One prospective advantage could be, if the front-end
deals several GCS (p2p, volunteer computing...etc) to perform computations,
  1 We showed in the previous chapter, that otherwise, the certifier will always return

SUCCESS.
CHAPTER 4. PROPOSED ONLINE POST CERTIFICATION MODEL
FOR MODULAR COMPUTATION                             33

Algorithm 4.1 Basic Front-end algorithm
Require: user function f and its inputs
Ensure: f (inputs)
 1:   while true do
 2:     n ← initial number of jobs
 3:     ask Public GCS: exec(f (inputs), n)
 4:     ask Decoder: start decoding
 5:     repeat
 6:       res ← Decoder output
 7:       cert ← ask Certifier: certif y(res)
 8:       if cert = SU CCESS then
 9:          return res
10:       end if
11:     until res ="END OF DECODING"
12:     increment n
13:   end while


deciding which prime to use is important to prevent overlapped computations.
The solution is to give two values to each of certifier, and public GCS repre-
senting the range of prime each one can use. The front end may also require
the decoder to return the number of corrected errors coupled with the decoded
result. This can provide valuable information about the underlying network
errors.
    Online schemes promotes early termination. The perfect early termination
can be achieved by increasing the redundancy r slowly (i.e., one by one), but
this solution, will require the system to perform a lot of certification tests (r
tests assuming the decoder returns one result each time we add redundancy)
until we get the correct result, therefor we will have a high overhead for tests
which result more iterations in the scheme and more certifying.
    In order to reduce the number of tests, one can add redundancy by doubling
as shown in figure 4.2(where n is the unknown required redundancy). Using
this technique we will perform k tests, while we dispatch 2k data. This solution
reduces the number of certification tests dramatically, but increases a lot the
number of jobs which could waste computational resources. In the worst case,
if n = 2k−1 + 1, the technique will execute 2k , which is 2n − 2 extra jobs.
    As we can see, we have a trade off between, number of jobs, and certification
tests or model iterations: number of times we need to add extra redundancy
as the certification fails. One solution is to use amortize control technique,
therefore we perform tests at ρf (1) , ρf (2) , . . . , ρf (k) steps, where 1 < ρ < 2, and
                                    i
f (i) = ia , 0 < α ≤ 1 or f (i) = log i . This technique is proved to achieve n with
around log n tests, and asymptotically n + o(n ) total redundancy [23]. The
choice of ρ and a control the speed of the growth.
    The reason amortize technique is important is because that the public GCS
could be provided by as service provider who charges upon the number of
CHAPTER 4. PROPOSED ONLINE POST CERTIFICATION MODEL
FOR MODULAR COMPUTATION                             34




                          Figure 4.2: Job growth by doubling


resources used. A business model could associate a price to the number of
floating point operations per hour (op/hour). Therefore, by amortize, we try
to keep the cost as low as possible. Hence, the speed of amortize growth can
be determined based on the trade off between provider pricing, and our quality
of service: how fast we want to deliver computations to end users.
    Another business model could associate a cost to machine/hour, regardless
to the number of operations. This will require to use the full potential of each
computing node during the execution. We shall use amortize technique to
control the data rate sent to the decoder. By adding computing nodes, the
data rate increase accordingly, and our pricing trade off will be based on this
rate instead2 . We shall consider the first model for the simplicity.
    By using amortize, we can dynamically increase the number of resources.
Assuming we want each machine to perform a grain of at most n computations,
                                    r
the number of machines would be n , where r is the amortize redundancy (or
the extra jobs).

4.3.2       Decoder
Stalinski showed in his Master thesis that the decoder works efficiently when
using 21 bit prime numbers, but not so efficiently when we use larger primes,
therefore we will consider primes of this size in this work. There are 73586 21
bit primes that we can use for decoding, which is enough for reconstructing
very large data (i.e., we can build at least a result ≥ (220 )73586 = 21471720 ,
   2 This   may be a prospective work of this project
CHAPTER 4. PROPOSED ONLINE POST CERTIFICATION MODEL
FOR MODULAR COMPUTATION                             35

assuming few errors). The Mandelbaum decoder requires a maximum value in
order to know when to stop finding convergents as we have shown before. Thus,
the front end must provide this maximum, which could be either hard to define,
or very pessimistic as in matrix determinant. The solution is to let the decoder
send a stream of candidate results and let the certifier discriminate. But there
are two problems: how many primes shall we use for CRT computation before
decoding, and when to stop looking for convergents.
    A proposed solution is that the front end tells the decoder two things: the
extra amortized jobs, and the maximum expected error. The number of amor-
tized jobs will be used by the decoder to know how many result pairs it expects
to read from the data proxy in order to build the CRT solution. Maximum
errors will be used to tell when to stop finding convergents (computing the
maximum Mk automatically. See chapter 1 for details). Algorithm 4.2 shows
the internal interactions of the new decoder.

Algorithm 4.2 Decoder manager algorithm
Require: stream of (xi , pi ), stream of of (kj , errorsj )
Ensure: decoded result stream from all (xi , pi )
 1:   repeat
 2:     for i = 0 to kj do
 3:       CRT ← Lif ter.add(xi , pi )
 4:     end for
 5:     repeat
 6:       result, errs ← Decoder(CRT, errorsj ) {errs: #corrected errors}
 7:       send result and errs to Front end
 8:     until result = "FINISH"
 9:   until k = "TERMINATE"

    Instead of the maximum errors, one can check if the corrected result sta-
bilizes after k convergents. This will return a corrected result with an error
probability of p1 . However, deciding k could be difficult by the decoder, for
                   k

instance, choosing a large k could slow down the computations (finding many
convergents) for no reason when we have few errors in the computations.
    As the front-end knows better about types of GCS being used, providing
the number of errors would be more reasonable. The front-end can give an
                     1
error rate close to 2 , in order to assure correction capability even if we have a
massive attack with high ratio close to 1 . However, this again will slow the
                                            2
decoder down, because it always checks for many convergents, even when we
don’t have a massive attack.
    We assumed previously, that we expect all types faults in untrusted re-
sources including omission faults or crash faults, and massive attack could be
close to 1 (but no asymptotically). Therefore the front end must consider ask-
ing more jobs than the reported jobs to the decoder, otherwise, the decoder
might keep waiting for result pairs while there are no more results available.
For example, assuming the network default error rate is e, and the number
CHAPTER 4. PROPOSED ONLINE POST CERTIFICATION MODEL
FOR MODULAR COMPUTATION                             36

of amortized jobs is r at some time. The front end must tell the decoder to
decode for r jobs while it tells the public GCS to dispatch r + e.r jobs3 .
    On the other hand, if we have a massive attack q > e, the previous solution
will no more hold and the decoder will keep waiting. A naive solution to this
issue is to let the decoder send a signal whenever it reads all the results (i.e.,
builds the CRT solution out of result pairs read from data proxy). This signal
could be the basic CRT solution, without error correction. Then we use a
timeout based on our control risk assessment, or in other words, how much
we can afford to wait for slow workers, how can we handle a massive omission
attack (a type of byzantine attacks), or even, handling temporary network
crash failure. The timeout may be not fixed, it should be a function of the
number of extra amortized jobs r, and triggered whenever the front-end gets a
task termination signal from GCS (if possible).
    However timeouts are always problematic solution specially when dealing
with asynchronous systems. A better solution is introduced by [4], to let the
public GCS test the workers whenever the front end asks for redundancy and
update the front end back with the expected error rate. The decoder assumes
that the front end always knows the maximum errors.

4.3.3    Data-proxy
The internal structure of data proxy is totally implementation dependent. One
implementation can propose a server listening to workers data, one other can
assume an NFS system that stores results from workers...etc. Either ways, the
Data-proxy should assure two conditions on the data submitted by workers:
   • All primes used are unique.
   • A worker can only submit few results.
For the first conditions, one solutions is that the data proxy creates a table
of all 21 bit primes (which is not very costly), and whenever, it receives a
prime, it flags the corresponding prime as used. This solution requires a binary
search in the table (given the table is sorted) for each prime received from
the worker. So, if we have n primes from workers, in the worst case, we
need to do < n log2 73586 16n table look ups. We need also, space of size
21 × 73586 = 1545306 bits. Another solution, is that we create a table of prime
indices from 0 − 73586 (or a simple array, where it contents are only flags),
and we require workers to inform the data proxy which indices have been used.
Obviously, this will give us O(1) for table look up, and smaller space cost, since
we need 1 bits to represent a binary flag (a bitmap of size 73586). This will
requires workers to send few more messages about indices. However, there is
problem where the worker may provide a faulty index of the prime, or even a
non prime number. The solution is to construct a prime table, and each time
    3 In real GCS, e errors are small, therefore we compute very few more redundant jobs

plus the amortize redundancy which could be much more in usual cases.
CHAPTER 4. PROPOSED ONLINE POST CERTIFICATION MODEL
FOR MODULAR COMPUTATION                             37

given the prime index we check the corresponding prime given by the user and
the one in the table. This would require O(1) for table lookup and the same
space complexity as the first solution, but few more messages from workers.
    For the second condition, there is a possibility that a malicious worker, fills
the data proxy table with faulty results; as a result, one attacker can shut the
whole computations. Therefore we need the GCS to able to identify workers so
a worker cannot submit more than a result. The attacker power is restricted by
the GCS identification algorithm which we need it to be used by the data proxy.
There is one issue, if each worker has the right to submit at most m results.
For this case, we need the front-end to provide m to data proxy, therefore,
whoever submits a range of results ≥ m, data proxy shall reject the results and
may spot the worker as malicious or faulty.
    A malicious worker could use a prime index used by the certifier to generate
a result. As a consequence, the certification succeeds even if the decoded result
is faulty. A solution is to let the front end send the certifier’s prime ranges to
the data proxy; however the range may not be known at the beginning of the
computations, and should be sent gradually during the execution. We propose
a simple solution in which the certifier will use different than 21 bit primes.

4.3.4    Certifier
From the previous part, we propose the certifier to use 20 bit primes instead;
however this may increase the certification error probability. We will not con-
sider this issue in this work.
    As we can see in the certification algorithm 3.2, if we repeat the certification
more with new different primes, we will obviously get a lower certification error
probability . As we haven’t studied in this work the exact value of , we assume
that the current certification result is enough to verify the correctness of the
result.

4.3.5    Public GCS
We proposed, for several reasons, the front end to ask the public GCS to
execute globally a function f with its inputs, and two indices representing a
prime range. As we discussed before, a GCS could have either centralized or
decentralized architecture. Either ways, we don’t want a single point to be
responsible for prime generation for three reasons:

  1. Could introduce a big load to let one process communicate with all other
     processes and send primes (Specially in p2p architecture).
  2. One component failure (prime generator) would lead to whole computa-
     tions failure.
  3. Some work stealing algorithms work better when jobs are splitter recur-
     sively (e.g. as a binary tree)[24].
CHAPTER 4. PROPOSED ONLINE POST CERTIFICATION MODEL
FOR MODULAR COMPUTATION                             38

We propose that the public GCS gives each worker two indices for the range
of primes to be used, and therefore each worker generates its own primes .
This solution, increase the overall computational cost, since each node need to
generate primes up to the second prime index. Generally, 21 bit primes is not
costly4 .
    The third reason rises a new problem, what happens if the public GCS
gives half of the work load to a faulty worker? the answer is, we get a fault
ratio q ≥ 1 . Usually, with work stealing algorithms, jobs are splitted (as a
            2
binary tree), and actual computations occur at the final nodes. Generally,
work stealing algorithms are efficient, and work very well when we have small
error rates, which is very likely to have in a GCS.
    There are actually two ways to know the average error rate e in a GCS:
       • Experimentally, by running the overall model and compute the average
         number of corrected errors. This can be achieved by the front end, giving
         the decoder a high maximum errors, as we discussed before. Therefore,
         after several experiments, we can estimate e.
       • Or by performing Germain-Renaud[4]sequential tests to find the GCS
         error rate e with a bounded test failure. We assume that the public GCS
         performs such as test whenever the front end asks for more jobs. Public
         GCS shall use the job results for the test batch sequentially5 .
Anyways, If we get a massive attack at any time, the model can adopt the new
error rate, and the front end will provide more expected errors to the decoder.

4.4           Components Interactions
We shall show the overall interactions between several components, and the
type of the data exchanged as well. Figure 4.3 shows a sequence diagrams for
a classical component interaction. The grain represents the maximum number
of jobs, we expect each workers to produce. first and last, are used to control
prime indices. For example, to fork 100 jobs, starting from index 0, first will
be 0, and last will be 100. The the public GCS may return “FALSE” for some
technical reasons, this will be useful since we don’t expect the decoder to be
able to decode when the GCS fails to dispatch jobs. Therefore, we need either
to study the reason, or just retry. Also, the public GCS performs some tests
on the job results sequentially, therefor it may return a new error rate, if it is
different the the old one.

4.4.1          Control flow
From the sequence diagram in figure 4.3, we can see that the front end doesn’t
give the control to the decoder. In other words, we have a stream of commands
       4 Rough    experiments show generally less than a half a second on normal machines.
       5 We   will not discuss details of such a test in this work. Further details can be found in
[4].
CHAPTER 4. PROPOSED ONLINE POST CERTIFICATION MODEL
FOR MODULAR COMPUTATION                             39




           Figure 4.3: Classical components interaction
CHAPTER 4. PROPOSED ONLINE POST CERTIFICATION MODEL
FOR MODULAR COMPUTATION                             40

from the front end, and a stream of results from the decoder. This is an
important aspect of the online system, where we don’t wait for function calls
until they finish. We actually don’t have function calls for some components,
rather data streams. For decoder, and the data proxy, whenever there is a
result, they are immediately send. Received data stream is buffered until each
component reads values.
    The front end also don’t give the execution control to the public GCS, even
though we have one result value (not a stream), because the front end needs
to deal with decoder and the certifier in the mean time.
    On the other hand, the front end should give the execution control to the
certifier, since it cannot decide anything until it gets a return value. During this
control blockage, received data from the decoder is buffered, until the control
is released (also returned value from the public GCS).

4.4.2    Data buffering
Since components are sent data asynchronously, we need to buffer received
results until we processes them. The size of buffer will depend on the size of
expected received data. We expect the decoder to have relatively large buffer
for reading many results. The front end should also have a large buffer to be
able to read large results from the decoder.
Chapter 5

Model Analysis

We assume, that communications are bi-directional, and authentication and
encryption techniques are unbreakable. For the communication link, we assume
two types of attacks:
   • Channel breaking (may be physical).
   • Man-in-the-middle attack (impersonation)
When we say impersonation, we mean the attacker can control both the iden-
tities and the data in the channel. We have two degree of attack severity:
   • Result forgery
   • Denial of service (DOS)

We assume the first one is much more dangerous. We also assume that an
adversary can control all resources, and communication channels completely.
Assuming also that the certifier error bound is negligible.

5.1     Communication channels risk analysis &
        Man-in-the-middle attacks
5.1.1    Front-end – Decoder
If this connection is broken, then the system will never terminate (i.e., DOS),
and won’t be able to produce any output to the user. If an adversary, can
impersonate both the decoder and the front end, then the system either won’t
terminate, or the adversary can give faulty results to the front end. In this
case, the certifier will capture the error. The adversary can also cause DOS for
both sides by filling buffers at each sides, or by sending termination signal to
the decoder. Therefore, we assume this link to be encrypted, and both sides
won’t communicate until they are authenticated to each others.



                                      41
CHAPTER 5. MODEL ANALYSIS                                                      42


5.1.2    front-end – Certifier
This is the most critical communication link in the system. If it breaks, then
we will have DOS, since the control blocks from the front end until the certifier
returns a result. On the other hand, if an attacker can impersonate both sides,
he can give faulty certification, and eventually the system will return a forged
result, which is the most dangerous system fault. This communication must
be encrypted, and both sides are authenticated to each other. Its possible that
both are in one private trusted network. In such as case, we might not need
encryption or authentication.

5.1.3    front-end – Public GCS
If this connection is broken, we will have DOS. If an attacker can impersonate
both sides, then the system might either never terminate (if he denies the
commands), or he can give faulty computations. This could be very useful
for the attacker, since he can control the GCS for his own benefit, and a
big economical problems if the GCS charges money on the resource usage.
Therefore, the corrective act is to use authentication and channel encryption.

5.1.4    front-end – Data-proxy
If the connection is broken, the front end won’t be able to provide the grain size
of jobs. As we explained before, if we have on faulty worker, he can fill prime
table with faulty residues, and accordingly we will have a DOS. A corrective act
might be to use a default grain size, if the link breaks. However, if an attacker
impersonates the front end, then he might provide a large grain (e.g., number
of 21 bit primes), and one faulty worker can fill up the table with faulty results
as we explained. The corrective act is to use authentication, and encryption to
protect this channel.

5.1.5    Decoder – Data-proxy
If the link breaks, then we will have DOS. On the other hand if an attacker
can impersonate both, a whole set of result pairs can be forged, therefore we
will have DOS. The attacker can cause DOS also by filling decoder buffers.
Another obvious DOS attack is to send a terminate signal to the data proxy.

5.1.6    Within GCS
We have three main different links in the GCS:
   • Public GCS – workers
   • Worker – Worker
   • Worker – Data-proxy
CHAPTER 5. MODEL ANALYSIS                                                       43


We assumed that all communication channels can break. If we have a perma-
nent massive network failure (or Internet crash), then obviously we will have
DOS. The impact of links failure is really dependent on the structure of the
GCS network, and the algorithm we are using (e.g., work stealing).
   If the public GCS only communicates with all workers directly, then one link
break could be as one worker crash failure (or omission failure). The system
should correct such an error. The same for the link between each worker and
data proxy. However, when we use work stealing algorithms recursively (as a
binary tree execution), then one link crash could cost a fraction of errors e, such
that log n ≤ e ≤ 1 , where n is the number of jobs. If a worker can impersonate
       1
                 2
the public GCS, then he or she can control the whole GCS computation. We
assume that workers cannot initialize the global computations, and the public
GCS is known by all workers.

5.2     Impacts of resources failure
Assuming an adversary can control any type of resources. We would like to
see, according to our models, what impacts could we have.

5.2.1    Untrusted resources failure
                                      1
The system can correct errors e < 2 , since as we add more redundancy, the
decoding correction capability increases. As far we have n ≥ k + 2t , where k
is the required primes, t is faulty computations, we can always correct results.
If the adversary manages to cause asymptotically an error rate e ≥ 1 , the
                                                                         2
system will never terminate. We assume that the sequential test computes e
with bounded test error probability. If we are unlucky, then we will have DOS.

5.2.2    Semi-trusted resources failure
The decoder, data proxy, and public GCS are semi trusted resources. The
adversary will cause a DOS to the system. Or the system might return faulty
result due certification failure only, otherwise the system will never return a
wrong function result. Also for public GCS, the adversary may utilize the GCS
system for his benefit only.

5.2.3    Trusted resources failure
If either the front end the decoder are compromised then the adversary can
cause result forgery, which is the most critical error, and/or DOS. If he controls
the front end, then he can control the system completely. These resources are
the most critical part of the model, and the whole system security is based on
their security.
Chapter 6

Design specifications

As to reduce the design complexity we require to accept only one user function
at a time1 . However, in order to achieve multiple function calls, we need to
replicate the whole system.
   We impose the following constraints on the design of the systems:
   • User functions must be independent from other components. User func-
     tions could be developed by the deployment team and should be evolved
     over time easily, without rewriting other components.
   • User functions are given at the run time as an argument (e.g., execute
     f unction1 arg1 arg2 ...)
   • Inputs of user function must be not limited by a number to any data type
     (e.g., could be a mixture of files and string values). The system needs to
     deal with serialization/serialization of user functions and data.
   • User functions could return different types of results. The basic type t
     is a fixed bit 21 bit integer. However, the system should support other
     types based on t such as a vector [t0 , t1 , . . . , tn ] or a matrix of basic type
     t. The system should be designed in a way to enable flexibility in adding
     new result types in future.
   • The design of the core decoder algorithm must be independent of the
     system.
We propose an Object Oriented Programming design for the stem for the fol-
lowing reasons:
   • Polymorphism: by using polymorphism, we can use on abstract class
     pointer to refer to several child implementations. This concept can help
     us to get user functions at run time, since we need one place to resolve
     string name to a child polymorphic class, and the whole system will use
     the abstract interface to deal with user functions.
   1 We   don’t want to handle multi clients at this stage


                                              44
CHAPTER 6. DESIGN SPECIFICATIONS                                              45


   • Modularity: OO gives a better modularity for the system, through the
     concept of encapsulation, implementation details are hidden from objects’
     users.
   • Maintainability: Since objects are independent logical entities with clear
     interfaces, internal implementations can improved independently and less
     costly.
In the design diagrams, we focus on the case where we have only integer return
type for the sake of simplicity.

6.1    Components structure
Communications happen now through communicators which hides complexity
details from main components as shown in figure 6.1. The certifier and public
GCS don’t need communicators since they implement remote functions which
are invoked only through front end communicator. CRT lifter and decoder is
designed by Thomas Stalinski, and he used this communicator to send results.
The main component contains some glue implementation code for the decoder,
and is also responsible for managing communications with front end and data
proxy.

6.2    Class diagrams
Here we shall show the class diagram of the system. Some data types shown in
the diagrams are not language dependent, and need to be defined. However,
we named these types in a way that it describes it actual data . Figure 6.2
show the communicators class diagram. All communicator use an instance
of communicatorUtil, which provides low level communication handling. The
current communicator show only integer return type for user functions (for
the sake of simplicity). For other types, one needs to add the corresponding
overloaded members.
    According the the design constraints, we designed a polymorphic user func-
tions shown in figure 6.3. userModularFunctionBase is an abstract class, which
later will be used as a pointer to its polymorphic implementations (user de-
fined functions). This class has a static member initializeByName, which cre-
ates polymorphic user functions by their string name. In the diagram, the
functions with italic font are abstract function which require children to imple-
ment. The second level of inheritance, specify the user function return type.
userModularFunctionInteger will be the parent of all functions returning one
integer value such as Fibonacci and determinant. Another type is userModular-
FunctionVector which represent function returning vector types. This scheme
enable us to add new types such as matrices, or vector space. The second level
implement mainly outputResult, which prints the resultPairs (to a stream or
to stdout, based on the implementation). The third level of inheritance, repre-
CHAPTER 6. DESIGN SPECIFICATIONS                                               46




                       Figure 6.1: Deployment diagram


sents the classes in which we can create objects (not abstract). These function
are designed to be easily added by the deployer of the system.
    In order to add a new function, one must simply extend an abstract ap-
propriate class (e.g., userModularFunctionInteger), and implement two things:
userModularFunction, and a constructor which receives a list of arguments:
arg_list. The first function is void, therefore, in order to return a value, the
developer must invoke the parent’s returnResult member, which will handle
storing it a list of results, resultPairs. If one or more arguments are files, then
the developer should inform the system by calling setFile giving the index of
that argument.
    We imposed this solution to simplify the user code. An alternative solution
could be a classical Marshalling/Demarshalling; however, this would require
the user to write extra members to support these operations. As we store the
arguments and file order, we can send these informations to reconstruct the
object again elsewhere in the system.
    The certifier interface is shown in figure 6.4, and public GCS interface is
shown in figure 6.5. publicGCS may return a new error value if it performed
sequential tests. We use the type SUCCESS_or_FAILURE for simplicity and
for some implementations who would like to get the error rate from the end
CHAPTER 6. DESIGN SPECIFICATIONS                                             47




                  Figure 6.2: Communicators class diagram


user.

6.3     Interface interactions
Figure 6.6 show the components interactions through interfaces controlled by
communicators. The diagram show mainly the type of the data used for online
components and the ordering of calls. This diagram is similar to figure 4.3,
but it focuses more on the type of the data being sent. In addition, the front
end now tells both the decoder and the data proxy the type of the data (i.e.,
integer, vector, matrix ...etc), so each one can deal with each type differently.
The current communicators diagram show only integer types.
CHAPTER 6. DESIGN SPECIFICATIONS                        48




             Figure 6.3: user functions class diagram




                      Figure 6.4: certifier




                     Figure 6.5: public GCS
CHAPTER 6. DESIGN SPECIFICATIONS                                        49




Figure 6.6: interface interactions through communicators and the data types
exchanged
Chapter 7

Implementation & Analysis

We implemented the specification to work on top of classical grid systems (i.e.,
consists of controlled clusters). The implementation written in c++ code, and
using Linbox library [27]for fast linear algebra computations used for comput-
ing the matrix determinant. Linbox requires GMP and Givaro [28, 29], we
also used GMP for certification computation, and also used for decoding1 . For
distributed computations, we used kaapi middle-ware [20] using its Athapascan
API[25]. The grid experiments are conducted on Grid5000 [26]. The imple-
mentation uses ssh protocol to communicate with the grid being used, and also
oarsh[30] to connect to Grid5000 job machines. Through out this implemen-
tation, we force the certifier and the front end to run on the same machine,
for the sake of simplicity. Also, since the grid is controlled, we relax the prime
uniqueness condition on the data proxy, since errors are very low in controlled
systems and they are more probably to be in the function computation rather
than prime generation2. On the other hand, if a worker provides incomplete
data (e.g., a result but not a prime), then this condition will be rejected at the
decoder communicator side. Through out this implementation, for the sake of
simplicity, we don’t compute error rates dynamically by a sequential test, we
simply accept it as a user input. Now, we shall briefly describe both Kaapi and
Grid5000.

7.1     Kaapi Library
KAAPI means Kernel for Adaptative, Asynchronous Parallel and Interactive
programming. It is a C++ library that allows to execute multithreaded com-
putation with data flow synchronization between threads. The library is able
to schedule fine/medium size grain program on distributed machine. The data
flow graph is dynamic (unfold at runtime). Target architectures are clusters
of SMP machines. It is based on work-stealing algorithms which can run on
   1 developed by Stalinski
   2 However, if there is a error in the prime value, then there could be a problem with the
system. This case should be handled in future updates.


                                            50
CHAPTER 7. IMPLEMENTATION & ANALYSIS                                                  51


various processors, various architectures (clusters or grids), and it contains
non-blocking and scalable algorithms[20].

7.2     Grid5000
Grid5000 is an instrumental computer science project aimed at gathering 5000
processors at the national scale in France (currently 9 sites are involved). The
project is dedicated for researches in large-scale parallel and distributed sys-
tems.
    Each of site comprises several clusters of machines (2-3 clusters of 60-600
machines). It has a heterogeneous architecture of 3202 cpus and 5714 cores.
Within each site, there is an NFS server which is only accessible by site’s
machines. Each user of Grid5000 has a home directory in each of 9 sites (i.e., 9
home directories). The way a user connects to the grid is by ssh access to the
site front access shown in figure 7.1. For resource management, it used OAR
tool [30]. In order to reserve some resource, the user must connect through ssh
to the access machine, then ssh to the frontend of the site. From there, he can
use OAR tool to reserve resources, and he/she will get a listed of machines and
may get an ssh key to accesses these machines. oarsh command is an alternative
to ssh in the sense, that it hides key management details for accessing resources.
One can easily conned between reserved nodes without the need to specify
where the job key is stored. This tool finds the key transparently whenever a
user moves between OAR reserved machines.

7.3     Online Certifier
We created an open source project called online-certifier licensed under GNU
General Public License v3 [31] (figure 7.2 shows the project logo). It acts
as middle-ware for executing user functions on a classical grid systems. The
application is compatible with Unix like systems3 .
   The application is developed to work in the three mode: Grid, Grid5000,
and Locally (for testing).

7.3.1    Deployment over grid5000
Grid5000 imposes some constraints on the components connectivity. Both data
proxy, and public GCS run on resources within a reserved job, which are not
directly accessible by the front end and data proxy (figure 7.3 show the de-
ployment of the system). Data proxy and public GCS could be on either two
different machines or share a single machine. We use ssh port forwarding
to secure the communications and to redirect the communications to connect
between components. The workers in Grid5000 store their outputs (residues
    3 For windows, it would need to modify the source, since we use Unix system calls for

socket communications and some file descriptors.
CHAPTER 7. IMPLEMENTATION & ANALYSIS                                                52




                         Figure 7.1: Grid5000 ssh access




 Sun symbol represents the power of distributed computing while preserving
   the control over resources against attack. It is a symbol of stability and
                                  continuity.
                         Figure 7.2: Online certifier logo


and primes), to a directory on site’s NFS. The data proxy reads files in that
directory and send results stream to the decoder.

7.3.2    Package structure
The package tar ball has the following structure (compliant to a subset of the
specification)4 :
   4 The package is still not finalized. We need to use gnu automake and autoconfig to

package it in a better way, and simplify the installation and usage process to be POSIX
CHAPTER 7. IMPLEMENTATION & ANALYSIS                                         53




              Figure 7.3: Online certifier deployment on Grid5000


   • communicator/:

         – communicatorUtil.(hh/cpp): Provide utility function for other com-
           municators
         – frontendCommunicator.(hh/cpp)
         – decoderCommunicator.(hh/cpp)
         – dataproxyCommunicator.(hh/cpp)
         – Makefile

   • decoder/

         – mandelbaum.(hh/cpp): contains two classes, lifter and decoder 5
         – main.cpp: controls the execution of the whole decoder (deals with
           decoderCommunicator, lifter, and decoder) to interact with other
           components, namely, front end and data proxy.
compliant
   5 Developed by Stalinski
CHAPTER 7. IMPLEMENTATION & ANALYSIS                                     54


      – Makefile
      – decoder.conf: configuration file such as port number, data proxy
        address...etc.

  • front-end/

      – certifier.(hh/cpp)
      – main.(hh/cpp): deals with certifier, and frontendCommunicator to
        control system operations.
      – Makefile
      – compile.sh: exports Linbox flags and compiles the frontend binary.

  • grid5000/

      – kaapiTask.cpp: Kaapi compatible application represents the public
        GCS component.
      – dataProxy.cpp: uses dataproxyCommunicator
      – Makefile
      – compile.sh: exports Kaapi and Linbox flags for compilation
      – dataProxy.conf: contains data proxy configurations as server port
        number

  • user-functions/

      – userModularFunctionBase.(hh/cpp)
      – userModularFunctionInteger.(hh/cpp)
      – determinant.(hh/cpp): a user function for modular matrix determi-
        nant computation
      – fibo.(hh/cpp): a user function for computing Fibonacci numbers
      – userIncludes.hh: contains the headers of all user functions. Used by
        frontend and kaapiTask to recognize user functions.
      – test.cpp: application used to test user function before using them
        in the system. Accepts function name, a prime value, and outputs
        a result and function informations.
      – Makefile
      – compile.sh: exports Linbox flags and compiles the source

  • util/

      – generatePrime.inl: 21 bit prime generation code
      – typeDefines.hh: contains global decoded result types and return
        types used over the system.
CHAPTER 7. IMPLEMENTATION & ANALYSIS                                                           55


   • doc/: doxygen documentation6
   • bin/: An empty directory will contain executables and config files after
     compilations.
   • data/: contains dense and sparse matrix generation codes.
After compiling the system, we will get four executable: frontend, decoder,
kaapiTask, and dataProxy.
    The kaapiTask is a kaapi application which can run independently using
karun command. This command mainly receives a machine file representing
machine pool which we perform our application on. kaapiTask receives the
following arguments:
   • output directory: the directory where we store outputs. This should
     be on a mounted NFS directory, otherwise each process will store results
     locally
   • first: first prime index
   • last: last prime index - 1
   • grain size: number of jobs to be grouped in one machine
   • user function name
   • function arguments
The dataProxy server is in charge of reading from a given data directory (the
same used for kaapiTask output) containing files of each computing machine
result, and then send these contents to a connected client (decoder). Each time
the data proxy reads a file, it immediately deletes it after finishing reading to
release the space7 . The data proxy accepts the following arguments:
   • port: data proxy server listen port number
   • dir: the directory which we need to read results from
the dataProxy accepts either arguments, or a config file located at same direc-
tory where we run the executable.
    The frontend uses frontendCommunicator, as specified in the specification,
to connects to the decoder through a port. However, it uses ssh connection to
connect to the grid and execute kaapiTask program using karun.
    The frontend usage formate is “frontend function_name grain_size args...”.
More configuration details are in frontend.conf file, which contains compo-
nents addresses and ports, execution mode, a file address containing reserved
machines addresses on the grid, output dir, a task name (would be used to
create a subdirectory under output dir to store kaapiTask outputs), maximum
   6 As    this is stage is not completely finished, but will be available soon
   7 In   future, it could be useful to provide an option to archive results for future use.
CHAPTER 7. IMPLEMENTATION & ANALYSIS                                                    56


expected error rate, and an ssh key address used to access grid5000 reserved
job (this is used if the mode is GRID5000)8 . The configuration file must be
located where we execute the frontend binary.
   The decoder uses decoderCommunicator to communicate by both fron-
tend and dataProxy through ports. The usage formate is “decoder port dat-
aProxy_addr dataproxy_port”. It either accepts arguments or reads from dat-
aProxy.conf which must be located where the binary is executed.

7.4     Installing & using online-certifier
Both frontend and kaapiTask requires Linbox library (for matrix determinant
user function). kaapiTask requires Kaapi to be installed (with Kaapi commands
visible globally by adding binaries directory to PATH variable). The frontend
requires GMP library as well for the certifier. However, GMP is required when
we want to install Linbox9 .
    To install Linbox you requires the following steps briefly:
    • Download the source of Linbox and untar it
    • You will also need to install GMP and Givaro. You will need to first
      install GMP (with the option –enable-cxx passed to the configure script,
      then Givaro with the –with-GMP=<GMP-path>)
    • You also need a BLAS library. If none is available (libblas or libcblas),
      then install ATLAS.
    • configure Linbox with the options “–with-blas=<blas-path> –with-gmp=<gmp-
      path> –with-Givaro=<Givaro-path>”, and then install it.
If we want to compile the system locally (for testing) we need first to install
Linbox with GMP and then perform the following steps (in order):
   1. Go to communicator directory and type make
   2. Go to user-functions directory and run ./ compile
   3. Go to front-end directory and run ./ compile
   4. Go to decoder directory and type make.
   5. Go to grid5000 directory and type ./compile
Now, all executable and default config files are available in bin directory. To
run the system, we need the following steps in order:
   8 The configuration file contains comments about each of required arguments
   9 There is a Linbox 1.1.6 bug with Kaapi 2.4. After Linbox 1.1.6 installation, you might
need to modify /linbox/include/dir/linbox/solutions/det.h by commenting line 328 and 335
CHAPTER 7. IMPLEMENTATION & ANALYSIS                                            57


  1. Run: dataProxy <server_port> <kaapiTask_output_directory> (or put
     the corresponding inputs in dataProxy.conf and then run without argu-
     ments)
  2. Run: decoder <server_port> localhost <dataproxy_server_port> (or
     provide the corresponding inputs in the decoder.conf )
  3. Open frontend.conf and check that the decoder port is correct. Also,
     change the error rate to 0 (since we don’t expect errors locally), and
     make sure the mode is LOCAL. Make sure the the directory of kaapiTask
     binary is correct. Also, make sure the “grid_data_dir/task_name” is
     the same directory given to dataProxy (kaapiTask_output_directory).
     Hence, modify grid_data_dir and task_name individually.
  4. Run: frontend <user_function_name> <grain_size> (e.g., ./frontend
     determinant 8)
For Grid5000, we need few more steps for compilations. In order to install
the frontend on a machine, you will need to install Linbox with GMP, and
compile: communicator, user-function, and front-end. For the decoder (if on
either the same or a different machine), you need GMP library and then to
simply compile communicator, then decoder.
    For the dataProxy and kaapiTask on grid, we need to first install Linbox
with GMP, then reserve a machine (using oarsub -I command on the frontend).
Now we need to compile: communicator, user-functions, grid5000. We also
need to give our ssh public key to grid access machine, so we won’t need to
write a password when we connect to Grid5000 (probably using ssh-copy-id
myuser@mygridaccess).
    In order to run online-certifier on Grid5000, we need the following prepa-
rations:
  1. Reserve a number of machines using (oarsub -I -l hosts=xxx -e job_key_file
     or oargridsub command).
  2. Store the machines addresses inside a file (e.g., oargridstat -l grid_job_number
     > machines or on a reserved machine cat $OAR_FILE_NODE > ma-
     chines).
  3. In our front end machine (not site frontend), modify the configuration file:
     change the grid address (which is a combination of user@access.site.grid5000.fr)10.
     Modify the job_key, and put the reserved job key address of Grid5000.
     Modify the mode to GRID5000, and change the public GCS machine
     address (from the set of reserved machines). The last option is either
     to use amortize technique or not (If you set the option to “no” then the
  10 You  may modify ~/.ssh/config file and but a short hand name. Then put this name
instead of user@address
CHAPTER 7. IMPLEMENTATION & ANALYSIS                                                       58


      frontend will let the kaapiTask run for all primes, and provides amor-
      tize data to the decoder only11 ). Make sure the decoder address is what
      you want to be (only IP address is accepted for now). Finally, make sure
      that grid_data_dir and task_name is what you want to be on Grid5000.
      Give a final check to all configuration, and make sure it is exactly what
      you want.
   4. Now on the decoder machine, we need to perform three times ssh local
      port forwarding to reach dataProxy (a machine from reserved grid job).
      For example, run: ssh -L 9999:localhost:9999 grid. Again on the grid ac-
      cess machine run: ssh -L 9999:localhost:9999 frontend. Finally run oarsh
      -i <grid_ssh_job_key> -L 9999:localhost:9999 <one_machine_from_machines_file>.
      Assuming we want the dataProxy server to run on port 9999. This will
      let the dataProxy run as it is on the decoder machine since we did local
      port forwarding.
   5. On another terminal, from the front end machine, do ssh local port for-
      warding to the decoder in order to secure the channel (e.g., ssh -L 5555:lo-
      calhost:5555 decoder).
Steps to launch the computations:
   1. Run the dataProxy server on the grid machine (the configuration should
      be correct, as we described in local execution).
   2. On another terminal, in the decoder machine, run the decoder (make sure
      the configurations are correct).
   3. Finally, run the frontend with appropriate user function input, grain size,
      and arguments.
In order to run online-certifier on any type of clusters, or grids with an NFS
server accessible by all machines. Do all the steps we did for the Grid5000,
except the port forwarding and providing the job_key to the frontend.conf.
The frontend mode should be now GRID. Other steps remain similar.
    The user has also the option to simply run kaapiTask on a set of machines
using karun command providing machines’ file. This will let us do experiments
without using the model.

7.5     How to write a user function ( Fibonacci function
        example)
In order to add a new user function, one should do four steps:
   11 This options not recommended, since it wastes a lot of space. It is useful sometime when

karun doesn’t behaves properly. Sometime it requires manual termination (“^C”), which is
painful for an iterative model.
     CHAPTER 7. IMPLEMENTATION & ANALYSIS                                       59


         • Write the class header and content of the function (with some few con-
           straints) and put them in user-functions directory
         • Put your class header in user-functions/userIncludes.hh file
         • Add two lines of code in user-functions/userModularFunctionBase.cpp
           (to resolve string to user function object)
         • Add an entry in the Makefile
     Since we want to create a function with integer return type, then we need to
     extend userModularFunctionInteger. The code of fibo header should be as the
     following:
1    # include " u s e r M o d u l a r F u n c t i o n I n t e g e r. hh "
2    # include < vector >
3    class fibo : public u s e r M o d u l a r F u n c t i o n I n t e g e r{
4         public :
5             void userModular F un c ti o n ( fixedBitSi ze I nt e ge r p );
6             fibo ( std :: vector < string >& args );
7         private :
8             int n ;
9    };
        As we can see, we simply need to extend userModularFunctionInteger, and
     implement one abstract member userModularFunction. Also, we need to im-
     plement a constructor with an argument of vector<string> type representing
     the list of user function inputs. Hence, the user need to perform his modular
     function using the prime p given as an argument. The body would be:
 1   # include " stdlib . h "
 2   # include " u s e r M o d u l a r F u n c t i o n I n t e g e r. hh "
 3   # include " fibo . hh "
 4
 5   using namespace std ;
 6   void fibo :: userModul ar F un c ti o n ( fixedBitSiz e In t eg e r p ){
 7       if (n <2){
 8             storeResult (( fixedBitSi ze I nt e ge r )n , p ); // return the result
 9       } else
10       {
11             unsigned long long fibo =1;
12             unsigned long long fibo_p =1;
13             unsigned long long tmp =0;
14             unsigned long long i =0;
15             for ( i =0; i <n -2; i ++){
16                  tmp = ( fibo + fibo_p ) % p ;
17                  fibo_p = fibo ;
18                  fibo = tmp ;
19             }
20             storeResult (( fixedBitSi ze I nt e ge r ) fibo , p ); // return the result
21       }
     CHAPTER 7. IMPLEMENTATION & ANALYSIS                                                      60


22   }
23   fibo :: fibo ( vector < string >& args ): u s e r M o d u l a r F u n c t i o n I n t e g e r( args ){
24        if ( args . size () < 1)
25              throw " no argument for fibonacci ! " ;
26        n = ( fixedBitSi z eI n te g er ) atol ( args [0]. c_str ());
27   }
         The user would read the arguments in the constructor and store it in any
     class attribute he wants. Then he would use his attribute in userModular-
     Function. The most important part of userModularFunction is that whenever
     the user wants to return a result, he would use storeResult function, with the
     result, and the prime used p. The user must use fixedBitSizeInteger for the
     result of integer type.
         In other types of functions, the user may need to read a file. In this case
     he would need to specify an argument as file, so that the frontend can copy
     the input file to the grid. In the constructor, the user must call setFile(index),
     where index is the argument index corresponding to a file.
         Finally, we need to edit userModularFunctionBase.cpp in initializeByName
     function add:
1          else if ( name == " fibo " )
2             tmp = new fibo ( args );
        Now you should add an entry in the make file, and compile.
        In order to test the Fibonacci function modulo 7, you may run: ./test fibo
     7 100.

     7.6     Read/Write atomicity
     Between the dataProxy and Kaapi we have the problem of result write atom-
     icity. The kaapiTask on a machine might write some incomplete content, and
     meanwhile, dataProxy might read some part and try to delete the rest, before
     the completion of writing. One solution is to use file locks. However file locks
     will force each time the dataProxy reads a file to wait until the kaapiTask
     finishes. This would result a huge slow down if we have very slow workers,
     or having heavy computations on each node. We proposed a simple solution,
     where kaapiTask write a special character “$” at the beginning of a file, then
     it writes the outputs, and when it finishes, it replaces the first character by
     another character “\n”. The dataProxy would read each file, and whenever the
     file is either empty or with “$” character at first line, it would consider the file
     is locked, and simply closes the file without deleting it.

     7.7     Prime generation
     As this scheme requires unique primes for decoding, we provided a simple
     scheme for prime generation suitable for Grid5000. kaapiTask is responsible
     for this task, since the frontend gives only two values representing the prime
CHAPTER 7. IMPLEMENTATION & ANALYSIS                                          61




                   Figure 7.4: Different amortize functions


range required, f irst, and last, where last − 1 is the index of the last prime.
kaapiTask recursively splits the prime range by two until it meets a condition:
if n = last − f irst ≤ grain, then each node compute a prime table of size n
that contains primes of indices i0 = f irst to in−1 = last − 1. We want node
to compute primes to prevent bottle nicks as we discussed in chapter 4. One
solution was to store all 21 bit prime in a file in the NFS and let machines
read prime from it. However this would cause a lot of read operation and some
load on the NFS. Another solution was to store these values locally, which is
difficult to achieve on GRID5000, since the home directory is mounted to the
NFS. We may store in the /tmp directory, but this requires to send the prime
files to all machines after reserving a job. Therefore, the best solution is to
compute them at each node, which is not so costly operations. To generate a
whole 21 bit prime table, it take less than a second on a machine in genepi
cluster in Grenoble site of Grid5000.

7.8    Implementation Amortize Technique
For our implementation, we used an amortize function f (i) = f (i − 1) +
  f (i−1)
log2 f (i−1) instead of doubling the redundancy. Other amortized functions dis-
cussed in chapter 3, increase a bit slowly. In our case, the overhead of running
new jobs using karun is high and resources are already booked and no pay-
ment associated (figure 7.4 & 7.5 show different function growth). We start
the amortize function by 32 instead of 2, to allow even faster growth.
    The frontend applies this function to compute extra redundancy (in the
Grid) whenever the decoder fails to decode. However, the frontend asks kaap-
iTask for f (i) + errors, where errors = f (i) × t, and t is the error rate given
as an input to the frontend. The reason for that is to suppress the case where
we might have omission faults as we discussed before.
CHAPTER 7. IMPLEMENTATION & ANALYSIS                                            62




                              i                               f (i−1)
                  ρ(i) = 1.9 log i vs f (i) = f (i − 1) +   log f (i−1)

                           Figure 7.5: ρ-amortize vs f


7.9    An application of matrix determinant computations
In this part we shall discuss an effective application of matrix determinant. For
large matrices, the matrix determinant computation could be very costly since
operations in Z are costly. However, computing the determinate many times
over Fpi , ∀pi coprime integers, would reduces the cost of operations since pi
is much smaller than Z. However, in some cases when we have large matrices
(e.g., 1000 x 1000), each modular computation would also take a considerable
time (of course much less than in Z); therefore, performing computation in
parallel would be important.
    The number of modular computation depends on the matrix A of size n
and the size of elements in A. The Hadamard bound requires |det(A)| ≤
  n
n 2 (max(|ai,j |)n ≤ i pi , where ai,j is an element in A.
    Let k be the number of pi primes requires for the Hadamard bound such
           n
             log n+n log(max(|ai,j |)
that k ≥ 2          log pi            . We tested a simple dense matrix of size 100
with maximum of 30 bit elements, using 21 bit primes. The Hadamard bounds
requires 167 primes. Whereas computing this matrix in an online manner,
using linear amortize technique (e.g., each time increment the redundancy by
1) with 0 error rate in the network, gave us an average 156 primes for different
random matrices of the same size. However, using amortize technique, it takes
around 168 primes, which is very few overhead for this case.
    Using online certifier, we computed the determinant of a dense matrix of size
1000, with 100 bit elements, using 160 cores. The determinant was about 30849
digits long using 0 error rate for the Grid. The system used total 7903 primes
to compute the determinant while the Hadamard bound requires 5250 primes
(an overhead of 0.336) which is good since we have a dense matrix with all
elements of 100 bits (which is almost the worst cast). The total execution time
CHAPTER 7. IMPLEMENTATION & ANALYSIS                                                 63


was around 10 minutes12 .

7.10     Current limitations
The grid implementation is now limited to one cluster only, where all machines
are connected to a single NFS. In order to allow multiple clusters, we need first
to let the front end synchronize the user inputs file to all NFS, and also let the
data proxy read from all NFS involved.
    Another limitation is that Linbox executable is quite large (e.g., kaapiTask
is a around 5 MB), which would slow down the execution over a distributed
environment. Another issue is that Linbox is not thread safe. For computing
matrix determinant, we use a function det from Linbox library which is global
function. For time being, we use karun command with option -t 1 to run one
thread per process. We could develop a wrapper code in future to synchronize
access to det. The current implementation on Grid5000 requires 3 levels of
ssh port forwarding, which needs some time to get used to. Also, the current
implementation doesn’t dynamically allocate resources during run time. Grid
API13 could be used in future for such a purpose.




  12 The implementation requires some of optimizations to get better timing performance.

Due to lake of time, we couldn’t provide a detailed performance analysis.
  13 The API is still under development for most of the Grid5000 sites.
Conclusion

Through out this work we showed how we efficiently could use untrusted re-
sources for modular computation using few trusted resources, a certifier mainly,
to verify the computations. We provide a simple yet strong scheme to perform
a sequential function in parallel and showed in details several types of risks
and possible attacks, and how the system can be resilient or not for each type.
This scheme could be extended to include more functions with integer return
type, even with vector, or matrix return types. We showed an example of com-
puting a matrix determinant as an effective application which requires huge
computing power for certain inputs. More applications to this scheme could
be discovered in future: we believe that our proposed scheme could be utilized
in several fields, and several applications, because it covers a wide space user
requirements. A possible enhanced implementation could be a library of user
functions provided as a service to third parties, who may use it for perform-
ing heavy computations. This service could be either provided for researchers
freely only, or one can make a business out of it; of course, with a a set of user
demanded user functions.

Prospectives
This work could be improved in several ways and several directions. First,
we would like to let the decoder use the certification computations in its CRT
and treat them in special way to enhance the performance. This will require a
change in the model, where we store certifier’s results in a secure storage and
let the decoder read from it14 . We also would like to improve the decoder to
handle vectors, and matrices, as the user function may return these types.
    Moreover, we think it is important to study the certification error proba-
bility and provide an accurate bound for it, because a user may really need
a clear bound for his application using our functions. We may incorporate a
weak form of blacklisting to reduce error rates in a public GCS as it is studied
by Cecile[4]. A prospective work could improve the implementation to handle
multi-clients, and several decoders using a load balancer to reduce the load on
a single machine. The same thing might be applied for the certifier. The front
end can be improved to manage several GCS systems, and according to each
  14 More   details are in Thomas master thesis



                                             64
CHAPTER 7. IMPLEMENTATION & ANALYSIS                                        65


system cost and efficiency criteria, and based on the user priority or quality
of service, the front end may choose an appropriate GCS system for each user
function call. The will add lot more complexities to the system, but it would
increase the efficiency and the service quality. The amortize technique could
be used differently in models where machine cost is only associated to usage
time. One could study how to control the data rate, by adding more machines
dynamically, from the data proxy to the decoder, and control the increase of
this rate using amortize control technique, so that he can reach the unknown
bound, efficiently and using the full potential of the booked machines. This
could be useful for clusters and classical grid systems. Another direction to
improve this work, is to introduce dependencies for certain user functions. For
example, a user may want to write a function for matrix matrix product, and
want to fork each vector vector multiplication on different nodes. We want the
system to handle this requirement transparently from the end user, providing
simple libraries to control the execution tree.
    Through out this work, we provided proofs and theoretical study for the
case of modular computations. However, the same scheme may apply to float-
ing point functions, with different way of distributing tasks and decoding. This
would make the scheme much more effective, since there are dozens of appli-
cations in statistics and different fields relay on floating point functions. The
vision of this work is to have an online library of different type of function
(exact and floating point), accessible though APIs and interactive user inter-
faces remotely. A public cloud hides all complexities from the user and provide
on-the-fly computations with certified results to the public and other business
parties.
Bibliography

[1] A. L. Beberg, J. Lawason, D. MacNett, distributed.net home page.
    http://www.distributed.net
[2] Folding@Home website. http://folding.stanford.edu
[3] SETI@home website. http://setiathome.ssl.berkeley.edu
[4] Germain-Renaud, C. and Monnier-Ragaigne. Grid result check-
    ing. In Proceedings of the 2nd Conference on Comput-
    ing Frontiers. CF ’05. ACM, New York, NY, 87-96. DOI=
    http://doi.acm.org/10.1145/1062261.1062280, may 2005
[5] TCPA Main Specification 1.1b. http://www.trustedcomputing.org
[6] Roch, Jean-Louis and Varrette, Sebastien, Probabilistic certifica-
    tion of divide & conquer algorithms on global computing plat-
    forms:   application to fault-tolerant exact matrix-vector prod-
    uct, 978-1-59593-741-4, 88–92, London, Ontario, Canada. ACM,
    DOI=http://doi.acm.org/10.1145/1278177.1278191, 2007
[7] Jean-Louis Roch, Samir Jafar and Sébastien Varrette. A Probabilistic Ap-
    proach for Task and Result Certification of Large-Scale Distributed Ap-
    plications in Hostile Environments . 978-3-540-26918-2, 2005
[8] Faith Fich and Eric Ruppert. Hundreds of Impossibility Results for Dis-
    tributed Computing. Distributed Computing. Volume 16. Pages 121-163,
    2003
[9] Xavier DÉFAGO. Agreement-related problems: from semi-passive repli-
    cation to totally ordered broadcase, 2000
[10] Clement Pernet, Jean-Louis Roch, Thomas RocheFault-tolerant Polyno-
     mial Interpolation, 2009
[11] Mandelbaum, D. On a class of arithmetic codes and a decoding algorithm,
     1976




                                    66
BIBLIOGRAPHY                                                              67


[12] H. Higaki, K. Shima, T. Tachikawa, M. Takizawa, "Checkpoint and Roll-
     back in Asynchronous Distributed Systems," IEEE Computer and Com-
     munications Societies, Annual Joint Conference of the, pp. 998, INFO-
     COM ’97. Sixteenth Annual Joint Conference of the IEEE Computer and
     Communications Societies. Driving the Information Revolution, 1997.
[13] BOINC website. http://boinc.berkeley.edu
[14] Mersenne website. http://www.mersenne.org
[15] James Cowling. HQ Replication, 2007
[16] Luis F. G. Sarmenta. Sabotage-tolerance mechanisms for volunteer com-
     puting systems. Future Generation Computer Systems, 18(4):561–572,
     2002
[17] JXTA website. https://jxta.dev.java.net
[18] OGSA home page. http://www.globus.org/ogsa
[19] MPI home page. http://www.mcs.anl.gov/research/projects/mpi
[20] Kaapi website. http://kaapi.gforge.inria.fr
[21] Michael Armbrust, Armando Fox, Rean Griffith, Anthony D. Joseph,
     Randy Katz, Andy Konwinski, Gunho Lee, David Patterson, Ariel Rabkin,
     Ion Stoica, and Matei Zaharia, Above the Clouds: A Berkeley View of
     Cloud Computing, 2009
[22] Serdar Boztaş, Hsiao-Feng Lu. Applied algebra, algebraic algorithms and
     error-correcting codes: 17th international symposium, AAECC-17, Ban-
     galore, India, December 16-20, 2007
[23] O. Beaumont, E.M. Daoudi, N. Maillard, P. Manneback, and J.-L. Roch.
     Tradeoff to minimize extra-computations and stopping criterion tests for
     parallel iterative schemes. In PMAA’04, 2004.
[24] Luciano Soares, Clément Ménier, Bruno Raffin, and Jean-Louis Roch.
     Work Stealing for Time-constrained Octree Exploration: Application to
     Real-time 3D Modeling
[25] Francois Galilee, Gerson G. H. Cavalheiro, Jean-Louis Roch, Mathias Dor-
     eille. Athapascan-1: On-Line Building Data Flow Graph in a Parallel Lan-
     guage
[26] Grid5000 website. https://www.grid5000.fr
[27] Linbox website. http://www.linalg.org
[28] GMP website. http://gmplib.org
[29] Givaro home page. http://www-lmc.imag.fr/CASYS/LOGICIELS/givaro
BIBLIOGRAPHY                                                                   68


[30] OAR description page. https://www.grid5000.fr/mediawiki/index.php/OAR2
     (retrieved on 1st sept 2009)
[31] Online-certifier project site. http://code.google.com/p/online-certifier.