Extension of BOINC middleware to a Peer-to-Peer Architecture by liuqingyan


									                           DEPARTAMENTO DE ENGENHARIA INFORMÁTICA
                              FACULDADE DE CIÊNCIAS E TECNOLOGIA
                                   UNIVERSIDADE DE COIMBRA

  Extension of BOINC middleware
   to a Peer-to-Peer Architecture
                                              July 2007

              Fernando Luís Todo-Bom Ferreira da Costa
                                     Under the supervision of

                                 Professor Luís Moura e Silva

This thesis was submitted to the University of Coimbra for the partial fulfilment of the requirements for
                                     the Master in Science degree.
                              FACULDADE DE CIÊNCIAS E TECNOLOGIA
                                   UNIVERSIDADE DE COIMBRA

  Extension of BOINC middleware
   to a Peer-to-Peer Architecture
                                              July 2007

              Fernando Luís Todo-Bom Ferreira da Costa
                                     Under the supervision of

                                 Professor Luís Moura e Silva

This thesis was submitted to the University of Coimbra for the partial fulfilment of the requirements for
                                     the Master in Science degree.

    This dissertation presents two new models for data distribution in BOINC: an approach based on the
popular BitTorrent protocol; and a Content Delivery Network based approach, where mirrors act as
Content Delivery Servers. We design and implement a prototype for both scenarios, to allow for a
comparison between two different data distribution paradigms when applied to Volunteer Computing, or
more specifically BOINC.
    The BitTorrent implementation, named BT BOINC, was extensively tested in a medium-scale
environment and provided interesting results. We measured the impact of the BitTorrent components in
both the BOINC client and server, and compared it with the original BOINC. It allowed us to discover an
abnormal result in the server network output that should be further analyzed. We proved that the
BitTorrent client has a negligible influence in the client’s computation time, even when it is seeding
during half that time, and determined the specific overhead in clients and servers.
    The CDN BOINC prototype is still in an early state, since, for lack of time and machines, not enough
experiments were run on this first version, to evaluate its performance before moving on to more
ambitious ideas, such as using a content distribution mechanism such as FastReplica to distribute and
monitor files throughout the network.


    I would like to thank Professor Luis Silva for the support and help along the thesis, and especially for
the constructive criticism that helped me clarify my ideas and structure my work.

    Secondly, I want to thank all my colleagues in the Department, and in particular my “lab partners”,
that shared my frustrations, helped me with my mistakes and pushed me forward when I lacked the
motivation. My thanks to them.

    I would like to address a special thank you to Gilles Fedak for allowing me to have an account on the
Grid’5000 network. This work would not have happened were it not for that opportunity.
    Experiments presented in this thesis were carried out using the Grid'5000 experimental testbed, an
initiative from the French Ministry of Research through the ACI GRID incentive action, INRIA, CNRS
and RENATER and other contributing partners (see https:// www.grid5000.fr)

    Outside the working environment, and because there is life beyond informatics, I would like to
express my gratitude towards all my friends, who have put up with me and honored me with their
friendship and support.

    Finally I want to thank my sister Sara, and my parents, Fernando and Ana, for their unconditional
support, for the opportunities they have given me and mostly for their unrelenting and unfailing trust in
me. I am forever in your debt.

Table of Contents

ABSTRACT ...............................................................................................................................................I
ACKNOWLEDGEMENTS .........................................................................................................................III
TABLE OF CONTENTS ............................................................................................................................. V
INTRODUCTION ..................................................................................................................................... 1
    1.1 MOTIVATION ..................................................................................................................................... 2
       1.1.1  Two New Scenarios ............................................................................................................... 6
    1.2 OBJECTIVES ........................................................................................................................................ 7
    1.3 STRUCTURE ........................................................................................................................................ 7
STATE OF THE ART ................................................................................................................................. 9
    2.1 EXISTING VOLUNTEER AND GRID COMPUTING PROJECTS........................................................................... 10
       2.1.1   Grid Computing VS Volunteer Computing .......................................................................... 10
       2.1.2   The Potential of Volunteer Computing ............................................................................... 11
       2.1.3   BOINC.................................................................................................................................. 11
       2.1.4   Volunteer Computing Projects ............................................................................................ 14
       2.1.5   Grid Computing Projects ..................................................................................................... 20
    2.2 DATA STORAGE SYSTEMS AND CONTENT DELIVERY NETWORKS .................................................................. 26
       2.2.1   Data Storage Systems ......................................................................................................... 26
       2.2.2   Content Delivery Networks ................................................................................................. 27
    2.3 PEER-TO-PEER AND NAT TRAVERSAL..................................................................................................... 31
       2.3.1   Nat Traversal ...................................................................................................................... 31
       2.3.2   P2P ...................................................................................................................................... 33
       2.3.3   Application of Super Peer Protocol on Grid ........................................................................ 36
BITTORRENT ON BOINC ....................................................................................................................... 39
    3.1 CONCERNS ....................................................................................................................................... 40
       3.1.1  Firewall & Router Configuration ......................................................................................... 40
       3.1.2  Data Integrity...................................................................................................................... 41
       3.1.3  Adaptable Network Topology ............................................................................................. 41
       3.1.4  BOINC Integration ............................................................................................................... 42
    3.2 BITTORRENT SCENARIO ...................................................................................................................... 42
       3.2.1  Server .................................................................................................................................. 42
       3.2.2  Client ................................................................................................................................... 45
       3.2.3  BT BOINC File Transfer ........................................................................................................ 47
    3.3 EXPERIMENTAL RESULTS ..................................................................................................................... 48
       3.3.1  Testbed ............................................................................................................................... 48
       3.3.2  BOINC Project ..................................................................................................................... 48
       3.3.3  Monitoring Tools ................................................................................................................ 49
       3.3.4  Test Cases ........................................................................................................................... 49
    3.4 PROBLEMS AND SHORTCOMINGS.......................................................................................................... 57
    3.5 FUTURE WORK ................................................................................................................................. 58
CDN ON BOINC .................................................................................................................................... 59
    4.1 CLIENT REDIRECTION ......................................................................................................................... 60
    4.2 HEALTH MONITORING ....................................................................................................................... 62
       4.2.1  Local Monitoring................................................................................................................. 62
       4.2.2  Peer Monitoring .................................................................................................................. 62
    4.3 CONTENT DISTRIBUTION ..................................................................................................................... 63
    4.4 EXPERIMENTAL RESULTS ..................................................................................................................... 64
    4.5 FUTURE WORK ................................................................................................................................. 64

CONCLUSION ....................................................................................................................................... 67
    5.1     LESSONS LEARNED............................................................................................................................. 68
    5.2     CONTRIBUTION ................................................................................................................................. 69
REFERENCES ......................................................................................................................................... 71
PUBLICATIONS ..................................................................................................................................... 77

Chapter 1
    The use of computational power of personal computers distributed all over the world has been
steadily increasing in popularity. Desktop Grids have been extremely successful in bringing large
numbers of donated computing systems together to form computing communities with vast resource
pools. These types of systems are well suited to perform highly parallel computations that do not require
any interaction between network participants.
    One of the first and most successful projects of this kind was SETI@Home, which has gathered
nearly 5 million participants over the internet, and presently provides a steady processing rate of almost
100 TFLOPS. After the success of SETI@Home, dozens of other similar initiatives executing intensive
computation applications were developed, like Einstein@Home, Folding@Home, Climateprediction.net,
or Rosetta@Home. All these projects use a middleware platform developed by the same team of
SETI@Home, called BOINC – Berkeley Open Infrastructure for Network Computing.
    Volunteer computing platforms such as BOINC are currently the most successful Desktop Grid
systems, which rely on donated computer cycles from ordinary citizen communities. BOINC is currently
being successfully used by many projects to analyze data, and with a supportive user community can
provide compute power to rival that of the world’s supercomputers. In the current implementation of
these systems, network topology is restricted to a strict master/worker scheme, generally with a fixed set
of centrally managed project computers distributing and retrieving results from network participants. The
potentially large user communities that become involved in volunteer computing initiatives can easily
result in large network requirements for host projects, forcing them to upgrade their computer hardware
and network availability as their projects rise in popularity.
    These centralized data architectures currently employed by BOINC and other Desktop Grid systems,
can be a potential bottleneck when tasks share large input files or the central server has limited
bandwidth. With new data management technologies, Desktop Grid users will be able to explore new
types of data-intensive application scenarios – ones that are currently overly prohibitive given their large
data transfer needs. This lack of a robust data solution often discourages application developers from
embracing a Desktop Grid environment, or forces users to scale back their applications to only problems
that do not rely upon large data sets. There are many applications that, given more robust data
capabilities, could either expand their current problem scope, or migrate to a Desktop Grid environment.

 1.1 Motivation

    The BOINC platform follows a simple model: each project runs a central server that runs a Master
application. The applications are divided in thousands of smaller tasks that are sent to machines spread
over the internet, where they execute Worker type applications. There is no communication between
workers and all communication must be from the Worker to the Master, to enable the traversal of NATs
and firewalls.
    BOINC applications are therefore limited to this Master/Worker model, with a central server
responsible for distributing work to BOINC clients. It follows a simple network protocol, show in Figure

1, that requires clients to initiate all the communications (because of NAT/firewall problems), and to
contact the server every time a client requires more work.

                                   Figure 1 – Network Protocol Overview

    Every time a client is idle and wishes to execute more work, it has to contact the main server 3 times
(the first connection to get the master file is normally only needed once, when the client first attaches to a
project), following these steps:
         •    Contact the Scheduler to ask for a new result to execute. The scheduler reply contains
              information on the result: input files and executables the client needs to download, and
              URLs of data servers where it should download and upload files from;
         •    Download the input files and executables from data server referred by the scheduler;
         •    Upload the output files to data servers indicated by scheduler.

    BOINC’s architecture, show in Figure 2, along with the protocol described above, shows that there is
a heavy dependence on each project’s central server, which suggests a potential bottleneck.

                                Figure 2 – BOINC’s Project components

    There are two interfaces between the server and each client: the scheduler and the data server.
When considering how to remove the central point of failure, one must study the organization of the
central server to identify which components can be transferred to the client or distributed along many

                                  Figure 3 – Main Server Components

    Figure 3 shows the organization of the main server, with a central coordinating point, the database.
The heavy dependence on the MySQL database by all the other components (transitioner, assimilator, etc)
can limit the decentralization of the main server since the distribution of the DB would bring more
problems (security, reliability, intrusiveness) than advantages.

    As the DB would remain centralized, the scheduling of tasks would still have to go through the main
server, and would not allow for a completely independent, distributed scheduling mechanism in clients.
Therefore, if we were to consider a P2P distributed architecture to distribute responsibility among clients,
we could, for instance, implement a Super Node architecture, where clients would be divided in 2 classes:
    Super Nodes (SNs) – clients that met special requirements would be given more responsibilities;
    Ordinary Nodes (ONs) - normal clients that would be connected to SNs.

    Super Nodes would act as both data servers and schedulers, and Ordinary Nodes would be unaware
of the new architecture and would proceed acting as usual. The implementation of the scheduler in a SN
would involve pre-fetching of Work Units, to decrease the number of connections to the main server’s
scheduler. However, as stated before, there are many limitations and disadvantages in the implementation
of a scheduler in a client, such as:
    The communication would still have to go through the main server’s database;
    Each SN would be in possession of several Work Units and could alter their descriptions, which
would increase the insecurity for the end-user;
    The failure of a SN would mean the loss of those Work Units, therefore there would have to be
increased redundancy (more unnecessary computation, slower project progress);

    Taking this into account, and considering that the main bottlenecks occur in data transfers, since
schedulers only need to handle small HTTP messages (XML-encoded), and no file transfers, the
implementation of a Scheduler in clients would not compensate.

    The distribution of data through the clients could, on the other hand, bring a real advantage to
the BOINC architecture. The bottleneck of the main server is more noticeable when a new application is
released because the application is the larger file of most projects (SETI@home, for instance, has 350KB
work units, and application is around 2MB). An application update/introduction happens every few
months, so this would not be reason enough to introduce data distribution in BOINC.
    However, there are projects that have large input files, needed by work units, such as
Climateprediction.net and Einstein@home, which input files are around 15 MB. In these cases, file
replication and distribution throughout the clients could decrease the bandwidth used in the main server.
    Einstein@home’s work units need a data file that can go up to 25MB in size, and each data file can
be used by several work units, decreasing the bandwidth used per work unit. The computation of a work
unit takes 8 hours on a 1.8GHz machine, and each client can compute up to 12 work units with a single
data file, the average being 4 for fast connections (dial-up clients have a higher average). On average,
every 5 days, a client has to download a new data file that, in the worst case scenario, can be 20MB. This
means that the bottleneck in the central server would happen much more frequently for Einstein@home.
This kind of projects would benefit the most from data distribution, and will serve as the basis for this

    Therefore, projects and participants are limited to the bandwidth of the project servers. This
limitation becomes more noticeable when a project has large files or limited bandwidth. The use of
mirrors lessens this effect, but it is more desirable to explore a less expensive solution, that does not
involve using more machines.

 1.1.1            Two New Scenarios

    The centralized architecture of BOINC not only creates a single, or in the case of mirrored servers,
small number of failure points and potential bottlenecks, but it also fails to take advantage of the client-
side network bandwidth and capabilities. If client-side network bandwidth could be successfully utilized
to distribute data sets, not only would it allow for larger data files to be distributed, but it would also
minimize the needed network capabilities of BOINC projects, thereby substantially lowering operation
    Peer-to-Peer (P2P) data sharing techniques can be used to introduce a new kind of data distribution
system for volunteer and Desktop Grid projects – one that takes advantage of client-side network
capabilities. This functionality could be implemented in a variety of forms, ranging from BitTorrent-style
networks where all participants share equally, to more constrained and customizable unstructured P2P
networks where certain groups are in charge of data distribution and discovery. These approaches,
although similar in nature, each have their own distinct advantages and disadvantages, especially when
considered in relation to a scientific research community utilizing volunteer resources.
    We have chosen to use BitTorrent because it has proven to be scalable and efficient and would be
especially beneficial to projects that:
         •    Have large input files;
         •    Use the same input file for several work units.

    Furthermore, it would be advantageous for projects that have limited or slow outbound connections
from the central project server to use BitTorrent, since they would not use that much bandwidth.

    The current distribution of data throughout the mirrors is also very basic and rudimentary. The choice
of which mirror to serve which client request is either made based on a round-robin/random algorithm or
based on the client’s timezone: the server closer to the client is serves the request. This causes various
problems since the timezone can only divide servers that are quite apart, and they do not take into account
the server load or response time. Furthermore, the mirror servers are chosen from a static file, and are not
checked/pinged for liveliness, so if a server goes down, this goes unnoticed.

    Considering the potential of Volunteer Computing and the reputation of SETI@Home and BOINC, it
is perfectly conceivable to expect a growth in the number of participants, and in the complexity of
applications. The decentralization of BOINC’s architecture or the use of a CDN is a step towards this

    To date, Desktop Grid systems have focused primarily on utilizing spare CPU cycles, yet have
neglected to take advantage of client network capabilities. Leveraging client bandwidth will not only
benefit current projects by lowering their overheads but will also facilitate Destkop Grid adoption by
data-heavy applications. On the other hand, current mirroring in BOINC is very limited and does not take
into account servers down time, or load which can prove to be prejudicial to performance. The goal of
this Thesis is to compare two new models for data distribution:
        •    A BitTorrent model, where file distribution is achieved through the highly successful
             BitTorrent protocol.
        •    A CDN model, where the project data servers form a Content Delivery Network, like
             CoDeeN [*] and Coral [*]. The objective is to extend BOINC's software into this
             organization, maintaining a central model, but organizing Data Servers as a Collaborative
             Content Delivery Network (CCDN).

1.2 Objectives

    The main goal of this thesis is to distribute the data layer on BOINC, and test these new architectures.
This can be achieved by focusing on the following objectives:
        •    Change data distribution layer by applying two new models: a BitTorrent and a CDN
        •    Evaluate BitTorrent scenario by measuring file distribution times, bandwidth and CPU
             usage on the central server and compare it to the original BOINC;
        •    Evaluate CDN scenario by testing new features such as identifying dead Content Delivery
             Servers, while checking for overhead.

1.3 Structure

    The rest of this document is organized as follows:
        •    State of the Art: in this section, a review of related research will be presented. It will present
             existing techniques and tools that can help with the project, and show that the problem at
             hand has not been solved yet. It is divided in three subsections: Existing Volunteer and Grid
             Computing projects; Data Storage Systems and Content Delivery Networks; and Peer-to-
             Peer and Nat Traversal.
        •    BitTorrent on BOINC: the BitTorrent model is presented here, in two subsections: its
             architecture, changes on BOINC and new procedures are described in the first subsection;
             results are presented and discussed in the second subsection.
        •    CDN on BOINC: following the lines of the previous chapter, here the second model is
             presented and discussed.

•   Conclusion: in this chapter, this thesis' main conclusions are presented, and future directions
    of research are described.

Chapter 2
  State of the Art
    The state of the art is divided in three main subsections:
         •   Existing Volunteer and Grid Computing Projects: overview of the most significant projects
             on these areas.
         •   Data Storage Systems and Content Delivery Networks: description of the existing
             alternatives for data distribution on CDN and Data Storage Systems
         •   Peer-to-Peer and Nat Traversal: survey on the most significant P2P projects in existence,
             and on Nat/Firewall traversal techniques.

    The analysis of such systems and projects will help us determine the shortcomings and strengths of
the existing work, and show that our work not only builds upon these foundations, but extends what has
been done so far.

 2.1 Existing Volunteer and Grid Computing Projects

    The creation of Condor [*], as one of the first Grid Computing middleware projects, paved the way
for numerous Desktop Grid projects, that, instead of harnessing computational power from clusters on
organizations, sought to take advantage of the internet and distributed desktop users.
    An analysis of both Grid and Volunteer Computing projects is essential to understand the state of the
art on this subject, and to identify the ground covered so far and the opportunities worth exploring.

 2.1.1              Grid Computing VS Volunteer Computing

    Grid Computing involves organizationally-owned resources: supercomputers, clusters, and PCs
owned by universities, research labs, and companies. These resources are centrally managed by IT
professionals and are connected by full-time, high-bandwidth network links. There is a symmetric
relationship between organizations: each one can either provide or use resources. Malicious behavior such
as intentional falsification of results would be handled outside the system, e.g. by firing the perpetrator.
Redundant computing, cheat-resistant accounting, and support for user-configurable application graphics
are not necessary in a Grid system.

    Public Resource Computing (or Volunteer Computing) are executed in open internet resources.
Any user can volunteer its PC to participate in a public computation. It represents an asymmetric
relationship between projects and participants. Projects are typically small academic research groups that
have a grand-challenge problem that requires high-turnaround computing and they do not have enough
computing resources. Most participants are individuals who desktop PCs connected to the Internet by
telephone or cable modems or DSL, and often behind network-address translators (NATs) or firewalls.
The computers are frequently turned off or disconnected from the Internet. Participants only join into a
project only if they are interested in it and receive “incentives” such as credit and screensaver graphics.
Projects have no control over participants, and cannot prevent malicious behavior. It must accommodate

many existing commercial and research-oriented academic systems, and must provide a general
mechanism for resource discovery and access. In fact, it must address all the issues of dynamic
heterogeneous distributed systems, an active area of Computer Science research for several decades.

    Despite their differences, Grid and Volunteer Computing have the same objective: gathering of
resources for large-scale distributed computing and/or storage. Volunteer Computing has greater
potential, especially considering it can be applied to every computer in the world, and without the
financial drawback. However, it is important to analyze the projects on both fields to develop a broader
view and understanding of distributed systems.

 2.1.2            The Potential of Volunteer Computing

    In [1], is presented a study about volunteer computing. That study was done by using BOINC [2] on
the SETI@home [3] project.
    From the participating hosts, 25% had 2 or more CPUs and the average memory was 819MB RAM
and 2.03GB swap. The average network throughput was 289Kbps (only download, upload negligible).
The BOINC client measures the amount of total and free disk space on the volume where it is installed. It
averages 63GB and 36GB, respectively. The total free space is 12 Petabytes. At the time the article was
written, there were 1 million participants (a few hundred thousand for each project). The average host
lifetime is 91 days. The mean on-fraction (fraction of real time during which the BOINC client is running
on the host) was 0.81, connected-fraction (fraction of the time that BOINC is running that a physical
network connection exists) was close to 1 for hosts with LAN and DSL connections and the mean was
0.83. The average active-fraction (fraction of the time that BOINC is running when BOINC is allowed to
compute and communicate) was 0.84. SETI@home, at the time of the study, had a potential processing
rate of 149.8 TeraFLOPS. The host pool provides processing at a sustained rate of 95.9 TFLOPS. It also
has the potential to provide 7.74 Petabytes of storage, with an access rate of 5.27 Terabytes per second.

    These numbers prove that Volunteer Computing already offers incredible computational power, even
though there is still much ground to cover in the future (only a small percentage of the world’s computers
are currently participating).

 2.1.3            BOINC

    In SETI@home [4], the client program computes a result (a set of candidate signals), returns it to the
server, then gets another work unit. There is no communication between clients. SETI@home does
redundant computation: each work unit is processed multiple times, allowing it to detect and discard
results from faulty processors and from malicious users. The task of creating and distributing work units
is done by a server complex. The resulting work units are 350KB (keeps a typical computer busy for
about a day). There is a relational database to store information and a multithreaded data/result server to
distribute work units to clients. The server uses a HTTP-based protocol so clients inside firewalls are able

to contact it. A “garbage collector” program removes work units from disk, clearing an on-disk flag in
their database records – deletes work units that have been sent M times, where M is slightly more than the
redundancy level (causes some work units to never produce results). The client can be configured to
compute only when its host is idle or to compute constantly at a low priority. The program periodically
writes its state to a disk file, reading the file on startup. The client can run as a background process as
either a GUI application or as a screensaver. Upon receiving a result, the data server writes the result to a
disk file. A program reads the files, creating result and signal records in the database. For each result, the
server writes a log entry describing the result’s user, its CPU time, and more. A program reads these log
files, accumulating in a memory cache (flushed to the BD every few minutes) the updates to all relevant
database records. A “redundancy elimination” program examines each group of redundant results and
uses an approximate consensus policy to choose a representative result for that work unit. These results
are copied to a separate database.
     SETI@home is the most popular and widely embraced project in Volunteer Computing. It is
therefore an excellent starting point for understanding the main advantages as well as the main
shortcomings of this paradigm. It has not found signs of extraterrestrial life, but together with related
distributed computing and storage projects, it has established the viability of public-resource computing
in which computing resources are provided by the general public. The team that developed SETI@home
(at U.C. Berkeley Spaces Sciences Laboratory) advanced even further by creating a platform that could
support various internet scale grid-computing projects like SETI@Home.

     This platform was named BOINC [5] (Berkeley Open Infrastructure for Network Computing). The
server complex of a BOINC project is centered on a relational database that stores descriptions of
applications, platforms, versions, workunits, results, accounts, teams, and so on. Server functions are
performed by a set of web services and daemon processes:
     Scheduling servers handle RPCs from clients; they issue work and handle reports of completed
     Data servers handle file uploads using a certificate-based mechanism to ensure that only legitimate
files, with prescribed size limits, can be uploaded. File downloads are handled by plain HTTP.

     BOINC uses a set of abstractions to describe the files, applications and data. A workunit (WU)
represents the input to a computation: the application, a set of references input files, and sets of
command-line arguments and environment variables. A result consists of a reference to a WU and a list of
references to output files. Files have project-wide unique names and are immutable but can be replicated:
the description of a file includes a list of URLs from which it may be downloaded or uploaded. Files can
have associated attributes indicating, for example, that they should remain resident on a host after their
initial use. When the BOINC client communicates with a scheduling server it reports completed work,
and receives an XML document describing a collection of the above entities. BOINC’s computational
system also provides a distributed storage facility (of computational inputs or results) as a byproduct
(much different from P2P storage systems like Gnutella). BOINC provides support for redundant
computing, a mechanism for identifying and rejecting erroneous results. A project can specify that N

results should be created for each workunit. Once M ≤ N of these have been distributed and completed, an
application-specific function is called to compare the results and possibly select a canonical result. If no
consensus is found, or if results fail, BOINC creates new results for the workunit, and continues this
process until either a maximum result count or a timeout limit is reached. BOINC uses a work-
distribution policy that sends only at most one result of a give workunit to a given user. Redundant
computing is implemented using several server daemon processes: the transitioner implements the
redundant computing logic: it generates new results as needed and identifies error conditions; the
validator examines sets of results and selects canonical results. It includes an application-specific result-
comparison function; the assimilator handles newly-found canonical results. It includes an application-
specific function which typically parses the result and inserts it into a science database; the file deleter
deletes input and output files from data servers when they are no longer needed.

    In this architecture servers and daemons can run on different host and be replicated (scalability), and
availability is enhanced because some daemons can run even while parts of the project are down. BOINC
provides a feature called homogeneous redundancy that, when enabled, the BOINC scheduler sends
results for a given workunit only to hosts with the same operation system and CPU vendor. To prevent
overload, all client/server communication uses exponential backoff in case of failure. BOINC provides an
accounting system in which there is a single unit of “credit”. It also provides a cross-project identification
mechanism using the email address. BOINC offers a trickle messages mechanism, providing
bidirectional, asynchronous, reliable, ordered messages, piggybacked onto the regular client/server RPC
traffic. This can be used to convey credit or to report a summary of computational state. The BOINC
client software consists of several components: the core client performs network communication with
scheduling and data servers, executes and monitors applications, and enforces preferences (implemented
as a hierarchy of interacting finite-state machines); a client GUI provides a spreadsheet-type view of the
projects, the work and file transfers in progress, and the disk usage. It also provides an interface
(communicates with the core client via XML/RPC over a local TCP connection); an API, that interacts
with the core client to report CPU usage and fraction done, to handle requests to provide graphics, and to
provide heartbeat functionality; a screensaver program which, when activated, instructs the core client to
provide screensaver graphics. BOINC also provides a framework for project preferences. The core client
implements a scheduling policy, based on a dynamic “resource debt” to each project.
    Despite its widespread use BOINC has limited scalability because of its central server dependability.
A more decentralized approach would potentially solve this problem.

    Unlike most BOINC projects, Einstein@home [25] and Climateprediction.net [26] use large data
files in their computations [27], which would justify the use of data distribution.
    Climateprediction.net has an application with an average size of 15 MB, but each climate model is
only around 20KB. This means that a client would only need to download 15MB on their first connection,
and whenever the application changed, which happens quite rarely (every few months). The computation
of a work unit (SLAB model) takes around 4 weeks on a 1.4GHz, and needs a climate model to analyse

[28]. Therefore, the most frequent downloads for clients of this project are monthly 20KB files (assuming
no error on the computation), and every few months a 15MB file.
    Einstein@home has work units that need data files with an average size of 20MB. Each client can
execute a few work units using just one data file, and each computation is around 8 hours long. On
average, each client will download a new data file every 5 days.
    Taking this into account, projects resembling Einstein@home would benefit the most from a faster
data distribution system.

 2.1.4            Volunteer Computing Projects

    P3 (Personal Power Plant) [6] is another middleware for desktop internet computing. It enables
mutual and equal transfer of computing power between individuals, and makes use of a general-purpose
P2P library, JXTA, providing a network overlay on which any computer can initiate communication to
other computers even though those computers are behind firewalls and NAT. P3 consists of a job
management subsystem, a job monitor, and parallel programming libraries. The web-based job monitor
shows the progress of a job and computers participating to the job group (group of users allocated to that
job). A P3 user manages a job using: a Host, a daemon program which a resource provider runs on his/her
computer. It discovers a job group, receives a parallel application representing the job, and hosts the
application; and a Controller, a tool which a resource user uses to submit and control jobs to a computer
pool running Hosts. The Host first verifies the digital signature of a discovered job. Secondly, if the Host
is running in non-GUI mode, it decides whether to accept the job or not autonomically according to a
policy supplied by the user. A P3 application is written in Java language, and then compiled and packed
into a JAR (Java Archive) file. P3 provided an object passing library to support multiple parallel
programming models with less development work. It is implemented directly on JXTA, hides the
complexity of the JXTA API, and presents a simple set of APIs to libraries relying on it. The master-
worker library and the message passing library could be easily implemented because they rely on the
object passing library. A communication target is specified using its peer ID. The message passing library
also sends and receives objects, but the communication target is distinguished using a nonnegative integer
rank, not a peer ID. The master-worker library supports master-worker-style parallel processing. An
application developer writes programs for the master-side program, the worker-side program, and the
workunit, respectively. Workunit delivery and scheduling are the charge of the master-worker library, not
an application. A worker can join or leave a job at any time, even though the job has started running or
the worker is processing a workunit. This feature, ad-hoc joining and leaving, is implemented entirely by
the library, and an application developer does not need to take care of it. The former lazy worker problem
(receive workunits and not process them) is addressed by the timeout-based redistribution of workunits. A
master distributes a single workunit M times to workers and compares N returned calculation results. The
master accepts a result as the correct one if N results agree (a developer can provide code performing a
custom matching process instead of the default exact matching). A Host, not Controller, takes the role of
master. In P3, a Controller does only job management work. A Host taking the master’s role is chosen by
the Controller submitting the job. It is possible that a Host stands as a candidate for master. If a Controller
finds multiple standing Hosts, it chooses one standing Host randomly. If there is no standing Host, a Host

is chosen randomly out of all Hosts in the job group. The Controller announces the chosen Host as the
master and all Hosts recognize it.
    P3 allows a bidirectional access to the resources on the net, as opposed to BOINC, which specifies a
project that requires resources, and users who provide them by implicitly joining a project. In P3, any user
can be the provider or the user of resources. This is achieved because of JXTA, which allows for a true
P2P network. The higher number of applications in P3 is balanced with its lower complexity and lower
number of user per job (max per group).

    JNGI [7] is a framework that also uses JXTA as a base for P2P distributed computing. One
advantage of building the framework utilizing the JXTA protocols is that the concept of peer groups can
be leveraged. The framework contains the following peer groups: the monitor group, the worker group,
the task dispatcher group, and the repository group. The monitor group coordinates the overall activity of
the framework, including handling requests for peers to join the framework and their subsequent
assignment of the node to peer groups, and high-level aspects of the job submission process. The worker
group is responsible for performing the computations, while the task dispatcher group distributes
individual tasks to workers. The repository group serves as a cache for code and data. Within each worker
group there is one task dispatcher. Idle workers regularly poll the task dispatcher relaying information
regarding resources available. Based on this information, the task dispatcher polls the repository for tasks
to be performed on available codes, or for codes to be downloaded to the workers. Upon distribution of
code and tasks, the worker performs the task and returns the result to the task dispatcher. The task
dispatcher does not keep track of the job submitters. It is therefore up to the job submitter to initiate the
result retrieval process. The job ID (unique) is sent to the job submitter when the task repository is
created, and is used to request the results. The task dispatcher relays this request to the repository which
returns with the tasks if the job has completed. As workers are added to a work group, the communication
bandwidth between workers and task dispatchers may become a bottleneck. To prevent this, another role
is introduced, the monitor. The main function of the monitor is to intercept requests from peers which do
not belong to any peer group yet. Monitors free task dispatchers from direct communication with the
outside world. There are also monitor peer groups to provide redundancy. With monitors, job submitters
make requests to the monitor peer group. Monitors within that peer group redirect these requests to a
work group. The work group replies directly to the job submitter. Monitors can also request a worker to
become a monitor in case of a monitor failure. To avoid a bottleneck in case too many groups are
associated to the monitor, the model also enables one to have a hierarchy of monitor peer groups, with
each monitor peer group monitoring a combination of work groups and monitor groups. To submit a job,
the job submitter or worker contacts the top level monitor group. To avoid a bottleneck in the top level
monitor group, when a new peer contacts it, all the monitors within this peer group receive the message.
Each monitor in the monitor peer group has a subset of requests to which it replies. These subsets do not
overlap and put together compose the entire possible set of requests that exist.
    The use of groups by JNGI allows a clear definition of responsibilities, which can then be distributed
along the users. This could be a promising concept to apply to BOINC, allowing users to share more of

the work management (as opposed to the central control station, in the project back-end). The hierarchical
organization is also a good idea between clients, to improve scalability.
    As in P3, the P2P network is created using JXTA. The development of such a overlay in BOINC
would increase its clients’ usefulness and offer a number of new possibilities for developing distributed
programs, or even altering BOINC’s architecture itself.

    WOW [8] (Self-Organizing Wide Area Overlay Networks of Virtual Workstations) is a distributed
system that combines virtual machine, overlay networking and P2P techniques. It extended the Brunet
P2P protocol and the IPOP virtual network system to support on-demand establishment of direct overlay
links between communicating WOW nodes. Such direct connections allow nodes to bypass intermediate
overlay routers and communicate directly with other nodes if IP packet exchanges between nodes are
detected by the overlay. It is capable of traversing NAT/firewall routers using hole-punching techniques
in a decentralized manner. Each node is an independent computer which has its own IP address on a
private network. A virtual disk is configured and copied to all hosts, and nodes across a WAN are
interconnected by a software overlay. The system works with VMs that provide a NAT-based virtual
network interface, such as VMware and Xen, and does not require the allocation of a physical IP address
to the VM. The only software needed within the VM that is additional to a typical cluster environment is
the IPOP virtual network. At the core of the WOW architecture is the ability to selfconfigure overlay
links for nodes that join the distributed system. Its adaptive algorithm supports direct P2P connection
establishment between communicating nodes so that they can communicate over a single overlay hop.
Such direct connections are referred as shortcut connections. Brunet maintains a structured ring of P2P
nodes ordered by 160-bit Brunet addresses. Each node maintains connections to its nearest neighbors in
the P2P address space called structured near connections. When a new node joins the P2P network, it
must find its right position on the existing P2P ring and form structured near connections with its nearest
neighbors in the P2P address space. Each node also maintains k connections to distant nodes in the P2P
address space called structured far connections. The information about transport protocol and the
physical endpoint (e.g. IP address and port number) is contained inside a Uniform Resource Indicator
(URI), such as brunet.tcp: Note that a P2P node may have multiple URIs, if it has
multiple network interfaces or if it is behind one or more levels of NAT. The mechanism for connection
setup between nodes consists of conveying the intent to connect, and resolution of P2P addresses to URIs
followed by the linking handshake. Nodes keep an idle connection state alive by periodically exchanging
ping messages. A new P2P node is initialized with URIs of a few nodes already in the network. The new
node creates what we call a leaf connection with one of these initial nodes by directly using the linking
protocol. These initial nodes are typically public and if the new node is behind a NAT, it discovers and
records its own URIs corresponding to the NAT assigned IP/port. Once the leaf connection is established,
the leaf target acts as forwarding agent for the new node. The new node must now identify its correct
position in the ring, and form structured near connections with its left and right neighbors to become
fully routable. Initially, nodes behind NATs only know their private IP/port, and during connection setup
with public nodes they also learn their NAT-assigned IP/port. The bi-directionality of the
connection/linking protocols is what enables the NAT holepunching technique to succeed. During the

linking protocol, nodes try each others URIs one at a time, until they find one over which they can send
and receive handshake messages. In Brunet, for each connection type, each P2P node has a connection
overlord which ensures the node has the right number of connections. To support shortcut P2P
connections, a ShortcutConnectionOverlord was implemented within the Brunet library. The SCO at a
node tracks communication with other nodes using a metric called score (amount of remaining work left
in the node’s virtual queue). The higher the score of a destination node, the more communication there
has been with it. The nodes for which the virtual queue is the longest are the nodes we connect to. The
SCO establishes and maintains shortcut connections with nodes whose scores exceed a certain threshold.
    The use of structured near and far connections offered by Brunet is a good idea to apply to a BOINC
overlay, but with a different definition of what is a near connection (addresses in Brunet). Keeping
connections between nodes alive with periodic pings is also a good idea, as is establishing a list of initial
public nodes to which a new node can connect to on its first execution.
    To apply communication between users in BOINC, there should be a way to bypass firewalls and
NATs, and a holepunching technique (like the one described here) is a good solution.

    Entropia [9] is an enterprise desktop grid system, one of the first commercial middleware for Global
Computing. To provide rapid application integration, Entropia uses binary modification technology that
obviates access to the applications source code while providing strong security guarantees and ensuring
unobtrusive application execution. To support the execution of a large number of applications, and to
support the execution in a secure manner, Entropia employs proprietary binary sandboxing techniques
that enable any Win32 application (and even third-party shrink-wrapped software and common scripting
languages) to be deployed in the Entropia system with no modifications and no special system support.
Sandboxing automatically wraps an application in Entropia’s virtual machine technology. An application
on the Entropia system executes within this sandbox and is not allowed to access or modify resources
outside of the sandbox. The Entropia system architecture is composed of three separate layers. At the
bottom is the Physical Node Management. On top of this layer is the Resource Scheduling Users can
interact directly with the Resource Scheduling layer through the available APIs or alternatively, users can
access the system through the Job Management layer that provides management facilities for handling
large numbers of computations and files. The security services employ a range of encryption and binary
sandboxing technologies. The Resource Scheduling layer accepts units of computation from the user or
job management system, matches them to appropriate client resources, and schedules them for execution.
The Job Management layer of the Entropia system is responsible for decomposing the single job into the
many subjobs, managing the overall progress of the job, providing access to the status of each of the
generated subjobs, and aggregating the results of the subjobs. The priority for a subjob is increased if it is
not assigned to a client in a reasonable amount of time. When the Entropia client is installed on a machine
it registers itself with a specified node manager. This registration includes providing a list of all of the
client’s attributes. The node manager provides a centralized interface to manage all of the clients. The
goal of the Desktop Client is to harvest unused computing resources by running subjobs unobtrusively on
the machine. First, subjobs are run at a low process and thread priority. The sandbox enforces this low
priority on all processes and threads created. Second, the Desktop Client monitors desktop usage of the

machine and resources used by the Entropia system. If desktop usage is high, the client will pause the
subjob’s execution, avoiding possible resource contention. Third, the Desktop Client provides security for
the client machine by mediating subjob access to the file system, registry, and graphical user interface.
The Entropia system automatically monitors and limits application usage of a variety of key resources
including CPU, memory, disk, threads, processes, etc. The Entropia sandbox isolates the grid application
and ensures that it cannot invoke inappropriate system calls, nor inappropriately modify the desktop disk,
registry, and other system resources. Entropia guarantees that the state of the desktop machine remains
unchanged after executing an application. The Entropia sandbox keeps all data files encrypted on disk,
and monitors and checks data integrity of grid applications and their data and result files.
    WOW is based on virtual machines, and therefore has a high level of security and fully decouples the
execution environment exposed to applications within a “guest” VM from that of its “host”. However,
Entropia increases the security through sandboxing (BOINC has not implemented that, but there are plans
on doing so). The extra level of control of programs executed on client’s software increases the options
for programmers, and gives more confidence to the user when choosing wether or not to use the system –
in BOINC, participants have to trust in the projects, since there is almost no way of preventing malicious
use (the BOINC API is more of a facilitator). There is also a protection of the programmer’s source code
since Entropia only needs binaries (Win32 executables).

    Cluster Computing on the Fly [10] seeks to harvest cycles from ordinary users in an open access,
non-institutional environment. In the CCOF architecture, hosts join a variety of community-based overlay
networks, depending on how they would like to donate their idle cycles. Clients then form a compute
cluster on the fly by discovering and scheduling sets of machines from these overlays. Host communities
are organized through the creation of overlay networks based on factors such as interest, geography,
performance, trust, institutional affiliation, or generic willingness to share cycles. There was a
comprehensive study of generic searching methods in a highly dynamic environment for workpile
applications (described on the next paragraph). Preliminary results show that, under light workloads, the
search method “rendezvous point” performs best with respect to job completion, while under heavy
workloads its performance falls below the other techniques. For application scheduling, strategies such as
oversubscription and duplication scheduling, may be used for maximum flexibility. Furthermore, if
coordination across a set of host nodes is required, it may be desirable to organize the selected hosts into a
new overlay to support interprocess communication. The job of a local scheduler can be viewed as an
admission control problem. Some hosts may provide guaranteed service by accepting only CCOF jobs.
The CCOF application scheduler probes host nodes using undetectable quiz codes to develop a trust
rating for each host, as well as to validate returned results. We assume that hosts protect themselves from
a variety of attacks by running guest code within a virtual machine monitor, creating a ”sandbox” that
protects the host and controls resource usage. To prevent the improper use of their resources (useless
tasks), hosts could deny network access for untrusted clients, and users can give priority to projects they
have deemed trustworthy through any outside form of communication.                CCOF’s Wave Scheduler
captures available night time cycles in timezones from east to west. Wave scheduling seeks to capture
cycles from the millions of machines that lie completely idle at night. By following night timezones

around the globe, it continuously gives workpile tasks dedicated access to cycles without interruption
from users reclaiming their machines. The CCOF Wave Scheduler uses a CAN-based DHT overlay to
organize nodes located in different time zones. The correctness of results returned by host nodes to the
workpile application is validated using a quiz mechanism. The application node sends a set of quizzes to
the hosts whose solutions are known beforehand. Based on the hosts’ performance on the quizzes, the
application can then decide whether to accept or reject the results. Quiz and application results are
periodically sent back from the host to the application. If the application node receives wrong quiz
answers, it can immediately reschedule the task on another host.
    CCOF presents a new idea on how to form overlay networks, through factors like interests or
geography. In BOINC, this choice of “network” is made by the client when he decides which project to
join (his decision is based on each user’s factors – scientific interest, number of participants, etc). It is
also the first system to use wave scheduling, a concept explained on the next paragraph.
    However, CCOF is just an architecture proposal, and the main concepts have only been studied
through simulations.

    The study of new scheduling methods lead to the development of Wave Scheduler [11], which has
two major components: a self-organized, timezone-aware overlay network and an efficient scheduling and
migration strategy. Wave scheduler can utilize any structured overlay network such as CAN, Pastry, and
Chord. The algorithm presented uses a CAN overlay to organize nodes located in different timezones.
The scheduling model is composed of the following four key components: host selection criteria, host
discovery strategy, local scheduling policy, and migration scheme. When a client wants to schedule a job,
the scheduler chooses the candidate host(s) satisfying the host selection criteria via host discovery. Then
it schedules the job on the candidate host. A migration scheme decides when and where to migrate the
job. It uses host discovery and host selection strategy to decide where to migrate the host. A client uses its
host selection criteria to decide whether the host can be a candidate, and selects one of them. Simple
scheduling methods relax their hosts selection criteria to use any unclaimed (here is no foreign job on that
host) hosts, while fast turnaround scheduling methods try to schedule foreign job on available (there is no
foreign job on that host and the host is idle) hosts for instant execution. In this study, a low-complexity
host selection strategy is used (select the first discovered host that satisfies the particular host selection
criteria). The purpose of the host discovery scheme is to discovery candidate hosts to accept the foreign
job. Two schemes are used: Label-based random discovery: when the client needs extra cycles, the client
randomly chooses a point in the CAN coordinate space and sends a request to that point. If the host does
not satisfy the host selection criteria, the client can repeatedly generate another random point and contact
another host. Expanding ring search: When the client needs extra cycles, the client sends out a request
with the host selection criteria to its direct neighbors. On receiving such request, if the criteria can be
satisfied, the neighbor acknowledges the client. If the request is not satisfied, the client increases the
search scope and forwards the request to its neighbours one-hop farther away. This procedure is repeated
until the request is satisfied or the searching scope limit is reached. The local scheduling policy on a host
determines the type of service a host gives to a foreign job that it has accepted. The screensaver policy,
where foreign jobs can only run when there is no recent mouse/keyboard activity, was used. In the study,

it is assumed that there is no resource availability prediction and that migration is a simple best effort
decision based primarily on local information, e.g. when the host becomes unavailable due to user
activity. Several migration schemes are compared that differ regarding when to migrate and where to
migrate. The options for when to migrate include: Immediate migration: once the host is no longer
available, the foreign jobs are immediately migrated to another available host; Linger migration: allows
foreign jobs to linger on the host for a random amount of time after the host becomes unavailable. After
lingering, if the host becomes available again, the foreign job can continue execution on that host. There
are also two options for where to migrate the jobs: Random: The new host is selected in a random area in
the overlay network. Night-time machines: the night-time machines are assumed to be idle for a large
chunk of time. The Wave Scheduler uses the geographic CAN overlay to select a host in the night-time
    After running tests and simulations to evaluate the performance of different scheduling strategies
(based on the 4 migration options and on the host discovery method) there were some conclusions:
Compared with no-migration schemes, migration significantly reduces turnaround time; the adaptive
strategies (more persistent in their efforts to find an available host) perform best overall with respect to
both low turnaround time and low job failure rate; wave scheduler performs better than the other
migration strategies when the free time during the day is limited; wave-adaptive improves upon wave
because it reduces collisions on night-time hosts. It performs best among all the scheduling strategies.
    Wave scheduling could possibly be applied do BOINC when deciding if a host is capable of finishing
the execution of a result before the deadline. Hosts that were in a night zone would be given priority.

 2.1.5            Grid Computing Projects

    Condor [12] is the product of the Condor Project in the University of Wisconsin-Madison and one of
the first Grid Computing projects. It uses the Up-Down algorithm presented by Mutka and the Remote
UNIX Facility to execute remote jobs. It follows an hybrid approach that lies between a centralized and
the fully distributed approach: a central node to avoid the overhead of messages to decide which
workstation should be allocated available capacity. Each workstation keeps the state information of its
own jobs and has the responsibility of scheduling them – it has a local scheduler and a background job
queue. One workstation holds the central coordinator as well. Every 2 minutes the central coordinator
polls the stations to see which stations are available to serve as sources for remote cycles. Between
successive polls, each local scheduler monitors its station to see if it can serve as a source of remote
capacity. When local activity is detected, the local scheduler will immediately preempt the background
job (job is kept there for 5minutes. If the station does not become available, the job will be checkpointed
and moved). Local schedulers are not affected if a remote site discontinues service. When the Remote
Unix is explicitly invoked, a shadow process runs locally as the surrogate of the process running on the
remote machine. Any UNIX system call made by the program on the remote machine invokes a library
routine which communicates with the shadow process. When a job is removed from a remote location,
RU checkpoints it. The state of an RU program is the text, data, bss, and the stack segments of the
program, the registers, the status of open files, and any messages sent by the program to its shadow for

which a reply has not been received. The text of the program contains the executable code (kept in case
the user wants to alter code without having to change the filename – or it would interfere with old file
running), the data segment contains the initialized variables of the program, and the bss segment holds the
uninitialized variables. To provide fair access to resources, it manages available capacity with the Up-
Down algorithm: trades off the remote cycles users have received with the time they have waited to
receive them by maintaining a schedule index for each workstation. When remote capacity is allocated to
a workstation, the index is increased. When a workstation wants remote capacity, but is denied access to
it, the index is decreased (default 0). Every 2 minutes the coordinator will check if any stations have new
jobs to execute. If a station with higher priority has a job to execute, and there are no idle stations, the
coordinator preempts a remotely executing job from a station with lower priority.
    Condor is therefore a distributed batch system for sharing the workload of compute-intensive jobs in
a pool of UNIX workstations connected by a network. Although the design of Condor does not preclude
the merging of different pools into one pool with WAN connections, Condor was not designed to protect
the rights of an organization to its own cluster. It was aimed for single clusters/LANs, and the merging of
different pools was not handled well.

    In [13], a mechanism is presented that enables a controlled exchange of computing resources across
the boundaries of Condor pools. Using this so-called flocking mechanism, independent Condor pools
can be turned into a Condor flock where jobs submitted in one pool – the submission pool – may access
resources belonging to another pool – the execution pool (any two pools can be connected by a pair of
Gateway machines, one in either pool). The set of rules that govern the exchange of jobs between two
pools is referred to as a (resource-sharing) agreement. For example, job transfer can be restricted – job
transfers may only be allowed in one direction between two pools. In this distributed flock structure, the
decision making is distributed among the pools by having any pair of Condor pools negotiate the transfer
of jobs without any interference from other pools. The layered design allows the flocking mechanism to
be developed independently from the standard Condor. Each pool has at least one Gateway Machine
(GW). Each GW has a flock configuration file describing the subset of connections maintained by the
GW. For each connection, the file contains the name of the pool and the network address of the GW at
the other end, and whether the local pool is allowed to run Condor jobs in the remote pool and vice versa.
Connecting two pools is done by entering the appropriate information in the flock configuration files in a
GW in either pool. Periodically, the GWs exchange information on the availability of machines in their
pools. From time to time, the GW chooses a machine from the availability lists received (from the other
pools), and presents itself to the Central Manager with the characteristics of this machine (if there is no
information on idle machines in remote pools, it presents itself as a machine that is unavailable). The
flocking protocol works as follows. The matchmaking between a submission machine S and a GW
(posing as remote machine) is made as the standard Condor protocol. It is then established a connection
between the submission pool and the execution (remote) pool (between their GWs), with the remote GW
receiving the job’s context. The matchmaking and the establishment of a connection within the execution
pool are similar to same steps of the standard Condor protocol. Scheduling details: whenever a GW needs
to represent a machine to the CM, it chooses at random a machine of the availability list from a randomly

chosen pool to which it is connected. When a job is assigned to the GW, it will not necessarily be
executed on the machine advertised by the GW, and it may even be sent to an execution pool different
from the one containing the advertised machine. The GW scans the availability lists in random order until
it encounters a machine satisfying the job requirements and the job preferences (or just the requirements
if no machine is found at first). If still no machine is found, the job remains queued at the submission
machine and has to be rescheduled. When a flocked job is checkpointed in the execution pool, its
checkpoint file is sent back to the submission machine.
    Since Condor pools distributed over a wide area can have dynamically changing availability and
sharing preferences, the flocking mechanism based on static configurations can limit the potential of
sharing resources across Condor pools.

    There is a technique for resource discovery in distributed Condor pools using peer-to-peer
mechanisms that are self-organizing, fault-tolerant, scalable, and locality-aware – a Self Organizing
Flock of Condors [14]. This technique uses Pastry, a P2P overlay network. Pastry arranges the pools on
a logical ring – the p2p overlay’s node identifier name space – and allows a Condor pool to join the ring
using only the knowledge about a single bootstrap pool that is already in the ring. Another advantage of
using Pastry is the automatic creation of the proximity-aware routing table that can be used to sort
available remote pools in order of the network proximity. The Condor pools that are interested in sharing
resources with other pools form a p2p overlay network, and each pool is issued a random node identifier
(nodeId) in the ring. Only the central manager needs to be part of this logical ring (other resources in a
pool are not aware of the p2p organization of the pool managers). Each pool that has resources available
sends a message announcing the available resources to all the pools specified in its routing table, starting
from the first row and going downwards (contacts nearby pools first). The dynamic resource pool
discovery is achieved via a software layer. The software runs on each central manager M and uses the
resource announcements from other managers to decide which resource pools to flock to. From this
information, M can create a list of resource pools that are available to it, ordered with respect to the
network proximity. One potential drawback is that a Condor pool will not be able to flock to other
resources that do not appear in the Pastry routing table. To address this problem, the p2p-based flocking
can be extended by introducing a time-to-live (TTL) field in the announcement message. All the
resources in a condor pool can be arranged on a logical ring, with the nodeId of the central manger known
to every resource. This ring is local to a pool and does not interact with the logical ring for on-demand
flocking. The central manager is the only node that is on both the rings. The central manager periodically
informs everyone in the pool of its aliveness. In addition, replicas of the pool configuration and other
management information of the central manager are maintained on the K immediate neighbors of the
central manager in the node identifier space. In case the central manager fails, the clients detect its
absence and send messages with the central manager’s nodeId as the message key in the p2p overlay.
These messages are guaranteed by the p2p routing to arrive at one and only one of the K neighbors of the
failed manager, which then takes on the role of the central manager. In case of self-organized flocking,
the jobs from remote pools can also be sandboxed using either the Java Virtual Machine or system-call
tracing. To protect against a malicious remote condor pool, the proposed approach uses a policy file,

which controls a pool’s interactions (e.g. interactions can be limited to with only those remote pools that
have been pre-approved by the pool manager). The results obtained in the article show that p2p
technology offers a promising approach to dynamic resource discovery essential to high throughput
      The use of Pastry (like JXTA) allows for a P2P overlay to be formed between Condor pools. One of
its main advantages, compared to JXTA, is the proximity-aware routing tables, which order the entries by
network proximity. This could be applied in BOINC, to choose which “nodes” each client should connect

      The Sun Grid Engine [15] acts as the central nervous system of a cluster of networked computers.
Via daemons, the Grid Engine Master supervises all resources in the network to allow full control and
achieve optimum utilization of the resources available. Sun Grid Engine aggregates the compute power
available in dedicated compute farms, networked servers and desktop workstations, and presents a single
access point to users needing computer cycles. The enhancement of Grid Engine, which currently is
restricted to manage local networked computer resources, is 'The Grid Broker' (WAN).

      Alchemi.NET [16] is another example of Internet-based clustering. It creates a federation of clusters
to create hierarchical, cooperative grids, and allows dedicated or non-dedicated (voluntary) execution. It
provides an object-oriented grid thread programming model (fine-grained abstraction), and a web services
interface supporting a grid job model (coarse-grained abstraction) for cross-platform interoperability.
Alchemi follows the master-worker parallel programming paradigm in which a central component
dispatches independent units (grid thread, part of a grid application) of parallel execution to workers and
manages them. Alchemi offers four distributed components, designed to operate under three usage
patterns. The Manager manages the execution of grid applications and provides services associated with
managing thread execution. The Executor accepts threads from the Manager and executes them. An
Executor can be configured to be dedicated (resource is centrally managed by the Manager), or non-
dedicated (resource is managed on a volunteer basis via a screen saver or by the user). For non-dedicated
execution, there is one-way communication between the Executor and the Manager. Where two-way
communication is possible and dedicated execution is desired the Executor exposes an interface so that
the Manager may communicate with it directly. Grid applications created using the Alchemi API are
executed on the Owner component. The Owner submits threads to the Manager and collects completed
threads on behalf of the application developer via the Alchemi API. The Cross-Platform Manager, an
optional sub-component of the Manager, is a generic web services interface that exposes a portion of the
functionality of the Manager in order to enable Alchemi to manage the execution of platform independent
grid jobs (as opposed to grid applications utilizing the Alchemi grid thread model). The components
discussed above allow Alchemi to be utilized to create different grid configurations: desktop cluster grid,
multi-cluster grid, and cross-platform grid (global grid). A multi-cluster environment is created by
connecting Managers in a hierarchical fashion. As in a single-cluster environment, any number of
Executors and Owners can connect to a Manager at any level in the hierarchy. The key to accomplishing
multi-clustering is that a Manager behaves like an Executor towards another Manager since the Manager

implements the interface of the Executor. A Manager at each level except for the topmost level is
configured to connect to a higher level Manager as an “intermediate” Manager and is treated by the
higher level-Manager as an Executor. In case an intermediate Manager receives a thread, it is scheduled
locally with a priority reduced by one unit and is executed as normal by the Manager’s local ‘Executors’.
If, at some point, a Manager does not have local threads to allocate, it requests a thread from its higher-
lever Manager. A grid middleware component such as a broker can use the Cross-Platform Manager web
service to execute cross-platform applications (jobs within tasks) on an Alchemi node (cluster or multi-
cluster) as well as resources grid-enabled using other technologies such as Globus. The .NET Framework
offers two mechanisms for execution across application domains – Remoting and web services. .NET
Remoting allows a .NET object to be “remoted” and expose its functionality across application domains.
Remoting is used for communication between the four Alchemi distributed grid components. Web
services are used for the Cross-Platform Manager’s public interface. Alchemi simplifies the development
of grid applications by providing a programming model that is object-oriented and that imitates traditional
multi-threaded programming. Developers deal only with application and thread objects and any other
custom objects. This approach allows development of grid applications where inter-thread
communication is required. Alchemi’s architecture supports software for the “grid job” model (atomic
unit is process) via web services interface for the following reasons: grid-enabling existing applications;
and cross-platform interoperability with grid middleware that can leverage Alchemi.
    Alchemi presents a new possibility, for dedicated execution: two way communication between
master and worker. The worker exposes an interface that allows the manager to contact it. This could be
used in BOINC, to allow communications between clients. NAT and firewall bypassing would be the
problem to solve at that point.

    XtremWeb [17] is another distributed platform, with the difference that it tries to address both Grid
Computing and Global (Volunteer) Computing issues. It follows the general vision of a Large Scale
Distributed System (LSDS) turning a set of non specific resources into a runtime environment executing
services (application modules, runtime modules or infrastructure modules) and providing volatility
management. A general architecture for LSDS considers four main layers representing a total of seven
sub-layers. The role of the first layer is to aggregate non specific resources (clusters, home PCs, PCs in
LAN, etc.) for building a full, but unstable cluster. The second layer turns the non stable cluster into a
virtual stable cluster. The third layer creates a generic GC platform. The fourth layer deploys runtime
environments modules for parallel computing. The XtremWeb GC platform implements a subset of this
architecture. Its design follows a set of three main principles: 1) a three-tier coordination architecture
connecting client to workers through a Coordination service, 2) a set of security mechanisms based on
autonomic decisions, 3) a fault tolerance design allowing the mobility of clients, the volatility of workers
and failure of the Coordination service. The role of the third tier, called the coordinator, is a) to de-couple
clients from workers and b) to coordinate tasks execution on workers. In XtremWeb, the coordinator is
currently executed by a single machine that could be installed specifically. For deployment in a cluster,
instead of accessing the clusters, the user logs onto the coordinator (which manages a community of
users) and submit jobs through him. All user tasks are submitted by the coordinator to the cluster batch

scheduler, as a representative of the user. Communications between the different parts of XtremWeb
include remote procedure call (RPC) messages and data transfers. XtremWeb communication architecture
relies on three protocol layers. The first level “connection" is dedicated to enable connection between the
entities possibly protected by firewall or behind a NAT or a proxy. The second level “transport" is
responsible for reliable and secure message transport. The third level “protocol" gathers several flavors of
RPC API. As XtremWeb is currently centralized, firewall bypassing can be done if the coordinator is
reachable by other parties. Communication channels are then never initiated by the coordinator. The
transport layer relies on TCP/IP. Security can then be achieved with standard SSL. A typical service call
follows several steps: the resource discovery engine is invoked to return a factory address being able to
realize a service instantiation. When the service is instantiated, the factory returns the hosting machine
address where the service can be called. Finally, the service stays alive until it detects a termination
condition. The first implementation of XtremWeb considers 3 main services: client who submits requests,
worker which executes them and coordinator. In this version of XtremWeb, coordinator encapsulates
different services (scheduler, results server, applications repository). In the RPC implementation for
XtremWeb, called XWRPC, the client automatically translates a RPC call into a task manageable by the
coordinator. When it gets the result file back, the client extracts from it the output parameters. XWRPC
provides blocking and non blocking RPC calls. RPC fault tolerance aims to certify that all RPC calls
succeed. In such a system, it is impossible to design a perfect failure detector, thus, the failure detector
may wrongly suspect some processes, resulting in an over submissions of RPC calls. The coordinator
detects workers crash/disconnection by time-out mechanism and re-schedules their allocated tasks on
other available workers using logged messages. The high volatility of nodes in LSDS implies the use of a
fault-tolerant MPI (message passing library) implementation. To study several fault tolerance protocols
for MPI, the MPICH-V project which is a research effort with theoretical studies, experimental
evaluations and pragmatic implementations of a fault-tolerant MPI, was launched. A MPICHV
environment encompasses a communication library based on MPICH and a runtime environment. In case
of LSDS, the runtime (responsible for process distribution, fault detection, process restart, checkpoint
scheduling, etc.) occupies several layers of the software stack. The library implements all communication
subroutines provided by MPICH. To protect the integrity/privacy of the participating computers
resources, XtremWeb uses Sandboxing to confine the code execution inside an unbreakable envelope
(filters the system calls). LSM (Linux Security Module) is a framework in the Linux kernel which allows
inserting security modules directly in codes of system calls. XtremWeb developed SBLSM, a module for
LSM dedicated to GC and P2P systems. Every time a sandboxed process issues a system call, the module
checks a dedicated variable which can take three different states: GRANT, the specific controls are called,
DENY, the call is denied and return with an error number, and ASK, the module asks an authority (i.e. an
administrator) what to do via the security device. Currently, SBLSM provides three controls: File access
control; Network access control; Process Signal Control. It allows the kernel to ask the user (ASK mode)
for a decision.
    The 3-tier architecture by XtremWeb can be almost directly applied to BOINC: Client (Project) –
Coordinator (Chosen Client) – Worker (Normal Client). By initiating connections only from the Worker

and choosing a Coordinator with a public IP (no NAT/firewall), the clients (as Coordinators) could
communicate with the ordinary clients and share some of the responsibilities of the project back-end.

 2.2 Data Storage Systems and Content Delivery Networks

    To distribute data we have several alternatives, each with its strengths and weaknesses that we can
analyze according to the goal we have in mind: distributing large files of data through several machines,
which are then accessed by many users, which have to download them.
    Both Data Storage Systems and Content Delivery Networks combine networking with storage and
content management techniques.

 2.2.1            Data Storage Systems

    Storage may support a wide range of requirements, from caching (expensive, volatile and fast) to
archival (inexpensive, persistent and slow). By combining networking and storage, numerous possibilities
arise, allowing Distributed Storage Systems (DSS) to adopt various roles which fall beyond data storage.
    The evolution of networks facilitated the evolution of Distributed Storage Systems, which no longer
simply provide a means to store data, but also offer innovative services like publishing, federation,
anonymity and archival. Recent work on this area has focused on Peer-to-Peer and Data Grid systems
[44] [45], which makes this a relevant topic to discuss.

    The taxonomy of current Data Storage Systems is presented in [35]. Alternatives are characterized
according to predetermined topics such as System Function or Security, and a survey of several existing
and previous Distributed Storage Systems is presented. The survey covers a wide range of systems, from
systems which provide storage utility on a global scale (OceanStore [36]) to systems which offer high
level of accessibility to mobile users (Coda [37]). We will also present a summary on the most significant
projects that exist today.

    OceanStore [36] is a global, distributed, Internet-based storage infrastructure. It consists of
cooperating servers, which work as both server and client. The data is split up in fragments which are
stored redundantly on the servers. For search, OceanStore provides the Tapestry [38] subsystem, and
updates are performed by using Byzantine consensus protocol. This adds an unnecessary overhead since
file search is not a requisite for BOINC, and supporting replication implies the use of a distributed
locking service, which incurs further performance penalties.

    Farsite [39] aims to provide the user with persistent non-volatile storage with a filesystem like
interface, by utilizing unused storage from user workstations, whilst operating within the boundaries of an
institution. Like OceanStore, Farsite uses the Byzantine agreement protocol to establish trust within an
untrusted environment.

    Frangipani [40] is a performance oriented Distributed Storage System typically used by applications
which require a high level of performance. It follows a server-client architecture, and was implemented
on top of the Petal system, employing Petal’s low-level distributed storage services. It is designed to be
utilized within the bounds of an institution where servers are assumed to be connected by a secure high
bandwidth network, which goes against the global distribution of BOINC. Furthermore, like OceanStore,
Frangipani also implements a distributed locking service, causing a considerable performance drop when
servers access the same file.

    Freeloader [41] combines storage scavenging and striping, achieving good parallel bandwidth on
shared resources. It aggregates unused desktop storage space and I/O bandwidth into a shared
cache/scratch space, for hosting large, immutable datasets and exploiting data access locality. It is
designed for large scientific results (outputs of simulations), which are then studied numerous times for a
period of weeks, using visualization tools.

    The overall architecture of Google File System (GFS) [42] shares many similarities to Freeloader. It
is a distributed storage solution which scales in performance and capacity whilst being resilient to
hardware failures. GFS was designed to operate in a trusted environment, where the application is the
main influence of usage patterns. The GFS typical file size was expected to be in the order of GB’s and
the application workload would consist of large continuous reads and writes, which does not apply to the
BOINC environment.

    Gnutella [43] is a decentralized file-sharing system whose participants form a virtual network,
communicating via the Gnutella protocol, which is a simple protocol for distributed file search. To
participate in Gnutella a peer first must connect to a known Gnutella host (host lists are available on
specialized sites). The search queries are broadcasted over the network, which cause very high bandwidth
consumption, and no reputation system exists making it impossible to establish a trust-based mechanism.

    KaZaA [19] is another decentralized file-sharing system. However, since it shares many similarities
to Skype, and is a reference for P2P systems it will be discussed in the P2P subsection.

 2.2.2            Content Delivery Networks

    Content distribution on the Internet combines development of high-end computing technologies with
high performance networking infrastructure and distributed replica management techniques. A CDN
consists of a combination of content-delivery, request-routing, distribution and accounting infrastructure.
Content-delivery is achieved through the use of a set of edge servers (surrogates) that deliver copies of
content to end-users. Request-routing directs client requests to the appropriate edge server, and interacts
with the distribution infrastructure to keep an updated view of the content stored in the CDN caches. The
distribution infrastructure moves content from the origin server to the edge servers and guarantees

consistency of content in the caches. The accounting infrastructure keeps logs of client accesses and
records the usage of the CDN servers.

    Much like for Data Storage Systems, a survey has been done on Content Delivery Networks [46].
This paper describes and categorizes the existing CDNs, and explores uniqueness, weaknesses,
opportunities, and future directions in this field. A comprehensive taxonomy is provided with coverage of
CDNs in terms of organizational structure, content distribution mechanisms, request redirection
techniques, and performance measurement methodologies. The existing CDNs are studied in terms of
their infrastructure, request-routing mechanisms content replication techniques, load balancing, and cache
management. A mapping of the taxonomy to the various CDNs helps in “gap” analysis in the content
networking domain and provides a mean to identify the present and future development in this field.
    In this section, we will present a summary on CDN projects that exist today. Many commercial
CDNs (e.g. Akamai, Adero, Digital Island, Mirror Image, Inktomi, Limelight Networks etc.) as well as
academic CDNs (e.g. Coral, Codeen, Globule etc.) are present in the content distribution space, but we
will focus on the most significant ones.

    CoralCDN [47] is a peer-to-peer Content Distribution Network that allows users to run web sites
that act as web proxies, leveraging the aggregate bandwidth of volunteers running the software. To use
CoralCDN, a content publisher – or someone posting a link to a high-traffic portal – simply appends
“.nyud.net:8090” to the hostname in the URL. Through a P2P DNS layer redirection, oblivious clients
with unmodified browsers are transparently redirected to nearby Coral web caches.
     CoralCDN is composed of three main parts: 1) a network of cooperative HTTP proxies that handle
users’ requests, 2) a network of DNS nameservers for nyucd.net that map clients to nearby Coral HTTP
proxies, and 3) the underlying Coral indexing infrastructure and clustering machinery on which the first
two applications are built.
    The novel key/value indexing structure, Coral, allows nodes to locate nearby copies of web objects
without querying more distant nodes, and prevents hotspots in the infrastructure. Coral exploits overlay
routing techniques similar to a number of Distributed Hash Tables (DHTs), but provides weaker
consistency than traditional DHTs. For this reason, its indexing abstraction is called a distributed sloppy
hash table, of DSHT. DSHTs are designed for applications storing soft-state key/value pairs, where
multiple values may be stored under the same key. CoralCDN uses this mechanism to map a variety of
types of keys onto addresses of CoralCDN nodes. Each Coral node belongs to several distinct DSHTs
called clusters. Each cluster is characterized by a maximum desired network round-trip-time (RTT) called
the diameter. The system is formed by a hierarchy of diameters know as levels. Every node is a member
of one DSHT at each level. Coral queries nodes in higher-level, fast clusters before those in lower-level,
slower clusters.
    The Coral DNS server, dnssrv, returns IP addresses of Coral HTTP proxies when browsers lookup up
the hostnames in “Coralized” hosts. Every instance of dnssrv is an authoritative nameserver for the
domain nyucd.net. dnssrv assumes that web browsers are generally close to their resolvers on the

network, so that the source address of a DNS query reflects the browser’s network location. To determine
locality, dnsserv measures its round-trip-time to the resolver and categorizes it by level.
       The Coral HTTP proxy, CoralProxy, satisfies HTTP requests for Coralized URLs. A CoralProxy
proxy fetches web pages from other proxies whenever possible to minimize load on origin servers. When
a client requests a non-resident URL, CoralProxy first attempts to locate a cached copy of the referenced
resource using Coral, with the resource indexed by a SHA-1 hash of its URL. Once a CoralProxy obtains
a file, it inserts a reference to itself in its DSHTs, and periodically renews referrals to resources in its

       CoDeeN [48] is an academic testbed Content Distribution Network (CDN) built on top of PlanetLab
by the Network Systems Group at Princeton University. It consists of a network of high-performance
proxy servers, which act both as request redirectors and server surrogates. These servers cooperate with
each other and collectively provide a fast and robust web content delivery service to CoDeeN users.
       A number of projects are related to Codeen – CoBlitz (a scalable Web-based distribution system for
large files), CoDeploy (an efficient synchronization tool for PlanetLab slices), CoDNS (a fast and reliable
name lookup service), CoTop (a command line activity monitoring tool for PlanetLab), CoMon (a Web-
based slice monitor that monitors most PlanetLab nodes), and CoTest (a login debugging tool). CoDeeN
provides caching of Web content and redirection of HTTP requests.
       A client wishing to use CoDeeN only needs to add one of their proxies (listed in their site1) to his
web browser’s proxy configuration. Requests to that proxy are then forwarded to an appropriate member
of the system that has the file cached and that has sent recent updates showing that it is still alive. The file
is forwarded to the proxy and then to the client. Thus even if the server response time is slow, as long as
the content is cached on the system, serving requests to that file will be fast. It also means that the request
will not be satisfied by the original server.

       A significant application service running on top of CoDeeN is CoBlitz [49]. It is a file transfer
service which distributes large files without requiring any modifications to standard Web servers and
clients, since all the necessary support is located on CoDeeN itself. Its mechanism is very similar to that
of CoralCDN. Clients only need to prepend the original URL with “http://coblitz.codeen.org:3125” and
fetch it like any other URL. A customized DNS server maps the name coblitz.codeen.org to a nearby
PlanetLab node. In CoBlitz, a large file is considered as a set of small files (chunks) that can be spread
across the CDN. CoBlitz works if the chunks are fully cached, partially cached, or not at all cached,
fetching any missing chunks from the origin as needed. Thus, while transferring large files over CoBlitz,
no assumptions are made about the existence of the file on the peers.

       Globule [50] is an open source collaborative content delivery network (CCDN) developed at the
Vrije Universiteit in Amsterdam. A CCDN is a network composed of end-user machines that operate in a
peer-to-peer fashion across a wide-area network. Globule is composed of Web servers that cooperate
across a wide-area network to provide performance and availability guarantees to the sites they host. It

    CoDeeN web proxies list: http://fall.cs.princeton.edu/codeen/

provides replication of content, monitoring of servers and redirecting of client requests to available
    In Globule, there is a strong distinction between a site and a server. A site is defined as a collection of
documents that belong to one specific user (the site’s owner) and a server is a process running on a
machine connected to a network, which executes an instance of the Globule software. Each server may
host one or more sites and to deliver content to the clients. In Globule, internode latency is taken as a
proximity measure, and used to optimally place replicas close to clients, and to redirect clients to an
appropriate replica server. Globule estimates latencies by positioning nodes in an M-dimensional
geometric space. The latency between any pair of nodes is then estimated as the Euclidean distance
between their corresponding M-dimensional coordinates. To calculate the coordinates of node X, the
latencies between X and m designated landmarks (m slightly larger than M) are measured. These
measurements are transparent to the clients, and are based in three steps: when a browser accesses a page
1) it is requested to download a 1x1-pixel image from each of the landmarks; 2) the client-to-landmark
latency is measured passively by the landmark during the TCP connection phase, and 3) reported back to
the origin server where the node’s coordinates are computed. The metric estimation service used in
globule is passive, which does not introduce any additional traffic to the network. However, results in
[51] show that the distance metric estimation procedure is not very accurate.
    Globule supports HTTP as well as DNS redirection, and is implemented as a third-party module for
the Apache HTTP Server that allows any given server to replicate its documents to other Globule servers.
To replicate content, content providers only need to compile an extra module into their Apache server and
edit a simple configuration file. Globule automatically replicates content and redirects clients to a nearby

    Akamai [52] [53] is the most used CDN in the world today. It is the market leader in providing
content delivery services and owns more than 25,000 servers over nearly 900 networks in 69 countries
[52]. Akamai servers deliver static, dynamic content and streaming audio and video.
    Flash crowds are handled by allocating more servers to sites experiencing high load, while serving all
clients from nearby servers. Client requests are directed to the nearest available surrogate likely to have
the requested content. Akamai uses a mapping system that resolves a hostname based on the service
requested, user location and network status. It also uses a dynamic, fault-tolerant DNS system for network
load-balancing. Akamai name servers resolve hostnames to IP addresses by mapping requests to a server,
using a complex adaptive request-routing algorithm. It takes into consideration a number of metrics such
as replica server load, the reliability of loads between the client and each of the replica servers, and the
bandwidth that is currently available to a replica server. This algorithm is proprietary to Akamai, and its
technical details are unavailable.
    Akamai’s DNS-based load balancing system continuously monitors the state of services and their
servers and networks. Akamai uses agents that simulate end-user behavior to monitor the system’s end-
to-end health, by download Web objects and measuring failure rates and download times. With this data,
it can detect and suspend problematic servers. Each of the content servers periodically reports its load to a
monitoring application, which then aggregates and sends load reports to the local DNS server. That DNS

server then determines which IP addresses (two or more) to return in the request routing phase. Akamai
uses two thresholds, one to determine when some of the server’s content should be allocated to additional
servers, and another that, once exceeded, makes the server’s IP address no longer available to clients.

    Limelight Networks [54] is a CDN that provides distributed on-demand and live delivery of video,
music, games and download. Limelight Networks has surrogate servers located in 72 locations around the
world, and has created a system for distributed digital media delivery to large audiences. It is built on a
CDN platform that allows it to shape the CDN to meet any content provider’s specific needs and
environment. Limelight Networks has the following products: Limelight ContentEdge for distributed
content delivery via HTTP, Limelight MediaEdge Streaming for distributed video and music delivery via
streaming, and Limelight Custom CDN for custom distributed delivery solutions. Limelight MediaEdge
Streaming is a distribution platform that provides high performance services for live and on-demand
streaming of audio and video content over the Internet. Content providers using Limelight’s streaming
services use Limelight User Exchange (LUX) to track end-users’ activity with real time reporting. Thus it
can be used to access and improve the media streaming strategy.

 2.3 Peer-to-Peer and Nat Traversal

    To reduce the load on the central server, and remove the unique point of failure, a P2P solution
would provide a viable solution if it were:
         •     Scalable – a BOINC project like Einstein@Home already has thousands of users, but there
               is a lot of potential in volunteer computing, with the increasing number and power of
               existing PCs, and the vast number of new projects that can appear (especially if it were
               possible to use more data-intensive applications),
         •     Distributed – it would not make much sense to use a centralized approach like Napster since
               we are trying to eliminate the single point of failure,
         •     Client-oriented – with the distributed design, the client would have to be able to make most
               of the decisions autonomously without the participation of an outside entity, and
         •     Efficient – with P2P, we would be giving the users more responsibility, but also an
               undesired overhead which should not influence the computation being performed by the
               BOINC client.

    Furthermore, integrating users on data transfers would add an extra problem: Firewall/NAT
circumvention. Many clients are behind a NAT/firewall, so it is important to discuss it before presenting
P2P systems.

 2.3.1             Nat Traversal

    In order to obtain a widespread virtual network composed of desktop computers, one must address
the problem of NAT (firewall) traversal, to reach all possible clients. Some approaches, like

SETI@home, rely on a unidirectional connection between the client and a server, with the client initiating
all the connections. Other approaches, like WOW or P3, extend or use P2P network overlays like Brunet
or JXTA to reach a more scalable structure. To reach a convergence between these two, it is important to
learn about the problems and solutions for NAT traversal.

    There are two basic methods for a client to determine the NAT mapped public address:port pair. The
first is to ask the NAT, the second is to ask someone outside the NAT what the actual address:port pair
should be. Universal Plug and Play (UPnP) is a protocol that allows client applications to discover and
configure network components, including NATs and Firewalls, which are equipped with UPnP software.
Using this technology, a client queries the NAT via UPnP asking what mapping it should use if it wants
to receive on port x. One problem with UPnP is that it will not work in the case of cascading NATs (more
than one level of NAT). There are also security issues that have not yet been addressed with UPnP.
    In the absence of a method of communicating with the NAT device, the next best way for a client to
determine its external address:port pair is to ask a server sitting outside the NAT on the public Internet
how it sees the source of a packet coming from this client. In this scenario, a server sits listening for
packets (call this a NAT probe). When it receives a packet, it returns a message from the same port to the
source of the received packet containing the address:port pair that it sees as the source of that packet. The
client can then determine: a) if it is behind a NAT; b) the public address:port pair it should use in the SDP
message in order for the endpoint to reach it. This will not work in the case of symmetric NATs, since the
IP address of the NAT probe is different than that of the endpoint. Simple Traversal of Udp through
NATs (STUN) is a protocol for setting up the kind of NAT Probe that was just described. It can also help
determine which kind of NAT the client is behind. Note that the STUN server does not sit in the signaling
or media data path. NAT probe or STUN server will not work with symmetric NATs.
    One solution to the symmetric NAT problem is known as Connection Oriented Media. If an
endpoint is meant to speak both to clients that are behind NATs and clients on the open internet, then it
must know when it can trust the SDP message that it receives in the SIP message, and when it needs to
wait until it receives a packet directly from the client before it opens a channel back to the source
address:port pair of that packet. One proposal for informing the endpoint to wait for the incoming packet
is to add a line to the SDP message (coming from the client behind the NAT): a=direction:active. When
the endpoint reads this line, it understands that the initiating client will “actively” set up the address:port
pair to which the endpoint should return RTP, and that the address:port pair found in the SDP message
should be ignored. This approach is still problematic, as it depends on endpoint support of the
a=direction:active tag. Since there are not many endpoints to date supporting this tag it is pretty much
unusable. Using the concept behind Connection Oriented Media, one can simply “ignore” the SDP in all
cases and always respond to the port from where it receives RTP traffic. While this approach works well
for GW’s to the PSTN, it is not by any means a panacea as this solution breaks down if both endpoints are
behind symmetric NAT’s. For this latter case, the only possible remedy is some sort of packet relay
elopement that relays the packets between the two endpoints.
    Traversal Using Relay NATs (TURN) complements STUN and places the probe in the signaling
and media path. The probe in essence “terminates” the media for both ends so that vis-à-vis the client the
same probe that detected its address:port pair in the first place is also the probe that is sending the client

media so the symmetric problem is taken care of. QoS and Security requirements at the entrance to the
network limit using a TURN like approach since relevant SIP session information is not exposed in the
TURN protocol.
     If we combine the strengths of both “Symmetric RTP” and the “TURN server” we can design an
element named a media-relay that has the best of both worlds. The relay can send media packets to an
endpoint on a port previously used by that same endpoint to send a media packet to the relay. This applies
to both endpoints. If a mapping can be generated to match up the two endpoints than the media relay can
be used as an intermediary to facilitate media packet transfer between the two endpoints. As opposed to
the TURN server, since the relay has access to the SIP message this media port manipulation is quite
trivial. The IETF has defined a standard name ICE (Interactive Connectivity Establishment), which
empowers the endpoints to determine the types of NAT’s that exist between them and come up with a list
of IP addresses through which the endpoints can communicate. The discovery process makes use of most
of the mechanisms discussed above including STUN, TURN and RSIP.

    BOINC only supports communications started by the client, so this is not a problem in the current

 2.3.2           P2P

    Both in terms of number of participating users and in traffic volume, KaZaA [19] is one of the most
important applications in the Internet today. In fact, it can be argued that KaZaA has been so successful
that any new proposal for a P2P file sharing system should be compared with the KaZaA benchmark.
Nevertheless, because KaZaA is proprietary and uses encryption, little is understood about KaZaA's
overlay structure and dynamics, its messaging protocol, and its index management.
    KaZaA peers differ in availability, bandwidth connectivity, CPU power, and NATed access. KaZaA
was one of the first P2P systems to exploit this heterogeneity by organizing the peers into two classes,
Super Nodes (SNs) and Ordinary Nodes (ONs). SNs are generally more powerful in terms of
connectivity, bandwidth, processing, and non-NATed accessibility. Each ON has a parent SN. When an
ON launches the KaZaA application, the ON chooses a parent SN, maintains a semi-permanent TCP
connection with its parent SN, and uploads to this SN the metadata for the files it is sharing. As with most
other P2P file sharing systems, KaZaA maintains a file index that maps file identifiers to the IP addresses.
This file index is distributed across the SNs. In particular, each SN maintains a local index for all of its
children ON. When a user wants to locate files, the user's ON sends a query with keywords over the TCP
connection to its parent SN. For each match in its database, the SN returns the IP address, server port
number, and metadata corresponding to the match. Each SN also maintains long-lived TCP connections
with other SNs, creating an overlay network among the SNs. When a SN receives a query, it may forward
the query to one or more of the SNs to which it is connected. A given query will in general visit a small
subset of the SNs, and hence will obtain the metadata information of a small subset of all the ONs. A
KaZaA peer has the following software components: 1) The KaZaA Media Desktop (KMD); 2) Software
environment information stored in the Windows Registry. Included in this environment information is a

list of up to 200 SNs (referred to as the SN list cache); 3) DBB files, with each DBB file containing
metadata for the files that the peer is willing to share; 4) DAT files, with each file containing a partially
downloaded file. Once all the file data is retrieved, the DAT file is renamed to the original file which was
intended to be downloaded. The KaZaA ON-SN and SN-SN signalling messages are encrypted.
    On [20], to unravel the mysteries of the KaZaA overlay, two measurement apparatus were developed:
the KaZaA Sniffing Platform and the KaZaA Probing Tool. The KaZaA Sniffing Platform collects
KaZaA signaling traffic. The KaZaA Probing Tool is used for analyzing node availabilities and KaZaA
neighbor selection. KaZaA nodes frequently exchange with each other lists of SNs. In particular, when an
ON connects with a parent SN, the SN immediately pushes to the ON a SN refresh list, which consists of
the IP addresses, port numbers and workload values of up to 200 SNs. The first entry in the SN refresh
list is the parent SN that is sending the list. When an ON receives a SN refresh list from its parent SN, the
ON will typically purge some of the entries from its SN list cache and add entries sent by the parent SN.
Neighboring SNs in the overlay also exchange SN refresh lists. When a peer launches the KaZaA client,
the first task of the client is to choose a parent SN and establish an overlay link (i.e., TCP connection)
with it. To this end, the following steps are taken: the ON chooses several (typically 5) candidate SNs
from the list and probes the candidates by sending one UDP packet to each candidate. The ON then
receives UDP responses from a subset of these candidates; to each SN from which it receives a UDP
response, the ON attempts to establish a TCP connection. For each such connection, the SN and ON will
exchange encryption key material, the ON will send peer information, and the SN will send a SN refresh
list. Included in the peer information is the local IP address, service port number and username; the ON
will then select one of the SNs and disconnect from the other SNs. The one remaining SN becomes the
ON's parent SN. Structure of signaling messages: Each message begins with the identifier “K”, which is
then followed by a message-type field (two bytes), a payload-length field (two bytes), and the payload
itself. When an ON establishes a link with a parent SN, it informs the parent SN of its port number.
Furthermore, the SN refresh lists sent among the peers also advertise the port numbers of the SNs. The
measurements indicate that roughly 30% of the KaZaA peers are behind NATs. KaZaA's two-tier
hierarchy provides a mechanism to partially solve this problem. In KaZaA, when peer A sees that peer B
has a private NAT address, instead of sending a request directly to peer B, it sends the request to peer B's
parent SN. The parent SN then sends a message to peer B, indicating that it should initiate a connection
directly back to peer A.

    The measurement results can be leveraged to set forth a number of key principles for the design of an
unstructured P2P overlay.
         •   Distributed Design: unlike Napster, KaZaA does not rely on infrastructure servers,
             essentially all of its nodes run on user peers.
         •   Exploiting Heterogeneity: peers differ in availability, bandwidth connectivity, CPU power,
             and NATed access. KaZaA was one of the first P2P systems to exploit this heterogeneity by
             organizing the peers into two classes, Super Nodes (SNs) and Ordinary Nodes (ONs). SNs
             are generally more powerful in terms of availability, bandwidth, processing, and non-
             NATed accessibility. The SNs process, distribute, and respond to query traffic; and the SNs
             process index-maintenance traffic and overlay maintenance traffic.

         •   Load Balancing: To achieve approximate balance, the overlay can be designed so that each
             SN has roughly the same degree in the overlay (that is, has roughly the same number of TCP
             connections to ON and SN neighbors).
         •   Locality in Neighbor Selection: locality in the form of a common IP prefix and short RTTs
             play in determining an ON's parent SN as well as in the selection of SN-SN links. The
             advantages of locality have to be weighed against to the need for high content availability.
         •   Connection Shuffling: By shuffling the links in the overlay, a larger set of SNs can be
             visited for an extended search period. Shuffling the links in the overlay helps the P2P
             system to find a replacement copy to complete a download.
         •   Efficient gossiping algorithms: In a two-tier distributed P2P system, it is critical that the
             SNs learn about the other SNs in the network, so that they can shuffle connections as well as
             find new SNs when existing connections leave. Thus the SNs need to gossip SN lists to each
             other. One of the fields in the SN refresh list is the “freshness” field. This value enables the
             peers ONs and SNs to estimate the freshness of the SN availability information.
         •   Firewall avoidance and NAT circumvention: KaZaA uses dynamic port numbers along
             with its hierarchical design to avoid firewall blocking. Furthermore, it uses connection
             reversal to allow NATed peers to share files.

    BitTorrent [29] is a file-downloading protocol whose goal is to quickly replicate files to a set of
clients. A torrent consists of a central component, called tracker and all the currently active peers.In a
BitTorrent network, a peer that wants to download a file first connects to the tracker of the file. The
tracker returns a random list of peers that have the file, and the downloader establishes a connection to
these other peers. To initiate a new torrent, there must be at least a Web Server that allows to discover the
tracker and an initial seed (leecher with complete copy of file). Files are split into chunks and the
downloaders of a file barter for chunks of it by uploading and downloading them. A study of the protocol
[30] revealed that there is a positive correlation between download and upload rates, which means that a
user is rewarded for having a better upload.
    In [31], an evaluation of the BitTorrent Protocol for Computational Desktop Grids is presented. A
data management prototype is designed using the XtremWeb Desktop Grid as a reference architecture.
Based on this prototype, experimentations are conducted to evaluate the potential of BitTorrent compared
to a classical approach based on FTP data server. The new architecture enhances the middle tier with two
entities dedicated to data management: the Data Catalog and the Data Repository. The Data Catalog
keeps track of the data and of its location, the Data Repository stores the data and should be remotely
accessed by senders and receivers of files. It runs the necessary software to access the data e.g. a FTP file
server or a BitTorrent tracker.
    The results obtained in the article referred above were complemented in [32], and reached the
following conclusions.
         •   Basic performance – distributing files (1 to 250 MB) over 20 nodes:
         •   BitTorrent outperforms FTP when file size is over 20MB;

         •    FTP is more efficient in transfers of small files – BT has a 0.8s overhead due to protocole
              steps, FTP has 0.1s.
         •    Scalability evaluation – node pool, varying from 1 to 64 nodes downloading a 50MB file
         •    BitTorrent download time remains stable while number of resources increase, in FTP it
              increases linearly;
         •    With 50MB, there is a crossover point around 10 workers where FTP is more efficient than
              FTP due to overhead.

    The scenario where a set of workers start the file transfer at the same time is very unlikely. Therefore,
in a new experiment two files (10 and 100MB) are distributed to a set of 20 nodes, and workers start their
download one after the other, with a waiting time of 1 minute.
         •    For the 10 MB file, the time to complete the first transfer is a little higher for BitTorrent
              (20s) than FTP (18s) but as more and more copies of the file are distributed to other nodes,
              the download time for BitTorrent decreases by a factor 1.9 while the download time for FTP
              stays the same;
         •    When considering a larger file (100MB), the time for the first download is also decreased
              compared to FTP.

    Evaluation on a synthetic multi-parametric application – an application is composed of a set of n
    independet tasks, n being equal to the number of involved nodes. A task consists of two phases: a file
    transfer (20MB) followed by an execution. Execution time of each task is tcommunication + tcomputation. The
    reference tcommunication is set to the time to transfer 20MB between 2 nodes using FTP and the ratio
    tcommunication/tcomputation – r – varies from 0.1 to 10.
         •    Speed-up of BitTorrent increases with the number of nodes and the communication ratio to
              reach a factor of 2.5 when r is 10 and n is 70;
         •    When the number of nodes is small and the communication ratio is high, FTP outperforms
              BitTorrent due to the large overhead when transmitting small files.

    Like BOINC, XtremWeb relies on a centralized data service architecture (Coordinator in XtremWeb,
Project Server on BOINC), which means that these results should give us a clue on what to expect on

2.3.3              Application of Super Peer Protocol on Grid

    There is a proposal for the use of a super peer protocol for the submission of a very large number of
jobs on a Grid environment in [21].
    In the scheme proposed there, the Job initiator is lightweight and sends the data once to the network,
which propagates itacross the data nodes as and when required. This helps to distribute the data load
dynamically in a decentralized fashion, both in topology and administratively. The super peer job

submission protocol described in this paper enables caching of the input data files in multiple data
centers, i.e. in super-peers, which have sufficient data storage facilities. The job manager node (i) receives
data from a detector, (ii) produces the job description files (or job adverts), and (iii) collects output
results. Simple peers, or workers, are available for job execution: they issue a job query to get a job
description and then a data query to collect the corresponding input data file to be analyzed. Super-peer
interconnections are exploited to make job and data queries travel the network rapidly; super peers play
the role of rendezvous nodes, since they can store job and data adverts (and potentially the data files
themselves), and compare these files with queries issued to discover them; thereby acting as a meeting
place for both job or data providers and consumers. Only some of the peers in the network will cache
such files. Such peers are referred to as data centers (DC) nodes and can be located on super peers or
worker peers. Each user decides if they want to be a super peer and/or data center, as well as a worker.
The job submission protocol requires that job execution is preceded by two matching phases, the first one
for job assignment and the second one for downloading of input data. In the job-assignment phase the job
manager generates a number of job adverts, which are XML documents describing the properties of the
jobs to be executed, and sends them to the local rendezvous super-peer, which stores the adverts. Each
worker, when ready to offer a fraction of its CPU time, sends a job query that travels the Grid through the
super-peer interconnections – a query message is sent to the directly connected super-peer, which in turn
forwards it to its neighbor super-peers and so on, until the message TTL parameter is decremented to 0 or
the job query finds a matching job advert. Whenever the job query gets to a rendezvous super-peer that
maintains a matching job advert, such a rendezvous assigns the related job to the requesting worker by
directly sending it a job assignment message. In the data-download phase, the worker that has been
assigned a job inspects the job advert, which contains information about the job and the required input
data file. Then the worker sends a data query message to discover the input file. The data query travels the
super-peer network searching for a matching input data file stored by a data center. Since the same file
can be maintained by different data centers, the data center that receives a data query, in order to avoid
multiple transmissions of the same file, does not send data directly to the worker. Conversely, the data
center sends only a small data advert to the super peer connected to the worker and then to the worker
itself. The worker initiates the download operation after receiving the first data advert, and discards the
subsequent adverts. After receiving the input data, the worker executes the job, reports the results to the
job manager and immediately issues a query for another job.
    In the job assignment phase the protocol works in a way similar to the BOINC software, except that
job queries are not sent directly to the job manager, as in BOINC, but travel the super-peer network hop
by hop. Conversely, the data download phase differs from BOINC in that it exploits the presence of
multiple data centers in order to replicate input data files across the Grid network.
    This is a preliminary study, as there has not been applied in the real world, but the simulation results
show that the use of several data centers can bring benefits to Grid applications in terms of lower total
execution times, higher throughput and load balancing among worker nodes. The super-peer architecture
shows, as expected, promise.

Chapter 3
BitTorrent on BOINC
    In this chapter, we will present the architecture of the prototype used to evaluate the use of BitTorrent
on BOINC, and the issues involved in implementing the BitTorrent protocol on BOINC.
    We will begin by addressing some of the concerns one must have when developing such an
architecture, then we will describe this new scenario, point out some of its shortcomings and problems,
present our experimental results, and finally possible future paths in research will be proposed.

 3.1 Concerns

    When considering the practical application of P2P technologies to the “production” BOINC
environment, several concerns must be adequately addressed if the solution is to be successful. For the
purposes of this thesis, we have chosen to focus on the following four:
         •   Router Configuration — a Peer-to-Peer infrastructure would have to have a way to
             automatically configure routers or somehow bypass NAT issues through use of relaying
         •   Data Integrity — mechanisms for identifying hosts that supply bad data, and subsequently
             banning them from the network or having ways to avoid using them
         •   Adaptable Network Topology — ability to not only adapt on the wide area network, but
             also to detect and exploit local area network topologies and relative proximity
         •   BOINC Integration — any new technology must be easy to integrate with current BOINC
             client software, in practice this means a C++ implementation or binding

 3.1.1            Firewall & Router Configuration

    BitTorrent, as other P2P protocols, is based on a two-way communication between peers. Every peer,
seed or not, is supposed to accept requests for chunks from other peers, and therefore must allow
incoming connections, by opening the BitTorrent port (usually in the 6881– 6889 range) in their
routers/firewalls. In a common BitTorrent usage scenario, there are always users that are behind
routers/firewalls, without the necessary open ports. However, the file transfer is usually completed since
clients with open ports or public IPs are able to receive incoming connections. Therefore, it is possible for
a client to use BitTorrent, allowing only outgoing connections. In the worst case scenario, in a BitTorrent
swarm, should no peer accept incoming connections (including the initial seed), the system would not
    There is no easy answer for this problem, faced by most P2P protocols. As presented in the State of
the Art, if both clients are behind symmetric NATs, the only solution is to use a relay server, possibly a
node with a public IP that would act as an intermediary between two clients. This methodology is used by
Skype, but it would prove disastrous in this case, given the size of the shared files, causing an excessive
overhead on the relay. For non-symmetric NATs, hole punching techniques could be used, but it would
involve changes in the BitTorrent core software layer, which is beyond the scope of this thesis.

    3.1.2           Data Integrity

       The integration of BitTorrent would bring new security issues to BOINC, and creates more
possibilities for malicious users to exploit the system. The BitTorrent protocol itself does not strictly
enforce fairness and exploits are possible, but the use of a central tracker decreases the danger of
malicious attacks. Hashing prevents bad data from being propagated across the network, and small chunk
sizes can be used to avoid downloading too much corrupted data. An additional level of security is
provided by certain BitTorrent clients like Azureus2, that bans peers that share bad data. The “original”
BitTorrent client by Bram Cohen [55] also incorporates a similar mechanism by default, with the tag –
retaliate_to_garbled_data, which refuses further connections from addresses with broken or intentionally
hostile peers.
       Therefore, the main problem with BitTorrent is not in the protocol itself, but rather in the peer
swarms which allow BOINC users to obtain a list of other users that are downloading the same file (and
possibly executing the same work unit). A client could send consecutive requests for peer lists to the
tracker, and build a comprehensive database of peers sharing a file. Should a user from the list answer the
attacker and agree to cooperate with him, or become compromised, several negative scenarios would be
possible. For example, both users could report bad results that would be marked as correct if there was
not enough replication (in practice, this number is not higher than three, so two users would build a
quorum), or they could report a much higher computation time/value than they had to use in an attempt to
obtain more credits. A possible solution for this problem would be a trust-based system, where peers
would have a reputation based on their past actions. This was presented in [56], where weighted voting
mechanisms were proposed, and clients were classified according to the results of the computation.

    3.1.3           Adaptable Network Topology

       An interesting advantage of the BitTorrent protocol would be the possibility to take advantage of the
network topology. Clients could give a higher priority to peers on the same Local Area Network, reducing
the traffic generated to the outside. Bram Cohen’s BitTorrent client has an option turned on by default, –
use local discovery, which scans the local network for other clients with the desired content. Connections
to outside the local network are reduced, and outgoing bandwidth is saved. This can be particularly useful
for BOINC clients working inside organizations.
       Another possibility would be using an approach similar to the one used in the Julia Content
Distribution Network [57], in which nodes gather statistics about the network conditions as the download
progresses, and then contact closer nodes (in terms of latency and bandwidth).

    See project web site at: http://azureus.sourceforge.net/

 3.1.4            BOINC Integration

    To allow for an easy integration with BOINC, the current prototype implementation has been
completed in the same language as BOINC, C++. This minimized the conflicts and number of additional
software packages needed. Additionally, if the initial “seed” is left running on the network, failure of any
or all peer P2P data nodes to distribute data to a given client essentially causes a fallback to the standard
centralized nature that BOINC currently implements, greatly reducing the risk of overarching data
    Previous BOINC clients can keep using the original HTTP transfer mode, since both clients are
compatible with the current BT BOINC server.

 3.2 BitTorrent Scenario

    To apply this new scenario, both BOINC server and client had to be altered. The BitTorrent protocol
requires the integration of components such as a tracker, and a BitTorrent client.
     This section is divided in three subsections: one to describe the changes made on the server; a
second one to present the alterations on the client side; and finally a description of the new process for file

 3.2.1            Server

    As mentioned before, the current BOINC architecture relies on a central server to answer to client
work requests and to distribute work. As you can see from Figure 4, the project back-end incorporates a
scheduler and at least one data server (among other internal components).

                                        Figure 4 - Original BOINC

    This architecture had to be slightly changed in order to allow the BitTorrent protocol to be used.
BOINC clients that had BT capability would be able to use the BT protocol to download by connecting to
a tracker, as you can see in Figure 5, while others would continue using HTTP downloads as usual.

                                        Figure 5 – BOINC with tracker

    In the integration of BitTorrent in BOINC, the main server code remains relatively unchanged but a
tracker is needed to co-ordinate downloads.
    When the BitTorrent tracker is installed on the central server, a port is defined to receive client
requests (normally 6881). We decided to use a centralized tracker because the decentralized alternative is
very recent, and the maintenance and construction of the DHT requires each peer to maintain an
orthogonal set of neighbors within the DHT, and pay the communication costs of maintaining the DHT in
the face of high rates of churn [58].
    For every input file that should be downloaded through BitTorrent, a .torrent file is created,
pointing to the tracker in the central server. This can be done using the command maketorrent-
console, available from the BitTorrent package. The torrent file is named by adding the extension
.torrent to the original file’s name: file.data → file.data.torrent. Both files are hosted on a project data

    A tracker, however, is not enough to allow BitTorrent transfers. The original data file has to be
spread throughout the network from the central server. This means that a BitTorrent client has to be
running along the other components of the server, as you can see from Figure 6. To start sharing the file,
the BOINC server must start the BitTorrent client to act as a seed and announce itself to the tracker.

                                Figure 6 – BitTorrent BOINC architecture

    The .torrent file is related to the data file through the work unit. When creating work, a tag
<bittorrent/> is added to the file info of the data file in the work unit template and the .torrent file itself is
added as an input file:

         •    Workunit template with input file, input.data, which is downloaded, as usual, through

                      [ ... ]

         •    Workunit template with input file, input.data, which can be downloaded through


                         [ ... ]

       The <bittorrent/> tag will tell the client which input files may be downloaded through
BitTorrent. The .torrent files are treated as normal input files, which means they are deleted from the
server at the same time as the input file itself, when the file deleter daemon finds it appropriate (normally
when the workunit has received enough result successful result files, or the deadline has passed).

    3.2.2            Client

       For BitTorrent file transfers to be possible, the BOINC client must be integrated with a BitTorrent
client capability, as shown in Figure 7.

                                          Figure 7 – BT BOINC client

       To incorporate a BT client’s capability in BOINC, one could use one of the many BitTorrent clients
available3, or use a library to create a client. The idea in this research was to obtain a client that could be
used on any platform, with the original, unaltered version of the BitTorrent protocol, and that was not on
an experimental phase but had already “proven itself”. Experiments with BitTorrent had already been
conducted successfully before, on XtremWeb (see State of the Art – 2.3.2 ), using the BitTorrent client
Azureus, further pointing us towards that solution.
       We decided to use the original BitTorrent client, from Bram Cohen, because it features the original
BitTorrent protocol (it is in fact managed by the BitTorrent protocol creator himself), and it can be used

    List of BitTorrent clients: http://en.wikipedia.org/wiki/Bittorrent_client

in all platforms (Linux, Mac and Windows). The Azureus client, on the other hand, is written in Java,
which would add an extra dependency for the JRE, and has extra functionalities.

    As mentioned in the server, the BOINC BT client uses the <bittorrent/> tag to identify input files that
can be downloaded through BitTorrent - a parameter was added to the FILE_INFO class that identifies
files as possible BitTorrent downloads.
    After identifying the input file as a BitTorrent downloadable file, the client waits for the download of
the .torrent file to finish. The corresponding .torrent file is downloaded through normal HTTP since it is
considered as another input file (it has MD5 hashes and file integrity checks). When the download is
finished, the BitTorrent client is initiated using the .torrent file as a parameter.
    After finishing the BitTorrent download, the client proceeds to verifying the file, checking its size
and MD5 hash, which must match the one provided by the server through the scheduler reply on XML. In
case of error (file size does not match expected value, for instance), a BitTorrent error is declared, and no
further attempts are done at a BitTorrent transfer. The client then tries to download the file through
HTTP. Errors caught from this point on are handled by BOINC, as a normal transfer.
    If the download was successful and the file passes the tests, the BitTorrent client keeps on seeding
the file, to share it among the network. The seeding time is a variable defined in the client code, which
requires the recompilation of the client source code each time the value is changed. However, this can be
easily adapted to more user-friendly alternatives:
         •    The seeding time value could be read from a configuration file of the client. The client
              would be able to adapt this value to maximize its connection, while respecting a minimum
         •    There could be a relation between seeding time and file size: bigger files had longer seeding
              time; smaller files had shorter seeding time. This could help ensure files that needed more
              time to download had more peers sharing them at any given moment;
         •    A similar approach as the above could be followed, but between expected computation time
              and seeding time. Longer computation would mean less file transfer frequency, which
              would free the network for upload/seeding;
         •    The seeding time could be determined by each project, and communicated to the clients in
              scheduler replies.

    Combinations of the above possibilities could also be achieved, for example: Project-defined seeding
time could be overwritten by a local client-defined value.
    One of the main defining characteristics of Desktop Grid Computing is volatility. Clients do not
necessarily keep a continuous stream of computation or connectivity. For this reason, whenever a client
quits or the client state is changed (finished download, for example), that information is written on the
state file. In this scenario, the client saves on the file each BitTorrent transfer that was is place on the
instant of the event. A BitTorrent transfer is characterized by the file it was downloading/seeding, the
time it already been downloading (for statistics) and seeding (to control seeding time), and its state
(peering or seeding). This way, whenever a client restarts, it will start the BitTorrent client for the

specified files, and continue downloading/seeding from the point it was in before it stopped (similar to

3.2.3             BT BOINC File Transfer

    This subsection summarizes the steps taken in a file transfer between server and client.

                         Scheduler        Data Server           Tracker        BT Client

                                                   (2)         (3)

                                                    BOINC Client

                                  Figure 8 - BT BOINC File Transfer

    Figure 8 shows the architecture and highlights the steps of a file transfer:
        (1). The client contacts the scheduler and asks for work. The scheduler then replies with a given
             work unit and a reference to a .torrent file that represents an input file made available
             via BitTorrent.
        (2). The client then downloads the .torrent file through normal HTTP from the specified
             Data Server;
        (3). After downloading the .torrent file, the BOINC client initiates the local BitTorrent
             client with the .torrent as an argument. The BitTorrent library then contacts the tracker
             defined on the file and receives list of peers;
        (4). The client contacts the chosen peers and the BitTorrent protocol is used to download the
             subsequent file chunks and re-assemble the input file for processing by the local BOINC

    The downloaded input file is then checked for integrity through its hash and size. After being
verified, it is used for the processing of its workunit. The rest of the process is unchanged from the
original BOINC.

    3.3 Experimental Results

      To test this new architecture, experiments of medium scale were performed, and various parameters
were considered when trying new scenarios. The results of these tests are presented in this section, as well
as information on the testing infrastructure.

    3.3.1           Testbed

      The area of this research requires many machines to achieve meaningful results without having to
resort to a simulator. This requirement was met by the project Grid'5000 [59], that serves as an
experimental testbed for research in Grid Computing.
      The Grid'5000 project aims at building a highly reconfigurable, controllable and monitorable
experimental Grid platform gathering 9 sites geographically distributed in France featuring a total of 5000
CPUs. It will reach 5000 CPUs in 2008, counting, at the moment, around 3500 CPUs4.
      Most of the experiments were conducted on the Orsay site5 that was composed of 312 IBM eServer
326m machines, with dual-core AMD Opteron (246 or 250), and 2GB of RAM. Nodes are interconnected
with a PCI-X Gigabit Ethernet card.
      The BitTorrent client used was, as mentioned before, Bram Cohen’s BitTorrent, as well as the tracker
(both available on http://www.bittorrent.com), version 5.0.7. The BOINC client version used was 5.8.8
and server version 5.9.3 (available on http://boinc.berkeley.edu).
      In order to distribute the environment throughout the clients, Grid’5000 uses Kadeploy, which allows
a user to create a custom environment based on a default distribution (debian, fedora, ubuntu …). This
was particularly useful to share a BOINC client environment between the required nodes.

    3.3.2           BOINC Project

      To evaluate this new scenario against the original one, a base of comparison must be used. Therefore,
we had to develop a BOINC project that would be used by all the scenarios, creating a standard
environment, and reducing the potential disparities in tests.
      To create a BOINC project, several instructions are given at the BOINC site6. After installing the pre-
requisite software and solving all the dependency problems, the BOINC source code was downloaded and
the server configured and compiled. To create a project, we used the make_project script, which:
            •   Creates the project directory and its subdirectories.
            •   Creates the project’s encryption keys.
            •   Creates and initializes the MySQL database.


            •   Copies source and executable files.
            •   Generates the project’s configuration file.

       The application used in the project was created based on example applications provided with
BOINC’s source code. The application has a single loop that performs a simple addition, to keep the
client busy for approximately 3 minutes on an Orsay node, using the original BOINC.
       We then proceeded to creating a work unit and a result template. The result template used was the
default provided with the code for an example application (Hello World7). The work unit template,
however, was slightly changed, since we had to identify the input files there. The input file, whose size
was a variable was created as a simple text file, which we proceeded to extend or cut to get the desired
size. It was then referenced in the work unit template, along with the .torrent file, when testing the
       Finally, we used the missing default back-end components to get the project running: a validator and
an assimilator. To start the project it is only required to run bin/start in the project directory.

    3.3.3           Monitoring Tools

       To analyse the performance of this new scenario, compared to the original BOINC both BOINC logs
and an external monitoring tool, Ganglia were used.
       BOINC already provides a few benchmarks, such as the weighted aveage of network throughput. It
also has a built-in logging system that saves transfer times, computation times, and other project-related
messages. We used some of that information to gather measures on the client’s performance.
       Ganglia [60] is a scalable distributed monitoring system for high performance computing systems
such as cluster and Grids. It is based on a hierarchical design targeted at federations of clusters. It relies
on a multicast-based listen/announce protocol to monitor state within clusters and uses a tree of point-to-
point connections amongst representative cluster nodes to federate clusters and aggregate their state.
       Ganglia is comprised of two components, the Gmon local-area monitoring system, and the Gmeta
wide-area system. The Gmon system operates at the cluster level and gathers metrics such as heartbeats,
hardware/operating system parameters, and user-defined key-value pairs from every node. Gmon uses
UDP multicaste to exchange these metrics within cluster. Therefore, each node has information on the
current values of every node on the cluster.
       This allowed us to use a simple Java program designed to contact one of the nodes, through ssh, and
gather the information into a temporary file, where the information was then parsed. By doing this, the
Gmeta daemon did not have to be used, and we had access to all the information needed.

    3.3.4           Test Cases


    To evaluate the BitTorrent architecture, there were several measurements that needed to be taken,
such as:
            •   CPU usage on central server, to check for overhead
            •   Network Output on central server, to compare bandwidth used
            •   Computation times on clients, to measure overhead of BT client
            •   Distribution times of input files, to compare times against the original BOINC

    These measurements were made while considering two variables:
            •   Input file size
            •   Number of clients

    This could help us make a comparison between the original BOINC architecture and the new
BitTorrent scenario in terms of users and file sizes, the two most significant variables on BOINC data
transfer.             Server Load

    To evaluate the load on the central server we used two measures: CPU usage, and network output.
Both were obtained from Ganglia.
    CPU usage helps us measure the overhead caused by the BitTorrent tracker, while the server is
serving requests for a varying number of clients. Network Output shows us how much responsibility the
clients have in the distribution of the file.

                      Graph 1 – Server Network Output for 25 nodes and 30 MB file

    In our first experience, using 25 nodes and a 30MB file, we obtained very interesting results, shown
in Graph 1. The network output for BT BOINC was a little over 10% of the value for the original BOINC.
The time needed to distribute the file was also 3 seconds lower for BT BOINC. For the same
configuration, we analyzed the CPU Usage – Graph 2.

                       Graph 2 - Server CPU usage for 25 nodes and 30 MB file

    In this test, we observed a spike in the server’s CPU usage just before starting the upload of the file.
With exception of that spike, the values were similar.
    We proceeded with tests with a different number of nodes and different file size.

                    Graph 3 - Server Network Output for 50 nodes and 30 MB file

    Tests with double the node (50) on Graph 3 show that there still is a considerable difference in
ougoing bandwidth on the central server, with BT BOINC using a little more than 5 MB/s whereas
Normal BOINC uses almost 30 MB/s. In this case, BT BOINC finished distributing the files a little later.

                      Graph 4 - Server CPU usage for 50 nodes and 30 MB file

    Again, we witnessed a spike in CPU usage in the first part of the file transfer, which corresponds to
the first plane, where transfer rates reached a 5 MB/s peak. CPU usage was a little over 50%, and not
much higher than in the case of 25 nodes. We decided to test this again in a scenario with bigger files,
while maintaining the same number of nodes.

                   Graph 5 – Server Network Output for 50 nodes and 40 MB file

    In the next experiment we kept using 50 nodes but increased the file size to 40 MB. Graph 5 shows
the network output for this case, where we can see a much lower network output from BT BOINC, after a
slow start probably due to the slower boot of some of the nodes. On the second phase, the difference in
bandwidth used is over 100%, which is extremely exaggerated, even considering the BitTorrent premise
of sharing the data distribution between clients. It would be expected that the server used up more of its
output bandwidth, so there may be a problem with the file transfer from the central server.

                        Graph 6 - Server CPU usage for 50 nodes and 40 MB file

    Here we can see that the peak is very similar regardless of the number of nodes or file size, with
values of up to 55%. It is related with the extra components on the server side: BitTorrent client and
tracker. Although not a dangerous value, it is nevertheless a much higher value than the default one.

    The server probes have shown that BT BOINC can distribute a file to all the nodes at the same speed
or faster than the original BOINC to its clients. It was not expected for the biggest gain in distribution
time to be achieved with the least nodes, on relatively large files: 25 nodes and 30 MB file. With a
smaller file, the central server did not have to use all of its output bandwidth at the same, therefore
reducing the difference for the bandwidth used by the server in BT BOINC.
    That is another point worth noting: the server in BT BOINC never used more than 10 MB/s of its
output bandwidth. This throttle may be caused by the BitTorrent client himself, which can be extremely
limiting in the capacity to obtain better results. This pushes us to try new clients, or use torrent libraries to
increase the server’s bandwidth contribution. If the problem is not caused by the client, but rather the
protocol itself, the server should be stressed, by placing it further away from the clients, to limit its
     In BitTorrent (and specifically in this scenario, as shown in the next subsection), the clients are
quick to start distributing the file amongst themselves shortly after the server distributed its first pieces.
This also contributes to the lower server network output.

                                                      53           Client load

    In this sub-section, the influence this new architecture has over the client is analyzed. We look at the
upload contribution of each node and its relation with the download and compare CPU usage to determine
the overhead caused by the BitTorrent client.

                  Graph 7 – Download/Upload in a BT BOINC client with 30 MB file

    In the first experiment we measured the bandwidth contribution of a BT BOINC client with a 30 MB
file. We can see that the upload is started shortly after the download started, and that it reaches a peak of 1
MB immediately. Again, this suggests a limitation on the BitTorrent client part, since nodes in the Orsay
cluster of Grid’5000 are interconnected by Gigabit connections.

                             Graph 8 – CPU Usage in Client with 25 clients
    In the second experiment, we measured the CPU usage on a client from a few seconds before the data
transfer started until the computation was over. The CPU usage in the BT BOINC client is higher during
the data transfer, due to the BitTorrent client, as expected. It decreases slightly before the start of the
computation, characterized by the full use of the CPU.
    The computation is not significantly impaired by the BitTorrent client, that continues to run even
after the download has finished, seeding the file. On the contrary, it is finished before the unaltered client.
This can be a particular case, where the BT BOINC client performed well and the normal BOINC client
was slower than average. For that reason, we decided to determine the average, minimum and maximum
computation for different file sizes and number of nodes (see next subsection).
    On the following test, we decided to see if there was a relation between CPU usage and number of
clients sharing the file, so we increased the number of clients.

                             Graph 9 - CPU Usage in Client with 40 clients

                            Graph 10 – CPU Usage in Client with 50 clients

    Results after running experiments with 40 and 50 clients are presented in Graph 9 and Graph 10. In
both experiments, BT BOINC shows a more intensive use of CPU, due to the BitTorrent client. These
values on the 10-20% range, and there is not a significant difference between results with fewer clients
and results using more concurrent clients.
    We can therefore conclude that there is no direct relation between number of nodes and CPU
overhead on the client.          Computation times

    Here, a graph with the computation time for various file sizes and number of clients is presented.
This allows us to identify more precisely just how much BOINC’s main objective (to perform
computation) is hindered by the BitTorrent client/protocol.
    The evaluation of this scenario was done with the help of BOINC logs that save the beginning and
end of each computation. A BT BOINC client keeps seeding the file for 1 minute after the download,
since computation time is normally slightly over 2 minutes. Seeding for half the time of computation
seems reasonable and a considerable value if taken into consideration that some computations can take
many hours.

             Graph 11 – Computation time max, mean and min for BT and Normal BOINC

    Graph 11 presents the computation times observed after running experiments with 25 nodes and
varying file sizes. We can observe that BT BOINC has very similar computation times, with a slight
overhead in mean time of 1 to 2 seconds. We can deduce from this that, in this scenario, for every 2
minutes of computation, with half the time seeding, there is an average overhead of 1,15 seconds, which
is a little under 1% of overhead. This is an acceptable value: in a two-hour computation, with seeding
time of 1 hour the overhead would be 72 seconds.

3.4 Problems and Shortcomings

    There are some advantages and disadvantages to implementing a pure BitTorrent solution. The
advantages are many, for example, BitTorrent:
         •    Has proven itself to be an efficient and low-overhead means of distributing data;
         •    Can scale easily to large numbers of participants; and
         •    Has built-in functionality to ensure relatively equal sharing ratios [61]

    Some of these advantages however turn into disadvantages when trying to apply BitTorrent to a
volunteer computing platform. For example, because of its flat topology, BitTorrent only works if enough
nodes in its network are listening for incoming connections, which can prove problematic when
confronted with firewalls and NAT systems. Another potential disadvantage when applying BitTorrent to
the volunteering computing platform its “tit-for-tat” sharing requirements, which forces most participants
to share on a relatively equal scale to what they are receiving. Although this proves quite effective for
preventing selfish file-sharing on traditional home networking systems, it is not necessarily a requirement
when applying P2P technologies to volunteer computing. For example, in the volunteer computing case,

not everyone may wish to be a BitTorrent node but they may wish to offer their CPU time to a project.
So, in the pure tit-for-tat BitTorrent world, this would not be possible.

    This architecture helps reduce the load on the server and can improve distribution times on projects
with large data files. It can provide new opportunities for projects that were previously limited by
bandwidth issues on their server and, by improving the data distribution, speeding up the scientific
research behind the projects.
    However, this approach is likely to be received with skepticism, if not resistance, for three main
    1.     Users are not willing to share their bandwidth when there is no direct benefit and the alternative
               works: clients are rewarded with credits by BOINC for completing computations of work
               units. If they can still receive those credits without having to share bandwidth, they may not
               wish to contribute and use this architecture. A possible solution would be to award credits
               for bandwidth uploaded, as it is done in other non-CPU intensive projects.
    2.     BitTorrent, like other P2P systems, is normally associated with piracy and illegal downloads,
               which taints its reputation; and
    3.     Besides motivation, security can also be an issue since, to operate in good conditions, ports must
               be opened which increases users’ vulnerabilities (not necessarily because of the BitTorrent

 3.5 Future Work

    A possible and predictable step in the future would be the further decentralization of this model, by
distributing the tracker itself, creating what is called a “trackerless” torrent, which basically makes use of
Distributed Hash Tables (DHT) to share the burden between the clients. As mentioned before, this is a
very recent research area, but one that, if well exploited, can eliminate the final central failure point –
with the current architecture, BitTorrent transfers would not cease to exist if the tracker went down, but
the clients would not be able to discover new peers.
    As mentioned previously, new experiments should be run using a different BitTorrent client, since
the console version of the reference BitTorrent project may have an upload cap. It should not be too hard
to implement a version that used Azureus, the Java client, for instance. Using a BitTorrent library like
libtorrent is another option, but it would require more changes to the BOINC client code.
    On the other hand, to decrease the central server’s upload capability, and stress the network output,
tests should be run with the central server deployed in a different cluster. This would require the creation
of a new environment, which would entail installing server software, resolving software dependencies and
creating a project. The clients would stay in the Orsay node. This would increase the stress on client-
server connections, but the inter-cluster Gigabit connection would not be easily overused.

Chapter 4
       As stated and presented in the State of the Art, many CDNs exist today, both commercial and
academic. Commercial CDNs cannot be applied in this case, since their service would have to be paid

       Of all the academic CDNs presented only two are open-source: Globule and Coral, and none could
be directly applied the BOINC scenario. There are two main reasons for this:
            •   They work on a pull-based approach: clients requests are directed (either using DNS
                redirection or URL rewriting) to their closes surrogate servers. If there is a cache miss,
                surrogate servers pull content from the origin server.
            •   Request -routing algorithms are either DNS or HTTP based: DNS based routing is done
                according to the destination web site – same routing used for the whole site, based on
                modified DNS servers; with HTTP redirection, a redirector adds an extra HTTP header that
                tells the client to resubmit its request to another server.

       Furthermore, Globule is implemented as a third-party module for the Apache web server, which is
incompatible with BOINC’s architecture. Requests should be redirected by the scheduler since it is the
component responsible for choosing which data servers each client should download from.
       This does not mean, however, that certain mechanisms from each project cannot be applied in this
scenario. CoDeeN, for example, has a very interesting peer monitoring system, where two systems are
used: a lightweight UPD-based heartbeat and a “heavier” HTTP/TCP-level “fetch” helper.

       The scenario presented here is therefore a simple implementation of a CDN on BOINC, with
limitations on certain components, such as the content distribution mechanisms. It uses, as mentioned
previously, ideas from other CDN projects, and adapts them to BOINC, in order to build a network where
surrogate servers are mirror data servers (in the current BOINC implementation). In future steps, it could
be applied to clients who shared files through HTTP, with a few changes to the current system.

 4.1 Client Redirection

       In BOINC’s architecture, when a client requires more work, a request is issued and sent to the
scheduler, which then replies with the work units it should execute. Each work unit is represented in the
scheduler reply (in XML) with a number of parameters: name, application, input files, and resources
estimates and bounds. Each input file has the URL of one or more (mirrors) data servers from where they
can be downloaded. The scheduler works as the redirector of a CDN: it receives client requests and
redirects them to the appropriate server. This compelled us to use the scheduler as the redirector for this
       In current BOINC architecture, the use of mirrors is possible, and there is a very simple mechanism
to choose which mirror to use for the download of each file: the BOINC client reports its machine

timezone to the scheduler, which then proceeds to choose the server with the closest timezone. Should
two or more servers be equally close, a round-robin algorithm is used to choose one of them. This was
improved upon by using IP prefix matching to choose the closest mirror. The round-robin algorithm is
used in case two or more servers are equally close. Should there be no match at all between IPs the
timezone mechanism is used as a last resort.

    In BOINC, the list of mirrors is saved in the central server in project root directory, in a two-column
file called “download_servers”. The first column is an integer listing the server's offset in seconds from
UTC. The second column is the server URL in the format such as http://einstein.phys.uwm.edu. The
download servers must have identical file hierarchies and contents, and the path to file and executables
must start with '/download/...' as in 'http://X/download/123/some_file_name'. For this scenario, we
decided to take advantage of this organization, and use the file to identify mirrors that would act as

    A list of mirrors/surrogates is therefore initialized using this file, and saved in the scheduler. Each
mirror will be saved as an object with various attributes such as statistics and information on the server
status. This information will allow the scheduler to reply to each request with the best alternative to serve
the data at the moment of the request, as pictured in Figure 9.


                           Figure 9 – Scheduler as redirector on CDN BOINC

    To maintain reliable and smooth operations on this system, each mirror monitors system health and
provides this data to the scheduler. The scheduler needs to continually know the state of other proxies and
decide which mirrors should be used for request redirection. Therefore, this architecture includes mirror
health monitoring facilities.

 4.2 Health Monitoring

    To monitor the health and status of the mirrors, we used a similar mechanism to CoDeeN, using both
a UDP and a HTTP/TCP Heartbeat. A program is run on each of the mirrors, which gathers information
about the host’s state and environment to assess resource contention as well as external service
availability. This information is then gathered through the hearbeats.

 4.2.1            Local Monitoring

    The local monitor examines the host’s resources, such as CPU usage and system load averages. From
the operating system/utilities, the following values can be gathered: node uptime, system load averages
(both via /proc), and system CPU usage (via “vmstat”). Load average is read every 30 seconds, whereas
the CPU usage is probed every 10 seconds, and the 1-minute max is kept. Using the maximum over 1
minute reduces fluctuations, and, at 3 mirrors, exceeds the gap between successive heartbeats (described
below) from the scheduler.
    CoDeeN avoids nodes reporting more than 95% system CPU time, which they found it relates to
kernel/scheduler time, so we do the same here. Mirrors that go over that threshold are not considered for

 4.2.2            Peer Monitoring

    As mentioned before, to monitor the mirrors’ health, the scheduler employs two mechanisms: a UDP-
based heartbeat and a HTTP/TCP level heartbeat.           UDP Heartbeat

    We use UDP heartbeats as a simple gauge of liveliness, which helps us identify the avoidable
mirrors. UDP has low overhead and can be used when socket exhaustion prevents TCP-based
communication. Failure to receive acknowledgements (ACKs) is used to infer packet loss.
    The scheduler sends a heartbeat message every 15 seconds to one of the mirrors, which tens responds
with information about its local state. The piggybacked load information includes the mirror’s average
load, system time CPU, and server uptime.
    Heartbeat acknowledgements can get delayed or lost, giving some insight into the current network
state. Acknowledgements received within 3 seconds are considered acceptable. This value can be adapted
to the usage scenario. In this case, and being Grid’5000 the testbed, the typical inter-node RTT was less
than 100ms, so failure to receive an ACK in 3 seconds is considered abnormal. Information about these
late ACKs is kept to distinguish between overloaded mirrors/links and failed mirrors/links, for which
ACKs are never received.

    We followed CoDeeN’s policies on how to deal with missing ACKs. Any mirror that does not
respond to the most recent ACK is avoided, since it may have just recently died. Using a 5% loss rate as a
limit, and understanding the short-term nature of network congestion, any node missing 2 or more ACKs
in the past 32 is avoided, since that implies a 6% loss rate. However, any node that responds to the most
recent 12 ACKs is considered viable, since it has a roughly 54% chance of having 12 consecutive
successes with a 5% packet loss rate, and the mirror is likely to be usable.
    By coupling the history of ACKs with their piggybacked local status information, the scheduler can
assess the health of the mirrors.           HTTP/TCP Heartbeat

    The UDP-based heartbeat is useful for excluding some mirrors, but it cannot definitively determine
its health, since it cannot test some of the paths that may lead to service failures. For example, there could
be a failure on the mirror’s web server, causing HTTP connections to fail.
    The HTTP/TCP heartbeat in CoDeeN is much too complex for this case, since it picks one of its
presumed live peers to act as the origin server, and iterates through all of the possible peers as proxies
using the fetch tool.
    In our case, we used a simple mechanism to test the HTTP connection. Since the content distribution
mechanism is, as mentioned before, limited (described below), we use a reference file, that we make sure
it is always present in the servers. The name and location of this file is read from a file present in the
central server project root directory: “http_heartbeat_file”. By guaranteeing that this file is never deleted
from the mirrors, one can test the HTTP connection by using wget [62] to check the HTTP response. A
response other than HTTP 200 OK marks the mirror as avoidable.

 4.3 Content distribution

    Content outsourcing is performed using either cooperative push-based, non-cooperative pull based or
cooperative pull-based approaches.
    The non-cooperative pull-based approach is the most widely used, where client requests are directed
to their closest surrogate servers. In case of a cache miss, surrogate servers pull content from the origin
server. Akamai uses this approach. The cooperative pull-based approach differs from the non-cooperative
approach in the sense that surrogate servers cooperate with each other to get the requested content in case
of cache miss. This approach is used by CoralCDN that implemented a variation of a DHT.
    In the BOINC scenario, however, a client request must be satisfied immediately, in order not to delay
the computation. Therefore, a cooperative push-based approach would have to be used. This is based on
the pre-fetching of content to the surrogates. Content is pushed to the surrogate servers from the origin,
and surrogate servers cooperate to reduce replication and update cost. In this scheme, the CDN maintains
a mapping between content and surrogate servers, and each request is directed to the closest surrogate
server or otherwise the request is directed to the origin server.

    However, it is still considered as a theoretical approach since it has not been used by any CDN
provider [63] [64]. In this first version of BOINC CDN, the content distribution is still handled “outside
the system”, as it is done with data transfers to mirrors. Project managers coordinate the distribution of
the files before placing the corresponding work units in the server.
    Future work may involve using a system to distribute files from the central server to mirrors, such as
FastReplica [65], and a mechanism to monitor the distribution of the files, and identify the location of
each one.

 4.4 Experimental Results

    Due to time and machine constraints (Grid’5000 was essentially used to test BitTorrent scenario), we
limited our experiments to the evaluation of new features, while monitoring the overhead of the local
monitoring program on the machine.
    These tests were performed on a small LAN, with one node acting as the central server and another
as a mirror. We observed that the scheduler on the central server was aware when:
         •   Mirror machine was down: this was detected by both the UDP and the HTTP heartbeat;
         •   Mirror machine was up, local monitoring program running, but Apache was down: UDP
             continued to gather information on mirror without complaining, but the HTTP heartbeat
             detected the error and marked the node as avoidable;
         •   Mirror machine was up, local monitoring program was not running, Apache was working:
             UDP heartbeat detects a problem, and marks node as avoidable; HTTP verifies the file can
             be downloaded, and corrects the information.

    Running the local monitoring consumes little CPU and memory since it consists mostly of sleeps and
a quick probe on a few system parameters. Considering this initial phase, where this would be applied on
BOINC mirror servers, there should not be a problem with network throughput as well, since the periodic
heartbeat messages would be sent to only a few servers (BOINC projects usually do not have more than 5
mirrors – Einstein@Home case).

 4.5 Future Work

    The first priority is obviously to run tests on a distributed environment to obtain meaningful results
on this scenario. The Grid5000 network is a good candidate for the experimental runs, since it allows the
use of many nodes, which could be spread through different clusters to try out the request routing
mechanism. Other tests would include stressing a CDS to check if and when the scheduler would redirect
clients to alternative nodes, and when a node is deemed problematic.

    This first version is applied on current mirrors used for several BOINC projects such as
Einstein@Home, since it uses the file “download_servers” to identify the alternative data servers. A

future iteration will consist on adapting this architecture to use clients as data servers, instead of just pre-
defined mirrors. This would require changes in content distribution and management, and in monitoring

    If clients were used as Content Delivery Servers (CDS’s), the distribution of files would have to be
automatic, without requiring the intervention of the user. A possibility would be to use FastReplica,
mentioned earlier, where the file to be replicated is divided into n equal parts and a part is sent to each
CDS. Each CDS is informed of the other mirrors that contain a part of the file. After finishing the
download from the original Data Server, each CDS opens n-1 connections to send its part to the other
    The choice of CDS where to send the data is made considering: CDS disk space; number of files
already hosted there; and CDS bandwidth. This would require a few additions to the local monitoring
mechanism. Each CDS would have to reply to the heartbeat with a list of files hosted so the scheduler
could maintain an updated list. If a CDS was considered to be down (avoided), it would not be considered
when distributing files besides being discarded during request routing.
    If FastReplica were used, each CDS would have to acquire information on disk space and bandwidth
used (and knowledge of total bandwidth). A list of files in disk would also have to be updated with each

    Since communication between CDSs and the central server is made through the internet, there is the
possibility of malicious behavior, such as a user acting as a middle man and posing as a CDS, or just
sniffing the data to identify the characteristics of each server. To protect communication and ensure that
the source of data is trusted, we must use encryption with a RSA key pair to identify the scheduler. The
private key would be kept in the scheduler and the public key would be stored in each CDS (and not
shared). Every time the scheduler initiates communication with a CDS (during startup, and again after a
defined time period has passed), it generates a random session key and uses it with an AES algorithm to
encrypt the data. The session key will then be encrypted with the scheduler’s public key, and would be
sent along with the encrypted data. After arriving at the CDS, the session key will be decrypted and used
to decrypt the data. The key would be used from here thereafter until the scheduler determined it was time
to use a new one.

Chapter 5
    In this research we have argued that the current centralized client/server architecture applied by
BOINC and other Desktop Grid systems for data distribution is limiting and costly, and these projects
would benefit from P2P data distribution technologies. Specifically, we have presented two approaches
for large scale data management in Desktop Grid domains: one, based directly upon the BitTorrent
protocol, and another employing a simple version of a Content Delivery Network.
    We will conclude by summarizing our lessons learned and presenting the contributions of this thesis.

 5.1 Lessons Learned

    The first lesson was that the current BitTorrent implementation can save an enormous amount of
bandwidth to the central server, when distributing files. When using BT BOINC, the central server
reduced its network output in over 100% of the original BOINC outgoing bandwidth, which was
surprising, even considering the premise of clients sharing data distribution.

    This lead to the second lesson, which consists on the fact that the BitTorrent client used in this
scenario was unable to use high bandwidth outputs. This limitation severely undermines the potential of
BitTorrent, since it cannot compete with the much higher output of a central server. It is important to
determine the cause of this behavior. A possible culprit is the BitTorrent client itself, which can be non-
optimized for the environment where it was deployed, or even not prepared to sustain high output rates.
Further experiments should be conducted using a different BitTorrent client, or even a BitTorrent library
such as libtorrent, which reportedly can seed up to 3 times faster than a normal client.

    Considering the results obtained, and taking into account previous research on this subject, conducted
using XtremWeb and Azureus, another option to improve BT/HTTP network output ratio is to stress the
central server, by placing it further away from the clients. On the XtremWeb tests, for example, the FTP
server they tested against BitTorrent only had up to 1MB/s of outgoing transfer rates. That can provide a
more suitable scenario for BitTorrent to outperform other data distribution protocols/methods.
    On the other hand, those same experiments also showed a limitation for BitTorrent similar to the
1MB/s for FTP. Since an Azureus client was used in that case, it is possible that the problem is in the
BitTorrent protocol itself. This would mean that BitTorrent is not prepared to take advantage of all the
available bandwidth, and is underused in scenarios where the initial seed has a good outbound
connection. Further experiments would help us clarify this problem.

    Regarding the overhead caused by the BitTorrent components added to BOINC, the results showed
that the computation time on the client is practically not affected at all. With a seeding time of
approximately half the computation time, the overhead cause was a little under 1% of the total computing
time. This is a very acceptable result: in a two-hour computation, with seeding time of 1 hour the
overhead would be 72 seconds. CPU usage for BT BOINC was generally higher during the download
phase, with increases in the 10 to 20% range. This is also an acceptable value, considering that computer
would be normally idle during these downloads.

    On the server side, the results were not too promising. The joint use of both the BitTorrent client and
a tracker caused spikes in CPU usage, that reached a 50% value, which can be problematic should the
server require more computing power for other tasks. However, the spikes were not long which can
suggest an initial startup time when establishing connections to the other peers, and a following slowing
down period.

    On the CDN scenario, we were able to develop the first version of a model that can be used to test
different server choosing algorithms and mechanisms. Using the heartbeats combined with local
monitoring can provide a wealth of data on the mirrors serving the data. The analysis of this data is done
in a very basic manner in this version, with thresholds for limited probed components. In the future, the
number of probes could be increased and adaptive algorithm could be used to determine which servers
should serve each request.

    The current version is able to detect when a mirror is down, either through faster UDP or more
reliable HTTP heartbeats, adding an important feature to a very limited mirror selection algorithm on the
current BOINC. The other features should be tested in a medium or large scale environment to evaluate
the advantages on a scenario closer to the real world setting.

 5.2 Contribution

    The work and results presented in this dissertation gave important contributions, such as:
         •     A BOINC architecture with BitTorrent capabilities, with a full fledged prototype
               implemented that can be easily altered to support different types of BitTorrent clients;
         •     The description of a BOINC model with the characteristics of a CDN network. The
               prototype created has given the first steps towards that goal, but it is still an early version;
         •     The identification of a possible BitTorrent limitation, either in the client or in the protocol
         •     The confirmation that network bandwidth is saved in the central server when using BOINC,
               although the results showed more difference than expected;
         •     The identification of the overhead in the system usage, and computation time when using a
               BitTorrent client incorporated in the BOINC client.

 [1] D. P. Anderson and G. Fedak. "The Computational and Storage Potential of Volunteer
 Computing". IEEE/ACM International Symposium on Cluster Computing and the Grid, Singapore,
 May 16-19, 2006

 [2] BOINC: http://boinc.berkeley.edu/

 [3] SETI@Home project: http://setiathome.berkeley.edu/

 [4] D. P. Anderson, J. Cobb, E. Korpela, M. Lebofsky, and D. Werthimer. ”SETI@home: An
 experiment in public-resource computing”. Communications of the ACM, Nov. 2002, Vol. 45 No. 11,
 pp. 56-61

 [5] David Anderson. "BOINC: A System for Public-Resource Computing and Storage". In Proc. 5th
 IEEE/ACM International Workshop on Grid Computing, 2004

 [6] K. Shudo, Y. Tanaka, S. Sekiguchi. “P3: P2P-based middleware enabling transfer and aggregation
 of computational resources”. In Cluster Computing and the Grid, CCGrid 2005

 [7] J. Verbeke, N. Nadgir, G. Ruetsch, I. Sharapov. “Framework for Peer-to-Peer Distribution
 Computing in a Heterogeneous, Decentralized Environment”, Sun Microsystems, Inc., Palo Alto,
 California, 2002

 [8] Arijit Ganguly, Abhishek Agrawal, P. Oscar Boykin, Renato Figueiredo. "WOW: Self-Organizing
 Wide Area Overlay Networks of Virtual Workstations". In Proceedings of the 15th IEEE
 International Symposium on High Performance Distributed Computing (HPDC), Paris 2006

 [9] A. Chien, B. Calder, S. Elbert, K. Bhatia. “Entropia: architecture and performance of an enterprise
 desktop grid system”. Journal of Parallel and Distributed Computing, Volume 63, Issue 5, May 2003,
 Pages 597-610, Special Issue on Computational Grids

 [10] V. Lo, D. Zappala, Y. Liu, S. Zhao. “Cluster Computing on the Fly: P2P Scheduling of Idle
 Cycles in the Internet”. In Proc. 3rd International Workshop on Peer-to-Peer Systems, San Diego,
 CA, 2004

 [11] D. Zhou, V. Lo. “Wave Scheduler: Scheduling for Faster Turnaround Time in Peer-based
 Desktop Grid Systems”. In Proc.11th Workshop on Job Scheduling Strategies for Parallel Processing
 (ICS 2005), Cambridge, MA, 2005

[12] M. Litzkow, M.Luvby, M.Mutka. “Condor - A Hunter of Idle Workstations”. 8th International
Conference on Distributed Computing Systems (ICDCS). Pages 104-111, Washington, DC, 1988

[13] D. Epema, M. Livny, R. van Dantzig, X. Evers, J. Pruyne. “A worldwide flock of Condors: Load
sharing among workstation clusters”. Future Generation Computer Systems, May 1996

[14] A. Butt, R. Zhang, Y.C. Hu. "A self-organizing flock of Condors". Journal of Parallel and
Distributed Computing, Volume 66, Issue 1, Pages 145-161, Jan 2006,

[15] W. Gentzsch. “Sun Grid Engine: Towards Creating a Compute Power Grid”. 1st International
Symposium on Cluster Computing and the Grid, 2001

[16] A. Luther, R. Buyya, R. Ranjan, S. Venugopal. “Alchemi: A .NET-based Desktop Grid
Computing Framework”. In High Performance Computing: Paradigm and Infrastructure, Laurence
Yang and Minyi Guo (editors), Wiley Press, 2004

[17] F. Cappello, S.Djilali, G.Fedak, T.Herault. F.Magniette, V.Neri, O.Lodygensky. “Computing on
large-scale distributed systems: XtremWeb architecture, programming models, security, tests and
convergence with Grid”. FGCS Future Generation Computer Science, 2004

[18] D. Schwartz, B. Sterman. “NAT traversal in SIP”. Kayote Networks, September 2005

[19] KaZaA: http://www.kazaa.com

[20] J. Liang, R. Kumar, K. W. Ross. “The kazaa overlay: A measurement study”. Computer
Networks 49, 6, Oct. 2005

[21] P. Cozza1, C. Mastroianni, D. Talia1, and I. Taylor. “A Super-Peer Model for Multiple Job
Submission on a Grid”. Tech. Rep. TR-0067, Institute on Knowledge and Data Management,
CoreGRID Network of Excellence, January 2007

[22] David P. Anderson, John McLeod VII. “Local Scheduling for Volunteer Computing”. Workshop
on Large-Scale, Volatile Desktop Grids (PCGrid 2007) held in conjunction with the IEEE
International Parallel & Distributed Processing Symposium (IPDPS), Long Beach, March 30, 2007

[23] Skype: http://www.skype.com

[24] BitTorrent: http://www.bittorrent.com

[25] Einstein@home: http://einstein.phys.uwm.edu/

[26] Climateprediction.net: http://climateprediction.net/

[27] Unofficial BOINC Wiki: http://boinc-wiki.ath.cx/

[28] Carl Christensen, Tolu Aina, David Stainforth. “The Challenge of Volunteer Computing With
Lengthy Climate Modelling Simulations”. In Proceedings of the 1st IEEE Conference on e-Science
and Grid Computing, Melbourne, Australia, 5-8 Dec 2005

[29] B. Cohen. “Incentives build robustness in BitTorrent”. In Proceedings of IPTPS, 2003

[30] M. Izal, G. Urvoy-Keller, E. W. Biersack, P. Felber, A. Al Hamra, and L. Garces-Erice.
“Dissecting BitTorrent: Five Months in a Torrent's Lifetime”. Passive and Active Measurements 2004,
April 2004

[31] Baohua Wei, Gilles Fedak and Franck Cappello. “Collaborative Data Distribution with
BitTorrent for Computational Desktop Grids”. ISPDC'05 Lille, France, 2005

[32] B. Wei , G. Fedak , F. Cappello. “Scheduling Independent Tasks Sharing Large Data Distributed
with BitTorrent”. 6th IEEE/ACM International Workshop on Grid Computing, Seattle, USA, 2005

[33] Grid5000: http://www.grid5000.fr/

[34] SimBOINC: http://simboinc.gforge.inria.fr

[35] Martin Placek and Rajkumar Buyya. “A Taxonomy of Distributed Storage Systems”. Technical
Report, GRIDS-TR- 2006-11, Grid Computing and Distributed Systems Laboratory. The University of
Melbourne, Australia. July 3, 2006

[36] J. Kubiatowicz, D. Bindel, Y. Chen, S. Czerwinski, P. Eaton, D. Geels, R. Gummadi, S. Rhea, H.
Weatherspoon, W. Weimer, C. Wells and B. Zhao. “Oceanstore: An architecture for global-scale
persistent storage”. In the 9th International Conference on Architectural Support for Programming
Languages and Operating Systems, 2000.

[37] M. Satyanarayanan, J. J. Kistler, P. Kumar, M. E. Okasaki, E. H. Siegel and D. C. Steere. “Coda:
A highly available file system for a distributed workstation environment”. IEEE Transactions on
Computers 39, 4, 447–459. April 1990

[38] B. Y. Zhao, J. Kubiatowicz, and A. Joseph. “Tapestry: An Infrastructure for Fault-tolerant Wide-
area Location and Routing”. UCB Tech Report UCB/CSD-01-1141, University of California,
Berkeley, 2001.

[39] A. Adya, W. Bolosky, M. Castro, G. Cermak, R. Chaiken, J. R. Douceur, J. Howell, J. R. Lorch,
M. Theimer and R. P. Wattenhofer. "FARSITE: Federated, Available and Reliable Storage for an
Incompletely Trusted Environment". In the Proceedings of the 5th Symposium on OSDI, 2002

[40] C. A. Thekkath, T. Mann, and E. K. Lee. “Frangipani: A Scalable Distributed File System”. In
Proceedings of the 16th ACM Symposium on Operating Systems Principles, St.-Malo, France, Oct

[41] Sudharshan S. Vazhkudai, Xiaosong Ma, VincentW. Freeh, JonathanW. Strickland, Nandan
Tammineedi, Stephen L. Scott. “FreeLoader: Scavenging Desktop Storage Resources for Scientific
Data”. ACM/IEEE SC 2005 Conference (SC’05), 2005

[42] Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung. “The Google File System”. In SOSP
’03: Proceedings of the nineteenth ACM symposium on Operating systems principles, pages 29–43,
New York, NY, USA, 2003. ACM Press

[43] Gnutella Project: http://www.gnutella.com/

[44] S. Androutsellis-Theotokis and D. Spinellis. “A survey of peer-to-peer content distribution
technologies”. ACM Computing Surveys 36, 4, 335–371. 2004

[45] S. Venugopal, R. Buyya and K. Ramamohanarao. “A taxonomy of data grids for distributed data
sharing, management, and processing”. ACM Computing Survey 28. 2006

[46] Al-Mukaddim Khan Pathan and Rajkumar Buyya. "A Taxonomy and Survey of Content Delivery
Networks". Technical Report, GRIDS-TR-2007-4, Grid Computing and Distributed Systems
Laboratory, The University of Melbourne, Australia. February 12, 2006

[47] Michael J. Freedman, Eric Freudenthal, and David Mazières. “Democratizing Content
Publication with Coral”. In Proc. 1st USENIX/ACM Symposium on Networked Systems Design and
Implementation (NSDI '04) San Francisco, CA, March 2004

[48] L. Wang, K. Park, R. Pang, V. S. Pai, and L. Peterson. “Reliability and security in the CoDeeN
content distribution network”. In Proceedings of the USENIX Annual Technical Conference, Boston,
MA, June 2004

[49] K. S. Park and V. S. Pai, “Scale and Performance in the CoBlitz Large-File Distribution Service”.
In Proceedings of the 3rd Symposium on Networked Systems Design and Implementation (NSDI
2006), San Jose, CA, USA, May 2006

[50] G. Pierre, and M. van Steen. “Globule: A Collaborative Content Delivery Network”. IEEE
Communications, Vol. 44, No. 8, August 2006

[51] B. Huffaker, M. Fomenkov, D. J. Plummer, D. Moore and K. Claffy. “Distance Metrics in the
Internet”. In Proceedings of IEEE International Telecommunications Symposium, IEEE CS Press, Los
Alamitos, CA, USA, 2002

[52] Akamai Technologies, Inc.: http://www.akamai.com

[53] J. Dilley, B. Maggs, J. Parikh, H. Prokop, R. Sitaraman, and B. Weihl. “Globally Distributed
Content Delivery”. IEEE Internet Computing, pp. 50-58, September/October 2002

[54] M. Gordon. “The Internet Streaming Media Boom: A Powerful Trend that Represents
Fundamental Change”. Limelight Networks White paper, 2007. http://www.limelightnetworks.com

[55] [bram_cohen_bt] Bram Cohen. “Incentives build robustness in BitTorrent”. In Proceedings of
IPTPS, 2003

[56] Gheorghe Silaghi, Louis M. Silva, Patricio Domingues, Alvaro E. Arenas. “Tackling the
Collusion Threat in P2P-Enhanced Internet Desktop Grids”. Presented in CoreGRID Workshop on
Grid programming model, Grid and P2P systems architecture and Grid systems, tools and
environments, Crete, Greece 2007

[57] Danny Bickson and Dahlia Malkhi. “The Julia Content Distribution Network”. 2nd Usenix
Workshop on Real, Large Distributed Systems (WORLDS ’05), San Francisco, USA, December 2005

[58] J. Li, J. Stribling, R. Morris, M. F. Kaashoek, and T. M. Gil. “A performance vs. cost framework
for evaluating DHT design tradeoffs under churn”. IEEE Conference on Computer Communications
(INFOCOM), 2005

[59] Grid’5000: http://www.grid5000.fr/

[60] M. L. Massie, B. N. Chun, and D. E. Culter. “The Ganglia Distributed Monitoring System:
Design, Implementation, and Experience”. Parallel Computing, vol. 30, pp.817-840, July 2004

[61] M. Izal, G. Urvoy-Keller, E.W. Biersack, P. A. Felber, A. A. Hamra, and L. Garces-Erice.
“Dissecting BitTorrent: Five Months in a Torrent’s Lifetime”. In Proceedings of Passive and Active
Measurements (PAM), 2004

[62] GNU wget: http://www.gnu.org/software/wget/wget.html

[63] N. Fujita, Y. Ishikawa, A. Iwata, and R. Izmailov. “Coarse-grain Replica Management Strategies
for Dynamic Replication of Web Contents”. Computer Networks: The International Journal of
Computer and Telecommunications Networking, Vol. 45, Issue 1, pp. 19-34, May 2004

[64] Y. Chen, L. Qiu, W. Chen, L. Nguyen, and R. H. Katz. “Efficient and Adaptive Web Replication
using Content Clustering”. IEEE Journal on Selected Areas in Communications, Vol. 21, Issue 6, pp.
979-994, August 2003

[65] L. Cherkasova and J. Lee. “FastReplica: Efficient large file distribution within Content Delivery
Networks”. In Proceedings of the 4th USITS, Seattle, WA, March 2003


 Fernando Costa, Luis Silva, Ian Kelley, and Ian Taylor. "Peer-to-Peer        Techniques for Data
 Distribution in Desktop Grid Computing Platforms". Presented in a CoreGRID Workshop, Crete,
 Greece, June 2007.

 Fernando Costa, Luis Silva. "Optimizing the Data Management Layer of BOINC with BitTorrent". To
 be published as a Technical Report of the CoreGRID Project (http://www.coregrid.net)


To top