Major Project Report (PDF) by instw

VIEWS: 157 PAGES: 113

Dynamic Load Balancing of processes in a peer-
                       to-peer network
                         Project Work Report

 Submitted in partial fulfillment of the requirements for the degree of

                BACHELOR OF TECHNOLOGY in

                    COMPUTER ENGINEERING


    GAURAV T MOGRE                                AVINASH H

         (06CO36)                                  (06CO72)

                        Under the Guidance of
                       Mr. Alwyn Roshan Pais
                           Senior Lecturer

                SURATHKAL, MANGALORE – 575 025
                              April, 2010


       We hereby declare that the Project Work Report entitled “Dynamic Load Balancing of
processes in a peer-to-peer network” which is being submitted to the National Institute of
Technology Karnataka, Surathkal for the award of the degree of Bachelor of Technology in
Computer Engineering is a bonafide report of the work carried out by us. The material
contained in this Project Report has not been submitted to any University or Institution for the
award of any degree.

1. Gaurav T Mogre (06CO36)

2. Avinash H (06CO72)

Place: NITK, Surathkal                                                Date:


        This is to certify that the project entitled “Dynamic Load Balancing of processes in a peer-to-
peer network” was carried out by Gaurav T Mogre (Reg.No.06CO36) and Avinash H
(Reg.No.06CO72) in partial fulfillment of the requirements for the award of the Degree of Bachelor of
Technology in Computer Engineering of National Institute of Technology Karnataka, Surathkal during
the academic year 2009-2010.

Mr. Alwyn Roshan Pais                                                        Dr. Santhi Thilgam

Project Guide                                                               Head of Department

Place: NITK Surathkal



        The successful completion of our project gives us an opportunity to convey heartfelt
regard to each and everyone who has been instrumental in shaping up the final outcome of this

        We would like to thank Mr. Alwyn Roshan Pais, Senior Lecturer, Department of
Computer Engineering, NITK - Surathkal for his able guidance and encouragement during our
project work which was of great help in the successful completion of the project.

      We thank Dr. Santhi Tilagam, Professor and HOD, Department of Computer
Engineering, NITK Surathkal for her support and help to utilize the resources available at
Department Lab.

        Finally we would like to thank everyone who was a great source of help during various
stages of the project.


The field of High performance computing is continuously gaining importance as time
progresses. We see that a lot of research in this field has gone into specialized
applications. Our objective during this project was to harness the power of distributed
computing for regular computing applications for regular users. And this is the main aim
of this project: Introducing the high performance computing concepts into the realm of
personal computing.

The aim of the project is to allow for migration of processes from one computer to
another transparently. This includes making the decision to migrate the process, to find
the target computer that would execute the process the fastest, to migrate this process to
the computer, and to build an environment so that the process migrated runs in similar
environment as present in the host machine. This problem statement has been studied
before for systems such as Unix, and partial solutions for some sub-problems have been
discovered. However, we aim to provide an integrated solution on Linux, an operating
system that is more widely used.

In order to find the solution to the problem, each sub-problem must be tackled with
individually. To evaluate the performance of a machine, we must find out the speed of
the machine at no load (natively how fast the machine runs), and the current load on the
machine. In order to load balance effectively, a distributed peer-2-peer algorithm was
devised such that the load is balanced equally. The solution of actual process migration
was also broken down and various alternatives were studied.

We see that using the system for migration of a CPU intensive process from a computer
at 90% load to a lightly loaded computer led to nearly 10% reduction of overall execution


ABSTRACT _____________________________________________________________ 1

LIST OF TABLES/FIGURES/CHARTS __________________________________________ 8

     List of Tables _______________________________________________________________ 8

     List of Figures ______________________________________________________________ 8

NOMENCLATURE/ABBREVIATIONS ________________________________________ 10

1.     INTRODUCTION ___________________________________________________ 11

     1.1.       Generalized Process migration objectives ________________________________ 11

     1.2.       Problems with Process Migration _______________________________________ 12

2.     LITERATURE STUDY ________________________________________________ 13

     2.1.       Study of Existing Systems _____________________________________________ 13

       2.1.1.     MOSIX ___________________________________________________________________ 13

       2.1.2.     Sprite Network OS _________________________________________________________ 15

       2.1.3.     Mach ____________________________________________________________________ 17

       2.1.4.     LSF and Condor____________________________________________________________ 19

     2.2.       Comparison of Existing Systems ________________________________________ 20

     2.3.       Observations about these systems ______________________________________ 20

3.     PROCESS MIGRATION: THE PROBLEM STATEMENT _______________________ 22

     3.1.       Goals ______________________________________________________________ 22

     3.2.       Important Definitions and Terminologies_________________________________ 23

     3.3.       Modules of the Solution ______________________________________________ 23

     3.4.       Software Requirement Specification ____________________________________ 24

       3.4.1.     External Interface Requirements ______________________________________________ 24

       3.4.2.     Functional Requirements ____________________________________________________ 26

       3.4.3.     Performance Requirements __________________________________________________ 31

       3.4.4.     Design Constraints _________________________________________________________ 32

       3.4.5.     Software system attributes __________________________________________________ 32

4.     LOAD MEASUREMENT ______________________________________________ 34

     4.1.       Foundation of collection ______________________________________________ 34

       4.1.1.     Static Data Acquisition ______________________________________________________ 34

       4.1.2.     CPU Performance: Measurement vs. benchmarking ______________________________ 35

     4.2.       Benchmarking CPU___________________________________________________ 36

       4.2.1.     Principal Component Analysis ________________________________________________ 37

       4.2.2.     Processor Characteristics Measurement ________________________________________ 38

       4.2.3.     Running the Benchmarks ____________________________________________________ 39

       4.2.4.     Clustering ________________________________________________________________ 40

       4.2.5.     Benchmarks obtained ______________________________________________________ 41

       4.2.6.     Implementing these benchmarks _____________________________________________ 42

     4.3.       Measuring System Load _______________________________________________ 42

       4.3.1.     Tools for measuring system load ______________________________________________ 42

       4.3.2.     Parameters to consider _____________________________________________________ 48

       4.3.3.     Getting and parsing the information ___________________________________________ 50

     4.4.       Combining: The System Metric _________________________________________ 52

       4.4.1.     Distribution Metric _________________________________________________________ 53

5.     LOAD BALANCING__________________________________________________ 54

     5.1.       Generalized Load Balancing____________________________________________ 54

       5.1.1.     Taxonomy of Load Balancing Algorithms _______________________________________ 54

       5.1.2.     Representation of Relationships ______________________________________________ 55

       5.1.3.     Types of Load Balancing _____________________________________________________ 57

     5.2.       Hierarchical Load Balancing____________________________________________ 59

       5.2.1.     The Basic Algorithm ________________________________________________________ 59

       5.2.2.     Embedding the Directed Acyclic Graph _________________________________________ 60

       5.2.3.     Modifications to the algorithm for DAG embedding ______________________________ 63

       5.2.4.     Embedding a DAG _________________________________________________________ 64

       5.2.5.     Conclusion about Hierarchical Algorithm _______________________________________ 67

     5.3.       Symmetric Undirected Hierarchical Algorithm _____________________________ 67

       5.3.1.     Overcoming DAG obstacles __________________________________________________ 67

       5.3.2.     Algorithm ________________________________________________________________ 68

       5.3.3.     Load Balancing Policies _____________________________________________________ 71

6.     PROBING _________________________________________________________ 74

     6.1.       Algorithm __________________________________________________________ 74

       6.1.1.     Sending Probe ____________________________________________________________ 74

       6.1.2.     Propagating the Probe ______________________________________________________ 75

       6.1.3.     Receiving probe reply ______________________________________________________ 76

     6.2.       Effect of Parameters on Probing ________________________________________ 77

7.     PROCESS MIGRATION _______________________________________________ 78

     7.1.       Process Creation ____________________________________________________ 79

       7.1.1.     Intercepting fork() _________________________________________________________ 79

       7.1.2.     Intercepting execve() _______________________________________________________ 80

     7.2.       Intercepting System Calls _____________________________________________ 80

       7.2.1.     Bash ____________________________________________________________________ 81

       7.2.2.     GlibC ____________________________________________________________________ 82

       7.2.3.     Kernel System Calls Interface ________________________________________________ 83

       7.2.4.     Kernel functions ___________________________________________________________ 83

       7.2.5.     Conclusion on Interception Points ____________________________________________ 84

     7.3.       Intercepting __libc_fork() _____________________________________________ 84

       7.3.1.     Implementation ___________________________________________________________ 85

     7.4.       Intercepting __execve() _______________________________________________ 85

       7.4.1.     Implementation ___________________________________________________________ 86

       7.4.2.     Further hooks after process migration _________________________________________ 86

     7.5.       Implementation of Listener ____________________________________________ 87

       7.5.1.     Shared Memory Structures __________________________________________________ 90

8.     RESULTS _________________________________________________________ 92

9.     CONCLUSION______________________________________________________ 94

     9.1.       Scope for Further Studies _____________________________________________ 94

APPENDIX ____________________________________________________________ 96

     Appendix I: Intercepted functions of glibc _______________________________________ 96

       Fork(): __________________________________________________________________________ 96

       Execve(): _______________________________________________________________________ 101

  Appendix II: Test Suite _____________________________________________________ 103

    Brute: __________________________________________________________________________ 103

    Prime: _________________________________________________________________________ 103

    HelloWorld _____________________________________________________________________ 104

References __________________________________________________________ 105


List of Tables

Table 1 Comparison of Existing Solutions ....................................................................... 20

Table 2 Benchmark Types in SPEC2006.......................................................................... 39

Table 3 Clustered integer benchmarks .............................................................................. 41

List of Figures

Figure 1 Architecture of MOSIX ...................................................................................... 14

Figure 2 Mach Process Migration ..................................................................................... 18

Figure 4 Configuration Wizard Prototype ........................................................................ 25

Figure 5 Pairing request Prototype ................................................................................... 25

Figure 6 CINT k clustering of SPEC2006 ........................................................................ 41

Figure 7 Taxonomy of Load Balancing Algorithms ......................................................... 55

Figure 8 Example of preliminary representation .............................................................. 56

Figure 9 Topology of Hierarchical Load Balancing ......................................................... 59

Figure 10 Sample Graph to show Embedding .................................................................. 61

Figure 11 Example Tree to show Tree Embedding .......................................................... 62

Figure 12 Example Directed Acyclic Graph to show DAG embedding ........................... 62

Figure 13 Comparison of Load Balancing Strategies ....................................................... 72

Figure 14 Results of Process Migration ............................................................................ 93


    Jiffies: a jiffy is the duration of one tick of the system timer interrupt
    DAG: Directed Acyclic Graph
    System Metric: A measure of execution time of a process arriving at the system
    Distribution Metric: The system metric + overhead (network delays)
    PCA: Principal Component Analysis
    Glibc: GNU libC. The C standard library for GNU platform.


The field of high performance computing is continuously gaining importance. This is
because of the increase in power of communications, which led to harnessing the power
of distributed computer. However, we see that a lot of the research is focused on specific
areas of interests, such as multi-core processors, parallel implementations of various
algorithms, or such applications to specific scientific applications.

We wish to shift this paradigm in our project. Our project is geared towards bringing high
performance computing to regular computer applications used by normal home users.
The project works under the assumption that there always exists a scenario in which a
particular process would function better on one system than another. It is with this
assumption that we can build a system, which would enable creating this scenario for
transferring of a process to another more “favorable” system for execution.

   1.1.    Generalized Process migration objectives

Process migration enables:

       dynamic load distribution, by migrating processes from overloaded nodes to less
       loaded ones
       fault resilience, by migrating processes from nodes that may have experienced a
       partial failure
       improved system administration, by migrating processes from the nodes that are
       about to be shut down or otherwise made unavailable
       Data access locality, by migrating processes closer to the source of some data.

   1.2.    Problems with Process Migration

Despite the advantages of process migration, we see that commercially process migration
has not been applied. This is mainly because of the following reasons:

       Complexity of adding transparent migration to systems originally designed to run
       Infeasibility of creating new systems with process migration as a design objective
       No compelling commercial argument for operating system vendors to support
       process migration

In spite of these barriers, process migration continues to attract research. We believe that
the main reason is the potentials offered by mobility as well as the attraction to hard
problems, so inherent to the research community.

   2.1.    Study of Existing Systems

A number of solutions have been implemented which have tried to address the problem
of process migration. We studied these systems to understand their characteristics,
features, as well as their deficits. The systems that were studied were:

   1. MOSIX
   2. Sprite
   3. Mach
   4. LSF (Load Sharing Faculty) and Condor Checkpoint migration library

   2.1.1. MOSIX

MOSIX is a management system for clusters and organizational grids that provides a
single-system image (SSI), i.e. the equivalent of an operating system for a cluster as a
whole. The MOSIX system evolved from transformation of single-user operating system,
into a distributed operating system.    Objectives of MOSIX

       Dynamic process migration: At context switch time, a MOSIX node may elect to
       migrate any process to another node. The migrated process is not aware of the
       Single system image: MOSIX presents a process with a uniform view of the file
       system, devices and networking facilities regardless of the process‟s current
       Autonomy of each node: Each node in the system is independent of all other
       nodes and may selectively participate in the MOSIX cluster or deny services to
       other nodes. Diskless nodes in MOSIX rely on a specific node for file services.

       Dynamic configuration: MOSIX nodes may join or leave a MOSIX cluster at any
       time. Processes that are not running on a node or using some node specific
       resource, are not affected by the loss of that node.
       Scalability: System algorithms avoid using any global state. By avoiding
       dependence on global state or centralized control, the system enhances its ability
       to scale to a large number of nodes.    Process migration architecture in MOSIX

                                Figure 1 Architecture of MOSIX

The system architecture separates the UNIX kernel into a lower and an upper kernel.
Each object in MOSIX, like an open file, has a universal object pointer that is unique
across the MOSIX domain. Universal objects in MOSIX are kernel objects (e.g. a file
descriptor entry) that can reference an object anywhere in the cluster. For example, the
upper kernel holds a universal object for an open file; the universal object migrates with
the process while only the host of the file has the local, non-universal file information.
The upper kernel provides a traditional UNIX system interface. It runs on each node and

handles only universal objects. The lower kernel provides normal services, such as device
drivers, context switching, and so on without having any knowledge or dependence on
other nodes. The third component of the MOSIX system is the linker, which maps
universal objects into local objects on a specific node, and which provides internode
communication, data transfer, process migration and load balancing algorithms. When
the upper kernel needs to perform an operation on one of the universal objects that it is
handling, it uses the linker to perform a remote kernel procedure call on the object‟s host
node. (1)

   2.1.2. Sprite Network OS

The primary goal of Sprite was to treat a network of personal workstations as a time-
shared computer, from the standpoint of sharing resources, but with the performance
guarantees of individual workstations. It provided a shared network file system with a
single-system image and a fully-consistent cache that ensured that all machines always
read the most recently written data. The kernel implemented a UNIX-like procedural
interface to applications; internally, kernels communicated with each other via a kernel-
to-kernel RPC. User-level IPC was supported using the file system, with either pipes or a
more general mechanism called pseudo-devices. Virtual memory was supported by
paging a process‟s heap and stack segments to a file on the local disk or a file server. (2)    Objectives of Sprite

       Workstation autonomy: Local users had priority over their workstation. Dynamic
       process migration, as opposed to merely remote invocation, was viewed primarily
       as a mechanism to evict other users‟ processes from a personal workstation when
       the owner returned. In fact, without the assurance of local autonomy through
       process migration, many users would not have allowed remote processes to start
       on their workstation in the first place.
       Location transparency: A process would appear to run on a single workstation
       throughout its lifetime.
       Using idle cycles: Migration was meant to take advantage of idle workstations,
       but not to support full load balancing.
       Simplicity: The migration system tried to reuse other support within the Sprite
       kernel, such as demand paging, even at the cost of some performance. For exam
       example, migrating an active process from one workstation to another would
       require modified pages in its address space to be written to a file server and
       faulted in on the destination, rather than sent directly to the destination.    Design of Sprite

Transparent migration in Sprite was based on the concept of a home machine. A foreign
process was one that was not executing on its home machine. Every process appeared to
run on its home machine throughout its lifetime, and that machine was inherited by
descendants of a foreign process as well. Some location-dependent system calls by a
foreign process would be forwarded automatically, via kernel-to-kernel RPC, to its home;
examples include calls dealing with the time-of day clock and process groups. Numerous
other calls, such as fork and exec, required cooperation between the remote and home
machines. Finally, location-independent calls, which included file system operations,
could be handled locally or sent directly to the machine responsible for them, such as a
file server. Foreign processes were subject to eviction — being migrated back to their
home machine — should a local user return to a previously idle machine.

When a foreign process migrated home, it left no residual dependencies on its former
host. When a process migrated away from its home, it left a shadow process there with

some state that would be used to support transparency. This state included such things as
process identifiers and the parent-child relationships involved in the UNIX wait call.

   2.1.3. Mach

Mach is a microkernel developed at the Carnegie Mellon University and later at the OSF
Research Institute. A migration mechanism on top of the Mach microkernel was
developed at the University of Kaiserslautern, from 1991 to 1993. Task migration was
used for experiments with load distribution. In this phase, only tasks were addressed,
while UNIX processes were left on the source machine. This means that only Mach task
state was migrated, whereas the UNIX process state that was not already migrated as a
part of the Mach task state (e.g. state in the UNIX “personailty server” emulating UNIX
on top of the Mach microkernel) remained on the source machine. Therefore, most of the
UNIX system calls were forwarded back to the source machine, while only Mach system
calls were executed on the destination machine.

                              Figure 2 Mach Process Migration   Goals of Mach

      To provide a transparent task migration at a user level with minimal changes to
      the microkernel
      To demonstrate that it is possible to perform load distribution at the microkernel
      level, based on three distinctive parameters that characterize microkernels:
      processing, VM and IPC   Design of Mach Process Migration

The design of task migration is affected by the underlying Mach microkernel. Mach
supported various powerful OS mechanisms for purposes other than task and process
migration. Examples include Distributed Memory Management (DMM) and Distributed
IPC (DIPC). DIPC and DMM simplified the design and implementation of task
migration. DIPC takes care of forwarding messages to migrated process, and DMM
supports remote paging and distributed shared memory. The underlying complexity of
message redirection and distributed memory management are heavily exercised by task
migration, exposing problems otherwise not encountered. This is in accordance with
earlier observations about message-passing.

   2.1.4. LSF and Condor

LSF provides some distributed operating system facilities, such as distributed process
scheduling and transparent remote execution, on top of various operating system kernels
without change. LSF primarily relies on initial process placement to achieve load
balancing, but also uses process migration via checkpointing as a complement.    Design of LSF

LSF‟s support for process user-level process migration is based on Condor‟s approach. A
checkpoint library is provided that must be linked with application code. Part of this
library is a signal handler that can create a checkpoint file of the process so that it can be
restarted on a machine of compatible architecture and operating system.

In addition to user-level transparent process checkpointing, LSF can also take advantage
of checkpointing already supported in the OS kernel (such as in Cray Unicos and
ConvexOS), and application-level checkpointing. The latter is achievable in classes of
applications by the programmer writing additional code to save the data structures and
execution state information in a file that can be interpreted by the user program at restart
time in order to restore its state. This approach, when feasible, often has the advantage of
a much smaller checkpoint file because it is often unnecessary to save all the dirty virtual
memory pages as must be done in user-level transparent checkpointing. Application-level
checkpointing may also allow migration to work across heterogeneous nodes.

    2.2.     Comparison of Existing Systems

By studying the existing systems, we see the following differences between them.

                             Table 1 Comparison of Existing Solutions

Characters           MOSIX              Sprite                Mach          LSF
Initial              Moderate           moderate              Low           Low
Migration time
Residual             None               None                  Yes           None
Residual      time None                 Moderate              High          None
and costs
Freeze cost          Moderate           Moderate              Small         Moderate
Freeze time          Moderate           Moderate              Low           Moderate
Transparency         Full               Full                  Full          Limited
decentralization     Distributed        Centralized           Distributed   centralized
Fault resilience     Yes                Limited               No            yes
Knowledge            Aging              Periodic              Negotiation   none

    2.3.     Observations about these systems

We see that the systems that were discussed have their own characteristics and goals. We
see that the following goals are not the focus of these systems:

    1. Implementation of a popular platform: Systems such as MOSIX and Sprite are
          implemented Unix operating system. We see that these systems are not geared
          towards more popular commercial platforms. Mach is a microkernel in itself,

   while LSF and Condor aren‟t full-fledged process migration solutions (they are
   just libraries).
2. Maintaining “single-user” interface: The MOSIX and Sprite systems are both
   based on a SSI architecture, in which a distributed system is identified as a single
   entity. The Mach system partially gears towards SSI. LSF and Condor are
   libraries, which don‟t offer an integrated solution. Using these systems would
   mean users must get accustomed to a new environment of execution.
3. Non-preemptive process migration: The systems that were discussed use a check
   pointing mechanism to store the state of a process before it is migrated. This
   would mean an additional overhead to understand the process state, and to send it
   across a network, and to resume it on the target system. We could reduce this
   overhead by allowing non-preemptive process migration. While this may reduce
   the amount of task migration, it would also reduce the overhead with the


We have seen the problem of process migration and the existing solutions. Here, we aim
to formalize the problem, and the goals of the project. We do not discuss the issue of
generalized process migration, but focus on the aspects which we deemed were more
important, to be implemented in the project.

   3.1.        Goals

       An initial configurations interface must be available to the user to set up an
       overlay network.
       Processes can be run at any instant, at any CPU load.
       Processes may be dispatched to other computers only when it is more efficient to
       dispatch than it is to process.
       Dispatching of processes must depend on the total and current capability of the
       target systems.
       A target system must not suffer noticeable slowdown in interactive tasks, due to
       the processing of dispatched processes.
       A process executing on a target machine must minimize its interactions with the
       source machine.
       The process of dispatching of processes must not be visible or noticeable to user
       working on the source machine.
       There must be no regression caused due to changes in the OS. In absence of using
       this system (for e.g. in absence of network connection) regular operations of the
       OS must be possible.
       The dispatching framework must effectively dispatch for any application (running
       in application space)

   3.2.        Important Definitions and Terminologies

       Execution Time: The time between the arrival of the process, and the end of
       execution of the process
       System Metric: The system metric is a measure of the execution time of a
       process running on a particular computer. The system metric depends on a variety
       of parameters:
          o Current load on the system: Decides how long a process waits in the wait
          o The capability of the system: The speed at which the system executes.
          o The structure of the program
       Distribution Metric: The Distribution Metric is the System Metric + The
       approximate network delay for transferring a process to a particular system. It is
       the distribution metric that is used for making load balancing decisions.
       Pairing: 2 systems can migrate processes amongst one another only when both
       machines agree. The process of negotiation is known as pairing.

   3.3.        Modules of the Solution

The entire process of process migration can be divided into modules. Each module is
associated with a singular task, and can be completed independent of other modules. The
modules are:

       Load Information: The entire process of load balancing depends on the
       distribution metric, which in turn depends on the system metric. Thus, a module
       to get the system metric is needed. The system metric is derived by running
       benchmark tests, and querying the operating system on the current load.

       Load Balancing: It is of primary importance that the entire system be a peer-2-
       peer network. Thus, load balancing involves exchanging distribution metric data,
       such that each node can make an informed decision on load balancing when a new
       process must be migrated.
       Load Balancing (Probing): Once (partial) information about peers is maintained,
       a more thorough traversal must be conducted when a process can be migrated.
       Process Migration: Once a target system is identified, the actual process
       migration must take place. This includes transferring the process code+data,
       stopping/preventing the process from starting on the source machine, and starting
       a new process on the target machine
       Hooking IO: Once a process is migrated, and is running on a target machine, the
       Input output operations must be intercepted, such that the source system is called
       for when needed.

These modules are expanded as chapters in this report, since each module is important for
successful implementation of the project.

   3.4.          Software Requirement Specification

The following is a SRS of the project. It highlights the usability of the project in terms of
end users. It however, does not focus on the algorithms, modules or internal mechanisms
of the project

   3.4.1.        External Interface Requirements

                                             24   User Interface

1. The configuration wizard: The Configuration wizard must be simple to use and
   design. Priority of the project is given to simplicity rather than configurability. A
   sample configuration wizard could be:

                         Figure 3 Configuration Wizard Prototype

2. Pairing Requests: Pairing requests must also be handled to allow both systems to
   be paired. The pairing process ensures the 2 systems indeed can share processes.
   A sample pairing wizard can be:

                            Figure 4 Pairing request Prototype

3. In Action: The rest of the project is meant to perform in the back end, in the
   kernel during process scheduling and execution. Thus, no additional interfaces are
   needed to be exposed to the user

                                           25   Hardware Interfaces

A high speed network card which is supported by the Operating system would be used as
a medium of communication between computers. Hardware required for functioning of
the operating system is required.   Communication Interfaces

A TCP/IP implementation in the kernel is necessary

    3.4.2.     Functional Requirements   Configurations/Pairing Process Ability of Pair with Computers Connected to the Network

Description: The project must allow the system to be paired with another computer on
the network. It is reasonable to assume that pairing can occur only when both computers
are simultaneously available. The pairing must be remembered even after computers are

Importance: 8 Secure Pairing

Description: During the Pairing process, both systems must actively acknowledge and
accept the pairing proposal, without which the Pairing process cannot complete.
Moreover, to ensure correct pairing, a common code/password must be entered in both
the systems participating in pairing

Importance: 8 Un-pairing

Description: If at any point of time, a system wishes to dissociate from pairing with
another computer, it can do so immediately and without intimidating the other system.
Once a system is unpaired, it will not accept any process dispatching requests.

Importance: 7    Load Distribution System Metrics

Description: At any point of time, if a target system is available for dispatching work, it
must be possible to determine the “system metric” of that system. The system metric
determines the time needed for completely executing a task on a target computer. A
higher system metric would imply a faster computing system.

Importance: 10 Distribution Metrics

Description: The distribution metrics is proportional to both the system metric, as well as
the network latency of a target machine. For any target system, it must be possible to
determine the distribution metric at any time. The distribution metric is used for load
balancing decisions between multiple target systems.

Importance: 10 Self-Execution vs. Dispatch
Description: At the time of load balancing, decision must be made whether the thread is
allowed to execute on the same system, or it must be dispatched. The decision is made
with the help of the distribution metric of the “best” target machine, the system metric of
the source machine as well as results of scanning through thread code.

Importance: 10 Transitivity of Transference

Description: A target computer receiving a dispatched thread can dispatch the thread
again to other computers, provided it is more efficient to do so. Such transference must
not result in cycles. Since scheduling overhead must be minimized, finding a shortest
path from source machine to a desired (but not directly accessible) target machine can be
avoided by heuristic methods

Importance: 5    Scheduling Minimize scheduling overhead

Description: In order to account for dispatched processes on target systems, additions
and modifications must be made in the scheduling algorithm. It is important to minimize
the overhead caused due to these changes.

Importance: 9 Optimize Scheduler Results:

Description: The scheduling algorithm used by default in the Operating System should
be used even for the dispatched tasks. A dispatched task must however have a much

lower priority than native tasks on the target system. The System Metric must also
account for the scheduling delay.

Importance: 8 Code Analysis:

Description: During thread initialization, it must undergo an analysis to determine its
performance when it is dispatched. This analysis should take into consideration the
system calls, the access to hardware, synchronization structures used, etc.

Importance: 5    Dispatching Processes Non erroneous results

Description: A task that has been dispatched to a target system must result in giving the
exact same results back to the source system, as when the task was run on the source
system itself. If there is a possibility of results not matching, then the task must not be

Importance: 10 Memory Coherence

Description: Any changes made to a global memory by a dispatched thread must be
reflected in the global memory area of the source machine. This includes modifications to
critical sections in memory.

Importance: 10 Thread Synchronization
Description: Any thread synchronization API used in a dispatched thread must be
synchronous with other related threads, which may be on source system, or in any other
target system.

Importance: 9 Stream Synchronization

Description: A dispatched process must have access to all currently opened streams
which are accessible on the source system. Streams may be opened by the dispatched
process and written to. These changes must be reflected in the source system.

Importance: 10 Minimize Communications

Description: The number of times target machine must communicate with the source
machine must be minimized to minimize overhead. Piggybacking of requests must be
performed, and bulk data transfer is preferred.

Importance: 8 Sandbox Environment

Description: Changes made by a dispatched thread must not be retained globally on a
target system. The changes made must be recorded and incorporated only on the source
machine. These changes include changes to date, shutdown of system, etc.

Priority: 10 Asynchronous Dispatch

Description: The process of dispatching of a thread (as well as getting results back from
a dispatched thread) must run asynchronously on both, the source and target machines. If
it cannot be asynchronous, it must not be a part of the kernel scheduling thread.

Priority: 9    Abstraction Application Abstraction

Description: Dispatch mechanisms must not affect application development. A thread
must not find out whether it is run on the source computer, or it is dispatched. The
environment available to a thread must always remain consistent and the same as that on
the source system.

Priority: 8 Kernel Methods

Description: A part of the dispatching system must be opened, so that future additions to
the kernel could make use of the dispatching system. Moreover, kernel exceptions must
be made to not allow new additions to the kernel cause regressions in the dispatching

Priority: 5

   3.4.3.      Performance Requirements

       The time required for execution completion of a dispatched thread must always be
       less than or equal to the time required for execution completion of the thread on
       the source system. (This is a hard constraint, which must be followed unless there
       is an exception occurs.

       The overhead caused due to modifications in scheduling must not exceed 30% of
       the normal scheduling time, and must not add to the complexity of the scheduling
       Heuristic algorithms for load balancing are preferred if it results in lesser
       overheads. A tradeoff of 1:2 (speed to error ratio) or below is sufficient for
       accepting a better algorithm.

   3.4.4.      Design Constraints    Standards Compliance

       C coding will be done in accordance with the GNU coding convention.
       Documentation will be available in the man pages.    Hardware Limitations

Minimum System Requirements:

       IA32 compliant processor
       32 MB of RAM
       1 GB Hard disc
       Hi-speed Ethernet NIC

   3.4.5.      Software system attributes    Availability

The project will be available at all times during the functioning of the operating system. It
can be turned off, in which case the dispatching system will not work. The availability of
a peer computer depends entirely on the status of the dispatching system running on it.    Security

The project must not cause easily exploitable security vulnerabilities. Dispatching of
threads must be allowed only on systems which are managed by a single entity. It must
not be possible to make cause a pairing without the consent of both the systems. It must
not be possible to lead to thread dispatch without a pairing of both systems with each
other.     Maintainability

         The project must be well documented and commented to allow for future
         Interfaces must be provided to especially add functionalities in areas susceptible
         to modifications.
         Interfaces of the project must also be available for use, for kernel programmer
         coding functionalities related to process management.     Portability

         The project must work on all systems which follow the minimum specifications
         The project is not intended to run on different types of processor architecture
         The project‟s scope is limited to execution on the operating system to be
         implemented on.


A major component of the project is an estimate on the relative performance of 2
different systems executing a common task at a given instant of time. This is to be
represented by a single figure, a System Metric. This metric forms the basis of the load
balancing procedure.

The system metric can be considered a paramount of 2 components:

       The performance of the hardware system itself: The effects of heterogeneous
       processors, different memory layouts, architectural differences, etc.
       The performance as measured by the OS: Factors such as current load on the
       system, memory load, available memory, etc. This can further be classified into:
           o Static Data Acquisition: These metrics remain more or less constant
              during the operation of the system
           o Dynamic Data Acquisition: These metrics depends on the dynamically
              changes resources available

   4.1.    Foundation of collection
   4.1.1. Static Data Acquisition

We follow the assumption that once a system is in operation, the hardware configuration
of the machine is not susceptible to change. If this is assumed, then the processor
performance may be calculated:

       Either once overall in the system
           o Advantage: Since performed lesser number of times, can be more
           o Disadvantage: Hardware changes would cause the metrics to be
       Every time the system is rebooted
           o Advantage: Hardware changes when system is offline can be accounted
               for in the metric

          o Disadvantage: Since performance is graded in every reboot, the
              performance analysis must be done with minimal overhead
       Every time a ping request is sent
          o Advantage: Gives the most up-to-date information
          o Disadvantage: Infeasible.

Weighing these approaches, we decided to use the 2nd approach. We measure the static
data at the time the system is rebooted.

   4.1.2. CPU Performance: Measurement vs. benchmarking

There are 2 approaches that could be used to measure the performance of a processor in
a system:

       Measurement: The various parameters of the processor are accurately measured.
       This includes Clock speed, pipeline length, cache memory, bus speed, mean
       memory access time, etc. The representative of these measurements is then
       formed to give a single metric.
       Benchmarking: Sample specialized programs are run on the machine. Each
       program measures some aspects of the processor. The throughput of each program
       is considered as an indicative of the CPU performance. A representative of
       performance of all the programs is then formed to give a single metric.

We choose the Benchmarking approach. The reasons for this are:

       It is not always possible to determine programmatically the various parameters of
       the processor. For eg. The size of the L2 cache cannot be directly determined.
       It is not always possible to quantify the measurement of a certain parameter. For
       eg. The memory architecture, branch prediction schemes, etc. cannot be directly
       quantified without a context.
       There exist benchmarking standards which accurately measure the performance of
       the processor.
                                           35    Disadvantages of benchmarking

        Most benchmarking solutions take a long time to execute on a machine.
        The solutions of the benchmarking tools are usually comprehensive, and touch on
        the various aspects of the system. To collect these aspects into a single metric
        poses another problem.
        The comprehensive nature of the benchmarks can also be partly attributed to the
        high workload (inputs)

We hence find that we cannot use conventional benchmarking solutions directly.

    4.2.    Benchmarking CPU

We will use the SPEC CPU2006 benchmarking suite as the starting point of the solution.
CPU 2006 is a leading industry benchmark for processors.

CPU2006 draws benchmarks from real-life applications, rather than artificial loop kernels
and synthetic benchmarks. The most important part of the benchmarking suite is the
benchmarks themselves. They provide an indicative of performance of common

CPU2006 benchmarks are subdivided into 2 components: CINT (Integer benchmarking
component) and CFP (Floating point benchmarking component). These 2 groups test 2
separate areas of processor computation, and thus we shall also treat these 2 groups are

CPU2006 however has a large benchmark run time to ensure benchmarks run for long
enough time to make meaningful and accurate measurements. However, longer run times
result in increase in cost of performance evaluation. Thus, there are 2 methods to improve
the performance evaluation:

       Develop techniques to reduce simulation time for each benchmark
       Find a smaller, but representative set of benchmark programs

We see that the former approach requires knowledge of the processor on which the
benchmarks must be run, to optimize the simulations. The approach usually requires
knowledge about the micro-architecture of the CPU. Since we deal with heterogeneous
systems, we shall not deal with this aspect.

We must hence carefully choose the set of benchmarks to be run. A poorly chosen suite
may not depict accurately the true performance of the CPU. Similarly, selecting too few
benchmarks may not cover the entire spectrum of applications run on the computer. Thus,
the suite of benchmark applications must be carefully selected.

We thus will first discuss ways to cluster benchmarks to give a true indication of CPU
performance with minimal run time. However, even these techniques to reduce the
number of benchmarks may not be sufficient to result in fast measurements. Thus, the
minimal subset of benchmarks selected by clustering will be analyzed; the essence of
each benchmark would be extracted, and a new benchmark would be devised for our

   4.2.1. Principal Component Analysis

Principal component analysis is a mathematical technique that transforms a number of
possibly correlated variables into a smaller number of uncorrelated variables called
principal components. We see that during benchmarking tests, each benchmark will
measure multiple variables simultaneously to give a composite variable, the time taken to
run that particular benchmark. A single benchmark result would result from the
performances of various parameters. We can employ multiple benchmark results to
estimate the principal components. Once the principal components are found out, the

benchmarks will then be subject to clustering based on the various components (and then
each cluster is simplified to exclusively test the principal component representative of
that cluster).

In case of sub-setting of benchmarks, PCA is used to recognize the important
characteristics of a particular benchmark, fairly independent of the machine being used.     Steps in PCA

    1. Data Representation: The data available should be represented by readings of
       data in the different dimensions of measurement.
    2. Normalization: Mean of the data in different dimensions is found. The mean is
       subtracted from each measurement of each dimension.
    3. Covariance Matrix: The covariance matrix is calculated for the normalized data
    4. Eigenvectors: For the obtained covariance matrix, obtain all possible eigenvectors
       and the associated eigenvalues. The eigenvectors represent the data in the
       different dimensions, with varying degree of correlation.
    5. Elimination: The eigenvalue for a particular eigenvector is proportional to the
       correlation with the initial data. We can thus eliminate eigenvectors with low
       eigenvalues to lead to lesser dimensions. These eigenvectors are the principal

It is possible to derive a new data set. However, our interest lies in getting the Principal
components and using them for clustering.

    4.2.2. Processor Characteristics Measurement

In order to measure the performance of a benchmark, we must measure metrics
associated with the running of that benchmark.

Ideally, we can recognize a few micro-architecture independent characteristics of
benchmarks. These include:

          Instruction Mix: Relative frequencies of various operations in the program.
          Decides if program is CPU intensive or memory access intensive.
          Branch Direction: The number of backward branches v/s forward branches.
          Dependency Distance: Measurement of Instruction level parallelism possible.
          Locality metrics: Measure of caching of data.

However, it was found that evaluating micro-architecture independent characteristics is a
time consuming task. Thus, it is more feasible to use metrics that can be easily attained
by hardware performance monitoring counters. The SPEC2006 suite consists of 2 sets of
benchmarks: CINT and CFP, and thus, we can derive 2 classes of such measurements:

                              Table 2 Benchmark Types in SPEC2006

Integer Benchmarks                              Floating Point Benchmarks
Integer      operations   per     instruction Floating point operations per instruction
(measurement of IPC (Instruction per (measurement of IPC)
L1     cache     misses     per   instruction Memory          references   per    instruction
(measurement of memory access)                  (measurement of memory access)
Branching Ratio                                 L2 D-cache misses per instruction
Mispredicted branch ratio                       L2 D-cache misses per L2 access
L2 data cache misses per instruction            Data TLB misses per instruction

     4.2.3. Running the Benchmarks

In order to get the characteristics, we use hardware performance counters. To access this,
an interface such as PAPI could be used. The counters must be reset, and then the
benchmarks are run. At the end of the benchmarks, the PAPI GUI could be used to
analyze the hardware counters. This data is collected over several machines.   Problems in running benchmarks

We could not run the benchmarks ourselves, due to the following difficulties:

       Although the SPEC2006 specification is open, the suite itself is only
       commercially available.
       There was insufficient number of heterogeneous machines to run the benchmarks
       Measurements have already been taken by others and been published. We used
       the same data.

We thus rely on the measurements in the papers.

At the end of PCA, we get „k‟ Principal components for each benchmark. These PCs are
used for comparing the similarity and dissimilarity of each benchmark.

   4.2.4. Clustering

Clustering is a statistical method which groups programs with similar characteristics
together. There are 2 commonly used clustering techniques: k-means clustering and
hierarchical clustering. Hierarchical clustering was chosen since it allows multiple
clustering possibilities while having a choice over the number of clusters, without a
complete recalculation. In hierarchical clustering, the data is represented in the form of

The Euclidean distance between principal components (found out by PCA) is used as a
measure of similarity between the various benchmarks. Initially, each benchmark if
placed in a separate cluster. Then, the linkage distance is increased to allow for
clustering, which leads to similar benchmarks being clustered together.

Measurements have shown the following dendograms for CINT:

                           Figure 5 CINT k clustering of SPEC2006

We can thus form clusters by analyzing the dendograms at linkage distances:

                               Table 3 Clustered integer benchmarks

Subset of 5       458.sjeng, 462.libquantum, 403.gcc, 456.hmmer, 483.xalancbmk
Subset of 6       458.sjeng,      429.mcf,       462.libquantum,      403.gcc,   456.hmmer,

   4.2.5. Benchmarks obtained

Since we aim for fast benchmarks, we decided to use CINT benchmarks highlighted by
the clustering. These benchmarks are:

       458.sjeng (Artificial Intelligence): 458.sjeng is based on Sjeng 11.2, which is a
       program that plays chess and several chess variants, such as drop-chess (similar to
       Shogi), and 'losing' chess. Based on sjeng:
       462.libquantum (Physics/Quantum Computing): libquantum is a library for the
       simulation     of     a     quantum     computer.     Based     on    libquantum:
       403.gcc (C compiler): Based on the gcc compiler:
       456.hmmer (Search Gene Sequence): making and using Hidden Markov Models
       (HMMs)                     of                 biological                sequences:
       483.xalancbmk (XML parser): This benchmark is a modified version of Xalan-
       C++, an XSLT processor written in a portable subset of C++:

   4.2.6. Implementing these benchmarks

We see that the benchmark programs were taken from their respective websites and then
run on the system. The programs were modified so they would work in batch mode. The
workloads were also modified to allow for fast execution.

   4.3.    Measuring System Load

The system load in Linux can be measured in a variety of ways. We first conduct a
survey of existing tools that could be used to measure the system load

   4.3.1. Tools for measuring system load vmstat

vmstat reports information about processes, memory, paging, block IO, traps, and cpu
activity. The first report produced gives averages since the last reboot. Additional reports

give information on a sampling period of length delay. The process and memory reports
are instantaneous in either case.

We see that the stats that are outputted from vmstat include:

            o r: The number of processes waiting for run time.
            o b: The number of processes in uninterruptible sleep.
            o swpd: the amount of virtual memory used.
            o free: the amount of idle memory.
            o buff: the amount of memory used as buffers.
            o cache: the amount of memory used as cache.
            o si: Amount of memory swapped in from disk (/s).
            o so: Amount of memory swapped to disk (/s).
            o bi: Blocks received from a block device (blocks/s).
            o bo: Blocks sent to a block device (blocks/s).
            o in: The number of interrupts per second, including the clock.
            o cs: The number of context switches per second.
       CPU: These are percentages of total CPU time.
            o us: Time spent running non-kernel code. (user time, including nice time)
            o sy: Time spent running kernel code. (system time)
            o id: Time spent idle.
            o wa: Time spent waiting for IO.
            o st: Time stolen from a virtual machine.

To get these statistics, vmstat uses the following system files:

   1. /proc/stat: To get CPU utilization
   2. /proc/meminfo: To get the memory information

It also uses the following system calls

   1. sysctl: To get kernel parameters
   2. devstat: To get device parameters    mpstat

The mpstat command writes to standard output activities for each available processor,
processor 0 being the first one. Global average activities among all processors are also
reported. The mpstat command can be used both on SMP and UP machines, but in the
latter, only global average activities will be printed.

The report generated by the mpstat command has the following format:

       CPU: Processor number. The keyword all indicates that statistics are calculated as
       averages among all processors.
       %user: Show the percentage of CPU utilization that occurred while executing at
       the user level (application).
       %nice: Show the percentage of CPU utilization that occurred while executing at
       the user level with nice priority.

       %sys: Show the percentage of CPU utilization that occurred while executing at
       the system level (kernel). Note that this does not include time spent servicing
       interrupts or softirqs.
       %iowait: Show the percentage of time that the CPU or CPUs were idle during
       which the system had an outstanding disk I/O request.
       %irq: Show the percentage of time spent by the CPU or CPUs to service
       %soft: Show the percentage of time spent by the CPU or CPUs to service
       softirqs. A softirq (software interrupt) is one of up to 32 enumerated software
       interrupts which can run on multiple CPUs at once.
       %steal: Show the percentage of time spent in involuntary wait by the virtual CPU
       or CPUs while the hypervisor was servicing another virtual processor.
       %idle: Show the percentage of time that the CPU or CPUs were idle and the
       system did not have an outstanding disk I/O request.
       intr/s: Show the total number of interrupts received per second by the CPU or

To generate these stats, the /proc/stat system file is mainly used.    top

The top program provides a dynamic real-time view of a running system. It can display
system summary information as well as a list of tasks currently being managed by the
Linux kernel. The types of system summary information shown and the types, order and
size of information displayed for tasks are all user configurable and that configuration
can be made persistent across restarts. The program provides a limited interactive

interface for process manipulation as well as a much more extensive interface for
personal configuration -- encompassing every aspect of its operation.

top displays a variety of information about the processor state:

       "uptime": This line displays the time the system has been up, and the three load
       averages for the system. The load averages are the average number of process
       ready to run during the last 1, 5 and 15 minutes.
       processes: The total number of processes running at the time of the last update.
       This is also broken down into the number of tasks which are running, sleeping,
       stopped, or undead. The processes and states display may be toggled by the t
       interactive command.
       "CPU states": Shows the percentage of CPU time in user mode, system mode,
       niced tasks, iowait and idle. (Niced tasks are only those whose nice value is
       positive.) Time spent in niced tasks will also be counted in system and user time,
       so the total will be more than 100%. The processes and states display may be
       toggled by the t interactive command.
       Mem: Statistics on memory usage, including total available memory, free
       memory, used memory, shared memory, and memory used for buffers. The
       display of memory information may be toggled by the m interactive command.
       Swap: Statistics on swap space, including total swap space, available swap space,
       and used swap space. This and Mem are just like the output of free(1).
       PID: The process ID of each task.
       PPID: The parent process ID each task.
       UID: The user ID of the task's owner.
       USER: The user name of the task's owner.
       PRI: The priority of the task.
       NI: The nice value of the task. Negative nice values are higher priority.

SIZE: The size of the task's code plus data plus stack space, in kilobytes, is
shown here.
TSIZE: The code size of the task. This gives strange values for kernel processes
and is broken for ELF processes.
DSIZE: Data + Stack size. This is broken for ELF processes.
TRS: Text resident size.
SWAP: Size of the swapped out part of the task.
D: Size of pages marked dirty.
LC: Last used processor. (That this changes from time to time is not a bug; Linux
intentionally uses weak affinity. Also notice that the very act of running top may
break weak affinity and cause more processes to change current CPU more often
because of the extra demand for CPU time.)
RSS: The total amount of physical memory used by the task, in kilobytes, is
shown here. For ELF processes used library pages are counted here, for a.out
processes not.
SHARE: The amount of shared memory used by the task is shown in this
STAT: The state of the task is shown here. The state is either S for sleeping, D
for uninterruptible sleep, R for running, Z for zombies, or T for stopped or traced.
These states are modified by trailing < for a process with negative nice value, N
for a process with positive nice value, W for a swapped out process (this does not
work correctly for kernel processes).
WCHAN: Depending on the availability of either /boot/psdatabase or the kernel
link map /boot/ this shows the address or the name of the kernel
function the task currently is sleeping in.
TIME: Total CPU time the task has used since it started. If cumulative mode is
on, this also includes the CPU time used by the process's children which have

       died. You can set cumulative mode with the S command line option or toggle it
       with the interactive command S. The header line will then be changed to CTIME.
       %CPU: The task's share of the CPU time since the last screen update, expressed
       as a percentage of total CPU time per processor.
       %MEM: The task's share of the physical memory.

   4.3.2. Parameters to consider

As we saw from the preceding tools, a large number of parameters can be used to
determine the current load in the system. The parameters that could be considered are:

   1. The number of jiffies that were spent by User processes
   2. The number of jiffies that were spent idling
   3. The number of jiffies that were spent performing system/kernel operations
   4. The number of jiffies that were spent processing interrupts
   5. The number of processes currently running
   6. The amount of free memory
   7. The number of faults in the swap memory

We can formulate many other parameters as well by analyzing relations between these
parameters, for each process.

Now, we must investigate into the scheduler of Linux. The approximate load is a measure
of how much CPU time a new process entering a system can occupy. We see that given a
process P:

       If there are I jiffies that are spent by the CPU idling, then all the I jiffies are
       assigned to the process that is entering the system

       If there are X jiffies that are spent by the CPU executing user processes, then on
       an average, the additional process must be (X)/(n+1) jiffies.

Thus, we see that the approximate load can be described as

We see that the number of processes running affects the interactivity of the system. This
is because a larger number of processes would require a higher number of context
switches to be interactive. But, since we consider only the throughput of the system, we
do not need the number of processes currently running in the system. The amount of time
that is spent for context switching is considered as part of the interrupt handling jiffies
(since context switching happens by either software or hardware (clock) interrupts)

The amount of free memory does not influence the execution time throughput, unless the
amount of free memory is substantially little. Thus, we perform a bound check on the
percentage of free memory that exists on the machine, and only if it falls below a
threshold, we consider its effect on the throughput.

Lastly, the number of faults in the virtual memory usually depends on the amount of free
memory. Since we consider that process migration happens when a CPU starts getting
moderately loaded, we assume the contribution of swap faults is negligible.

We see that when calculating the approximate load, we consider the total number of
jiffies that could be expended running a new process. Thus, this variable not only gives
the relative load of the system, it gives a cumulative measure on how much of the CPU a
new process will get, which is directly related to the execution time of the process.
      4.3.3. Getting and parsing the information

The information of the CPU usage can be found in the /proc/stat file. This file is a
memory mapped file maintained by the linux kernel. The file contains a variety of
different statistics about the system since it was last restarted. The file starts with:

The very first "cpu" line aggregates the numbers in all of the other "cpuN" lines. This is
followed with numbers:

         user: normal processes executing in user mode
         nice: niced processes executing in user mode
         system: processes executing in kernel mode
         idle: twiddling thumbs
         iowait: waiting for I/O to complete
         irq: servicing interrupts
         softirq: servicing softirqs

For example, a sample output is:

cpu     2255 34 2290 22625563 6290 127 456
cpu0 1132 34 1441 11311718 3675 127 438
cpu1 1123 0 849 11313845 2614 0 18

We see that at any instant, this file would give the total number of jiffies spent on various
operations since the last restart.

Thus, in order to get the Approximate Load, we perform the following:

      1. Get the stats from /proc/stats
      2. Sleep for a small duration

   3. Get the new stats from /proc/stats
   4. Find the difference in the 2 readings to get the number of jiffies spent on various
   5. Find the number of jiffies that were spent idling. Find the total number of jiffies
   6. Find the ratio to get the approximate load   Getting Memory stats

Memory stats can be obtained from the memory mapped file /proc/meminfo, which is
also maintained by the kernel. The file lists the total memory, and the allocation of the
memory on various resources. The results reported are:

   1. The total available memory
   2. The total free memory
   3. The cache memory and swap memory
   4. The distribution of the used memory, such as for holding files, etc.

As mentioned earlier, we only consider the effect of free space only if it falls below a
threshold, called freemem_threshold. We assume a value of 15% memory as the
threshold. We thus calculate the system metric from the approximate load as follows

Hence, we successfully have derived the system metric for the dynamic component

   4.4.    Combining: The System Metric

We have seen how to derive the various metrics. Now, these must be combined together
to get the overall system metric.

In order to do this, we must first get the static system metric. This can be derived from
static benchmark results. Each benchmark‟s performance will be evaluated based on the
time taken to complete execution of a given workload. The time taken would then be
compared to the time taken to complete under a “very” fast system, and the result, in
terms of percentage is noted.

If the ratio is >1, then it is assumed to be one. The individual results are then combined
together to find the average, to get the Static System Metric

To get the final system metric, the static and dynamic system metric are combined
together in an appropriate ratio. A ratio of 30% to static system metric was considered in
our project.

Thus, using benchmarking and measurements, we successfully derived the system metric.
For simplicity and efficiency, we convert the sysmetric from 0-1 domain to 0-1000
domain, and round the decimals, so that we only need to perform integer operations.

   4.4.1. Distribution Metric

The distribution metric, as defined earlier, is basically:

The network delays are considered since:

   1. The network delay includes the delay for the process migration itself. This
       includes the time to send the probe, the process, as well as the overhead for
       hooking IO functions
   2. Since transitivity is required, adding this extra constant ensures that within the
       overlay network, propagation happens only for drastic improvements of the
       system metric over nodes.

It is this distribution metric that is considered for load balancing of nodes.


The capacity of a cluster of computers cannot be fully utilized without proper means to
distribute the load in the cluster. Thus, an adequate load balancing algorithm is required.
Achieving optimal load balancing is a problem of continual research, and usually the cost
of computing the optimal load balancing outweighs the advantage it offers. Hence, in
order to efficiently compute the system to which a process must be sent, data can be
collected in 2 stages:

   1. A preliminary measurement of distribution metrics of neighbors. This will reduce
       the search domain for the next stage
   2. A probe of the distribution metrics of various systems, which retrieves the optimal
       distribution node.

In this chapter, the first technique is detailed: Creating a preliminary overlay network to
assist in step 2 of load balancing.

   5.1.    Generalized Load Balancing
   5.1.1. Taxonomy of Load Balancing Algorithms

                         Figure 6 Taxonomy of Load Balancing Algorithms

We see that in the global picture, we can divide the classification into 2 types: Static and
Dynamic Load Balancing. We see that in our system, the order of processes and their
assignment cannot be predicted beforehand. Thus, we must adapt a dynamic load
balancing approach. Since we agree to a pairing scheme, a cooperative approach would
suffice. We see that an optimal solution would be the best for load balancing; however,
the computation cost would be high. Thus, we adapt an approximate suboptimal solution,
since heuristics would involve further knowledge of the particular network topology.

   5.1.2. Representation of Relationships

The network must first be represented in an appropriate format to prepare an algorithm
for load balancing. Since we use a distributed load balancing algorithm, we inherently
choose a graph to represent it. We represent each system in the cluster as a vertex in the

graph. For every paired connection between 2 systems, we add an edge between the
corresponding vertices in the graph.

For example, consider a cluster with the systems: A, B, C, D and E. Let us assume that
there exists a pairing between:

       A and B
       B and C
       D and C
       E and B
       D and B

Then this can be represented by the graph:

                          Figure 7 Example of preliminary representation

Note that this representation is a basic mapping between the overlay network and a
suitable data structure. Specific representations will be required for specific algorithms.

    5.1.3. Types of Load Balancing

Load balancing can also be classified depending on who initiates the load balancing.
While this is concerned with the process migration stage, we discuss that there are 3

    1. Sender Initiated: An overloaded system searches for a free system when it is
            a. overloaded node: when a new task makes the queue length exceed
                   threshold, T
            b. under loaded node: if accepting a task will not cause queue to exceed
                   threshold, T
            c. overloaded node attempts to send task to under loaded node
            d. only newly arrived tasks considered for transfers
            e. location policies: random, threshold, or shortest
            f. information policy: demand-driven
            g. render the system unstable at high load
    2. Receiver Initiated: A free system searches for an overloaded system, and asks
        the system to transfer load:
            a. If a task departure drops node queue length below T: receiver
            b. Under loaded node tries to obtain task from overloaded node
            c. Initiate search for sender either on a task departure or after a
                   predetermined period: why?
            d. information policy: demand-driven
            e. remain stable at high and low loads
            f. Drawback: most transfers are preemptive and expensive, especially if
                   round-robin CPU scheduling is employed at nodes.
    3. Symmetric: Depending on the load of the overall cluster, either the sender-
        initiated or receiver-initiated method is used.

           a. senders search for receivers:
                    i. successful in low load situations
                    ii. high polling overhead in high load situations
           b. receivers search for senders:
                    i. useful in high load situations
                    ii. preemptive task transfer facility is necessary      Choosing an appropriate strategy

Having reviewed the earlier policies, we primarily choose a sender-initiated policy, since:

   1. We assume the cluster itself isn‟t heavily loaded
   2. Using a symmetric policy would lead to a higher overhead, especially since the
       cluster is expected to primarily be in light to moderate load.

The policy is:

   1. We use a round robin policy to initially get an approximate measurement of the
       loads on the systems. Each node is responsible for sending a periodic “ping” of its
       current minimum distribution metric. The nodes cache this information to gain an
       idea of the load of systems in the locality.
   2. We use a sender-initiated policy when a process must be migrated. The initial
       cached measurements are then used to direct a probe message through the
       network to gain a node with minimum distribution metric to perform the process
       migration on.

   5.2.    Hierarchical Load Balancing

In hierarchical load balancing, we embed a tree in the graph. Then, each node caches the
distribution metric that it obtains from its parent and all its children. This data is then
compared with its own system metric to determine the distribution metric of the node.

The structure of the network can be represented as:

                         Figure 8 Topology of Hierarchical Load Balancing

   5.2.1. The Basic Algorithm

The basic algorithm for a particular node may be summarized as:

   1. For the current node, find the parent node and all the child nodes
   2. Keep listening to all the child nodes for pings.
           a. If any child send a ping, then update the local table
           b. For each ping received, store in local table: The ping received (the
               dist_metric of child) + constant (to indicate overhead and network delay)
   3. Periodically, check the current system metric
            a. If current system metric is lower than the highest stored distribution metric
                of the child nodes (current system load is higher than that of the child),
                then the current distribution metric = Highest distribution metric of child
            b. If the current system metric is higher than the highest stored distribution
                metric of the child nodes (current system load is lesser than that of any
                child), then the current distribution metric = system metric
   4. Send the current distribution metric to the parent.
   5. Go to step 2

   5.2.2. Embedding the Directed Acyclic Graph

One of the first steps in using this algorithm is to embed a tree within a particular graph.
We see the motivation behind this is:

   1. There are no cycles, and hence no possibility of a distribution metric of a node
         being pinged to itself
   2. There exists a hierarchical structure, in which every node is connected to a parent.

Now, we see that all communications in the algorithms are performed only in a certain
direction. We also know that every time a node‟s information propagates, it gains a
“constant” addition, and hence we should aim to maximize the number of links between

Thus, to achieve the following objectives, we can instead use a Directed Acyclic Graph:

   1. There must exist no cycles in the embedding

   2. There must exist a ordering of node, such that each node has a parent. We can
       modify the algorithm such that a node could have multiple parents (which might
       reduce efficiency of pings, but improve process migration)
   3. To maximize load sharing, we must introduce the maximum number of edges.

All these 3 requirements are met by using a Directed Acyclic Graph instead of a Tree.

For example, consider the following cluster:

                           Figure 9 Sample Graph to show Embedding

We see that the corresponding tree could be:

                           Figure 10 Example Tree to show Tree Embedding

We also see that if a DAG is drawn, the number of edges is increased, while maintaining
the characteristics of the tree.

                   Figure 11 Example Directed Acyclic Graph to show DAG embedding

Hence, the initial problem of finding an embedded tree transforms to finding an
embedded Directed Acyclic Graph.     Properties of DAG

We see the following needs to be proved      Existence of DAG

Theorem: For every undirected graph, we can find a corresponding directed acyclic

Proof: Consider a graph containing 2 vertices and an edge. We see that we can trivially
find a DAG embedding on this graph. This is the initial test.

Consider a graph G with „k‟ vertices, and assume it contains a DAG embedding. Then, let
us add another single vertex „k+1‟. Now, for every edge from G to vertex „k+1‟, add a
directed edge from the same vertex G to „k+1‟. We see that all edges in this DAG are
directed towards „k+1‟ vertex. Thus, there is no cycle. Moreover, all the edges that were
attached to „k+1‟ in the graph were accounted for in the new DAG. Thus, the newly
formed DAG corresponds to the graph G with vertex „k+1‟ as well.

Thus, by mathematical induction, we conclude that for every undirected graph, a
corresponding DAG with the same number of edges can be found.

   5.2.3. Modifications to the algorithm for DAG embedding

The using an embedded DAG would lead to the new algorithm:

   1. For the current node, find the all the in edges and out edges.
   2. Designate all nodes that are connected to current node by in edges, as parent
       nodes. Designate all nodes that are connected to the current node by out edge, as
       child nodes.
   3. Keep listening to all the child nodes for pings.
           a. If any child send a ping, then update the local table
           b. For each ping received, store in local table: The ping received (the
               dist_metric of child) + constant (to indicate overhead and network delay)
   4. Periodically, check the current system metric
           a. If current system metric is lower than the highest stored distribution metric
               of the child nodes (current system load is higher than that of the child),
               then the current distribution metric = Highest distribution metric of child
           b. If the current system metric is higher than the highest stored distribution
               metric of the child nodes (current system load is lesser than that of any
               child), then the current distribution metric = system metric
   5. Send the current distribution metric to all the parent nodes.
   6. Go to step 3

   5.2.4. Embedding a DAG

The main obstacle with hierarchical load balancing is the need to embed a directed
acyclic graph within the graph. We have proved that within a given graph, there must
exist an embedded DAG; however, we cannot get the DAG directly from the given graph
without traversing the entire graph.

However, we see that for traversing the entire graph, all the nodes in the graph must be
visited. Thus, as the number of nodes in the network increase, the delay for getting the
embedded DAG also increases. Moreover, we see that to get the embedded DAG, all
systems in the cluster must be visited individually and linearly. Thus, we see that to
maintain scalability, it is not possible to traverse the entire graph.

We must, hence, come up with solutions to create an embedded DAG without traversing
the graph. The following options were investigated

    1. Naming based on “link” position: Initially, when there are only 2 nodes in the
        network, one node arbitrarily becomes the parent, and the other a child. The child
        is named with respect to a parent. For eg, if A is the parent, and B is the child,
        then B would be named A.B. If a new parent C was added to B, then B would be
        renamed to A-C.B.
            a. If a new node is added, add it as a child to the cluster node and give it
                appropriate name.
            b. If a new edge between 2 nodes of the same cluster exists, then check the
                list of parents, and find if one is an ancestor of another. If one node is an
                ancestor of another, then add the directed edge correctly. If no such
                relationship exists, then arbitrarily decide a parent and a child, and rename
                the child accordingly.
            c. The disadvantage is of this convention is that, to maintain coherency,
                whenever a node is added, it must be propagated downwards in the tree.
                This is equivalent to traversal
            d. While most of the times, coherency would not be necessarily, in a few
                cases, especially when connecting between an ancestor and descendent,
                inconsistencies could lead to cycles.
    2. Reverse “link” position: This approach basically reverses the earlier approach,
        with the naming of the parent changing, depending on the name of the child. This
        change would reduce the amount of time required for propagation of information

   if coherency of node information is to be maintained. However, the worst case
   scenario of this approach is also O(n), which is essentially the same as traversal.
3. Subset naming: One of the problems of assigning static names is that when
   changes higher up (or lower down) in the DAG are made, the changes must be
   propagated to achieve coherency amongst nodes. An alternative approach is to
   assign a particular “namespace” to a node. This is much like the DNS system on
   the internet. However, in DNS, the hierarchy leads to an imbalance in load, since
   the higher level servers are accessed more often. An alternative to using
   dynamically assigned namespaces, is to instead use static namespaces

   In particular, we assume a domain of 1-1,00,000 for the entire cluster. When there
   are 2 systems in the cluster, the root node assumes the entire cluster, and assigns a
   part of it to the other child node. For example, consider A and B are connected
   initially, and A is arbitrarily decided to be the root. Then A will be in control of
   the domain 1-100000 while B is in control of a subset of this domain (assigned by
   A), such as from 50000-88000. All the children are then assigned subsets of these
   When a new node joins the cluster, it is assigned a subset dynamically by its
   parent and it immediately becomes part of the cluster. When a connection is to be
   established between 2 nodes of the same cluster, the subsets are checked. If one
   node‟s set is a subset of the other node‟s set, then one node is an ancestor of the
   other, and the new edge is directed from the ancestor to the descendent.

   We see that this approach solves the main problem of propagation, since the sets
   are static, though assigned dynamically, and hence there is no need to propagate
   node information during joining/leaving the cluster to maintain coherency.
   However, we find that if we use this approach, we would not be able to identify
   clusters on different parts of the DAG, unless they are represented by discrete and

       disjoint sets. Maintaining the consistency such that one set is not a subset of
       another would require further traversals.

   5.2.5. Conclusion about Hierarchical Algorithm

The Hierarchical Load balancing algorithm offers a very good solution to distributed load
balancing within a true peer-2-peer network. We find the following characteristics
offered by the algorithm:

       The load calculation is distributed almost equally amongst the nodes.
       There is sufficient redundancy to prevent a single point of failure
       The algorithm can be easily implemented
       The algorithm gets an approximate measure of the nodes within the cluster. This
       basic data obtained is sufficient for further investigation during the probing phase.

   5.3.    Symmetric Undirected Hierarchical Algorithm

To solve the problem of embedding the DAG within the graph, we modified the heuristic
algorithm even further, such that the need to form the DAG is no longer required. This is
the algorithm that is used within the project

   5.3.1. Overcoming DAG obstacles

We have seen that DAG was chosen earlier because of certain characteristics that it
offered. We can overcome the constraints offered:

   1. There must exist no cycles in the embedding: This condition is required so that
       the data about a node does not reach itself. In the original algorithm, if this
       happens, there is a chance that the probe message goes in a cycle. We can
       overcome this restriction by making sure that even if there exists a cycle, the
       node‟s distribution metric obtained after the cycle is traversed cannot be better
       than the current node‟s system metric. This can be done by adding a sufficient
       constant every time a system metric is propagated between a pair of nodes.
   2. There must exist a ordering of node, such that each node has a parent: This was
       required to allow for hierarchical storing of data. However, we can store the data
       in a mesh. A node, instead of becoming a mediator between the top and bottom of
       a DAG, can become a simple node of communication between any 2 nodes on the
       graph. The node accepts pings from all its neighbors, and after it checks with its
       own system metric, it also sends the min distribution metric it stores to all the
       neighbors. Thus, a parent and child relation is replaced by a pure peer-2-peer
   3. To maximize load sharing, we must introduce the maximum number of edges:
       No. of edges in DAG = No. of edges in the graph

   5.3.2. Algorithm

The algorithm for load balancing is:

   1. Keep listening to all the neighbor nodes for pings.
           a. If any neighbor send a ping, then update the local table
           b. For each ping received, store in local table: The ping received (the
              dist_metric of neighbor) + constant (to indicate overhead and network
   2. Periodically, check the current system metric
           a. If current system metric is lower than the highest stored distribution metric
              of the neighbors (current system load is higher than that of the child), then
                 the current distribution metric = Highest distribution metric of neighbors +
             b. If the current system metric is higher than the highest stored distribution
                 metric of any of the neighbors (current system load is lesser than that of
                 any neighbor), then the current distribution metric = system metric
      3. Send the current distribution metric to all neighbors
      4. Go to step 3

The algorithm can more precisely be given by the pseudo code:

                 while(process *p = get_incoming_process())
         //Check if new process has come and service it
         while((process *p = get_new_process_from_local_scheduler()))

function updateTable()
      //Get the updated values
      int child_metrics[];
      int i, min = INFINITY;
      foreach(++i, system c = get_child())
             //Get each of child metrics
             child_metrics[i] = get_metric(c);
             if(child_metrics[i] < min)
                   min = child_metrics[i];
      //Make decision based on child.
      if((min + OVERHEAD) < self_metric())

      int parent_metric = get_metric(get_parent());
      //Other than calculating Node_metric, all peers are equal
      store_peer_metrics(child_metrics[], parent_metric);

function distribute(process p)
      if(self_metric()    >    (OVERHEAD    +   get_process_overhead(p)   +
             //Distribute process again only now
             system peer = get_min_overhead_peer();
             set_process_overhead(p,         get_process_overhead(p)      +
               send_process(p, peer);
               //Let local scheduler run it

The features of this algorithm:

       Every time a process migrates, an overhead is added to the process. This ensures
       that the process migration does not take place through too many levels
       By keeping a periodic update and an overhead with each migration, the possibility
       of process getting migrated cyclically is negligible.
       The hierarchical structure is maintained only for determining the metric of a
       particular node. A parent will contain the minimum of the metrics of its children.
       During the load balancing, all nodes are treated equally, and no bias is given to
       either parent or children.

    5.3.3. Load Balancing Policies

We discussed the generalized load balancing strategies. We shall now classify our
algorithm among these:

Sender Initiated v/s Receiver Initiated: In sender initiated policies, congested
nodes attempt to move work to lightly-loaded nodes. In receiver-initiated policies,
lightly-loaded nodes look for heavily-loaded nodes from which work may be
received. What we also see is the sender-initiated policy performs better than the
receiver-initiated policy at low to moderate system loads and at high system
loads, the receiver-initiated policy performs better (6). We assume a lightly
loaded environment, as is the case in most household applications, and hence use
a sender-initiated policy.

                 Figure 12 Comparison of Load Balancing Strategies

Global v/s Local Strategies: In global policies, the load balancer uses the
performance profiles of all available workstations. In local policies workstations
are partitioned into different groups. The choice of a global or local policy
depends on the behavior an application will exhibit. For global schemes, balanced
load convergence is faster compared to a local scheme since all workstations are
considered at the same time. However, this requires additional communication
and synchronization between the various workstations; the local schemes
minimize this extra overhead. But the reduced synchronization between
workstations       is      also      a       downfall       of      the       local
Schemes if the various groups exhibit major differences in performance. In order
to obtain the best performance, we use a mixture of these two strategies. We
partition the network into various groups (each group consisting of a parent and
its children, with groups having multiple overlaps). At the end of each tick, the
group data is trickled to other groups, and thus the algorithm uses a global
strategy, with minimal overhead. The downside of this policy however, is that the
values of distribution metrics for nodes are not necessarily consistent on all the
Centralized v/s Decentralized Strategies: A load balancer is categorized as either
centralized or distributed, both of which define where load balancing decisions
are made. In a centralized scheme, the load balancer is located on one master
workstation node and all decisions are made there. In a distributed scheme, the
load balancer is replicated on all workstations. For centralized schemes, the
reliance on one central point of balancing control could limit future scalability.
Additionally, the central scheme also requires an “all-to-one” exchange of profile
information from workstations to the balancer as well as a “one-to-all” exchange
of distribution instructions from the balancer to the workstations. The distributed
scheme helps solve the scalability problems, but at the expense of an “all-to-all”
broadcast of profile information between workstations. We again use a hybrid
approach, by using primarily a decentralized strategy for decisions. However,
rather than broadcasting, we divide the entire network into subsets, and
communications take place only within each subset, and between representatives
of each subset.


Probing is the procedure to find the exact node to which a process must be migrated to
before the process migration. It forms the second step of process migration, and uses the
results of the symmetric undirected hierarchical load balancing algorithm to direct the

The aims of probing are:

          To get the node with the optimal load, to handle the process execution
          To allow for the transitivity property to hold: A process may be transferred from a
          source node to a destination node, where the source and destination need not
          necessarily be directly paired

Unlike the hierarchical load balancing algorithm which worked on timely pings, the
probe algorithm is a sender-initiated strategy. The Probe message is sent just before the
process is going to start executing, to find an appropriate destination node to deliver the
process to.

   6.1.       Algorithm

The probe message itself is composed of 3 parts: P( I, J, K). These fields are:

          I: The source where the probe started (an overloaded system)
          J: The node from which the probe was sent
          K: The node to which the probe is sent

   6.1.1. Sending Probe

Initially, when a process is about to start in the system I, the system performs the

   1. Get the current system metric, and compare it with highest distribution metric of
       the neighbor.
   2. If the current system metric is greater, then execute the process locally on the
       system itself.
   3. If the highest distribution metric of the neighbor is higher, then first get the
   4. Send a probe message to the neighbor.
   5. Wait for a reply. If a reply arrives, store it.

This process would send a probe to the nearest neighbor if the process should/can be
migrated. The probe message would then be propagated across the network.

   6.1.2. Propagating the Probe

Consequently, the algorithm for handling probe messages P(I, J, K) is:

   1. Consider that node K receives the probe P(I, J, K).
   2. Check if the current system metric is greater than the distribution metric obtained
       from any neighbor
   3. If the system metric of K is higher, then send back Probe reply P R(K, I,
       sys_metricK). Go to end.
   4. If the system metric is lower, get the neighbor with the highest distribution metric.
       Let this neighbor be M
   5. Send the probe P(I, K, M) to the neighbor M. Go to end
This algorithm is responsible for handling if a probe message arrives at the node. It
checks whether it is sufficiently idle to handle the process (and replies to the probe if it
is), or it forwards this to the next neighbor.

    6.1.3. Receiving probe reply

The source sends the probe message before the process is actually created (by hooking
fork()). Now, it is important that the source process creation is not blocked by the probe
message, since the probe message propagation takes place across the network. By
blocking, we are inducing extra overhead in the process execution. Instead, we perform
the following overall steps:

    1. When a process is about to be executed on a node I, send the probe message
        (based on algorithm highlighted in 6.1.1)
    2. Do not wait for the probe reply. Continue normal Execution
    3. If a probe reply is received, then store the value in probe_reply. Also get the time
        the probe arrived. Add this to a constant integer “Freshness” and store this value
        in probe_reply/
    4. When a process is executed:
            a. If a the probe_reply variable is set, check if the time stored is greater than
                the current time
                    i. If time stored is greater or equal, then perform the process
                   ii. If time stored is lesser, then the probe reply has expired. In this
                        case, the process is executed locally.
            b. If probe_reply is not set (yet), execute the process locally.

As we see, the process of creation and probing take place asynchronously. The
“Freshness” constant is added to determine for how long will the probe reply‟s data still
be fresh. We see that if much time has passed between the probe reply and a process
creation, then the data retrieved from the probe would no longer be consistent with the
current state of the target system. Thus, the “freshness” constant determines how long the
data is assumed to be reasonably close to the load on the target system.

   6.2.    Effect of Parameters on Probing

Various parameters determine the probing and the final process migration:

   1. The “freshness” constant: A large freshness constant might lead to incoherency. A
       node that sent a probe reply could become heavily loaded, but this would not be
       changed in the probe_reply variable. This could lead to extra amount of process
       migration, with the possible result of infinitely migrating the process.
       Similarly, if the freshness constant is too low, then the probe reply would expire
       too soon, and almost no process would be migrated since the probe would expire
       too soon. Thus, an adequate freshness constant is necessary.
   2. Probe message processing priority: The priority of processing probes is also very
       important. We see that probe messages must gain a priority over the processing of
       pings, or the establishing connections with new nodes. However, the probe
       message handling should still be lower than the process creation and loading


The process migration is the part involved with the actual migration of the process from a
source to a destination node. At the time when the process migration is to be performed,
the following details are known:

   1. The source node, on which the process originates
   2. The destination node, to which the process is migrated to
   3. The system metric of source node and distribution metric of destination node

We see that there are 2 types of process migrations:

       Preemptive process migration: A process that is currently running in memory is
       paused, saved. It is then migrated across to the destination node. There it is loaded
       into memory, and resumed.
       Non-preemptive process migration: Process migration takes places only before
       the process is created on the source node. The process is directly sent across to the
       destination node, and it is executed there.

We have discussed these 2 approaches earlier, and we will be implementing the 2nd
approach, due to the lower overhead.

A further investigation into process migration requires investigation into the operating
system itself. In our case, we have investigated a standard Linux installation.

    7.1.   Process Creation

The process creation process in Linux takes place in 2 stages:

    1. fork(): The current process first calls a fork(). This function is responsible for
       creating a new process. A fork() creates a new blank process, and in it, copies the
       content of the existing process.
    2. execve(): The execve() is used to load a particular executable into the process
       space. It takes an argument of the program to execute.

We hence see that typically, process creation happens such as:

pid_t pid = fork();          // Create a new process
if(pid == 0){                //If this is the new process
       //Load the program
       execve(PROGRAM_PATH, argvs, envs);

    7.1.1. Intercepting fork()

fork() creates a child process that differs from the parent process only in its PID and
PPID, and in the fact that resource utilizations are set to 0. File locks and pending signals
are not inherited. Under Linux, fork() is implemented using copy-on-write pages, so the
only penalty that it incurs is the time and memory required to duplicate the parent's page
tables, and to create a unique task structure for the child.

On success, the PID of the child process is returned in the parent's thread of execution,
and a 0 is returned in the child's thread of execution.

We see that if we intercept fork() and perform process migration at fork(), then we must
migrate the entire parent process that is currently loaded in memory. To implement this,
we must stop the process that is currently running in memory, an approach similar to
preemptive process migration.

Instead, we notice that before an execve(), a fork() is usually called. Thus, if we are to
perform process migration during execve(), we can hook the fork() call to send the probe

   7.1.2. Intercepting execve()

int execve(const char *filename, char *const argv[],
             char *const envp[]);

execve() executes the program pointed to by filename. argv is an array of argument
strings passed to the new program. envp is an array of strings, conventionally of the form
key=value, which are passed as environment to the new program. Both argv and envp
must be terminated by a null pointer.

execve() does not return on success, and the text, data, bss, and stack of the calling
process are overwritten by that of the program loaded. The program invoked inherits the
calling process's PID, and any open file descriptors that are not set to close-on-exec.
Signals pending on the calling process are cleared.

We can hence intercept the execve() call, since it is this call that loads a program into

   7.2.    Intercepting System Calls
Let us consider a program that is run through the bash shell. We see that the process is
created and loaded in a variety of steps. In order to hook onto the process creation
functions, we must first study the points at which we can intercept the process creation.
The points of interception are:

   7.2.1. Bash

We see that the bash sources reveal that the creation of a process is caused by the lines:

if (nofork && pipe_in == NO_PIPE && pipe_out == NO_PIPE)
     pid = 0;
     pid = make_child (savestring (command_line), async);
exit (shell_execve (command, args, export_env));

We can hence intercept the call to make_child(), so that a probe is sent before a child is
made. The call to shell_execve() can be hooked to test and facilitate process migration.

We see that

            o No modifications required to be done to the kernel
            o It would allow for migration of processes created explicitly by the users
            o Changes are made only to bash shell. Using any other shell, or any other
                interface could negate the changes.
            o Processes created by processes (other than bash) cannot be migrated. This
                severely limits applicability of process migration

The make_child() function of bash in turn calls the fork() function. Similarly, the
shell_execve() calls the execve() function. These functions are defined in libC

   7.2.2. GlibC

The GLibC is the C standard library released by the GNU project. It is the standard C
library in Linux, and provides abstraction over the kernel system calls. The functions
fork() and execve() are defined in libC. When the fork() and execve() functions are
called in the bash, the corresponding fork() and execve() functions are executed in the
libC library. These functions process and parse the arguments before calling the
corresponding system calls.

Since libC is used to create most of the processes on the system, intercepting fork() and
execve() in libC would widen the scope of the changes. We see the advantages and
disadvantages of making changes in libC:

           o No kernel modifications required
           o Changes to libC can be made locally. libC can be statically linked, and
               thus changes to libC are far easier to test.
           o No kernel modules are needed for additional support
           o Might be less efficient than changes in kernel
           o Would not function for all processes, since a process may call the system
               call directly.

The libC fork() and execve() in turn perform the system calls to the kernel, using either
software interrupts or by using the sysenter-sysleave mechanism on 80486+ processors

   7.2.3. Kernel System Calls Interface

The kernel handles all system calls using small stubs, which are part of the kernel itself.
The fork() and execve() system calls have also stubs correspondingly. These stubs in turn
call the actual handlers in the kernel which do the processing of fork() and execve(). It is
possibly to intercept the process creation at the interrupt handlers.

           o Easier than do_fork and do_execve() implementation (Kernel functions)
           o Cannot avail of further interrupt handling functions, which will be
               required for sending probes, etc.
           o Changes made to stubs which are supposed to be run fast

   7.2.4. Kernel functions

The fork() is finally processed in the kernel by the do_fork() function. Similarly, execve()
is processed in the kernel by the do_execve() function. It is possible to modify these
functions themselves to account for process migration.

           o Most efficient solution
           o Ability to better analyze the ELF header, if the project migration is to be
               carried out in a restricted environment
           o Involves directly modifying the kernel code
           o Changes made are system wide and hard to recover from

We‟ve mentioned that modifications to the kernel, though efficient, are usually
unnecessary. This is because modifications to kernel have the following disadvantages:

   1. External modifications to the kernel are hard to commit to the upstream version
   2. Programming in kernel requires learning a completely new set of functions for
       performing tasks.
   3. Bugs in kernel have a system wide effect
   4. Making changes to kernel and testing it is much harder. Testing a kernel would
       involve not only compiling the kernel, but also to install it. Thus carrying out Unit
       Tests would become very tough.

   7.2.5. Conclusion on Interception Points

By studying the 4 possible points to hook onto, the libC library was chosen to hook into.
This is because:

   1. Changes made are generalized enough to allow for process migration for most
       processes running on the system
   2. Changes are made in the user mode, not the kernel mode
   3. It is possible to compile the libC library and link it statically. Thus, it is very easy
       to carry out unit tests on the changes.
   4. Mechanisms offered by the kernel, such as shared memory, can easily be

   7.3.    Intercepting __libc_fork()

The fork() function is actually a weak alias for the __libc_fork() which is responsible, in
libC for forking. The function source can be found in Appendix I

We see that the __libc_fork() function first executes all the handlers associated with the
forking process. It then synchronizes the thread and then calls the system call to fork. It
then executes handlers post fork for both, the parent and the child process.

As decided earlier, the hook in __libc_fork() is simply to send a probe message. In order
to do this, modify __libc_fork() so that we first construct a UDP probe and then send it to
the nearest neighbor.

   7.3.1. Implementation

We see that by sending the UDP packet during the fork() itself would lead to an overhead
while forking. To decrease this overhead, we carry out the sending of probe message
asynchronously with the fork().

We do this by creating a simple user-mode application, called “Listener”. When a fork()
is called, it sends a signal to the listener process. The listener process, in turn, is
responsible for sending the probe message. It is also responsible for getting the probe
result, and putting the probe_reply, along with the freshness, in a shared variable.

   7.4.    Intercepting __execve()

The execve() function is actually a weak alias for the function __execve() located in
GlibC. The __execve() function is a very simple abstraction of the execve() system call.
The function takes the arguments. If libC is compiled with bounded pointers, its
arguments are first checked for bounding. It then simply calls the system interface. The
source can be found in Appendix I

We see that __execve() must be responsible for the actual process migration. We know
that a particular process calls the execve() only when it is going to be overridden by the
new program that is loaded into its location. Thus, it is not necessary to make this call
non-blocking, if process migration is the suitable option.

   7.4.1. Implementation

When execve() is called and process migration begins, first, the shared variable
probe_reply is checked. It is checked for freshness, and the node to send to is obtained.
Then, this function itself gets the program that is to be loaded into memory, and sends it
on a UDP packet to the destination host. At the destination, a listener user process will be
running, which will receive the packet, will download the program, perform a fork() and
then execve() the downloaded program.

   7.4.2. Further hooks after process migration

After the process migration, the process which was created by fork() still runs. This
process is used to act as a stub to the process executing remotely. In this new process, a
new stub process, pmhandler, is loaded, using the __execve(). This stub will be
responsible for:

       Listening for any request for input from the migrated application
       Collecting input from user if needed by the migrated application
       Collecting information about local files on a system
       Sending the collected information remotely
       Listening for any output that is made available by the migrated application
       Listening for quit command when the migrated process stops execution. End the
       current stub on this as well.

     7.5.   Implementation of Listener

The listener application is a user-mode application that is an integral part of the project.
As we‟ve seen, in order to minimize the overheads for load balancing and process
migration, the load balancing and handling incoming processes is handled
asynchronously. This is done by creating a user-mode process called the “listener” which
is responsible for the following:

        Periodically sending the ping to all neighbors
        Collecting ping data from all the neighbors
        Finding out the highest distribution metric amongst the neighbors
        Propagating and replying to probe messages
        Accepting process transfers and launching the corresponding processes

The listener process can hence be described by the C code:

#include <stdio.h>
#include <stdlib.h>

#define FRIENDLIST "ffip.txt"
#define PING_PORT 4123
#define PINGDELAY 5
#define PROBE_PORT 4124
#define PM_PORT 4125
struct friendList{
        int sock;
        long ipaddr;
        int load;
        friendList *next;
};    //Linked list
friendList* g_friendlist;

friendList* g_minFriend;
int g_minLoad;

int main(void){
      long* ipList;
      ipList = loadFriendsIP(FRIENDLIST);
      int listenerSocket = openListener(PING_PORT);
      int probeListener = openUdpListener(PROBE_PORT);
      int pmListener = openUdpListener(PM_PORT);
      //Populate the FD_SET
      FD_SET fdsocks;
      fillFD(&fdsocks, listenerSocket, probeListener, pmListener);

      time_t rawtime, prevtime = 0;
      struct timeval timeout;
      timeout.tv_sec = 0;
      timeout.tv_usec = 500;

             while((rawtime - prevtime) < PINGDELAY){
                     int   selectRes   =   select(FD_SETSIZE,   &fdsocks,   0,   0,
                            //Probe check
                            if(FD_ISSET(probeListener, &fdsocks)){
                                  //Get nearest neighbor stats
                            //Check for incoming process
                            if(FD_ISSET(pmListener, &fdsocks)){
                                  //Get process and run
                           //Check listener socket
                           if(FD_ISSET(listenerSocket, &fdsocks)){
                                 struct sockaddr_in claddr;
                                 int         claddrlen     =      sizeof(struct
                                 int      newcon    =    accept(listenerSocket,
(struct sockaddr*)&claddr, &claddrlen);
                                 addFriend(newcon, claddr);
                           //Check and update friends
                     fillFD(&fdsocks,        listenerSocket,     probeListener,
               prevtime = rawtime;

               //Calculate the current load
               int load = getSystemLoad();
               if(g_minLoad >= 0){
                     //Some other neighbor with load. Compare
                     if(g_minLoad > load){
                           //Replace load
      return 0;

     7.5.1. Shared Memory Structures

In order to share data between the listener, and the fork()/execve() libC functions, shared
memory is used. This shared memory is created by the kernel using shmget() call. This
shared memory contains:

     1. The pid of the listener process (so fork() can send signals)
     2. Whether the highest neighbor node with a distribution metric higher than current
        system‟s sysmetric exists.
     3. The IP of the neighbor with the highest distribution metric. (0 if 2nd field is false)
     4. Whether the probe_reply is set
     5. The IP of the probe_reply node
     6. The Freshness time of the probe_reply

Thus, this can be represented as:

struct shared_mem{
        pid_t listener_pid;
        char is_highneighbor;          /*0 or 1*/
        long ip_neighbor;              /*4 byte long. atomic*/
        char is_probe_reply;           /*0 or 1*/
        long ip_probe_reply;           /*4 byte long*/
        time_t freshness_probe_reply;

A major problem with accessing of shared memory is synchronization between 2
accesses. One approach is to use synchronization mechanisms such as mutexes or
semaphores. However, using such synchronization mechanisms could possibly lock one
of the processes from accessing the shared memory. This would induce the possibility
that the fork() and execve() be blocked indefinitely, which must not arise.

Thus, we instead use carefully crafted code, which does not synchronize the shared
memory. There exists only one writer for the shared memory: The listener.

We know that setting a variable is an atomic operations in the CPU. Thus, we apply 2
mutexes within the shared memory itself: is_highneighbor and is_probe_reply. The
ip_neighbor and probe_reply are respectively the critical sections of the memory. Thus,
we apply the same principal as a mutex to achieve synchronization. However, we see that
the fork() and execve() don‟t block for the mutexes. Instead, if critical section is in use, it
treats as if the critical section does not exist, and continues to take steps toward local

       8. RESULTS

To test the project system, we first create a test suite of programs that are to be run. This
suite is listed in Appendix II.

The host node, A, is then overloaded so that the number of jiffies that each process gets is
less. The programs are then first run on this node directly, without linking with the
project migration library.

         Brute: Runs in 7405 ms
         Prime: Runs in 6782 ms
         HelloWorld: Runs in 0052 ms

Now, another node, B, is added, and is not loaded. Then, process migration is allowed to
continue. We see that the approximate execution time for the processes being migrated

         Brute: Runs in 6755 ms
         Prime: Runs in 6293 ms
         HelloWorld: Runs in 0842 ms

Hence, we see that the project has improved the execution times of the processes on an
average by: 9%.

We see the performance decreasing slightly due to slight overhead while creating
HelloWorld. This is because the program is IO intensive, and thus there exists a high
overhead of communication between the source and destination node for IO transfers.





4000                                                           No PM



       Brute              Prime                   HelloWorld

               Figure 13 Results of Process Migration


From the results, we see that there is a noticeable improvement on the execution time of
processes in a heavily loaded system. We hence conclude that the system achieves the
following goals:

       Process migration successfully takes place from a heavily loaded node to a lightly
       loaded node
       The load balancing is performed with a dynamic and distributed 2 step algorithm
       The measurement of the load of a system based on the system metric does give an
       adequate picture of the expected execution time on the system.
       Probing of target nodes is successfully done with a freshness parameter that
       sufficiently balances the migration possibility and coherency.
       The system has no single point of failure
       The process migration leads to an improvement in performance for CPU intensive
       Process creation calls have been properly hooked.
       There is no noticeable difference in the usability of the operating system.

   9.1.    Scope for Further Studies

The process migration solution that was devised offers a base on which improvements
can be made. These could be:

       Introducing process preprocessing: This involves parsing the program before it
       is migrated. A program can be analyzed for how many IO operations are required,
       and hence can be classified as a CPU intensive or IO intensive application. We

saw that process migration benefits CPU intensive applications more than IO
intensive applications
Piggybacking and Caching: In our project, the outputs of migrated process were
cached for as long as necessary. However, further caching could be possibly by
piggybacking. By preprocessing the program before execution, a list of possible
IO accesses could be listed, and these files and inputs could be piggybacked when
migrating the process.
Wider system call support: Our project was meant to be a proof of concept of an
idea. The number of system calls that were modified was limited to demonstrate
the viability of the solution. More system calls could be modified to make the
project more practical.
IO cache sharing for Migrated Processes: Multiple processes that have been
migrated could use piggbacked messages to transfer IO.


Appendix I: Intercepted functions of glibc


__libc_fork (void)
    pid_t pid;
    struct used_handler
        struct fork_handler *handler;
        struct used_handler *next;
    } *allp = NULL;

    /* Run all the registered preparation handlers.          In reverse order.
            While doing this we build up a list of all the entries.       */
    struct fork_handler *runp;
    while ((runp = __fork_handlers) != NULL)
             /* Make sure we read from the current RUNP pointer.     */
             atomic_full_barrier ();

             unsigned int oldval = runp->refcntr;

             if (oldval == 0)
             /* This means some other thread removed the list just after
                the pointer has been loaded.    Try again.   Either the list
                is empty or we can retry it.    */

             /* Bump the reference counter.    */

      if          (atomic_compare_and_exchange_bool_acq        (&__fork_handlers-
                                          oldval + 1, oldval))
      /* The value changed, try again.        */

      /* We bumped the reference counter for the first entry in the
          list.    That means that none of the following entries will
          just go away.    The unloading code works in the order of the

            While executing the registered handlers we are building a
            list of all the entries so that we can go backward later on.
      while (1)
           /* Execute the handler if there is one.        */
           if (runp->prepare_handler != NULL)
             runp->prepare_handler ();

           /* Create a new element for the list.     */
           struct used_handler *newp
             = (struct used_handler *) alloca (sizeof (*newp));
           newp->handler = runp;
           newp->next = allp;
           allp = newp;

           /* Advance to the next handler.     */
           runp = runp->next;
           if (runp == NULL)

           /* Bump the reference counter for the next entry.        */
           atomic_increment (&runp->refcntr);
         /* We are done.   */

  _IO_list_lock ();

#ifndef NDEBUG
  pid_t ppid = THREAD_GETMEM (THREAD_SELF, tid);

  /* We need to prevent the getpid() code to update the PID field so
        that, if a signal arrives in the child very early and the signal
        handler uses getpid(), the value returned is correct.   */
  pid_t parentpid = THREAD_GETMEM (THREAD_SELF, pid);
  THREAD_SETMEM (THREAD_SELF, pid, -parentpid);

#ifdef ARCH_FORK
  pid = ARCH_FORK ();
# error "ARCH_FORK must be defined so that the CLONE_SETTID flag is
  pid = INLINE_SYSCALL (fork, 0);

  if (pid == 0)
         struct pthread *self = THREAD_SELF;

         assert (THREAD_GETMEM (self, tid) != ppid);

         if (__fork_generation_pointer != NULL)
         *__fork_generation_pointer += 4;

         /* Adjust the PID field for the new process.      */
         THREAD_SETMEM (self, pid, THREAD_GETMEM (self, tid));

         /* The CPU clock of the thread and process have to be set to
zero.        */
         hp_timing_t now;
         HP_TIMING_NOW (now);
         THREAD_SETMEM (self, cpuclock_offset, now);
         GL(dl_cpuclock_offset) = now;

         /* Reset the file list.      These are recursive mutexes.    */
         fresetlockfiles ();

         /* Reset locks in the I/O code.         */
         _IO_list_resetlock ();

         /* Reset the lock the dynamic loader uses to protect its data.
         __rtld_lock_initialize (GL(dl_load_lock));

         /* Run the handlers registered for the child.      */
         while (allp != NULL)
              if (allp->handler->child_handler != NULL)
                  allp->handler->child_handler ();

              /* Note that we do not have to wake any possible waiter.
                  This is the only thread in the new process.    The count
                  may have been bumped up by other threads doing a fork.
                  We reset it to 1, to avoid waiting for non-existing
                  thread(s) to release the count.     */
              allp->handler->refcntr = 1;

           /* XXX We could at this point look through the object pool
              and mark all objects not on the __fork_handlers list as
              unused.   This is necessary in case the fork() happened
              while another thread called dlclose() and that call had
              to create a new list.    */

           allp = allp->next;

       /* Initialize the fork lock.         */
       __fork_lock = LLL_LOCK_INITIALIZER;
       assert (THREAD_GETMEM (THREAD_SELF, tid) == ppid);

       /* Restore the PID value.      */
       THREAD_SETMEM (THREAD_SELF, pid, parentpid);

       /* We execute this even if the 'fork' call failed.      */
       _IO_list_unlock ();

       /* Run the handlers registered for the parent.     */
       while (allp != NULL)
           if (allp->handler->parent_handler != NULL)
             allp->handler->parent_handler ();

           if (atomic_decrement_and_test (&allp->handler->refcntr)
               && allp->handler->need_signal)
             lll_futex_wake (allp->handler->refcntr, 1, LLL_PRIVATE);

           allp = allp->next;
    return pid;
weak_alias (__libc_fork, __fork)
libc_hidden_def (__fork)
weak_alias (__libc_fork, fork)


__execve (file, argv, envp)
         const char *file;
         char *const argv[];
         char *const envp[];
        char *const *v;
        int i;
        char *__unbounded *__unbounded ubp_argv;
        char *__unbounded *__unbounded ubp_envp;
        char *__unbounded *__unbounded ubp_v;

        for (v = argv; *v; v++)
        i = v - argv + 1;
        ubp_argv   =   (char   *__unbounded   *__unbounded)   alloca   (sizeof
(*ubp_argv) * i);

        for (v = argv, ubp_v = ubp_argv; --i; v++, ubp_v++)
          *ubp_v = CHECK_STRING (*v);
        *ubp_v = 0;

        for (v = envp; *v; v++)
        i = v - envp + 1;
        ubp_envp    =   (char     *__unbounded     *__unbounded)     alloca     (sizeof
(*ubp_envp) * i);
        for (v = envp, ubp_v = ubp_envp; --i; v++, ubp_v++)
          *ubp_v = CHECK_STRING (*v);
        *ubp_v = 0;

        return   INLINE_SYSCALL    (execve,   3,   CHECK_STRING    (file),    ubp_argv,
    return INLINE_SYSCALL (execve, 3, file, argv, envp);
weak_alias (__execve, execve);

Appendix II: Test Suite


#include <stdio.h>

int main(){
         int i,j,k;
         return 0;


#include <stdio.h>

int isPrime(int a){
         int p;
                  if(a%p == 0) return 0;
         return 1;
int main(){
         int num = 223810;
         int p;
         return 0;


#include <stdio.h>

int main(){
      printf("Hello World");
      return 0;


1. Process Migration. Dejan S. Milojicic, Fred Douglas, Yves Paindaveine, Richard
Wheeler, Songnian Zhou. September 2000, ACM Computing Surveys (CSUR), pp.

2. The Sprite Network Operating System. John K. Ousterhout, Andrew R. Cherenson,
Frederick Douglis, Michael N. Nelson, Brent B. Welch. Berkeley : s.n., 1987.

3. Malik, Shahzad. Dynamic Load Balancing in a Network of Workstations. Toronto :
s.n., 2000. 95.515F Research Report.

4. Distributed Scheduling Support in the Presence of Autonomy. Chapin, S.J. s.l. :
Proceedings of the 4th Heterogeneous Computing Workshop, 1995. IPPS. pp. 22-29.

5. Lawrence, R. A survey of process migration mechanisms. s.l. : University of
Manitoba, 1998.

6. Rumelhard, Hinton, McClelland. A general framework for Parallel Distributed

7. A taxonomy of scheduling in general-purpose distributed computing systems. T.L.
Casavant, J.G. Kuhl. 2, s.l. : IEEE Transactions on Software Engineering, 1988, Vol.
14. ISSN:0098-5589.


To top