Automating the fault tolerance process in Grid Environment by ijcsis


More Info
									                                             (IJCSIS) International Journal of Computer Science and Information Security,
                                             Vol. 8, No. 7, October 2010

          Automating the fault tolerance process in
                    Grid Environment
                 Inderpreet Chopra                                            Maninder Singh
                      Research Scholar                                        Associate Professor
       Thapar University Computer Science Department             Thapar University Computer Science Department
                        Patiala, India                                            Patiala, India

Abstract:                                                     Autonomic computing [4, 5] presented and
As Grid encourages the dynamic addition                       advocated by IBM, suggests a desirable
of resources that are not likely to be
benefited from the manual management                          solution to this problem. The vision of
techniques as these are time-consuming, un-                   autonomic computing is to design and build
secure and more prone to errors. A new
paradigm for self-management is pervading                     computing systems that possess inherent self-
over the old manual system to begin the                       managing capabilities [6]. In this paper a Self-
next generation of computing. In this paper
we have discussed the different approaches                    healing        model-          SMU            (Self-healing
for    self-healing   the    current   grid                   management unit) has been described which
middleware use, and after analyzing these
we have proposed the new approach, Self-                      by autonomic computing has targeted to
healing Management Unit, SMU that will                        improve the level of automation and self-
provide the automated way of dealing with
failures.                                                     management capabilities to a far greater extent

Keywords: SMU, heartbeat                                      than it is today in Grid Computing systems.
                                                              The SMU aims to:
   1. Introduction
                                                                 •      keep track over the jobs submission
In recent years Grid, which facilitates the                      and execution
sharing and      integration     of large      scale,            •      recover the lost jobs
heterogeneous resources, has been widely                         •      administrate complexity of grid
recognized as the future framework of                            •      check the efficiency for resource
distributed computing [1, 2]. However, the                       discovery
increasing complexity of Grid services and                       •      keep all the services running.
systems    demands      correspondingly        larger
human effort for system configuration and
                                                                  2. Self-Healing Mechanisms
performance management, which are mainly                      Self-healing [7] is the ability of a system to
done in a manual style today, making it time-                 recover from faults that might cause some
consuming,        error-prone         and        even         parts of it to malfunction. For a system to be
unmanageable      for    human      administrators.           self-healing, it must be able to recover from a

                                                                                        ISSN 1947-5500
                                               (IJCSIS) International Journal of Computer Science and Information Security,
                                               Vol. 8, No. 7, October 2010

failed component by first detecting and                                individual nodes in the network or the
isolating the failed component, taking it off                          whole network may go down.
line, fixing and reintroducing the fixed or
                                                                       •      Software Faults: There are several
replacement component into service without
                                                                       high      resource      intensive         applications
any apparent overall disruption. A self-healing
                                                                       running on grid to do particular tasks.
system also needs to predict problems and
                                                                       Several software failures like the unhandled
take actions to prevent the failure from having
                                                                       exception; unexpected input etc. can take
an impact on applications. The self-healing
                                                                       place     while      running         this      software
objective must be to minimize all outages in
order to keep the system up and available at
                                                                 In addition to ad-hoc mechanisms – based on
all times.
                                                                 users complaints and log files analysis – grid
There can be many different reasons that can
                                                                 users have used automatic ways to deal with
lead to the fault occurrence in grids. Some of
                                                                 failures in their Grid Environment. To achieve
the reasons we are able to find are as follows:
                                                                 the automatic ways to deal with failures,
   •      Hardware Faults: Hardware failures                     various fault tolerance mechanisms are there.
   take      place   due     to   faulty     hardware            Some of these self-healing mechanisms are:
   components such as CPU, memory, and
                                                                 Application-dependent:                     Grids             are
   storage devices [8].
                                                                 increasingly used for applications requiring
   •      Application and Operating System                       high levels of performance and reliability, the
   Faults: Application and operating system                      ability to tolerate failures while effectively
   failures occur due to application or                          exploiting the resources in scalable and
   operating       system    specific      faults   like         transparent manner must be integral part of
   memory leakage, deadlocks, inefficient                        grid         computing       resource          management
   resource management etc.                                      systems. Support for the development of fault-
                                                                 tolerant applications has been identified as one
   •      Network Faults: In a grid, computing
                                                                 of the major technical challenges to address
   resources are connected over multiple and
                                                                 for       the      successful          deployment                 of
   different types of distributed networks. As
                                                                 computational grids [10]. To date, there has
   a result, physical damage or operational
                                                                 been limited support for application-level fault
   faults in the network are more likely [9].
                                                                 tolerance in computational grids. Support has
   The network may exhibit significant packet
                                                                 consisted mainly of failure detection services
   loss      or   packet    corruption.     Moreover,
                                                                 or fault-tolerance capabilities in specialized
                                                                 grid toolkits. Neither solution is satisfactory in

                                                                                            ISSN 1947-5500
                                                     (IJCSIS) International Journal of Computer Science and Information Security,
                                                     Vol. 8, No. 7, October 2010

the long run. The former places the burden of                           again    an     instance      of     high        asymptotic
incorporating fault-tolerance techniques into                           complexity
the hands of application programmers, while
                                                                        Checkpointing-recovery: Checkpointing and
the     latter     only     works       for       specialized
                                                                        rollback      recovery     provides         an     effective
applications. Even in cases where fault-
                                                                        technique for tolerating transient resource
tolerance techniques have been integrated into
                                                                        failures, and for avoiding total loss of results.
programming tools, these solutions have                                 Checkpointing involves saving enough state
generally been point solutions, i.e., tool
                                                                        information of an executing program on a
developers have started from scratch in
                                                                        stable storage so that, if required, the program
implementing their solution and have not
                                                                        can be re-executed starting from the state
shared, nor reused, any fault tolerance code. A                         recorded in the checkpoints. Checkpointing
better way is to use the compositional
                                                                        distributed applications is more complicated
approach in which fault-tolerance experts
                                                                        than Checkpointing the ones which are not
write algorithms and encapsulate them into
                                                                        distributed. When an application is distributed,
reusable code artifacts, or modules.
                                                                        the Checkpointing algorithm not only has to
Monitoring          Systems:       In     this      a     fault         capture the state of all individual processes,
monitoring unit is attached with the grid. The                          but it also has to capture the state of all the
base technique which most of the monitoring                             communication             channels               effectively.
units follow is heartbeating technique. The                             Checkpointing [12] is basically divided into 2
heartbeating        technique       [11]      is        further         types:
classified into 3 types:
                                                                        - Uncoordinated Checkpoint: In this approach,
-     Centralized         Heartbeating        -     Sending             each of the processes that are part of the
heartbeats to a central member creates a hot                            system determines their local checkpoints
spot,     an     instance     of    high          asymptotic            individually. During restart, these checkpoints
complexity.                                                             have to be searched in order to construct a

- Ring Based Heartbeating - along a virtual                             consistent global checkpoint.

ring     suffers     from     unpredictable             failure         - Coordinated Checkpoint: In this approach,
detection times when there are multiple                                 the Checkpointing is orchestrated such that the
failures, an instance of the perturbation effect.                       set of individual checkpoints always results in

- All-to-all heartbeating - sending heartbeats                          a consistent global checkpoint. This minimizes

to all members, causes the message load in the                          the storage overhead, since only a single

network to grow quadratically with group size,                          global checkpoint needs to be maintained on

                                                                                                 ISSN 1947-5500
                                       (IJCSIS) International Journal of Computer Science and Information Security,
                                       Vol. 8, No. 7, October 2010

stable storage. Algorithms used in this
approach are blocking and nonblocking.

Fault   Tolerant    Scheduling:     With     the
momentum gaining for the grid computing
systems, the issue of deploying support for
integrated   scheduling    and    fault-tolerant
approaches becomes a paramount importance
[13]. For this most of the fault tolerant
scheduling algorithms are using the coupling
                                                                          Figure 1: Grid Network
of scheduling policies with the job replication
                                                         MN comprises of (figure:2) Application
schemes such that jobs are efficiently and
                                                         Receiving Unit (ARU), Node Receiving Unit
reliably executed. Scheduling policies are
                                                         (NRU), Processing Unit (PU) and database.
further classified on basis of time sharing and
                                                         ARU deals with the reception of request for
space sharing.
                                                         the job to be executed, NRU helps to gather
3. SMU Approach (Self-Healing                            the status of the execution nodes attached to
   Management Unit)                                      the MN. NRU is responsible for monitoring
Our approach is to provide efficient self-               both the requests and responses in between the
healing functionality in grids.   Most of the            MN and the execution nodes. PU takes input
existing approaches follow the heartbeat                 from ARU, NRU and DB to decide to which
technique to find the working nodes. But in              execution node the job has to be assigned. PU
this new approach another way of handling the            uses the following algorithm to decide the
failures has been proposed. In SMU there is              execution of the application.
one central node for each cluster called
Managing Node (MN) (figure 1). MN is
responsible for all the job submission and the
job execution of applications submitted to
grid. Apart from this, MN also keeps on
monitoring the status of the job at different
time intervals, to make sure that the job is
executing well.
                                                                      Figure 2: SMU Managing Node

                                                                                  ISSN 1947-5500
                                                (IJCSIS) International Journal of Computer Science and Information Security,
                                                Vol. 8, No. 7, October 2010


1.  Check for the Current status for the execution
    node that is going to handle the job.
         CHK ‘STATUS’
2. If the status is “OK”, then assign Job to that node.
    Else, find another node. Go to step 1.
3. Once node is selected, to check the status of the
    job, PU asks NRU to keep on monitoring the node
    after certain interval of time. For optimizing this
    process, we follow following steps:
  a. Maintain the records for the kind of the job and
        their expected execution time in the database.
        Based upon that, start the monitoring after that
        expected interval, if PU doesn’t receive any
        input from NRU.                                                     Figure 3: Default vs SMU execution rate
  b. If the job type is new i.e. not having any
        information, save it in the history database               SMU also reduces the resources usage by
        after it finish execution. But for monitoring it,
        decide any time say X minutes, after which the             optimizing the monitoring procedure (figure
        monitoring starts.                                         4). If there are n number of nodes, and they
   4. Node self-failure detection option is also
        available for the node to self detects the failure         keep on sending their alive signals at interval
        and send the status to NRU. This failure alarm
        can be raised based upon the CPU usage,                    of t seconds, then the total time the managing
        insufficient resources to execute the job, etc.
                                                                   nodes remains busy is n*t. But in the present
4. Results                                                         approach, it reduces to large percentage as
                                                                   only those nodes are monitored that are
To test SMU, we submit jobs and introduce
                                                                   needed for the execution of submitted job.
the faults into the setup grid environment. For
producing the faults, we shutdown few nodes
when job is submitted to them, another way
we used to push the faults is by overloading
the nodes which are used to execute the job.
SMU do the auto job resubmission when some                           Figure 4: Optimized Monitoring reduces resource usage

erroneous happens during job execution. We                         It increases the scalability of resources. Now
have found that using SMU, we can prevent                          more resources can be attached with the single
the large number of jobs to get lost. To test                      node. In normal heartbeat technique, it will be
this, we have started submitting large number                      difficult to monitor nodes if the number of
of jobs and gather the detail that how many                        execution nodes increases. It also speed’s up
jobs got lost. Figure 3, shows that the rate of                    the execution of jobs whose entry is already
jobs executed in normal scenario is less as                        there in the history table.
compared to SMU system. SMU helps the
                                                                   5. Conclusions:
system to overcome from faults thus helps to                       As grid is a distributed and unreliable system
increase the job execution rate.                                   involving heterogeneous resources located in

                                                                                           ISSN 1947-5500
                                         (IJCSIS) International Journal of Computer Science and Information Security,
                                         Vol. 8, No. 7, October 2010

different geographical domain, for this case              [2] Ian Foster, Carl Kesselman, and Steven
fault tolerant resource allocation services [14]          Tuecke, ”The anatomy of the Grid”, Copyright
have to be provided. In particular, when                  2003 John Wiley & Sons, Ltd.
crashes occur, tasks have to be reallocated               [3] Manish Parashar and Salim Hariri.
quickly and automatically, in a completely                Autonomic         Computing:            An       Overview.
transparent way from the user’s point of view.            Springer-Verlag Berlin Heidelberg, 2005.
Different grid middleware uses different                  [4]Jeffrey O. Kephart and David M. Chess.
strategies to overcome faults. Like Globus                “The Vision of Autonomic Computing”, IEEE
[15] and Alchemi [16] uses the heartbeat to               Computer, January 2003.
know about the current status of its execution            [5] A. G. Ganek, T. A. Corbi, “The Dawning
nodes. On the other hand SUN N1GE6 [17]                   of the Autonomic Computing Era”, IBM
uses   the    user   level   and   kernel   level         Systems Journal, .42 n.1, p.5-18, January 2003
checkpointing strategies to restart the job from          [6] Hang Guo1, Ji Gao2, Peiyou Zhu, Fan
where it fails. Condor [18] provides system-              Zhang, "A Self-Organized Model of Agent-
level checkpoints of distributed applications             Enabling Autonomic Computing for Grid
but it not geared to high-performance parallel            Environment", Proceedings of the 6th World
programs.                                                 Congress        on      Intelligent        Control          and
The proposed SMU approach is destined to                  Automation, June 21 - 23, 2006, Dalian, China
increase the efficiency by reducing the                   [7]P.    Horn.       Autonomic          computing:ibm’s
unnecessary monitoring of nodes that are not              perspective on the state of information
going to participate in the execution of the job.         technology,                  October                     2001.
At same time, information from the past         
execution of similar jobs is maintained, that             [8] Gerald Tesauro and David M. Chess and
determines the time interval after which                  William E. Walsh and Rajarshi Das and,
monitoring of the job is started. As the                  "A      Multi-Agent        Systems         Approach              to
execution nodes in the grids increase the SMU             Autonomic Computing", AAMAS'04 ACM
approach becomes all the more efficient.                  (Jul 2004)

References:                                               [9]     Luis    Ferreira,      Viktors       Berstis        and
                                                ,"Introduction to Grid Computing with
[1] Ian Foster, Carl Kesselman, Jeffrey M.
Nick, and Steven Tuecke. The physiology of
                                                          f/ IBM globus.pdf,IBM
the Grid. Global Grid Forum, June 2002.

                                                                                    ISSN 1947-5500
                                               (IJCSIS) International Journal of Computer Science and Information Security,
                                               Vol. 8, No. 7, October 2010

[10] Anh Nguyen-Tuong, “Integrating Fault-                       [15] Paul Stelling, Ian Foster, Carl Kesselman,
Tolerance Techniques in Grid Applications”                       Craig Lee, Gregor von Laszewski ,” A Fault
PhD Thesis, University of Virginia, August                       Detection Service for Wide Area Distributed
2000                                                             Computations”         Volume         2,      Number             2,
[11] Amit Jain and R.K. Shyamasundar,                            springer, pp. 117-128(12), 1999
“Failure      Detection          and      Membership             [16] Krishna Nadiminti, Akshay Luther,
Management in Grid Environments” Fifth                           Rajkumar Buyya, “Alchemi: A .NETbased
IEEE/ACM International Workshop on Grid                          Enterprise Grid System and Framework”
Computing (GRID'04) pp. 44-52                                    December 2005
[12] Sriram Krishnan, “An Architecture for                       [17] Liang PENG, Lip Kian NG, “N1GE6
Checkpointing and Migration of Distributed                       Checkpointing            and         Berkeley             Lab
Components on the Grid” PhD Thesis,                              Checkpoint/Restart” Dec 28, 2004
Department of Computer Science, Indiana                          [18]James Frey, Todd Tannenbaum, Miron
University, November 2004                                        Livny Ian Foster, Steven Tuecke, “Condor-G:
[13]       J.H.      Abawajy,          “Fault-Tolerant           A Computation Management Agent for Multi-
Scheduling        Policy   for     Grid    Computing             Institutional Grids” 10th IEEE International
Systems”, IPDPS’04                                               Symposium on High Performance Distributed
[14] Kamana Sigdel. Resource allocation in                       Computing (HPDC-10 '01)
heterogeneous        and     dynamic         networks.
Master’s      thesis,      Delft       University     of
Technology, 2005.

                                                                                          ISSN 1947-5500

To top