Automating the fault tolerance process in Grid Environment
W
Description
Vol. 8 No. 7 October 2010 International Journal of Computer Science and Information Security
Shared by: ijcsis
Categories
Tags
IJCSIS, call for paper, journal computer science, research, google scholar, IEEE, Scirus, download, ArXiV, library, information security, internet, peer review, scribd, docstoc, cornell university, archive, Journal of Computing, DOAJ, Open Access, October 2010, Volume 8, No. 7, Impact Factor, engineering, international, proQuest, computing, computer, technology
-
Stats
- views:
- 118
- posted:
- 11/2/2010
- language:
- English
- pages:
- 7
Document Sample


(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 8, No. 7, October 2010
Automating the fault tolerance process in
Grid Environment
Inderpreet Chopra Maninder Singh
Research Scholar Associate Professor
Thapar University Computer Science Department Thapar University Computer Science Department
Patiala, India Patiala, India
inderpreet@thapar.edu msingh@thapar.edu
Abstract: Autonomic computing [4, 5] presented and
As Grid encourages the dynamic addition advocated by IBM, suggests a desirable
of resources that are not likely to be
benefited from the manual management solution to this problem. The vision of
techniques as these are time-consuming, un- autonomic computing is to design and build
secure and more prone to errors. A new
paradigm for self-management is pervading computing systems that possess inherent self-
over the old manual system to begin the managing capabilities [6]. In this paper a Self-
next generation of computing. In this paper
we have discussed the different approaches healing model- SMU (Self-healing
for self-healing the current grid management unit) has been described which
middleware use, and after analyzing these
we have proposed the new approach, Self- by autonomic computing has targeted to
healing Management Unit, SMU that will improve the level of automation and self-
provide the automated way of dealing with
failures. management capabilities to a far greater extent
Keywords: SMU, heartbeat than it is today in Grid Computing systems.
The SMU aims to:
1. Introduction
• keep track over the jobs submission
In recent years Grid, which facilitates the and execution
sharing and integration of large scale, • recover the lost jobs
heterogeneous resources, has been widely • administrate complexity of grid
recognized as the future framework of • check the efficiency for resource
distributed computing [1, 2]. However, the discovery
increasing complexity of Grid services and • keep all the services running.
systems demands correspondingly larger
human effort for system configuration and
2. Self-Healing Mechanisms
performance management, which are mainly Self-healing [7] is the ability of a system to
done in a manual style today, making it time- recover from faults that might cause some
consuming, error-prone and even parts of it to malfunction. For a system to be
unmanageable for human administrators. self-healing, it must be able to recover from a
224 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 8, No. 7, October 2010
failed component by first detecting and individual nodes in the network or the
isolating the failed component, taking it off whole network may go down.
line, fixing and reintroducing the fixed or
• Software Faults: There are several
replacement component into service without
high resource intensive applications
any apparent overall disruption. A self-healing
running on grid to do particular tasks.
system also needs to predict problems and
Several software failures like the unhandled
take actions to prevent the failure from having
exception; unexpected input etc. can take
an impact on applications. The self-healing
place while running this software
objective must be to minimize all outages in
application.
order to keep the system up and available at
In addition to ad-hoc mechanisms – based on
all times.
users complaints and log files analysis – grid
There can be many different reasons that can
users have used automatic ways to deal with
lead to the fault occurrence in grids. Some of
failures in their Grid Environment. To achieve
the reasons we are able to find are as follows:
the automatic ways to deal with failures,
• Hardware Faults: Hardware failures various fault tolerance mechanisms are there.
take place due to faulty hardware Some of these self-healing mechanisms are:
components such as CPU, memory, and
Application-dependent: Grids are
storage devices [8].
increasingly used for applications requiring
• Application and Operating System high levels of performance and reliability, the
Faults: Application and operating system ability to tolerate failures while effectively
failures occur due to application or exploiting the resources in scalable and
operating system specific faults like transparent manner must be integral part of
memory leakage, deadlocks, inefficient grid computing resource management
resource management etc. systems. Support for the development of fault-
tolerant applications has been identified as one
• Network Faults: In a grid, computing
of the major technical challenges to address
resources are connected over multiple and
for the successful deployment of
different types of distributed networks. As
computational grids [10]. To date, there has
a result, physical damage or operational
been limited support for application-level fault
faults in the network are more likely [9].
tolerance in computational grids. Support has
The network may exhibit significant packet
consisted mainly of failure detection services
loss or packet corruption. Moreover,
or fault-tolerance capabilities in specialized
grid toolkits. Neither solution is satisfactory in
225 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 8, No. 7, October 2010
the long run. The former places the burden of again an instance of high asymptotic
incorporating fault-tolerance techniques into complexity
the hands of application programmers, while
Checkpointing-recovery: Checkpointing and
the latter only works for specialized
rollback recovery provides an effective
applications. Even in cases where fault-
technique for tolerating transient resource
tolerance techniques have been integrated into
failures, and for avoiding total loss of results.
programming tools, these solutions have Checkpointing involves saving enough state
generally been point solutions, i.e., tool
information of an executing program on a
developers have started from scratch in
stable storage so that, if required, the program
implementing their solution and have not
can be re-executed starting from the state
shared, nor reused, any fault tolerance code. A recorded in the checkpoints. Checkpointing
better way is to use the compositional
distributed applications is more complicated
approach in which fault-tolerance experts
than Checkpointing the ones which are not
write algorithms and encapsulate them into
distributed. When an application is distributed,
reusable code artifacts, or modules.
the Checkpointing algorithm not only has to
Monitoring Systems: In this a fault capture the state of all individual processes,
monitoring unit is attached with the grid. The but it also has to capture the state of all the
base technique which most of the monitoring communication channels effectively.
units follow is heartbeating technique. The Checkpointing [12] is basically divided into 2
heartbeating technique [11] is further types:
classified into 3 types:
- Uncoordinated Checkpoint: In this approach,
- Centralized Heartbeating - Sending each of the processes that are part of the
heartbeats to a central member creates a hot system determines their local checkpoints
spot, an instance of high asymptotic individually. During restart, these checkpoints
complexity. have to be searched in order to construct a
- Ring Based Heartbeating - along a virtual consistent global checkpoint.
ring suffers from unpredictable failure - Coordinated Checkpoint: In this approach,
detection times when there are multiple the Checkpointing is orchestrated such that the
failures, an instance of the perturbation effect. set of individual checkpoints always results in
- All-to-all heartbeating - sending heartbeats a consistent global checkpoint. This minimizes
to all members, causes the message load in the the storage overhead, since only a single
network to grow quadratically with group size, global checkpoint needs to be maintained on
226 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 8, No. 7, October 2010
stable storage. Algorithms used in this
approach are blocking and nonblocking.
Fault Tolerant Scheduling: With the
momentum gaining for the grid computing
systems, the issue of deploying support for
integrated scheduling and fault-tolerant
approaches becomes a paramount importance
[13]. For this most of the fault tolerant
scheduling algorithms are using the coupling
Figure 1: Grid Network
of scheduling policies with the job replication
MN comprises of (figure:2) Application
schemes such that jobs are efficiently and
Receiving Unit (ARU), Node Receiving Unit
reliably executed. Scheduling policies are
(NRU), Processing Unit (PU) and database.
further classified on basis of time sharing and
ARU deals with the reception of request for
space sharing.
the job to be executed, NRU helps to gather
3. SMU Approach (Self-Healing the status of the execution nodes attached to
Management Unit) the MN. NRU is responsible for monitoring
Our approach is to provide efficient self- both the requests and responses in between the
healing functionality in grids. Most of the MN and the execution nodes. PU takes input
existing approaches follow the heartbeat from ARU, NRU and DB to decide to which
technique to find the working nodes. But in execution node the job has to be assigned. PU
this new approach another way of handling the uses the following algorithm to decide the
failures has been proposed. In SMU there is execution of the application.
one central node for each cluster called
Managing Node (MN) (figure 1). MN is
responsible for all the job submission and the
job execution of applications submitted to
grid. Apart from this, MN also keeps on
monitoring the status of the job at different
time intervals, to make sure that the job is
executing well.
Figure 2: SMU Managing Node
227 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 8, No. 7, October 2010
Algorithm:
1. Check for the Current status for the execution
node that is going to handle the job.
CHK ‘STATUS’
2. If the status is “OK”, then assign Job to that node.
Else, find another node. Go to step 1.
3. Once node is selected, to check the status of the
job, PU asks NRU to keep on monitoring the node
after certain interval of time. For optimizing this
process, we follow following steps:
a. Maintain the records for the kind of the job and
their expected execution time in the database.
Based upon that, start the monitoring after that
expected interval, if PU doesn’t receive any
input from NRU. Figure 3: Default vs SMU execution rate
b. If the job type is new i.e. not having any
information, save it in the history database SMU also reduces the resources usage by
after it finish execution. But for monitoring it,
decide any time say X minutes, after which the optimizing the monitoring procedure (figure
monitoring starts. 4). If there are n number of nodes, and they
4. Node self-failure detection option is also
available for the node to self detects the failure keep on sending their alive signals at interval
and send the status to NRU. This failure alarm
can be raised based upon the CPU usage, of t seconds, then the total time the managing
insufficient resources to execute the job, etc.
nodes remains busy is n*t. But in the present
4. Results approach, it reduces to large percentage as
only those nodes are monitored that are
To test SMU, we submit jobs and introduce
needed for the execution of submitted job.
the faults into the setup grid environment. For
producing the faults, we shutdown few nodes
when job is submitted to them, another way
we used to push the faults is by overloading
the nodes which are used to execute the job.
SMU do the auto job resubmission when some Figure 4: Optimized Monitoring reduces resource usage
erroneous happens during job execution. We It increases the scalability of resources. Now
have found that using SMU, we can prevent more resources can be attached with the single
the large number of jobs to get lost. To test node. In normal heartbeat technique, it will be
this, we have started submitting large number difficult to monitor nodes if the number of
of jobs and gather the detail that how many execution nodes increases. It also speed’s up
jobs got lost. Figure 3, shows that the rate of the execution of jobs whose entry is already
jobs executed in normal scenario is less as there in the history table.
compared to SMU system. SMU helps the
5. Conclusions:
system to overcome from faults thus helps to As grid is a distributed and unreliable system
increase the job execution rate. involving heterogeneous resources located in
228 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 8, No. 7, October 2010
different geographical domain, for this case [2] Ian Foster, Carl Kesselman, and Steven
fault tolerant resource allocation services [14] Tuecke, ”The anatomy of the Grid”, Copyright
have to be provided. In particular, when 2003 John Wiley & Sons, Ltd.
crashes occur, tasks have to be reallocated [3] Manish Parashar and Salim Hariri.
quickly and automatically, in a completely Autonomic Computing: An Overview.
transparent way from the user’s point of view. Springer-Verlag Berlin Heidelberg, 2005.
Different grid middleware uses different [4]Jeffrey O. Kephart and David M. Chess.
strategies to overcome faults. Like Globus “The Vision of Autonomic Computing”, IEEE
[15] and Alchemi [16] uses the heartbeat to Computer, January 2003.
know about the current status of its execution [5] A. G. Ganek, T. A. Corbi, “The Dawning
nodes. On the other hand SUN N1GE6 [17] of the Autonomic Computing Era”, IBM
uses the user level and kernel level Systems Journal, .42 n.1, p.5-18, January 2003
checkpointing strategies to restart the job from [6] Hang Guo1, Ji Gao2, Peiyou Zhu, Fan
where it fails. Condor [18] provides system- Zhang, "A Self-Organized Model of Agent-
level checkpoints of distributed applications Enabling Autonomic Computing for Grid
but it not geared to high-performance parallel Environment", Proceedings of the 6th World
programs. Congress on Intelligent Control and
The proposed SMU approach is destined to Automation, June 21 - 23, 2006, Dalian, China
increase the efficiency by reducing the [7]P. Horn. Autonomic computing:ibm’s
unnecessary monitoring of nodes that are not perspective on the state of information
going to participate in the execution of the job. technology, October 2001.
At same time, information from the past http://www.research.ibm.com/autonomic/.
execution of similar jobs is maintained, that [8] Gerald Tesauro and David M. Chess and
determines the time interval after which William E. Walsh and Rajarshi Das and et.al,
monitoring of the job is started. As the "A Multi-Agent Systems Approach to
execution nodes in the grids increase the SMU Autonomic Computing", AAMAS'04 ACM
approach becomes all the more efficient. (Jul 2004)
References: [9] Luis Ferreira, Viktors Berstis and
et.al,"Introduction to Grid Computing with
[1] Ian Foster, Carl Kesselman, Jeffrey M.
Globus",http://www.liv.ac.uk/escience/beowul
Nick, and Steven Tuecke. The physiology of
f/ IBM globus.pdf,IBM
the Grid. Global Grid Forum, June 2002.
229 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 8, No. 7, October 2010
[10] Anh Nguyen-Tuong, “Integrating Fault- [15] Paul Stelling, Ian Foster, Carl Kesselman,
Tolerance Techniques in Grid Applications” Craig Lee, Gregor von Laszewski ,” A Fault
PhD Thesis, University of Virginia, August Detection Service for Wide Area Distributed
2000 Computations” Volume 2, Number 2,
[11] Amit Jain and R.K. Shyamasundar, springer, pp. 117-128(12), 1999
“Failure Detection and Membership [16] Krishna Nadiminti, Akshay Luther,
Management in Grid Environments” Fifth Rajkumar Buyya, “Alchemi: A .NETbased
IEEE/ACM International Workshop on Grid Enterprise Grid System and Framework”
Computing (GRID'04) pp. 44-52 December 2005
[12] Sriram Krishnan, “An Architecture for [17] Liang PENG, Lip Kian NG, “N1GE6
Checkpointing and Migration of Distributed Checkpointing and Berkeley Lab
Components on the Grid” PhD Thesis, Checkpoint/Restart” Dec 28, 2004
Department of Computer Science, Indiana [18]James Frey, Todd Tannenbaum, Miron
University, November 2004 Livny Ian Foster, Steven Tuecke, “Condor-G:
[13] J.H. Abawajy, “Fault-Tolerant A Computation Management Agent for Multi-
Scheduling Policy for Grid Computing Institutional Grids” 10th IEEE International
Systems”, IPDPS’04 Symposium on High Performance Distributed
[14] Kamana Sigdel. Resource allocation in Computing (HPDC-10 '01)
heterogeneous and dynamic networks.
Master’s thesis, Delft University of
Technology, 2005.
230 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
Other docs by ijcsis
Comparative Analysis between Split and HierarchyMap Treemap Algorithms for Visualizing Hierarchical Data
Views: 15 | Downloads: 0
Non-Preemptive Multi-Constrain Scheduling for Multiprocessor with Hopfield Neural Network
Views: 5 | Downloads: 0
Reliable Multipath Routing Protocol (RMRP) For Mobile Ad Hoc Networks Using Adaptive Video Compression
Views: 10 | Downloads: 1
Single CCTA-Based Four Input Single Output Voltage-Mode Universal Biquad Filter
Views: 36 | Downloads: 0
A Cloud Computing Architecture for E-Learning Platform, Supporting Multimedia Content
Views: 42 | Downloads: 0
Get documents about "