Model Cover Page for Deliverables - Download as DOC by yPe9TR0

VIEWS: 0 PAGES: 9

									                               DataGrid

                        WP4            RELEASE                    2
     P R E L I M I N A R Y     D E S I G N A N D D E V E L O P M E N T     P L A N   :
                        F    A U L T T O L E R A N C E T A S K




                                     Document identifier:

                                     Date:                  08/08/2012

                                     Work package:          WP04: Fabric Management

                                     Partner(s):            KIP

                                     Lead Partner:          CERN

                                     Document status        DRAFT

                                     Author(s):             Lord Hess

                                                            Hess@kip.uni-heidelberg.de

                                     File:




Abstract: The document presents the Release 2 preliminary design and development
planning for the Fabric Fault Tolerance subtask of WP4.




IST-2000-25182                               PUBLIC                                      1/9
                                                                                                                                                 Doc. Identifier:

                                                               WP4 RELEASE 2
                                     preliminary design and development plan: Fault Tolerance                                                  Date: 08/08/2012
                                                               task


                                                                       CONTENT
1. INTRODUCTION ............................................................................................................................................. 3
   1.1. OBJECTIVES OF THIS DOCUMENT................................................................................................................... 3
   1.2. APPLICATION AREA ...................................................................................................................................... 3
   1.3. APPLICABLE DOCUMENTS AND REFERENCE DOCUMENTS .............................................................................. 3
   1.4. TERMINOLOGY.............................................................................................................................................. 3
2. OVERVIEW ...................................................................................................................................................... 5
   2.1. PURPOSE, SCOPE AND OBJECTIVES................................................................................................................. 5
   2.2. ASSUMPTIONS AND CONSTRAINTS ................................................................................................................ 5
   2.3. PRELIMINARY DESIGN .................................................................................................................................. 6
      2.3.1. Monitoring Sensors (MS) ..................................................................................................................... 6
      2.3.2. Monitoring Sensor Agent (MSA) .......................................................................................................... 7
      2.3.3. Fault Tolerance Correlation Engine .................................................................................................... 7
      2.3.4. Actuator Dispatcher (AD) .................................................................................................................... 7
   2.4. DELIVERABLES .............................................................................................................................................. 7
   2.5. RISKS MANAGEMENT AND FALLBACKS ......................................................................................................... 8
   2.6. PROJECT MONITORING AND CONTROL ........................................................................................................... 8
   2.7. WORK BREAK STRUCTURE (WBS) ............................................................................................................... 9




IST-2000-25182                                                              PUBLIC                                                                            2/9
                                                                                            Doc. Identifier:

                                        WP4 RELEASE 2
                       preliminary design and development plan: Fault Tolerance           Date: 08/08/2012
                                                 task



1. INTRODUCTION

1.1. OBJECTIVES OF THIS DOCUMENT
This document describes the Release 2 preliminary design and planning for the Fabric Fault Tolerance
part of the Fabric Monitoring and Fault Tolerance subsystem. The planning takes into account the
software release planning proposed in [A1], where four intermediate releases are foreseen in PM13,
PM15, PM17 and PM19 (January, March, May and July 2002) prior to the official DataGrid Release 2
scheduled for PM22 (September 2002). The preliminary design is based on the architecture described
in [R1]. The planning includes the estimated resources for the foreseen components. The detailed
timeline remains to be coordinated with the other WP4 tasks, in particular the Monitoring task.

1.2. APPLICATION AREA
This document should be applied as input to the global WP4 planning for release 2, which will be
presented to the rest of the DataGrid project by end of 2001.

1.3. APPLICABLE DOCUMENTS AND REFERENCE DOCUMENTS
Applicable documents
[A1]Software release planning,             Bob Jones, 10 August 2001,
DataGrid-12-TED-0500-0-1                   http://web.datagrid.cnr.it/pls/portal30/docs/1881.pdf



Reference documents
[R1]Architectural design and evaluation http://cern.ch/hep-proj-grid-fabric/architecture/eu/WP4-
criteria, DataGrid-4-D4.2- 0119-2_0     architecture/eu/WP4-architecture-2_0.pdf
[R2]OPC (OLE for Process Control)       http://www.opcfoundation.org/
[R3]PVSS                                   http://www.pvss.com/
[R4]PEM (Performance and exception http://cern.ch/proj-pem/
monitoring)


1.4. TERMINOLOGY


Glossary
API                       Application Programming Interface
AD                        Actuator Dispatcher
FTA                       Fault Tolerance Actuator
FTCE                      Fault Tolerance Correlation Engine
RMS                       Resource Management System
HLD                       High Level Definition (as defined by WP4 configuration management task)
MR                        Monitoring Repository
MS                        Monitoring Sensor
MSA                       Monitoring Sensor Agent


IST-2000-25182                                   PUBLIC                                               3/9
                                                                              Doc. Identifier:

                                  WP4 RELEASE 2
                 preliminary design and development plan: Fault Tolerance   Date: 08/08/2012
                                           task


PEM                 Performance and Exception Monitoring project
NVA API             Node View Access API of the WP4 configuration management system
WBS                 Work Breakdown Structure




IST-2000-25182                             PUBLIC                                       4/9
                                                                                            Doc. Identifier:

                                         WP4 RELEASE 2
                       preliminary design and development plan: Fault Tolerance           Date: 08/08/2012
                                                 task



2. OVERVIEW

2.1. PURPOSE, SCOPE AND OBJECTIVES
The purposes for the Fault Tolerance framework delivered at release 2 are:
       Deliver fail over tools for computing nodes based on information of monitoring.
       The fault tolerance uses other fabric components to trigger automatic remedy actions.



The scope for the delivered system is:
       The delivered system should be able to correlate information provides by the monitoring task
        to detect unusual behaviour of a computing node.
        The delivered system should support only computing nodes and not additional parts of a grid
        like storage systems, network switches an so on.
       The delivered system should use the WP4 configuration management subsystem for managing
        all configuration information.
       The delivered system should be able to react quickly on emergency situations to prevent
        damage on the hardware.
       The delivered system should provide a subscription mechanism for external applications to be
        notified when fault tolerance actuator software is started.
       The delivered system should use the GUI for visualization of alarms provided by the
        monitoring task.
       The delivered system should provide an open interface to plug-in actuators .
       The delivered system will not provide the tools for controlling software running by a grid user
        or other WP4 software. Such tools are foreseen for Release 3.
       The delivered system will not provide a complete set of actuators and a full featured
        correlation unit.
       The delivered system will not provide final sensor software. This will be provided by the WP4
        monitoring task. The delivered system will offer some simple sensors which are only for
        demonstrating and testing the fault tolerance software. This sensor software will replaced
        immediately if monitoring software available.
The objectives are:
       The delivered system should be demonstrated on a cluster with at least 1’000 nodes and the
        design should not show any obvious limitation to scale beyond that
       The delivered system should allow for several hundreds of independent quantities to be
        correlating on each node.
       The delivered system should allow start emergency actuators in less than 5 seconds after
        receiving information about a hardware failure.

2.2. ASSUMPTIONS AND CONSTRAINTS
The developments should be synchronized with the monitoring task using the PVSS control system
[R3] for fabric monitoring, the resource management system task and the configuration Management
task. Specific items that have been identified are:


IST-2000-25182                                   PUBLIC                                               5/9
                                                                                           Doc. Identifier:

                                         WP4 RELEASE 2
                        preliminary design and development plan: Fault Tolerance         Date: 08/08/2012
                                                  task


        Sensor interface (API).
        Interface to the monitoring database.
        Interface to the configuration management database.
        What kind of data should be stored in the configuration database.
        Interface to the resource management system (RMS).
More items may be added to the list during the detailed design phase.

2.3. PRELIMINARY DESIGN
The deployment diagram for the fault tolerance framework with interfaces to monitoring,
configuration management and RMS is shown in




Figure 1. A further internal subdivision of the Fault Tolerance Correlation Engine (FTCE) component
exposes the detailed development steps that have to be performed. Below follows a brief description of
the components.




                        Monitoring               Configuration                     RMS
                                                      DB




        MS                                                                                     FTA


                           MSA                    FTCE                    Actuator
        MS                                                               Dispatcher            FTA




        MS                                                                                     FTA

     Local Node


                                                 Remote
                                                  Node



Figure 1: detailed deployment view of fault tolerance component

2.3.1. Monitoring Sensors (MS)
Monitoring Sensors are part of the monitoring task and will discussed there.



IST-2000-25182                                    PUBLIC                                             6/9
                                                                                            Doc. Identifier:

                                         WP4 RELEASE 2
                        preliminary design and development plan: Fault Tolerance          Date: 08/08/2012
                                                  task


2.3.2. Monitoring Sensor Agent (MSA)
Monitoring Sensor Agent is part of the monitoring task and will discussed there.

2.3.3. Fault Tolerance Correlation Engine
The Fault Tolerance Correlation Engine is the most complex part of the framework. Its divided in two
parts for independent development: The correlation engine and the decision unit.
The correlation engine is the part which correlate two or more values for detecting critical states on a
computing node, a cluster or grid. Status: one prototype exists which can correlate two and three
values.
The decision unit checks the information from the correlation unit and advises the actuator dispatcher
to start an actuator. Status: one prototype exists which is implemented as a ruled based system. This
prototype will be replaced in near future.
The communication between fault correlation engine and configuration management system and
resource management system does not exists in the moment.

2.3.4. Actuator Dispatcher (AD)
The actuator dispatcher receives orders to start some actuator software from the FTCE. Status: the AD
is not implemented as an independent software module. A complete redesign is necessary to achieve
the objectives.

2.3.4.1. Fault Tolerance Actuator (FTA)
The fault tolerance actuators are independent software modules which are controlled by the actuator
dispatcher. These modules can be simple programs like shutdown or more complex programs like a
reinstalling actuator which need to be coordinated with the configuration management or the install
management system. Status: Some less complex actuators exists.

2.4. DELIVERABLES
The development items described in previous sections are summarized in the table below together with
a preliminary estimated of the required effort (PM = Person Months). The proposed delivery schedule
is subjected to a WP4 collaboration agreement on the allocation of efforts for next year.




IST-2000-25182                                    PUBLIC                                              7/9
                                                                                                          Doc. Identifier:

                                                WP4 RELEASE 2
                               preliminary design and development plan: Fault Tolerance                 Date: 08/08/2012
                                                         task


Label Description                                  Type                  Effort (PM)      Depends on    Proposed delivery

          Code Developments
C1        Correlation Unit                         Code,doc,test                    2                  Release 1.1
C2        Decision Unit ruled based                Code,doc,test                    1                  Release 1.1
C3        Decision Unit second version             Code,doc,test                    4                  Release 2
C4        Actuator Dispatcher                      Code,doc,test                  0,5                  Release 1.1
C5        Actuator Part 1                          Code,doc,test                    1                  Release 1.1
C6        Actuator Part 1                          Code,doc,test                    2                  Release 1.2
C7        Actuator Part 3                          Code,doc,test                    2                  Release 2

          Public Interface definitions
PI1       Monitoring API definition                javadoc or doxygen             0,5 Monitoring    Release 1.1
PI2       Configuration Mgnt API definition        javadoc or doxygen             0,5 Configuration Release 1.1

          Internal Interface definitions
II1       FTCE - actuator dispatcher API           javadoc or doxygen             0,5                  Release 1.4


          Detailed design and unit test plans
D1        FTCE - CE                                Design report                  0,5                  end-February 2002
D2        Actuator dispatcher                      Design report                  0,5                  end-February 2002
D3        FTCE -DU ruled based                     Design report                  0,5                  end-April 2002
D4        FTCE -DU second Version                  Design report                    1                  Release 2
D5        Actuators                                Design report                  0,5                  Release 1.2

          Total                                                                    17




Table 1: Fault Tolerance task deliverables. PM = Person Months.
The proposed release schedule in [A1] is as follows:
           Release 1.1 – end of January 2002
           Release 1.2 – end of March 2002
           Release 1.3 – end of May 2002
           Release 1.4 – end of July 2002
           Release 2 – end of September 2002 (fixed)

2.5. RISKS MANAGEMENT AND FALLBACKS
There are three identifiable risks with the proposed plan:
           Not all installed computing nodes will be able to use the complete set of sensors and actuators.


The fault tolerance software system contains a very low level part which depends on the used
hardware like processor or main board. In cases where this low level part will not work we will offer a
“light” version of the software without some functionalities and we try to implement this specified
hardware parts interfaces to our software.

2.6. PROJECT MONITORING AND CONTROL


IST-2000-25182                                            PUBLIC                                                     8/9
                                                                                        Doc. Identifier:

                                       WP4 RELEASE 2
                      preliminary design and development plan: Fault Tolerance        Date: 08/08/2012
                                                task


      Internal coordination: weekly task meetings at Heidelberg.
      External coordination: bi-weekly WP4 phone-conferences with WP4 members.Those
       meetings decides on corrective actions and possibly re-allocation of efforts.
      Progress is reported to the WP4 manager for inclusion in the normal WP4 quarterly reports

2.7. WORK BREAK STRUCTURE (WBS)




IST-2000-25182                                  PUBLIC                                            9/9

								
To top