Docstoc

thesis

Document Sample
thesis Powered By Docstoc
					Fault Tolerance in an
Event Rule Framework
for Distributed Systems
  Hillary Caituiro Monge



                           1
Contents
1.   Introduction
2.   Related Works
3.   Overview of the Event Rule Framework (ERF)
4.   Overview of the Fault Tolerant CORBA
5.   Design of the Fault tolerant ERF (FT-ERF)
6.   Performance Analysis
7.   Conclusions



                                                  2
1. Introduction
   Justification
     Distributed Systems (DS)
     Fault Tolerance (FT)
     Reactive Components (RC)
     The Event Rule Framework (ERF)
     Motivation

   Objectives

                                       3
                 Distributed Systems (DS)
   A DS is a
       Collection of software components distributed among processors of
        heterogeneous platforms


   DSs purpose are:
       Sharing resources and workload, and
       Maximizing availability.                      PDA
                                                              Server   Workstation



   Design goals of DSs are:              Laptop

       Transparency,
       Scalability,
       Reliability and
                                                   Computer
       Performance.                                                          Mainframe
                                                                                          4
                   Fault Tolerance (FT)
 FT is the ability of a system to continue
  operating as expected, despite internal or
  external failures.
 DSs are prone to failures.
     Some faults can be detected.
     Some others cannot be detected.
   FT of a DS can be improved through the
    redundancy, i.e. replication of its hardware
    or software components.
                                               5
           Reactive Components (RC
   Reactive Components
     React to external   stimulus (i.e. events)
     Initiate action


   A RC can be
     Asynchronous or synchronous
     Non-deterministic or deterministic


   A reactive component can be asynchronous and
    non-deterministic (ANDRC).

                                                   6
     The Event Rule Framework (ERF)

   An example of a DS framework having ANDRCs is:
     ERF   (Event/Rule Framework).
   ERF
     Developed   at the Center for Computing Research and
      Development of University of Puerto Rico – Mayagüez
      Campus.
     It is an Event-Rule Framework for developing distributed
      systems.
   In ERF, Events and rules are used as abstractions
    for specifying system behavior.
                                                            7
                                 Motivation (1/2)
   There is a challenge to achieve fault tolerance in
    ANDRCs.
   In non-deterministic components:
     The output could   be different;
     Even if the same   sequence of stimuli is input with the
      same initial state.
   Since the component is asynchronous:
     Timing   assumptions are not valid.
   Moreover, ANDRCs behavior fulfills the
    Heisenberg’s uncertainly principle:

                                                                 8
                                 Motivation (2/2)
   Existing fault-tolerance techniques
     Failure   detectors
        Timing assumptions
        Synchronous or semi synchronous systems,

     State   transfer protocols
        Deterministic systems
        Very intrusive

     Duplicates
               detection and suppression
      mechanisms
          Sequencers
                                                    9
                                     Objectives (1/2)
   This research is about the use of active and
    semi-active replication techniques for achieving
    fault tolerance in ERF, which is a framework that
    uses ANDRCs.

   Active replication technique
     Allreplicated components accept third-party incoming
      events.
     A middle-tier component is in charge of
           Event multicasting
           Detecting and suppressing duplicated events.
                                                           10
                              Objectives (2/2)
   Semi-active replication technique
     Allreplicated components accept third-party incoming
      events
     Only one (“the leader”) is able to post events,
     Backup replicas listen to the leader to make a
      consistent production of events.
     Each replicated component is in charge of the
      detection and suppression of duplicated events.



                                                         11
2. Related Works
 Generic support of FT in DSs
 FT Event-Based DSs




                                 12
        Generic support of FT in DSs
   OMG Fault-Tolerant CORBA Standard

                  IRL                   Eternal       OGS           SENSEI
OGM FT-CORBA Yes                        Yes           Soon          No
Compliant
Design Style      Non-intrusive         Non-intrusive Non-intrusive Non-intrusive
Interoperability  Free                  Expensive     Free          Free
Replication Logic Centralized with      Group toolkit Group toolkit Group toolkit
Implementation    Passive Replication
Asynchronism      Yes                   No            No            No
support
Non-determinism No                      No            No            No
support


                                                                              13
                     FT Event-Based DSs



                  NODS              ISEE           YEAST

Fault tolerance   Framework RS2.7   Not directly   Watchd and
                                    supported      Libft




                                                            14
3. Overview of the Event Rule
   Framework (ERF)
   Model
     Event Model
     Rule Model
     Behavioral Model

   Components
     EventChannel
     RUBIES

   Architecture of ERF-CORBA

                                15
                                                                           Model
                                     package erf;
   Event Model
                                     import erf.lang.*;
       ERF provides the event       import java.io.Serializable;

        abstraction to represent     public class Event implements Serializable
                                     {
        significant occurrences          /* Attributes */
                                         public String id = "";
        in a distributed system.         public TimeValue ttl;
                                         public TimeValue daytime;
                                         public DistributedObject producer;

       i.e. Flood alert system.         /* Methods */
                                         public TimeValue t() {...}
                                         public TimeValue ts() {...}
                                         public TimeValue ttl() {...}
       The base class Event             public void setttl(long tv) {...}
                                         public DistributedObject getProducer() {...}
        defines the structure and        public void setProducer(DistributedObject producer) {...}
                                         public void sett(long tv) {...}
        behavior applicable to all       public String pName() {...}
                                         public boolean isDead() {...}
        types of events.                 public String getTypeName() {...}
                                     }


                                     Figure 3.2 Java definition of the class Event
                                                                                            16
                                                                             Model
   Rule Model
     In ERF, the behavior of a DS is defined in terms of
      rules.
     A rule is an algorithm that is triggered when events in
      the event set match a rule’s event pattern

              [package <package_specification> ]
              rule <rule_id>
              [priority <priority_number>]
              on <trigger_events>
              [use <usage_specification>]
              [if <condition> then <actions> [else <alternative_actions>]]
              [do <unconditional_actions>]




           Figure 3.5 Syntax of rule definition language (RDL)
                                                                                 17
                                          Model
   Behavioral Model
     Defines  how rules are triggered and evaluated
      upon the occurrence of events.
     Evaluation of rules needs to be made
      periodically because RUBIES receive events
      constantly.
     The evaluation of rules is performed based on
      a rule priority.

                                                  18
                        Components (1/2)
   Event Channel
     Is a middleware distributed component
     It allows sending events to consumers.
     It allows receiving events from producers.
     Events are treated as objects.




                                                   19
                        Components (2/2)
   Rule Based Intelligent Event Service
    (RUBIES)
     Is  the main component of ERF.
     It is an engine that handles events through
      the evaluation of rules.
     RUBIES is a distributed component
     It is registered to the event channel both as a
      consumer and as a producer.

                                                    20
Architecture of ERF-CORBA
StructuredEvent
                   *                                                             RUBIES

         *
                                             *


                                CORBAEventChannel




                                                     StructuredPushSupplier    RuleCompiler
             StructuredPushConsumer




         StructuredProxyPushSupplier             StructuredProxyPushConsumer




                                      *
                                      EventChannel

              Figure 3.8 Architecture of ERF-CORBA                                            21
4. Overview of the Fault
   Tolerant CORBA (FT-CORBA)
 Fault Tolerant CORBA (FT-CORBA)
 Replication Management
 Fault Management
 Logging and Recovery Management




                                    22
                  Fault Tolerant CORBA
   Adopted by OMG through 2000.
   Commitments rather than a solution.
   Full interoperability among different products.
   It provides support for applications that require
     High levels of reliability
     With minimal modifications.


   This research was addressed to be compliant
    with this standard.
                                                        23
                           Replication Management
                                                               Replication management
                      GenericFactory
                                                                covers a Fault Tolerant
                                                                Domain.
ObjectGroupManager
                                          PropertyManager
                                                               It is done through the
                                                                Replication Manager
                     ReplicationManager
                                                                component, which
                                                                inherits from the
        Figure 4.3 Hierarchy of the                             Property Manager,
         Replication Management                                 Object Group Manager,
                                                                and Generic Factory
                                                                components.
                                                                                      24
                                               Fault Management
                                                              It includes the Fault
                                                               Notification, Fault
StructuredPushConsumer    FaultNotifier    FaultAnalyzer       Detection, and Fault
                                                               Analysis services.

SequencePushConsumer     FaultDetector                        The Fault Notifier sends
                                                               informs to its consumers.
                                                              The Fault Detectors are
                         PullMonitorable                       connected to replicas or
                                                               host and provide “faults” to
   Figure 4.8 Architecture of Fault                            the Fault Notifier.
            Management                                        The Fault Analyzer
                                                               analyzes faults and produce
                                                               reports to the Fault Notifier.
                                                                                         25
                     Logging and Recovery
                              Management
   Loggin mechanism.
     Log   the state of the primary member.

   Recovery mechanism
     Acton fails or for new members.
     Recover from the log to the new primary.


   Consistency must be controlled by the
    infrastructure.
                                                 26
5. Design of the Fault tolerant
   ERF (FT-ERF)
   Scalability and Fault Tolerance Problems in ERF
    CORBA
   Architecture of Scalable and Fault Tolerant ERF
   Architecture of Fault-Tolerant ERF-CORBA
   EID Uniqueness
   Events and Pattern equality rules.
   Pattern Management
   Active Replication
   Semi-Active Replication

                                                  27
         Scalability and Fault Tolerance
              Problems in ERF CORBA
                                           RUBIES




                                           (b)

                       (a)




                       RULES

                             DB




Figure 5.1 Two possible points of scalability and fault-tolerance problems
    in ERF: (a) the size of the rules database; (b) a crash of RUBIES.
                                                                        28
                                         DISTRIBUTION DIMENSION




                        RUBIES             RUBIES                                 RUBIES
                        (γ11, δ1)          (γ21, δ2)                              (γN1, δN)
REPLICATION DIMENSION




                        RUBIES             RUBIES                                 RUBIES
                        (γ12, δ1)          (γ22, δ2)                              (γN2, δN)




                        RUBIES             RUBIES                                 RUBIES
                        (γ1M, δ1)          (γ2M, δ2)                             (γNM, δN)




                            Figure 5.3 Architecture of Scalable and fault-tolerant ERF
                                                                                              29
                                      PullMonitor                Monitorable



        PropertyManager



                                    FaultDetector              PullMonitorable



       ReplicationManager                                                           RUBIES
                                               Updateable



                                                                                             RuleCompiler
                               FaultNotifier                FTRUBIESIntOperations




             GenericFactory
                                  FTRUBIESServant

                                                                                  FTRUBIES
ObjectGroupManager




                              ServerRequestInterceptor                    ClientRequestInterceptor

   ObjectGroup


                                 ServerIOGRSupport                             ClientIOGRSupport



                                                     CORBAEventChannel

                       Figure 5.3 Architecture of FT ERF-CORBA                                              30
                      EID Uniqueness (1/2)
   Each event in the system need to be uniquely
    identified by an event identifier
     EID.
   EID uniqueness must guaranteed in different
    contexts
     Local,   replication group, system.
   The use of sequencers is an option to achieve
    EID uniqueness
     Each   replica start a sequencer.
     But, is only valid with deterministic components.

                                                          31
                       EID Uniqueness (2/2)
   Events can be identified by its history.
     Each   event is produced due to an event pattern.
   Such history includes
     The list of previous events that triggered the event.
     The function or rule that caused its production.




       Figure 5.5 Conceptual View of the Event Unique Identification   32
             EVENT EQUALITY RULE
   Two events are equal if:
     Both are of the same Type.
     Both were produced due to the same Rule.
     Both have the same Order of production in the
      time when the Rule was triggered.
     Both have the same Pattern.




                                                  33
      PATTERN EQUALITY RULE
   Two event patterns are equal if:
     Both have the same number of events.
     Both have events in the same order.
     Two events for the same position are equal if
      the Event Equality Rule is accomplished as
      previously defined.




                                                      34
           Pattern Management (1/2)
   Rules use a pattern management framework
     To prevent events being triggered more than once for
      a given event pattern.
   In this framework, patterns are defined in terms
    of:
     Source   events (i.e., events that cause rules to trigger)
      and
     Target events (i.e., events that are produced by
      rules).

                                                              35
                  Pattern Management (2/2)
                                                                        Rule
                               SourceEvents
                 0..*



         Event                                 Pattern              PatternManager
                                                         0..*   1



                                TargetEvents
                        0..*
                                                                       Indexer
                  Figure 5.6 Architecture of Pattern Management

   The framework has three main components for pattern management:
       Pattern Manager to manage patterns of events.
       Pattern to store patterns of events.
       Indexer to organize patterns of events.

                                                                                     36
                  Active Replication (AR)
   For systems with tight time constraints.
   All replicas are running at the same time.
     Are   accepting events.
     Are   sending events.
   So, duplicated events are going around.
   Therefore, it is crucial.
     To detect and suppress duplicated-events.
     To deliver a unique reply.
     To keep consistency.
     To be fault tolerant transparent.


                                                  37
                          AR: Pattern Naming
   For Duplicated-Events Detection and
    Suppression
   Is a centralized Mid-tier component that
       Through an analysis of an event’s history
       Detects if the event has already been delivered.
   It relies on two primitives:
       Event binding
           Register an event.
       Pattern solving.
           Resolve if an equivalent event was already delivered.

                                                                    38
                    AR: Pattern Naming

                                      1       0..*
   EventChannel
                             PatternContext     PatternName
          1                    1



          0..1           1

         1              1
CORBAEventChannel
                     EventHandler
      1



          0..*


    FTRUBIES


    Figure 5.9 Architecture of the Pattern Naming
                                                              39
    Semi-Active Replication (SAR)
   For systems with relatively loose time constraints.
   All replicas are running at the same time.
   Only the primary is able to reply to clients.
     When   the primary fails, a new member is selected.
     When a backup member fails, it is released from the
      group.
   Failure detectors are used to detect failures in
    group members.
     Time   delay before the selection of new primary (sec).
                                                            40
                SAR: Production Controller
   For Duplicated-Events Detection and Suppression
   It is distributed within each replica.
   The following algorithm is executed on backup members.
    On incoming event P from the primary
        If in queue BQ is an equivalent event B for the event P then
             Update B.id with P.id across the entire system
             Remove P
        Else
             Enqueue P in PQ
    On produced event B from the backup
        If in queue PQ is an equivalent event P for the event B then
             Update B.id with P.id across the entire system
             Remove P
        Else
             Enqueue B in BQ
    On fail and if the backup replica is elected as new primary
        Post all events of the queue BQ                                41
                    SAR: Production Controller

                                                               -primaryEvent
                   EventChannel                                                               0..*
                          1                          Updater                      Event              SourceEvents
                                                                -backupEvent

                         0..1                                                         0..*
                                                                               TargetEvents             Pattern
                CORBAEventChannel                                                                      0..*
                     1

                                                                                                        1
                         0..*                                                  Rule                  PatternManager
                                        ProductionController
                         1          1
FaultNotifier       FTRUBIES

                     Figure 5.14 Architecture of the Production Controller
                                                                                                                  42
6. Performance Analysis
 Objectives
 Methodology
     Test Scenarios
     Test Procedure

   Test Results



                          43
                                            Objectives
   Measure the execution time of fault-tolerant ERF
    using active and semi-active replication
    techniques for:
     An increasing number of replicas.
     An Increasing number of failures.
     An increasing workload.
   Compare the execution time of:
     Active versus semi-active replication techniques.
     Failure-free versus failure execution scenarios.
     Fault-tolerant   versus non fault-tolerant execution.


                                                              44
                                 Test Scenarios: Services
                                              distribution
          ece                        china                                                         sorelsvr1

                                         test                                                                  event
                  http server                                                                                 channel
                                         app                    *                          *

                                                                          10. jobo
                   name
                   server                                                       factory
                                                                      9.
                                                                      * quayaba
                                                                                                          *

                  Replication        *                               8. chironja factory
                                     *                               *                  ft
          *
                   manager
                                                                7. guineo factory rubies
      *                                                          *                                *
                                                                                     ft
                                 *
                                 *                          6. acerola factory rubies
                fault detector                               *                                *
                                                                                  ft
                                                       5. pajuil     factory rubies
                                                        *                     ft          *
                                                   4. quenepafactory rubies                                *
                                                   *                                  *
                                                                         ft
                                               3. toronja factoryrubies
                                               *
                                                                    ft             *
                                           2. melon       factory rubies                    *
                                            *
                                                                 ft             *           *     adaselsvr1
                                 *     1. parcha factory rubies                                              corba
                                       *                     ft             *                                event
                                     *                                                          *
                                                                                                *           channel
                                               factory    rubies
                                                       ft              *
                                 *                  rubies
                                                 fault          *
                                                tolerant
                                                rubies
                                                            *



Figure 7.1 UML deployment diagram of the test environment. (The domain for
                                                                       45
                     all computers is ece.uprm.edu)
     Test Scenarios: Failure schedule:
                       First scenario
   Six workstations,
   3 to 8 replicas.
   193 rules.
   Failure schedule defined by power set F.


Where
 n is the number of replicas
 f(p=n) = ∞
 f(p=1...n-1) = p*T/n determines the time of the failure
       p is the position of the replica in the sub set
       T is the arithmetic average of the execution time of ten free
        failure runs with n replicas.                                   46
     Test Scenarios: Failure schedule:
                    Second scenario
   Ten workstations,
   Ten replicas.
   193 rules.
   Failure schedule defined by set G.


Where
 n is the number of replicas
 g(p=n) = ∞
 g(p=1...n-1) = p*T/n determines the time of the failure
       p is the position of the replica in the sub set
       T is the arithmetic average of the execution time of ten free
        failure runs with n replicas.                                   47
      Test Scenarios: Failure schedule
                       Third scenario
   Ten workstations,
   Ten replicas.
   Six rule sets of 6, 12, 24, 48, 96, and 193 rules
    each time.
   The failure schedule was given by the function
    G(n) defined for the second scenario.




                                                        48
Test Scenarios: Test application
   Client consumer/producer of the event channel.

   It starts sending two events of GageLevelReport
    type to start the test
   It ends its execution when an event of the
    TestEventEnd type arrives.

   Measures the execution time
     Starting just after second event is posted, and
     Ending just after a event of TestEventEnd type
      arrives.
                                                        49
      Methodology: Test Procedure
   The procedure consisted of three major steps:
     Clearthe environment;
     Launch the infrastructure; and
     Run the test application.

   The results are
     The   arithmetic media of 10 runs on each test case.
   The arithmetic media of the standard deviation
    was 1.46%

                                                             50
                            140000



                            120000



                            100000
      Time (milliseconds)




                              80000



                              60000



                              40000



                              20000                                                                           8
                                                                                                          7
                                                                                                      6
                                      0
                                                                                                  5
                                          0                                                   4
                                                1
                                                       2
                                                                3
                                                                          4
                                                                              5
                                                                                          3        Number
                                                                                                                  1722
                                              -608   Number of failures           6
                                                                                      7
                                                                                                  of replicas




Figure 7.4 Cost of fault tolerant ERF execution using active replication
technique with increasing number of replicas and number of failures.
                                                                                                                         51
                        140000



                        120000



                        100000
      Time (msec) (t)




                         80000


                          60000


                          40000


                          20000                                                                          8
                                                                                                     7
                                                                                                 6
                                 0
                                                                                             5
                                     0                                                   4
                                           1                                                 Number of
                                                  2
                                                           3
                                                                     4
                                                                         5
                                                                                     3        Replicas       956
                                         3051   Number of Failures           6
                                                                                 7



  Figure 7.5 Cost of fault tolerant ERF execution using semi-active
replication technique with increasing number of replicas and number
                               of failures.                                                                        52
                    160000

                    140000

                    120000
  Time (msec) (t)




                    100000

                     80000
                                              Failure-free !!!
                     60000

                     40000

                     20000

                        0
                             3     4    5          6       7           8         9     10
                                            Number of Replicas (n)


                                 1406       Failure-free       (n-1) failures   1176

Figure 7.6 Impact of (n-1) failures execution over failure- free execution
                    in active replication technique.
                                                                                            53
                    160000

                    140000

                    120000
  Time (msec) (t)




                    100000

                    80000

                    60000

                    40000

                    20000

                        0
                             3     4    5          6       7           8        9      10
                                            Number of Replicas (n)


                                 1377       Failure-free       (n-1) failures   5341

Figure 7.7 Impact of (n-1) failures execution over failure-free execution
                 in semi-active replication technique.
                                                                                            54
                  160000

                  140000

                  120000
Time (msec) (t)




                  100000

                   80000

                   60000

                   40000

                   20000

                      0
                           3      4       5         6       7          8     9    10
                                              Number of Replicas (n)


                                   1406         Active       Semi active   1377

                  Figure 7.8 Comparison between active and semi active replication
                                techniques on failure-free scenario.
                                                                                       55
                  160000

                  140000

                  120000
Time (msec) (t)




                  100000

                  80000

                  60000

                  40000

                  20000

                      0
                           3      4       5         6       7          8      9   10
                                              Number of Replicas (n)


                                   1176         Active       Semi active   5341

                  Figure 7.9 Comparison between active and semi active replication
                                techniques on nine-failures scenario.
                                                                                       56
                  160000

                  140000

                  120000
Time (msec) (t)




                  100000

                  80000

                  60000

                  40000

                  20000

                      0
                           0        500          1000          1500         2000           2500    3000
                                                               Events

                               RUBIES     Non fault tolerant     Failure-free      Nine failures

 Figure 7.10 Workload impact on the time of the failure-free execution and
   of the nine-failures execution using the active replication technique.
    Overload respect to the non fault tolerant execution and to the pure
                            RUBIES execution.                             57
                  160000

                  140000

                  120000
Time (msec) (t)




                  100000

                   80000

                   60000

                   40000

                   20000

                       0
                           0       500       1000             1500         2000        2500         3000
                                                            Events

                                    RUBIES     Non fault tolerant    Failure-free   Nine failures


                  Figure 7.11 Workload impact on the time of the failure-free execution and
                   the nine-failures execution using the semi-active replication technique.
                     Overload respect to non fault tolerant execution and pure RUBIES
                                                  execution.                                58
7. Conclusions
   The performance results of the implementation of the
    active and semi-active replication techniques
       Shows linear time curves both
            For increasing number of replicas, failures and work load.
   Therefore,
       The proposed solutions were proved to be feasible, and
       Their performance results were proved to be acceptable.

   Additionally,
       The active replication technique has better overall performance
        than semi-active replication technique.


                                                                          59
                Research Contributions
   Active Replication for Asynchronous, non-
    deterministic Reactive Components
     Tighttime constrains.
     The replication logic is in a centralized component
     Advantages
          It is not significantly affected by either increasing number of
           replicas, failures, or workload.
          Both clients and replicas need not to be aware of the
           replication mechanism.
          It can be used for large distributed systems.
     Disadvantage
        Relies on a centralized component.

                                                                             60
               Research Contributions
   Semi-active Replication for Asynchronous,
    non-deterministic Reactive Components
     Loose time constraints
     The replication logic is distributed.
     Advantages
          The replication mechanism is distributed,
     Disadvantages
          Relies on failure detectors.


                                                       61
Questions and/or Comments?




          ?
                         62

				
DOCUMENT INFO
Shared By:
Categories:
Stats:
views:11
posted:4/29/2011
language:English
pages:62