Docstoc

Network Fault Tolerance

Document Sample
Network Fault Tolerance Powered By Docstoc
					HUMBOLDT-UNIVERSITÄT ZU BERLIN
INSTITUT FÜR INFORMATIK




     Zuverlässige Systeme für Web und E-Business
     (Dependable Systems for Web and E-Business)

                                Vorlesung 9

                NETWORK FAULT TOLERANCE


                          Wintersemester 2000/2001

                     Leitung: Prof. Dr. Miroslaw Malek

                   www.informatik.hu-berlin.de/~rok/zs


                                  DS - IX - NFT - 1
       NETWORK FAULT TOLERANCE


• OBJECTIVES:
  – TO INTRODUCE FAULT TOLERANCE TECHNIQUES USED IN
    COMPUTER NETWORKS



• CONTENTS:
  – COMPUTER NETWORKS
  – BASIC TECHNIQUES
  – EXAMPLE-MULTISTAGE NETWORKS




                       DS - IX - NFT - 2
               COMPUTER NETWORKS

•   PACKET SWITCHING VS. CIRCUIT SWITCHING
•   POINT-TO-POINT VS. INDIRECT
•   STATIC VS. DYNAMIC
•   SINGLE PATH VS. MULTIPATH

• EXAMPLES:
    –   BUS
    –   RING
    –   MULTISTAGE (e.g., BANYAN)
    –   CUBE
    –   STAR
    –   TREE



                             DS - IX - NFT - 3
               BASIC TECHNIQUES


• RETRY (RETRANSMISSION)
• COMPLEMENTED RETRY WITH CORRECTION
• REPLICATION (e.g., dual bus)
• CODING
• SPECIAL PROTOCOLS (single handshake, double handshake,
  etc.)
• TIMING CHECKS
• REROUTING
• RETRANSMISSION with SHIFT (INTELLIGENT RETRY)




                         DS - IX - NFT - 4
                 EXAMPLE
        MULTICOMPUTER NETWORKS (1)


• OBJECTIVE:
   – RELIABLE AND TIMELY, HIGH BANDWIDTH DATA TRANSFER


• ISSUES:
   –   FAULT IMPACT
   –   RELIABILITY EVALUATION
   –   TESTING
   –   FAULT DIAGNOSIS
   –   RECOVERY
   –   FAULT TOLERANCE




                           DS - IX - NFT - 5
               EXAMPLE
      MULTICOMPUTER NETWORKS (2)
• LEVEL:
   – SWITCH LEVEL
      •   CODES
      •   PROTOCOLS
      •   CONTROL
      •   DATA
      •   TIME
   – SYSTEM LEVEL
      •   CODES
      •   PROTOCOLS
      •   CONTROL
      •   DATA
      •   TIME




                      DS - IX - NFT - 6
       MULTICOMPUTER NETWORK
    FAULT CLASSES AND THEIR IMPACT
• FAULT CLASS I - DATA LINK OR DATA REGISTERS
   –   STUCK AT 0
   –   STUCK AT 1
   –   OR-BRIDGE
   –   AND-BRIDGE


• FAULT CLASS II - CONTROL LINES
   – (DATA VALID LINE) STUCK AT VALID
   – (REQUEST/ACK) STUCK-AT-0, STUCK-AT-1
   – (DATA STROBE) STUCK-AT-1, STUCK-AT-0




                         DS - IX - NFT - 7
                   FAULT IMPACT
• DATA BIT ERROR
  – NO IMMEDIATE IMPACT, BUT ERROR WILL SHOW UP IN
    HIGHER LEVELS LATER. MAY BE OUT OF THE SPHERE OF
    CONTROL WHEN DETECTED.
• ADDRESS TAG ERROR
  – DATA PACKET CANNOT REACH THE INTENDED
    DESTINATION. THIS MAY CAUSE WRONG DATA TO BE
    RETRIEVED.
• STUCK AT SOME VALID CONFIGURATION
  – DATA PACKET WILL BE MISDIRECTED
• OPEN CONNECTION
  – COMPLETE DATA LOSS
• SHORT CONNECTION
  – MAY CAUSE BROADCASTING EFFECT, DATA PACKET
    MISDIRECTED

                         DS - IX - NFT - 8
    THE FAULT IMPACTS CAN BE GROUPED
                  INTO:
1. CORRUPTED DATA

2. LOST DATA

3. UNEXPECTED DATA




•   THESE FAULTS CAN BE EXTRACTED FROM THE SWITCH
    AND MODELED BY A FAULTY CHANNEL THAT WILL
    CORRUPT, LOSE, DELAY DATA TRANSMITTED THROUGH
    IT.


                       DS - IX - NFT - 9
      WHERE TO DETECT AND RECOVER

•   THERE ARE THREE LEVELS WHERE WE CAN PERFORM
    ERROR DETECTION AND RECOVERY

1. SWITCH LEVEL

2. PME LEVEL

3. SOFTWARE LEVEL




                      DS - IX - NFT - 10
                SWITCH LEVEL

• COSTS THE LEAST (IN TERMS OF COMPUTATION) TO
  RECOVER

• HAS HIGHEST COVERAGE, MOST ERRORS ARE WITHIN
  "SPHERE OF CONTROL“

• NEEDS EXTRA HARDWARE

• THE DESIGN OF DETECTION/CORRECTION MECHANISM
  NEEDS TO CONSIDER IMPLEMENTATION LIMITS SUCH AS
  LOGIC COMPLEXITY AND I/O PIN USAGE




                      DS - IX - NFT - 11
           LOCALIZED RECOVERY

• SINCE 99 PERCENT OF ERRORS ARE "SOFT“, RETRY IS
  AN EFFECTIVE WAY TO RECOVER FROM FAULTS

• 100 PERCENT COVERAGE OF SINGLE MESSAGE LOSS

• REQUIRES ONLY MODEST NUMBER OF PINS

• ERROR CORRECTING CODES HAVE PROHIBITIVE PINOUT
  (62% OVERHEAD FOR 8-BIT DATA CHANNEL).




                       DS - IX - NFT - 12
     FAULT TOLERANCE TECHNIQUES
        FOR GLOBAL RECOVERY
1. DYNAMIC FULL ACCESS (DFA)
   – IF THE NETWORK GRAPH IS MAXIMALLY CONNECTED THE
     RECOVERY IS FEASIBLE


2. MULTIPLE NETWORKS (FAULT TOLERANCE + IMPROVED
   PERFORMANCE)
   – WITH OR WITHOUT BRIDGES


3. REDUNDANT SWITCHES

4. EXTRA-STAGE

5. CODING

                       DS - IX - NFT - 13
                  PME LEVEL

• THERE ARE 8 BYTES IN ONE REQUEST, THEREFORE 3
  EXTRA BITS MAY BE NEEDED FOR SEQUENCING. ON A
  4X4 UNIDIRECTIONAL SWITCH, THIS MEANS 24 MORE
  PINS.
• FOR REQUESTS WHOSE RELATIVE ORDER NEEDS TO BE
  KEPT, SOME EXTRA BITS ARE NEEDED OR ELSE
  SEQUENTIAL CONSISTENCY MAY BE VIOLATED.
  ANOTHER WAY TO GET AROUND THIS IS TO ALLOW ONLY
  ONE OUTSTANDING REQUEST FOR SHARED DATA.
• HOWEVER, NOT ALL SHARED DATA MAY BE USED FOR
  SYNCHRONIZING, SO A FENCE COUNTER SHOULD BE
  PROVIDED TO LET THE PROGRAMMER DECIDE ON THE
  NUMBER OF ALLOWED OUTSTANDING REQUESTS.



                      DS - IX - NFT - 14
              SOFTWARE LEVEL

• WHEN AN ERROR IS DETECTED, IT MAY BE TOO LATE TO
  RECOVER. EVEN IF IT IS STILL POSSIBLE, IT IS OFTEN
  EXPENSIVE (IN TERMS OF COMPUTATION REQUIRED). TO
  BE ABLE TO ROLL BACK, CHECKPOINT INFORMATION HAS
  TO BE SAVED FREQUENTLY. THIS INCREASES SYSTEM
  OVERHEAD.

• RESTART (OR GLOBAL RESET) IS VERY EXPENSIVE IN
  TERMS OF TIME.




                       DS - IX - NFT - 15
               OBSERVATIONS
• THE IMPACT OF A FAULT ON A MULTISTAGE NETWORK
  MAY BE SEVERE.
• THE FAULT IMPACT DEPENDS ON A FAULT LOCATION
  (LEVEL).
• A SWITCH FAULT IS OBVIOUSLY MORE SEVERE THAN A
  LINE FAULT.
• EXTRA-STAGE WILL NOT HELP IF INSTANTANEOUS
  RECOVERY IS NOT ASSURED.
• USE RETRY FOR TRANSIENT AND INTERMITTENT FAULTS.
• USE LOCALIZED REROUTING FOR PERMANENT FAULTS.
• DFA AND EXTRA-STAGE COMBINED MAY PROVIDE A
  VERY EFFECTIVE SOLUTION IN CASE OF THE MULTIPLE
  FAULTS.
• FAULT-TOLERANT SWITCHING ELEMENT PROTOCOL AND
  MINIMIZATION OF ERROR LATENCY ARE CRUCIAL TO
  SATISFACTORY SYSTEM OPERATION.

                      DS - IX - NFT - 16

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:8
posted:11/5/2011
language:German
pages:16