Fault Tolerance in VHDL Description Transient Fault

Document Sample
Fault Tolerance in VHDL Description Transient Fault Powered By Docstoc
					                       Fault-Detection Capability Analysis of a
                            Hardware-Scheduler IP-Core in
                      Electromagnetic Interference Environment

                                 J. Tarrillo1, L. Bolzani1, F. Vargas1, E. Gatti2, F. Hernandez3, L. Fraigi2

                      1 ElectricalEngineering Dept., Catholic University – PUCRS. Porto Alegre, Brazil.
                               2 Inst. Nacional de Tecnologia Industrial (INTI). Buenos Aires, Argentina.

                                                               3 Universidad ORT. Montevideo, Uruguay.

Catholic University                                                       1

Nowadays, safety-critical embedded systems support real-time (RT)
applications that have to respect strict timing constraints.

 They have to provide logically and temporally correct results !
The high complexity of these systems requires the adoption of Real-Time
Operating Systems (RTOS) that manage task switching process,
concurrency between tasks, memory, time as well as interrupts.

Understanding the Problem …

The increasing hostility of the electromagnetic
environment caused by the widespread adoption of electronics
and in particular wireless technologies, represents a huge
challenge for the reliability of RT embedded systems.

Electromagnetic interference (EMI) may induce Power
Supply Disturbances (PSD) that can generate transient faults.

These faults can affect not only the applications running on
embedded systems but also the RTOS executing the application
code, by causing scheduling dysfunctions that could lead to
incorrect system behavior.

Understanding the Problem …

  Several solutions have been proposed. However, they provide
  fault tolerance only at the application level and do NOT
  consider faults affecting the RTOS that propagate to
  application tasks.

 e.g.: about 34% of the faults injected in processor’s registers
 led to scheduling dysfunctions:

      If not detected at the RTOS-level,
    - 44% of these dysfunctions led to system crashes,
       these faults escape detection by
    - 34% caused RT problems and
conventional (app-level) techniques as well !
     - 22% generated incorrect outputs (propagate to system


In this context…

We propose a Hardware-based Scheduler (Hw-S) IP core
to improve the robustness of embedded systems based
on RTOS.

 the Hw-S targets faults that are NOT detected by the
native structures present in the RTOS kernel.


1. The Proposed Approach

2. Practical Experiments

3. Discussion: The Benefits

4. Conclusions

                        1. The Proposed Approach


Events: Tick, interruption, ...
                                                                Memory Addresses accessed
      (Reference for                                                by the processor.
 Switching Task Context )
                                   Hw-S identifies the current task
                                  under execution and correlates it
                                  with the information stored in an
                                   Address Table generated during
                                      the compilation process.

                  Block diagram of the target embedded system
                                          1. The Proposed Approach
                                                           In charge of identifying the task under execution based on
                                                           the addresses accessed by the CPU and on the information
                                                           stored in an Address Table generated during the compilation

                                                                                              Error Indication to
                                                                                                 System Level

                                                       Implements the scheduling algorithm based on the RTOS kernel
Based on the tick and on any other event               and provides fault detection according to:
(interrupts), it is in charge of defining the                - the task in execution,
Time Limit (tl) for the processor to                         - the analysis of the tl, and
execute each task, as well as detecting                      - the events (interrupts) that can influence the RT-system.
the events that can possibly interrupt the
task in execution.
                                                Block diagram of the Hw-S
                      1. The Proposed Approach

  Time for Context Switching (Δ time, proportional to the number and complexity of resources used by the RTOS)

     External Event

  Next task recover
      from the
  execution queue

    Current task
retirement into the
 execution queue

                                          Time Limit for Switching Context

                                 Context Switch and Time Limit.
                1. The Proposed Approach

  Regarding the fault detection capability, the Hw-S targets two types of

  Sequence error (E_seq): occurs at the end of the Time Limit, tl, by noting
that the current task does not represent the expected one according to the
task’s execution flow.

  Time error (E_time): occurs when a task switching process takes place
in between two consecutive context switching events (e.g., two
consecutive ticks) thus, violating the time constraints associated to the
real-time system.

                    2. Practical Experiments

Case study:

 Von Neumann 32-bit RISC Plasma microprocessor running a RTOS (

 Plasma’s instruction set compatible to MIPS architecture.

 We developed and validated three benchmarks that exploit different services offered by
the Plasma’s RTOS:

               T1         Variable 1

                                                    Tasks T1, T2 and T3 access and update the value of
     BM1       T2         Variable 2
                                                    three different global variables.
               T3         Variable 3

               T1      QM               T2          Tasks T1 and T2 communicate by message queue. T1
     BM2                                            sends a value to the queue and T2 reads this value.
               T3         Variable
                         3                          Task T3 writes a value into a global variable.

                                                    Tasks T1, T2 and T3 access a global variable which has
               T2     Global
     BM3                                            been protected by mutual exclusion semaphore
               T3                      MUTEX        (MUTEX).

                                 2. Practical Experiments
               Temp Sensor

                                                                                  Supply F0                                               Supply F1

Test Side                                                               SRAM 0                                                                           SRAM 1
                                                                                                       RS232             RS232

                                                                                                               32 bits

                                                          Supply        SRAM 0    FPGA0                     Supply                        FPGA1          SRAM 1    Supply
                                                           M0                                                MSC                                                    M1

                                                                                                   8 bits                   8 bits
                                                                        Flash 0                             8051                                         Flash 1
                                                   Test Side

                                                Glue Logic Side                       8 bits                                                8 bits
                                                                                               RS232         CLK                 8 bits

Remaining Glue                                                                                                              Supply C

Logic Side

                                                                                         Block Diagram

            Test board designed for IEC 62.132-2 and 61.004-29 electromagnetic susceptibility analysis
                                      2. Practical Experiments

                                                                     Test Conditions:

                                                 GTEM Cell                     Freq. range: 150 KHz – 3 GHz
                                                                               Field range: 10 – 200 V/m
                                                  Test Host                    Signal Modulation: AM 80%
                                                                              Total time of exposition: 27 hours

RF Noise Generator          Power-Supply
   and Amplifier        Noise Generator Board

                     Test Board and
                      Shielding Box

                                                              1.2 volts

                                                              1.15 volts

     Fault injection environment

                                                                               4.2 % of voltage dips

                                                                           Injected noise at the FPGA power bus
                                                                           (conducted EMI)
                         2. Practical Experiments

                                                 Summary of the obtained results

                                                                                RTOS/Hw-S           FPGA
                                         RTOS              Hw-S
After 27 hours, # of                                                             latency         configuration
                                      detection [%]    detection [%]
erroneous outputs                                                              [clock cycle]       lost [%]
   observed per
  benchmark: 65

                          BM1               33.8          100.0                    1523                 7.7

                          BM2               43.1          100.0                    498                  1.5

                          BM3                1.5          100.0                    810                    -

                                                                                               Minimum fault latency

                       Highest fault detection
                                                         Coverage of faults that
                                                         propagated to outputs

                              2. Practical Experiments

                   After inspection …

      Time_Errors                                                                 RTOS lost information
(CPU switched to another      Sequence_Errors                                    associated to the “next
   task between two           (CPU executed an                                    thread”, so preventing
   consecutive ticks)                                   RTOS lost “semaphore     the CPU from switching
                           unexpected task from the        information”, so
                            Task Execution Queue)                                  to the next task in the
                                                       preventing the CPU from        execution queue
                                                        continuing the proper
                                                        execution of the tasks

               Migrate to HW the weakest reliability points of the RTOS

           Percentage of E_seq and                       Percentage of assert() send by the
           E_time detected by the Hw-S.                  RTOS
                       4. Final Conclusions

We presented a Hardware-based Scheduler (Hw-S) IP core to
improve the robustness of embedded systems based on RTOS

The Hw-S targets faults: scheduling dysfunctions that could
lead to incorrect system behavior
These faults are NOT detected by the native structures
present in the RTOS kernel

The IP core is attached to the processor bus to monitor
tasks execution flow

Practical experiments indicate the technique is effective to
increase fault detection coverage provided by the RTOS-native

Shared By: