Mask Powerpoint Template by zau21261

VIEWS: 172 PAGES: 25

More Info
									Complex Upset Mitigation Applied to a
Re-Configurable Embedded Processor




 EEL 6935
 Lu Hao
 Wenqian Wu
                                                           Outline



•   Issues of SRAM-based FPGA used for space application
•   Upset mitigation solutions
•   Resource usage and performance analysis
•   Summary
                                 System on Programmable Chip

•   Soft-core processor implemented in SRAM based FPGA is very attractive to
    spacecraft designer. A complete computer system can be created on a single FPGA
    chip.
                                                             MicroBlaze core


•    MicroBlaze is a soft processor core designed for Xilinx FPGAs.
•    Many aspects of the MicroBlaze can be user configured: cache size, pipeline depth
     (3-stage or 5-stage), embedded peripherals, memory management unit, and bus-
     interfaces.
                                                                      Onchip
                                                                     peripheral
                                                                        bus




     Local
    memory
      bus
                                                     Space application issues


•   Radiation environment
    In space, high energy ionizing particles exist as part of the natural background.
    In addition, solar particle events and high energy protons trapped in the Earth's magnetosphere
    (Van Allen radiation belts).
    These electro-magnetic radiation brings potential threats to electronic devices.


•   Single Event Upset (SEU)
    SEU is a change of state caused by ions or electro-magnetic radiation striking a sensitive node in
    a micro-electronic device, such as in a microprocessor, semiconductor memory, or power
    transistors. The state change is a result of the free charge created by ionization in or close to an
    important node of a logic element (e.g. memory "bit").



•   FPGA is susceptible to SEU
    data/instruction stored in block memory
    configuration bits stored in distributed RAM


•   Upsets mitigation technique is one of key issues for SRAM-based FPGA
    design for space application
                                          Proposed upset mitigation



•   To ensure reliable space application based on SRAM-FPGA, the author
    investigates 3 level of upset mitigation:
     – Functional-block design triplication
     – Continuous external configuration scrubbing
     – Independent internal BRAM scrubbing (also triplicated)
                                          Tool, device and environment


•   Tools:
    Xilinx TMR: easily trade off maximum radiation effect immunity against area, pinout, and board
    layout consideration.
•   Device:
    Xilinx Virtex II XQR2 V6000 FPGA
•   Program running in MicroBlaze:
    Integer-based FFT
•   Test environment:
    Crocker Nuclear Laboratory at University of California at Davis using a proton beam of 63.3 MeV.
•   Test borad
    Two FPGAs, one is device under test (DUT), the other is service FPGA
                                                DUT and Service FPGA


•   Service FPGA performs two functions:
    1) configuration readback and scrubbing DUT when there is readback error
    2) control and monitoring of the functional operation of the MicroBlaze running the
    FFT program

•   Program (FFT) is stored in internal BRAM each time the DUT is configured
•   Data is sent to DUT internal BRAM by service FPGA.
•   The result of FFT program are returned to service FPGA and compared to the
    expected result.




                                                            uBlaze

                Service FPGA                        DUT
                                                            BRAM
                                                 Upset Mitigation



• Mitigation solution
   1. Functional-block design triplication
   2. Continuous external configuration scrubbing
   3. Independent internal BRAM scrubbing (also triplicated)
                                                                                 TMR


•   Triple Module Redundancy
    3 modules performing the same task, only the majority will be pick up as output by
    the Voter.




                                        TMR


     If any one of the three systems fails, the other two systems can correct and mask the
    fault. If the voter fails then the complete system will fail. However, in a good TMR
    system the voter is a critical component and should be much more reliable than the
    other components.
Xilinx TMR
                                                 Upset mitigation



• Mitigation solution
   1. Functional-block design triplication
   2. Continuous external configuration scrubbing
   3. Independent internal BRAM scrubbing (also triplicated)
                       External Configuration Scrubbing


•   Configuration scrubbing is the process of rewriting the configuration
    memory of an FPGA for the purpose of correcting any errors that may have
    accumulated since the device was last configured.

•   Service FPGA will detect readback error, and scrub the configuration by
    reloading bitstream to correct upsets.

•   Transparent process
    normal device operation runs concurrently and without interruption


•   Configuration scrubbing frequency: 16 MHz, i.e. 4 scrub-cycles per sec
                                                 Upset mitigation



• Mitigation solution
   1. continuous external configuration scrubbing
   2. functional-block design triplication
   3. Independent internal BRAM scrubbing (also triplicated)
Independent internal BRAM scrubbing
                                     BRAM Triplication


Port A: used for MicroBlaze
processor




Port B: counter connected; used
for error detection and correction
                                                   BRAM Triplication

•   TMR counter
     – Allow continuous refreshing of the BRAM
       contents
     – Cycle through the memory addresses
       incrementing the BRAM address of the
       second port
     – In case the first port of the BRAM is not
       being used, it rewrites the BRAM content
       at this specific address with the voted
       value from the associated voter (TRV16).


•   BRAM
     – Conventional BRAM

•   Associated voter (TRV 16)
     – Compares three values from the same
        address of three BRAMs, selects the
        majority and writes back to the
        corresponding address.
                                                                               Testing


 • Two mitigated versions of the MicroBlaze design architecture have been
   implemented and tested:
    – with the BRAM scrubber.
    – without the BRAM scrubber.
 • Error types:
       – Type 1 errors: FFT outputs were wrong.
           • Type 1a: Corrected after a configuration scrub cycle
           • Type 1b: Not corrected after a scrub cycle, even after a reset of the DUT design
       – Type 2 errors: Nonresponsiveness of the DUT, requiring a reset and
         synchronization
           • Type 2a: Corrected by scrubbing and hence referred to as a recovering reset
           • Type 2b: Not corrected by scrubbing and referred to as a runaway reset.
               – This type of error (runaway reset) is an uncorrected error condition that
This is what we causes the functional monitor to continually attempt to reset the MicroBlaze
emphasis on       processor each time the watchdog timer set for the handshaking between
                  the two FPGAs reaches its limit value.
       – Type 3 errors: Occurrence of an exception or interrupt detection.
                                 (No BRAM scrubber)




                                  (BRAM scrubber)




Is BRAM code corruption the main reason of runaway resets?
                                        Standalone test


• To make sure that the BRAM code corruption is likely to
  be the cause of these runaway resets, the BRAM
  mitigation design has been implemented in standalone
  mode and tested under proton beams at similar fluxes
  and at the same facility.
                     Runaway Resets Caused by BRAM Corruption




•   At a flux (1.70×108), at least 17% (1.21×10-11/6.82×10-11) of the runaway resets are
    due to errors in the BRAM code, while at a (1.70×109) flux, 23% of them are caused
    by code corruption.
                    Exceptions Caused by BRAM Runaway Resets




•   Design 1: An average of 64% of the unrecovered resets (due to BRAM code
    corruption) has been detected by exceptions (64% at the flux 1 and 80% at the flux
    2).
•   Design 2: exceptions were observed only after an increase of two orders of
    magnitude of the flux (1.70×109) and only 25% of the runaway resets have been
    detected.
•   Not all the illegal states are detected by the exception mechanism.
     – At a lower flux (1.70×108) , although seven resets have been observed, no
        exceptions have been detected
•   The MicroBlaze was optimized to fit in the Xilinx FPGAs and the exception circuitry
    has been designed to detect only major illegal operations.
                                                       Conclusion


• Issues of SRAM-based FPGA used for space application
   – Single Event Upset (SEU) can be caused by radiation
      environment
   – So we need fault tolerance system
• Complete solution of upset mitigation implemented on Xilinx Virtex II
  FPGA
   – continuous external configuration scrubbing
   – functional-block design triplication
   – Independent internal BRAM scrubbing (also triplicated)
• Testing results
   – BRAM code corruption is the main reason causing runaway
      resets
                                                                                                 Reference

•   [1] F. Lima, C. Carmichael, J. Fabula, R. Padovani, and R. Reis, “A fault injection analysis of virtex FPGA TMR
    design methodology,” presented at the Radiation and Its Effects on Components and Systems, Sep. 2001.
•   [2] F. Lima(de), S. Rezgui, E. F. Cota, M. Lubaszewski, and R. Velazco, “Designing and testing a radiation
    hardened 8051-like micro-controller,” presented at the Military and Aerospace of Programmable Devices and
    Technologies Conf., Laurel, MD, Sep. 2000.
•   [3] G. Swift et al., “Dynamic testing of xilinx virtex-II field programmable gate array’s (FPGA’s) Input Output Blocks
    (IOB’s),” IEEE Trans. Nucl. Sci., vol. 51, no. 6, pp. 3469–3474, Dec. 2004.
•   [4] C. Carmichael, B. Bridgford, and J. Moore, “Triple module redundancy scheme for static latch-based FPGAs,”
    presented at the Military and Aerospace of Programmable Devices and Technologies Conf., Laurel, MD, Sep.
    2004.
•   [5] Triple Module Redundancy Design Techniques for Virtex FPGAs, Xilinx Appl. Note XAPP197, C. Carmichael.
    (2001, Nov.). [Online]. Available: http://www.xilinx.com/bvdocs/appnotes/xapp197.pdf
•   [6] MicroBlaze Processor Reference User Guide, Xilinx, Inc., Aug. 2004. Embedded Development Kit (EDK 6.3),
    UG081, Version 4.0.
•   [7] FFT C Code, T. Roberts and M. Slaney. (1994, Dec.). [Online]. Available: http://www.jjj.de/fft/int_fft.c
•   [8] TMR Tool User Guide, Xilinx, Inc., UG156, Version 6.2.3 (2004, Sep.). [Online]. Available:
    http://support.xilinx.com/products/milaero/ug156.pdf
•   [9] Triple Module Redundancy Design Techniques for Virtex FPGAs, Nov. 2001. Xilinx Appl. Note 197.
Thanks!
    Questions?

								
To top