Docstoc

fpga_iat_summary1_final

Document Sample
fpga_iat_summary1_final Powered By Docstoc
					                                The First Summary Report

                                             on the

                                     Independent Review

                                                 of

                                    SX-S FPGA Reliability

                                                 on

                              NASA Space Flight Missions




Meeting Date: January 7, 2004




NASA Office of Logic Design
February 11, 2004



February 11, 2004 – Final Version        Page 1 of 22
                                       February 11, 2004

564

TO:              Director of Engineering/Arthur F. Obenschain

FROM:            Head, Office of Logic Design

SUBJECT:         First Summary Report on the Independent Review of SX-S FPGA
                 Reliability on NASA Space Flight Missions




Enclosed is the first summary report on the SX-S FPGA reliability for NASA space flight
missions. This report provides a snapshot of the work accomplished to date, including the
meeting held at NASA GSFC on January 7, 2004, a review of all available data and reports,
and extensive analysis performed over the past month. A diverse team of 10 engineers from
various NASA Centers and the Department of Defense was assembled for this task.

The primary objective of this activity is to determine the root cause of failures of Actel SX-S
devices, which are used extensively in NASA‘s spacecraft, both crewed and robotic, as well
as to offer guidance to engineers testing and using these devices.

This summary documents the findings and recommendations for the use of these devices in
NASA and other mission- or safety-critical systems. A NASA Advisory will be submitted for
dissemination, applications notes published, and seminars held to maximize the reliability of
digital electronics systems.

Thank you for your support in this effort.

Sincerely,




______________________
    Richard B. Katz




February 11, 2004 – Final Version       Page 2 of 22
                                                   Distribution
Team Members

Rod Barto           Office of Logic Design
Lisa Coe            Marshall Space Flight Center
Martin Fraeman      Applied Physics Laboratory
Kevin Hames         Johnson Space Center
Richard Katz1       Office of Logic Design
Robert Kichak2      NESC
Andy Kostic         Missile Defense Agency
Henning Leidecker   Goddard Space Flight Center
Jay Schaefer        National Security Agency
Geoffrey Yoder      Johnson Space Center
        1
            Chair
        2
            Co-Chair



cc:

560     J. Day
564     R. Kasa
NESC    R. Roe
500     M.Ryschkewitsch
500     S. Scott
100     W. Townsend




February 11, 2004 – Final Version             Page 3 of 22
                                                                         Contents
KEY REFERENCES ...................................................................................................................................................4

EXECUTIVE SUMMARY .........................................................................................................................................5

INTRODUCTION .......................................................................................................................................................7

BACKGROUND ..........................................................................................................................................................7

REPORTED FAILURES AND REVIEWS...............................................................................................................8

MEETING OBJECTIVES ..........................................................................................................................................8

TECHNICAL BACKGROUND SESSION ...............................................................................................................9

FAILURES AND ANALYSES SESSION .................................................................................................................9

FINDINGS AND RECOMMENDATIONS ............................................................................................................ 10

SUMMARY OF THE MEETING AND DISCUSSION ......................................................................................... 16
   DEVICE RELIABILITY AND APPLICATION .................................................................................................................. 16
   MER AUTOMATED TEST AND POST-PROGRAMMING BURN-IN EQUIPMENT ............................................................. 16
   FAILURE ANALYSIS EVALUATION ............................................................................................................................ 17
     MER S/N 132 ....................................................................................................................................................... 17
     MER S/N 50 ......................................................................................................................................................... 17
     MER Setup Part ................................................................................................................................................... 17
     MER S/N 52 ......................................................................................................................................................... 17
     MER S/N 153 and S/N 179................................................................................................................................... 18
     MER S/N 117, 118, 119, 120, 121, 123, 124, and 125 ......................................................................................... 18
     MER S/N 91 ......................................................................................................................................................... 18
     MRO S/N 8068, S/N 8069, and S/N 8129............................................................................................................. 19
   TEST EQUIPMENT EVALUATION ............................................................................................................................... 19
   RECOMMENDATIONS – POST-PROGRAMMING BURN-IN ........................................................................................... 19
   RECOMMENDATIONS – AUTOMATED TEST EQUIPMENT AND TESTING ..................................................................... 20
TEAM MEMBERS, PRESENTERS, AND ATTENDEES .................................................................................... 21
   TEAM MEMBERS ...................................................................................................................................................... 22
   CONSULTANTS.......................................................................................................................................................... 22
   PRESENTERS ............................................................................................................................................................. 22
   GENERAL ATTENDEES .............................................................................................................................................. 22



Key References
      1.    ―Summary of October 8, 2003 Meeting on Actel FPGA Failures, R. Katz, M. Fraeman, and J. Boldt to S.
            Scott and E. Hoffman, October 2003.
      2.    "Study on the increase in ICCI Current for RT54SX32S lot # T25JSP03 and RT54SX72S lot T25KS001
            with VCCI = 5V (± 10%) Operation," Actel Corporation, November 21, 2001.
      3.    ―MER FPGA Tiger Team Report,‖ Jonathan Perret to Kelly Stanford, IOM 5140-03-031, April 4, 2003




February 11, 2004 – Final Version                                     Page 4 of 22
                                        Executive Summary
Throughout the industry, the 0.25 µm SX-A/SX-S parts have been used successfully, with over a million units in use
and have a rated reliability of approximately 10 FITS1, based on a qualification program of approximately 3,000
parts that underwent post-programming dynamic testing. However, there have been several ―clusters of failures‖ at
several aerospace organizations; previous generations of devices operated essentially failure-free throughout the
entire industry. At this point no data indicates device failure when operated in compliance with the manufacturer‘s
specification. However, a correlation between damage to the parts and exceeding absolute maximum ratings has
been shown. While the sensitivity of the programmed antifuse is considered the weakest link of these devices in
particular, the lower margins and increased susceptibility to damage is a trend of modern, high-performance digital
electronics in general.

On Wednesday, January 7, 2004, a meeting was held at the NASA Goddard Space Flight Center (GSFC) to address
the issue of Actel SX-S FPGA reliability as related to the failures reported by General Dynamics, Boeing Satellite
Systems, and the Jet Propulsion Laboratory (JPL). In approximately May, 2003, General Dynamics reported
failures (both SX-A and SX-S) on their Department of Defense (DOD) programs but declined to participate or
provide access to relevant data. On other DOD programs, Boeing Satellite Systems reported failures in the latter
half of 2003 and presented their preliminary data. For the Mars Exploration Rover (MER) in 2002 and, more
recently, the Mars Reconnaissance Orbiter (MRO) programs, JPL reported manufacturing defects at a rate far higher
than that expected from the manufacturer‘s published reliability data. The reported failures raise concern for all
users of these devices throughout NASA, the Department of Defense and others.

The electrical environment created by both the automatic test equipment and the post-programming burn-in
equipment at the facility used for MER parts testing have been found to be electrically ―dirty2.‖ Standard design and
test practices (e.g., terminating the 50 Ω outputs of laboratory instruments, adequate supply bypassing, proper
termination of device inputs, prevention of bus contention, as well as others) for digital systems were not followed.
As a result supply voltages exceeding absolute maximum values, violation of input transition time limits, large
device currents from improper configuration, and potential bus contention were observed. These factors contribute
to excessive stress levels potentially leading to device damage. It was also an observation that the test personnel did
not fully understand the operation of the equipment, the implications of exceeding specifications, and performance
requirements of the parts. The design and operation of the test equipment for modern, high-performance digital
electronics should be performed to the same electrical standards as flight hardware with specifications followed, and
conservative and careful design practices utilized. Test equipment design and operation for testing parts of this class
should be reviewed and problems tracked in the same manner as flight hardware. This approach will detect
problems at the earliest possible time and prevent installation of potentially overstressed devices in flight hardware,
as was the case for MER, where the manufacturer had provided a ―don‘t fly‖ recommendation based on concerns
arising from the out-of-specification electrical test environment.

The MER program had an unusually high fallout rate in screening with 16 parts failing out of approximately 75 —
over 20%. The failures fit into several categories. The three programming failures are not considered a problem,
inherent in the technology, and have been credibly explained by the manufacturer, starting with the first generation
of these parts for aerospace systems over a decade ago. However, most of the failures were not benign. One was a
result of a test configuration error, with clock inputs not properly terminated, resulting in high device currents.
Several other devices were most likely victims of electrical overstress from the ―dirty‖ environment created by the
test equipment. One such case had a severely damaged clock input; in another damage was observed in the ―SLX
transistors,‖ known to be the electrically weakest point in those circuits for the revision of die MER utilized. In
addition, the device used for verifying the safety of the test equipment, sustained significant damage which was not
detected until tests recommended by an independent reviewer were implemented. While this failure was not
analyzed by the parties involved, all failures during all phases of flight part testing should be properly understood.



1
 One FIT is equal to 1 failure per billion device-hours of operation.
2
 A ―dirty burn-in‖ is a burn-in test where the electrical environment is outside of the device manufacturer‘s
specifications.


February 11, 2004 – Final Version               Page 5 of 22
Based on the data and analysis presented, it is concluded that the likely causes of the MER failures were related to
the testing and not the devices. The conclusion is supported by the available data, analysis, the known electrically
dirty environment, and experience with these devices and must be tempered by the incomplete nature of the failure
analyses, with much of the work presented, in the Team‘s opinion, being superficial, with no failure analysis reports
and in some case no physical analysis.

The MRO program‘s fallout rate was even higher than MER‘s — 50% — with three out of 6 parts being rejected.
Some of these parts had an inexplicable test signature of higher device leakage currents at 25 °C than at either
-55 °C or +125 °C — not a pattern typical of these devices in particular or CMOS devices in general. Similar
symptoms resulting from test set problems were reported by Actel at the October meeting; no additional work at JPL
was reported to follow-up on this over the past three months. Device evaluation in the NASA test fixture has been
recommended with data collected at additional temperature points to permit a proper evaluation.

The details of post-programming burn-in (PBBI) and the use of automated test equipment (ATE) were discussed
extensively. The discussion included the capabilities of the equipment with respect to fault detection, the design of
the equipment and operating principles, a review of the data, and the performance and risks associated with these
test sets. Also discussed was the demonstrated reliability of the devices in qualification tests along with the
vulnerability of these parts to damage from ―dirty‖ electrical environments. It was concluded that post-
programming burn-in and the automated testing of these devices are not recommended since in many cases the risk
of the tests outweighs the benefits.

If PBBI and/or ATE testing is to be performed, a quantitative justification of the desired reliability improvement
should be required, along with an analysis of the fault acceleration and detection capabilities of the test. In addition
the test equipment must be designed, reviewed, and qualified to flight-type electrical standards and meet all device
specifications.

Specific findings and recommendations are listed and discussed in a separate section. They include suggestions for
user application and testing of the devices as well as improvements in the devices themselves, documentation, and
programming software. The Team emphasizes the importance of thorough and extensive testing in the target
electrical environment and conservative application of these devices, providing a safe and effective fault detection
environment.

These findings and recommendations must be viewed in light of the fact that not all failures have been properly
analyzed with the root cause determined, as well as some data not being made available to this Team. Modern
digital technologies have smaller margins than previous generations and ―raise the bar‖ for device handling and
testing. As a result, the recommendations given in this report stress the conservative application of these parts,
eliminating or minimizing the conditions that are suspected of being capable of causing device damage.




February 11, 2004 – Final Version                Page 6 of 22
                                                Introduction
On Wednesday, January 7, 2004, a meeting was held at the NASA Goddard Space Flight Center to address the issue
of Actel SX-S FPGA reliability for NASA space flight missions. The JPL parts effort for the MER and MRO
programs reported manufacturing defects at a rate far higher than that expected from the manufacturer‘s published
reliability data, raising concern for all users of these devices. These programmable logic devices are ubiquitous and
are used or planned for use throughout virtually all of NASA spacecraft, both crewed and robotic, as well as in many
Department of Defense applications.

JPL's conclusions on the root cause of failure of SX-S devices contradict those of the manufacturer, the Actel
Corporation. The subsequently-issued screening recommendations are in direct conflict with Actel‘s long-standing
guidance for use of the devices.

A team of 10 independent technical experts from various NASA Centers and DOD organizations, led by the NASA
Office of Logic Design (OLD), conducted a technical evaluation of the failures reported by JPL, the analyses
performed, and their follow-on recommendations and corrective actions. This effort included collecting,
distributing, and analyzing all available data, analyses, test protocols/equipment, as well as any related material, and
subjecting them to a thorough, independent, and critical peer review. The engineers spanned the range from logic
designers, analysts, failure analysis specialists, to physicists. Richard Katz of the NASA Office of Logic Design
(OLD) chaired the meeting, with Robert Kichak of the NASA Engineering and Safety Center (NESC) co-chairing.



                                                Background
Actel field programmable gate arrays (FPGAs) have been used in NASA spaceborne electronics systems for over a
decade. These devices have an array of logic modules interconnected by user configurable routing. The routing
determines both the interconnections of the modules as well as each module‘s logical function. All programmable
connections are made via antifuses, an element that is initially high-impedance, but becomes a ―low resistance‖ path
when programmed.

The first three generations of Actel FPGAs used ONO (oxide-nitride-oxide) antifuses with programmed resistances
of several hundred ohms which were located in channels of the gate array. Starting with the SX series of devices,
metal-to-metal antifuses were employed, lowering programmed resistances approximately by an order of magnitude
resulting in greater speed. Another fundamental change was placing the antifuses above the logic modules, between
the upper layers of metallization in the modern semiconductor processes, eliminating the routing channels which
make the devices smaller, further increasing device performance. For the Actel metal-to-metal antifuses-based
microcircuits which are the topic of discussion here, the first devices were 0.6 µm, 3.3V prototype devices
fabricated at what now is BAE Systems in Manassas, VA. Radiation-tolerant devices for space, the SX-series, also
0.6 µm, 3.3V devices, were fabricated at Matsushita Electric Co. (MEC) in Japan. MEC fabricated most previous
antifuse-based FPGAs for NASA. Commercial SX devices are based on a 0.35 µm, 3.3V process at Chartered
Semiconductor. The next generation of devices for both commercial and military/aerospace applications — SX-A
and SX-S — was fabricated at MEC in a 0.25 µm, 2.5V process. The SX-S devices shared a common process and
radiation-hardened antifuse with the commercial SX-A devices with the detailed SX-S design modified for radiation
hardness (TID, SEU) and I/O performance. SX-A production devices were shipped starting in 1999 and to date over
1 million devices were delivered; approximately 10 thousand SX-S devices have been delivered. Note that the SX-
A devices for commercial use have migrated to the 0.22 µm, 3.3V process at UMC. Those devices are not directly
considered in this study since their antifuse design and process differ from the devices utilized by NASA.

The manufacturer‘s reliability numbers for these SX-A/SX-S devices are considered ―high-rel‖ (the FIT rate is
approximately 10).




February 11, 2004 – Final Version                Page 7 of 22
                             Reported Failures and Reviews
JPL reported failures of SX-S devices during their MER screening almost two years ago. An initial evaluation of
their testing found deficiencies in equipment and procedures. Subsequently, numerous requests for data, testing
information, and analyses were made to JPL via various independent channels with little success, with most requests
for technical information not responded to or were technically non-responsive. Some related memos were obtained
by OLD from outside of NASA. A JPL-led independent review of their tiger team‘s effort on October 30, 2002
produced a report that was subsequently withdrawn since, in part, no consensus was reached. Additionally, up until
the beginning of 2004, JPL did not authorize Actel to release data and evaluations.

In approximately May, 2003, General Dynamics reported failures (both SX-A and SX-S) on their Department of
Defense programs. General Dynamics was invited by both OLD and NESC to participate in this meeting but
declined. They did not contribute their data, analysis, or other pertinent technical information or permit Actel to
share the detailed technical information that Actel has generated during the failure analysis effort.

Boeing has reported three failures of SX-S devices and provided their information prior to the meeting, participating
via a telephone connection. However, failure analysis has not yet begun and thus no conclusions can be reached.
One of their failed devices was on a test board at Boeing, with the other two tested at Raytheon, the same test
contractor that JPL used for the MER and MRO programs. Some ESD, electrical environment, and test equipment
issues have been identified.

In October, 2003, The Aerospace Corporation hosted an ―industry meeting‖ regarding SX-A/SX-S failures. Little
relevant technical data was exchanged. It was clear, however, that the failures of these devices are appearing in
clusters and not randomly distributed throughout the industry. A summary of this meeting was prepared by NASA
OLD and APL, and is listed in the reference section of this report.



                                         Meeting Objectives
The Team decided to limit discussion to the SX-A/SX-S devices of concern and not try to address the broader
implications of new, high-speed digital technologies in general. The five objectives of the meeting are listed below:

    1.   Review all information about failures, including symptoms, test conditions, test equipment, procedures and
         methods, analyses, and conclusions.
    2.   Understand the mechanisms and root causes of all reported failures.
    3.   Provide guidance:
             a. On the application of these devices for space flight or other mission critical systems.
             b. On the programming, testing and handling of these devices.
    4.   Define further work to be done (e.g., tests, analysis, etc.).
    5.   Provide recommendations to Actel for product improvement.




February 11, 2004 – Final Version               Page 8 of 22
                             Technical Background Session
The review proceeded with a detailed technical background session, laying the foundation for the review of specific
device failures, following up on the papers and reports acquired by the Team prior to the meeting. Specifically, the
items covered were:

    1.   Device Technology and Reliability History
             a. Manufacturer reliability tests and data
             b. SX-A vs. SX-S; ‘32 vs. ‗72 – Implications for reliability.
             c. Field (user) reliability
    2.   Possible damage mechanisms for SX-A and SX-S FPGAs
    3.   Possible triggers of damage mechanisms.
.



                              Failures and Analyses Session
The bulk of the meeting comprised a detailed discussion of all known failures with the majority of the discussion
concerning failures during manufacturing tests and the MER/MRO programs. Special attention was paid to the
electrical environment at the time of failure such as supply voltage characteristics, input stimulus, output loads, and
other conditions. The goal was to determine the failure mechanism and the root cause for each failure.




February 11, 2004 – Final Version                Page 9 of 22
                           Findings and Recommendations
Device Reliability
    Based on the currently available data which was reviewed, the Team concluded that the devices are reliable
        when operated per specification with no data presented to indicate otherwise.
    Actel's procedures for testing and burning-in unprogrammed parts for space class applications appear
        sufficiently effective in capturing part manufacturing defects (infant mortality). Programmed antifuse
        reliability is discussed in a separate section below.
    A metric of stress on the programmed antifuse is the ratio of the peak current passing through it relative to
        the programming current. Although the SX-S series devices are physically larger than their SX-A
        counterparts, Actel presented normalized data demonstrating that the stress level in the SX-S series is lower
        than SX-A due to current limiting resistors in SX-S modules which is not present in SX-A modules.

Programmed Antifuse Reliability
    There are variations in programmed antifuses. Some are not as robust as others.
    Actel developed a post-programming screening test that identified parts (approximately 2%) with the most
       sensitive antifuses. However, subsequent reliability testing of the passing set of devices showed that they
       were no longer reliable, rendering the procedure impractical for screening of flight parts.
    A ―dirty burn-in‖ test, inadvertently performed as a result of unknown facility issues at Actel‘s burn-in
       contractor, where the power supply levels were not always in specification, resulted in a fall out of
       approximately 2% of the devices.3
    The limited available data shows that the propagation delay of signals passing through damaged antifuses
       can change substantially. Increases can range from tens of nanoseconds up to microseconds.
    Antifuses can be damaged during testing and operation if the devices are operated out of specification.
       Once damaged they are ―unstable.‖ The long-term behavior of ―unstable‖ antifuses is not currently known.
    For devices that are operated within data sheet limits, the programmed antifuse is a reliable circuit element.
       Qualification testing of approximately 3,000 SX-A and SX-S devices support this finding.

Manufacturing Defects
    JPL findings of manufacturing defects for SX-S devices were not supported by the presented data or
      analysis.

Device Safety Margins
    The amount of margin in these newer devices is lower than previous devices and historically encountered.
        Thus, particular care must be taken in the application of these FPGAs.
    The Team also noted that the margin reductions are not a phenomenon particular to SX-A/SX-S devices,
        but are characteristic of modern high-speed digital technologies in general.
    These parts have "raised the bar" for device handling and testing.
    A NASA advisory should be issued highlighting the additional handling and design precautions that must
        be followed.

Device Usage and Application – Conservative Design Practices
    Simultaneous Switching Outputs (SSO) should be limited and properly distributed. Optimization of layout
        for printed circuit board routability may be a contributing factor to device failure. For example, one device
        reported as failing had 64 SSOs, with a layout optimized for printed circuit board design, with all high slew
        drivers, and no robust bypass capacitor placement. While not shown to be a root cause of this failure at this
        time, this output configuration is suspect and should always be avoided.
    Low slew outputs configurations should be used for all buses, and other signals when possible. High slew
        output usage should be minimized and justified.


3
 A ―dirty burn-in‖ is a burn-in test where the electrical environment is outside of the device manufacturer‘s
specifications.


February 11, 2004 – Final Version              Page 10 of 22
        Loading on all outputs should be conservative (e. g., buffers should be used for driving memory arrays.)
         Long lines should be driven with either buffers (preferred) or isolation resistors, as appropriate.
        A robust bypass capacitor strategy should be employed with careful layout to maximize effectiveness.
        Signal integrity issues must be given particularly careful consideration. This requirement includes both
         inputs and outputs to ensure that voltage excursions do not exceed the manufacturer‘s recommendations.
         The potential signal integrity problems are exacerbated by the fact that the drivers used in SX-A/SX-A are
         both fast and powerful.4

MER Failures
  The burn-in setup used for the MER devices at Raytheon, JPL‘s contractor, had critical design deficiencies that
  operated the parts in violation of multiple manufacturers‘ specification. Standard engineering practices such as
  bypassing and clock signal termination were not employed. The violated requirements included maximum
  supply voltage and input transition times. Based on JPL‘s presentation, Actel‘s analysis, and all available data,
  the Team concludes that their failures were most likely caused by overstress in testing or handling. The
  available data included Actel qualification testing and extended life testing on a set of parts (approximately 75)
  from the MER lot with zero failures.


Improve Data Sheet for Designers
    Although the data sheet states the specifications for safe device operation, the implications of exceeding the
       limits should be strengthened and placed in a prominent location.
    A cautions and warnings section should discuss in detail key limits of the electrical environment:
            o Power supply characteristics such as overshoot and ripple and bypassing requirements. Power
                supply turn on is a time of vulnerability for these parts. The manufacturer indicated that this
                period, with the charge pump starting, is particularly sensitive for the device.
            o Signal integrity issues such as input signal limits, output loading, and the effects of any overshoot
                from unmatched transmission lines.
            o Other guidelines such as simultaneous switching outputs limits.

Improve Future Devices for Designability and Testability
    ―Low slew‖ drivers should have specified characteristics. At this time, the slew is only reduced for the
       falling edge; it should be reduced for both edges. The difference in performance between high and low
       slew should be increased significantly. More than two levels of slew should be considered.
    Inputs should be tolerant of longer transition times to aid in protection networks, particularly from off-
       board signals.
    Consider adding scan chain for R-Cells including the redundant latches to permit verification of TMR cells
       (redundant flip-flops and voter circuits) during board and system level environmental tests as well as final
       test before shipment.

Requested NESC Support
    Additional support from the NESC is needed to ―break free‖ the General Dynamics data. There are no
       known reasons for not having access to this data. Preventing NASA access to this data is unjustifiable.
    Provide funding for the Sandia National Laboratories failure analysis group who has experience with Actel
       parts in general and the MER failures in particular.
    Provide funding for NASA to complete the failure analysis of the MER devices. After almost two years,
       physical analysis is still in the preliminary stages.

Review Data for All Failures
    NASA to complete the failure analysis of the MER devices as soon as possible. The analysis effort is in
       the preliminary stages, proper failure analysis has not been performed, and the conclusions are not
       supported by the available data or fact. Failure analysis reports do not appear to be available.


4
 Fast, powerful drivers, as found in the SX-A/SX-S devices, have fast voltage transition times and high current
capability.


February 11, 2004 – Final Version              Page 11 of 22
        NASA to obtain, understand, and review the data from all relevant failures, the failure mechanisms, and the
         root causes. In particular, there are several ―clusters of failures‖ at Boeing and General Dynamics which
         are considered open.
        NASA OLD to establish communications with the manufacturer to ensure that all failures are reported,
         understood, and the implications are factored into any follow-on recommendations. A NASA OLD-Actel
         agreement should be set up giving NASA access to all relevant device failure reports and data.
        All NASA projects, both in-house and contractor-based, should send all failure reports of this class to the
         Office of Logic Design for analysis and trending. All failures will be cross-checked with the manufacturer
         to ensure that no failed devices ―fall through the cracks,‖ proper failure reports are generated, and that
         NASA is fully informed of the results, and can distribute recommendations and advisories, as appropriate.

Damaged Antifuses
    The long term reliability and stability of a damaged antifuse is not presently known. A model of damaged
      antifuse behavior based on experimental data is needed from the manufacturer. The model must include
      behavior of antifuse over time so that flight hardware tests can be appropriately designed.
    A test to accelerate programmed antifuse damage should be developed, if practical and safe, for flight
      hardware, so that defects are detectable prior to shipment of the assemblies containing Actel SX-A/SX-S
      FPGAs.
    The available data shows that the symptoms of a damaged antifuse are not always detectable by either
      traditional-style board level or ATE testing.
    An increase in propagation delay is a typical indication of a damaged programmed antifuse. Such increase
      can range from tens of nanoseconds to microseconds.
    The ―KTAG test,‖ currently under development, may be able to non-invasively detect a large subset of
      damaged antifuses exploiting the device‘s existing internal test structures.

Design of Test Equipment
     Improperly performed electrical measurements have the capacity to degrade or damage the parts.
     A survey of test equipment outside of the manufacturer‘s facility has failed to find a single facility that
        meets acceptable and safe standards, including the ATE and burn-in fixtures for MER and MRO. The
        observed examples include lack of control of key clock signals, absence of adequate bypassing and voltage
        control of the supplies, and failure to prevent bus contention.
     The personnel designing and operating test equipment are often not sufficiently familiar with modern,
        complex devices under test, device design considerations, or device limitations, as well as some critically
        important operational characteristics of their own equipment. As an example, for the MER devices, the test
        personnel do not know if their test vectors institute a ―break before make‖ philosophy when switching the
        direction of bidirectional busses, the current limit settings on the pin drivers, or whether bus contentions
        were present. The SX-S devices have large short circuit output currents exceeding 225 mA per pin.
     The design of test equipment, with respect to the electrical environment to which device is subjected to,
        must be performed to flight standards including application analysis, margin analysis, and device protection
        analysis. For modern, high-performance devices with reduced margins for damage, these standards will be
        of increasing importance.
     The design of test equipment currently falls outside of the flight review process and constitutes a risk to the
        flight hardware. All test equipment must be thoroughly and properly reviewed for safety to the devices
        under test.




February 11, 2004 – Final Version              Page 12 of 22
Programming Devices - User
    For each programmer used, all programming activity should be logged with programming yields computed.
    Actel normally achieves a 95% programming yield. The 5% dropout is normal since the programming
       involves paths not previously tested and not all antifuses are expected to program satisfactorily.
    Actel customers on average achieve a significantly lower programming yield than the manufacturer with
       the average dropout rate of approximately 10% – twice as high as Actel‘s. Neither the discrepancy itself
       nor the potential implications of it are currently understood. One possible explanation for such difference
       is lack of proper care for the programmer, power conditioning, and poor device handling practices. Note
       that several MER parts submitted for failure analysis were in physically poor condition.
    Based on the currently unexplained lower programming yield, the following defensive and conservative
       practices are recommended:
            o All programmers should be on properly conditioned power.
            o The programmer and adapter socket should be checked by the calibration routines prior to the
                 programming of each device.
            o Each programmer should have complete programming records to detect any trends.
            o A single programmer used for flight devices at each facility.
            o Actel should offer all SX-S and military grade device users a programming service documented in
                 the data sheet.

Device Programmer Functions and Features – Proposed Additions
    Display and log all device supply currents before and after programming.
    Display and log a list of all tests performed for device verification, after the programming sequence is
        complete.
    Record the antifuse programming currents and the number of pulses needed to program each antifuse in a
        log file.
    Always obtain user permission prior to overwriting any log file.
    Tighten the programming tolerances for all SX-S and military grade devices to screen devices that are
        outside the normal distribution if field reliability can be improved and/or margins increased as a result.
    Restore programmer functionality that allows device stimulation, debugging, and functional verification.

Enhance Communication from Manufacturer to High-Reliability Users
    For all devices that fail in qualification testing, publish the failure mechanism and root cause of failure in
       the reliability report. The manufacturer‘s failure reports should be available on request.
    The data sheets specify absolute limits for the devices. The guard bands used for acceptance testing should
       be available in report form to users on request.
    Produce failure analysis reports that can be shared between customers to ensure that all users make fully
       informed decisions while not releasing proprietary information and keeping user identities anonymous.
    Ensure that users have the opportunity to review data packages prior to device shipment.

Recommended Actel Testing
    During qualification, test a set of devices to failure over all operating parameters. This will establish the
      limits and the margins of the devices, enabling future device improvements, and provide guidance to
      designers on areas that need extra margin in the system design rules.
    Conduct testing in various simulated application environments. This will permit the determination of
      reliability for credible worst-case application conditions.
    Screening devices with a tightened guardbands for critical parameters such as device currents will permit
      identification of outlier devices. These outliers can still be used for non-critical engineering and other
      applications that are not mission- or safety-critical.




February 11, 2004 – Final Version              Page 13 of 22
User Testing – Automatic Test Equipment (ATE)
     Actel indicated that judicious post programming ATE voltage and schmoo testing should not degrade the
        devices. The Team concurs.
     User post-programming electrical test (PPET) is not encouraged or recommended. The Team concludes
        that the risk to the health of the part outweighs the possible benefits. This is the case for many of the test
        sets that have been examined, which includes the one used for MER.
             o Stuck-at fault coverage testing is poor, ranging from approximately 15% to 60% and could
                  provide false confidence. The NASA ASIC guide recommends a level of at least 99%. This is
                  consistent with BAE Systems guidelines for their 0.25 µm gate array product.
             o The type testing conducted on these testers, including ―at-speed testing,‖ is unlikely to detect cases
                  of damaged antifuses unless the damage manifests itself as a gross timing violation; in such case,
                  the failure should be detectable in a proper board level test. ATE and board-level tests can not
                  determine the true slack for the majority of timing paths.
             o A survey of various ATE testers, including those that have tested SX-S parts for extended periods
                  of time, has shown that every test system failed the review. This represented a credible risk to the
                  part by not operating the part as designed, exceeding specification limits, and/or stressing the
                  device.
     If special ATE testing is required, these tests should be performed by the device manufacturer prior to
        shipment, thus minimizing the number of different boards interacting with the flight device.
     If ATE is used for PPET, then the tester and all test programs must be qualified to flight (electrical)
        standards, including a full analysis combined with direct measurements of the electrical environment to
        which the part will be subjected. This is a higher level of review and certification than what has been
        practiced historically. It is found that operators of these equipments are often not familiar with the
        electrical environment that they are creating for the flight devices.
     The use of the ActionProbe feature of SX-A and SX-S devices permits many delay paths to be measured
        non-invasively on the flight board and trends observed over time, voltage, and temperature, without
        limiting measurements to the path with the minimum slack time. This test is safe since it utilizes the
        standard Actel Silicon Explorer and an oscilloscope.

User Testing – Post Programming Burn-In (PBBI)
     There is no standard or generally accepted procedure for burning-in programmed units.
             o The MER burn-in was performed for 96 hours. It was noted that this number was dictated by
                 schedule considerations and was not based on engineering analysis.
             o The MER and MRO burn-in tests utilized a pseudo-random bit sequence without disabling random
                 signals from asserting device reset pins, needlessly minimizing the dynamic exercising of the part.
                 JPL personnel could not state what the toggle rate was, if any, for the damaged node in one of the
                 failed devices. The toggle rate can and should have been determined as it is a critical to
                 understand the electrical environment of the node that failed during their test.
     Actel data shows that a properly constructed test set for post-programming burn-in will not damage the
        parts. A poorly constructed test set for post-programming burn-in may damage devices under test.
             o Actel has tested approximately 3,000 devices for qualification with two failures.
             o The two failures were system failures, with facility noise coupling into the device‘s supply pins,
                 resulting in electrical overstress, a.k.a. ―Dirty Burn-In.‖
             o Special tests, in support of this problem investigation, resulted in additional failures during these
                 unintentional ―Dirty Burn-Ins.‖ The dropout rate during these runs was approximately 2%.
                 Identifying the source of the problem and protecting the devices under test was non-trivial even
                 for the device manufacturer.
     Post-programming burn-in is neither encouraged nor recommended. The Team concludes that the risk to
        the health of the part outweighs the possible benefits. This conclusion is strengthened by the results of the
        examination of PPBI test sets, including the one used for MER. 5

5
  Note: Blank Actel military and space grade devices are subject to a dynamic burn-in as part of the manufacturer‘s
processing flow. This dynamic burn-in exercises all device elements except the programmed antifuses which are
formed at the end user facility. The programmer measures each antifuse ensuring that its resistance is within
predefined limits.


February 11, 2004 – Final Version              Page 14 of 22
            o      Post-programming burn-in tests to date have not shown any failure mechanism that is accelerated
                   by the post-programming burn-in test. These tests have only demonstrated that PPBI is often
                   improperly conducted and damages and/or stresses devices.
              o The MER PBBI set utilized long runs from the power supply to the board, approximately 5 feet
                   total, and had no capacitors on the DUT card. This is a violation of standard flight design rules. It
                   can lead to overshoot on power-up, a sensitive condition for these parts, as well as excessive noise
                   on power and ground. JPL data showed that the device supply voltage levels were out of
                   specification.
              o The MER PBBI board used a general purpose laboratory signal generator with output impedance
                   of 50 ohm to supply a 5 MHz clock to four Actel FPGAs; the pattern generator and three devices
                   under test (DUT). No controlled impedance cables or line terminations were used. The ―Phase 2‖
                   burn-in system had a clock buffer board with some of the same characteristics. JPL personnel
                   informed the Team that the implemented modifications were limited by available budget and
                   schedule. The signal integrity of these lines is critical and long distances are non-trivial. A flight
                   design would normally utilize buffers (and often a standard clock distribution IC), controlled
                   impedance runs, and proper termination to ensure safe and reliable operation of the device.
                   Measurements made of these signals were inconclusive as it was reported that there was no
                   difference in quality between the two setups with neither setup meeting all of the device‘s
                   specifications. A credible, but not proven explanation of this fact is that the testing of the setup
                   simply measured the scope probe‘s characteristics. The scope probe characteristics including
                   impedance and bandwidth were not know at the time of the meeting.
              o The utilization of one stimulation integrated circuit to drive three devices under test can result in
                   many arrangements of drivers with no mechanism for synchronization. This can potentially result
                   in illegal voltage levels on the inputs to the devices under test. JPL provided no analysis to show
                   that the devices were always operated in a legal configuration.
              o It was found that operators of the equipment are not familiar with either the electrical environment
                   that they are creating for the flight devices or the key characteristics of the devices they are testing.
       If post-programming burn-in testing is performed, the equipment and all procedures must be qualified to
        flight (electrical) standards.
              o A full analysis combined with direct measurements of the electrical environment that the part will
                   be subjected to must be performed.
              o Operators of the equipment must be trained to understand the electrical environment that they are
                   creating for the flight devices, requirements for that device, and how to properly monitor that
                   environment.
              o Properly conditioned power should be used with appropriate power monitors.
              o If PPBI is to be performed, the flight project (and not the test facility) must establish the safety of
                   the burn-in equipment. This will require a detailed analysis and a test plan that measures the
                   electrical environment the DUTs are subjected to.
              o A sufficient set of devices must be subjected to the same equipment, facility, and procedures prior
                   to flight devices to qualify the test environment.
       If post-programming burn-in testing is performed, then the rationale must be clearly stated with a
        quantitative analysis providing justification for the risk.
              o A failure mechanism and a model for its acceleration by the dynamic and environmental burn-in
                   conditions must be unambiguously and verifiably established.
              o The flight program must state reliability goals that significantly exceed the reliability level
                   established by the manufacturer (approximately 10 FITS for SX-A/SX-S).
              o No data demonstrating inconsistency with the manufacturer‘s reliability numbers were presented.
                   Field performance for most customers totaling more than 10 6 devices is consistent with the
                   manufacturer‘s claim.
       It is further recommended that once the component is installed in the hardware assembly, the device should
        be thoroughly and extensively tested in the target electrical environment.




February 11, 2004 – Final Version               Page 15 of 22
                    Summary of the Meeting and Discussion
This short narrative will describe some of the meeting‘s highlights and the Team‘s conclusions.


Device Reliability and Application
It is clear that there are issues with the SX-S devices that need to be addressed by users of these devices. While
generally these devices have been used without failure, there are ―clusters of failures‖ at several organizations. In
contrast, essentially zero parts failed in user testing and operations in previous generations of Actel FPGA devices.

Based on the currently available data, the Team concludes that the devices are reliable when operated per
specification, with no data presented to indicate otherwise. However, the amount of margin in these newer devices
is lower than previous devices and historically encountered. Thus, particular care must be taken in the application of
these FPGAs, especially with respect to power supplies and signal integrity issues, consistent with Actel‘s
publications. The Team also noted that this is not a phenomenon particular to SX-A/SX-S devices but rather to
modern high-speed digital technologies in general.

The Team also concluded that the parts are nontrivial to use and easy to damage. The device documentation should
be strengthened to assure proper application. While the data sheets do specify limits on the electrical environment
required for safe operation, additional caution and warnings need to be prominently presented, making device safety
as important to the user as functionality and performance. Future devices with advanced semiconductor technology
will also have similar constraints.


MER Automated Test and Post-Programming Burn-In Equipment
The automated test and post-programming burn-in equipment used for the MER devices had critical design
deficiencies that operated the part in violation of the manufacturer‘s specification. Standard engineering practices
such as power supply bypassing, clock signal terminations, and maintaining valid logic levels on CMOS inputs were
not employed. Some violated requirements included maximum supply voltage and input transition times. Based on
JPL‘s presentation, Actel‘s analysis, and all available data, the Team concludes that the MER failures were most
likely caused by overstress in testing and handling.


The Team has also concluded that Actel's procedures for testing and burning-in unprogrammed parts for high
reliability applications appear to have high fault coverage and are sufficiently effective in capturing infant mortality
problems. The most significant structure of the device not tested at the factory is the set of programmed antifuses,
which are verified by an impedance check during and after programming. An inherent limitation of the
manufacturer‘s burn-in test is the fact that the antifuses are not programmed prior to shipment; thus that structure is
not burned in. It is noted that Actel, which tests and handles these parts on a day-to-day basis, also experienced
occasional problems during testing. The root causes of these problems, according to the information presented by
Actel, were facility-induced ―systems accidents‖ and issues with sockets.




February 11, 2004 – Final Version               Page 16 of 22
Failure Analysis Evaluation
Actel analysis was stopped prior to completion but the Team found the work to be of reasonable quality and well
documented.

JPL indicated that no formal reports existed, and aside from notes in a lab book, much of the work was incomplete
and could not be defended technically. The Team found the work to be insufficient to support claims of
manufacturing defects in the part. Additionally, it was found that after almost two years, contracts are not yet in
place to continue this work, as Actel or Sandia National Labs are no longer involved. There are strong indications
that the parts were subjected to a dangerous electrical environment, strongly suggesting electrical overstress as the
cause of failure. Because of the lack of data and analyses, however, electrical overstress can not at this time be
proven to be the root cause for all of the device failures.

MER S/N 132: This device had a damaged HCLK input (~44 mA IIH). JPL determined ―manufacturing defect‖ as
the root cause since only 1 of the 3 devices under test at the time was damaged. This conclusion was drawn from
the symptoms and did not attempt to identify the root cause of failure. The Team concluded that most likely the
failed device clamped the input and protected the other two devices in the test fixture. The damaged input‘s low
impedance likely limited subsequent clock pulses to lower voltage levels. Based on the physical data, the structure
was damaged by electrical overstress, not ESD. The claim of ―infant mortality‖ for this failure was clearly
inconsistent with the fact that the failed structure of this Grade E device was already subject to a 240 hour dynamic
burn-in as well as a static burn-in. Transients from the external clock source or a faulty setting in the test equipment
were not considered by JPL as a possible cause of failure. Additionally, the cable from the signal generator‘s 50 Ω
output was not terminated. Given the fact that these inputs are designed for up 5V systems application (higher than
the 3.3V MER application) and that Actel analysis showed that far higher voltages were required for the destruction
observed, the failure was most likely the result of a testing error.

MER S/N 50: JPL stated, based on Sandia TIVA images, that the failure was a metal-to-metal short between two
output buffers. No direct linkage from the TIVA images to the schematics could be made and follow-up questions
were not answered. Following the meeting, the full set of 26 TIVA images were obtained from Sandia and the
damaged areas were compared against the layout of the device. The locations identified corresponded to the
locations of transistors in the ―SLX problem‖ and the increased current in the device was similarly consistent. In
summary, the SLX problem originated from the porting of the SX-S design from the SX-series. Several 3.3V-rated
transistors inadvertently remained in the design while the output cells‘ drive levels were increased to 5V, making the
device reliable for 3.3V I/O. If 5V is applied to the device (a standard voltage used for previous generations of
parts) there will be up to three gate-to-substrate breakdowns, with a current signature matching that found for S/N
50. Thus, the JPL conclusion was not accepted by the Team. Since the failed transistors were the ―weak link‖ on
the VCCI supply, and there was no data showing any other structures damaged, the cause of failure was most likely
electrical overstress.

MER Setup Part: This device was used to verify the safety of the various test fixtures. After an outside reviewer
suggested the critical ICCA measurements be added to the test procedure, it was found that this device had an ICCA of
approximately 50 mA — far out of specification — an indication of serious damage. No other information related
to this failure was available.

MER S/N 52: This device failed ICCA delta limits intermittently. The JPL presentation attributed this behavior to
the fact that the CLKA and CLKB inputs were not terminated properly on the test card. It is known and well
documented that not terminating these inputs results in high currents since they are not pulled down internally to the
device. Clock inputs were properly terminated by the design software in earlier generations of Actel devices, but
this family's architecture was optimized for speed and omitted built-in termination.




February 11, 2004 – Final Version               Page 17 of 22
MER S/N 153 and S/N 179: These two devices failed to program and were described by JPL as manufacturing
defects with cause unknown. Based on the design and technology of these devices, a dropout rate of approximately
5% during programming is considered normal and is seen at Actel during in-house programming as not all structures
associated with the programmed antifuse can be tested until the device is personalized. Such dropout rate (2 out of
75 devices) seen in this lot is considered acceptable. Note that a dropout rate of approximately this magnitude is not
particular to this family but is an inherent property of all Actel antifuse-based FPGAs since the introduction of their
first family, Act 1.

MER S/N 117, 118, 119, 120, 121, 123, 124, and 125: These devices were initially (as of October meeting)
flagged as having the input logic threshold (VIH) out of specification (too high). These 8 device failures were listed
as ―cause unknown‖ and it was speculated that the failures may be design dependent. The speculated cause for
failure was not accepted at the prior meeting and the results appeared to be an artifact of a noisy test. One example
was the HCLK input to the ―nvm_vme_i/f_b‖ devices S/N 124 and S/N 125 where the routing and loading of this
hardwired clock is fixed by design. An examination of the VIH and VIL values presented showed an anomalously
large spread between the two measurements. NASA GSFC data as well as Actel characterization data were self-
consistent and reasonable, both for absolute values and the VIH and VIL spread. The data strongly suggested that the
measurements were in fact severely affected by noise which is design and test pattern dependent. The utilized test
method is known to result in noisy measurements. At the meeting, JPL agreed with this assessment. A further
examination of the test equipment showed inadequate bypass capacitors on the DUT card. Furthermore, the
measurement technique is also a concern since it results in ―non-logic levels‖ or out-of-specification voltages on the
clock input and potential device overstress.

MER S/N 91: This device showed symptoms of increased propagation delay (microseconds instead of the expected
nanoseconds) after the post-programming burn-in test. Actel diagnostics indicated that the failure mechanism was
increased resistance of an antifuse. This signature is similar to several other failures recently diagnosed.

This device failed to program on the first attempt and should have been immediately discarded rather than continued
in the flight processing flow. The Team can not dismiss the failure for this reason since there is no known linkage
from the failure to program to the damaged antifuse. The ―invalid electronic signature‖ message reported by the
programming software indicated a probable initial setup problem.

The JPL analysis indicated ―outlier‖ VIL (the logic ‗0‘ input threshold voltage) performance during pre-burn-in
testing. However, no linkage between the threshold of an input buffer and the resistance change of an antifuse in the
core of the device was shown.

Measurements on the burn-in board also showed that the maximum observed voltage of 3.2 volts exceeded the
absolute maximum specification. The integration time for this measurement was only a few seconds which is
considered short and may not capture the worst-case transient event on the supply, as seen during other
characterizations. JPL analysis estimated that at least a 16V transient voltage level would be required to degrade the
antifuse, a claim withdrawn at the meeting. Data presented earlier by Actel indicated that far lower voltages can
damage a programmed antifuse.

There is a lack of consensus on root cause of this failure between Actel and JPL. No analysis showing any
manufacturing defect was presented at the meeting. Indeed, the physical analysis of the device is yet to be started.
Actel‘s position is that the data suggests a voltage transient which is supported by the design of the dynamic burn-in
fixture, JPL‘s measurements of voltages exceeding the device‘s absolute maximum rating, and other failures with a
similar signature. While there is no direct evidence showing the linkage from electrical overstress to antifuse
damage, this particular burn-in test set and its characteristics, given the small margins in this device family, suggests
that electrical overstress is a likely root cause of failure. No other credible mechanism backed by verifiable
engineering data was presented. This determination has to be given cautiously as antifuse damage is not yet fully
understood.




February 11, 2004 – Final Version               Page 18 of 22
MRO S/N 8068, S/N 8069, and S/N 8129: Three of six devices for the Mars Reconnaissance Orbiter failed the
ICCA delta limit at 25 °C, with two of the devices‘ currents decreasing by about 6 mA. The measurements at -55 °C,
+25 °C, and +125 °C showed an unusual characteristic for CMOS devices in general and these FPGAs in particular,
with the highest current reported at room temperature. Normally device leakage current will increase with
temperature. Additionally, the room temperature current of 6 mA is substantially higher than normally observed
although still within device specification. Actel commented that such currents could be a result of socket problems,
with the p- and n-channel transistor leakage currents on tri-stated, bidirectional pins determining the bias of the
transistors‘ gates and subsequent totem pole currents, a condition that they had observed at their facility. It was also
noted by the Team that the three measurement points were insufficient to determine what was happening to the
device. It was recommended that these devices be tested in a NASA universal test board to verify the reported
measurements, which may be an artifact of the test setup.

Test Equipment Evaluation
The Team has noted that SX-A/SX-S devices are nontrivial to use, easy to damage, and not sufficiently documented
to assure proper application. Thus, for any testing on these devices, the design of the test equipment and operating
procedures must be done with the same care as flight hardware, with respect to the electrical environment that the
devices are subjected to. The "test as you fly, fly as you test" approach is especially important for these devices as
testing can inadvertently reduce their reliability. In recent evaluations of test equipment serious deficiencies have
been observed. Such instances include contention between large buses, floating inputs, unterminated clock pins,
signal integrity issues, lack of bypassing to limit transients on the power supplies, as well as other issues. In general,
it is found that the problems stem from a lack of a thorough understanding of this modern, advanced technology,
high-speed devices. For instance, the test personnel did not know whether their automated test equipment had issues
with bus contention or implemented "break before make" approach - the standard flight design practice for bi-
directional busses. While older devices were often tolerant of such conditions, the modern devices, with reduced
safety margins, are not. This question was initially raised at the October "Industry Meeting‖ and there was no
follow-up.

Recommendations – Post-Programming Burn-In
The Team has noted that there has been no credible engineering case made for performing post-programming burn-
in on the 0.25 µm SX-A or SX-S devices. No failure mechanism has been shown that will be accelerated by the
post-programming burn-in as performed by MER at high temperature. In fact, based on data and analysis from both
Quicklogic and Actel, a low temperature burn-in should be used for burn-in of the programmed antifuse, as that does
increase the peak current through the antifuse, a stressing condition. A review of Actel data and procedures show
that they use both low and high temperature testing for device qualification. The Team agrees with this conservative
approach.

A credible case against post-programming burn-in can be made since, from the data available, the benefits to be
gained are small and the risks high. For this technology, Actel has burned in for device qualification approximately
3,000 devices, mostly without failures. During the course of this testing two devices have failed with both failures
traced to issues with the facility.

During recent tests in support of this failure investigation, with the same burn-in boards used previously, Actel
observed a high drop out rate of approximately 2%. This change led to the instrumentation of the facility to track
down the triggering event. It was found again that the noise introduced into the system by the burn-in facility via
the heating and cooling equipment was violating the device specifications. Care must be taken not only in the
application of the device but in the system design, as even the experienced manufacturer encounters problems.
"Dirty burn-in" tests with the electrical environment outside of the Actel specification, have produced significant
fallout rates, while "clean" tests have so far resulted in zero failures.

In the case of the MER devices, both JPL and Actel tested sets of approximately 75 devices, with all devices from
the same lot. While JPL reported a high level of failure for their post-programming burn-in test, the Actel
qualification tests showed zero failures. A small sample of the same lot, was tested at NASA GSFC over a wide
temperature range and an extended time period, with a pattern designed to stress the programmed antifuse; no




February 11, 2004 – Final Version                Page 19 of 22
failures were detected. Based on the data and information presented, the tests conducted by JPL represented a case
of ―dirty burn-in" and are the likely cause of the MER failures.

Recommendations – Automated Test Equipment and Testing
The Team extensively discussed the benefits and risks with respect to post-programming electrical test (PPET) on
automated test equipment (ATE). In principle, it is used to catch failures early, since that is in theory more cost-
effective than finding problems later during board level testing. However, at the failure rates being observed, the
test development and equipment certification is more expensive than fixing faults found later. Indeed, APL has been
testing their SX-S FPGAs and, to date, found no failures.

Our survey of ATE equipment has found significant issues in every set examined as discussed in this report. In spite
of what appears to be a common perception, the ATE testing is neither simple nor risk-free.

ATE testing does have the ability to test with larger degrees of freedom than most board or system level tests, since
temperatures, voltages, and frequency can be varied without any system limitations. A damaged antifuse, however,
frequently manifests itself through a propagation delay increase that does not exceed the available slack time, and,
therefore, is not detectable with traditional ATE methods. For faults with gross propagation delay increases, the
damage would be detected during either properly designed board level or ATE testing.

The discussion of ATE testing as a means of detecting programmed antifuse damage may prove academic. The data
sets and analysis strongly suggest that the devices‘ faults are introduced via electrical overstress, either as a result of
additional testing or the application environment. For the former, devices stressed in testing may be caught; devices
passing the screen may have latent damage, dependent upon the fault coverage of the test. For the latter, devices
damaged in board and system level tests will not be detected since it is not practical to remove these quad flat
package devices from surface mount boards and retest them. Lastly, no evidence demonstrating antifuse damage
when the device is removed from the programmer has been presented

There are several additional issues with ATE testing. The first issue has to do with fault coverage. Unlike ASICs,
which use a design for test methodology, FPGA test vectors on programmed devices typically have low stuck-at
fault coverage, often ranging between 15% and 60% while the NASA ASIC guide specifies fault coverage at 99%.
Additionally, the ATE can not be used to observe trends in the parts performance since removing devices from
boards and retesting them is not practical with the ceramic quad flat packages being used. ATE is also limited since
the test vectors normally stop when the first or worst-case path is encountered; thus, the margin in most paths can
not be determined, only the one with minimum timing slack.

User post-programming electrical test (PPET) is not encouraged or recommended. The Team concludes that the risk
to the health of the part outweighs the possible benefits. This is the case for many of the test sets that have been
examined, which includes the one used for MER. If ATE is used for PPET, then the tester and all test programs
must be qualified to flight (electrical) standards, including a full analysis combined with direct measurements of the
electrical environment to which the part will be subjected. This is a higher level of review and certification than
what has been practiced historically.

The Team feels that an in situ approach in the target hardware, using the safe, non-invasive test port of the
SX-A/SX-S devices, is safer and more effective in fault detection. This technique allows many propagation delays
to be directly measured and permits trending over time, voltage, and temperature. This type of testing will be
cheaper and have a high fault coverage since the ―test vectors‖ will come from system operations and tests and will
be safer since all testing and handling will be in a flight environment using simple, standard interfaces. With a small
modification to off-the-shelf Silicon Explorer software, these tests can be automated using standard laboratory
equipment.




February 11, 2004 – Final Version                Page 20 of 22
                 Team Members, Presenters, and Attendees
Attendees at the meeting filled three groups: The Team, the main presenters, and others who chose to participate.
General Dynamics was absent. They did not participate or contribute data or analysis.

The Team was designed to be diverse, both organizationally and technically. The Team and consultants were
composed of 5 NASA organizations and 4 non-NASA organizations. Organizations contributing to the Team,
including consultants, were:

       Applied Physics Laboratory
       Missile Defense Agency
       NASA Engineering and Safety Center
       NASA Goddard Space Flight Center
       NASA Johnson Space Center
       NASA Marshall Space Flight Center
       NASA Office of Logic Design
       National Security Agency
       University of Maryland/CALCE


Similarly, technical diversity was established. The Team and consultants had members with the following
skills/titles:

       Chief Engineer, NASA Goddard
       Head, NASA Office of Logic Design
       Logic/Electrical Design: Design Engineers, Analysts, and Supervisor
       NESC Discipline Chief Engineers: Power and Avionics; Software
       Physicist
       Project Manager
       Research Scientist
       Parts: Branch Chief, JSC; Chief Parts Engineer GSFC
       POC, Parts, Materials, and Processes, Missile Defense Agency (Representative)


General attendees represented the following organizations:

       Aerospace Corporation
       Applied Physics Laboratory
       Defense Threat Reduction Agency
       Jackson & Tull
       Jet Propulsion Laboratory
       NASA Langley Research Center
       NASA Goddard Space Flight Center
       NASA Office of Logic Design
       NEPP
       Orbital Sciences Corporation
       QSS




February 11, 2004 – Final Version              Page 21 of 22
Team Members

Rod Barto           NASA Office of Logic Design                 Analyst
Lisa Coe            NASA Marshall Space Flight Center           Design Engineer
Martin Fraeman      Applied Physics Laboratory                  Section Supervisor
Kevin Hames         NASA Johnson Space Center                   Project Manager
Richard Katz1       NASA Office of Logic Design                 Head, OLD
Robert Kichak2      NASA Engineering and Safety Center          Discipline Chief Engineer, Power and Avionics
Andy Kostic         Missile Defense Agency                      Technical Risk Manager
Henning Leidecker   NASA Goddard Space Flight Center            Chief Parts Engineer
Jay Schaefer        National Security Agency                    Senior Engineer
Geoffrey Yoder      NASA Johnson Space Center                   Branch Chief, EV5
        1
            Chair
        2
            Co-Chair



Consultants

Sanka Ganesan       CALCE/University of Maryland                Research Scientist
Steven Scott3       NASA Goddard Space Flight Center            Chief Engineer, NASA Goddard
                    NASA Engineering and Safety Center          Discipline Chief Engineer, Software
        3
            Holds both Center and NESC positions



Presenters

Dan Elftmann        Actel Corporation                           Director, Product Engineering
Kay Jobe            Boeing Corporation
John McCollum       Actel Corporation                           Founder
Jonathan Perret     Jet Propulsion Laboratory                   Principal Engineer



General Attendees

Richard Brace       Jet Propulsion Laboratory                   Manager, Mission Assurance Office
Arthur Bradley      NASA Langley Research Center                GIFTS Electrical
Anne Clark          Defense Threat Reduction Agency             Project Manager
Robert Hodson       NASA Langley Research Center                Lead Engineer
Igor Kleyner        Orbital Sciences Corporation/OLD            Principle Engineer
Ken LaBel           NASA Goddard Space Flight Center            NEPP Manager
Joanna Mellert      Applied Physics Laboratory                  Staff Engineer
Bruce Meinhold      Jackson & Tull                              ST5 Parts Engineer
Mike Sampson        NASA Goddard Space Flight Center            EEE Parts Manager
Noman Siddiqi       QSS                                         Parts Engineer
Thomas Tsubota      The Aerospace Corporation                   Member of Technical Staff




February 11, 2004 – Final Version               Page 22 of 22

				
DOCUMENT INFO
Shared By:
Categories:
Stats:
views:17
posted:11/15/2010
language:English
pages:22