SEUs

Introduction

As a key component of space electronic systems, static random-access memory (SRAM)-based field-programmable gate arrays (FPGAs) are inevitably affected by single-event upsets (SEUs) in a space environment, which is an important and unavoidable issue in long-term space missions. SEUs can cause bit flips in memory elements and induce transient faults in semiconductors, and then produce the program error in the data flow or control flow, resulting in system failure or accidents. SEUs predominantly including single-cell upset (SCU) and multiple-cell upset (MCU). When an SCU occurs, only one cell/bit in the configuration memory changes its state in a single event, also known as single-bit upset (SBU). If sufficient charge is deposited in a single event, it may affect multiple memory cells, resulting in MCU.

High-level MCU mitigation technology is a low-cost fault-tolerant technology. It mainly includes scrubbing, redundancy, and partition technologies. In general, an actual fault-tolerant strategy is a combination of multiple technologies. Thus, it is important to analyze the mitigation efficiency of these technologies to provide a reference for the final decision. The structure and characteristics of these design modes are briefly introduced as follows:

  • Triple Modular Redundancy (TMR) TMR is the most commonly used fault-tolerant design technology, as shown in Fig.1. Its core is that three modules perform the given operation simultaneously, and the majority of the same output is used as the correct result. As long as two identical errors do not occur in the three modules in the meantime, the errors of the faulty module can be masked to ensure the correct output of the system. Because the TMR technology is simple and has high reliability, it is widely used in the fault tolerance technology of SEU on FPGA.
    Fig.1 Block diagram of TMR mode
  • Logical Partition Logical partition based on TMR is a more effective fault-tolerant method than TMR, as shown in Fig.2. The idea of the partition strategy is to divide the TMR-based design into N logic blocks. Each block contains three identical redundant parts and voting circuits, so it can solve the problem of multiple failures in different partitions.
    Fig.2 Block diagram of Partition mode
  • Scrubbing strategies Scrubbing is a mechanism that uses a bitstream with the original configuration to reconfigure the memory within a given time interval, and the bitstream is usually stored in memory that is not affected by SEU. One of the main advantages of scrubbing is that it does not completely interrupt the operating mode of the system, and only part of the configuration is updated every scrub interval. However, frequent scrubbing will cause the system to consume too much energy.

Material

  • Demo 1 describes the effect of different mitigation modes against SCU;
  • Demo 2 describes the effect of different mitigation modes against MCU;
  • Demo 3 analyzes the reliability of a phased mission system with different design modes considering common cause failures.

Details of these demos can be found in (the files are about 10MB, so it may take some time to load):

Provider: 邵麒,北京航空航天大学,sq5670063@buaa.edu.cn