# Radiation Hardening By Design: A Novel Gate Level Approach

Massoud Mokhtarpour Ghahroodi and Mark Zwoliński School of Electronics & Computer Science University of Southampton Southampton, SO17 1BJ, UK Email: [mmg08r, mz]@ecs.soton.ac.uk

Abstract—Soft errors induced by radiation, causing malfunctions in electronic systems and circuits, have become one of the most challenging issues that impact the reliability of the modern processors even in sea-level applications. In this paper we present two novel radiation-hardening techniques at Gate-level. We present a Single-Event-Upset (SEU) tolerant Flip-Flop design with 38% less power overhead and 25% less area overhead at 65nm technology comparing to the conventional Triple Modular Redundancy (TMR) for Flip-Flop design. We also present an SEU-tolerant Clock-Gating scheme with less than 50% areapower overheads and no performance penalty comparing to the conventional TMR for clock-gating. Our simulations show that the proposed schemes can recover from SEU errors in 99% of the cases.

## I. INTRODUCTION

The first report of serious industrial problem due to soft errors goes back to 1978 on the 2107-series 16-KB DRAMs by Intel; it was reported that the errors were caused by the traces of radioactivity due to the  $\alpha$  particles in the package materials which led to radiation-induced Single-Event-Upsets (SEU) at sea level, referred to as "soft errors" [1]. From that era until now, radiation-induced problems are some of the most challenging reliability issues in circuits and systems not only in safety-critical applications and avionics, but also for Commercial, off-the-shelf (COTS) products.

Soft errors are a subset of Non-Destructive single-event effects (SEEs) [3]. Among all of the SEEs, Single-Event-Upsets (SEUs) is the major concern for radiation hardening, because 80% of system malfunctions in space are caused by SEUs on Memory elements as reported in [2]. The scope of this paper is focused on the SEUs in the sequential elements such as flip-flops and latches. Other memory elements like SRAMs & DRAMs are usually protected by Error-Detection-and-Correction coding techniques such as ECC, Reed-Solomon and hence they are out of our discussion.

In section II, we take a brief survey of the proposed techniques in the literature to deal with the SEUs in various levels of abstraction. In the next two sections, we present two novel radiation-hardening techniques at Gate-level; One for designing SEU-tolerant Flip-Flops and the other for designing SEU-tolerant Clock-Gating scheme.

Emre Özer ARM Ltd. 110 Fulbourn Road, Cambridge, CB1 9NJ, UK Email: emre.ozer@arm.com

## II. BACKGROUND

#### A. Radiation Hardening By Design (RHBD) Techniques

In this section, we take a brief survey of the proposed RHBD techniques in the literature. Our main focus will be on the error detection and correction techniques rather than only error detection methods.

1) RHBD at Layout Level: The simplest solution at layout level is increasing the charge needed for an SEU to occur which is known as "critical charge". This can be achieved by increasing the capacitance in the sensitive nodes. The bigger the capacitance, the higher the immunity to SEUs with the drawback of imposing more power and area overhead [6]. In [4], [5] Enclosed Layout Transistors (ELT) has been proposed to eliminate the radiation-induced current between source and drain, hence avoiding the upset to happen. This has been demonstrated to be very effective in CMOS processes of different technology nodes. However, due to challenges such as modelling the ELT transistors to compute W/L, the limitation in the W/L ratio that can be achieved and the lack of symmetry in the device, very few such radiation hardened cell libraries exist.

2) *RHBD at Transistor Level:* Most of the proposed techniques at Transistor level and above are based on spatial or temporal redundancies. At transistor level, Heavy Ion Tolerant (HIT) [7] and Dual Interlock Cell (DICE) [8] have been proposed. In both of them, the state-holding notes are duplicated to avoid the upsets. However for 90nm technologies and below, the SEU immunity achieved by these techniques is reported to be only 10 times better than standard cells. Moreover a particle strike on one of the state-holding nodes can cause the cell output to be wrong temporarily that can be fatal if it propagates to the next logic stage [9].

*3) RHBD at Gate Level:* Among all of the proposed techniques at Gate Level, TMR is the most effective one and has been used extensively in the industry. The TMR concept can be applied at gate level or higher levels of abstraction. For RHBD at gate level, usually all the sequential elements in the design are triplicated with a majority voting circuit at the end. This imposes 3.2X overhead in terms of area and power comparing to a non-TMR sequential cell.

4) RHBD at Register Transfer Level: The concept of TMR can be applied at Register Transfer level too. In [10], a method for the automatic insertion of radiation-hardened modules in designs at Register Transfer Level (RTL) is described. In their approach the VHDL RTL code is taken and the desired replicated blocks are added to design along with the required auxiliary signals. This is done in two steps: 1) Target selection and replication, 2)Resolution function. However there is no commercial automatic RHBD at RTL tool available. In [11], an SEU error correction method is proposed in which the data-paths are duplicated and the outputs of every stage are monitored continuously. In the case of a mismatch at each stage, second computation is triggered on one of the two data path while the other data path continues processing the next input. Here the assumption is that neither of the computations requires error monitoring due to the probability of SEU occurrence on two consecutive iterations.

5) RHBD at Software Level: In the case that RHBD techniques are not applicable on hardware (because of architectural or technological limitations), Software level is an interesting option. Various approaches have been proposed at software level like Computation Duplication [14], Procedure-level Duplication [15], Program-level Duplication [16] and Redundant Multi-Threading (RMT) [12], [13]. In all of these approaches, the error detection & correction capabilities are obtained by virtually adding the Dual Modular Redundancy (DMR) or TMR schemes at different levels of granularity: instruction, instructions block, procedure, program, etc.

Applying RHBD techniques at each level of abstraction has its own advantages and drawbacks. There is a trade-off between the overhead and efficiency, and usually RHBD at higher levels of abstraction adds to the complexity of such techniques. Among all, Radiation Hardening at gate-level is the simplest and one of the most effective one, which is also supported by conventional EDA tools. It is noteworthy to mention that the DMR or the TMR concepts are also applicable at system level in which the whole core (sequential cells and combinational blocks) are triplicated; however this adds more than 200% overhead to the whole area and power at system level. In our discussion, we use the terms TMR & DMR for the sequential cells only and not the replication of the whole system. In the next part, we propose two novel radiation-hardening techniques at Gate-level; one for SEUtolerant Flip-Flop design and the other for SEU-tolerant Clock-Gating scheme in a fully synchronous system.

## **III. SEU-TOLERANT FLIP-FLOP DESIGN**

In this part, we present a novel SEU-Tolerant Flip-Flop design. The main difference between our proposed design with other detection & recovery methods which are typically based on the TMR concept is that our design is based on DMR. This obviously imposes less area and power overheads on the design. Conventional DMR methods can only detect the errors with no recovery. However the presented method can detect and recover from the SEU errors too. During any given clock cycle, the two flip-flops in a DMR scheme shown in Fig. 1 should hold the same value. If during any given clock cycle an SEU occurs on one of the flip-flops, the comparator compares the flip-flop outputs and detects the mismatch. But it cannot determine which one of the two flip-flops is hit by the particle. Hence error recovery is not possible. But the fact is that during any given clock cycle and right before the SEU occurrence and the mismatch between the outputs, both flip-flops were holding the correct value as depicted in Fig. 2. We exploit this fact and propose the SEU-Tolerant scheme depicted in Fig. 3. The timing diagram of the proposed scheme is shown in Fig. 4.



Fig. 1. Dual-Module-Redundancy (DMR)



Fig. 2. DMR Timing Diagram

In SEU-free situations, the XOR output is always low and the active-low latch is transparent. The delayed version of the output from either of the flip-flops passes through the activelow transparent latch to the main output. By the time a particle hits one of the flip-flops and causes an SEU, the XOR goes High indicating the mismatch and it closes the latch. Since the latch is fed by the delayed version of one of the flip-flops (the amount of the delay is greater than the XOR propagation delay), the latch always closes on the correct value (the value before the SEU occurrence) and holds it. Therefore the main output is remained unchanged and always correct.



Fig. 3. Proposed SEU-Tolerant Scheme - DMR with Error Recovery.



Fig. 4. DMR with Error Recovery Timing Diagram - In the occurrence of an SEU, the latch closes on the correct value (the region under the oval), thus the main output is always correct.

In other words, the latch is in transparent mode all the time behaving as a combinational gate and it is only in state-holding mode during an SEU occurrence. The advantage of such a circuit is that even if a particle hits the latch in any given clock cycle, it can only cause a glitch on the main output, because the latch is in transparent mode and not holding any state. This also means that, if in any give clock cycle, two particles strike the module, in such a way that the latch is hit first and one of the flip-flips is hit next, again the circuit can recover from the error, because the latch will close on the second particle hit and stores the correct value but with a glitch on the main output caused by the first particle hit.

The scheme has been implemented at transistor-level and gate-level for more accurate analysis. The proposed scheme can also be implemented at register-transfer level; however care should be taken at the place & route stage to reduce charge sharing and collecting between the sensitive nodes in a DMR/TMR sequential cell [17], [18]. It's also noteworthy to mention that the RTL implementation can complicate the timing issues by placing the storage elements of a DMR/TMR sequential cell too far from each other, hence complicating the clock network synthesis in the place & route stage.

We have used 65nm technology standard cells with 600 MHz clock frequency. Total number of transistors for the proposed flip-flop scheme is 70 comparing to an equivalent TMR sequential cell (that is comprised of three flip-flops and the majority voting circuit implemented using the standard cells with the same cell size and driving strength) that contains 101 transistors. On average there is 38% less power overhead and 25% less area overhead because it can be implemented with fewer transistors and gates comparing to a TMR sequential cell. The comparisons are depicted in Fig. 5.

The delay overhead in the TMR cell is due to the majority voting circuit which is comprised of three 2-input AND gates and one 3-input OR gate in our implementation. The propagation delay for the TMR flip-flop cell is the sum of  $T_{Clock-to-Q}$  (of a none-TMR-Flip-Flop) +  $T_{Majority-Voter}$  +  $T_{Interconnects}$ ; while the propagation delay for the DMR with recovery cell is the sum of  $T_{Clock-to-Q}$  (of a none-TMR-Flip-Flop) +  $T_{(delay-element + latch(D-to-Q))}$  +  $T_{Interconnects}$ . The delay overhead in the DMR with recovery scheme is caused by the delay element and the latch. There is a 10% increase in the Clock-to-Q delay on average comparing to the TMR cell as shown in Fig. 5(b). This delay can be reduced by using smaller delay elements and faster latches or totally redesigning and characterizing the DMR cell as a new cell and adding it to the cell library.

To validate the SEU immunity of the proposed scheme, Transistor-level simulations have been used for statistical SEUfault injection. SEUs have been injected into either of the two flip-flops at different times during a given clock cycle in 10K Monte-Carlo runs to achieve high level of confidence. The results show that the proposed scheme can statistically detect 100% of SEU errors and recover from 99.1% of the SEU errors. In less than 1% of cases, the SEU occurs right at the rising edge of the clock, in such a way that of one of the flip-flops does not have any chance to store the input value. In this case, the XOR gate goes high right on the rising edge of the clock indicating the error, but depending on the probability that the particle hits which one of the flip-flops, the main output can be correct or incorrect. In these cases, if the flip-flop connected to the delay element is not the struck one, the main output is still correct, since the latch was fed by this flip-flop and closes on the occurrence of the SEU, but because of the mismatch in the XOR inputs, the Error signal goes high and the output is considered faulty.

## IV. RADIATION-HARDENING AND CLOCK-GATING DESIGN

One of the most important issues that is usually ignored in radiation-hardening at gate-level is the radiation susceptibility of the low-power design techniques such as clock-gating. To save power, the clock signal is gated with an enable signal, in such a way that, when the flip-flop is holding its previous



(b) Area & Delay Comparisons

Fig. 5. Power, Area & Delay Comparisons between two radiation-hardened sequential cells: TMR cell *VS* The proposed DMR with Recovery cell

state and should not get updated, the clock will be disabled by the enable signal.

Conventional clock gating schemes use a latch to provide a glitch-free gated-clock to a number of flip-flops as depicted in Fig. 7. This imposes more state-holding elements to the design with the same radiation susceptibility as the flip-flops. A particle hit on one of these clock-gating latches can create an SEU on the latch, that can eventually disregards the required enable signal status and updates (avoid updating) the stored values of the flip-flops during the clock cycle in which the flip-flops must hold their previous values (get updated).

The conventional solution is using the TMR scheme on the



Fig. 6. Spice-Level Simulation. Despite one of the flip-flop outputs q1 is almost destroyed due to an SEU, but the main output *ff-out* is still correct.



Fig. 7. Conventional Clock-Gating Scheme.

clock-gating latches as well. This imposes 3.2x overhead in terms of area and power plus the performance overhead due to the existence of the majority voting circuit. Since the clockgating is a special case, an alternative hardening technique is our proposed SEU-tolerant clock-gating scheme as shown is Fig. 8. A conventional TMR clock gating scheme uses three latches with the majority voting circuit. In our case of using the 65nm standard cell library, a TMR clock-gating latch contains 65 transistors; however the proposed scheme can be implemented using 27 transistors. This imposes less than 50% Area-Power overhead comparing to the TMR version. Moreover there is no considerable delay overhead, because it does not have any majority voting circuit.



Fig. 8. Proposed SEU-tolerant Clock-Gating Scheme

The proposed clock-gating scheme is comprised of two active-high latches & one 3-input AND gate as depicted in Fig. 8. Two different scenarios exists:

- Scenario 1: The SEU occurs when the Enable signal must be '0': Due to the fact that the controlling value on the AND gate is '0', therefore even an SEU on of the latches, changing '0' to '1' does not have any impact. This scheme guarantees that no SEU can activate the gated-clock signal and therefore in 100% of cases when the enable signal should be '0' it will remain '0', and an SEU on any of the latches cannot corrupt the flip-flop data by unwanted activation of the gated-clock signal "CLK-G" as shown in Fig. 9(a).
- Scenario 2: The SEU occurs when the Enable signal must be '1': Since the controlling value on the AND gate is '0', any SEU on one of the latches can flip '1' to '0'. This causes the clock-gated signal "CLK-G" connected to the flip-flops to have a narrower high phase, depending on the time that the SEU occurs during any given clock cycle Fig. 9(b). Our Spice-level simulations using 65nm technology show that only in less than 1% of cases this can lead to a data corruption on the flip-flop. For instance, in a worst case scenario, where the SEU occurs right at the rising edge of the clock signal in such a way that the gated-clock signal will be just a very narrow pulse looking like a glitch Fig. 10, but the flip-flop still gets updated properly.

Note that our scope in this section was focused on the RHBD clock gating. The flip-flops connected to this scheme need their own radiation hardening protection.



Fig. 9. Timing Diagram for the proposed SEU-tolerant Clock-Gating Scheme



Fig. 10. SEU-tolerant Clock-Gating Scheme: A worst case scenario - The clock signal is almost destroyed by the SEU, but the flip-flop still gets updated properly but with a bit longer clock-to-q delay.

### V. CONCLUSION

In this paper we present a novel technique at gate level to design radiation-hardened sequential cells. The approach we take is based on DMR with error recovery that results in 30% less area and power overhead comparing to TMR sequential cells. We also presented a novel technique to design radiationhardened clock gating scheme which results in less than 50% Area and Power overheads plus no performance overhead comparing to the TMR version. Since we use conventional standard cell libraries and EDA tools to apply these techniques, no additional modification or custom made libraries or tools are needed. Our spice-level simulations show that these methods are statistically able to recover from 99% of SEU errors.

### REFERENCES

- T.C. May and M.H. Woods, "A New Physical Mechanism for Soft Errors in Dynamic Memories", *16th Annual Symposium on Reliability Physics*, pp. 33-40, April 1978.
- [2] K.L. Bedingfield, R.D. Leach and M.B. Alexander, "Spacecraft system failures and anomalies attributed to the natural space environment", *NASA reference publication 1390*, August 1996.
- [3] JEDEC Standard JESD89A, "Measurement and reporting of alpha particle and terrestrial cosmic ray-induced soft errors in semiconductor devices", October 2006.
- [4] G. Anelli, M. Campbell, M. Delmastro, F. Faccio, S. Floria, A. Giraldo, E. Heijne, P. Jarron, K. Kloukinas, A. Marchioro, P. Moreira and W. Snoeys, "Radiation Tolerant VLSI Circuits in Standard Deep Submicron CMOS Technologies for the LHC Experiments: Practical Design Aspects", *IEEE Trans. Nucl. Science*, Vol. 46, No. 6, pp. 1690-1696, December 1999.
- [5] W. Snoeys et al., "Layout techniques to enhance the radiation tolerance of standard CMOS technologies demonstrated on a pixel detector readout chip", Nuclear Instruments and Methods in Physics Research Section A: Accelerators, Spectrometers, Detectors and Associated Equipment, Volume 439, Issues 2-3, pp. 349-360, January 2000.
- [6] F. Faccio, K. Kloukinas, G. Magazzu and A. Marchioro, "SEU effects in registers and in a Dual-Ported Static RAM designed in a 0.25um CMOS technology for applications in the LHC", *in the proceedings of the Fifth Workshop on Electronics for LHC Experiments*, Snowmass, September 1999.
- [7] R. Velazco, D. Bessot, S. Duzellier, R. Ecoffet and R. Koga, "Two CMOS Memory Cells Suitable for the Design of SEU-Tolerant VLSI Circuits", *IEEE Trans. Nucl. Science*, Vol. 41, No. 6, pp. 2229, December 1994.
- [8] T. Calin, M. Nicolaidis and R. Velazco, "Upset Hardened Memory Design for Submicron CMOS Technology", *IEEE Trans. Nucl. Science*, Vol. 43, No. 6, pp. 2874, December 1996.
- [9] R. Velazco, P. Fouillat and R. Reis, "Radiation Effects on Embedded Systems", Springer, 2010.
- [10] L. Entrena, C. Lopez and E. Olias, "Automatic insertion of faulttolerant structures at the RT level", *Seventh International On-Line Testing Workshop Proceedings*, pp. 48-50, 2001.
- [11] H. Liang, P. Mishra and K. Wu, "Error Correction On-Demand: A Low Power Register Transfer Level Concurrent Error Correction Technique", *IEEE Transactions on Computers*, Vol. 56, No. 2, pp. 243-252, February 2007.
- [12] S.S. Mukherjee, M. Kontz and S.K. Reinhardt, "Detailed design and evaluation of redundant multi-threading alternatives", 29th Annual International Symposium on Computer Architecture Proceedings, pp. 99-110, 2002.
- [13] C. Wang, H. Kim, Y. Wu and V. Ying, "Compiler-Managed Softwarebased Redundant Multi-Threading for Transient Fault Detection", *International Symposium on Code Generation and Optimization CGO '07.*, pp. 244-258, March 2007.
- [14] M. Rebaudengo, M. Sonza Reorda, M. Torchiano and M. Violante, "Soft-error detection through software fault-tolerance techniques", Proceedings of the IEEE International Symposium on Defect and Fault Tolerance in VLSI Systems, pp. 210218, 1999.
- [15] N. Oh and E.J. McCluskey, "Error detection by selective procedure call duplication for low energy consumption", *IEEE Transactions on Reliability*, pp. 392402, 2002.
- [16] H. Engel, "Data flow transformations to detect results which are corrupted by hardware faults", *Proceedings of the IEEE High-Assurance System Engineering Workshop*, pp. 279285, 1997.
- [17] M.P. Baze, J.C. Killens, R.A. Paup, Hardening Techniques for Retargetab W.P. Snapp, "SEU Scalable, Hardening Retargetable, Sub-Micron Digital Circuits and Libraries", Thirteenth Biennial Symposium Single Effects Manhattan Beach, Available: www.klabs.org/DEI/References/Radiation/baze\_see\_mit\_seesymp02.pdf, April 2002 [March 2011].
- [18] M. Haghi, J. Draper, "The 90 nm Double-DICE storage element to reduce Single-Event upsets", 52nd IEEE International Midwest Symposium on Circuits and Systems, pp. 463-466, August 2009.