Epipe: A low-cost fault-tolerance technique considering WCET constraints

https://doi.org/10.1016/j.sysarc.2013.06.003Get rights and content

Abstract

Transient faults will soon become a critical reliability concern for processors used in mainstream computing. As the mainstream commodity market accepts only low-cost solutions for transient-fault tolerance, traditional high-end solutions are not acceptable due to their prohibitive costs. This paper presents Epipe, a hybrid software/hardware solution that provides sufficient fault coverage with affordable overhead for mainstream commodity systems. Given a program, Epipe identifies its vulnerable instructions (VIs), i.e., the ones that may cause silent data corruptions (SDCs) by compile-time analysis, and selects a subset of VIs to protect considering worst-case execution time (WCET) constraints in the fault-free execution. During program execution on a modified superscalar processor which incurs minimal hardware overhead, Epipe relies on selective instruction replication to handle the VI-induced SDCs and an existing exception detector to tolerate the remaining faults that manifest as system exceptions. Our experimental results show that Epipe provides sufficient fault coverage under some tight WCET constraints and increasingly higher coverage under more relaxed WCET constraints. As the WCET allowance increases from 5% to 15% and then to 25%, the coverage increases from 70.8% to 80% and then to 86.6% averagely. Unlike existing hybrid solutions, Epipe is the first to respect WCET constraints, which are an important concern for real-time systems.

Introduction

Architectural trends toward smaller transistors, lower core voltage and higher frequency make transient faults a more critical reliability concern than ever before. Transient faults, also known as soft errors, that occur during the execution of a program can be caused by many reasons: external events such as high-energy particle strikes or other external events that change the logic values of latches or logic structures, as well as internal events that include coupling, leakage, power supply noise, and temporal circuit variations. Although the duration of a transient fault is very short, it may lead to disastrous consequences. In 2000, Sun Microsystems acknowledged that cosmic rays interfered with cache memories and caused crashes in server systems at major customer sites, including America Online, eBay, and dozens of others [1]. In 2004, Cypress semiconductor reported a number of incidents arising from soft errors [2]. A recent study shows that a BlueGene/L machine with 104 nodes deployed in Lawrence Livermore National Labs experiences soft errors once every 4 h [3].

Historically, transient faults were of concern for those designing (high availability) systems used in electronics-hostile environments such as outer space. Traditional solutions include employing hardware redundancy or special hardware checkers (or watchdog mechanisms) to detect faults. The gold standards in this space have been the IBM S/360 (now Z-series servers) [4] and the HP NonStop systems [5]. There are also many others such as Boeing 777 airplanes [6] and DIVA [7]. While these techniques can provide high reliability, they also introduce excessive overheads in terms of both chip area and power required for redundant computation.

Given that the reliability per bit is estimated to drop 8% per generation of processors [8], transient faults are also forecast to be a problem for the mainstream commodity market, including industries such as mobile electronic transactions and mobile navigation. The design constraints of computer systems in this market differ substantially from those in high-end systems [9], [10]. There is a growing recognition that a wide spectrum of the commodity space will only accept much lower cost solutions (in area, power, and performance). In this context, traditional high-end solutions are not acceptable for the mainstream commodity market due to their prohibitive costs. On the other hand, this market does not pursue perfect reliability, regularly tolerating occasional crashes of systems. The key challenge facing the mainstream commodity market is to provide just enough coverage of transient faults at low cost while guaranteeing certain performance constraints.

Existing fault-tolerance techniques for dealing with transient faults are either hardware-based [5], [11], [7], [12], [13], which can be costly in dollars, or software-based [14], [15], [16], which can cause significant performance slowdowns, or hybrids [17], [18], [19]. There is also research [20] on application-level correctness, which attempts to reduce the fault-tolerance overhead by relaxing reliability requirements in specific applications for which 100% numerically correct results are not necessary (such as artificial intelligence applications). Despite advances made on achieving low-cost fault tolerance, these solutions cannot be deployed yet in commodity systems because they either provide limited reliability or incur considerable hardware cost/performance penalty or apply to specific applications only. Moreover, many mainstream commodity products are real-time systems, requiring also constraints on worst-case execution time (WCET) to be respected. Examples of such systems include multimedia systems, monitoring apparatuses, virtual reality systems, telecommunication networks, and interactive computer games. To the best of our knowledge, such constraints have never been taken into account in these existing techniques.

This paper presents a hybrid approach called Epipe, which aims at addressing transient faults that occur in single-threaded programs running on commodity systems while respecting WCET constraints. According to prior work [13], [21], [22], [23], most of the faults are inherently masked at various levels of the hardware-software hierarchy, from the circuit and micro-architectural level up through the application level, and thus will not affect program outputs. For example, a fault occurring in one register operand ($r1 or $r2) of a branch instruction bne $r1, $r2, L1 can be masked by its value comparison operation with high probability. The unmasked faults mainly induce system exceptions (such as memory protection exception and illegal opcode) or silently impact program outputs in the form of silent data corruption (SDC) errors, where an error induces erroneous outputs without any error being logged. The masked faults can be ignored since they do not ultimately propagate to user-visible corruptions. The unmasked faults that eventually trigger system exceptions can be handled easily in hardware.

Therefore, the key insight behind Epipe is that (unmasked) faults can be addressed more efficiently using different strategies according to their manifested behaviors. The faults that cause system exceptions can be detected and restored with little cost by leveraging existing exception detection and checkpoint mechanisms provided by modern superscalar processors. To deal with SDCs effectively, Epipe performs compile-time analysis to identify the vulnerable instructions (VIs), i.e., the ones that likely cause SDCs in a program and solves a formulation of ILP to select a set of VIs by considering WCET constraints. During program execution on a modified superscalar processor which incurs minimal hardware overhead, Epipe relies on selective instruction replication to detect and recover from the VI-induced SDCs. As a result, Epipe can provide sufficient fault coverage with affordable overhead for low-end systems while satisfying some given WCET constraints in the fault-free execution. For simplicity, in the remainder of the paper, we refer to WCET constraints in the fault-free execution as WCET constraints.

This paper makes the following contributions:

  • A hybrid software/hardware approach that provides sufficient fault coverage with affordable overhead for commodity processors, by handling SDCs using selective instruction replication in both software and hardware and exception-causing faults in hardware.

  • A superscalar pipeline modified to support selective instruction replication and efficient checkpointing-based program execution and recovery.

  • An ILP-based formulation that selects SDC-inducing VIs to protect by trading off reliability and performance overhead subject to WCET constraints, representing the first to do so in handling SDCs.

  • An experimental evaluation demonstrating that Epipe can provide sufficient fault coverage under some tight WCET constraints and increasingly higher coverage under more relaxed WCET constraints. In particular, as the WCET allowance increases from 5% to 15% and then to 25%, the fault coverage achieved on average increases from 70.8% to 80% and then to 86.6%.

The rest of this paper is organized as follows. Section 2 introduces the related work. Section 3 makes clear a few assumptions made. Section 4 describes our Epipe approach. Section 5 presents our experimental results and analysis. Section 6 concludes the paper.

Section snippets

Related work

This section examines Epipe in the context of the prior research in the area. As listed in Table 1, existing representative fault-tolerance techniques fall broadly into three categories: hardware-based, software-based and hybrids. The solutions in each category are reviewed separately below.

The Epipe system assumptions

There have been many effective techniques to protect on-chip memory structures like ECC [27] parity checks. However, how to protect the unstructured control logic that exists within a modern processor pipeline remains an open problem. The amount of chip area devoted to such general logic increases with chip complexity, and consequently, the effect of transient faults through combinational logic networks and pipeline latches is of particular concern [13]. The sphere of replication (SoR) [28] is

The Epipe approach

This work aims to provide a hybrid fault-tolerance approach with affordable performance overhead for commodity systems by considering WCET constraints. To this end, we distinguish exception-causing faults from SDCs so that the former can be handled efficiently in hardware and the latter efficiently in both software and hardware combined.

We restrict ourselves to single-threaded programs. Like the prior work on instruction replication [14], all modifications to a program at compile time are made

Experimental evaluation

This section presents an evaluation of Epipe on a set of nine benchmarks selected from the Mälardalen WCET benchmark suite [40]. We first describe the experimental setup, then present and analyze runtime overheads, and finally, evaluate the effectiveness of Epipe.

Our experiments have validated a few hypotheses about this work. First, replicating all VIs in a program can be expensive for some applications, making it impossible to satisfy some WCET constraints. Second, Epipe can provide

Conclusion

With the continued evolution of hardware toward smaller feature size, lower voltage, and higher frequency, the reliability challenge today pervades almost the entire computing market. This paper presents Epipe, a hybrid software/hardware implemented reliability solution for mainstream commodity processors considering WCET constraints. Epipe detects and recovers from transient faults based on hardware mechanisms in modern superscalar processor with fault-tolerant extensions and achieves fault

Acknowledgements

This research is supported by Australian Research Council Grants (DP 110104628 and DP130101970) and National Natural Science Foundation of China (Grant No. 61202116).

Jianli Li is a Ph.D. student in the School of Computer at the National University of Defense Technology. His research interests are in fault tolerance, embedded computing and compiler optimizations.

References (43)

  • X. Li et al.

    Chronos: a timing analyzer for embedded software

    Science of Computer Programming

    (2007)
  • R. Baumann

    Soft errors in commercial semiconductor technology: overview and scaling trends

    IEEE Reliability Physics Tutorial Notes, Reliability Fundamentals

    (2002)
  • J. Ziegler et al.

    SER – History, Trends and Challenges: a Guide for Designing with Memory ICs

    (2004)
  • G. Bronevetsky, B.R. de Supinski, M. Schulz, A foundation for the accurate predication of the soft error vulnerability...
  • L. Spainhower et al.

    IBM S/390 parallel enterprise server G5 fault tolerance: a historical perspective

    IBM Journal of Research and Development

    (1999)
  • D. Bernick, B. Bruckert, P.D. Vigna, D. Garcia, R. Jardine, J. Klecka, J. Smullen, NonStop advanced architecture, in:...
  • Y. Yeh, Triple–triple redundant 777 primary flight computer, in: AERO 1996, pp....
  • T.M. Austin, DIVA: a reliable substrate for deep submicron microarchitecture design, in: MICRO 1999, pp....
  • S. Borkar, Microarchitecture and design challenges for gigascale integration, in: MICRO...
  • X. Yang et al.

    The reliability wall for exascale supercomputing

    IEEE Transactions on Computers

    (2012)
  • Y. Zhu, Y. Li, J. Xue, T. Tan, J. Shi, Y. Shen, C. Ma, What is system hang and how to handle it, in: ISSRE, 2012, pp....
  • E. Rotenberg, AR-SMT: a microarchitectural approach to fault tolerance in microprocessors, in: FTCS 1999, pp....
  • A. Meixner, M. Bauer, D. Sorin, Argus: low-cost, comprehensive error detection in simple cores, in: MICRO 2007, pp....
  • N.J. Wang, S.J. Patel, ReStore: symptom based soft error detection in microprocessors, in: DSN 2005, pp....
  • N. Oh et al.

    Error detection by duplicated instructions in super-scalar processors

    IEEE Transactions on Reliability

    (2002)
  • G. Reis, J. Chang, N. Vachharajani, R. Rangan, D. August, SWIFT: software implemented fault tolerance, in: CGO 2005,...
  • J. Yu, M.J. Garzaran, M. Snir, ESoftCheck: Removal of non-vital checks for fault tolerance, in: CGO 2009, pp....
  • N. Nakka, K. Pattabiraman, R. Iyer, Processor-level selective replication, in: DSN 2007, pp....
  • S. Feng, S. Gupta, A. Ansari, S. Mahlke, Shoestring: probabilistic soft error reliability on the cheap, in: ASPLOS...
  • Y. Zhang, S. Ghosh, J. Huang, J.W. Lee, S.A. Mahlke, D.I. August, Runtime asynchronous fault tolerance via speculation,...
  • X. Li, D. Yeung, Application-level correctness and its impact on fault tolerance, in: HPCA 2007, pp....
  • Cited by (6)

    • Fast and accurate architectural vulnerability analysis for embedded processors using Instruction Vulnerability Factor

      2016, Microprocessors and Microsystems
      Citation Excerpt :

      Currently, the AVF is most popular and a lot of researches have been done on AVF related topics such as cost effective and accurate AVF estimation methods [7,14,15], online AVF estimation methods [9–12], AVF prediction methods [16,17], and reliability aware system and circuit design [17–21]. In recent years, many researchers have considered the instruction level reliability analysis and AVF estimation based on the vulnerability of running instructions [22–33]. Instruction Vulnerability Index (IVI) and Instruction Masking Index (IMI) and related metrics have been proposed by Rehman et al. [25,26,28,29].

    • Exploiting component dependency for accurate and efficient soft error analysis via Probabilistic Graphical Models

      2015, Microelectronics Reliability
      Citation Excerpt :

      Besides the two system-level masking effects characterization, some works on instruction-level masking effects also are estimated in [27] via an analytical model and he metric of PVF (Program Vulnerability Factor) is defined in [28] to evaluate the soft error impacts from the instruction-level. All the masking effects considered estimation can be used for cost-effective reliable designs [29–31] using duplicated registers or compiler-oriented code optimization. Thereby, the more accurate estimation should be achieved for more effective mitigation designs.

    • A user-assisted thread-level vulnerability assessment tool

      2019, Concurrency and Computation: Practice and Experience
    • Towards More Accurate Fault Localization: An Approach Based on Feature Selection Using Branching Execution Probability

      2016, Proceedings - 2016 IEEE International Conference on Software Quality, Reliability and Security, QRS 2016
    • Masking soft errors with static bitwise analysis

      2016, Proceedings - Asia-Pacific Software Engineering Conference, APSEC

    Jianli Li is a Ph.D. student in the School of Computer at the National University of Defense Technology. His research interests are in fault tolerance, embedded computing and compiler optimizations.

    Jingling Xue is a professor in the School of Computer Science and Engineering at the University of New South Wales. He received his B.Eng and M.Eng degrees in Computer Science and Engineering from Tsinghua University in 1984 and 1987, respectively. He received his Ph.D. in Computer Science and Engineering from Edinburgh University in 1992. Professor Xue leads the Programming Languages and Compilers Group and its subgroup Compiler Research Group (CORG) at UNSW. His research interests are programming languages, compiler optimizations, computer architecture, parallel computing, distributed systems and cluster computing, and embedded systems.

    Xinwei Xie is a Ph.D. student in the School of Computer Science and Engineering at the University of New South Wales. His research interests are in parallel computing, advanced compilation, computer architecture and systems, and operating systems.

    Qing Wan is a Ph.D. student in the School of Computer Science and Engineering at the University of New South Wales. His research interests are in software-managed memory allocation.

    Qingping Tan is a professor in the School of Computer at the National University of Defense Technology. His research interests are in software engineering, distributed computing, software reliability, compiler optimizations.

    Lanfang Tan is a Ph.D. student in the School of Computer at the National University of Defense Technology. Her research interests are in software test and fault tolerance verification.

    View full text