Fault tolerance on-chip: a reliable computing paradigm using self-test, self-diagnosis, and self-repair (3S) approach

Li, Xiaowei; Yan, Guihai; Ye, Jing; Wang, Ying

doi:10.1007/s11432-017-9290-4

Fault tolerance on-chip: a reliable computing paradigm using self-test, self-diagnosis, and self-repair (3S) approach

Research Paper
Published: 24 May 2018

Volume 61, article number 112102, (2018)
Cite this article

Science China Information Sciences Aims and scope Submit manuscript

Xiaowei Li^1,2,
Guihai Yan^1,2,
Jing Ye¹ &
…
Ying Wang¹

185 Accesses
3 Citations
Explore all metrics

Abstract

If your computer crashes, you can revive it by a reboot, an empirical solution that usually turns out to be effective. The rationale behind this solution is that transient faults, either in hardware or software, can be fixed by refreshing the machine state. Such a “silver bullet”, however, could be futile in the future because the faults, especially those existing in the hardware such as Integrated Circuit (IC) chips, cannot be eliminated by refreshing. What we need is a more sophisticated mechanism to steer the system back to the right track. The “magic cure” is the Fault Tolerance On-Chip (FTOC) mechanism, which relies on a suite of built-in design-for-reliability logic, including fault detection, fault diagnosis, and error recovery, working in a self-supportive manner. We have exploited the FTOC to build a holistic solution ranging from on-chip fault detection to error recovery mechanisms to address faults caused by chips progressively aging. Besides fault detection, the FTOC paradigm provides attractive benefits, such as facilitating graceful performance degradation, mitigating the impact of verification blind spots, and improving the chip yield.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Resistive Random Access Memory (RRAM): an Overview of Materials, Switching Mechanism, Performance, Multilevel Cell (mlc) Storage, Modeling, and Applications

Article Open access 22 April 2020

Furqan Zahoor, Tun Zainal Azni Zulkifli & Farooq Ahmad Khanday

The Roadmap of 2D Materials and Devices Toward Chips

Article Open access 16 February 2024

Anhan Liu, Xiaowei Zhang, … Tian-Ling Ren

Resistive random access memory: introduction to device mechanism, materials and application to neuromorphic computing

Article Open access 09 March 2023

Furqan Zahoor, Fawnizu Azmadi Hussin, … Haider Abbas

References

Borkar S. Designing reliable systems from unreliable components: the challenges of transistor variability and degradation. IEEE Micro, 2005, 25: 10–16
Article Google Scholar
Yang G H, Han Y H, Li X W. ReviveNet: a self-adaptive architecture for improving lifetime reliability via localized timing adaptation. IEEE Trans Comput, 2011, 60: 1219–1232
Article MathSciNet Google Scholar
Fu B, Han Y, Ma J, et al. An abacus turn model for time/space-efficient reconfigurable routing. In: Proceedings of the 38th Annual International Symposium on Computer Architecture, San Jose, 2011. 259–270
Google Scholar
Yan G, Sun F, Li H, et al. CoreRank: redeeming “Sick Silicon” by dynamically quantifying core-level healthy condition. IEEE Trans Comput, 2016, 65: 716–729
Article MathSciNet Google Scholar
Yan G, Han Y, Li X. SVFD: a versatile online fault detection scheme via checking of stability violation. IEEE Trans VLSI Syst, 2011, 19: 1627–1640
Article Google Scholar
Zhang L, Han Y H, Xu Q, et al. On topology reconfiguration for defect-tolerant NoC-based homogeneous manycore systems. IEEE Trans VLSI Syst, 2009, 17: 1173–1186
Google Scholar
Dennard R H, Gaensslen F H, Rideout V L, et al. Design of ion-implanted MOSFET’s with very small physical dimensions. IEEE J Solid-State Circuits, 1974, 9: 256–268
Article Google Scholar
Srinivasan J, Adve S, Bose P, et al. The impact of technology scaling on lifetime reliability. In: Proceedings of International Conference on Dependable Systems and Networks, Florence, 2004. 177–186
Google Scholar
Borkar S, Karnik T, Narendra S, et al. Parameter variations and impact on circuits and microarchitecture. In: Proceedings of Design Automation Conference, Anaheim, 2003. 338–342
Google Scholar
Wang W P, Yang S Q, Sarvesh B, et al. The impact of NBTI on the performance of combinational and sequential circuits. In: Proceedings of the 44th ACM/IEEE Design Automation Conference, San Diego, 2007. 364–369
Google Scholar
Borkar S, Karnik T, Narendra S, et al. Parameter variations and impact on circuits and microarchitecture. In: Proceedings of Design Automation Conference, Anaheim, 2003. 338–342
Google Scholar
Chen G, Chuah K Y, Li M F, et al. Dynamic NBTI of PMOS transistors and its impact on device lifetime. In: Proceedings of the 41st Annual IEEE International Reliability Physics Symposium, Dallas, 2003. 196–202
Google Scholar
Zhao W, Liu F, Agarwal K, et al. Rigorous extraction of process variations for 65-nm CMOS design. IEEE Trans Semicond Manufact, 2009, 22: 196–203
Article Google Scholar
Xiang D, Zhang Y. Cost-effective power-aware core testing in NoCs based on a new unicast-based multicast scheme. IEEE Trans Comput-Aided Des Integr Circuits Syst, 2011, 30: 135–147
Article Google Scholar
Xiang D, Chakrabarty K, Fujiwara H. A unified test and fault-tolerant multicast solution for network-on-chip designs. In: Proceedings of IEEE International Test Conference (ITC), Fort Worth, 2016. 1–9
Google Scholar
Xiang D, Sui W, Yin B, et al. Compact test generation with an influence input measure for launch-on-capture transition fault testing. IEEE Trans VLSI Syst, 2014, 22: 1968–1979
Article Google Scholar
Ferhani F, Saxena N, McCluskey E, et al. How many test patterns are useless. In: Proceedings of the 26th IEEE VLSI Test Symposium, San Diego, 2008. 23–28
Google Scholar
Wang N J, Patel S J. ReStore: symptom-based soft error detection in microprocessors. IEEE Trans Dependable Secure Comput, 2006, 3: 188–201
Article Google Scholar
Aitken R. Yield learning perspectives. IEEE Des Test Comput, 2012, 29: 59–62
Article Google Scholar
Powell M D, Biswas A, Gupta S, et al. Architectural core salvaging in a multi-core processor for hard-error tolerance. In: Proceedings of the 36th Annual International Symposium on Computer Architecture, Austin, 2009. 93–104
Google Scholar
Eyerman S, Eeckhout L, Karkhanis T, et al. A top-down approach to architecting CPI component performance counters. IEEE Micro, 2007, 27: 84–93
Article Google Scholar
Tschanz J, Bowman K, Lu S, et al. A 45 nm resilient and adaptive microprocessor core for dynamic variation tolerance. In: Proceedings of IEEE International Solid-State Circuits Conference Digest of Technical Papers (ISSCC), San Francisco, 2010. 282–283
Google Scholar
Petrica P, Izraelevitz A, Albonesi D, et al. Flicker: a dynamically adaptive architecture for power limited multicore systems. In: Proceedings of the 40th Annual International Symposium on Computer Architecture, Tel-Aviv, 2013. 13–23
Google Scholar
Carlson T, Heirman W, Eeckhout L. Sniper: exploring the level of abstraction for scalable and accurate parallel multicore simulation. In: Proceedings of International Conference for High Performance Computing, Networking, Storage and Analysis (SC), Seattle, 2011. 1–12
Google Scholar
Miller J, Kasture H, Kurian G, et al. Graphite: a distributed parallel simulator for multicores. In: Proceedings of IEEE 16th International Symposium on High Performance Computer Architecture (HPCA), Bangalore, 2010. 1–12
Google Scholar
Kohler A, Schley G, Radetzki M. Fault tolerant network on chip switching with graceful performance degradation. IEEE Trans Comput-Aided Des Integr Circuits Syst, 2010, 29: 883–896
Article Google Scholar
Gizopoulos D, Psarakis M, Adve S, et al. Architectures for online error detection and recovery in multicore processors. In: Proceedings of Design, Automation and Test in Europe, Grenoble, 2011. 1–6
Book Google Scholar
Alizadeh B, Fujita M. A debugging method for repairing post-silicon bugs of high performance processors in the fields. In: Proceedings of International Conference on Field-Programmable Technology, Beijing, 2010. 328–331
Google Scholar
Chang C-W, Chou H-Z, Chang K-H, et al. Constraint generation for software-based post-silicon bug masking with scalable resynthesis technique for constraint optimization. In: Proceedings of the 12th International Symposium on Quality Electronic Design, Santa Clara, 2011. 174–181
Google Scholar

Download references

Acknowledgements

This work was supported by National Natural Science Foundation of China (Grant Nos. 61532017, 61572470, 61521092, 61522406, 61432017, 61376043), and in part by Youth Innovation Promotion Association, CAS (Grant No. Y404441000).

Author information

Authors and Affiliations

State Key Laboratory of Computer Architecture, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, 100190, China
Xiaowei Li, Guihai Yan, Jing Ye & Ying Wang
University of Chinese Academy of Sciences, Beijing, 100049, China
Xiaowei Li & Guihai Yan

Authors

Xiaowei Li
View author publications
You can also search for this author in PubMed Google Scholar
Guihai Yan
View author publications
You can also search for this author in PubMed Google Scholar
Jing Ye
View author publications
You can also search for this author in PubMed Google Scholar
Ying Wang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Guihai Yan.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Li, X., Yan, G., Ye, J. et al. Fault tolerance on-chip: a reliable computing paradigm using self-test, self-diagnosis, and self-repair (3S) approach. Sci. China Inf. Sci. 61, 112102 (2018). https://doi.org/10.1007/s11432-017-9290-4

Download citation

Received: 03 June 2017
Revised: 06 September 2017
Accepted: 01 November 2017
Published: 24 May 2018
DOI: https://doi.org/10.1007/s11432-017-9290-4

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fault tolerance on-chip: a reliable computing paradigm using self-test, self-diagnosis, and self-repair (3S) approach

Abstract

Access this article

Similar content being viewed by others

Resistive Random Access Memory (RRAM): an Overview of Materials, Switching Mechanism, Performance, Multilevel Cell (mlc) Storage, Modeling, and Applications

The Roadmap of 2D Materials and Devices Toward Chips

Resistive random access memory: introduction to device mechanism, materials and application to neuromorphic computing

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Fault tolerance on-chip: a reliable computing paradigm using self-test, self-diagnosis, and self-repair (3S) approach

Abstract

Access this article

Similar content being viewed by others

Resistive Random Access Memory (RRAM): an Overview of Materials, Switching Mechanism, Performance, Multilevel Cell (mlc) Storage, Modeling, and Applications

The Roadmap of 2D Materials and Devices Toward Chips

Resistive random access memory: introduction to device mechanism, materials and application to neuromorphic computing

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation