Skip to main content
Log in

Self-Adaptive Fault Tolerance in Multi-/Many-Core Systems

  • Published:
Journal of Electronic Testing Aims and scope Submit manuscript

Abstract

This paper presents a novel approach to the design of multi-/many-core systems with an adaptive level of reliability. The approach defines a layer at the operating system level that achieves fault detection/tolerance/diagnosis properties by means of thread replication and re-execution mechanisms. The layer applies the most convenient hardening mechanism to achieve the desired trade-off between reliability and performance by adapting at run-time to the changes of the working scenario. The proposed strategy has been applied in a set of experimental sessions considering a real-world parallel application, to evaluate its benefits on the final system with respect to various strategies selected at design time.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18

Similar content being viewed by others

References

  1. Accelera Systems Initiative: http://www.accellera.org. Accessed 27 Mar 2013

  2. Aggarwal N, Ranganathan P, Jouppi NP, Smith JE (2007) Configurable isolation: building high availability systems with commodity multi-core processors. In: Proceeding international symposium on computer architecture, pp 470–481

  3. Auslander M, Dasilva D, Edelsohn D, Krieger O, Ostrowski M, Rosenburg B, Wisniewski RW, Xenidis J (2002) K42 overview. Tech. rep., IBM T. J. Watson Research Center

  4. Baumann A, Barham P, Dagand PE, Harris T, Isaacs R, Peter S, Roscoe T, Schüpbach A, Singhania A (2009) The multikernel: a new OS architecture for scalable multicore systems. In: Proceeding ACM symposium on operating systems principles (SOSP), pp 29–44, New York

  5. Bolchini C, Miele A, Sciuto D (2012) An adaptive approach for online fault management in many-core architectures. In: Proceeding conference on design, automation and test in Europe (DATE), pp 1429–1432

  6. Chen Z, Yang M, Francia G, Dongarra J (2007) Self adaptive application level fault tolerance for parallel and distributed computing. In: Proceeding international parallel and distributed processing symposium (IPDPS), pp 1–8

  7. ECSS: Methods for the calculation of radiation received and its effects andapolicyfordesignmargins. Tech. Rep. ECSS-E-ST-10-12C European Cooperation for Space Standardization (2008)

  8. Gizopoulos D, Psarakis M, Adve S, Ramachandran P, Hari S, Sorin D, Meixner A, Biswas A, Vera X (2011) Architectures for online error detection and recovery in multicore processors. In: Proceeding conference on design, automation and test in europe (DATE), pp 533–538

  9. Horn P (2001) Autonomic Computing: IBM’s Perspective on the State of Information Technology

  10. Huang J, Blech J, Raabe A, Buckl C, Knoll A (2011) Analysis and optimization of fault-tolerant task scheduling on multiprocessor embedded systems. In: Proceeding international conference Hw/Sw codesign and system synthesis, pp 247–256

  11. International Technology Roadmap for Semiconductors–Emerging Research Devices Section (2010) http://public.itrs.net/. Accessed 27 Mar 2013

  12. Kephart JO, Chess DM (2003) The vision of autonomic computing. IEEE Comput 36:41–50

    Article  Google Scholar 

  13. Kouadri A, Heron O, Montagne R (2011) A lightweight API for an adaptive software fault tolerance using POSIX-thread replication. In: Proceeding international conference on architecture of computing systems (ARCS), pp 16–19

  14. LaFrieda C, Ipek E, Martinez JF, Manohar R (2007) Utilizing dynamically coupled cores to form a resilient chip multiprocessor. In: Proceeding conference dependable systems and networks (DSN), pp 317–326

  15. Lattuada M, Pilato C, Tumeo A, Ferrandi F (2009) Performance modeling of parallel applications on MPSoCs. In: Proceeding 11th international conference on system-on-chip (SoC), pp 64–67

  16. Meloni P, Tuveri G, Raffo L, Cannella E, Stefanov T, Derin O, Fiorin L, Sami M (2012) System adaptivity and fault-tolerance in NoC-based MPSoCs: the MADNESS project approach. In: Proceeding EUROMICRO conference digital system design (DSD), pp 517–524

  17. Mukherjee S, Kontz M, Reinhardt S (2002) Detailed design and evaluation of redundant multi-threading alternatives. In: Proc Intl Symp Comput Architecture. 99–110

  18. Normand E (1996) Single event upset at ground level. IEEE Trans Nuclear Sci 43(6):2742–2750

    Article  Google Scholar 

  19. Politecnico di Milano: ReSP web site. http://code.google.com/p/resp-sim/. Accessed 27 Mar 2013

  20. Salehie M, Tahvildari L (2009) Self-adaptive software: Landscape and research challenges. ACM Trans Autonomous and Adaptive Systems 4:14:1–14:42

    Google Scholar 

  21. STMicroelectronics and CEA (2010) Platform 2012: A many-core programmable accelerator for ultra-efficient embedded computing in nanometer technology. In: Research workshop on STMicroelectronics Platform 2012

  22. Teraflux (2011) Definition of ISA extensions, custom devices and external COTSon API extensions. In: Teraflux: Exploiting dataflow parallelism in Tera-device computing

  23. The OpenMP API specification for parallel programming (2011). http://openmp.org/wp/. Accessed 27 Mar 2013

  24. Various Authors (2011) The MIT Angstrom Project: Universal Technologies for Exascale Computing. http://projects.csail.mit.edu/angstrom/. Accessed 27 Mar 2013

  25. Weis S, Garbade A, Wolf J, Fechner B, Mendelson A, Giorgi R, Ungerer T (2011) A fault detection and recovery architecture for a teradevice dataflow system. In: Workshop on data-flow execution models for extreme scale computing (DFM), pp 38–44

  26. Wells PM, Chakraborty K, Sohi GS (2009) Mixed-mode multicore reliability. In: Proceeding international conference architectural support for programming languages and operating systems, pp 169–180

  27. Wirthlin M, Johnson E, Rollins N, Caffrey M, Graham P (2003) The reliability of FPGA circuit designs in the presence of radiation induced configuration upsets. In: Proceeding symposium field-programmable custom computing machines (FCCM), pp 133–142

Download references

Acknowledgment

This work is partially supported by EU-ARTEMIS SMECY project, grant no. 100230.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Antonio Miele.

Additional information

Responsible Editor: D. Gizopoulos

Rights and permissions

Reprints and permissions

About this article

Cite this article

Bolchini, C., Carminati, M. & Miele, A. Self-Adaptive Fault Tolerance in Multi-/Many-Core Systems. J Electron Test 29, 159–175 (2013). https://doi.org/10.1007/s10836-013-5367-y

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10836-013-5367-y

Keywords

Navigation