Skip to main content

Advertisement

Log in

Self-Adaptive Fault Tolerance in Multi-/Many-Core Systems

  • Published:
Journal of Electronic Testing Aims and scope Submit manuscript

Abstract

This paper presents a novel approach to the design of multi-/many-core systems with an adaptive level of reliability. The approach defines a layer at the operating system level that achieves fault detection/tolerance/diagnosis properties by means of thread replication and re-execution mechanisms. The layer applies the most convenient hardening mechanism to achieve the desired trade-off between reliability and performance by adapting at run-time to the changes of the working scenario. The proposed strategy has been applied in a set of experimental sessions considering a real-world parallel application, to evaluate its benefits on the final system with respect to various strategies selected at design time.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18

Similar content being viewed by others

References

  1. Accelera Systems Initiative: http://www.accellera.org. Accessed 27 Mar 2013

  2. Aggarwal N, Ranganathan P, Jouppi NP, Smith JE (2007) Configurable isolation: building high availability systems with commodity multi-core processors. In: Proceeding international symposium on computer architecture, pp 470–481

  3. Auslander M, Dasilva D, Edelsohn D, Krieger O, Ostrowski M, Rosenburg B, Wisniewski RW, Xenidis J (2002) K42 overview. Tech. rep., IBM T. J. Watson Research Center

  4. Baumann A, Barham P, Dagand PE, Harris T, Isaacs R, Peter S, Roscoe T, Schüpbach A, Singhania A (2009) The multikernel: a new OS architecture for scalable multicore systems. In: Proceeding ACM symposium on operating systems principles (SOSP), pp 29–44, New York

  5. Bolchini C, Miele A, Sciuto D (2012) An adaptive approach for online fault management in many-core architectures. In: Proceeding conference on design, automation and test in Europe (DATE), pp 1429–1432

  6. Chen Z, Yang M, Francia G, Dongarra J (2007) Self adaptive application level fault tolerance for parallel and distributed computing. In: Proceeding international parallel and distributed processing symposium (IPDPS), pp 1–8

  7. ECSS: Methods for the calculation of radiation received and its effects andapolicyfordesignmargins. Tech. Rep. ECSS-E-ST-10-12C European Cooperation for Space Standardization (2008)

  8. Gizopoulos D, Psarakis M, Adve S, Ramachandran P, Hari S, Sorin D, Meixner A, Biswas A, Vera X (2011) Architectures for online error detection and recovery in multicore processors. In: Proceeding conference on design, automation and test in europe (DATE), pp 533–538

  9. Horn P (2001) Autonomic Computing: IBM’s Perspective on the State of Information Technology

  10. Huang J, Blech J, Raabe A, Buckl C, Knoll A (2011) Analysis and optimization of fault-tolerant task scheduling on multiprocessor embedded systems. In: Proceeding international conference Hw/Sw codesign and system synthesis, pp 247–256

  11. International Technology Roadmap for Semiconductors–Emerging Research Devices Section (2010) http://public.itrs.net/. Accessed 27 Mar 2013

  12. Kephart JO, Chess DM (2003) The vision of autonomic computing. IEEE Comput 36:41–50

    Article  Google Scholar 

  13. Kouadri A, Heron O, Montagne R (2011) A lightweight API for an adaptive software fault tolerance using POSIX-thread replication. In: Proceeding international conference on architecture of computing systems (ARCS), pp 16–19

  14. LaFrieda C, Ipek E, Martinez JF, Manohar R (2007) Utilizing dynamically coupled cores to form a resilient chip multiprocessor. In: Proceeding conference dependable systems and networks (DSN), pp 317–326

  15. Lattuada M, Pilato C, Tumeo A, Ferrandi F (2009) Performance modeling of parallel applications on MPSoCs. In: Proceeding 11th international conference on system-on-chip (SoC), pp 64–67

  16. Meloni P, Tuveri G, Raffo L, Cannella E, Stefanov T, Derin O, Fiorin L, Sami M (2012) System adaptivity and fault-tolerance in NoC-based MPSoCs: the MADNESS project approach. In: Proceeding EUROMICRO conference digital system design (DSD), pp 517–524

  17. Mukherjee S, Kontz M, Reinhardt S (2002) Detailed design and evaluation of redundant multi-threading alternatives. In: Proc Intl Symp Comput Architecture. 99–110

  18. Normand E (1996) Single event upset at ground level. IEEE Trans Nuclear Sci 43(6):2742–2750

    Article  Google Scholar 

  19. Politecnico di Milano: ReSP web site. http://code.google.com/p/resp-sim/. Accessed 27 Mar 2013

  20. Salehie M, Tahvildari L (2009) Self-adaptive software: Landscape and research challenges. ACM Trans Autonomous and Adaptive Systems 4:14:1–14:42

    Google Scholar 

  21. STMicroelectronics and CEA (2010) Platform 2012: A many-core programmable accelerator for ultra-efficient embedded computing in nanometer technology. In: Research workshop on STMicroelectronics Platform 2012

  22. Teraflux (2011) Definition of ISA extensions, custom devices and external COTSon API extensions. In: Teraflux: Exploiting dataflow parallelism in Tera-device computing

  23. The OpenMP API specification for parallel programming (2011). http://openmp.org/wp/. Accessed 27 Mar 2013

  24. Various Authors (2011) The MIT Angstrom Project: Universal Technologies for Exascale Computing. http://projects.csail.mit.edu/angstrom/. Accessed 27 Mar 2013

  25. Weis S, Garbade A, Wolf J, Fechner B, Mendelson A, Giorgi R, Ungerer T (2011) A fault detection and recovery architecture for a teradevice dataflow system. In: Workshop on data-flow execution models for extreme scale computing (DFM), pp 38–44

  26. Wells PM, Chakraborty K, Sohi GS (2009) Mixed-mode multicore reliability. In: Proceeding international conference architectural support for programming languages and operating systems, pp 169–180

  27. Wirthlin M, Johnson E, Rollins N, Caffrey M, Graham P (2003) The reliability of FPGA circuit designs in the presence of radiation induced configuration upsets. In: Proceeding symposium field-programmable custom computing machines (FCCM), pp 133–142

Download references

Acknowledgment

This work is partially supported by EU-ARTEMIS SMECY project, grant no. 100230.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Antonio Miele.

Additional information

Responsible Editor: D. Gizopoulos

Rights and permissions

Reprints and permissions

About this article

Cite this article

Bolchini, C., Carminati, M. & Miele, A. Self-Adaptive Fault Tolerance in Multi-/Many-Core Systems. J Electron Test 29, 159–175 (2013). https://doi.org/10.1007/s10836-013-5367-y

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10836-013-5367-y

Keywords