Self-Adaptive Fault Tolerance in Multi-/Many-Core Systems

Bolchini, Cristiana; Carminati, Matteo; Miele, Antonio

doi:10.1007/s10836-013-5367-y

Self-Adaptive Fault Tolerance in Multi-/Many-Core Systems

Published: 17 April 2013

Volume 29, pages 159–175, (2013)
Cite this article

Journal of Electronic Testing Aims and scope Submit manuscript

Cristiana Bolchini¹,
Matteo Carminati¹ &
Antonio Miele¹

768 Accesses
21 Citations
Explore all metrics

Abstract

This paper presents a novel approach to the design of multi-/many-core systems with an adaptive level of reliability. The approach defines a layer at the operating system level that achieves fault detection/tolerance/diagnosis properties by means of thread replication and re-execution mechanisms. The layer applies the most convenient hardening mechanism to achieve the desired trade-off between reliability and performance by adapting at run-time to the changes of the working scenario. The proposed strategy has been applied in a set of experimental sessions considering a real-world parallel application, to evaluate its benefits on the final system with respect to various strategies selected at design time.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Accelera Systems Initiative: http://www.accellera.org. Accessed 27 Mar 2013
Aggarwal N, Ranganathan P, Jouppi NP, Smith JE (2007) Configurable isolation: building high availability systems with commodity multi-core processors. In: Proceeding international symposium on computer architecture, pp 470–481
Auslander M, Dasilva D, Edelsohn D, Krieger O, Ostrowski M, Rosenburg B, Wisniewski RW, Xenidis J (2002) K42 overview. Tech. rep., IBM T. J. Watson Research Center
Baumann A, Barham P, Dagand PE, Harris T, Isaacs R, Peter S, Roscoe T, Schüpbach A, Singhania A (2009) The multikernel: a new OS architecture for scalable multicore systems. In: Proceeding ACM symposium on operating systems principles (SOSP), pp 29–44, New York
Bolchini C, Miele A, Sciuto D (2012) An adaptive approach for online fault management in many-core architectures. In: Proceeding conference on design, automation and test in Europe (DATE), pp 1429–1432
Chen Z, Yang M, Francia G, Dongarra J (2007) Self adaptive application level fault tolerance for parallel and distributed computing. In: Proceeding international parallel and distributed processing symposium (IPDPS), pp 1–8
ECSS: Methods for the calculation of radiation received and its effects andapolicyfordesignmargins. Tech. Rep. ECSS-E-ST-10-12C European Cooperation for Space Standardization (2008)
Gizopoulos D, Psarakis M, Adve S, Ramachandran P, Hari S, Sorin D, Meixner A, Biswas A, Vera X (2011) Architectures for online error detection and recovery in multicore processors. In: Proceeding conference on design, automation and test in europe (DATE), pp 533–538
Horn P (2001) Autonomic Computing: IBM’s Perspective on the State of Information Technology
Huang J, Blech J, Raabe A, Buckl C, Knoll A (2011) Analysis and optimization of fault-tolerant task scheduling on multiprocessor embedded systems. In: Proceeding international conference Hw/Sw codesign and system synthesis, pp 247–256
International Technology Roadmap for Semiconductors–Emerging Research Devices Section (2010) http://public.itrs.net/. Accessed 27 Mar 2013
Kephart JO, Chess DM (2003) The vision of autonomic computing. IEEE Comput 36:41–50
Article Google Scholar
Kouadri A, Heron O, Montagne R (2011) A lightweight API for an adaptive software fault tolerance using POSIX-thread replication. In: Proceeding international conference on architecture of computing systems (ARCS), pp 16–19
LaFrieda C, Ipek E, Martinez JF, Manohar R (2007) Utilizing dynamically coupled cores to form a resilient chip multiprocessor. In: Proceeding conference dependable systems and networks (DSN), pp 317–326
Lattuada M, Pilato C, Tumeo A, Ferrandi F (2009) Performance modeling of parallel applications on MPSoCs. In: Proceeding 11th international conference on system-on-chip (SoC), pp 64–67
Meloni P, Tuveri G, Raffo L, Cannella E, Stefanov T, Derin O, Fiorin L, Sami M (2012) System adaptivity and fault-tolerance in NoC-based MPSoCs: the MADNESS project approach. In: Proceeding EUROMICRO conference digital system design (DSD), pp 517–524
Mukherjee S, Kontz M, Reinhardt S (2002) Detailed design and evaluation of redundant multi-threading alternatives. In: Proc Intl Symp Comput Architecture. 99–110
Normand E (1996) Single event upset at ground level. IEEE Trans Nuclear Sci 43(6):2742–2750
Article Google Scholar
Politecnico di Milano: ReSP web site. http://code.google.com/p/resp-sim/. Accessed 27 Mar 2013
Salehie M, Tahvildari L (2009) Self-adaptive software: Landscape and research challenges. ACM Trans Autonomous and Adaptive Systems 4:14:1–14:42
Google Scholar
STMicroelectronics and CEA (2010) Platform 2012: A many-core programmable accelerator for ultra-efficient embedded computing in nanometer technology. In: Research workshop on STMicroelectronics Platform 2012
Teraflux (2011) Definition of ISA extensions, custom devices and external COTSon API extensions. In: Teraflux: Exploiting dataflow parallelism in Tera-device computing
The OpenMP API specification for parallel programming (2011). http://openmp.org/wp/. Accessed 27 Mar 2013
Various Authors (2011) The MIT Angstrom Project: Universal Technologies for Exascale Computing. http://projects.csail.mit.edu/angstrom/. Accessed 27 Mar 2013
Weis S, Garbade A, Wolf J, Fechner B, Mendelson A, Giorgi R, Ungerer T (2011) A fault detection and recovery architecture for a teradevice dataflow system. In: Workshop on data-flow execution models for extreme scale computing (DFM), pp 38–44
Wells PM, Chakraborty K, Sohi GS (2009) Mixed-mode multicore reliability. In: Proceeding international conference architectural support for programming languages and operating systems, pp 169–180
Wirthlin M, Johnson E, Rollins N, Caffrey M, Graham P (2003) The reliability of FPGA circuit designs in the presence of radiation induced configuration upsets. In: Proceeding symposium field-programmable custom computing machines (FCCM), pp 133–142

Download references

Acknowledgment

This work is partially supported by EU-ARTEMIS SMECY project, grant no. 100230.

Author information

Authors and Affiliations

Dipartimento di Elettronica, Informatica e Bioingegneria, Politecnico di Milano, Piazza Leonardo da Vinci 32, 20133, Milano, Italy
Cristiana Bolchini, Matteo Carminati & Antonio Miele

Authors

Cristiana Bolchini
View author publications
You can also search for this author in PubMed Google Scholar
Matteo Carminati
View author publications
You can also search for this author in PubMed Google Scholar
Antonio Miele
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Antonio Miele.

Additional information

Responsible Editor: D. Gizopoulos

Rights and permissions

Reprints and permissions

About this article

Cite this article

Bolchini, C., Carminati, M. & Miele, A. Self-Adaptive Fault Tolerance in Multi-/Many-Core Systems. J Electron Test 29, 159–175 (2013). https://doi.org/10.1007/s10836-013-5367-y

Download citation

Received: 02 October 2012
Accepted: 07 March 2013
Published: 17 April 2013
Issue Date: April 2013
DOI: https://doi.org/10.1007/s10836-013-5367-y

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Self-Adaptive Fault Tolerance in Multi-/Many-Core Systems

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

ASTEROID and the Replica-Aware Co-scheduling for Mixed-Criticality

Engineering Cross-Layer Fault Tolerance in Many-Core Systems

On the Consolidation of Mixed Criticalities Applications on Multicore Architectures

References

Acknowledgment

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Self-Adaptive Fault Tolerance in Multi-/Many-Core Systems

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

ASTEROID and the Replica-Aware Co-scheduling for Mixed-Criticality

Engineering Cross-Layer Fault Tolerance in Many-Core Systems

On the Consolidation of Mixed Criticalities Applications on Multicore Architectures

References

Acknowledgment

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now