research-article

Software Approaches for In-time Resilience

Authors:
Aviral Shrivastava

Arizona State University

Arizona State University
View Profile

,
Moslem Didehban

Cadence Design Systems

Cadence Design Systems
View Profile

DAC '19: Proceedings of the 56th Annual Design Automation Conference 2019June 2019Article No.: 197Pages 1–4https://doi.org/10.1145/3316781.3323487

Published:02 June 2019Publication History

DAC '19: Proceedings of the 56th Annual Design Automation Conference 2019

Pages 1–4

ABSTRACT

Advances in semiconductor technology have enabled unprecedented growth in safety-critical applications. However, due to unabated scaling, the unreliability of the underlying hardware is only getting worse. For a lot of applications, just recovering from errors is not enough -- the latency between the occurrence of the fault to it's detection and recovery from the fault, i.e., in-time error resilience is of vital importance. This is especially true for real-time applications, where the timing of application events is a crucial part of the correctness of application. While software techniques for resilience are highly desirable since they can be flexibly applied, but achieving reliable, in-time software resilience is still an elusive goal. A new class of recent techniques have started to tackle this problem. This paper presents a succinct overview of existing software resilience techniques from the point-of-view of in-time resilience, and points out future challenges.

References

Shekhar Borkar. 2005. Designing reliable systems from unreliable components: the challenges of transistor variability and degradation. MICRO (2005). Google ScholarDigital Library
Moslem Didehban et al. 2017. InCheck: An in-application recovery scheme for soft errors. In DAC. IEEE. Google ScholarDigital Library
Moslem Didehban et al. 2017. NEMESIS: A software approach for computing in presence of soft errors. In ICCAD. IEEE. Google ScholarDigital Library
Moslem Didehban and Aviral Shrivastava. 2016. nZDC: a compiler technique for near Zero Silent data Corruption. In Proceedings of the 53rd Annual Design Automation Conference. ACM, 48. Google ScholarDigital Library
Moslem Didehban and Aviral Shrivastava. 2018. A Compiler Technique for Processor-Wide Protection From Soft Errors in Multithreaded Environments. IEEE Transactions on Reliability 67, 1 (2018), 249--263.Google ScholarCross Ref
Shuguang Feng et al. 2010. Shoestring: probabilistic soft error reliability on the cheap. In SIGARCH Computer Architecture News, Vol. 38. ACM. Google ScholarDigital Library
Shuguang Feng et al. 2011. Encore: low-cost, fine-grained transient fault recovery. In Proceedings of International Symposium on Microarchitecture. ACM. Google ScholarDigital Library
Jörg Henkel, Lars Bauer, Nikil Dutt, Puneet Gupta, Sani Nassif, Muhammad Shafique, Mehdi Tahoori, and Norbert Wehn. 2013. Reliable on-chip systems in the nano-era: Lessons learnt and future trends. In Proceedings of the 50th Annual Design Automation Conference. ACM, 99. Google ScholarDigital Library
Dmitrii Kuvaiskii, Oleskii Oleksenko, Pramod Bhatotia, Pascal Felber, and Christof Fetzer. 2016. Elzar: Triple modular redundancy using intel avx (practical experience report). In 2016 46th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN). IEEE, 646--653.Google ScholarCross Ref
George Reis et al. 2007. Automatic instruction-level software-only recovery. IEEE micro 27 (2007). Google ScholarDigital Library
George A Reis et al. 2005. Software-controlled fault tolerance. TACO 2 (2005). Google ScholarDigital Library
Muhammad Shafique, Siddharth Garg, Jörg Henkel, and Diana Marculescu. 2014. The EDA challenges in the dark silicon era: Temperature, reliability, and variability perspectives. In Proceedings of the 51st Annual Design Automation Conference. ACM, 1--6. Google ScholarDigital Library
Hwisoo So et al. 2018. EXPERT: Effective and flexible error protection by redundant multithreading. In Design, Automation & Test in Europe Conference & Exhibition. IEEE, 533--538.Google Scholar
Hwisoo So et al. 2019. A software-level Redundant MultiThreading for Soft/Hard Error Detection and Recovery. In Design, Automation & Test in Europe Conference & Exhibition. IEEE.Google Scholar

Software Approaches for In-time Resilience
1. Hardware

Recommendations

Cross-Layer Resilience: Challenges, Insights, and the Road Ahead
DAC '19: Proceedings of the 56th Annual Design Automation Conference 2019

Resilience to errors in the underlying hardware is a key design objective for a large class of computing systems, from embedded systems all the way to the cloud. Sources of hardware errors include radiation, circuit aging, variability induced by ...
Read More
Software approaches for resilience of high performance computing systems: a survey
Abstract
With the scaling up of high-performance computing systems in recent years, their reliability has been descending continuously. Therefore, system resilience has been regarded as one of the critical challenges for large-scale HPC systems. Various ...
Read More
Resilience in computer systems and networks
ICCAD '09: Proceedings of the 2009 International Conference on Computer-Aided Design

The term resilience is used differently by different communities. In general engineering systems, fast recovery from a degraded system state is often termed as resilience. Computer networking community defines it as the combination of trustworthiness (...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in

DAC '19: Proceedings of the 56th Annual Design Automation Conference 2019
June 2019
1378 pages
ISBN:9781450367257
DOI:10.1145/3316781

Copyright © 2019 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 2 June 2019
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Qualifiers
- research-article
- Research
- Refereed limited
Conference

Acceptance Rates
Overall Acceptance Rate1,770of5,499submissions,32%
Upcoming Conference
DAC '24

Sponsor:

sigda

61st ACM/IEEE Design Automation Conference

June 23 - 27, 2024

San Francisco , CA , USA
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 4
  Total Citations
  View Citations
- 130
  Total Downloads
- Downloads (Last 12 months)12
- Downloads (Last 6 weeks)2
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Software Approaches for In-time Resilience

DAC '19: Proceedings of the 56th Annual Design Automation Conference 2019

ABSTRACT

References

Cited By

Recommendations

Cross-Layer Resilience: Challenges, Insights, and the Road Ahead

Software approaches for resilience of high performance computing systems: a survey

Resilience in computer systems and networks

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Software Approaches for In-time Resilience

DAC '19: Proceedings of the 56th Annual Design Automation Conference 2019

ABSTRACT

References

Cited By

Recommendations

Cross-Layer Resilience: Challenges, Insights, and the Road Ahead

Software approaches for resilience of high performance computing systems: a survey

Resilience in computer systems and networks

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media