ABSTRACT
Advances in semiconductor technology have enabled unprecedented growth in safety-critical applications. However, due to unabated scaling, the unreliability of the underlying hardware is only getting worse. For a lot of applications, just recovering from errors is not enough -- the latency between the occurrence of the fault to it's detection and recovery from the fault, i.e., in-time error resilience is of vital importance. This is especially true for real-time applications, where the timing of application events is a crucial part of the correctness of application. While software techniques for resilience are highly desirable since they can be flexibly applied, but achieving reliable, in-time software resilience is still an elusive goal. A new class of recent techniques have started to tackle this problem. This paper presents a succinct overview of existing software resilience techniques from the point-of-view of in-time resilience, and points out future challenges.
- Shekhar Borkar. 2005. Designing reliable systems from unreliable components: the challenges of transistor variability and degradation. MICRO (2005). Google ScholarDigital Library
- Moslem Didehban et al. 2017. InCheck: An in-application recovery scheme for soft errors. In DAC. IEEE. Google ScholarDigital Library
- Moslem Didehban et al. 2017. NEMESIS: A software approach for computing in presence of soft errors. In ICCAD. IEEE. Google ScholarDigital Library
- Moslem Didehban and Aviral Shrivastava. 2016. nZDC: a compiler technique for near Zero Silent data Corruption. In Proceedings of the 53rd Annual Design Automation Conference. ACM, 48. Google ScholarDigital Library
- Moslem Didehban and Aviral Shrivastava. 2018. A Compiler Technique for Processor-Wide Protection From Soft Errors in Multithreaded Environments. IEEE Transactions on Reliability 67, 1 (2018), 249--263.Google ScholarCross Ref
- Shuguang Feng et al. 2010. Shoestring: probabilistic soft error reliability on the cheap. In SIGARCH Computer Architecture News, Vol. 38. ACM. Google ScholarDigital Library
- Shuguang Feng et al. 2011. Encore: low-cost, fine-grained transient fault recovery. In Proceedings of International Symposium on Microarchitecture. ACM. Google ScholarDigital Library
- Jörg Henkel, Lars Bauer, Nikil Dutt, Puneet Gupta, Sani Nassif, Muhammad Shafique, Mehdi Tahoori, and Norbert Wehn. 2013. Reliable on-chip systems in the nano-era: Lessons learnt and future trends. In Proceedings of the 50th Annual Design Automation Conference. ACM, 99. Google ScholarDigital Library
- Dmitrii Kuvaiskii, Oleskii Oleksenko, Pramod Bhatotia, Pascal Felber, and Christof Fetzer. 2016. Elzar: Triple modular redundancy using intel avx (practical experience report). In 2016 46th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN). IEEE, 646--653.Google ScholarCross Ref
- George Reis et al. 2007. Automatic instruction-level software-only recovery. IEEE micro 27 (2007). Google ScholarDigital Library
- George A Reis et al. 2005. Software-controlled fault tolerance. TACO 2 (2005). Google ScholarDigital Library
- Muhammad Shafique, Siddharth Garg, Jörg Henkel, and Diana Marculescu. 2014. The EDA challenges in the dark silicon era: Temperature, reliability, and variability perspectives. In Proceedings of the 51st Annual Design Automation Conference. ACM, 1--6. Google ScholarDigital Library
- Hwisoo So et al. 2018. EXPERT: Effective and flexible error protection by redundant multithreading. In Design, Automation & Test in Europe Conference & Exhibition. IEEE, 533--538.Google Scholar
- Hwisoo So et al. 2019. A software-level Redundant MultiThreading for Soft/Hard Error Detection and Recovery. In Design, Automation & Test in Europe Conference & Exhibition. IEEE.Google Scholar
- Software Approaches for In-time Resilience
Recommendations
Cross-Layer Resilience: Challenges, Insights, and the Road Ahead
DAC '19: Proceedings of the 56th Annual Design Automation Conference 2019Resilience to errors in the underlying hardware is a key design objective for a large class of computing systems, from embedded systems all the way to the cloud. Sources of hardware errors include radiation, circuit aging, variability induced by ...
Software approaches for resilience of high performance computing systems: a survey
AbstractWith the scaling up of high-performance computing systems in recent years, their reliability has been descending continuously. Therefore, system resilience has been regarded as one of the critical challenges for large-scale HPC systems. Various ...
Resilience in computer systems and networks
ICCAD '09: Proceedings of the 2009 International Conference on Computer-Aided DesignThe term resilience is used differently by different communities. In general engineering systems, fast recovery from a degraded system state is often termed as resilience. Computer networking community defines it as the combination of trustworthiness (...
Comments