ABSTRACT
Applying error recovery monotonously can either compromise the real-time constraint, or worsen the power/energy envelope. Neither of these violations can be realistically accepted in embedded system design, which expects ultra efficient realization of a given application. In this paper, we propose a HW/SW methodology that exploits both application specific characteristics and Spatial/Temporal redundancy. Our methodology combines design-time and runtime optimizations, to enable the resultant embedded processor to perform runtime adaptive error recovery operations, precisely targeting the reliability-wise critical instruction executions. The proposed error recovery functionality can dynamically 1) evaluate the reliability cost economy (in terms of execution-time and dynamic power), 2) determine the most profitable scheme, and 3) adapt to the corresponding error recovery scheme, which is composed of spatial and temporal redundancy based error recovery operations. The experimental results have shown that our methodology at best can achieve fifty times greater reliability while maintaining the execution time and power deadlines, when compared to the state of the art.
- H. Asadi, M. B. Tahoori, and C. Tirumurti. Estimating error propagation probabilities with bounded variances. In DFT, pages 41--49, 2007. Google ScholarDigital Library
- R. Baumann. Soft errors in advanced computer systems. IEEE Design & Test of Computers, 22(3):258--266, 2005. Google ScholarDigital Library
- P. Bernstein. Sequoia: a fault-tolerant tightly coupled multiprocessor for transaction processing. Computer, 21(2):37--45, feb. 1988. Google ScholarDigital Library
- M. R. Guthaus, J. Ringenberg, D. Ernst, T. Mudge, R. Brown, and T. Austin. MiBench: a free, commercially representative embedded benchmark suite. In IEEE International Symposium on Workload Characterization, 2001. Google ScholarDigital Library
- D. Hunt and P. Marinos. A general purpose cache-aided rollback error recovery (CARER) technique. In proceedings of the 17th international symposium on fault-tolerant coputing systems, pages 170--175, 1987.Google Scholar
- IRC. International Technology Roadmap for Semiconductor 2007 Edition Design, 2007.Google Scholar
- R. Leveugle, A. Calvez, P. Maistri, and P. Vanhauwaert. Statistical fault injection: quantified error and confidence. In Proceedings of the Conference on Design, Automation and Test in Europe, DATE '09, pages 502--506, 3001 Leuven, Belgium, Belgium, 2009. European Design and Automation Association. Google ScholarDigital Library
- T. Li, R. Ragel, and S. Parameswaran. Reli: Hardware/software checkpoint and recovery scheme for embedded processors. In Design, Automation Test in Europe Conference Exhibition (DATE), 2012, pages 875--880, march 2012. Google ScholarDigital Library
- P. Mishra and N. Dutt. Processor Description Languages, Volume 1. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 2008. Google ScholarDigital Library
- S. Mukherjee, C. Weaver, J. Emer, S. Reinhardt, and T. Austin. A systematic methodology to compute the architectural vulnerability factors for a high-performance microprocessor. In Microarchitecture, 2003. MICRO-36. Proceedings. 36th Annual IEEE/ACM International Symposium on, pages 29--40, dec. 2003. Google ScholarDigital Library
- J. Peddersen, S. L. Shee, A. Janapsatya, and S. Parameswaran. Rapid embedded hardware/software system generation. VLSI Design, International Conference on, 0:111--116, 2005. Google ScholarDigital Library
- J. M. Rabaey and S. Malik. Challenges and solutions for late- and post-silicon design. IEEE Des. Test, 25:296--302, July 2008. Google ScholarDigital Library
- K. Rajamani, H. Hanson, J. Rubio, S. Ghiasi, and F. Rawson. Application-aware power management. In Workload Characterization, 2006 IEEE International Symposium on, pages 39--48, oct. 2006.Google ScholarCross Ref
- S. Rehman, M. Shafique, and J. Henkel. Instruction scheduling for reliability-aware compilation. In Design Automation Conference (DAC), 2012 49th ACM/EDAC/IEEE, pages 1288--1296, june 2012. Google ScholarDigital Library
- S. Rehman, M. Shafique, F. Kriebel, and J. Henkel. Reliable software for unreliable hardware: embedded code generation aiming at reliability. In Proceedings of the seventh IEEE/ACM/IFIP international conference on Hardware/software codesign and system synthesis, CODES+ISSS '11, pages 237--246, New York, NY, USA, 2011. ACM. Google ScholarDigital Library
- G. A. Reis, J. Chang, and D. I. August. Automatic instruction-level software-only recovery. IEEE Micro, 27:36--47, January 2007. Google ScholarDigital Library
- J. Sartori and R. Kumar. Architecting processors to allow voltage/reliability tradeoffs. In CASES, pages 115--124, 2011. Google ScholarDigital Library
- V. Sridharan and D. R. Kaeli. Using hardware vulnerability factors to enhance avf analysis. In Proceedings of the 37th annual international symposium on Computer architecture, ISCA '10, pages 461--472, New York, NY, USA, 2010. ACM. Google ScholarDigital Library
- R. Teodorescu, J. Nakano, and J. Torrellas. SWICH: A prototype for efficient cache-level checkpointing and rollback. IEEE Micro, 26:28--40, 2006. Google ScholarDigital Library
- J. F. Wakerly. Transient failures in triple modular redundancy systems with sequential modules. IEEE Trans. Computers, 24(5):570--573, 1975. Google ScholarDigital Library
- B. Zhao, H. Aydin, and D. Zhu. Enhanced reliability-aware power management through shared recovery technique. In ICCAD, pages 63--70, 2009. Google ScholarDigital Library
Index Terms
RASTER: runtime adaptive spatial/temporal error resiliency for embedded processors
Recommendations
CARE: compiler-assisted recovery from soft failures
SC '19: Proceedings of the International Conference for High Performance Computing, Networking, Storage and AnalysisAs processors continue to boost the system performance with higher circuit density, shrinking process technology and near-threshold voltage (NTV) operations, they are projected to be more vulnerable to transient faults, which have become one of the ...
A runtime model-based framework for specifying and verifying adaptive RTE systems
Adaptive Real-Time Embedded Systems (RTES) may execute in an unpredictable context that is impossible to definitely consider in the development time. Therefore, these systems are required to adapt their behaviour to unpredicted changes at runtime in order ...
Adaptive Parallelism Exploitation under Physical and Real-Time Constraints for Resilient Systems
Special Issue on 11th International Conference on Field-Programmable Technology (FPT'12) and Special Issue on the 7th International Workshop on Reconfigurable Communication-Centric Systems-on-Chip (ReCoSoC'12)This article introduces the resilient adaptive algebraic architecture that aims at adapting parallelism exploitation of a matrix multiplication algorithm in a time-deterministic fashion to reduce power consumption while meeting real-time deadlines ...
Comments