ABSTRACT
As voltages decrease, soft errors are expected to become an increasing problem in maintaining program correctness. Unfortunately, previous mechanisms to improve processor reliability protect all processor instructions equally, causing such approaches to suffer from significant performance degradation and/or substantial hardware overhead. However, recent research has shown that in multimedia applications such as photography, video, and audio, not all instructions are created equal: many operations prove to be far more tolerant to faults than others [1].
This observation can be leveraged to limit the cost of reliable computing by protecting only those instructions that are critical to correct execution. We propose a mechanism to protect against soft errors through selective instruction replication. We begin with a dynamic instruction replication framework that replicates every instruction and checks them upon commit, rolling back for any inconsistent results. Instead of replicating the entire program, instructions that the compiler identifies as tolerant to error would remain unprotected. While full replication requires 40% to 100% overhead, our mechanism requires only 30% to 75% overhead, reducing the overhead by 15-33% with minimal hardware overhead. We suffer only 0.5 - 1% fidelity degradation with this approach.
- D. Thaker, D. Franklin, V. Akella, and F. T. Chong, "Reliability requirements of control, address, and data operations in error-tolerant applications," Proceedings of the Workshop on Architectural Reliability, held conjunction with MICRO-2005, December 2005.Google Scholar
- Cohen, T.S.Sriram, N.Leland, D.Moyer, S.Butler, and R.Flatley, "Soft error considerations for deepsubmicron cmos circuit applications," IEEE International Electron Devices Meeting: Technical Digest, pp. 315--319, December 1999.Google Scholar
- F.Ziegler, "Terrestrial cosmic rays," IBM Journal of Research and Development, vol. 40, pp. 19--39, January 1996. Google ScholarDigital Library
- V. Neumann, "Probabilistic logic and the synthesis of reliable organisms from unreliable components," Automata Studies, Ann. of Math. Studies, vol. 34, pp. 43--98, 1956.Google Scholar
- Li and D. Yeung. Application-level correctness and its impact on fault tolerance. In Proceedings of the 13th International Symposium on High-Performance Computer Architecture, February 2007. Google ScholarDigital Library
- G. A. Reis, J. Chang, N. Vachharajani, R. Rangan, D. I. August, and S. S. Mukherjee. Software-controlled fault tolerance. ACM Transactions on Architecture and Code Optimization, 2(4):366--396, Dec 2005. Google ScholarDigital Library
- Thaker, D. Franklin, J. Oliver, S. Biswas, D. Lockhart, T. Metodi, and F. Chong. Characterization of error-tolerant applications when protecting control data. In IISWC '06: Proceedings of the IEEE International Symposium on Workload Characterization, San Jose, CA, USA, 2006. IEEE Computer Society.Google ScholarCross Ref
- Reinhardt, S. K. and Mukherjee, S. S. 2000. Transient fault detection via simultaneous multithreading. In Proceedings of the 27th Annual international Symposium on Computer Architecture (Vancouver, British Columbia, Canada). ISCA '00. ACM Press, New York, NY, 25--36. Google ScholarDigital Library
- Gomaa, M., Scarbrough, C., Vijaykumar, T. N., and Pomeranz, I. 2003. Transient-fault recovery for chip multiprocessors. In Proceedings of the 30th Annual international Symposium on Computer Architecture (San Diego, California, June 09 - 11, 2003). ISCA '03. ACM Press, New York, NY, 98--109. Google ScholarDigital Library
- Sundaramoorthy, K., Purser, Z., and Rotenburg, E. 2000. Slipstream processors: improving both performance and fault tolerance. In Proceedings of the Ninth international Conference on Architectural Support For Programming Languages and Operating Systems (Cambridge, Massachusetts, United States). ASPLOS-IX. ACM Press, New York, NY, 257--268. Google ScholarDigital Library
- Reddy, V. K., Rotenberg, E., and Parthasarathy, S. 2006. Understanding prediction-based partial redundant threading for low-overhead, high- coverage fault tolerance. SIGARCH Comput. Archit. News 34, 5 (Oct. 2006), 83--94. Google ScholarDigital Library
- Weaver, C. and Austin, T. M. 2001. A Fault Tolerant Approach to Microprocessor Design. In Proceedings of the 2001 international Conference on Dependable Systems and Networks (Formerly: Ftcs) (July 01 - 04, 2001). Google ScholarDigital Library
- Vijaykumar, T. N., Pomeranz, I., and Cheng, K. 2002. Transient-fault recovery using simultaneous multithreading. In Proceedings of the 29th Annual international Symposium on Computer Architecture (Anchorage, Alaska, May 25 - 29, 2002). International Conference on Computer Architecture. IEEE Computer Society, Washington, DC, 87--98. Google ScholarDigital Library
- Mukherjee, S. S., Kontz, M., and Reinhardt, S. K. 2002. Detailed design and evaluation of redundant multithreading alternatives. SIGARCH Comput. Archit. News 30, 2 (May. 2002), 99--110. Google ScholarDigital Library
- Rotenberg, E. 1999. AR-SMT: A Microarchitectural Approach to Fault Tolerance in Microprocessors. In Proceedings of the Twenty-Ninth Annual international Symposium on Fault-Tolerant Computing (June 15 - 18, 1999). FTCS. IEEE Computer Society, Washington, DC, 84. Google ScholarDigital Library
- Ray, J., Hoe, J. C., and Falsafi, B. 2001. Dual use of superscalar datapath for transient-fault detection and recovery. In Proceedings of the 34th Annual ACM/IEEE international Symposium on Microarchitecture (Austin, Texas, December 01 - 05, 2001). International Symposium on Microarchitecture. IEEE Computer Society, Washington, DC, 214--224. Google ScholarDigital Library
- D. C. Burger and T. M. Austin, "The simplescalar tool set, version 2.0," Technical Report CS-TR-1997--1342, University of Wisconsin, Madison, June 1997.Google ScholarDigital Library
- Jedidiah R. Crandall, Frederic T. Chong: "Minos: Control Data Attack Prevention Orthogonal to Memory Model." MICRO 2004: 221--232 Google ScholarDigital Library
- Randy Allen, Ken Kenndy, "Optimizing Compilers for Modern Architectures: A Dependence-based Approach", Morgan Kaufmann, 2001Google Scholar
- Sumeet Kumar and Aneesh Aggarwal, "Self-Checking Instructions: Reducing Instruction Redundancy for Concurrent Error Detection".PACT'06. Google ScholarDigital Library
- Gomaa and T. N. Vijaykumar. Opportunistic transientfault detection. 32nd International Symposium on Computer Architecture, pp. 172--183, June 2005. Google ScholarDigital Library
Index Terms
- Efficient fault tolerance in multi-media applications through selective instruction replication
Recommendations
Selective replication: A lightweight technique for soft errors
Soft errors are an important challenge in contemporary microprocessors. Modern processors have caches and large memory arrays protected by parity or error detection and correction codes. However, today's failure rate is dominated by flip flops, latches, ...
Exploiting Idle Hardware to Provide Low Overhead Fault Tolerance for VLIW Processors
Special Issue on Nanoelectronic Circuit and System Design Methods for the Mobile Computing Era and Regular PapersBecause of technology scaling, the soft error rate has been increasing in digital circuits, which affects system reliability. Therefore, modern processors, including VLIW architectures, must have means to mitigate such effects to guarantee reliable ...
Instruction Replication: Reducing Delays Due to Inter-PE Communication Latency
PACT '03: Proceedings of the 12th International Conference on Parallel Architectures and Compilation TechniquesAs feature sizes are becoming smaller, wire delays are becoming very critical. Clustering is a popular decentralization approach to reduce the impact of shrinking technologies on clock speed. In this approach, the centralized instruction window is ...
Comments