skip to main content
10.1145/1366224.1366227acmconferencesArticle/Chapter ViewAbstractPublication PagescfConference Proceedingsconference-collections
research-article

Efficient fault tolerance in multi-media applications through selective instruction replication

Published:05 May 2008Publication History

ABSTRACT

As voltages decrease, soft errors are expected to become an increasing problem in maintaining program correctness. Unfortunately, previous mechanisms to improve processor reliability protect all processor instructions equally, causing such approaches to suffer from significant performance degradation and/or substantial hardware overhead. However, recent research has shown that in multimedia applications such as photography, video, and audio, not all instructions are created equal: many operations prove to be far more tolerant to faults than others [1].

This observation can be leveraged to limit the cost of reliable computing by protecting only those instructions that are critical to correct execution. We propose a mechanism to protect against soft errors through selective instruction replication. We begin with a dynamic instruction replication framework that replicates every instruction and checks them upon commit, rolling back for any inconsistent results. Instead of replicating the entire program, instructions that the compiler identifies as tolerant to error would remain unprotected. While full replication requires 40% to 100% overhead, our mechanism requires only 30% to 75% overhead, reducing the overhead by 15-33% with minimal hardware overhead. We suffer only 0.5 - 1% fidelity degradation with this approach.

References

  1. D. Thaker, D. Franklin, V. Akella, and F. T. Chong, "Reliability requirements of control, address, and data operations in error-tolerant applications," Proceedings of the Workshop on Architectural Reliability, held conjunction with MICRO-2005, December 2005.Google ScholarGoogle Scholar
  2. Cohen, T.S.Sriram, N.Leland, D.Moyer, S.Butler, and R.Flatley, "Soft error considerations for deepsubmicron cmos circuit applications," IEEE International Electron Devices Meeting: Technical Digest, pp. 315--319, December 1999.Google ScholarGoogle Scholar
  3. F.Ziegler, "Terrestrial cosmic rays," IBM Journal of Research and Development, vol. 40, pp. 19--39, January 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. V. Neumann, "Probabilistic logic and the synthesis of reliable organisms from unreliable components," Automata Studies, Ann. of Math. Studies, vol. 34, pp. 43--98, 1956.Google ScholarGoogle Scholar
  5. Li and D. Yeung. Application-level correctness and its impact on fault tolerance. In Proceedings of the 13th International Symposium on High-Performance Computer Architecture, February 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. G. A. Reis, J. Chang, N. Vachharajani, R. Rangan, D. I. August, and S. S. Mukherjee. Software-controlled fault tolerance. ACM Transactions on Architecture and Code Optimization, 2(4):366--396, Dec 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Thaker, D. Franklin, J. Oliver, S. Biswas, D. Lockhart, T. Metodi, and F. Chong. Characterization of error-tolerant applications when protecting control data. In IISWC '06: Proceedings of the IEEE International Symposium on Workload Characterization, San Jose, CA, USA, 2006. IEEE Computer Society.Google ScholarGoogle ScholarCross RefCross Ref
  8. Reinhardt, S. K. and Mukherjee, S. S. 2000. Transient fault detection via simultaneous multithreading. In Proceedings of the 27th Annual international Symposium on Computer Architecture (Vancouver, British Columbia, Canada). ISCA '00. ACM Press, New York, NY, 25--36. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Gomaa, M., Scarbrough, C., Vijaykumar, T. N., and Pomeranz, I. 2003. Transient-fault recovery for chip multiprocessors. In Proceedings of the 30th Annual international Symposium on Computer Architecture (San Diego, California, June 09 - 11, 2003). ISCA '03. ACM Press, New York, NY, 98--109. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Sundaramoorthy, K., Purser, Z., and Rotenburg, E. 2000. Slipstream processors: improving both performance and fault tolerance. In Proceedings of the Ninth international Conference on Architectural Support For Programming Languages and Operating Systems (Cambridge, Massachusetts, United States). ASPLOS-IX. ACM Press, New York, NY, 257--268. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Reddy, V. K., Rotenberg, E., and Parthasarathy, S. 2006. Understanding prediction-based partial redundant threading for low-overhead, high- coverage fault tolerance. SIGARCH Comput. Archit. News 34, 5 (Oct. 2006), 83--94. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Weaver, C. and Austin, T. M. 2001. A Fault Tolerant Approach to Microprocessor Design. In Proceedings of the 2001 international Conference on Dependable Systems and Networks (Formerly: Ftcs) (July 01 - 04, 2001). Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Vijaykumar, T. N., Pomeranz, I., and Cheng, K. 2002. Transient-fault recovery using simultaneous multithreading. In Proceedings of the 29th Annual international Symposium on Computer Architecture (Anchorage, Alaska, May 25 - 29, 2002). International Conference on Computer Architecture. IEEE Computer Society, Washington, DC, 87--98. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Mukherjee, S. S., Kontz, M., and Reinhardt, S. K. 2002. Detailed design and evaluation of redundant multithreading alternatives. SIGARCH Comput. Archit. News 30, 2 (May. 2002), 99--110. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Rotenberg, E. 1999. AR-SMT: A Microarchitectural Approach to Fault Tolerance in Microprocessors. In Proceedings of the Twenty-Ninth Annual international Symposium on Fault-Tolerant Computing (June 15 - 18, 1999). FTCS. IEEE Computer Society, Washington, DC, 84. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Ray, J., Hoe, J. C., and Falsafi, B. 2001. Dual use of superscalar datapath for transient-fault detection and recovery. In Proceedings of the 34th Annual ACM/IEEE international Symposium on Microarchitecture (Austin, Texas, December 01 - 05, 2001). International Symposium on Microarchitecture. IEEE Computer Society, Washington, DC, 214--224. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. D. C. Burger and T. M. Austin, "The simplescalar tool set, version 2.0," Technical Report CS-TR-1997--1342, University of Wisconsin, Madison, June 1997.Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Jedidiah R. Crandall, Frederic T. Chong: "Minos: Control Data Attack Prevention Orthogonal to Memory Model." MICRO 2004: 221--232 Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Randy Allen, Ken Kenndy, "Optimizing Compilers for Modern Architectures: A Dependence-based Approach", Morgan Kaufmann, 2001Google ScholarGoogle Scholar
  20. Sumeet Kumar and Aneesh Aggarwal, "Self-Checking Instructions: Reducing Instruction Redundancy for Concurrent Error Detection".PACT'06. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Gomaa and T. N. Vijaykumar. Opportunistic transientfault detection. 32nd International Symposium on Computer Architecture, pp. 172--183, June 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Efficient fault tolerance in multi-media applications through selective instruction replication

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        WREFT '08: Proceedings of the 2008 workshop on Radiation effects and fault tolerance in nanometer technologies
        May 2008
        46 pages
        ISBN:9781605580920
        DOI:10.1145/1366224

        Copyright © 2008 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 5 May 2008

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        WREFT '08 Paper Acceptance Rate5of7submissions,71%Overall Acceptance Rate5of7submissions,71%

        Upcoming Conference

        CF '24

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader