skip to main content
10.1145/2482767.2482773acmconferencesArticle/Chapter ViewAbstractPublication PagescfConference Proceedingsconference-collections
research-article

Fault tolerance for multi-threaded applications by leveraging hardware transactional memory

Published: 14 May 2013 Publication History

Abstract

Providing fault tolerance especially to mission critical applications in order to detect transient and permanent faults and to recover from them is one of the main necessity for processor designers. However, fault tolerance for multi-threaded applications presents high performance degradations due to comparing the results of the instruction streams, checkpointing the entire system and recovering from the detected errors to an agreed state. In this study, we present FaulTM-multi, a fault tolerance scheme for multi threaded applications running on transactional memory hardware which reduces these performance degradations. FaulTM-multi decreases the performance degradation of lockstepping, a conventional fault detection scheme, from 23% and 9% to 10% and 2% for lock-based parallel and TM applications respectively. Also, FaulTM-multi creates 28% less checkpoints compared to Rebound, the state of the art checkpointing scheme.

References

[1]
Alpha 21264 Microprocessor Hardware Reference Manual. Compaq Computer Corparation, 1999.
[2]
R. Agarwal, P. Garg, and J. Torrellas. Rebound: Scalable Checkpointing for Coherent Shared Memory. In Proceedings of the 38th Annual International Symposium on Computer Architecture, pages 153--164, 2011.
[3]
R. Anglada and A. Rubio. An Approach to Crosstalk Effect Analysis and Avoidance Techniques in Digital CMOS VLSI Circuits. International Journal of Electronics, 6(5):9--17, 1988.
[4]
R. Baumann. Soft Errors in Advanced Computer Systems. IEEE Design and Test of Computers, 22(3):258--266, 2005.
[5]
N. L. Binkert, R. G. Dreslinski, L. R. Hsu, K. T. Lim, A. G. Saidi, and S. K. Reinhardt. The M5 Simulator: Modeling Networked Systems. IEEE Micro, 26:52--60, 2006.
[6]
C. Cao Minh, J. Chung, C. Kozyrakis, and K. Olukotun. STAMP: Stanford Transactional Applications for Multi-Processing. In IISWC '08: Proceedings of The IEEE International Symposium on Workload Characterization, September 2008.
[7]
D. Chen, P. W. Coteus, N. A. Eisley, A. Gara, P. Heidleberger, R. M. Senger, V. Salapura, B. Steinmacher-burow, Y. Sugawara, and T. E. Takken. Embedding Global Barrier and Collective in a Torus Network, United States Patent Application, 12/723277.
[8]
J. Chung, L. Yen, S. Diestelhorst, M. Pohlack, M. Hohmuth, D. Christie, and D. Grossman. ASF: AMD64 Extension for Lock-Free Data Structures and Transactional Memory. IEEE/ACM International Symposium on Microarchitecture, 0:39--50, 2010.
[9]
C. Click. Azul's Experiences with Hardware Transactional Memory, January 2009.
[10]
V. Gajinov, F. Zyulkyarov, O. S. Unsal, E. Ayguade, T. Harris, M. Valero, and A. Cristal. QuakeTM: Parallelizing a Complex Sequential Application Using Transactional Memory. In Proceedings of the 23rd International Conference on Supercomputing, pages 126--135, 2009.
[11]
J. R. T. Gil, A. Negi, M. E. Acacio, J. M. García, and P. Stenström. ZEBRA: A Data-Centric, Hybrid-Policy Hardware Transactional Memory Design. In Proceedings of the international conference on Supercomputing, pages 53--62, 2011.
[12]
R. Gong, K. Dai, and Z. Wang. Transient Fault Recovery on Chip Multiprocessor based on Dual Core Redundancy and Context Saving. International Conference for Young Computer Scientists, pages 148--153, 2008.
[13]
G. Kestor, V. Karakostas, O. S. Unsal, A. Cristal, I. Hur, and M. Valero. RMS-TM: A Comprehensive Benchmark Suite for Transactional Memory Systems. In Proceeding of the second joint WOSP/SIPEW international conference on Performance engineering, pages 335--346, 2011.
[14]
C.-K. Luk, R. Cohn, R. Muth, H. Patil, A. Klauser, G. Lowney, S. Wallace, V. J. Reddi, and K. Hazelwood. Pin: Building Customized Program Analysis Tools with Dynamic Instrumentation. In Proceedings of the 2005 ACM SIGPLAN conference on Programming language design and implementation, pages 190--200, 2005.
[15]
S. S. Mukherjee, M. Kontz, and S. K. Reinhardt. Detailed Design and Evaluation of Redundant Multithreading Alternatives. In Proceedings of the International Symposium on Computer Architecture, pages 99--110, 2002.
[16]
Prvulovic, Z. Zhang, and J. Torrellas. ReVive: Cost-Effective Architectural Support for Rollback Recovery in Shared-Memory Multiprocessors. In Proceedings of the International Symposium on Computer Architecture, pages 111--122, 2002.
[17]
J. Reinders. Transactional Synchronization in Haswell, February 2012.
[18]
S. K. Reinhardt and S. S. Mukherjee. Transient fault detection via simultaneous multithreading. SIGARCH Computer Architecture News, 28(2):25--36, 2000.
[19]
E. Rotenberg. AR-SMT: A microarchitectural approach to fault tolerance in microprocessors. In Proceedings of the International Symposium on Fault-Tolerant Computing, page 84, 1999.
[20]
S. Sanyal, S. Roy, A. Cristal, O. S. Unsal, and M. Valero. Dynamically Filtering Thread-Local Variables in Lazy-Lazy Hardware Transactional Memory. In Proceedings of the International Conference on High Performance Computing and Communications, pages 171--179, 2009.
[21]
T. J. Slegel and et al. IBM's S/390 G5 Microprocessor Design. IEEE Micro, 19:12--23, 1999.
[22]
J. C. Smolens, B. T. Gold, B. Falsafi, and J. C. Hoe. Reunion: Complexity-Effective Multicore Redundancy. In Proceedings of the IEEE/ACM International Symposium on Microarchitecture, pages 223--234, 2006.
[23]
J. C. Smolens, B. T. Gold, J. Kim, B. Falsafi, J. C. Hoe, and A. Nowatzyk. Fingerprinting: Bounding Soft-error Detection Latency and Bandwidth. In International Conference on Architectural Support for Programming Languages and Operating Systems, pages 224--234, 2004.
[24]
D. J. Sorin, M. M. K. Martin, M. D. Hill, and D. A. Wood. SafetyNet: Improving the Availability of Shared Memory Multiprocessors with Global Checkpoint/Recovery. In Proceedings of the International Symposium on Computer Architecture, pages 123--134, 2002.
[25]
K. Sundaramoorthy, Z. Purser, and E. Rotenberg. Slipstream Processors: Improving both Performance and Fault Tolerance. In Proceeding of the International Conference on Architectural Support for Programming Languages and Operating Systems, pages 257--268, 2000.
[26]
T. N. Vijaykumar, I. Pomeranz, and K. Cheng. Transient-fault recovery using simultaneous multithreading. In Proceedings of the International Symposium on Computer Architecture, pages 87--98, 2002.
[27]
S. C. Woo, M. Ohara, E. Torrie, J. P. Singh, and A. Gupta. The SPLASH-2 Programs: Characterization and Methodological Considerations. SIGARCH Compututer Architecture News, 23:24--36, May 1995.
[28]
A. Wood, R. Jardine, and W. Bartlett. Data Integrity in HP NonStop Servers. In Proceedings of the Workshop on System Effects of Logic Soft Errors, 2006.
[29]
G. Yalcin, O. Unsal, and A. Cristal. FaulTM: Error Detection and Recovery Using Hardware Transactional Memory. In Design, Automation and Test in Europe, 2013.
[30]
G. Yalcin, O. Unsal, A. Cristal, I. Hur, and M. Valero. FaulTM: Fault-Tolerance Using Hardware Transactional Memory. In Workshop on Parallel Execution of Sequential Programs on Multi-Core Architecture PESPMA, 2010.

Cited By

View all
  • (2021)BROFY: Towards Essential Integrity Protection for Microservices2021 40th International Symposium on Reliable Distributed Systems (SRDS)10.1109/SRDS53918.2021.00024(154-163)Online publication date: Sep-2021
  • (2019)IgnoreTM: Opportunistically Ignoring Timing Violations for Energy Savings using HTM2019 Design, Automation & Test in Europe Conference & Exhibition (DATE)10.23919/DATE.2019.8715139(1571-1574)Online publication date: Mar-2019
  • (2017)Edge-TMACM Transactions on Embedded Computing Systems10.1145/312655616:5s(1-18)Online publication date: 27-Sep-2017
  • Show More Cited By

Index Terms

  1. Fault tolerance for multi-threaded applications by leveraging hardware transactional memory

        Recommendations

        Comments

        Information & Contributors

        Information

        Published In

        cover image ACM Conferences
        CF '13: Proceedings of the ACM International Conference on Computing Frontiers
        May 2013
        302 pages
        ISBN:9781450320535
        DOI:10.1145/2482767
        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Sponsors

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        Published: 14 May 2013

        Permissions

        Request permissions for this article.

        Check for updates

        Author Tags

        1. HTM
        2. error detection
        3. error recovery
        4. redundancy

        Qualifiers

        • Research-article

        Conference

        CF'13
        Sponsor:
        CF'13: Computing Frontiers Conference
        May 14 - 16, 2013
        Ischia, Italy

        Acceptance Rates

        CF '13 Paper Acceptance Rate 26 of 49 submissions, 53%;
        Overall Acceptance Rate 273 of 785 submissions, 35%

        Upcoming Conference

        CF '25

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • Downloads (Last 12 months)7
        • Downloads (Last 6 weeks)1
        Reflects downloads up to 16 Feb 2025

        Other Metrics

        Citations

        Cited By

        View all
        • (2021)BROFY: Towards Essential Integrity Protection for Microservices2021 40th International Symposium on Reliable Distributed Systems (SRDS)10.1109/SRDS53918.2021.00024(154-163)Online publication date: Sep-2021
        • (2019)IgnoreTM: Opportunistically Ignoring Timing Violations for Energy Savings using HTM2019 Design, Automation & Test in Europe Conference & Exhibition (DATE)10.23919/DATE.2019.8715139(1571-1574)Online publication date: Mar-2019
        • (2017)Edge-TMACM Transactions on Embedded Computing Systems10.1145/312655616:5s(1-18)Online publication date: 27-Sep-2017
        • (2017)Fault-Tolerant Execution on COTS Multi-core Processors with Hardware Transactional Memory SupportArchitecture of Computing Systems - ARCS 201710.1007/978-3-319-54999-6_2(16-30)Online publication date: 4-Mar-2017
        • (2016)HAFTProceedings of the Eleventh European Conference on Computer Systems10.1145/2901318.2901339(1-17)Online publication date: 18-Apr-2016
        • (2015)Playing with FireProceedings of the 25th edition on Great Lakes Symposium on VLSI10.1145/2742060.2742090(9-14)Online publication date: 20-May-2015
        • (2015)Transactional Memory for ReliabilityTransactional Memory. Foundations, Algorithms, Tools, and Applications10.1007/978-3-319-14720-8_13(268-282)Online publication date: 2015
        • (2014)Toward Exascale ResilienceSupercomputing Frontiers and Innovations: an International Journal10.14529/jsfi1401011:1(5-28)Online publication date: 6-Apr-2014
        • (2014)Combining Error Detection and Transactional Memory for Energy-Efficient Computing below Safe Operation MarginsProceedings of the 2014 22nd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing10.1109/PDP.2014.61(248-255)Online publication date: 12-Feb-2014

        View Options

        Login options

        View options

        PDF

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        Figures

        Tables

        Media

        Share

        Share

        Share this Publication link

        Share on social media