Skip to main content

Fault-Tolerant Execution on COTS Multi-core Processors with Hardware Transactional Memory Support

  • Conference paper
  • First Online:
Architecture of Computing Systems - ARCS 2017 (ARCS 2017)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 10172))

Included in the following conference series:

Abstract

The demand for fault-tolerant execution on high performance computer systems increases due to higher fault rates resulting from smaller structure sizes. As an alternative to hardware-based lockstep solutions, software-based fault-tolerance mechanisms can increase the reliability of multi-core commercial-of-the-shelf (COTS) CPUs while being cheaper and more flexible. This paper proposes a software/hardware hybrid approach, which targets Intel’s current x86 multi-core platforms of the Core and Xeon family. We leverage hardware transactional memory (Intel TSX) to support implicit checkpoint creation and fast rollback. Redundant execution of processes and signature-based comparison of their computations provides error detection, and transactional wrapping enables error recovery. Existing applications are enhanced towards fault-tolerant redundant execution by post-link binary instrumentation. Hardware enhancements to further increase the applicability of the approach are proposed and evaluated with SPEC CPU 2006 benchmarks. The resulting performance overhead is 47% on average, assuming the existence of the proposed hardware support.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Bernick, D., Bruckert, B., Vigna, P.D., Garcia, D., Jardine, R., Klecka, J., Smullen, J.: NonStop\(^{\textregistered }\) advanced architecture. In: Proceedings of the International Conference on Dependable Systems and Networks (DSN), pp. 12–21 (2005)

    Google Scholar 

  2. Fetzer, C., Felber, P.: Transactional memory for dependable embedded systems. In: Proceedings of the International Conference on Dependable Systems and Networks Workshops (DSN-W), pp. 223–227 (2011)

    Google Scholar 

  3. Haas, F., Weis, S., Metzlaff, S., Ungerer, T.: Exploiting Intel TSX for fault-tolerant execution in safety-critical systems. In: Proceedings of the International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems (DFT), pp. 197–202 (2014)

    Google Scholar 

  4. Haas, F., Weis, S., Ungerer, T., Pokam, G., Wu, Y.: POSTER: fault-tolerant execution on COTS multi-core processors with hardware transactional memory support. In: Proceedings of the International Conference on Parallel Architecture and Compilation Techniques (PACT), pp. 421–422 (9 2016)

    Google Scholar 

  5. Hammarlund, P., Martinez, A.J., Bajwa, A.A., Hill, D.L., Hallnor, E., Jiang, H., Dixon, M., Derr, M., Hunsaker, M., Kumar, R., et al.: Haswell: the Fourth-Generation Intel Core Processor. IEEE Micro 34(2), 6–20 (2014)

    Article  Google Scholar 

  6. Henning, J.L.: SPEC CPU2006 benchmark descriptions. ACM SIGARCH Comput. Archit. News 34(4), 1–17 (2006)

    Article  Google Scholar 

  7. Herlihy, M., Moss, J.E.B.: Transactional memory: architectural support for lock-free data structures. In: Proceedings of the International Symposium on Computer Architecture (ISCA), pp. 289–300 (1993)

    Google Scholar 

  8. Kuvaiskii, D., Faqeh, R., Bhatotia, P., Felber, P., Fetzer, C.: HAFT: hardware-assisted fault tolerance. In: Proceedings of the European Conference on Computer Systems (EuroSys), pp. 25:1–25:17 (2016). http://doi.acm.org/10.1145/2901318.2901339

  9. LaFrieda, C., Ipek, E., Martinez, J.F., Manohar, R.: Utilizing dynamically coupled cores to form a resilient chip multiprocessor. In: Proceedings of the International Conference on Dependable Systems and Networks (DSN), pp. 317–326 (2007)

    Google Scholar 

  10. Laurenzano, M.A., Tikir, M.M., Carrington, L., Snavely, A.: PEBIL: efficient static binary instrumentation for Linux. In: Proceedings of the International Symposium on Performance Analysis of Systems and Software (ISPASS), pp. 175–183 (2010)

    Google Scholar 

  11. Mukherjee, S.: Architecture Design for Soft Errors. Morgan Kaufmann Publishers Inc., San Francisco (2008)

    Google Scholar 

  12. Reinhardt, S.K., Mukherjee, S.S.: Transient Fault Detection via Simultaneous Multithreading. In: Proceedings of the International Symposium on Computer Architecture (ISCA), pp. 25–36 (2000)

    Google Scholar 

  13. Reis, G.A., Chang, J., Vachharajani, N., Rangan, R., August, D.I.: SWIFT: software implemented fault tolerance. In: Proceedings of the International Symposium on Code Generation and Optimization (CGO), pp. 243–254 (2005)

    Google Scholar 

  14. Shye, A., Blomstedt, J., Moseley, T., Reddi, V.J., Connors, D.A.: PLR: a software approach to transient fault tolerance for multicore architectures. IEEE Trans. Dependable Secure Comput. (TDSC) 6(2), 135–148 (2009)

    Article  Google Scholar 

  15. Yalcin, G., Unsal, O.S., Cristal, A.: Fault tolerance for multi-threaded applications by leveraging hardware transactional memory. In: Proceedings of the International Conference on Computing Frontiers (CF), pp. 4:1–4:9 (2013)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Florian Haas .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing AG

About this paper

Cite this paper

Haas, F., Weis, S., Ungerer, T., Pokam, G., Wu, Y. (2017). Fault-Tolerant Execution on COTS Multi-core Processors with Hardware Transactional Memory Support. In: Knoop, J., Karl, W., Schulz, M., Inoue, K., Pionteck, T. (eds) Architecture of Computing Systems - ARCS 2017. ARCS 2017. Lecture Notes in Computer Science(), vol 10172. Springer, Cham. https://doi.org/10.1007/978-3-319-54999-6_2

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-54999-6_2

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-54998-9

  • Online ISBN: 978-3-319-54999-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics