Abstract
The demand for fault-tolerant execution on high performance computer systems increases due to higher fault rates resulting from smaller structure sizes. As an alternative to hardware-based lockstep solutions, software-based fault-tolerance mechanisms can increase the reliability of multi-core commercial-of-the-shelf (COTS) CPUs while being cheaper and more flexible. This paper proposes a software/hardware hybrid approach, which targets Intel’s current x86 multi-core platforms of the Core and Xeon family. We leverage hardware transactional memory (Intel TSX) to support implicit checkpoint creation and fast rollback. Redundant execution of processes and signature-based comparison of their computations provides error detection, and transactional wrapping enables error recovery. Existing applications are enhanced towards fault-tolerant redundant execution by post-link binary instrumentation. Hardware enhancements to further increase the applicability of the approach are proposed and evaluated with SPEC CPU 2006 benchmarks. The resulting performance overhead is 47% on average, assuming the existence of the proposed hardware support.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Bernick, D., Bruckert, B., Vigna, P.D., Garcia, D., Jardine, R., Klecka, J., Smullen, J.: NonStop\(^{\textregistered }\) advanced architecture. In: Proceedings of the International Conference on Dependable Systems and Networks (DSN), pp. 12–21 (2005)
Fetzer, C., Felber, P.: Transactional memory for dependable embedded systems. In: Proceedings of the International Conference on Dependable Systems and Networks Workshops (DSN-W), pp. 223–227 (2011)
Haas, F., Weis, S., Metzlaff, S., Ungerer, T.: Exploiting Intel TSX for fault-tolerant execution in safety-critical systems. In: Proceedings of the International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems (DFT), pp. 197–202 (2014)
Haas, F., Weis, S., Ungerer, T., Pokam, G., Wu, Y.: POSTER: fault-tolerant execution on COTS multi-core processors with hardware transactional memory support. In: Proceedings of the International Conference on Parallel Architecture and Compilation Techniques (PACT), pp. 421–422 (9 2016)
Hammarlund, P., Martinez, A.J., Bajwa, A.A., Hill, D.L., Hallnor, E., Jiang, H., Dixon, M., Derr, M., Hunsaker, M., Kumar, R., et al.: Haswell: the Fourth-Generation Intel Core Processor. IEEE Micro 34(2), 6–20 (2014)
Henning, J.L.: SPEC CPU2006 benchmark descriptions. ACM SIGARCH Comput. Archit. News 34(4), 1–17 (2006)
Herlihy, M., Moss, J.E.B.: Transactional memory: architectural support for lock-free data structures. In: Proceedings of the International Symposium on Computer Architecture (ISCA), pp. 289–300 (1993)
Kuvaiskii, D., Faqeh, R., Bhatotia, P., Felber, P., Fetzer, C.: HAFT: hardware-assisted fault tolerance. In: Proceedings of the European Conference on Computer Systems (EuroSys), pp. 25:1–25:17 (2016). http://doi.acm.org/10.1145/2901318.2901339
LaFrieda, C., Ipek, E., Martinez, J.F., Manohar, R.: Utilizing dynamically coupled cores to form a resilient chip multiprocessor. In: Proceedings of the International Conference on Dependable Systems and Networks (DSN), pp. 317–326 (2007)
Laurenzano, M.A., Tikir, M.M., Carrington, L., Snavely, A.: PEBIL: efficient static binary instrumentation for Linux. In: Proceedings of the International Symposium on Performance Analysis of Systems and Software (ISPASS), pp. 175–183 (2010)
Mukherjee, S.: Architecture Design for Soft Errors. Morgan Kaufmann Publishers Inc., San Francisco (2008)
Reinhardt, S.K., Mukherjee, S.S.: Transient Fault Detection via Simultaneous Multithreading. In: Proceedings of the International Symposium on Computer Architecture (ISCA), pp. 25–36 (2000)
Reis, G.A., Chang, J., Vachharajani, N., Rangan, R., August, D.I.: SWIFT: software implemented fault tolerance. In: Proceedings of the International Symposium on Code Generation and Optimization (CGO), pp. 243–254 (2005)
Shye, A., Blomstedt, J., Moseley, T., Reddi, V.J., Connors, D.A.: PLR: a software approach to transient fault tolerance for multicore architectures. IEEE Trans. Dependable Secure Comput. (TDSC) 6(2), 135–148 (2009)
Yalcin, G., Unsal, O.S., Cristal, A.: Fault tolerance for multi-threaded applications by leveraging hardware transactional memory. In: Proceedings of the International Conference on Computing Frontiers (CF), pp. 4:1–4:9 (2013)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Haas, F., Weis, S., Ungerer, T., Pokam, G., Wu, Y. (2017). Fault-Tolerant Execution on COTS Multi-core Processors with Hardware Transactional Memory Support. In: Knoop, J., Karl, W., Schulz, M., Inoue, K., Pionteck, T. (eds) Architecture of Computing Systems - ARCS 2017. ARCS 2017. Lecture Notes in Computer Science(), vol 10172. Springer, Cham. https://doi.org/10.1007/978-3-319-54999-6_2
Download citation
DOI: https://doi.org/10.1007/978-3-319-54999-6_2
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-54998-9
Online ISBN: 978-3-319-54999-6
eBook Packages: Computer ScienceComputer Science (R0)