Abstract
A plethora of resilience techniques have been investigated to protect application kernels. If, however, such techniques are combined and they interact across kernels, new vulnerability windows are created. This work contributes the idea of end-to-end resilience by protecting windows of vulnerability between kernels guarded by different resilience techniques. It introduces the live vulnerability factor (LVF), a new metric that quantifies any lack of end-to-end protection for a given data structure. The work further promotes end-to-end application protection across kernels via a pragma-based specification for diverse resilience schemes with minimal programming effort. This lifts the data protection burden from application programmers allowing them to focus solely on algorithms and performance while resilience is specified and subsequently embedded into the code through the compiler/library and supported by the runtime system. In experiments with case studies and benchmarks, end-to-end resilience has an overhead over kernel-specific resilience of less than \(3\%\) on average and increases protection against bit flips by a factor of three to four.
This work was supported in part by a subcontract from Lawrence Berkeley National Laboratory and NSF grants 1525609, 1058779, and 0958311. This manuscript has three authors of Lawrence Berkeley National Laboratory under Contract No. DE-AC02-05CH11231 with the U.S. Department of Energy. The U.S. Government retains, and the publisher, by accepting the article for publication, acknowledges, that the U.S. Government retains a non-exclusive, paid-up, irrevocable, world-wide license to publish or reproduce the published form of this manuscript, or allow others to do so, for U.S. Government purposes.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
Bit flips in code (instruction bits) create unpredictable outcomes (most of the time segmentation faults or crashes but sometimes also incorrect but legal jumps) and are out of the scope of this work.
- 2.
Extra checks are added to guarantee the correctness of data stored in a safe region. A safe region is assumed to neither be subject to bit flips nor data corruption from the application viewpoint—yet, the techniques to make the region safe remain transparent to the programmer. In other words, a safe region is simply one subject to data protection/verification via checking.
- 3.
Inputs are read from disk and stored in globals or on the heap, but may be recovered by re-reading from disk. Globals are calculated in the program and can only be recovered by re-calculation or ABFT schemes.
References
Anderson, J.H., Calandrino, J.M.: Parallel task scheduling on multicore platforms. SIGBED Rev. 3(1), 1–6 (2006)
Biswas, S., Supinski, B.R.D., Schulz, M., Franklin, D., Sherwood, T., Chong, F.T.: Exploiting data similarity to reduce memory footprints. In: IPDPS, pp. 152–163 (2011)
Blumofe, R.D., Joerg, C.F., Kuszmaul, B.C., Leiserson, C.E., Randall, K.H., Zhou, Y.: Cilk: an efficient multithreaded runtime system. In: PPoPP, pp. 207–216 (1995)
Böhm, S., Engelmann, C.: File I/O for MPI applications in redundant execution scenarios. In: Parallel, Distributed, and Network-Based Processing, February 2012
Bosilca, G., Bouteiller, A., Danalis, A., Faverge, M., Herault, T., Dongarra, J.: PaRSEC: exploiting heterogeneity to enhance scalability. Comput. Sci. Eng. 15(6), 36–45 (2013)
Cao, C., Herault, T., Bosilca, G., Dongarra, J.: Design for a soft error resilient dynamic task-based runtime. In: IPDPS, pp. 765–774, May 2015
Chen, S., et al.: Scheduling threads for constructive cache sharing on CMPs. In: SPAA, pp. 105–115 (2007)
Chen, Z., Wu, P.: Fail-stop failure algorithm-based fault tolerance for cholesky decomposition. IEEE TPDS 99(PrePrints), 1 (2014)
Chung, J., et al.: Containment domains: a scalable, efficient, and flexible resilience scheme for exascale systems. In: Supercomputing, pp. 58:1–58:11 (2012)
Dave, C., Bae, H., Min, S.J., Lee, S., Eigenmann, R., Midkiff, S.: Cetus: a source-to-source compiler infrastructure for multicores. Computer 42(12), 36–42 (2009)
Diniz, P.C., Liao, C., Quinlan, D.J., Lucas, R.F.: Pragma-controlled source-to-source code transformations for robust application execution. In: Desprez, F., et al. (eds.) Euro-Par 2016. LNCS, vol. 10104, pp. 660–670. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-58943-5_53
Du, P., Bouteiller, A., Bosilca, G., Herault, T., Dongarra, J.: Algorithm-based fault tolerance for dense matrix factorizations. In: PPoPP, pp. 225–234 (2012)
Duell, J.: The design and implementation of Berkeley Labs Linux Checkpoint/Restart. Technical report, LBNL (2003)
Duran, A., et al.: OmpSs: a proposal for programming heterogeneous multi-core architectures. Parall. Process. Lett. 21(2), 173–193 (2011)
Elliott, J., Hoemmen, M., Mueller, F.: Evaluating the impact of SDC on the GMRES iterative solver. In: IPDPS, pp. 1193–1202 (2014)
Elliott, J., Kharbas, K., Fiala, D., Mueller, F., Ferreira, K., Engelmann, C.: Combining partial redundancy and checkpointing for HPC. In: ICDCS, 18–21 June 2012
Fiala, D., Mueller, F., Engelmann, C., Ferreira, K., Brightwell, R.: Detection and correction of silent data corruption for large-scale high-performance computing. In: Supercomputing (2012)
Geist, A.: How to kill a supercomputer: dirty power, cosmic rays, and bad solder. In: IEEE Spectrum, February 2016
Huang, K.H., Abraham, J.: Algorithm-based fault tolerance for matrix operations. IEEE Trans. Comput. C–33(6), 518–528 (1984)
Islam, T.Z., Mohror, K., Bagchi, S., Moody, A., de Supinski, B.R., Eigenmann, R.: MCREngine: a scalable checkpointing system using data-aware aggregation and compression. In: Supercomputing, pp. 17:1–17:11 (2012)
Kale, L.V., Krishnan, S.: Charm++: a portable concurrent object oriented system based on c++. In: OOPSLA, pp. 91–108 (1993)
Kiczales, G., et al.: Aspect-oriented programming. In: ECOOP, pp. 220–242 (1997)
Li, S., Sridharan, V., Gurumurthi, S., Yalamanchili, S.: Software-based dynamic reliability management for GPU applications. In: Workshop in Silicon Errors in Logic System Effects (2015)
Martsinkevich, T., Subasi, O., Unsal, O., Cappello, F., Labarta, J.: Fault-tolerant protocol for hybrid task-parallel message-passing applications. In: Cluster Computing, pp. 563–570, September 2015
Min, S., Iancu, C., Yelick, K.: Hierarchical work stealing on manycore clusters. In: Partitioned Global Address Space Programming Models (2011)
Moody, A., Bronevetsky, G., Mohror, K., Supinski, B.R.D.: Design, modeling, and evaluation of a scalable multi-level checkpointing system. In: Supercomputing, pp. 1–11 (2010)
Panzer-Steindel, B.: Data integrity. Technical report, 1.3, CERN (2007)
Parr, T., Quong, R.: ANTLR: a predicated. Softw. Pract. Exp. 25(7), 789–810 (1995)
Schroeder, B., Gibson, G.A.: A large-scale study of failures in high-performance computing systems. In: DSN, pp. 249–258 (2006)
Schroeder, B., Pinheiro, E., Weber, W.D.: Dram errors in the wild: a large-scale field study. SIGMETRICS Perform. Eval. Rev. 37(1), 193–204 (2009)
Shantharam, M., Srinivasmurthy, S., Raghavan, P.: Fault tolerant preconditioned conjugate gradient for sparse linear system solution. In: Supercomputing, pp. 69–78 (2012)
Simon, T.A., Dorband, J.: Improving application resilience through probabilistic task replication. In: Workshop on Algorithmic and Application Error Resilience, June 2013
Snir, M., et al.: Addressing failures in exascale computing. Int. J. High Perform. Comput. (2013)
Sridharan, V., Kaeli, D.: Eliminating microarchitectural dependency from Architectural Vulnerability. In: HPCA, pp. 117–128, February 2009
Sridharan, V., et al.: Memory errors in modern systems: the good, the bad, and the ugly. In: ASPLOS, pp. 297–310 (2015)
Yim, K.S., Pham, C., Saleheen, M., Kalbarczyk, Z., Iyer, R.: Hauberk: lightweight silent data corruption error detector for GPGPU. In: IPDPS, pp. 287–300 (2011)
Yu, L., Li, D., Mittal, S., Vetter, J.S.: Quantitatively modeling application resilience with the data vulnerability factor. In: Supercomputing, pp. 695–706 (2014)
Zhang, Y., Mueller, F., Cui, X., Potok, T.: Large-scale multi-dimensional document clustering on GPU clusters. In: IPDPS, pp. 1–10, April 2010
Zheng, Z., Chien, A.A., Teranishi, K.: Fault tolerance in an inner-outer solver: a GVR-enabled case study. In: Daydé, M., Marques, O., Nakajima, K. (eds.) VECPAR 2014. LNCS, vol. 8969, pp. 124–132. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-17353-5_11
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Rezaei, A., Khetawat, H., Patil, O., Mueller, F., Hargrove, P., Roman, E. (2019). End-to-End Resilience for HPC Applications. In: Weiland, M., Juckeland, G., Trinitis, C., Sadayappan, P. (eds) High Performance Computing. ISC High Performance 2019. Lecture Notes in Computer Science(), vol 11501. Springer, Cham. https://doi.org/10.1007/978-3-030-20656-7_14
Download citation
DOI: https://doi.org/10.1007/978-3-030-20656-7_14
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-20655-0
Online ISBN: 978-3-030-20656-7
eBook Packages: Computer ScienceComputer Science (R0)