Abstract
DISC is a newly proposed parallel programming paradigm that models many classes of iterative scientific applications through specification of a domain and interactions among domain elements. Accompanied with an associated runtime, it hides the details of inter-process communication and work partitioning (including partitioning in the presence of heterogeneous processing elements) from the programmers. In this paper, we show how these abstractions, particularly the concepts of compute-function and computation-space objects, can be also used to leverage low-overhead fault-tolerance support. While computation-space objects enable automated application level checkpointing, replicated execution of compute-functions helps detect soft errors with low overheads. Experimental results show the effectiveness of the proposed solutions.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
Please see https://software.sandia.gov/mantevo.
References
Agbaria, A., Friedman, R.: Starfish: fault-tolerant dynamic MPI programs on clusters of workstations. In: 1999 Proceedings of the Eighth International Symposium on High Performance Distributed Computing, pp. 167–176 (1999)
Arnold, D., Miller, B.: Scalable failure recovery for high-performance data aggregation. In: 2010 IEEE International Symposium on Parallel Distributed Processing (IPDPS), pp. 1–11, April 2010
Bouteiller, A., Cappello, F., Herault, T., Krawezik, G., Lemarinier, P., Magniette, F.: MPICH-V project: a multiprotocol automatic fault tolerant MPI. Int. J. High Perform. Comput. Appl. 20(3), 319–333 (2006)
Cappello, F.: Fault tolerance in petascale/ exascale systems: Current knowledge, challenges and research opportunities. Int. J. High Perform. Comput. Appl. 23(3), 212–226 (2009)
Chen, Z.: Algorithm-based recovery for iterative methods without checkpointing. In: Proceedings of the 20th International Symposium on High Performance Distributed Computing, HPDC 2011, pp. 73–84. ACM, New York (2011)
Coti, C., Herault, T., Lemarinier, P., Pilard, L., Rezmerita, A., Rodriguez, E., Cappello, F.: Blocking vs. non-blocking coordinated checkpointing for large-scale fault tolerant MPI. In: Proceedings of the ACM/IEEE Conference on Supercomputing, SC 2006. ACM, New York (2006)
Davies, T., Karlsson, C., Liu, H., Ding, C., Chen, Z.: High performance linpack benchmark: A fault tolerant implementation without checkpointing. In: Proceedings of the International Conference on Supercomputing, ICS 2011, pp. 162–171. ACM, New York (2011)
Fang, J., Varbanescu, A.L., Sips, H., Zhang, L., Che, Y., Xu, C.: An Empirical Study of Intel Xeon Phi. ArXiv e-prints, October 2013
Feng, S., Gupta, S., Ansari, A., Mahlke, S.: Shoestring: Probabilistic soft error reliability on the cheap. SIGPLAN Not. 45(3), 385–396 (2010)
Fiala, D., Mueller, F., Engelmann, C., Riesen, R., Ferreira, K., Brightwell, R.: Detection and correction of silent data corruption for large-scale high-performance computing. In: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC 2012, pp. 78:1–78:12. IEEE Computer Society Press, Los Alamitos (2012)
Hari, S.K.S., Adve, S.V., Naeimi, H.: Low-cost program-level detectors for reducing silent data corruptions. In: DSN, pp. 1–12 (2012)
Hursey, J., Squyres, J., Mattox, T., Lumsdaine, A.: The design and implementation of checkpoint/restart process fault tolerance for open MPI. In: IEEE International Parallel and Distributed Processing Symposium, IPDPS 2007. pp. 1–8, March 2007
Islam, T.Z., Mohror, K., Bagchi, S., Moody, A., de Supinski, B.R., Eigenmann, R.: Mcrengine: A scalable checkpointing system using data-aware aggregation and compression. In: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC 2012, pp. 17:1–17:11. IEEE Computer Society Press, Los Alamitos (2012)
Kranzlmüller, D., Kacsuk, P., Dongarra, J.: Recent advances in parallel virtual machine and message passing interface. Int. J. High Perform. Comput. Appl. 19(2), 99–101 (2005)
Kurt, M.C., Agrawal, G.: Disc: A domain-interaction based programming model with support for heterogeneous execution. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2014, pp. 869–880. IEEE Press, Piscataway (2014)
Maxino, T., Koopman, P.: The effectiveness of checksums for embedded control networks. IEEE Trans. Dependable Secure Comput. 6(1), 59–72 (2009)
Moody, A., Bronevetsky, G., Mohror, K., de Supinski, B.R.: Design, modeling, and evaluation of a scalable multi-level checkpointing system. In: Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2010, pp. 1–11. IEEE Computer Society, Washington, DC (2010)
Ni, X., Meneses, E., Jain, N., Kalé, L.V.: ACR: Automatic checkpoint/restart for soft and hard error protection. In: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC 2013, pp. 7:1–7:12. ACM, New York (2013)
Plank, J., Kim, Y., Dongarra, J.: Algorithm-based diskless checkpointing for fault tolerant matrix operations. In: Twenty-Fifth International Symposium on Fault-Tolerant Computing, 1995, FTCS-25. Digest of Papers, pp. 351–360, June 1995
Quinn, H., Graham, P.: Terrestrial-based radiation upsets: a cautionary tale. In: 13th Annual IEEE Symposium on Field-Programmable Custom Computing Machines, FCCM 2005, pp. 193–202, April 2005
Reddy, V.K., Rotenberg, E., Parthasarathy, S.: Understanding prediction-based partial redundant threading for low-overhead, high- coverage fault tolerance. SIGARCH Comput. Archit. News 34(5), 83–94 (2006)
Reinhardt, S.K., Mukherjee, S.S.: Transient fault detection via simultaneous multithreading. In: Proceedings of the 27th Annual International Symposium on Computer Architecture, ISCA 2000, pp. 25–36. ACM, New York (2000)
Riesen, R., Ferreira, K., Da Silva, D., Lemarinier, P., Arnold, D., Bridges, P.G.: Alleviating scalability issues of checkpointing protocols. In: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC 2012, pp. 18:1–18:11. IEEE Computer Society Press, Los Alamitos (2012)
Rotenberg, E.: AR-SMT: a microarchitectural approach to fault tolerance in microprocessors. In: Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing, Digest of Papers, pp. 84–91, June 1999
Schroeder, B., Pinheiro, E., Weber, W.-D.: DRAM errors in the wild: A large-scale field study. In: Proceedings of the Eleventh International Joint Conference on Measurement and Modeling of Computer Systems, SIGMETRICS 2009, pp. 193–204. ACM, New York (2009)
Stellner, G.: CoCheck: checkpointing and process migration for MPI. In: Proceedings of the 10th International Parallel Processing Symposium, IPPS 1996, pp. 526–531, April 1996
Wang, C., Kim, H.-S., Wu, Y., Ying, V.: Compiler-managed software-based redundant multi-threading for transient fault detection. In: Proceedings of the International Symposium on Code Generation and Optimization, CGO 2007, pp. 244–258. IEEE Computer Society, Washington, DC (2007)
Acknowledgments
This work was supported by National Science Foundation under the award CCF-1319420, and by the Department of Energy, Office of Science, under award DE-SC0014135.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing Switzerland
About this paper
Cite this paper
Kurt, M.C., Ren, B., Agrawal, G. (2016). Low-Overhead Fault-Tolerance Support Using DISC Programming Model. In: Shen, X., Mueller, F., Tuck, J. (eds) Languages and Compilers for Parallel Computing. LCPC 2015. Lecture Notes in Computer Science(), vol 9519. Springer, Cham. https://doi.org/10.1007/978-3-319-29778-1_2
Download citation
DOI: https://doi.org/10.1007/978-3-319-29778-1_2
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-29777-4
Online ISBN: 978-3-319-29778-1
eBook Packages: Computer ScienceComputer Science (R0)