Low-Overhead Fault-Tolerance Support Using DISC Programming Model

Kurt, Mehmet Can; Ren, Bin; Agrawal, Gagan

doi:10.1007/978-3-319-29778-1_2

Mehmet Can Kurt¹⁶,
Bin Ren¹⁷ &
Gagan Agrawal¹⁶

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 9519))

Included in the following conference series:

Languages and Compilers for Parallel Computing

616 Accesses

Abstract

DISC is a newly proposed parallel programming paradigm that models many classes of iterative scientific applications through specification of a domain and interactions among domain elements. Accompanied with an associated runtime, it hides the details of inter-process communication and work partitioning (including partitioning in the presence of heterogeneous processing elements) from the programmers. In this paper, we show how these abstractions, particularly the concepts of compute-function and computation-space objects, can be also used to leverage low-overhead fault-tolerance support. While computation-space objects enable automated application level checkpointing, replicated execution of compute-functions helps detect soft errors with low overheads. Experimental results show the effectiveness of the proposed solutions.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
Please see https://software.sandia.gov/mantevo.

References

Agbaria, A., Friedman, R.: Starfish: fault-tolerant dynamic MPI programs on clusters of workstations. In: 1999 Proceedings of the Eighth International Symposium on High Performance Distributed Computing, pp. 167–176 (1999)
Google Scholar
Arnold, D., Miller, B.: Scalable failure recovery for high-performance data aggregation. In: 2010 IEEE International Symposium on Parallel Distributed Processing (IPDPS), pp. 1–11, April 2010
Google Scholar
Bouteiller, A., Cappello, F., Herault, T., Krawezik, G., Lemarinier, P., Magniette, F.: MPICH-V project: a multiprotocol automatic fault tolerant MPI. Int. J. High Perform. Comput. Appl. 20(3), 319–333 (2006)
Article Google Scholar
Cappello, F.: Fault tolerance in petascale/ exascale systems: Current knowledge, challenges and research opportunities. Int. J. High Perform. Comput. Appl. 23(3), 212–226 (2009)
Article Google Scholar
Chen, Z.: Algorithm-based recovery for iterative methods without checkpointing. In: Proceedings of the 20th International Symposium on High Performance Distributed Computing, HPDC 2011, pp. 73–84. ACM, New York (2011)
Google Scholar
Coti, C., Herault, T., Lemarinier, P., Pilard, L., Rezmerita, A., Rodriguez, E., Cappello, F.: Blocking vs. non-blocking coordinated checkpointing for large-scale fault tolerant MPI. In: Proceedings of the ACM/IEEE Conference on Supercomputing, SC 2006. ACM, New York (2006)
Google Scholar
Davies, T., Karlsson, C., Liu, H., Ding, C., Chen, Z.: High performance linpack benchmark: A fault tolerant implementation without checkpointing. In: Proceedings of the International Conference on Supercomputing, ICS 2011, pp. 162–171. ACM, New York (2011)
Google Scholar
Fang, J., Varbanescu, A.L., Sips, H., Zhang, L., Che, Y., Xu, C.: An Empirical Study of Intel Xeon Phi. ArXiv e-prints, October 2013
Google Scholar
Feng, S., Gupta, S., Ansari, A., Mahlke, S.: Shoestring: Probabilistic soft error reliability on the cheap. SIGPLAN Not. 45(3), 385–396 (2010)
Article Google Scholar
Fiala, D., Mueller, F., Engelmann, C., Riesen, R., Ferreira, K., Brightwell, R.: Detection and correction of silent data corruption for large-scale high-performance computing. In: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC 2012, pp. 78:1–78:12. IEEE Computer Society Press, Los Alamitos (2012)
Google Scholar
Hari, S.K.S., Adve, S.V., Naeimi, H.: Low-cost program-level detectors for reducing silent data corruptions. In: DSN, pp. 1–12 (2012)
Google Scholar
Hursey, J., Squyres, J., Mattox, T., Lumsdaine, A.: The design and implementation of checkpoint/restart process fault tolerance for open MPI. In: IEEE International Parallel and Distributed Processing Symposium, IPDPS 2007. pp. 1–8, March 2007
Google Scholar
Islam, T.Z., Mohror, K., Bagchi, S., Moody, A., de Supinski, B.R., Eigenmann, R.: Mcrengine: A scalable checkpointing system using data-aware aggregation and compression. In: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC 2012, pp. 17:1–17:11. IEEE Computer Society Press, Los Alamitos (2012)
Google Scholar
Kranzlmüller, D., Kacsuk, P., Dongarra, J.: Recent advances in parallel virtual machine and message passing interface. Int. J. High Perform. Comput. Appl. 19(2), 99–101 (2005)
Article Google Scholar
Kurt, M.C., Agrawal, G.: Disc: A domain-interaction based programming model with support for heterogeneous execution. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2014, pp. 869–880. IEEE Press, Piscataway (2014)
Google Scholar
Maxino, T., Koopman, P.: The effectiveness of checksums for embedded control networks. IEEE Trans. Dependable Secure Comput. 6(1), 59–72 (2009)
Article Google Scholar
Moody, A., Bronevetsky, G., Mohror, K., de Supinski, B.R.: Design, modeling, and evaluation of a scalable multi-level checkpointing system. In: Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2010, pp. 1–11. IEEE Computer Society, Washington, DC (2010)
Google Scholar
Ni, X., Meneses, E., Jain, N., Kalé, L.V.: ACR: Automatic checkpoint/restart for soft and hard error protection. In: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC 2013, pp. 7:1–7:12. ACM, New York (2013)
Google Scholar
Plank, J., Kim, Y., Dongarra, J.: Algorithm-based diskless checkpointing for fault tolerant matrix operations. In: Twenty-Fifth International Symposium on Fault-Tolerant Computing, 1995, FTCS-25. Digest of Papers, pp. 351–360, June 1995
Google Scholar
Quinn, H., Graham, P.: Terrestrial-based radiation upsets: a cautionary tale. In: 13th Annual IEEE Symposium on Field-Programmable Custom Computing Machines, FCCM 2005, pp. 193–202, April 2005
Google Scholar
Reddy, V.K., Rotenberg, E., Parthasarathy, S.: Understanding prediction-based partial redundant threading for low-overhead, high- coverage fault tolerance. SIGARCH Comput. Archit. News 34(5), 83–94 (2006)
Article Google Scholar
Reinhardt, S.K., Mukherjee, S.S.: Transient fault detection via simultaneous multithreading. In: Proceedings of the 27th Annual International Symposium on Computer Architecture, ISCA 2000, pp. 25–36. ACM, New York (2000)
Google Scholar
Riesen, R., Ferreira, K., Da Silva, D., Lemarinier, P., Arnold, D., Bridges, P.G.: Alleviating scalability issues of checkpointing protocols. In: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC 2012, pp. 18:1–18:11. IEEE Computer Society Press, Los Alamitos (2012)
Google Scholar
Rotenberg, E.: AR-SMT: a microarchitectural approach to fault tolerance in microprocessors. In: Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing, Digest of Papers, pp. 84–91, June 1999
Google Scholar
Schroeder, B., Pinheiro, E., Weber, W.-D.: DRAM errors in the wild: A large-scale field study. In: Proceedings of the Eleventh International Joint Conference on Measurement and Modeling of Computer Systems, SIGMETRICS 2009, pp. 193–204. ACM, New York (2009)
Google Scholar
Stellner, G.: CoCheck: checkpointing and process migration for MPI. In: Proceedings of the 10th International Parallel Processing Symposium, IPPS 1996, pp. 526–531, April 1996
Google Scholar
Wang, C., Kim, H.-S., Wu, Y., Ying, V.: Compiler-managed software-based redundant multi-threading for transient fault detection. In: Proceedings of the International Symposium on Code Generation and Optimization, CGO 2007, pp. 244–258. IEEE Computer Society, Washington, DC (2007)
Google Scholar

Download references

Acknowledgments

This work was supported by National Science Foundation under the award CCF-1319420, and by the Department of Energy, Office of Science, under award DE-SC0014135.

Author information

Authors and Affiliations

Department of Computer Science and Engineering, The Ohio State University, Columbus, OH, USA
Mehmet Can Kurt & Gagan Agrawal
Pacific Northwest National Laboratory, Richland, WA, USA
Bin Ren

Authors

Mehmet Can Kurt
View author publications
You can also search for this author in PubMed Google Scholar
Bin Ren
View author publications
You can also search for this author in PubMed Google Scholar
Gagan Agrawal
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Mehmet Can Kurt .

Editor information

Editors and Affiliations

North Carolina State University, Raleigh, North Carolina, USA
Xipeng Shen
North Carolina State University, Raleigh, North Carolina, USA
Frank Mueller
North Carolina State University, Raleigh, North Carolina, USA
James Tuck

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kurt, M.C., Ren, B., Agrawal, G. (2016). Low-Overhead Fault-Tolerance Support Using DISC Programming Model. In: Shen, X., Mueller, F., Tuck, J. (eds) Languages and Compilers for Parallel Computing. LCPC 2015. Lecture Notes in Computer Science(), vol 9519. Springer, Cham. https://doi.org/10.1007/978-3-319-29778-1_2

Download citation

DOI: https://doi.org/10.1007/978-3-319-29778-1_2
Published: 20 February 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-29777-4
Online ISBN: 978-3-319-29778-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics