Skip to main content

Low-Overhead Fault-Tolerance Support Using DISC Programming Model

  • Conference paper
  • First Online:
Languages and Compilers for Parallel Computing (LCPC 2015)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 9519))

Included in the following conference series:

  • 616 Accesses

Abstract

DISC is a newly proposed parallel programming paradigm that models many classes of iterative scientific applications through specification of a domain and interactions among domain elements. Accompanied with an associated runtime, it hides the details of inter-process communication and work partitioning (including partitioning in the presence of heterogeneous processing elements) from the programmers. In this paper, we show how these abstractions, particularly the concepts of compute-function and computation-space objects, can be also used to leverage low-overhead fault-tolerance support. While computation-space objects enable automated application level checkpointing, replicated execution of compute-functions helps detect soft errors with low overheads. Experimental results show the effectiveness of the proposed solutions.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Please see https://software.sandia.gov/mantevo.

References

  1. Agbaria, A., Friedman, R.: Starfish: fault-tolerant dynamic MPI programs on clusters of workstations. In: 1999 Proceedings of the Eighth International Symposium on High Performance Distributed Computing, pp. 167–176 (1999)

    Google Scholar 

  2. Arnold, D., Miller, B.: Scalable failure recovery for high-performance data aggregation. In: 2010 IEEE International Symposium on Parallel Distributed Processing (IPDPS), pp. 1–11, April 2010

    Google Scholar 

  3. Bouteiller, A., Cappello, F., Herault, T., Krawezik, G., Lemarinier, P., Magniette, F.: MPICH-V project: a multiprotocol automatic fault tolerant MPI. Int. J. High Perform. Comput. Appl. 20(3), 319–333 (2006)

    Article  Google Scholar 

  4. Cappello, F.: Fault tolerance in petascale/ exascale systems: Current knowledge, challenges and research opportunities. Int. J. High Perform. Comput. Appl. 23(3), 212–226 (2009)

    Article  Google Scholar 

  5. Chen, Z.: Algorithm-based recovery for iterative methods without checkpointing. In: Proceedings of the 20th International Symposium on High Performance Distributed Computing, HPDC 2011, pp. 73–84. ACM, New York (2011)

    Google Scholar 

  6. Coti, C., Herault, T., Lemarinier, P., Pilard, L., Rezmerita, A., Rodriguez, E., Cappello, F.: Blocking vs. non-blocking coordinated checkpointing for large-scale fault tolerant MPI. In: Proceedings of the ACM/IEEE Conference on Supercomputing, SC 2006. ACM, New York (2006)

    Google Scholar 

  7. Davies, T., Karlsson, C., Liu, H., Ding, C., Chen, Z.: High performance linpack benchmark: A fault tolerant implementation without checkpointing. In: Proceedings of the International Conference on Supercomputing, ICS 2011, pp. 162–171. ACM, New York (2011)

    Google Scholar 

  8. Fang, J., Varbanescu, A.L., Sips, H., Zhang, L., Che, Y., Xu, C.: An Empirical Study of Intel Xeon Phi. ArXiv e-prints, October 2013

    Google Scholar 

  9. Feng, S., Gupta, S., Ansari, A., Mahlke, S.: Shoestring: Probabilistic soft error reliability on the cheap. SIGPLAN Not. 45(3), 385–396 (2010)

    Article  Google Scholar 

  10. Fiala, D., Mueller, F., Engelmann, C., Riesen, R., Ferreira, K., Brightwell, R.: Detection and correction of silent data corruption for large-scale high-performance computing. In: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC 2012, pp. 78:1–78:12. IEEE Computer Society Press, Los Alamitos (2012)

    Google Scholar 

  11. Hari, S.K.S., Adve, S.V., Naeimi, H.: Low-cost program-level detectors for reducing silent data corruptions. In: DSN, pp. 1–12 (2012)

    Google Scholar 

  12. Hursey, J., Squyres, J., Mattox, T., Lumsdaine, A.: The design and implementation of checkpoint/restart process fault tolerance for open MPI. In: IEEE International Parallel and Distributed Processing Symposium, IPDPS 2007. pp. 1–8, March 2007

    Google Scholar 

  13. Islam, T.Z., Mohror, K., Bagchi, S., Moody, A., de Supinski, B.R., Eigenmann, R.: Mcrengine: A scalable checkpointing system using data-aware aggregation and compression. In: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC 2012, pp. 17:1–17:11. IEEE Computer Society Press, Los Alamitos (2012)

    Google Scholar 

  14. Kranzlmüller, D., Kacsuk, P., Dongarra, J.: Recent advances in parallel virtual machine and message passing interface. Int. J. High Perform. Comput. Appl. 19(2), 99–101 (2005)

    Article  Google Scholar 

  15. Kurt, M.C., Agrawal, G.: Disc: A domain-interaction based programming model with support for heterogeneous execution. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2014, pp. 869–880. IEEE Press, Piscataway (2014)

    Google Scholar 

  16. Maxino, T., Koopman, P.: The effectiveness of checksums for embedded control networks. IEEE Trans. Dependable Secure Comput. 6(1), 59–72 (2009)

    Article  Google Scholar 

  17. Moody, A., Bronevetsky, G., Mohror, K., de Supinski, B.R.: Design, modeling, and evaluation of a scalable multi-level checkpointing system. In: Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2010, pp. 1–11. IEEE Computer Society, Washington, DC (2010)

    Google Scholar 

  18. Ni, X., Meneses, E., Jain, N., Kalé, L.V.: ACR: Automatic checkpoint/restart for soft and hard error protection. In: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC 2013, pp. 7:1–7:12. ACM, New York (2013)

    Google Scholar 

  19. Plank, J., Kim, Y., Dongarra, J.: Algorithm-based diskless checkpointing for fault tolerant matrix operations. In: Twenty-Fifth International Symposium on Fault-Tolerant Computing, 1995, FTCS-25. Digest of Papers, pp. 351–360, June 1995

    Google Scholar 

  20. Quinn, H., Graham, P.: Terrestrial-based radiation upsets: a cautionary tale. In: 13th Annual IEEE Symposium on Field-Programmable Custom Computing Machines, FCCM 2005, pp. 193–202, April 2005

    Google Scholar 

  21. Reddy, V.K., Rotenberg, E., Parthasarathy, S.: Understanding prediction-based partial redundant threading for low-overhead, high- coverage fault tolerance. SIGARCH Comput. Archit. News 34(5), 83–94 (2006)

    Article  Google Scholar 

  22. Reinhardt, S.K., Mukherjee, S.S.: Transient fault detection via simultaneous multithreading. In: Proceedings of the 27th Annual International Symposium on Computer Architecture, ISCA 2000, pp. 25–36. ACM, New York (2000)

    Google Scholar 

  23. Riesen, R., Ferreira, K., Da Silva, D., Lemarinier, P., Arnold, D., Bridges, P.G.: Alleviating scalability issues of checkpointing protocols. In: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC 2012, pp. 18:1–18:11. IEEE Computer Society Press, Los Alamitos (2012)

    Google Scholar 

  24. Rotenberg, E.: AR-SMT: a microarchitectural approach to fault tolerance in microprocessors. In: Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing, Digest of Papers, pp. 84–91, June 1999

    Google Scholar 

  25. Schroeder, B., Pinheiro, E., Weber, W.-D.: DRAM errors in the wild: A large-scale field study. In: Proceedings of the Eleventh International Joint Conference on Measurement and Modeling of Computer Systems, SIGMETRICS 2009, pp. 193–204. ACM, New York (2009)

    Google Scholar 

  26. Stellner, G.: CoCheck: checkpointing and process migration for MPI. In: Proceedings of the 10th International Parallel Processing Symposium, IPPS 1996, pp. 526–531, April 1996

    Google Scholar 

  27. Wang, C., Kim, H.-S., Wu, Y., Ying, V.: Compiler-managed software-based redundant multi-threading for transient fault detection. In: Proceedings of the International Symposium on Code Generation and Optimization, CGO 2007, pp. 244–258. IEEE Computer Society, Washington, DC (2007)

    Google Scholar 

Download references

Acknowledgments

This work was supported by National Science Foundation under the award CCF-1319420, and by the Department of Energy, Office of Science, under award DE-SC0014135.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Mehmet Can Kurt .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing Switzerland

About this paper

Cite this paper

Kurt, M.C., Ren, B., Agrawal, G. (2016). Low-Overhead Fault-Tolerance Support Using DISC Programming Model. In: Shen, X., Mueller, F., Tuck, J. (eds) Languages and Compilers for Parallel Computing. LCPC 2015. Lecture Notes in Computer Science(), vol 9519. Springer, Cham. https://doi.org/10.1007/978-3-319-29778-1_2

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-29778-1_2

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-29777-4

  • Online ISBN: 978-3-319-29778-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics