Skip to main content
Log in

Reliability-aware performance model for optimal GPU-enabled cluster environment

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

Given that the reliability of a very large-scaled system is inversely related to the number of computing elements, fault tolerance has become a major concern in high performance computing including the most recent deployments with graphic processing units (GPUs). Many fault tolerance strategies, such as the checkpoint/restart mechanism, have been studied to mitigate failures within such systems. However, fault tolerance mechanisms generate additional costs and these may cause a significant performance drop if it is not used carefully. This paper presents a novel fault tolerance scheduling model that explores the interplay between the GPGPU application performance and the reliability of a large GPU system. This work focuses on the checkpoint scheduling model that aims to minimize fault tolerance costs. Additionally, a GPU performance analysis is conducted. Furthermore, the effect of a checkpoint/restart mechanism on the application performance is thoroughly studied and discussed.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

References

  1. General-purpose computation on graphics hardware. http://gpgpu.org. Accessed Dec 2012

  2. Fan Z, Qiu F, Kaufman A, Yoakum-Stover S (2004) GPU cluster for high performance computing. In: Proceedings of the ACM/IEEE conference on supercomputing, Pittsburgh, PA, USA, pp 47–53. ISBN:0-7695-2153. doi:10.1109/SC.2004.26

  3. Kindratenko VV, Enos J, Shi G, Showerman MT, Arnold GW, Stone JE, Phillips JC, Hwu W (2009) GPU clusters for high-performance computing. In: Proceedings of the IEEE international conference on cluster computing and workshops, CLUSTER, pp 1-8. ISBN:978-1-4244-5011-4. doi:10.1109/CLUSTR.2009.5289128

  4. Top 500 supercomputing sites. http://www.top500.org. Accessed Dec 2012

  5. Laosooksathit S, Naksinehaboon N, Leangsuksan C, Dhungana A, Chandler C, Chanchio K, Farbin A (2010) Lightweight checkpoint mechanism and modeling in gpgpu environment. Computing (HPC Syst) , vol 12, pp 13-20

  6. Laosooksathit S, Naksinehaboon N, Leangsuksan C (2011) Two level checkpoint/restart modeling for GPGPU. In: Proceedings of 9th IEEE/ACS international conference on computer systems and applications (AICCSA), pp 276–283 .ISBN:9781457704758. http://dx.doi.org/10.1109/AICCSA.2011.6126619

  7. NVIDIA (2011) CUDA C Programming Guide Version 4.0. Reliability-aware performance model for optimal GPU-enabled cluster environment 11

  8. Laosooksathit S, Baggag A, Chandler C (2009) Stream experiments: toward latency hiding in GPGPU. In: Proceedings of the 9th IASTED international conference, vol 676, p 240

  9. Liu Y, Nassar R, Leangsuksun C, Naksinehaboon N, Paun M, Scott S (2008) An optimal checkpoint/restart model for a large scale high performance computing system. In: Proceedings of the 2nd IEEE international parallel and distributed processing symposium (IPDPS 2008), Miami, Florida, pp 1–9. ISBN: 978-1-4244-1693-6. doi:10.1109/IPDPS.2008.4536279

  10. Paun M, Naksinehaboon N, Nassar R, Leangsuksun C, Scott SL, Taerat N (2010) Incremental checkpoint schemes for Weibull failure distribution. Int J Found Comput Sci 21(03):329

    Article  MATH  MathSciNet  Google Scholar 

  11. Gottumukkala NR, Leangsuksun CB, Liu Y, Nassar R, Scott SL (2006) Reliability analysis in HPC clusters. In: Proceedings of high avalability and performance workshop (HAPCS). Conjunction with Los Alamos Computer Science Institute (LACSI) Symposium 2006, Santa Fe

  12. Gottumukkala NR, Nassar R, Paun M, Leangsuksun CB, Scott SL (2010) Reliability of a system of \(k\) nodes for high performance computing applications. IEEE Trans Reliab 59(1):162–169

    Article  Google Scholar 

  13. Thanakornworakij T, Nassar R, Leangsuksun C, Paun M (2012) Reliability model of a system of k nodes with simultaneous failures for high performance computing applications. Int J High Perform Comput Appl

  14. Barney B (2013) Introduction to parallel computing. https://computing.llnl.gov/tutorials/parallel_comp/. Accessed Jan 2013

  15. Hill MD, Marty MR (2008) Amdahls law in the multicore era. In: IEEE Computer Society, pp 33 - 38. http://www.cs.wisc.edu/multifacet/papers/ieeecomputer08_amdahl_multicore.pdf

  16. Gustafson JL, Montry GR, Benner RE, Gear CW, Gustafson JL, Montry GR, Benner E (1988) Development of parallel methods for a 1024-processor hypercube. SIAM J Sci Stat Comput 9:609638

    Google Scholar 

  17. Gustafson JL (1988) Reevaluating Amdahl’s law. Commun ACM 31:532533

    Article  Google Scholar 

  18. CUDA Toolkit and SDK. https://developer.nvidia.com/cuda-downloads. Accessed Dec 2012

  19. Laosooksathit S (2013) Performance Modeling and Optimization for GPGPU, Dissertation, Louisiana Tech University

Download references

Acknowledgments

This work was partially supported by the grants CNS-0834483, EPS-1003897 and TE97/2010.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Mihaela Paun.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Laosooksathit, S., Nassar, R., Leangsuksun, C. et al. Reliability-aware performance model for optimal GPU-enabled cluster environment. J Supercomput 68, 1630–1651 (2014). https://doi.org/10.1007/s11227-014-1128-7

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-014-1128-7

Keywords

Navigation