Abstract
We discuss observed characteristics of GPUs deployed as accelerators in an HPC cluster at Los Alamos National Laboratory. GPUs have a very good theoretical FLOPS rate, and are reasonably inexpensive and available, but they are relatively new to HPC, which demands both consistently high performance across nodes and consistently low error rate.
We modified a standard acceptance procedure to test GPU performance, error rate and reliability characteristics, and ran the test suite on a Fermi HPC cluster at LANL. We discuss here our methodology for this testing, and present results relevant to the deployment of GPUs in an HPC environment.
In this paper we show performance variability, power usage variability (possibly related), and some reliability concerns on the GPUs tested. We argue for rigorous testing of these devices in deployment as a way of characterizing their behavior.
Chapter PDF
Similar content being viewed by others
Keywords
References
Cappello, F., Geist, A., Gropp, B., Kale, L., Kramer, B., Snir, M.: Toward exascale resilience. International Journal of High Performance Computing Applications 23, 374–388 (2009)
Danalis, A., et al.: The scalable heterogeneous computing (shoc) benchmark suite. In: Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units, GPGPU 2010, pp. 63–74. ACM, New York (2010), http://doi.acm.org/10.1145/1735688.1735702
Defour, D., Petit, E.: Gpuburn: A system to test and mitigate gpu hardware failures. In: Proceedings of the International Conference on Embedded Computer Systems: Architectures, Modeling and Simulation (May 2013)
Dell, T.J.: A white paper on the benefits of chipkill-correct ecc for pc server main memory (1997)
Fatica, M.: Accelerating linpack with cuda on heterogenous clusters. In: Proceedings of 2nd Workshop on General Purpose Processing on Graphics Processing Units, GPGPU-2, pp. 46–51. ACM, New York (2009), http://doi.acm.org/10.1145/1513895.1513901
Idler, C.: Gazeo (2013), http://www.github.com/losalamos/Gazebo
Kogge, P., et al.: Exascale computing study: Technology challenges in achieving exascale systems (2008)
NVIDIA: NVIDIA’s Next Generation CUDA Compute Architecture: Fermi (2009), http://tinyurl.com/ykawxw9
NVIDIA: Tesla m2090 dual-slot computing processor module (June 2011), http://tinyurl.com/3wbxd46
Nyland, L., Harris, M.: 31 Fast n-body simulation with cuda, ch. 31
Sridharan, V., Liberty, D.: A study of dram failures in the field. In: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC 2012, pp. 76:1–76:11. IEEE Computer Society Press, Los Alamitos (2012), http://dl.acm.org/citation.cfm?id=2388996.2389100
top500: Top500 supercomputer sites (November 2012), http://www.top500.org/lists/2012/11/
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
DeBardeleben, N. et al. (2014). GPU Behavior on a Large HPC Cluster. In: an Mey, D., et al. Euro-Par 2013: Parallel Processing Workshops. Euro-Par 2013. Lecture Notes in Computer Science, vol 8374. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-54420-0_66
Download citation
DOI: https://doi.org/10.1007/978-3-642-54420-0_66
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-54419-4
Online ISBN: 978-3-642-54420-0
eBook Packages: Computer ScienceComputer Science (R0)