GPU Behavior on a Large HPC Cluster

DeBardeleben, Nathan; Blanchard, Sean; Monroe, Laura; Romero, Phil; Grunau, Daryl; Idler, Craig; Wright, Cornell

doi:10.1007/978-3-642-54420-0_66

Nathan DeBardeleben²⁷,
Sean Blanchard²⁷,
Laura Monroe²⁷,
Phil Romero²⁷,
Daryl Grunau²⁷,
Craig Idler²⁷ &
…
Cornell Wright²⁷

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 8374))

Included in the following conference series:

European Conference on Parallel Processing

2197 Accesses

Abstract

We discuss observed characteristics of GPUs deployed as accelerators in an HPC cluster at Los Alamos National Laboratory. GPUs have a very good theoretical FLOPS rate, and are reasonably inexpensive and available, but they are relatively new to HPC, which demands both consistently high performance across nodes and consistently low error rate.

We modified a standard acceptance procedure to test GPU performance, error rate and reliability characteristics, and ran the test suite on a Fermi HPC cluster at LANL. We discuss here our methodology for this testing, and present results relevant to the deployment of GPUs in an HPC environment.

In this paper we show performance variability, power usage variability (possibly related), and some reliability concerns on the GPUs tested. We argue for rigorous testing of these devices in deployment as a way of characterizing their behavior.

Download to read the full chapter text

Chapter PDF

Exploring Energy Efficiency for GPU-Accelerated POWER Servers

Distributed Sparse Block Grids on GPUs

Reproducible and User-Controlled Software Environments in HPC with Guix

Keywords

References

Cappello, F., Geist, A., Gropp, B., Kale, L., Kramer, B., Snir, M.: Toward exascale resilience. International Journal of High Performance Computing Applications 23, 374–388 (2009)
Article Google Scholar
Danalis, A., et al.: The scalable heterogeneous computing (shoc) benchmark suite. In: Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units, GPGPU 2010, pp. 63–74. ACM, New York (2010), http://doi.acm.org/10.1145/1735688.1735702
Google Scholar
Defour, D., Petit, E.: Gpuburn: A system to test and mitigate gpu hardware failures. In: Proceedings of the International Conference on Embedded Computer Systems: Architectures, Modeling and Simulation (May 2013)
Google Scholar
Dell, T.J.: A white paper on the benefits of chipkill-correct ecc for pc server main memory (1997)
Google Scholar
Fatica, M.: Accelerating linpack with cuda on heterogenous clusters. In: Proceedings of 2nd Workshop on General Purpose Processing on Graphics Processing Units, GPGPU-2, pp. 46–51. ACM, New York (2009), http://doi.acm.org/10.1145/1513895.1513901
Chapter Google Scholar
Idler, C.: Gazeo (2013), http://www.github.com/losalamos/Gazebo
Kogge, P., et al.: Exascale computing study: Technology challenges in achieving exascale systems (2008)
Google Scholar
NVIDIA: NVIDIA’s Next Generation CUDA Compute Architecture: Fermi (2009), http://tinyurl.com/ykawxw9
NVIDIA: Tesla m2090 dual-slot computing processor module (June 2011), http://tinyurl.com/3wbxd46
Nyland, L., Harris, M.: 31 Fast n-body simulation with cuda, ch. 31
Google Scholar
Sridharan, V., Liberty, D.: A study of dram failures in the field. In: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC 2012, pp. 76:1–76:11. IEEE Computer Society Press, Los Alamitos (2012), http://dl.acm.org/citation.cfm?id=2388996.2389100
top500: Top500 supercomputer sites (November 2012), http://www.top500.org/lists/2012/11/

Download references

Author information

Authors and Affiliations

Los Alamos National Laboratory, High Performance Computing Division, Los Alamos, NM, 87544, USA
Nathan DeBardeleben, Sean Blanchard, Laura Monroe, Phil Romero, Daryl Grunau, Craig Idler & Cornell Wright

Authors

Nathan DeBardeleben
View author publications
You can also search for this author in PubMed Google Scholar
Sean Blanchard
View author publications
You can also search for this author in PubMed Google Scholar
Laura Monroe
View author publications
You can also search for this author in PubMed Google Scholar
Phil Romero
View author publications
You can also search for this author in PubMed Google Scholar
Daryl Grunau
View author publications
You can also search for this author in PubMed Google Scholar
Craig Idler
View author publications
You can also search for this author in PubMed Google Scholar
Cornell Wright
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Rechen- und Kommunikationszentrum, RWTH Aachen, Seffenter Weg 23, 52074, Aachen, Germany
Dieter an Mey
TU Vienna, 1040, Vienna, Austria
Michael Alexander
RWTH Aachen University, Seffenter Weg 23, 52074, Aachen, Germany
Paolo Bientinesi & Carsten Clauss &
University Magna Graecia of Catanzaro, 88100, Catanzaro, Italy
Mario Cannataro
Inria Rennes - Bretagne Atlantique, 35042, Rennes, France
Alexandru Costan & Christine Morin &
University of Innsbruck, 6020, Innsbruck, Austria
Gabor Kecskemeti
Department of Computer Science, University of Pisa, 56126, Pisa, Italy
Laura Ricci
Universitat Politècnica de València, 46022, València, Spain
Julio Sahuquillo
LLNL, USA
Martin Schulz
Dipartimento di Informatica, Università di Salerno, 84084, Salerno, Italy
Vittorio Scarano
Tennessee Tech University and Oak Ridge National Laboratory, 38505, Cookeville, TN, USA
Stephen L. Scott
Technische Universität München, 80333, Munich, Germany
Josef Weidendorfer

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

DeBardeleben, N. et al. (2014). GPU Behavior on a Large HPC Cluster. In: an Mey, D., et al. Euro-Par 2013: Parallel Processing Workshops. Euro-Par 2013. Lecture Notes in Computer Science, vol 8374. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-54420-0_66

Download citation

DOI: https://doi.org/10.1007/978-3-642-54420-0_66
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-54419-4
Online ISBN: 978-3-642-54420-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics