Abstract
Motivated by high computation power and low price per performance ratio of GPUs, GPU accelerated clusters are being built for high performance scientific computing. In this work, we propose a scalable implementation of a Conjugate Gradient (CG) solver for unstructured matrices on a GPU-extended cluster, where each cluster node has multiple GPUs. Basic computations of the solver are held on GPUs and communications are managed by the CPU. For sparse matrix-vector multiplication, which is the most time-consuming operation, solver selects the fastest between several high performance kernels running on GPUs. In a GPU-extended cluster, it is more difficult than traditional CPU clusters to obtain scalability, since GPUs are very fast compared to CPUs. Since computation on GPUs is faster, GPU-extended clusters demand faster communication between compute units. To achieve scalability, we adopt hypergraph-partitioning models, which are state-of-the-art models for communication reduction and load balancing for parallel sparse iterative solvers. We implement a hierarchical partitioning model which better optimizes underlying heterogeneous system. In our experiments, we obtain up to 94 Gflops double-precision CG performance using 64 NVIDIA Tesla GPUs on 32 nodes.
Similar content being viewed by others
References
Baskaran MM, Bordawekar R (2008) Optimizing sparse matrix-vector multiplication on GPUs. IBM Research Report, RC24704
Bell N, Garland M (2009) Implementing sparse matrix-vector multiplication on throughput-oriented processors. In: Proc SC ’09: ACM/IEEE Conference on Supercomputing, Portland, OR, USA
Buatois L, Caumon G, Lévy B (2007) Concurrent number cruncher: an efficient linear solver on the GPU. In: Proc HPCC 2007. LNCS vol 4782, pp 358–371
Catalyurek UV, Aykanat C (1999) Hypergraph-partitioning-based decomposition for parallel sparse-matrix vector multiplication. IEEE Trans Parallel Distrib Syst 10(7):673–693
Catalyurek UV, Aykanat C (1999) A multilevel hypergraph partitioning tool, V. 3.0. Tech Rep, Dept of Comp Eng, Bilkent University
Catalyurek UV, Ucar B, Aykanat C On two-dimensional sparse matrix partitioning: models, methods, and a recipe. Tech Rep, OSUBMI-TR_2008_2008_n04
Cevahir A, Nukada A, Matsuoka S (2009) Fast conjugate gradients with multiple GPUs. Lecture notes in computer science, vol 5544. Springer, Berlin, pp 898–903
Che S, Li J, Sheaffer JW, Skadron K, Lach J (2008) Accelerating compute intensive applications with GPUs and FPGAs. In: Proc IEEE symposium on application specific processors (SASP)
Fan Z, Qiu F, Kaufman A, Stover SY (2004) GPU cluster for high performance computing. In: Proc SC’04: ACM/IEEE conference on supercomputing
Göddeke D, Strzodka R, Mohd-Yusof J, McCormick P, Buijssen SHM, Grajewski M, Turek S (2007) Exploring weak scalability for FEM calculations on a GPU-enhanced cluster. Parallel Comput 33(10–11):685–699
GraphStream Inc.: GraphStream Scalable Computing Platforms. http://www.graphstream.com. Accessed 2009
Harris M (2007) Optimizing parallel reduction in CUDA. NVIDIA Developer Technology
Hartley TDR, Catalyurek UV, Ruiz A, Ujaldon M, Igual F, Mayo R (2008) Biomedical image analysis on a cooperative cluster of GPUs and multicores. In: Proc 22nd ACM international conference on supercomputing, pp 15–25
Lengauer T (1990) Combinatorial algorithms for integrated circuit layout. Wiley, Chichester
Matsuoka S (2008) The road to TSUBAME and beyond. Petascale computing: algorithms and applications. Computational science series. Chapman & Hall/CRC, London, pp 289–310
Matsuoka S, Aoki T, Endo T, Nukada A, Kato T, Hasegawa A GPU-accelerated computing—from hype to mainstream, the rebirth of vector computing. J Phys Conf Ser 180 (2009)
NVIDIA Corporation (2007) NVIDIA CUDA compute unified device architecture programming guide
Saad Y (1990) SPARSKIT: A basic tool kit for sparse matrix computation. Tech Rep CSRD TR 1029, University of Illinois, Urbana, IL
Teresco JD, Faik J, Flaherty JR (2004) Hierarchical partitioning and dynamic load balancing for scientific computation. In: Proc PARA’04, pp 911–920
University of Florida Sparse Matrix Collection. http://www.cise.ufl.edu/research/sparse/matrices/
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Cevahir, A., Nukada, A. & Matsuoka, S. High performance conjugate gradient solver on multi-GPU clusters using hypergraph partitioning. Comput Sci Res Dev 25, 83–91 (2010). https://doi.org/10.1007/s00450-010-0112-6
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00450-010-0112-6