High performance conjugate gradient solver on multi-GPU clusters using hypergraph partitioning

Cevahir, Ali; Nukada, Akira; Matsuoka, Satoshi

doi:10.1007/s00450-010-0112-6

High performance conjugate gradient solver on multi-GPU clusters using hypergraph partitioning

Special Issue Paper
Published: 02 April 2010

Volume 25, pages 83–91, (2010)
Cite this article

Computer Science - Research and Development

Ali Cevahir¹,
Akira Nukada¹ &
Satoshi Matsuoka^1,2

463 Accesses
32 Citations
Explore all metrics

Abstract

Motivated by high computation power and low price per performance ratio of GPUs, GPU accelerated clusters are being built for high performance scientific computing. In this work, we propose a scalable implementation of a Conjugate Gradient (CG) solver for unstructured matrices on a GPU-extended cluster, where each cluster node has multiple GPUs. Basic computations of the solver are held on GPUs and communications are managed by the CPU. For sparse matrix-vector multiplication, which is the most time-consuming operation, solver selects the fastest between several high performance kernels running on GPUs. In a GPU-extended cluster, it is more difficult than traditional CPU clusters to obtain scalability, since GPUs are very fast compared to CPUs. Since computation on GPUs is faster, GPU-extended clusters demand faster communication between compute units. To achieve scalability, we adopt hypergraph-partitioning models, which are state-of-the-art models for communication reduction and load balancing for parallel sparse iterative solvers. We implement a hierarchical partitioning model which better optimizes underlying heterogeneous system. In our experiments, we obtain up to 94 Gflops double-precision CG performance using 64 NVIDIA Tesla GPUs on 32 nodes.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Parallelizing the dual revised simplex method

Article Open access 14 December 2017

On the impact of quantum computing technology on future developments in high-performance scientific computing

Article Open access 31 August 2017

A Batched Jacobi SVD Algorithm on GPUs and Its Application to Quantum Lattice Systems

References

Baskaran MM, Bordawekar R (2008) Optimizing sparse matrix-vector multiplication on GPUs. IBM Research Report, RC24704
Bell N, Garland M (2009) Implementing sparse matrix-vector multiplication on throughput-oriented processors. In: Proc SC ’09: ACM/IEEE Conference on Supercomputing, Portland, OR, USA
Buatois L, Caumon G, Lévy B (2007) Concurrent number cruncher: an efficient linear solver on the GPU. In: Proc HPCC 2007. LNCS vol 4782, pp 358–371
Catalyurek UV, Aykanat C (1999) Hypergraph-partitioning-based decomposition for parallel sparse-matrix vector multiplication. IEEE Trans Parallel Distrib Syst 10(7):673–693
Article Google Scholar
Catalyurek UV, Aykanat C (1999) A multilevel hypergraph partitioning tool, V. 3.0. Tech Rep, Dept of Comp Eng, Bilkent University
Catalyurek UV, Ucar B, Aykanat C On two-dimensional sparse matrix partitioning: models, methods, and a recipe. Tech Rep, OSUBMI-TR_2008_2008_n04
Cevahir A, Nukada A, Matsuoka S (2009) Fast conjugate gradients with multiple GPUs. Lecture notes in computer science, vol 5544. Springer, Berlin, pp 898–903
Google Scholar
Che S, Li J, Sheaffer JW, Skadron K, Lach J (2008) Accelerating compute intensive applications with GPUs and FPGAs. In: Proc IEEE symposium on application specific processors (SASP)
Fan Z, Qiu F, Kaufman A, Stover SY (2004) GPU cluster for high performance computing. In: Proc SC’04: ACM/IEEE conference on supercomputing
Göddeke D, Strzodka R, Mohd-Yusof J, McCormick P, Buijssen SHM, Grajewski M, Turek S (2007) Exploring weak scalability for FEM calculations on a GPU-enhanced cluster. Parallel Comput 33(10–11):685–699
Google Scholar
GraphStream Inc.: GraphStream Scalable Computing Platforms. http://www.graphstream.com. Accessed 2009
Harris M (2007) Optimizing parallel reduction in CUDA. NVIDIA Developer Technology
Hartley TDR, Catalyurek UV, Ruiz A, Ujaldon M, Igual F, Mayo R (2008) Biomedical image analysis on a cooperative cluster of GPUs and multicores. In: Proc 22nd ACM international conference on supercomputing, pp 15–25
Lengauer T (1990) Combinatorial algorithms for integrated circuit layout. Wiley, Chichester
MATH Google Scholar
Matsuoka S (2008) The road to TSUBAME and beyond. Petascale computing: algorithms and applications. Computational science series. Chapman & Hall/CRC, London, pp 289–310
Google Scholar
Matsuoka S, Aoki T, Endo T, Nukada A, Kato T, Hasegawa A GPU-accelerated computing—from hype to mainstream, the rebirth of vector computing. J Phys Conf Ser 180 (2009)
NVIDIA Corporation (2007) NVIDIA CUDA compute unified device architecture programming guide
Saad Y (1990) SPARSKIT: A basic tool kit for sparse matrix computation. Tech Rep CSRD TR 1029, University of Illinois, Urbana, IL
Teresco JD, Faik J, Flaherty JR (2004) Hierarchical partitioning and dynamic load balancing for scientific computation. In: Proc PARA’04, pp 911–920
University of Florida Sparse Matrix Collection. http://www.cise.ufl.edu/research/sparse/matrices/

Download references

Author information

Authors and Affiliations

Tokyo Institute of Technology, Ookayama 2-12-1, Meguro-ku, Tokyo, 152-8552, Japan
Ali Cevahir, Akira Nukada & Satoshi Matsuoka
National Institute of Informatics, Hitotsubashi 4-5-6, Chiyoda-ku, Tokyo, 101-8430, Japan
Satoshi Matsuoka

Authors

Ali Cevahir
View author publications
You can also search for this author in PubMed Google Scholar
Akira Nukada
View author publications
You can also search for this author in PubMed Google Scholar
Satoshi Matsuoka
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ali Cevahir.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Cevahir, A., Nukada, A. & Matsuoka, S. High performance conjugate gradient solver on multi-GPU clusters using hypergraph partitioning. Comput Sci Res Dev 25, 83–91 (2010). https://doi.org/10.1007/s00450-010-0112-6

Download citation

Published: 02 April 2010
Issue Date: May 2010
DOI: https://doi.org/10.1007/s00450-010-0112-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

High performance conjugate gradient solver on multi-GPU clusters using hypergraph partitioning

Abstract

Access this article

Similar content being viewed by others

Parallelizing the dual revised simplex method

On the impact of quantum computing technology on future developments in high-performance scientific computing

A Batched Jacobi SVD Algorithm on GPUs and Its Application to Quantum Lattice Systems

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

High performance conjugate gradient solver on multi-GPU clusters using hypergraph partitioning

Abstract

Access this article

Similar content being viewed by others

Parallelizing the dual revised simplex method

On the impact of quantum computing technology on future developments in high-performance scientific computing

A Batched Jacobi SVD Algorithm on GPUs and Its Application to Quantum Lattice Systems

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation