Skip to main content
Log in

High performance conjugate gradient solver on multi-GPU clusters using hypergraph partitioning

  • Special Issue Paper
  • Published:
Computer Science - Research and Development

Abstract

Motivated by high computation power and low price per performance ratio of GPUs, GPU accelerated clusters are being built for high performance scientific computing. In this work, we propose a scalable implementation of a Conjugate Gradient (CG) solver for unstructured matrices on a GPU-extended cluster, where each cluster node has multiple GPUs. Basic computations of the solver are held on GPUs and communications are managed by the CPU. For sparse matrix-vector multiplication, which is the most time-consuming operation, solver selects the fastest between several high performance kernels running on GPUs. In a GPU-extended cluster, it is more difficult than traditional CPU clusters to obtain scalability, since GPUs are very fast compared to CPUs. Since computation on GPUs is faster, GPU-extended clusters demand faster communication between compute units. To achieve scalability, we adopt hypergraph-partitioning models, which are state-of-the-art models for communication reduction and load balancing for parallel sparse iterative solvers. We implement a hierarchical partitioning model which better optimizes underlying heterogeneous system. In our experiments, we obtain up to 94 Gflops double-precision CG performance using 64 NVIDIA Tesla GPUs on 32 nodes.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Baskaran MM, Bordawekar R (2008) Optimizing sparse matrix-vector multiplication on GPUs. IBM Research Report, RC24704

  2. Bell N, Garland M (2009) Implementing sparse matrix-vector multiplication on throughput-oriented processors. In: Proc SC ’09: ACM/IEEE Conference on Supercomputing, Portland, OR, USA

  3. Buatois L, Caumon G, Lévy B (2007) Concurrent number cruncher: an efficient linear solver on the GPU. In: Proc HPCC 2007. LNCS vol 4782, pp 358–371

  4. Catalyurek UV, Aykanat C (1999) Hypergraph-partitioning-based decomposition for parallel sparse-matrix vector multiplication. IEEE Trans Parallel Distrib Syst 10(7):673–693

    Article  Google Scholar 

  5. Catalyurek UV, Aykanat C (1999) A multilevel hypergraph partitioning tool, V. 3.0. Tech Rep, Dept of Comp Eng, Bilkent University

  6. Catalyurek UV, Ucar B, Aykanat C On two-dimensional sparse matrix partitioning: models, methods, and a recipe. Tech Rep, OSUBMI-TR_2008_2008_n04

  7. Cevahir A, Nukada A, Matsuoka S (2009) Fast conjugate gradients with multiple GPUs. Lecture notes in computer science, vol 5544. Springer, Berlin, pp 898–903

    Google Scholar 

  8. Che S, Li J, Sheaffer JW, Skadron K, Lach J (2008) Accelerating compute intensive applications with GPUs and FPGAs. In: Proc IEEE symposium on application specific processors (SASP)

  9. Fan Z, Qiu F, Kaufman A, Stover SY (2004) GPU cluster for high performance computing. In: Proc SC’04: ACM/IEEE conference on supercomputing

  10. Göddeke D, Strzodka R, Mohd-Yusof J, McCormick P, Buijssen SHM, Grajewski M, Turek S (2007) Exploring weak scalability for FEM calculations on a GPU-enhanced cluster. Parallel Comput 33(10–11):685–699

    Google Scholar 

  11. GraphStream Inc.: GraphStream Scalable Computing Platforms. http://www.graphstream.com. Accessed 2009

  12. Harris M (2007) Optimizing parallel reduction in CUDA. NVIDIA Developer Technology

  13. Hartley TDR, Catalyurek UV, Ruiz A, Ujaldon M, Igual F, Mayo R (2008) Biomedical image analysis on a cooperative cluster of GPUs and multicores. In: Proc 22nd ACM international conference on supercomputing, pp 15–25

  14. Lengauer T (1990) Combinatorial algorithms for integrated circuit layout. Wiley, Chichester

    MATH  Google Scholar 

  15. Matsuoka S (2008) The road to TSUBAME and beyond. Petascale computing: algorithms and applications. Computational science series. Chapman & Hall/CRC, London, pp 289–310

    Google Scholar 

  16. Matsuoka S, Aoki T, Endo T, Nukada A, Kato T, Hasegawa A GPU-accelerated computing—from hype to mainstream, the rebirth of vector computing. J Phys Conf Ser 180 (2009)

  17. NVIDIA Corporation (2007) NVIDIA CUDA compute unified device architecture programming guide

  18. Saad Y (1990) SPARSKIT: A basic tool kit for sparse matrix computation. Tech Rep CSRD TR 1029, University of Illinois, Urbana, IL

  19. Teresco JD, Faik J, Flaherty JR (2004) Hierarchical partitioning and dynamic load balancing for scientific computation. In: Proc PARA’04, pp 911–920

  20. University of Florida Sparse Matrix Collection. http://www.cise.ufl.edu/research/sparse/matrices/

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ali Cevahir.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Cevahir, A., Nukada, A. & Matsuoka, S. High performance conjugate gradient solver on multi-GPU clusters using hypergraph partitioning. Comput Sci Res Dev 25, 83–91 (2010). https://doi.org/10.1007/s00450-010-0112-6

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00450-010-0112-6

Keywords

Navigation