Skip to main content
Log in

From tile algorithm to stripe algorithm: a CUBLAS-based parallel implementation on GPUs of Gauss method for the resolution of extremely large dense linear systems stored on an array of solid state devices

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

This paper presents an efficient algorithmic approach to the GPU-based parallel resolution of dense linear systems of extremely large size. A formal transformation of the code of Gauss method allows us to develop for matrix calculations the concept of stripe algorithm, as opposed to that of tile algorithm. Our stripe algorithm is based on the partitioning of the linear system’s matrix into stripes of rows and is well suited for efficient implementation on a GPU, using cublasDgemm function of CUBLAS library as the main building block. It is also well adapted to storage of the linear system on an array of solid state devices, the PC memory being used as a cache between the SSDs and the GPU memory. We demonstrate experimentally that our code solves efficiently dense linear systems of size up to 400,000 (160 billion matrix elements) using an NIVDIA C2050 and six 240 GB SSDs.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20
Fig. 21
Fig. 22
Fig. 23
Fig. 24
Fig. 25
Fig. 26
Fig. 27
Fig. 28
Fig. 29

Similar content being viewed by others

Abbreviations

GPU:

Graphical processing unit

CUDA:

Compute unified device architecture

CUBLAS:

CUDA basic linear algebra subroutines

SSD:

Solid state device

L\(_\mathrm{SSD}\) :

1st memory level: array of SSDs

L\(_\mathrm{PC}\) :

2nd memory level: main memory of PC

L\(_\mathrm{GPU}\) :

3rd memory level: global memory of GPU

References

  1. Agullo E, Augonnet C, Dongarra J, Faverge M, Langou J, Ltaief H, Tomov S (2011) LU factorization for accelerator-based systems. AICCSA’ 11 conference, pp 217–224

  2. Angelaccio M, Colajanni M (1993) Unifying and optimizing parallel linear algebra algorithms. IEEE Trans Parallel Distrib Syst 4(1):1382–1397

    Article  Google Scholar 

  3. BLAS—basic linear algebra subprograms. www.netlib.org/blas

  4. Buttari A, Langou J, Kurzak J, Dongarra J (2009) A class of parallel tiled linear algebra algorithms for multicore architectures. Parallel Comput 35:38–53

    Article  MathSciNet  Google Scholar 

  5. Carter L, Ferrante J, Flynn Hummel S, Alpern B, Gatlin KS (1996) Hierarchical tiling: a methodology for high performance. Technical Report, UCSD CS96-508

  6. Cosnard M, Tourancheau B, Villard G (1987) Gaussian elimination on message passing architecture. In: Proceedings of 1st International Conference on Supercomputing. Athens. Lecture Notes in Computer Science, vol 297, pp 611–628

  7. CUBLAS—implementation of BLAS on top of the NVIDIA CUDA runtime. http://docs.nvidia.com/cuda/cublas/index.html

  8. CULA tools, GPU Accelerated Linear Algebra. http://www.culatools.com

  9. Dongarra J, Faverge M, Ltaief H, Luszcsek P (2011) Achieving numerical accuracy and high performance using recursive tile LU factorization. Technical Report, University of Tennessee Computer Science ICL-UT-11-08 (also Lawn 259)

  10. Evans F, Skiena S, Varshney A (1996) Optimizing triangle strips for fast rendering. IEEE Vis 96:319–326

    Google Scholar 

  11. Hadri B, Ltaief H, Agullo E, Dongarra J (2010) Tile QR factorization with parallel panel processing for multicore architectures. In: IEEE international symposium on parallel and distributed processing, pp 1–10

  12. Haidar A, Ltaief H, YarKhan A, Dongarra J (2011) Analysis of dynamically scheduled tile algorithms for dense linear algebra on multicore architectures. Technical Report, University of Tennessee Computer Science UT-CS-11-666 (also Lawn 243)

  13. Humphrey JR, Price DK, Spagnoli KE, Paolini AL, Kelmelis EJ (2010) CULA: hybrid GPU accelerated linear algebra routines. SPIE Conference Series 7705

  14. Igual FD, Chan E, Quintana-Ortí ES, Quintana-Ortí G, Van de Geijn RA, Van Zee FG (2012) The FLAME approach: from dense linear algebra algorithms to high-performance multi-accelerator implementations. J Parallel Distrib Comput 72:1134–1143

    Article  Google Scholar 

  15. Kurzak J, Ltaief H, Dongarra J, Badia RM (2010) Scheduling dense linear algebra operations on multicore processors. Concurrency and Computation: Practice and Experience 22:15–44

    Article  Google Scholar 

  16. LAPACK—derivation of a block algorithm for LU factorization. http://www.netlib.org/utk/papers/siam-review93/node13.html

  17. LAPACK—linear algebra PACKage. http://www.netlib.org/lapack

  18. MAGMA—matrix algebra on GPU and multicore architectures. http://icl.cs.utk.edu/graphics/posters/files/SC11-MAGMA.pdf

  19. Marquardt D (1963) An algorithm for least-squares estimation of nonlinear parameters. SIAM J Appl Math 11(2):431–441

    Article  MATH  MathSciNet  Google Scholar 

  20. PLASMA—parallel linear algebra software for multicore architectures. http://www.netlib.org/plasma

  21. ScaLAPACK—scalable linear algebra PACKage. http://www.netlib.org/scalapack

  22. Trefethen LN, Schreiber RS (1990) Average-case stability of Gaussian elimination. SIAM J Matrix Anal Appl 11:335–360

    Article  MATH  MathSciNet  Google Scholar 

  23. Tomov S, Dongarra J, Baboulin M (2010) Towards dense linear algebra for hybrid GPU accelerated manycore systems. Parallel Comput 36:232–240

    Article  MATH  Google Scholar 

  24. Tomov S, Nath R, Ltaief H, Dongarra J (2010) Dense linear algebra solvers for multicore with GPU accelerators. In: IEEE international symposium on parallel and distributed processing, pp 1–8

  25. Volkov V, Demmel JW (2008) Benchmarking GPUs to tune dense linear algebra. In: ACM/IEEE conference on supercomputing SC ’08

  26. Zhenjie D, Yan C (2009) An optimization load balancing algorithm design in massive storage system. In: ESIAT 2009 conference, vol 3, pp 310–313

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Manuel Carcenac.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Carcenac, M. From tile algorithm to stripe algorithm: a CUBLAS-based parallel implementation on GPUs of Gauss method for the resolution of extremely large dense linear systems stored on an array of solid state devices. J Supercomput 68, 365–413 (2014). https://doi.org/10.1007/s11227-013-1043-3

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-013-1043-3

Keywords

Navigation