Skip to main content

Performance Study of LU Decomposition on the Programmable GPU

  • Conference paper
High Performance Computing – HiPC 2005 (HiPC 2005)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 3769))

Included in the following conference series:

Abstract

With the increasing programmability of graphics processing units (GPUs), these units are emerging as an attractive computing platform not only for traditional graphics computation but also for general-purpose computation. In this paper, to study the performance of programmable GPUs, we describe the design and implementation of LU decomposition as an example of numerical computation. To achieve this, we have developed and evaluated some methods with different implementation approaches in terms of (a) loop processing, (b) branch processing, and (c) vector processing. The experimental results give four important points: (1) dependent loops must be implemented through the use of a render texture in order to avoid copies in the video random access memory (VRAM); (2) in most cases, branch processing can be efficiently handled by the CPU rather than the GPU; (3) as Fatahalian et al. state for matrix multiplication, we find that GPUs require higher VRAM cache bandwidth in order to provide full performance for LU decomposition; and (4) decomposition results obtained by GPUs usually differ from those by CPUs, mainly due to the floating-point division error that increases the numerical error with the progress of decomposition.

This work was partly supported by JSPS Grant-in-Aid for Scientific Research on Priority Areas (16016254).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Fernando, R. (ed.): GPU Gems: Programming Techniques, Tips and Tricks for Real-Time Graphics. Addison-Wesley, Reading (2004)

    Google Scholar 

  2. Fatahalian, K., Sugerman, J., Hanrahan, P.: Understanding the efficiency of GPU algorithms for matrix-matrix multiplication. In: Proc. SIGGRAPH/EUROGRAPHICS Workshop Graphics Hardware (GH 2004), pp. 133–137 (2004)

    Google Scholar 

  3. Thompson, C.J., Hahn, S., Oskin, M.: Using modern graphics architectures for general-purpose computing: A framework and analysis. In: Proc. 35th IEEE/ACM Int’l Symp. Microarchitecture (MICRO 2002), pp. 306–317 (2002)

    Google Scholar 

  4. Larsen, E.S., McAllister, D.: Fast matrix multiplies using graphics hardware. In: Proc. High Performance Networking and Computing Conf., SC 2001 (2001)

    Google Scholar 

  5. Whaley, R.C., Petitet, A., Dongarra, J.J.: Automated empirical optimizations of software and the ATLAS project. Parallel Computing 27, 3–35 (2001)

    Article  MATH  Google Scholar 

  6. Hall, J.D., Carr, N.A., Hart, J.C.: Cache and bandwidth aware matrix multiplication on the GPU. Technical Report UIUCDCS-R-2003-2328, University of Illinois (2003)

    Google Scholar 

  7. Krüger, J., Westermann, R.: Linear algebra operators for GPU implementation of numerical algorithms. ACM Trans. Graphics 22, 908–916 (2003)

    Article  Google Scholar 

  8. Bolz, J., Farmer, I., Grinspun, E., Schröder, P.: Sparse matrix solvers on the GPU: Conjugate gradients and multigrid. ACM Trans. Graphics 22, 917–924 (2003)

    Article  Google Scholar 

  9. Moravánszky, A.: Dense Matrix Algebra on the GPU (2003), http://www.shaderx2.com/shaderx.PDF

  10. Moreland, K., Angel, E.: The FFT on a GPU. In: Proc. SIGGRAPH/EUROGRAPHICS Workshop Graphics Hardware (GH 2003), pp. 112–119 (2003)

    Google Scholar 

  11. Fernando, R., Harris, M., Wloka, M., Zeller, C.: Programming graphics hardware. In: EUROGRAPHICS 2004 Tutorial Note, (2004), http://download.nvidia.com/developer/presentations/2004/Eurographics/EG_04_TutorialNotes.pdf

  12. Pharr, M., Fernando, R. (eds.): GPU Gems 2: Programming Techniques for High-Performance Graphics and General-Purpose Computation. Addison-Wesley, Reading (2005)

    Google Scholar 

  13. Grama, A., Gupta, A., Karypis, G., Kumar, V.: Introduction to Parallel Computing, 2nd edn. Addison-Wesley, Reading (2003)

    Google Scholar 

  14. Shreiner, D., Woo, M., Neider, J., Davis, T. (eds.): OpenGL Programming Guide, 4th edn. Addison-Wesley, Reading (2003)

    Google Scholar 

  15. Microsoft Corporation: DirectX (2005), http://www.microsoft.com/directx/

  16. Stevenson, D.: A proposed standard for binary floating-point arithmetic. IEEE Computer 14, 51–62 (1981)

    Google Scholar 

  17. Dongarra, J.J., Duff, I.S., Sorensen, D.C., Vorst, H.V.D. (eds.): Solving Linear Systems on Vector and Shared Memory Computers. SIAM, Philadelphia (1991)

    Google Scholar 

  18. Mark, W.R., Glanville, R.S., Akeley, K., Kilgard, M.J.: Cg: A system for programming graphics hardware in a C-like language. ACM Trans. Graphics 22, 896–897 (2003)

    Article  Google Scholar 

  19. Naruse, A., Sumimoto, S., Kumon, K.: Optimization and evaluation of linpack benchmark for Xeon processor. IPSJ Trans. Advanced Computing Systems 45, 62–70 (2004) (in Japanese)

    Google Scholar 

  20. Goto, K., van de Geijn, R.: On reducing TLB misses in matrix multiplication. Technical Report CS-TR-02-55, The University of Texas at Austin (2002)

    Google Scholar 

  21. Dongarra, J.J., Luszczek, P., Petitet, A.: The LINPACK benchmark: past, present and future. Concurrency and Computation: Practice and Experience 15, 803–820 (2003)

    Article  Google Scholar 

  22. Hillesland, K.E., Lastra, A.: In: GPU floating point paranoia. In: Proc. 1st ACM Workshop General-Purpose Computing on Graphics Processors (GP2 2004), vol. C–8 (2004), http://www.cs.unc.edu/~ibr/projects/paranoia/

  23. Moore, G.E.: Cramming more components onto integrated circuits. Electronics 38, 114–117 (1965)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2005 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Ino, F., Matsui, M., Goda, K., Hagihara, K. (2005). Performance Study of LU Decomposition on the Programmable GPU. In: Bader, D.A., Parashar, M., Sridhar, V., Prasanna, V.K. (eds) High Performance Computing – HiPC 2005. HiPC 2005. Lecture Notes in Computer Science, vol 3769. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11602569_13

Download citation

  • DOI: https://doi.org/10.1007/11602569_13

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-30936-9

  • Online ISBN: 978-3-540-32427-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics