Performance Study of LU Decomposition on the Programmable GPU

Ino, Fumihiko; Matsui, Manabu; Goda, Keigo; Hagihara, Kenichi

doi:10.1007/11602569_13

Fumihiko Ino²⁰,
Manabu Matsui²⁰,
Keigo Goda²⁰ &
…
Kenichi Hagihara²⁰

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 3769))

Included in the following conference series:

International Conference on High-Performance Computing

654 Accesses
5 Citations

Abstract

With the increasing programmability of graphics processing units (GPUs), these units are emerging as an attractive computing platform not only for traditional graphics computation but also for general-purpose computation. In this paper, to study the performance of programmable GPUs, we describe the design and implementation of LU decomposition as an example of numerical computation. To achieve this, we have developed and evaluated some methods with different implementation approaches in terms of (a) loop processing, (b) branch processing, and (c) vector processing. The experimental results give four important points: (1) dependent loops must be implemented through the use of a render texture in order to avoid copies in the video random access memory (VRAM); (2) in most cases, branch processing can be efficiently handled by the CPU rather than the GPU; (3) as Fatahalian et al. state for matrix multiplication, we find that GPUs require higher VRAM cache bandwidth in order to provide full performance for LU decomposition; and (4) decomposition results obtained by GPUs usually differ from those by CPUs, mainly due to the floating-point division error that increases the numerical error with the progress of decomposition.

This work was partly supported by JSPS Grant-in-Aid for Scientific Research on Priority Areas (16016254).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Fernando, R. (ed.): GPU Gems: Programming Techniques, Tips and Tricks for Real-Time Graphics. Addison-Wesley, Reading (2004)
Google Scholar
Fatahalian, K., Sugerman, J., Hanrahan, P.: Understanding the efficiency of GPU algorithms for matrix-matrix multiplication. In: Proc. SIGGRAPH/EUROGRAPHICS Workshop Graphics Hardware (GH 2004), pp. 133–137 (2004)
Google Scholar
Thompson, C.J., Hahn, S., Oskin, M.: Using modern graphics architectures for general-purpose computing: A framework and analysis. In: Proc. 35th IEEE/ACM Int’l Symp. Microarchitecture (MICRO 2002), pp. 306–317 (2002)
Google Scholar
Larsen, E.S., McAllister, D.: Fast matrix multiplies using graphics hardware. In: Proc. High Performance Networking and Computing Conf., SC 2001 (2001)
Google Scholar
Whaley, R.C., Petitet, A., Dongarra, J.J.: Automated empirical optimizations of software and the ATLAS project. Parallel Computing 27, 3–35 (2001)
Article MATH Google Scholar
Hall, J.D., Carr, N.A., Hart, J.C.: Cache and bandwidth aware matrix multiplication on the GPU. Technical Report UIUCDCS-R-2003-2328, University of Illinois (2003)
Google Scholar
Krüger, J., Westermann, R.: Linear algebra operators for GPU implementation of numerical algorithms. ACM Trans. Graphics 22, 908–916 (2003)
Article Google Scholar
Bolz, J., Farmer, I., Grinspun, E., Schröder, P.: Sparse matrix solvers on the GPU: Conjugate gradients and multigrid. ACM Trans. Graphics 22, 917–924 (2003)
Article Google Scholar
Moravánszky, A.: Dense Matrix Algebra on the GPU (2003), http://www.shaderx2.com/shaderx.PDF
Moreland, K., Angel, E.: The FFT on a GPU. In: Proc. SIGGRAPH/EUROGRAPHICS Workshop Graphics Hardware (GH 2003), pp. 112–119 (2003)
Google Scholar
Fernando, R., Harris, M., Wloka, M., Zeller, C.: Programming graphics hardware. In: EUROGRAPHICS 2004 Tutorial Note, (2004), http://download.nvidia.com/developer/presentations/2004/Eurographics/EG_04_TutorialNotes.pdf
Pharr, M., Fernando, R. (eds.): GPU Gems 2: Programming Techniques for High-Performance Graphics and General-Purpose Computation. Addison-Wesley, Reading (2005)
Google Scholar
Grama, A., Gupta, A., Karypis, G., Kumar, V.: Introduction to Parallel Computing, 2nd edn. Addison-Wesley, Reading (2003)
Google Scholar
Shreiner, D., Woo, M., Neider, J., Davis, T. (eds.): OpenGL Programming Guide, 4th edn. Addison-Wesley, Reading (2003)
Google Scholar
Microsoft Corporation: DirectX (2005), http://www.microsoft.com/directx/
Stevenson, D.: A proposed standard for binary floating-point arithmetic. IEEE Computer 14, 51–62 (1981)
Google Scholar
Dongarra, J.J., Duff, I.S., Sorensen, D.C., Vorst, H.V.D. (eds.): Solving Linear Systems on Vector and Shared Memory Computers. SIAM, Philadelphia (1991)
Google Scholar
Mark, W.R., Glanville, R.S., Akeley, K., Kilgard, M.J.: Cg: A system for programming graphics hardware in a C-like language. ACM Trans. Graphics 22, 896–897 (2003)
Article Google Scholar
Naruse, A., Sumimoto, S., Kumon, K.: Optimization and evaluation of linpack benchmark for Xeon processor. IPSJ Trans. Advanced Computing Systems 45, 62–70 (2004) (in Japanese)
Google Scholar
Goto, K., van de Geijn, R.: On reducing TLB misses in matrix multiplication. Technical Report CS-TR-02-55, The University of Texas at Austin (2002)
Google Scholar
Dongarra, J.J., Luszczek, P., Petitet, A.: The LINPACK benchmark: past, present and future. Concurrency and Computation: Practice and Experience 15, 803–820 (2003)
Article Google Scholar
Hillesland, K.E., Lastra, A.: In: GPU floating point paranoia. In: Proc. 1st ACM Workshop General-Purpose Computing on Graphics Processors (GP² 2004), vol. C–8 (2004), http://www.cs.unc.edu/~ibr/projects/paranoia/
Moore, G.E.: Cramming more components onto integrated circuits. Electronics 38, 114–117 (1965)
Google Scholar

Download references

Author information

Authors and Affiliations

Graduate School of Information Science and Technology, Osaka University, 1-3 Machikaneyama, Toyonaka, Osaka, 560-8531, Japan
Fumihiko Ino, Manabu Matsui, Keigo Goda & Kenichi Hagihara

Authors

Fumihiko Ino
View author publications
You can also search for this author in PubMed Google Scholar
Manabu Matsui
View author publications
You can also search for this author in PubMed Google Scholar
Keigo Goda
View author publications
You can also search for this author in PubMed Google Scholar
Kenichi Hagihara
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

College of Computing, Georgia Institute of Technology, 30332, Atlanta, GA, USA
David A. Bader
Department of Electrical and Computer Engineering, Rutgers, the State University of New Jersey, 94 Brett Road, 08854, Piscataway, NJ, USA
Manish Parashar
Satyam Computer Services Ltd., Indian Institute of Science Campus, Entrepreneurship Centre, SID Block, 560 012, Bangalore, India
Varadarajan Sridhar
Department of Electrical Engineering, University of Southern California, 90089-2562, Los Angeles, CA, USA
Viktor K. Prasanna

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ino, F., Matsui, M., Goda, K., Hagihara, K. (2005). Performance Study of LU Decomposition on the Programmable GPU. In: Bader, D.A., Parashar, M., Sridhar, V., Prasanna, V.K. (eds) High Performance Computing – HiPC 2005. HiPC 2005. Lecture Notes in Computer Science, vol 3769. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11602569_13

Download citation

DOI: https://doi.org/10.1007/11602569_13
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-30936-9
Online ISBN: 978-3-540-32427-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics