Abstract
With the increasing programmability of graphics processing units (GPUs), these units are emerging as an attractive computing platform not only for traditional graphics computation but also for general-purpose computation. In this paper, to study the performance of programmable GPUs, we describe the design and implementation of LU decomposition as an example of numerical computation. To achieve this, we have developed and evaluated some methods with different implementation approaches in terms of (a) loop processing, (b) branch processing, and (c) vector processing. The experimental results give four important points: (1) dependent loops must be implemented through the use of a render texture in order to avoid copies in the video random access memory (VRAM); (2) in most cases, branch processing can be efficiently handled by the CPU rather than the GPU; (3) as Fatahalian et al. state for matrix multiplication, we find that GPUs require higher VRAM cache bandwidth in order to provide full performance for LU decomposition; and (4) decomposition results obtained by GPUs usually differ from those by CPUs, mainly due to the floating-point division error that increases the numerical error with the progress of decomposition.
This work was partly supported by JSPS Grant-in-Aid for Scientific Research on Priority Areas (16016254).
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Fernando, R. (ed.): GPU Gems: Programming Techniques, Tips and Tricks for Real-Time Graphics. Addison-Wesley, Reading (2004)
Fatahalian, K., Sugerman, J., Hanrahan, P.: Understanding the efficiency of GPU algorithms for matrix-matrix multiplication. In: Proc. SIGGRAPH/EUROGRAPHICS Workshop Graphics Hardware (GH 2004), pp. 133–137 (2004)
Thompson, C.J., Hahn, S., Oskin, M.: Using modern graphics architectures for general-purpose computing: A framework and analysis. In: Proc. 35th IEEE/ACM Int’l Symp. Microarchitecture (MICRO 2002), pp. 306–317 (2002)
Larsen, E.S., McAllister, D.: Fast matrix multiplies using graphics hardware. In: Proc. High Performance Networking and Computing Conf., SC 2001 (2001)
Whaley, R.C., Petitet, A., Dongarra, J.J.: Automated empirical optimizations of software and the ATLAS project. Parallel Computing 27, 3–35 (2001)
Hall, J.D., Carr, N.A., Hart, J.C.: Cache and bandwidth aware matrix multiplication on the GPU. Technical Report UIUCDCS-R-2003-2328, University of Illinois (2003)
Krüger, J., Westermann, R.: Linear algebra operators for GPU implementation of numerical algorithms. ACM Trans. Graphics 22, 908–916 (2003)
Bolz, J., Farmer, I., Grinspun, E., Schröder, P.: Sparse matrix solvers on the GPU: Conjugate gradients and multigrid. ACM Trans. Graphics 22, 917–924 (2003)
Moravánszky, A.: Dense Matrix Algebra on the GPU (2003), http://www.shaderx2.com/shaderx.PDF
Moreland, K., Angel, E.: The FFT on a GPU. In: Proc. SIGGRAPH/EUROGRAPHICS Workshop Graphics Hardware (GH 2003), pp. 112–119 (2003)
Fernando, R., Harris, M., Wloka, M., Zeller, C.: Programming graphics hardware. In: EUROGRAPHICS 2004 Tutorial Note, (2004), http://download.nvidia.com/developer/presentations/2004/Eurographics/EG_04_TutorialNotes.pdf
Pharr, M., Fernando, R. (eds.): GPU Gems 2: Programming Techniques for High-Performance Graphics and General-Purpose Computation. Addison-Wesley, Reading (2005)
Grama, A., Gupta, A., Karypis, G., Kumar, V.: Introduction to Parallel Computing, 2nd edn. Addison-Wesley, Reading (2003)
Shreiner, D., Woo, M., Neider, J., Davis, T. (eds.): OpenGL Programming Guide, 4th edn. Addison-Wesley, Reading (2003)
Microsoft Corporation: DirectX (2005), http://www.microsoft.com/directx/
Stevenson, D.: A proposed standard for binary floating-point arithmetic. IEEE Computer 14, 51–62 (1981)
Dongarra, J.J., Duff, I.S., Sorensen, D.C., Vorst, H.V.D. (eds.): Solving Linear Systems on Vector and Shared Memory Computers. SIAM, Philadelphia (1991)
Mark, W.R., Glanville, R.S., Akeley, K., Kilgard, M.J.: Cg: A system for programming graphics hardware in a C-like language. ACM Trans. Graphics 22, 896–897 (2003)
Naruse, A., Sumimoto, S., Kumon, K.: Optimization and evaluation of linpack benchmark for Xeon processor. IPSJ Trans. Advanced Computing Systems 45, 62–70 (2004) (in Japanese)
Goto, K., van de Geijn, R.: On reducing TLB misses in matrix multiplication. Technical Report CS-TR-02-55, The University of Texas at Austin (2002)
Dongarra, J.J., Luszczek, P., Petitet, A.: The LINPACK benchmark: past, present and future. Concurrency and Computation: Practice and Experience 15, 803–820 (2003)
Hillesland, K.E., Lastra, A.: In: GPU floating point paranoia. In: Proc. 1st ACM Workshop General-Purpose Computing on Graphics Processors (GP2 2004), vol. C–8 (2004), http://www.cs.unc.edu/~ibr/projects/paranoia/
Moore, G.E.: Cramming more components onto integrated circuits. Electronics 38, 114–117 (1965)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Ino, F., Matsui, M., Goda, K., Hagihara, K. (2005). Performance Study of LU Decomposition on the Programmable GPU. In: Bader, D.A., Parashar, M., Sridhar, V., Prasanna, V.K. (eds) High Performance Computing – HiPC 2005. HiPC 2005. Lecture Notes in Computer Science, vol 3769. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11602569_13
Download citation
DOI: https://doi.org/10.1007/11602569_13
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-30936-9
Online ISBN: 978-3-540-32427-0
eBook Packages: Computer ScienceComputer Science (R0)