A Flexible CUDA LU-Based Solver for Small, Batched Linear Systems

Tumeo, Antonino; Gawande, Nitin; Villa, Oreste

doi:10.1007/978-3-319-06548-9_5

Antonino Tumeo²,
Nitin Gawande² &
Oreste Villa³

3239 Accesses

Abstract

This chapter presents the implementation of a batched CUDA solver based on LU factorization for small linear systems. This solver may be used in applications such as reactive flow transport models, which apply the Newton–Raphson technique to linearize and iteratively solve the sets of non linear equations that represent the reactions for ten of thousands to millions of physical locations. The implementation exploits somewhat counterintuitive GPGPU programming techniques: it assigns the solution of a matrix (representing a system) to a single CUDA thread, does not exploit shared memory and employs dynamic memory allocation on the GPUs. These techniques enable our implementation to simultaneously solve sets of systems with over 100 equations and to employ LU decomposition with complete pivoting, providing the higher numerical accuracy required by certain applications. Other currently available solutions for batched linear solvers are limited by size and only support partial pivoting, although they may result faster in certain conditions. We discuss the code of our implementation and present a comparison with the other implementations, discussing the various tradeoffs in terms of performance and flexibility. This work will enable developers that need batched linear solvers to choose whichever implementation is more appropriate to the features and the requirements of their applications, and even to implement dynamic switching approaches that can choose the best implementation depending on the input data.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 119.00; Price excludes VAT (USA)

Hardcover Book: USD 159.00; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preconditioners for Batched Iterative Linear Solvers on GPUs

A Framework for Batched and GPU-Resident Factorization Algorithms Applied to Block Householder Transformations

Optimized Batched Linear Algebra for Modern Architectures

References

Tang, G., D’Azevedo, E.F., Zhang, F., Parker, J.C., Watson, D.B., Jardine, P.M.: Application of a hybrid MPI/OPENMP approach for parallel groundwater model calibration using multi-core computers. Comput. Geosci. 36, 1451–1460 (2010)
Article Google Scholar
Higham, N.J.: Gaussian elimination. Comput. Stat. 3, 230–238 (2011)
Google Scholar
White, M.D., Oostrom, M.: STOMP Subsurface Transport Over Multiple Phase: User’s Guide. Technical report, Pacific Northwest National Laboratory, Richland (2006). PNNL-15782
Google Scholar
Yeh, G.T., Tripathi, V.S., Gwo, J.P., Cheng, H.P., Chend, J.-R.C., Salvage, K.M., Li, M.H., Fang, Y., Li, Y., Sun, J.T., Zhang, F., Siegel, M.D.: HYDROGEOCHEM: a coupled model of variably saturated flow, thermal transport, and reactive biogeochemical transport, on laptops to leadership-class supercomputers. In: Groundwater Reactive Transport Models. Bentham Science Publishers, Sharjah (2012)
Google Scholar
Hammond, G.E., Lichtner, P.C., Lu, C., Mills, R.T.: Pflotran: reactive flow and transport code for use on laptops to leadership-class supercomputers. In: Groundwater Reactive Transport Models. Bentham Science Publishers, Sharjah (2012)
Google Scholar
Zhang, K., Wu, Y., Pruess, K.: User’s Guide for TOUGH2-MP - A Massively Parallel Version of the TOUGH2 Code. Technical report, Lawrence Berkeley National Laboratory, Berkeley (2008). LBNL-315E
Google Scholar
Tomov, S., Nath, R., Ltaief, H., Dongarra, J.: Dense linear algebra solvers for multicore with gpu accelerators. In: IPDPSW’10: IEEE International Symposium on Parallel Distributed Processing, Workshops and Phd Forum, pp. 1–8 (2010)
Google Scholar
Agullo, E., Augonnet, C., Dongarra, J., Faverge, M., Langou, J., Ltaief, H., Tomov, S.: Lu factorization for accelerator-based systems. In: AICCSA: 9th IEEE/ACS International Conference on Computer Systems and Applications, pp. 217–224 (2011)
Google Scholar
NVIDIA Corporation. Nvidia CUDA C Programming Guide, Version 5.0 (2012)
Google Scholar
Song, F., Tomov, S., Dongarra, J.: Enabling and scaling matrix computations on heterogeneous multi-core and multi-GPU systems. In: ICS ’12: The 26th ACM International Conference on Supercomputing, pp. 365–376 (2012)
Google Scholar
NVIDIA Corporation. Nidia CUBLAS Library, Version 5.0 (2012)
Google Scholar
NVIDIA custom batched LU Decomposition. NVIDIA. Available at http://developer.nvidia.com (2013)

Download references

Author information

Authors and Affiliations

Pacific Northwest National Laboratory, Richland, WA, USA
Antonino Tumeo & Nitin Gawande
NVIDIA, Santa Clara, CA, USA
Oreste Villa

Authors

Antonino Tumeo
View author publications
You can also search for this author in PubMed Google Scholar
Nitin Gawande
View author publications
You can also search for this author in PubMed Google Scholar
Oreste Villa
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Antonino Tumeo .

Editor information

Editors and Affiliations

National Center for Supercomputing Applications, University of Illinois, Urbana, Illinois, USA
Volodymyr Kindratenko

Copyright information

About this chapter

Cite this chapter

Tumeo, A., Gawande, N., Villa, O. (2014). A Flexible CUDA LU-Based Solver for Small, Batched Linear Systems. In: Kindratenko, V. (eds) Numerical Computations with GPUs. Springer, Cham. https://doi.org/10.1007/978-3-319-06548-9_5

Download citation

DOI: https://doi.org/10.1007/978-3-319-06548-9_5
Published: 09 June 2014
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-06547-2
Online ISBN: 978-3-319-06548-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

A Flexible CUDA LU-Based Solver for Small, Batched Linear Systems

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Preconditioners for Batched Iterative Linear Solvers on GPUs

A Framework for Batched and GPU-Resident Factorization Algorithms Applied to Block Householder Transformations

Optimized Batched Linear Algebra for Modern Architectures

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Copyright information

About this chapter

Cite this chapter

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

A Flexible CUDA LU-Based Solver for Small, Batched Linear Systems

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Preconditioners for Batched Iterative Linear Solvers on GPUs

A Framework for Batched and GPU-Resident Factorization Algorithms Applied to Block Householder Transformations

Optimized Batched Linear Algebra for Modern Architectures

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Copyright information

About this chapter

Cite this chapter

Download citation

Share this chapter

Publish with us

Search

Navigation