Skip to main content

A Flexible CUDA LU-Based Solver for Small, Batched Linear Systems

  • Chapter
  • First Online:

Abstract

This chapter presents the implementation of a batched CUDA solver based on LU factorization for small linear systems. This solver may be used in applications such as reactive flow transport models, which apply the Newton–Raphson technique to linearize and iteratively solve the sets of non linear equations that represent the reactions for ten of thousands to millions of physical locations. The implementation exploits somewhat counterintuitive GPGPU programming techniques: it assigns the solution of a matrix (representing a system) to a single CUDA thread, does not exploit shared memory and employs dynamic memory allocation on the GPUs. These techniques enable our implementation to simultaneously solve sets of systems with over 100 equations and to employ LU decomposition with complete pivoting, providing the higher numerical accuracy required by certain applications. Other currently available solutions for batched linear solvers are limited by size and only support partial pivoting, although they may result faster in certain conditions. We discuss the code of our implementation and present a comparison with the other implementations, discussing the various tradeoffs in terms of performance and flexibility. This work will enable developers that need batched linear solvers to choose whichever implementation is more appropriate to the features and the requirements of their applications, and even to implement dynamic switching approaches that can choose the best implementation depending on the input data.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   119.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Hardcover Book
USD   159.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

  1. Tang, G., D’Azevedo, E.F., Zhang, F., Parker, J.C., Watson, D.B., Jardine, P.M.: Application of a hybrid MPI/OPENMP approach for parallel groundwater model calibration using multi-core computers. Comput. Geosci. 36, 1451–1460 (2010)

    Article  Google Scholar 

  2. Higham, N.J.: Gaussian elimination. Comput. Stat. 3, 230–238 (2011)

    Google Scholar 

  3. White, M.D., Oostrom, M.: STOMP Subsurface Transport Over Multiple Phase: User’s Guide. Technical report, Pacific Northwest National Laboratory, Richland (2006). PNNL-15782

    Google Scholar 

  4. Yeh, G.T., Tripathi, V.S., Gwo, J.P., Cheng, H.P., Chend, J.-R.C., Salvage, K.M., Li, M.H., Fang, Y., Li, Y., Sun, J.T., Zhang, F., Siegel, M.D.: HYDROGEOCHEM: a coupled model of variably saturated flow, thermal transport, and reactive biogeochemical transport, on laptops to leadership-class supercomputers. In: Groundwater Reactive Transport Models. Bentham Science Publishers, Sharjah (2012)

    Google Scholar 

  5. Hammond, G.E., Lichtner, P.C., Lu, C., Mills, R.T.: Pflotran: reactive flow and transport code for use on laptops to leadership-class supercomputers. In: Groundwater Reactive Transport Models. Bentham Science Publishers, Sharjah (2012)

    Google Scholar 

  6. Zhang, K., Wu, Y., Pruess, K.: User’s Guide for TOUGH2-MP - A Massively Parallel Version of the TOUGH2 Code. Technical report, Lawrence Berkeley National Laboratory, Berkeley (2008). LBNL-315E

    Google Scholar 

  7. Tomov, S., Nath, R., Ltaief, H., Dongarra, J.: Dense linear algebra solvers for multicore with gpu accelerators. In: IPDPSW’10: IEEE International Symposium on Parallel Distributed Processing, Workshops and Phd Forum, pp. 1–8 (2010)

    Google Scholar 

  8. Agullo, E., Augonnet, C., Dongarra, J., Faverge, M., Langou, J., Ltaief, H., Tomov, S.: Lu factorization for accelerator-based systems. In: AICCSA: 9th IEEE/ACS International Conference on Computer Systems and Applications, pp. 217–224 (2011)

    Google Scholar 

  9. NVIDIA Corporation. Nvidia CUDA C Programming Guide, Version 5.0 (2012)

    Google Scholar 

  10. Song, F., Tomov, S., Dongarra, J.: Enabling and scaling matrix computations on heterogeneous multi-core and multi-GPU systems. In: ICS ’12: The 26th ACM International Conference on Supercomputing, pp. 365–376 (2012)

    Google Scholar 

  11. NVIDIA Corporation. Nidia CUBLAS Library, Version 5.0 (2012)

    Google Scholar 

  12. NVIDIA custom batched LU Decomposition. NVIDIA. Available at http://developer.nvidia.com (2013)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Antonino Tumeo .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer International Publishing Switzerland

About this chapter

Cite this chapter

Tumeo, A., Gawande, N., Villa, O. (2014). A Flexible CUDA LU-Based Solver for Small, Batched Linear Systems. In: Kindratenko, V. (eds) Numerical Computations with GPUs. Springer, Cham. https://doi.org/10.1007/978-3-319-06548-9_5

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-06548-9_5

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-06547-2

  • Online ISBN: 978-3-319-06548-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics