Skip to main content
Log in

Fast algorithm for parallel solving inversion of large scale small matrices based on GPU

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

Inverting a matrix is time-consuming, and many works focus on accelerating the inversion of a single large matrix by GPU. However, the problem of parallelizing the inversion of a large number of small matrices has received little attention. These problems are widely applied in computer science, including accelerating cryptographic algorithms and image processing algorithms. In this paper, we propose a Revised In-Place Inversion algorithm for inverting a large number of small matrices on the CUDA platform, which adopts a more refined parallelization scheme and outperforms other algorithms, achieving a speedup of up to 20.9572 times over the batch matrix inverse kernel in CUBLAS. Additionally, we found that there is an upper bound on the input data size for each GPU device, and the performance will degrade if the input data size is too large. Therefore, we propose the Saturation Size Curve based on this finding to divide matrices into batches and improve the algorithm performance. Experimental results show that this strategy increases the algorithm’s performance by 1.75 times and effectively alleviates the problem of performance degradation.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

Data availability

All data sets this paper used were randomly generated by MATLAB, and there are no specific restrictions on the data sets.

Notes

  1. https://github.com/XFastDataLab/inverse_simple.

References

  1. Milanov E (2009) The RSA algorithm. RSA Laboratories, pp 1–11

  2. Syafalni I, Reynaldi DM, Munir R, Adiono T, Sutisna N, Mulyawan R (2022) Complexity analysis of encoding in CKKS-fully homomorphic encryption algorithm. In: 2022 International Symposium on Electronics and Smart Devices (ISESD), pp 1–5

  3. Richards D, Abdelgawad A, Yelamarthi K (2018) How does encryption influence timing in IoT? In: 2018 IEEE Global Conference on Internet of Things (GCIoT), pp 1–5

  4. Anaya E, Patel J, Shah P, Shah V, Cheng Y (2020) A performance study on cryptographic algorithms for IoT devices. In: Proceedings of the Tenth ACM Conference on Data and Application Security and Privacy. CODASPY ’20, pp 159–161. Association for Computing Machinery, New York

  5. Lee W, Kim M, Park J (2021) Speed-up of the matrix computation on the ridge regression. In: KSII Transactions on Internet & Information Systems, vol 15, no 10

  6. Shakeel N, Mehmood T, et al (2023) Inverse matrix problem in regression for high-dimensional data sets. Math Probl Eng 2023

  7. Abdi H, et al (2007) The method of least squares. In: Encyclopedia of Measurement and Statistics. Thousand Oaks

  8. Darabi A, Bagheri M, Gharehpetian GB (2019) Highly accurate directional overcurrent coordination via combination of Rosen’s gradient projection-complex method with GA-PSO algorithm. IEEE Syst J 14(1):1171–1182

    Article  Google Scholar 

  9. Wang Y, Wan R, Yang W, Li H, Chau L-P, Kot A (2022) Low-light image enhancement with normalizing flow. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol 36, pp 2604–2612

  10. Herbreteau S, Kervrann C (2022) Dct2net: an interpretable shallow CNN for image denoising. IEEE Trans Image Process 31:4292–4305

    Article  Google Scholar 

  11. Zhou M, Huang J, Fang Y, Fu X, Liu A (2022) Pan-sharpening with customized transformer and invertible neural network. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol 36, pp 3553–3561

  12. Yan M, Chen Y, Chen Y, Zeng G, Hu X, Du J (2022) A lightweight weakly supervised learning segmentation algorithm for imbalanced image based on rotation density peaks. Knowl Based Syst 244:108513. https://doi.org/10.1016/j.knosys.2022.108513 (ISSN 0950-7051)

    Article  Google Scholar 

  13. Wei T, Wang X, Li X, Zhu S (2022) Fuzzy subspace clustering noisy image segmentation algorithm with adaptive local variance & non-local information and mean membership linking. Eng Appl Artif Intell 110:104672

    Article  Google Scholar 

  14. Tanaka Y, Eldar YC, Ortega A, Cheung G (2020) Sampling signals on graphs: from theory to applications. IEEE Signal Process Mag 37(6):14–30

    Article  Google Scholar 

  15. Kumar MA, Chari KM (2019) Noise reduction using modified wiener filter in digital hearing aid for speech signal enhancement. J Intell Syst 29(1):1360–1378

    Google Scholar 

  16. Stankovic L, Mandic DP, Dakovic M, Kisil I, Sejdic E, Constantinides AG (2019) Understanding the basis of graph signal processing via an intuitive example-driven approach [lecture notes]. IEEE Signal Process Mag 36(6):133–145

    Article  Google Scholar 

  17. Althoen SC, Mclaughlin R (1987) Gauss–Jordan reduction: a brief history. Am Math Mon 94(2):130–142

    Article  MathSciNet  MATH  Google Scholar 

  18. Strassen V (1969) Gaussian elimination is not optimal. Numer Math 13(4):354–356

    Article  MathSciNet  MATH  Google Scholar 

  19. Bailey DH, Gerguson HR (1988) A Strassen–Newton algorithm for high-speed parallelizable matrix inversion. In: Conference on High Performance Networking and Computing: Proceedings of the 1988 ACM/IEEE Conference on Supercomputing, vol 12, pp 419–424

  20. Coppersmith D, Winograd S (1982) On the asymptotic complexity of matrix multiplication. SIAM J Comput 11(3):472–492

    Article  MathSciNet  MATH  Google Scholar 

  21. Press WH, Flannery BP, Teukolsky SA, Vetterling WT (1992) Lu decomposition and its applications. In: Numerical Recipes in FORTRAN: The Art of Scientific Computing, pp 34–42

  22. Burian A, Takala J, Ylinen M (2003) A fixed-point implementation of matrix inversion using Cholesky decomposition. In: 2003 46th Midwest Symposium on Circuits and Systems, vol 3, pp 1431–1434. IEEE

  23. Press W, Teukolsky S, Vetterling W, Flannery B (2007) Section 2.10. QR decomposition. In: Numerical Recipes: The Art of Scientific Computing, vol 3

  24. Gu M, Eisenstat SC (1996) Efficient algorithms for computing a strong rank-revealing QR factorization. SIAM J Sci Comput 17(4):848–869

    Article  MathSciNet  MATH  Google Scholar 

  25. DasGupta D et al (2013) In-place matrix inversion by modified Gauss–Jordan algorithm. Appl Math 4(10):1392–1396

    Article  Google Scholar 

  26. Ries F, De Marco T, Guerrieri R (2011) Triangular matrix inversion on heterogeneous multicore systems. IEEE Trans Parallel Distrib Syst 23(1):177–184

    Article  Google Scholar 

  27. Ries F, De Marco T, Zivieri M, Guerrieri R (2009) Triangular matrix inversion on graphics processing unit. In: Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, pp 1–10. IEEE

  28. Sharma G, Agarwala A, Bhattacharya B (2013) A fast parallel Gauss Jordan algorithm for matrix inversion using CUDA. Comput Struct 128:31–37

    Article  Google Scholar 

  29. Yu D, He S, Huang Y, Yu G, Yang L (2015) A fast parallel matrix inversion algorithm based on heterogeneous multicore architectures. In: 2015 IEEE Global Conference on Signal and Information Processing (GlobalSIP), pp 903–907. IEEE

  30. Evstigneev NM, Ryabkov OI, Tsatsorin EA (2018) On the inversion of multiple matrices on GPU in batched mode. Supercomput Front Innov 5(2):23–42

    Google Scholar 

  31. NVIDIA: cuBLAS Documentation. https://docs.nvidia.com/cuda/cublas/index.html. Accessed 17 March 2023

  32. Abdelfattah A, Haidar A, Tomov S, Dongarra J (2017) Factorization and inversion of a million matrices using GPUs: challenges and countermeasures. Procedia Comput Sci 108:606–615

    Article  Google Scholar 

  33. Cavicchioli R, Capodieci N, Bertogna M (2017) Memory interference characterization between CPU cores and integrated gpus in mixed-criticality platforms. In: 2017 22nd IEEE International Conference on Emerging Technologies and Factory Automation (ETFA), pp 1–10. IEEE

  34. Jeong D, Park J, Kim J (2022) Demand MemCpy: overlapping of computation and data transfer for heterogeneous computing. IEEE Access 10:79925–79938

    Article  Google Scholar 

  35. Tatsugi Y, Nukada A (2022) Accelerating data transfer between host and device using idle GPU. In: Proceedings of the 14th Workshop on General Purpose Processing Using GPU, pp 1–6

  36. Rocher-González J, Gran EG, Reinemo S-A, Skeie T, Escudero-Sahuquillo J, García PJ, Flor FJQ (2022) Adaptive routing in infiniband hardware. In: 2022 22nd IEEE International Symposium on Cluster, Cloud and Internet Computing (CCGrid), pp 463–472. IEEE

  37. Pfister GF (2001) An introduction to the infiniband architecture. High performance mass storage and parallel I/O 42(617–632):102

  38. Wadekar A, Swapnil S, Lohani RB (2011) Design and implementation of a universal DMA controller. In: ICWET ’11, pp 1189–1190. Association for Computing Machinery, New York

  39. Wang Z, Wang Z, Liao J, Chen C, Yang Y, Dong B, Chen W, Chen W, Lei M, Guo W, Chen R, Peng Y, Yu Z (2021) CNN-DMA: a predictable and scalable direct memory access engine for convolutional neural network with sliding-window filtering. In: Proceedings of the 2021 on Great Lakes Symposium on VLSI. GLSVLSI ’21, pp 115–121. Association for Computing Machinery, New York

  40. Kobayashi R, Fujita N, Yamaguchi Y, Boku T (2019) OpenCL-enabled high performance direct memory access for GPU-FPGA cooperative computation. In: Proceedings of the HPC Asia 2019 Workshops, pp 6–9

  41. Skejić E, Demirović D, Begić D (2020) Evaluation of perlin noise using nvidia cuda pla. Elektrotehniski Vestnik 87(5):260–266

    Google Scholar 

  42. Corporation, N.: CUDA Toolkit Documentation. https://docs.nvidia.com/cuda/index.html. Accessed March 2023

Download references

Funding

This paper accepted financial support from the National Natural Science Foundation of China (No. 61673186, No. 61972010), and the Natural Science Foundation of Fujian Province, China (No. 2021J01317), we acknowledge sincerely financial support above.

Author information

Authors and Affiliations

Authors

Contributions

XJ designed the major Algorithm, drafted the manuscript, analyzed and interpretation of experiments, and prepared figures. YC, WF, YZ, and JD revised the manuscript critically for important intellectual content.

Corresponding author

Correspondence to Chen Yewang.

Ethics declarations

Conflict of interest

All authors have no conflict of interest that might be perceived to influence the results or discussion reported in this paper.

Ethical approval

This paper is not applicable for ethical approval.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Xuebin, J., Yewang, C., Wentao, F. et al. Fast algorithm for parallel solving inversion of large scale small matrices based on GPU. J Supercomput 79, 18313–18339 (2023). https://doi.org/10.1007/s11227-023-05336-7

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-023-05336-7

Keywords

Navigation