Fast algorithm for parallel solving inversion of large scale small matrices based on GPU

Xuebin, Jin; Yewang, Chen; Wentao, Fan; Yong, Zhang; Jixiang, Du

doi:10.1007/s11227-023-05336-7

Fast algorithm for parallel solving inversion of large scale small matrices based on GPU

Published: 13 May 2023

Volume 79, pages 18313–18339, (2023)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

Jin Xuebin^1,2,3,
Chen Yewang^1,2,3^na1,
Fan Wentao^1,2,3^na1,
Zhang Yong⁴^na1 &
…
Du Jixiang^1,2^na1

296 Accesses
1 Citation
Explore all metrics

Abstract

Inverting a matrix is time-consuming, and many works focus on accelerating the inversion of a single large matrix by GPU. However, the problem of parallelizing the inversion of a large number of small matrices has received little attention. These problems are widely applied in computer science, including accelerating cryptographic algorithms and image processing algorithms. In this paper, we propose a Revised In-Place Inversion algorithm for inverting a large number of small matrices on the CUDA platform, which adopts a more refined parallelization scheme and outperforms other algorithms, achieving a speedup of up to 20.9572 times over the batch matrix inverse kernel in CUBLAS. Additionally, we found that there is an upper bound on the input data size for each GPU device, and the performance will degrade if the input data size is too large. Therefore, we propose the Saturation Size Curve based on this finding to divide matrices into batches and improve the algorithm performance. Experimental results show that this strategy increases the algorithm’s performance by 1.75 times and effectively alleviates the problem of performance degradation.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

HPMaX: heterogeneous parallel matrix multiplication using CPUs and GPUs

Article 11 October 2020

Parallel Algorithm for Quasi-Band Matrix-Matrix Multiplication

GPU Acceleration of Dense Matrix and Block Operations for Lanczos Method for Systems Over GF(2)

Article 27 November 2019

Data availability

All data sets this paper used were randomly generated by MATLAB, and there are no specific restrictions on the data sets.

Notes

https://github.com/XFastDataLab/inverse_simple.

References

Milanov E (2009) The RSA algorithm. RSA Laboratories, pp 1–11
Syafalni I, Reynaldi DM, Munir R, Adiono T, Sutisna N, Mulyawan R (2022) Complexity analysis of encoding in CKKS-fully homomorphic encryption algorithm. In: 2022 International Symposium on Electronics and Smart Devices (ISESD), pp 1–5
Richards D, Abdelgawad A, Yelamarthi K (2018) How does encryption influence timing in IoT? In: 2018 IEEE Global Conference on Internet of Things (GCIoT), pp 1–5
Anaya E, Patel J, Shah P, Shah V, Cheng Y (2020) A performance study on cryptographic algorithms for IoT devices. In: Proceedings of the Tenth ACM Conference on Data and Application Security and Privacy. CODASPY ’20, pp 159–161. Association for Computing Machinery, New York
Lee W, Kim M, Park J (2021) Speed-up of the matrix computation on the ridge regression. In: KSII Transactions on Internet & Information Systems, vol 15, no 10
Shakeel N, Mehmood T, et al (2023) Inverse matrix problem in regression for high-dimensional data sets. Math Probl Eng 2023
Abdi H, et al (2007) The method of least squares. In: Encyclopedia of Measurement and Statistics. Thousand Oaks
Darabi A, Bagheri M, Gharehpetian GB (2019) Highly accurate directional overcurrent coordination via combination of Rosen’s gradient projection-complex method with GA-PSO algorithm. IEEE Syst J 14(1):1171–1182
Article Google Scholar
Wang Y, Wan R, Yang W, Li H, Chau L-P, Kot A (2022) Low-light image enhancement with normalizing flow. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol 36, pp 2604–2612
Herbreteau S, Kervrann C (2022) Dct2net: an interpretable shallow CNN for image denoising. IEEE Trans Image Process 31:4292–4305
Article Google Scholar
Zhou M, Huang J, Fang Y, Fu X, Liu A (2022) Pan-sharpening with customized transformer and invertible neural network. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol 36, pp 3553–3561
Yan M, Chen Y, Chen Y, Zeng G, Hu X, Du J (2022) A lightweight weakly supervised learning segmentation algorithm for imbalanced image based on rotation density peaks. Knowl Based Syst 244:108513. https://doi.org/10.1016/j.knosys.2022.108513 (ISSN 0950-7051)
Article Google Scholar
Wei T, Wang X, Li X, Zhu S (2022) Fuzzy subspace clustering noisy image segmentation algorithm with adaptive local variance & non-local information and mean membership linking. Eng Appl Artif Intell 110:104672
Article Google Scholar
Tanaka Y, Eldar YC, Ortega A, Cheung G (2020) Sampling signals on graphs: from theory to applications. IEEE Signal Process Mag 37(6):14–30
Article Google Scholar
Kumar MA, Chari KM (2019) Noise reduction using modified wiener filter in digital hearing aid for speech signal enhancement. J Intell Syst 29(1):1360–1378
Google Scholar
Stankovic L, Mandic DP, Dakovic M, Kisil I, Sejdic E, Constantinides AG (2019) Understanding the basis of graph signal processing via an intuitive example-driven approach [lecture notes]. IEEE Signal Process Mag 36(6):133–145
Article Google Scholar
Althoen SC, Mclaughlin R (1987) Gauss–Jordan reduction: a brief history. Am Math Mon 94(2):130–142
Article MathSciNet MATH Google Scholar
Strassen V (1969) Gaussian elimination is not optimal. Numer Math 13(4):354–356
Article MathSciNet MATH Google Scholar
Bailey DH, Gerguson HR (1988) A Strassen–Newton algorithm for high-speed parallelizable matrix inversion. In: Conference on High Performance Networking and Computing: Proceedings of the 1988 ACM/IEEE Conference on Supercomputing, vol 12, pp 419–424
Coppersmith D, Winograd S (1982) On the asymptotic complexity of matrix multiplication. SIAM J Comput 11(3):472–492
Article MathSciNet MATH Google Scholar
Press WH, Flannery BP, Teukolsky SA, Vetterling WT (1992) Lu decomposition and its applications. In: Numerical Recipes in FORTRAN: The Art of Scientific Computing, pp 34–42
Burian A, Takala J, Ylinen M (2003) A fixed-point implementation of matrix inversion using Cholesky decomposition. In: 2003 46th Midwest Symposium on Circuits and Systems, vol 3, pp 1431–1434. IEEE
Press W, Teukolsky S, Vetterling W, Flannery B (2007) Section 2.10. QR decomposition. In: Numerical Recipes: The Art of Scientific Computing, vol 3
Gu M, Eisenstat SC (1996) Efficient algorithms for computing a strong rank-revealing QR factorization. SIAM J Sci Comput 17(4):848–869
Article MathSciNet MATH Google Scholar
DasGupta D et al (2013) In-place matrix inversion by modified Gauss–Jordan algorithm. Appl Math 4(10):1392–1396
Article Google Scholar
Ries F, De Marco T, Guerrieri R (2011) Triangular matrix inversion on heterogeneous multicore systems. IEEE Trans Parallel Distrib Syst 23(1):177–184
Article Google Scholar
Ries F, De Marco T, Zivieri M, Guerrieri R (2009) Triangular matrix inversion on graphics processing unit. In: Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, pp 1–10. IEEE
Sharma G, Agarwala A, Bhattacharya B (2013) A fast parallel Gauss Jordan algorithm for matrix inversion using CUDA. Comput Struct 128:31–37
Article Google Scholar
Yu D, He S, Huang Y, Yu G, Yang L (2015) A fast parallel matrix inversion algorithm based on heterogeneous multicore architectures. In: 2015 IEEE Global Conference on Signal and Information Processing (GlobalSIP), pp 903–907. IEEE
Evstigneev NM, Ryabkov OI, Tsatsorin EA (2018) On the inversion of multiple matrices on GPU in batched mode. Supercomput Front Innov 5(2):23–42
Google Scholar
NVIDIA: cuBLAS Documentation. https://docs.nvidia.com/cuda/cublas/index.html. Accessed 17 March 2023
Abdelfattah A, Haidar A, Tomov S, Dongarra J (2017) Factorization and inversion of a million matrices using GPUs: challenges and countermeasures. Procedia Comput Sci 108:606–615
Article Google Scholar
Cavicchioli R, Capodieci N, Bertogna M (2017) Memory interference characterization between CPU cores and integrated gpus in mixed-criticality platforms. In: 2017 22nd IEEE International Conference on Emerging Technologies and Factory Automation (ETFA), pp 1–10. IEEE
Jeong D, Park J, Kim J (2022) Demand MemCpy: overlapping of computation and data transfer for heterogeneous computing. IEEE Access 10:79925–79938
Article Google Scholar
Tatsugi Y, Nukada A (2022) Accelerating data transfer between host and device using idle GPU. In: Proceedings of the 14th Workshop on General Purpose Processing Using GPU, pp 1–6
Rocher-González J, Gran EG, Reinemo S-A, Skeie T, Escudero-Sahuquillo J, García PJ, Flor FJQ (2022) Adaptive routing in infiniband hardware. In: 2022 22nd IEEE International Symposium on Cluster, Cloud and Internet Computing (CCGrid), pp 463–472. IEEE
Pfister GF (2001) An introduction to the infiniband architecture. High performance mass storage and parallel I/O 42(617–632):102
Wadekar A, Swapnil S, Lohani RB (2011) Design and implementation of a universal DMA controller. In: ICWET ’11, pp 1189–1190. Association for Computing Machinery, New York
Wang Z, Wang Z, Liao J, Chen C, Yang Y, Dong B, Chen W, Chen W, Lei M, Guo W, Chen R, Peng Y, Yu Z (2021) CNN-DMA: a predictable and scalable direct memory access engine for convolutional neural network with sliding-window filtering. In: Proceedings of the 2021 on Great Lakes Symposium on VLSI. GLSVLSI ’21, pp 115–121. Association for Computing Machinery, New York
Kobayashi R, Fujita N, Yamaguchi Y, Boku T (2019) OpenCL-enabled high performance direct memory access for GPU-FPGA cooperative computation. In: Proceedings of the HPC Asia 2019 Workshops, pp 6–9
Skejić E, Demirović D, Begić D (2020) Evaluation of perlin noise using nvidia cuda pla. Elektrotehniski Vestnik 87(5):260–266
Google Scholar
Corporation, N.: CUDA Toolkit Documentation. https://docs.nvidia.com/cuda/index.html. Accessed March 2023

Download references

Funding

This paper accepted financial support from the National Natural Science Foundation of China (No. 61673186, No. 61972010), and the Natural Science Foundation of Fujian Province, China (No. 2021J01317), we acknowledge sincerely financial support above.

Author information

Chen Yewang, Fan Wentao, Zhang Yong and Du Jixiang have contributed equally to this work.

Authors and Affiliations

The College of Computer Science and Technology, Huaqiao University, Jimei Avenu 668, Xiamen, 361021, Fujian, China
Jin Xuebin, Chen Yewang, Fan Wentao & Du Jixiang
Fujian Key Laboratory of Big Data Intelligence and Security, Huaqiao University, Jimei Avenu 668, Xiamen, 361021, Fujian, China
Jin Xuebin, Chen Yewang, Fan Wentao & Du Jixiang
Xiamen Key Laboratory of Computer Vision and Pattern Recognition, Huaqiao University, Jimei Avenu 668, Xiamen, 361021, Fujian, China
Jin Xuebin, Chen Yewang & Fan Wentao
College of Mechanical Engineering and Automation, Huaqiao University, Jimei Avenu 668, Xiamen, 361021, Fujian, China
Zhang Yong

Authors

Jin Xuebin
View author publications
You can also search for this author in PubMed Google Scholar
Chen Yewang
View author publications
You can also search for this author in PubMed Google Scholar
Fan Wentao
View author publications
You can also search for this author in PubMed Google Scholar
Zhang Yong
View author publications
You can also search for this author in PubMed Google Scholar
Du Jixiang
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

XJ designed the major Algorithm, drafted the manuscript, analyzed and interpretation of experiments, and prepared figures. YC, WF, YZ, and JD revised the manuscript critically for important intellectual content.

Corresponding author

Correspondence to Chen Yewang.

Ethics declarations

Conflict of interest

All authors have no conflict of interest that might be perceived to influence the results or discussion reported in this paper.

Ethical approval

This paper is not applicable for ethical approval.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Xuebin, J., Yewang, C., Wentao, F. et al. Fast algorithm for parallel solving inversion of large scale small matrices based on GPU. J Supercomput 79, 18313–18339 (2023). https://doi.org/10.1007/s11227-023-05336-7

Download citation

Accepted: 22 April 2023
Published: 13 May 2023
Issue Date: November 2023
DOI: https://doi.org/10.1007/s11227-023-05336-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fast algorithm for parallel solving inversion of large scale small matrices based on GPU

Abstract

Access this article

Similar content being viewed by others

HPMaX: heterogeneous parallel matrix multiplication using CPUs and GPUs

Parallel Algorithm for Quasi-Band Matrix-Matrix Multiplication

GPU Acceleration of Dense Matrix and Block Operations for Lanczos Method for Systems Over GF(2)

Data availability

Notes

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Ethical approval

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Fast algorithm for parallel solving inversion of large scale small matrices based on GPU

Abstract

Access this article

Similar content being viewed by others

HPMaX: heterogeneous parallel matrix multiplication using CPUs and GPUs

Parallel Algorithm for Quasi-Band Matrix-Matrix Multiplication

GPU Acceleration of Dense Matrix and Block Operations for Lanczos Method for Systems Over GF(2)

Data availability

Notes

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Ethical approval

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation