Towards efficient tile low-rank GEMM computation on sunway many-core processors

Han, Qingchang; Yang, Hailong; Dun, Ming; Luan, Zhongzhi; Gan, Lin; Yang, Guangwen; Qian, Depei

doi:10.1007/s11227-020-03444-2

Towards efficient tile low-rank GEMM computation on sunway many-core processors

Published: 15 October 2020

Volume 77, pages 4533–4564, (2021)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

Qingchang Han¹,
Hailong Yang ORCID: orcid.org/0000-0003-1101-7927^1,2,
Ming Dun³,
Zhongzhi Luan¹,
Lin Gan⁴,
Guangwen Yang⁴ &
…
Depei Qian¹

522 Accesses
2 Citations
Explore all metrics

Abstract

Tile low-rank general matrix multiplication (TLR GEMM) is a novel method of matrix multiplication on large data-sparse matrices, which can significantly reduce storage footprint and arithmetic complexity under given accuracy. To implement high-performance TLR GEMM on Sunway many-core processor, the following challenges remain to be addressed: 1) design an efficient parallel scheme; 2) provide an efficient kernel library of math functions commonly used in TLR GEMM. This paper proposes swTLR GEMM, an efficient implementation of TLR GEMM. We assign LR GEMM computation to a single computing processing element (CPE) and use grouped task queue to process different data tiles of the TLR matrix. Moreover, we implement an efficient kernel library (swLR Kernels) for low-rank matrix operations. To scale to massive (CGs), we organize the CGs into the CG grid and partition the matrices into blocks accordingly. We also apply Cannon’s algorithm to enable efficient communication when processing the matrix blocks across CGs simultaneously. The experiment results show that the DGEMM kernel in swLR Kernels achieves 102\(\times\) speedup on average. In terms of overall performance, swTLR GEMM-LLD and swTLR GEMM-LLL achieve 91\(\times\) and 20.1\(\times\) speedup on average, respectively. In addition, our implementation of swTLR GEMM exhibits good scalability when running on 1,024 CGs of Sunway processors (66,560 cores in total).

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Performance improvement of the triangular matrix product in commodity clusters

Article Open access 15 April 2024

Parallelizing the dual revised simplex method

Article Open access 14 December 2017

A new distributed graph coloring algorithm for large graphs

Article 23 March 2023

References

Wolfgang Hackbusch (1999) A sparse matrix arithmetic based on \(\cal{H}\)-matrices. part i: Introduction to \({\cal{H}}\)-matrices. Computing 62(2):89–108
Article MathSciNet Google Scholar
Grasedyck L, Hackbusch Wolfgang (2003) Construction and arithmetics of \({\cal{H}}\)-matrices. Computing 70(4):295–334
Article MathSciNet Google Scholar
Akbudak K, Ltaief H, Mikhalev A, and Keyes D 2017) Tile low rank cholesky factorization for climate/weather modeling applications on manycore architectures. In: International Supercomputing Conference, pp 22–40. Springer
Charara A, Keyes D, and Ltaief H (2018) Tile low-rank gemm using batched operations on gpus. In: European Conference on Parallel Processing, pp 811–825. Springer
Susan BL, Antoine P, Roldan P, Karin R, Clint WR, James D, Jack D, Iain D, Sven H, Greg Henry et al (2002) An updated set of basic linear algebra subprograms (blas). ACM Trans Math Softw 28(2):135–151
Article MathSciNet Google Scholar
Kriemann Ronald (2005) Parallel \({\cal{H}}\)-matrix arithmetics on shared memory systems. Computing 74(3):273–297
Article MathSciNet Google Scholar
Halim BW, George T, Hatem L, Keyes David E (2018) Batched qr and svd algorithms on gpus with applications in hierarchical matrix compression. Parallel Comput 74:19–33
Article MathSciNet Google Scholar
Nvidia CUDA (2008) Cublas library. NVIDIA Corporation, Santa Clara, CaliforniaSanta Clara, CaliforniaSanta Clara, CaliforniaSanta Clara, California, p 31
Google Scholar
Augonnet C, Thibault S, Namyst R, Wacrenier Pierre-André (2011) Starpu: a unified platform for task scheduling on heterogeneous multicore architectures. Concurr Comput: Pract Exp 23(2):187–198
Article Google Scholar
Dongarra J (2016) Report on the sunway taihulight system. PDF). www. netlib. org. Retrieved June, 20,
Haohuan F, Liao J, Yang J, Wang L, Song Z, Huang X, Yang C, Xue W, Liu F, Qiao Fangli et al (2016) The sunway taihulight supercomputer: system and applications. Sci China Inf Sci 59(7):072001
Article Google Scholar
Jiang L, Yang C, Ao Y, Yin W, Ma W, Sun Q, Liu F, Lin R, and Zhang P (2017) Towards highly efficient dgemm on the emerging sw26010 many-core processor. In: 2017 46th International Conference on Parallel Processing (ICPP), pp 422–431. IEEE
Fang J, Fu H, Zhao W, Chen B, Zheng W, and Yang G (2017) swdnn: a library for accelerating deep learning applications on sunway taihulight. In: 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp 615–624. IEEE
de Dinechin BD Ayrignac R, Beaucamps PE, Couvert P, Ganne B, de Massas PG Jacquet F, Jones S, Chaisemartin NM, Riss F et al (2013) A clustered manycore processor architecture for embedded and accelerated applications. In: 2013 IEEE High Performance Extreme Computing Conference (HPEC), pp 1–6. IEEE
Çatalyürek Ümit V, Feo J, Gebremedhin AH, Halappanavar M, Pothen A (2012) Graph coloring algorithms for multi-core and massively multithreaded architectures. Parallel Comput 38(10–11):576–594
Article MathSciNet Google Scholar
Williams S, Shalf J , Oliker L, Kamil S, Husbands P, and Yelick K (2006) The potential of the cell processor for scientific computing. In: Proceedings of the 3rd Conference on Computing Frontiers, pp 9–20
Hackbusch W, Khoromskij B, Sauter SA (2000) On \({\cal{H}}^2\)-matrices. Lectures on applied mathematics. Springer, Berlin, pp 9–29
MATH Google Scholar
Rouet FH, Li XS, Ghysels P, Napov A (2016) A distributed-memory package for dense hierarchically semi-separable matrix computations using randomization. ACM Trans Math Softw (TOMS) 42(4):27
Article MathSciNet Google Scholar
Ambikasaran S, Darve E (2013) An \({\cal{O}}(n \log n)\) fast direct solver for partial hierarchically semi-separable matrices. J Sci Comput 57(3):477–501
Article MathSciNet Google Scholar
Amestoy P, Ashcraft C, Boiteau O, Buttari A, L’Excellent JY, Weisbecker Clément (2015) Improving multifrontal methods by means of block low-rank representations. SIAM J Sci Comput 37(3):A1451–A1474
Article MathSciNet Google Scholar
Kriemann Ronald (2013) \({\cal{H}}\)-lu factorization on many-core systems. Comput Visualiz Sci 16(3):105–117
Article MathSciNet Google Scholar
Noha Al-Harthi, Rabab Alomairy, Kadir Akbudak, Rui Chen, Hatem Ltaief, Hakan Bagci, and David E. Keyes. Solving acoustic boundary integral equations using high performance tile low-rank LU factorization. In: 2020 International Conference on High Performance Computing (ISC), pp 209–229. Springer
Cao Q, Pei Y, Akbudak K, Mikhalev A, Bosilca G, Ltaief H, Keyes D, and Dongarra J (2020) Extreme-scale task-based cholesky factorization toward climate and weather prediction applications. In: Proceedings of the Platform for Advanced Scientific Computing Conference, pp 1–11
Duan X, Gao P, Zhang T, Zhang M, Liu W, Zhang W , Xue W, Fu H, Gan L, Chen D et al (2018) Redesigning lammps for peta-scale and hundred-billion-atom simulation on sunway taihulight. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis, p 12. IEEE Press
Chen B, Fu H, Wei Y, He C, Zhang W, Li Y, Wan W, Zhang W, Gan L, Zhang W et al (2018) Simulating the wenchuan earthquake with accurate surface topography on sunway taihulight. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis, p 40. IEEE Press
Lin H, Zhu X, Yu B, Tang X, Xue W, Chen W, Zhang L , Hoefler T, Ma X, Liu X et al (2018) hentu: processing multi-trillion edge graphs on millions of cores in seconds. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis, pp 56. IEEE Press
Yongmin H, Yang H, Luan Z, Gan L, Yang G, Qian Depei (2019) Massively scaling seismic processing on sunway taihulight supercomputer. IEEE Trans Parallel Distrib Syst 31(5):1194–1208
Google Scholar
Fu H, Liao J, Ding N, Duan X, Gan L, Liang Y, Wang X, Yang J, Zheng Y, Liu W et al (2017) Redesigning cam-se for peta-scale climate modeling performance and ultra-high resolution on sunway taihulight. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, p 1. ACM
Liu C, Yang H, Sun R, Luan Z, and Qian D (2019) swtvm: Exploring the automated compilation for deep learning on sunway architecture. arXiv preprint arXiv:1904.07404,
Li L, Fang J, Fu H, Jiang J, Zhao W, He C, You X, and Yang G (2018) swcaffe: a parallel framework for accelerating deep learning applications on sunway taihulight. In: 2018 IEEE International Conference on Cluster Computing (CLUSTER), pp 413–422. IEEE
Zhong X, Li M, Yang H, Liu Y, Qian D (2018) swMR: a framework for accelerating mapreduce applications on sunway taihulight. IEEE Trans Emerg Topics Comput. https://doi.org/10.1109/TETC.2018.2881265
Article Google Scholar
Liu C, Xie B, Liu X, Xue W, Yang H, and Liu X (2018) Towards efficient spmv on sunway manycore architectures. In: Proceedings of the 2018 International Conference on Supercomputing, pp 363–373. ACM
Li M, Liu Y, Yang H, Luan Z, and Qian D (2018) Multi-role sptrsv on sunway many-core architecture. In: 2018 IEEE 20th International Conference on High Performance Computing and Communications; IEEE 16th International Conference on Smart City; IEEE 4th International Conference on Data Science and Systems (HPCC/SmartCity/DSS), pp 594–601. IEEE
Wang X, Liu W, Xue W , and Wu L (2018) swsptrsv: a fast sparse triangular solve with sparse level tile layout on sunway architectures. In: Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pp 338–353. ACM
Ayguadé E, Copty N, Duran A, Hoeflinger J, Lin Y, Massaioli F, Teruel X, Unnikrishnan P, Zhang G (2008) The design of openmp tasks. IEEE Trans Parallel Distrib Syst 20(3):404–418
Article Google Scholar
Alejandro D, Eduard A, Badia Rosa M, Jesús L, Luis M, Xavier M, Judit P (2011) Ompss: a proposal for programming heterogeneous multi-core architectures. Parallel process lett 21(02):173–193
Article MathSciNet Google Scholar
Kishore Kumar N, Schneider J (2017) Literature survey on low rank approximation of matrices. Linear Multilinear Algebra 65(11):2212–2244
Article MathSciNet Google Scholar
Bebendorf M (2011) Adaptive cross approximation of multivariate functions. Construct Approx 34(2):149–179
Article MathSciNet Google Scholar
Chan TF (1987) Rank revealing qr factorizations. Linear algebra Appl 88:67–82
MathSciNet MATH Google Scholar
Halko N, Martinsson PG, Tropp JA (2011) Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions. SIAM Rev 53(2):217–288
Article MathSciNet Google Scholar
Murphy KP (2012) Machine learning: a probabilistic perspective. MIT press, Cambridge
MATH Google Scholar
Skillicorn David (2007) Understanding complex datasets: data mining with matrix decompositions. CRC Press, Boca Raton
Book Google Scholar
Li X, Shen B, Liu BD, Zhang YJ (2016) A locality sensitive low-rank model for image tag completion. IEEE Trans Multimed 18(3):474–483
Article Google Scholar
Park H and Elden L (2003) Matrix rank reduction for data analysis and feature extraction. Technical report, Tr 03-015, University of Minnesota
Li M, Liu Y, Yang H, Luan Z, Gan L, Yang G, Qian D (2019) Accelerating sparse cholesky factorization on sunway manycore architecture. IEEE Trans Parallel Distrib Syst 31(7):1636–1650
Article Google Scholar
Van Zee Field G, Van De Geijn RA (2015) Blis: a framework for rapidly instantiating blas functionality. ACM Trans Math Softw 41(3):1–33
MathSciNet MATH Google Scholar
Anderson E, Bai Z, Bischof C, Blackford S, Dongarra J, Du Croz J, Greenbaum A, Hammarling S, McKenney A, Sorensen D (1999) LAPACK users’ guide, vol 9. Society for industrial and applied mathematics
Gander Walter (1980) Algorithms for the qr decomposition. Res. Rep 80(02):1251–1268
Google Scholar
Golub HG, Van Loan Charles F (1996) Matrix computations. Johns hopkins university Press, London
MATH Google Scholar
Wilkinson JH, Bauer FL, Reinsch C (2013) Linear algebra, vol 2. Springer, Berlin
Google Scholar
Cannon LE (1969) A cellular computer to implement the Kalman filter algorithm. PhD thesis, Montana State University-Bozeman, College of Engineering
Strassen V (1969) Gaussian elimination is not optimal. Numer Mathem 13(4):354–356
Article MathSciNet Google Scholar
Van De Geijn RA, Watts J (1997) Summa: scalable universal matrix multiplication algorithm. Concurr: Pract Exp 9(4):255–274
Article Google Scholar
Solomonik E and Demmel J (2011) Communication-optimal parallel 2.5 d matrix multiplication and lu factorization algorithms. In: European Conference on Parallel Processing, pp 90–109. Springer
Demmel J, Eliahu D, Fox A, Kamil S, Lipshitz B, Schwartz O, and Spillinger O (2013) Communication-optimal parallel recursive rectangular matrix multiplication. In: 2013 IEEE 27th International Symposium on Parallel and Distributed Processing, pp 261–272. IEEE
Kwasniewski G, Kabić M, Besta M, VandeVondele J , Solcà R, and Hoefler T (2019) Red-blue pebbling revisited: near optimal parallel matrix-matrix multiplication. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp 1–22
Yi-Han X, Yang CC, Hua M, Zhou Wen (2020) Deep deterministic policy gradient (ddpg)-based resource allocation scheme for noma vehicular communications. IEEE Access 8:18797–18807
Article Google Scholar
Yi-Han X, Xie JW, Zhang YG, Hua M, Zhou Wen (2020) Reinforcement learning (rl)-based energy efficient resource allocation for energy harvesting-powered wireless body area network. Sensors 20(1):44
Google Scholar

Download references

Acknowledgements

The authors would like to thank all anonymous reviewers for their insightful comments and suggestions. This work is supported by National Key R&D Program of China (Grant No. 2020YFB150001), National Natural Science Foundation of China (Grant No. 62072018), the Open Project Program of the State Key Laboratory of Mathematical Engineering and Advanced Computing (Grant No. 2019A12) and Center for High Performance Computing and System Simulation, Pilot National Laboratory for Marine Science and Technology (Qingdao). Hailong Yang is the corresponding author.

Author information

Authors and Affiliations

School of Computer Science and Engineering, Beihang University, Beijing, 100191, China
Qingchang Han, Hailong Yang, Zhongzhi Luan & Depei Qian
State Key Laboratory of Mathematical Engineering and Advanced Computing, Wuxi, China
Hailong Yang
School of Cyber Science and Technology, Beihang University, Beijing, 100191, China
Ming Dun
Department of Computer Science and Technology, Tsinghua University, Beijing, 100084, China
Lin Gan & Guangwen Yang

Authors

Qingchang Han
View author publications
You can also search for this author in PubMed Google Scholar
Hailong Yang
View author publications
You can also search for this author in PubMed Google Scholar
Ming Dun
View author publications
You can also search for this author in PubMed Google Scholar
Zhongzhi Luan
View author publications
You can also search for this author in PubMed Google Scholar
Lin Gan
View author publications
You can also search for this author in PubMed Google Scholar
Guangwen Yang
View author publications
You can also search for this author in PubMed Google Scholar
Depei Qian
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hailong Yang.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Han, Q., Yang, H., Dun, M. et al. Towards efficient tile low-rank GEMM computation on sunway many-core processors. J Supercomput 77, 4533–4564 (2021). https://doi.org/10.1007/s11227-020-03444-2

Download citation

Accepted: 29 September 2020
Published: 15 October 2020
Issue Date: May 2021
DOI: https://doi.org/10.1007/s11227-020-03444-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Towards efficient tile low-rank GEMM computation on sunway many-core processors

Abstract

Access this article

Similar content being viewed by others

Performance improvement of the triangular matrix product in commodity clusters

Parallelizing the dual revised simplex method

A new distributed graph coloring algorithm for large graphs

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Towards efficient tile low-rank GEMM computation on sunway many-core processors

Abstract

Access this article

Similar content being viewed by others

Performance improvement of the triangular matrix product in commodity clusters

Parallelizing the dual revised simplex method

A new distributed graph coloring algorithm for large graphs

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation