Abstract
High performance extended math library is used by many scientific engineering and artificial intelligence applications, which usually involves many common mathematical computations and the most time-consuming functions. In order to take full advantage of the high performance processors, these functions need to be parallelized and optimized intensively. It is common for processor vendors to supply highly optimized commercial math library. For example, Intel maintains oneMKL, and NVIDIA has cuBLAS, cuSolver, and cuFFT. In this paper, we release a new-generation high-performance extended math library, xMath 2.0, specifically designed for the SW26010-Pro many-core processor, which includes four major modules:BLAS, LAPACK, FFT, and SPARSE. Each module is optimized for the domestic SW26010-Pro processor, leveraging parallelization on the many-core CPE mesh and optimization techniques such as assembly instruction rearrangement and computation-communication overlapping. In xMath2.0, the BLAS module has an average performance increase of 146.02 times over the MPE version of GotoBLAS2, and the performance of BLAS level 3 functions has increased by 393.95 times. The LAPACK module (calling xMath BLAS) is 233.44 times better than LAPACK (calling GotoBLAS2). And the FFT module is 47.63 times faster than FFTW3.3.2. The library has been deployed on the domestic Sunway TaihuLight Pro supercomputer, which have been used by dozens of users.
Similar content being viewed by others
Editorial policies for
Springer journals and proceedings: https://www.springer.com/gp/editorial-policies
Nature Portfolio journals: https://www.nature.com/nature-research/editorial-policies
Scientific Reports: https://www.nature.com/srep/journal-policies/editorial-policies
BMC journals: https://www.biomedcentral.com/getpublished/editorial-policies
Change history
07 December 2022
A Correction to this paper has been published: https://doi.org/10.1007/s42514-022-00130-y
References
Ali, A., Johnsson, L., Subhlok, J.:. Scheduling FFT computation on SMP and multicore systems. In Proceedings of the 21st annual international conference on Supercomputing pp. 293-301. (2007}
Demmel, J., et al.: Communication-avoiding parallel and sequential QR factorizations. (2008)
Demmel, J., Grigori, L., Hoe mm en, M. et al.: Communication-optimal parallel and sequential QR and LU factorizations: theory and practice (2008)
Dongarra, J., Du Croz, J., Hammarling, S., Duff, I.S.: A set of level 3 basic linear algebra subprograms. ACM Transactions on Mathematical Software (TOMS) 16(1), 1–17 (1990)
Georgios, Karakasis, Vasileios, et al.: An Extended Compression Format for the Optimization of Sparse Matrix-Vector Multiplication. IEEE Transactions on Parallel and Distributed Systems: A Publication of the IEEE Computer Society (2013)
Goto, K., Geijn, R.A.V.D.: Anatomy of high-performance matrix multiplication. ACM Transactions on Mathematical Software (TOMS) 34(3), 1–25 (2008)
Hu, Y., Chen, D.K., Yang, C., Liu, F.F., Ma, W.J., Yin, W.W., Yuan, X.H., LIN, R.F.: Many-core Optimization of Level 1 and Level 2 BLAS Routines on the New Domestic SW26010-Pro Processor. Ruan Jian Xue Bao/J. Software (2021) (in Chinese)
Jack, D., et al.: HPC Programming on Intel Many-Integrated-Core Hardware with MAGMA Port to Xeon Phi. Scient. Program. (2015)
Jack, D., Gates, M., Haidar, A., et al.: Accelerating Numerical Dense Linear Algebra Calculations with GPUs. Springer International Publishing (2014)
Jiang, L., Yang, C., Ao, Y., et al.: Towards Highly Efficient DGEMM on the Emerging SW26010 Many-Core Processor. Int. Confer. Parallel Proc. IEEE (2017)
Liang, G., Li, X., Siegel, J.: An empirically tuned 2D and 3D FFT library on CUDA GPU, International Conference on Supercomputing DBLP (2010)
Liu, X., Smelyanskiy, M., Chow, E., et al.: Efficient sparse matrix-vector multiplication on x86-based many-core processors, Proceedings of the 27th International ACM Conference on International Conference on Supercomputing. ACM (2013)
Liu, W., Vinter, B.: CSR5: An Efficient Storage Format for Cross-Platform Sparse Matrix-Vector Multiplication, The 29th ACM Int. Confer. Supercomput. (ICS ’15). ACM, (2015)
Liu, Y., et al.: Memory Efficient Two-Pass 3D FFT Algorithm for Intel Xeon Phi TM Coprocessor. J. Comput. Sci. Technol. 29(6), 989–1002 (2014)
Liu, F., Yang, C., Yuan, X., Wu, C., Ao, Y.: A General SpMV Implementation in Many-Core Domestic Sunway 26010 Processor. J. Software 29(12), 3921–3932 (2018)
Liu, F., Chen, D., Yang, C., Zhao, Y.: Research on heterogeneous many-core fully-implicit solver for MHD dynamical equations. J. Numer. Methods Comput. Appl. 40(1), 34–50 (2019)
Markus, P., et al.: SPIRAL: Code Generation for DSP Transforms. Proc. IEEE 93(2), 232–275 (2005)
Matteo, F., Johnson, S.G.: The Design and Implementation of FFTW3. Proc. IEEE 93(2), 216–231 (2005)
Monakov, A., Lokhmotov, A., Avetisyan, A.: Automatically tuning sparse matrix-vector multiplication for GPU architectures. In International Conference on High-Performance Embedded Architectures and Compilers pp. 111-125. Springer, Berlin, Heidelberg (2010)
Nathan, B., Garland, M.: Implementing sparse matrix-vector multiplication on throughput-oriented processors. Confer. High Perform. Comput. Networking ACM (2009)
Rajib, N., Stanimire, T., Jack, D.: An improved magma gemm for fermi graphics processing units. Int. J. High Perform. Comput. Appl. (2010)
Tomas, A., Bai, Z., Hernández, V.: Parallelization of the QR Decomposition with Column Pivoting Using Column Cyclic Distribution on Multicore and GPU Processors, International Conference on High Performance Computing for Computational Science. Springer, Berlin, Heidelberg, (2012)
Wang, Q., et al.: AUGEM: Automatically generate high performance Dense Linear Algebra kernels on x86 CPUs. Storage & Analysis IEEE. High Perform. Comput. Netw. (2013)
Wang, J., Jaja, J.: High Performance FFT Based Poisson Solver on a CPU-GPU Heterogeneous Platform.”IEEE Int. Parallel Distrib. Proc. Sympos
Williams, S., Vuduc, R., liker, L., et al. Optimizing sparse matrix-vector multiply on emerging multicore platforms. Parallel Computing, 35(3) (2009) 178-194
Wu, J., Jaja, J.: High performance FFT based poisson solver on a CPU-GPU heterogeneous platform. In: Proceedings of the 2013 IEEE 27th International Symposium on Parallel and Distributed Processing, ser. IPDPS ’13, pp. 115–125. IEEE Computer Society, Washington, DC, USA (2013). https://doi.org/10.1109/IPDPS.2013.18
Yan, S., Li, C., et al.: YaSpMV: Yet another SpMV framework on GPUs, ACM SIGPLAN Notices (2014)
Zhao, Y., Ao, Y., Yang, C., Yin, W., Lin, R.: A general implementation of 1-d fft on the sunway 26010 processor. J. Software 31(10), 3184–3196 (2020)
Acknowledgements
The authors would like to thank the manufacturer of Sunway many-core processors and pilot national laboratory for marine science and technology(Qingdao) for the resources and site. This work was supported in part by Special Project on High-Performance Computing under the National Key R &D Program (2020YFB0204601).
Author information
Authors and Affiliations
Corresponding authors
Ethics declarations
Conflict of interest
On behalf of all authors, the corresponding author states that there is no conflict of interest.
Additional information
The original online version of this article was revised: Unfortunately two author names have been deleted during typesetting process. The authors WanWang Yin, Xinhui Yuan have been added to the author goup.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Liu, F., Ma, W., Zhao, Y. et al. xMath2.0: a high-performance extended math library for SW26010-Pro many-core processor. CCF Trans. HPC 5, 56–71 (2023). https://doi.org/10.1007/s42514-022-00126-8
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s42514-022-00126-8