xMath2.0: a high-performance extended math library for SW26010-Pro many-core processor

Liu, Fangfang; Ma, Wenjing; Zhao, Yuwen; Chen, Daokun; Hu, Yi; Lu, Qinglin; Yin, WanWang; Yuan, Xinhui; Jiang, Lijuan; Yan, Hao; Li, Min; Wang, Hongsen; Wang, Xinyu; Yang, Chao

doi:10.1007/s42514-022-00126-8

xMath2.0: a high-performance extended math library for SW26010-Pro many-core processor

Regular Paper
Published: 19 October 2022

Volume 5, pages 56–71, (2023)
Cite this article

CCF Transactions on High Performance Computing Aims and scope Submit manuscript

Fangfang Liu ORCID: orcid.org/0000-0001-7344-7493^1,2,
Wenjing Ma^1,2,
Yuwen Zhao^1,3,
Daokun Chen^1,3,
Yi Hu^1,3,
Qinglin Lu^1,3,
WanWang Yin⁴,
Xinhui Yuan⁴,
Lijuan Jiang^1,3,
Hao Yan^1,3,
Min Li^1,3,
Hongsen Wang^1,3,
Xinyu Wang^1,3 &
…
Chao Yang⁵

442 Accesses
3 Citations
1 Altmetric
Explore all metrics

A Publisher Correction to this article was published on 07 December 2022

This article has been updated

Abstract

High performance extended math library is used by many scientific engineering and artificial intelligence applications, which usually involves many common mathematical computations and the most time-consuming functions. In order to take full advantage of the high performance processors, these functions need to be parallelized and optimized intensively. It is common for processor vendors to supply highly optimized commercial math library. For example, Intel maintains oneMKL, and NVIDIA has cuBLAS, cuSolver, and cuFFT. In this paper, we release a new-generation high-performance extended math library, xMath 2.0, specifically designed for the SW26010-Pro many-core processor, which includes four major modules:BLAS, LAPACK, FFT, and SPARSE. Each module is optimized for the domestic SW26010-Pro processor, leveraging parallelization on the many-core CPE mesh and optimization techniques such as assembly instruction rearrangement and computation-communication overlapping. In xMath2.0, the BLAS module has an average performance increase of 146.02 times over the MPE version of GotoBLAS2, and the performance of BLAS level 3 functions has increased by 393.95 times. The LAPACK module (calling xMath BLAS) is 233.44 times better than LAPACK (calling GotoBLAS2). And the FFT module is 47.63 times faster than FFTW3.3.2. The library has been deployed on the domestic Sunway TaihuLight Pro supercomputer, which have been used by dozens of users.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Performance-Portable Finite Element Assembly Using PyOP2 and FEniCS

XcalableMP 2.0 and Future Directions

Porting DMRG++ Scientific Application to OpenPOWER

Editorial policies for

Springer journals and proceedings: https://www.springer.com/gp/editorial-policies

Nature Portfolio journals: https://www.nature.com/nature-research/editorial-policies

Scientific Reports: https://www.nature.com/srep/journal-policies/editorial-policies

BMC journals: https://www.biomedcentral.com/getpublished/editorial-policies

Change history

07 December 2022
A Correction to this paper has been published: https://doi.org/10.1007/s42514-022-00130-y

References

Ali, A., Johnsson, L., Subhlok, J.:. Scheduling FFT computation on SMP and multicore systems. In Proceedings of the 21st annual international conference on Supercomputing pp. 293-301. (2007}
Demmel, J., et al.: Communication-avoiding parallel and sequential QR factorizations. (2008)
Demmel, J., Grigori, L., Hoe mm en, M. et al.: Communication-optimal parallel and sequential QR and LU factorizations: theory and practice (2008)
Dongarra, J., Du Croz, J., Hammarling, S., Duff, I.S.: A set of level 3 basic linear algebra subprograms. ACM Transactions on Mathematical Software (TOMS) 16(1), 1–17 (1990)
Article MathSciNet MATH Google Scholar
Georgios, Karakasis, Vasileios, et al.: An Extended Compression Format for the Optimization of Sparse Matrix-Vector Multiplication. IEEE Transactions on Parallel and Distributed Systems: A Publication of the IEEE Computer Society (2013)
Goto, K., Geijn, R.A.V.D.: Anatomy of high-performance matrix multiplication. ACM Transactions on Mathematical Software (TOMS) 34(3), 1–25 (2008)
Article MathSciNet MATH Google Scholar
Hu, Y., Chen, D.K., Yang, C., Liu, F.F., Ma, W.J., Yin, W.W., Yuan, X.H., LIN, R.F.: Many-core Optimization of Level 1 and Level 2 BLAS Routines on the New Domestic SW26010-Pro Processor. Ruan Jian Xue Bao/J. Software (2021) (in Chinese)
Jack, D., et al.: HPC Programming on Intel Many-Integrated-Core Hardware with MAGMA Port to Xeon Phi. Scient. Program. (2015)
Jack, D., Gates, M., Haidar, A., et al.: Accelerating Numerical Dense Linear Algebra Calculations with GPUs. Springer International Publishing (2014)
Jiang, L., Yang, C., Ao, Y., et al.: Towards Highly Efficient DGEMM on the Emerging SW26010 Many-Core Processor. Int. Confer. Parallel Proc. IEEE (2017)
Liang, G., Li, X., Siegel, J.: An empirically tuned 2D and 3D FFT library on CUDA GPU, International Conference on Supercomputing DBLP (2010)
Liu, X., Smelyanskiy, M., Chow, E., et al.: Efficient sparse matrix-vector multiplication on x86-based many-core processors, Proceedings of the 27th International ACM Conference on International Conference on Supercomputing. ACM (2013)
Liu, W., Vinter, B.: CSR5: An Efficient Storage Format for Cross-Platform Sparse Matrix-Vector Multiplication, The 29th ACM Int. Confer. Supercomput. (ICS ’15). ACM, (2015)
Liu, Y., et al.: Memory Efficient Two-Pass 3D FFT Algorithm for Intel Xeon Phi TM Coprocessor. J. Comput. Sci. Technol. 29(6), 989–1002 (2014)
Article Google Scholar
Liu, F., Yang, C., Yuan, X., Wu, C., Ao, Y.: A General SpMV Implementation in Many-Core Domestic Sunway 26010 Processor. J. Software 29(12), 3921–3932 (2018)
Google Scholar
Liu, F., Chen, D., Yang, C., Zhao, Y.: Research on heterogeneous many-core fully-implicit solver for MHD dynamical equations. J. Numer. Methods Comput. Appl. 40(1), 34–50 (2019)
MathSciNet Google Scholar
Markus, P., et al.: SPIRAL: Code Generation for DSP Transforms. Proc. IEEE 93(2), 232–275 (2005)
Article Google Scholar
Matteo, F., Johnson, S.G.: The Design and Implementation of FFTW3. Proc. IEEE 93(2), 216–231 (2005)
Article Google Scholar
Monakov, A., Lokhmotov, A., Avetisyan, A.: Automatically tuning sparse matrix-vector multiplication for GPU architectures. In International Conference on High-Performance Embedded Architectures and Compilers pp. 111-125. Springer, Berlin, Heidelberg (2010)
Nathan, B., Garland, M.: Implementing sparse matrix-vector multiplication on throughput-oriented processors. Confer. High Perform. Comput. Networking ACM (2009)
Rajib, N., Stanimire, T., Jack, D.: An improved magma gemm for fermi graphics processing units. Int. J. High Perform. Comput. Appl. (2010)
Tomas, A., Bai, Z., Hernández, V.: Parallelization of the QR Decomposition with Column Pivoting Using Column Cyclic Distribution on Multicore and GPU Processors, International Conference on High Performance Computing for Computational Science. Springer, Berlin, Heidelberg, (2012)
Wang, Q., et al.: AUGEM: Automatically generate high performance Dense Linear Algebra kernels on x86 CPUs. Storage & Analysis IEEE. High Perform. Comput. Netw. (2013)
Wang, J., Jaja, J.: High Performance FFT Based Poisson Solver on a CPU-GPU Heterogeneous Platform.”IEEE Int. Parallel Distrib. Proc. Sympos
Williams, S., Vuduc, R., liker, L., et al. Optimizing sparse matrix-vector multiply on emerging multicore platforms. Parallel Computing, 35(3) (2009) 178-194
Wu, J., Jaja, J.: High performance FFT based poisson solver on a CPU-GPU heterogeneous platform. In: Proceedings of the 2013 IEEE 27th International Symposium on Parallel and Distributed Processing, ser. IPDPS ’13, pp. 115–125. IEEE Computer Society, Washington, DC, USA (2013). https://doi.org/10.1109/IPDPS.2013.18
Yan, S., Li, C., et al.: YaSpMV: Yet another SpMV framework on GPUs, ACM SIGPLAN Notices (2014)
Zhao, Y., Ao, Y., Yang, C., Yin, W., Lin, R.: A general implementation of 1-d fft on the sunway 26010 processor. J. Software 31(10), 3184–3196 (2020)
Google Scholar

Download references

Acknowledgements

The authors would like to thank the manufacturer of Sunway many-core processors and pilot national laboratory for marine science and technology(Qingdao) for the resources and site. This work was supported in part by Special Project on High-Performance Computing under the National Key R &D Program (2020YFB0204601).

Author information

Authors and Affiliations

Institute of Software, Chinese Academy of Sciences, Beijing, 100190, China
Fangfang Liu, Wenjing Ma, Yuwen Zhao, Daokun Chen, Yi Hu, Qinglin Lu, Lijuan Jiang, Hao Yan, Min Li, Hongsen Wang & Xinyu Wang
State Key Lab of Computer Science, Chinese Academy of Sciences, Beijing, 100190, China
Fangfang Liu & Wenjing Ma
University of Chinese Academy of Sciences, Beijing, 100049, China
Yuwen Zhao, Daokun Chen, Yi Hu, Qinglin Lu, Lijuan Jiang, Hao Yan, Min Li, Hongsen Wang & Xinyu Wang
National Supercomputing Center in Wuxi, Wuxi, 214000, China
WanWang Yin & Xinhui Yuan
School of Mathematical Sciences, Peking University, Beijing, 100871, China
Chao Yang

Authors

Fangfang Liu
View author publications
You can also search for this author in PubMed Google Scholar
Wenjing Ma
View author publications
You can also search for this author in PubMed Google Scholar
Yuwen Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Daokun Chen
View author publications
You can also search for this author in PubMed Google Scholar
Yi Hu
View author publications
You can also search for this author in PubMed Google Scholar
Qinglin Lu
View author publications
You can also search for this author in PubMed Google Scholar
WanWang Yin
View author publications
You can also search for this author in PubMed Google Scholar
Xinhui Yuan
View author publications
You can also search for this author in PubMed Google Scholar
Lijuan Jiang
View author publications
You can also search for this author in PubMed Google Scholar
Hao Yan
View author publications
You can also search for this author in PubMed Google Scholar
Min Li
View author publications
You can also search for this author in PubMed Google Scholar
Hongsen Wang
View author publications
You can also search for this author in PubMed Google Scholar
Xinyu Wang
View author publications
You can also search for this author in PubMed Google Scholar
Chao Yang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Fangfang Liu, Yuwen Zhao or Chao Yang.

Ethics declarations

Conflict of interest

On behalf of all authors, the corresponding author states that there is no conflict of interest.

Additional information

The original online version of this article was revised: Unfortunately two author names have been deleted during typesetting process. The authors WanWang Yin, Xinhui Yuan have been added to the author goup.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Liu, F., Ma, W., Zhao, Y. et al. xMath2.0: a high-performance extended math library for SW26010-Pro many-core processor. CCF Trans. HPC 5, 56–71 (2023). https://doi.org/10.1007/s42514-022-00126-8

Download citation

Received: 30 October 2021
Accepted: 07 September 2022
Published: 19 October 2022
Issue Date: March 2023
DOI: https://doi.org/10.1007/s42514-022-00126-8

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

xMath2.0: a high-performance extended math library for SW26010-Pro many-core processor

Abstract

Access this article

Similar content being viewed by others

Performance-Portable Finite Element Assembly Using PyOP2 and FEniCS

XcalableMP 2.0 and Future Directions

Porting DMRG++ Scientific Application to OpenPOWER

Editorial policies for

Change history

07 December 2022

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding authors

Ethics declarations

Conflict of interest

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

xMath2.0: a high-performance extended math library for SW26010-Pro many-core processor

Abstract

Access this article

Similar content being viewed by others

Performance-Portable Finite Element Assembly Using PyOP2 and FEniCS

XcalableMP 2.0 and Future Directions

Porting DMRG++ Scientific Application to OpenPOWER

Editorial policies for

Change history

07 December 2022

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding authors

Ethics declarations

Conflict of interest

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation