Skip to main content
Log in

xMath2.0: a high-performance extended math library for SW26010-Pro many-core processor

  • Regular Paper
  • Published:
CCF Transactions on High Performance Computing Aims and scope Submit manuscript

A Publisher Correction to this article was published on 07 December 2022

This article has been updated

Abstract

High performance extended math library is used by many scientific engineering and artificial intelligence applications, which usually involves many common mathematical computations and the most time-consuming functions. In order to take full advantage of the high performance processors, these functions need to be parallelized and optimized intensively. It is common for processor vendors to supply highly optimized commercial math library. For example, Intel maintains oneMKL, and NVIDIA has cuBLAS, cuSolver, and cuFFT. In this paper, we release a new-generation high-performance extended math library, xMath 2.0, specifically designed for the SW26010-Pro many-core processor, which includes four major modules:BLAS, LAPACK, FFT, and SPARSE. Each module is optimized for the domestic SW26010-Pro processor, leveraging parallelization on the many-core CPE mesh and optimization techniques such as assembly instruction rearrangement and computation-communication overlapping. In xMath2.0, the BLAS module has an average performance increase of 146.02 times over the MPE version of GotoBLAS2, and the performance of BLAS level 3 functions has increased by 393.95 times. The LAPACK module (calling xMath BLAS) is 233.44 times better than LAPACK (calling GotoBLAS2). And the FFT module is 47.63 times faster than FFTW3.3.2. The library has been deployed on the domestic Sunway TaihuLight Pro supercomputer, which have been used by dozens of users.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18

Similar content being viewed by others

Editorial policies for

Springer journals and proceedings: https://www.springer.com/gp/editorial-policies

Nature Portfolio journals: https://www.nature.com/nature-research/editorial-policies

Scientific Reports: https://www.nature.com/srep/journal-policies/editorial-policies

BMC journals: https://www.biomedcentral.com/getpublished/editorial-policies

Change history

References

  • Ali, A., Johnsson, L., Subhlok, J.:. Scheduling FFT computation on SMP and multicore systems. In Proceedings of the 21st annual international conference on Supercomputing pp. 293-301. (2007}

  • Demmel, J., et al.: Communication-avoiding parallel and sequential QR factorizations. (2008)

  • Demmel, J., Grigori, L., Hoe mm en, M. et al.: Communication-optimal parallel and sequential QR and LU factorizations: theory and practice (2008)

  • Dongarra, J., Du Croz, J., Hammarling, S., Duff, I.S.: A set of level 3 basic linear algebra subprograms. ACM Transactions on Mathematical Software (TOMS) 16(1), 1–17 (1990)

    Article  MathSciNet  MATH  Google Scholar 

  • Georgios, Karakasis, Vasileios, et al.: An Extended Compression Format for the Optimization of Sparse Matrix-Vector Multiplication. IEEE Transactions on Parallel and Distributed Systems: A Publication of the IEEE Computer Society (2013)

  • Goto, K., Geijn, R.A.V.D.: Anatomy of high-performance matrix multiplication. ACM Transactions on Mathematical Software (TOMS) 34(3), 1–25 (2008)

    Article  MathSciNet  MATH  Google Scholar 

  • Hu, Y., Chen, D.K., Yang, C., Liu, F.F., Ma, W.J., Yin, W.W., Yuan, X.H., LIN, R.F.: Many-core Optimization of Level 1 and Level 2 BLAS Routines on the New Domestic SW26010-Pro Processor. Ruan Jian Xue Bao/J. Software (2021) (in Chinese)

  • Jack, D., et al.: HPC Programming on Intel Many-Integrated-Core Hardware with MAGMA Port to Xeon Phi. Scient. Program. (2015)

  • Jack, D., Gates, M., Haidar, A., et al.: Accelerating Numerical Dense Linear Algebra Calculations with GPUs. Springer International Publishing (2014)

  • Jiang, L., Yang, C., Ao, Y., et al.: Towards Highly Efficient DGEMM on the Emerging SW26010 Many-Core Processor. Int. Confer. Parallel Proc. IEEE (2017)

  • Liang, G., Li, X., Siegel, J.: An empirically tuned 2D and 3D FFT library on CUDA GPU, International Conference on Supercomputing DBLP (2010)

  • Liu, X., Smelyanskiy, M., Chow, E., et al.: Efficient sparse matrix-vector multiplication on x86-based many-core processors, Proceedings of the 27th International ACM Conference on International Conference on Supercomputing. ACM (2013)

  • Liu, W., Vinter, B.: CSR5: An Efficient Storage Format for Cross-Platform Sparse Matrix-Vector Multiplication, The 29th ACM Int. Confer. Supercomput. (ICS ’15). ACM, (2015)

  • Liu, Y., et al.: Memory Efficient Two-Pass 3D FFT Algorithm for Intel Xeon Phi TM Coprocessor. J. Comput. Sci. Technol. 29(6), 989–1002 (2014)

    Article  Google Scholar 

  • Liu, F., Yang, C., Yuan, X., Wu, C., Ao, Y.: A General SpMV Implementation in Many-Core Domestic Sunway 26010 Processor. J. Software 29(12), 3921–3932 (2018)

    Google Scholar 

  • Liu, F., Chen, D., Yang, C., Zhao, Y.: Research on heterogeneous many-core fully-implicit solver for MHD dynamical equations. J. Numer. Methods Comput. Appl. 40(1), 34–50 (2019)

    MathSciNet  Google Scholar 

  • Markus, P., et al.: SPIRAL: Code Generation for DSP Transforms. Proc. IEEE 93(2), 232–275 (2005)

    Article  Google Scholar 

  • Matteo, F., Johnson, S.G.: The Design and Implementation of FFTW3. Proc. IEEE 93(2), 216–231 (2005)

    Article  Google Scholar 

  • Monakov, A., Lokhmotov, A., Avetisyan, A.: Automatically tuning sparse matrix-vector multiplication for GPU architectures. In International Conference on High-Performance Embedded Architectures and Compilers pp. 111-125. Springer, Berlin, Heidelberg (2010)

  • Nathan, B., Garland, M.: Implementing sparse matrix-vector multiplication on throughput-oriented processors. Confer. High Perform. Comput. Networking ACM (2009)

  • Rajib, N., Stanimire, T., Jack, D.: An improved magma gemm for fermi graphics processing units. Int. J. High Perform. Comput. Appl. (2010)

  • Tomas, A., Bai, Z., Hernández, V.: Parallelization of the QR Decomposition with Column Pivoting Using Column Cyclic Distribution on Multicore and GPU Processors, International Conference on High Performance Computing for Computational Science. Springer, Berlin, Heidelberg, (2012)

  • Wang, Q., et al.: AUGEM: Automatically generate high performance Dense Linear Algebra kernels on x86 CPUs. Storage & Analysis IEEE. High Perform. Comput. Netw. (2013)

  • Wang, J., Jaja, J.: High Performance FFT Based Poisson Solver on a CPU-GPU Heterogeneous Platform.”IEEE Int. Parallel Distrib. Proc. Sympos

  • Williams, S., Vuduc, R., liker, L., et al. Optimizing sparse matrix-vector multiply on emerging multicore platforms. Parallel Computing, 35(3) (2009) 178-194

  • Wu, J., Jaja, J.: High performance FFT based poisson solver on a CPU-GPU heterogeneous platform. In: Proceedings of the 2013 IEEE 27th International Symposium on Parallel and Distributed Processing, ser. IPDPS ’13, pp. 115–125. IEEE Computer Society, Washington, DC, USA (2013). https://doi.org/10.1109/IPDPS.2013.18

  • Yan, S., Li, C., et al.: YaSpMV: Yet another SpMV framework on GPUs, ACM SIGPLAN Notices (2014)

  • Zhao, Y., Ao, Y., Yang, C., Yin, W., Lin, R.: A general implementation of 1-d fft on the sunway 26010 processor. J. Software 31(10), 3184–3196 (2020)

    Google Scholar 

Download references

Acknowledgements

The authors would like to thank the manufacturer of Sunway many-core processors and pilot national laboratory for marine science and technology(Qingdao) for the resources and site. This work was supported in part by Special Project on High-Performance Computing under the National Key R &D Program (2020YFB0204601).

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Fangfang Liu, Yuwen Zhao or Chao Yang.

Ethics declarations

Conflict of interest

On behalf of all authors, the corresponding author states that there is no conflict of interest.

Additional information

The original online version of this article was revised: Unfortunately two author names have been deleted during typesetting process. The authors WanWang Yin, Xinhui Yuan have been added to the author goup.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Liu, F., Ma, W., Zhao, Y. et al. xMath2.0: a high-performance extended math library for SW26010-Pro many-core processor. CCF Trans. HPC 5, 56–71 (2023). https://doi.org/10.1007/s42514-022-00126-8

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s42514-022-00126-8

Keywords

Navigation