Skip to main content
Log in

A simple and efficient storage format for SIMD-accelerated SpMV

  • Published:
Cluster Computing Aims and scope Submit manuscript

Abstract

SpMV (Sparse matrix-vector multiplication) is an essential component in scientific computing and has attracted the attention of researchers in related fields at home and abroad. With the continuous expansion of matrix data, the efficient parallel SpMV algorithm has become a research hotspot for research experts in related fields. The sparse matrix compression format as a critical point to improve computing performance can effectively save storage space and efficiently cooperate with the advantages of the processor system structure to give full play to performance. This paper proposes a new sparse matrix storage format CSR2 (Compressed Sparse Row 2). It is a new single format and suitable for processor platforms with SIMD (Single Instruction Multiple Data) vectorizations. The format operation of CSR2 is easy to implement with a low overhead of conversion. We compared the SpMV algorithm based on CSR2 with the most advanced single format CSR5 (Compressed Sparse Row 5) and Intel MKL (Intel Math Kernel Library) on the mainstream high-performance processor Intel Xeon E5-2670 v3 CPU. We choose 48 sets of matrices to be used as a benchmark suite. Experimental results show that CSR2 has a remarkable performance improvement compared with CSR5 and MKL. Compared to CSR5, CSR2 can achieve an average acceleration of 1.401 × (up to 1.861 ×). Compared to MKL, CSR2 can achieve an average acceleration of 1.261 × (up to 5.921 ×). In reality, for applications with multiple iterations, using our CSR2 can bring low-overhead format conversion and high-throughput computing performance.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

Notes

  1. The source code of CSR2 is downloadable at https://github.com/nulidangxueshen/CSR2.

  2. The source code of CSR5 is downloadable at https://github.com/bhSPARSE/Benchmark_SpMV_using_CSR5.

  3. https://sparse.tamu.edu/.

References

  1. Zhu, X., Han, W., Chen, W.: GridGraph: large-scale graph processing on a single machine using 2-level hierarchical partitioning. In: Proceedings of the USENIX ATC, USA, pp. 375–386 (2015)

  2. Zhu, X., Chen, W., Zheng, W., Ma, X.: Gemini: a computation-centric distributed graph processing system. In: Proceedings of the OSDI, USA, pp. 301–316 (2016)

  3. Kyrola, A., Blelloch, G., Guestrin, C.: GraphChi: large-scale graph computation on just a PC. In: Proceedings of the OSDI, USA, pp. 31–46 (2012)

  4. Sundaram, N., Satish, N., Patwary, M.M.A., Dulloor, S.R., Anderson, M.J., Vadlamudi, S.G., Das, D., Dubey, P.: GraphMat: high performance graph analytics made productive. In: Proceedings of the VLDB Endowment, vol. 8, pp. 1214–1225, July 2015

  5. Wang, Y., Pan, Y., Davidson, A., Wu, Y., Yang, C., Wang, L., Osama, M., Yuan, C., Liu, W., Riffel, A.T., Owens, J.D.: Gunrock: GPU graph analytics. ACM Trans. Parallel Comput. 4, 49 (2017)

  6. Han, S., Liu, X., Mao, H., Pu, J., Pedram, A., Horowitz, M.A., Dally, W.J.: EIE: efficient inference engine on compressed deep neural network. Assoc. Comput. Mach. 44, 243–254 (2016)

  7. Nisa, I., Siegel, C., Rajam, A.S., Vishnu, A., Sadayappan, P.: Effective machine learning based format selection and performance modeling for SpMV on GPUs. In: 2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), Vancouver, BC, pp. 1056–1065 (2018)

  8. Ahamed, A.C., Magoulés, F.: Iterative Methods for sparse linear systems on graphics processing unit. In: 2012 IEEE 14th International Conference on High Performance Computing and Communication & 2012 IEEE 9th International Conference on Embedded Software and Systems, Liverpool, pp. 836–842 (2012)

  9. Mohammed, T., Albeshri, A., Katib, I., et al.: DIESEL: a novel deep learning-based tool for SpMV computations and solving sparse linear equation systems. J. Supercomput. 77, 6313–6355 (2021)

    Article  Google Scholar 

  10. Dziekonski, A., Rewienski, M., Sypek, P., Lamecki, A., Mrozowski, M.: GPU-accelerated LOBPCG method with inexact null-space filtering for solving generalized eigenvalue problems in computational electromagnetics analysis with higher-order FEM. Commun. Comput. Phys. 22, 997–1014 (2017)

    Article  MathSciNet  Google Scholar 

  11. Imakura, A., Sakurai, T.: Block Krylov-type complex moment-based eigensolvers for solving generalized eigenvalue problems. Numer. Algorithms 75, 413–433 (2017)

  12. Wozniak, B., Witherden, F.D., Russell, F.P., Vincent, P.E., Kelly, P.H.: GiMMiK—generating bespoke matrix multiplication kernels for accelerators: application to high-order computational fluid dynamics. Comput. Phys. Commun. 202, 12–22 (2017)

    Article  MathSciNet  Google Scholar 

  13. AlAhmadi, S., Muhammed, T., Mehmood, R., Albeshri, A.: Performance characteristics for sparse matrix-vector multiplication on GPUs. In: Smart Infrastructure and Applications, pp. 409–426. Springer, Cham (2019)

  14. Sun, Q., Zhang, C., Wu, C., Zhang, J., Li, L.: Bandwidth reduced parallel SpMV on the SW26010 many-core platform. In: Proceedings of the ICPP, New York, NY, USA, pp. 1–10 (2018)

  15. Xiao, G., Li, K., Chen, Y., He, W., Zomaya, A., Li, T.: CASpMV: a customized and accelerative SpMV Framework for the Sunway TaihuLight. In: IEEE Transactions on Parallel and Distributed Systems, pp. 1–1 (2019)

  16. Chen, Y., Xiao, G., Xiao, Z., Yang, W.: hpSpMV: a heterogeneous parallel computing scheme for SpMV on the Sunway TaihuLight Supercomputer. In: Proceedings of the (HPCC/SmartCity/DSS), Zhangjiajie, China, pp. 989–995 (2019)

  17. Chen, Y., Xiao, G., Wu, F., Tang, Z., Li, K.: tpSpMV: a two-phase large-scale sparse matrix-vector multiplication kernel for manycore architectures. Inf. Sci. 523, 279–295 (2020)

  18. Saule, E., Kaya, K., Catalyurek, U.V.: Performance evaluation of sparse matrix multiplication kernels on Intel Xeon Phi. Parallel Process. Appl. Math. 8384, 559–570 (2014)

  19. Lim, R., Lee, Y., Kim, R., et al.: An implementation of matrix–matrix multiplication on the Intel KNL processor with AVX-512. Clust. Comput. 21, 1785–1795 (2018)

  20. Xie, B., Zhan, J., Liu, X., Gao, W., Jia, Z., He, X., Zhang, L.: CVR: efficient vectorization of SpMV on x86 processors. In: Proceedings of the CGO, New York, NY, USA, pp. 149–162 (2018)

  21. Zhang, H., Mills, R.T., Rupp, K., Smith, B.F.: Vectorized parallel sparse matrix-vector multiplication in PETSc using AVX-512. In: Proceedings of the ICPP, New York, NY, USA, pp. 1–10 (2018)

  22. Su, B., Keutzer, K.: clSpMV: a cross-platform OpenCL SpMV Framework on GPUs. In: Proceedings of the ICS, New York, NY, USA, pp. 353–364 (2012)

  23. Vazquez, F., Fernandez, J., Garzon, E.M.: A new approach for sparse matrix vector product on NVIDIA GPUs. Concurr. Comput.: Pract. Exp. 23, 815–826 (2011)

  24. Li, K., Yang, W., Li, K.: Performance analysis and optimization for SpMV on GPU using probabilistic modeling. IEEE Trans. Parallel Distrib. Syst. 26(1), 196–205 (2015)

    Article  Google Scholar 

  25. Ashari, A., Sedaghati, N., Eisenlohr, J., Parthasarath, S., Sadayappan, P.: Fast sparse matrix-vector multiplication on GPUs for graph applications. In: Proceedings of the SC, New Orleans, LA, pp. 781–792 (2014)

  26. Sigurbergsson, B., Hogervorst, T., Qiu, T.D., Nane, R.: Sparstition: a partitioning scheme for large-scale sparse matrix vector multiplication on FPGA. In: Proceedings of the ASAP, New York, NY, USA, pp. 51–58 (2019)

  27. Wu, T., Wang, B., Shan, Y., Yan, F., Wang, Y., Xu, N.: Efficient PageRank and SpMV computation on AMD GPUs. In: Proceedings of the ICPP, San Diego, pp. 81–89 (2010)

  28. Shan, Y., Wu, T., Wang, Y., Wang, B., Wang, Z., Xu, N., Yang, H.: FPGA and GPU implementation of large scale SpMV. In: Proceedings of the SASP, Anaheim, CA, pp. 64–70 (2010)

  29. Liu, W., Vinter, B.: CSR5: an efficient storage format for cross-platform sparse matrix-vector multiplication. In: Proceedings of the ICS, New York, NY, USA, pp. 339–350 (2015)

  30. Kourtis, K., Karakasis, V., Goumas, G., Koziris, N.: CSX: an extended compression format for spmv on shared memory systems. In: Proceedings of the PPoPP, NY, USA, pp. 247–256 (2011)

  31. Yan, S., Li, C., Zhang, Y., Zhou, H.: YaSpMV: yet another SpMV framework on GPUs. In: Proceedings of the PPoPP, New York, NY, USA, pp. 107–118 (2014)

  32. Coronadobarrientos, E., Indalecio, G., Garcialoureiro, A.J.: AXC: a new format to perform the SpMV oriented to Intel Xeon Phi architecture in OpenCL. Concurr. Comput.: Pract. Exp. 31, e4864 (2019)

  33. Pizzuti, F., Steuwer, M., Dubach, C.: Generating fast sparse matrix vector multiplication from a high level generic functional IR. In: Proceedings of the CC, New York, NY, USA, pp. 85–95 (2020)

  34. Cao, W., Yao, L., Li, Z., Wang, Y., Wang, Z.: Implementing Sparse Matrix-Vector multiplication using CUDA based on a hybrid sparse matrix format. In: Proceedings of the ICCASM, Taiyuan, pp. V11-161–V11-165, (2010)

  35. Merrill, D., Garland, M.: Merge-based sparse matrix-vector multiplication (SpMV) using the CSR storage format. In: Proceedings of the PPoPP, New York, NY, USA, pp. 1–2 (2016)

  36. Li, Y., et al.: VBSF: a new storage format for SIMD sparse matrix-vector multiplication on modern processors. J. Supercomput. 76, 2063–2081 (2019)

    Article  Google Scholar 

  37. Kreutzer, M., Hager, G., Wellein, G., Fehske, H., Bishop, A.R.: A unified sparse matrix data format for efficient general sparse matrix-vector multiplication on modern processors with wide SIMD units. SIAM J. Sci. Comput. 36, 401–423 (2014)

  38. Bian, H., Huang, J., Liu, L., Huang, D., Wang, X.: Albus: a method for efficiently processing spmv using simd and load balancing. Future Gener. Comput. Syst. 116, 371–392 (2021)

    Article  Google Scholar 

  39. Jin, X., Yang, T., Tang, X.: A comparison of cache blocking methods for fast execution of ensemble-based score computation. In: Proceedings of the SIGIR, New York, NY, USA, pp. 629–638 (2016)

  40. Majo, Z., Gross, T.R.: Memory management in NUMA multicore systems: trapped between cache contention and interconnect overhead. Assoc. Comput. Mach. 46, 11–20 (2011)

  41. Edgar, A.: León, Mpibind: a memory-centric affinity algorithm for hybrid applications. In: Proceedings of the MEMSYS, New York, NY, USA, pp. 262–264 (2017)

  42. Bian, H., Huang, J., Dong, R., Liu, L., Wang, X.: CSR2: a new format for SIMD-accelerated SpMV. In: Proceedings of the CCGRID, Melbourne, Australia, pp. 350–359 (2020)

  43. Park, Y., Kim, R., Nguyen, T.M.T., et al.: Improving blocked matrix-matrix multiplication routine by utilizing AVX-512 instructions on intel knights landing and xeon scalable processors. Clust. Comput. (2021). https://doi.org/10.1007/s10586-021-03274-8

  44. Cheon, H., Ryu, J., Ryou, J., et al.: ARED: automata-based runtime estimation for distributed systems using deep learning. Clust. Comput. (2021). https://doi.org/10.1007/s10586-021-03272-w

Download references

Acknowledgements

The authors are grateful to the reviewers for valuable comments that have greatly improved the paper. This paper is partially supported by the National Natural Science Foundation of China (No.62062059, No.61962051), National Natural Science Foundation of Qinghai Province (No.2019-ZJ-7034).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jianqiang Huang.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Bian, H., Huang, J., Dong, R. et al. A simple and efficient storage format for SIMD-accelerated SpMV. Cluster Comput 24, 3431–3448 (2021). https://doi.org/10.1007/s10586-021-03340-1

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10586-021-03340-1

Keywords

Navigation