A simple and efficient storage format for SIMD-accelerated SpMV

Bian, Haodong; Huang, Jianqiang; Dong, Runting; Guo, Yuluo; Liu, Lingbin; Huang, Dongqiang; Wang, Xiaoying

doi:10.1007/s10586-021-03340-1

A simple and efficient storage format for SIMD-accelerated SpMV

Published: 20 June 2021

Volume 24, pages 3431–3448, (2021)
Cite this article

Cluster Computing Aims and scope Submit manuscript

Haodong Bian¹,
Jianqiang Huang ORCID: orcid.org/0000-0002-4454-7919^1,2,
Runting Dong¹,
Yuluo Guo¹,
Lingbin Liu¹,
Dongqiang Huang¹ &
…
Xiaoying Wang¹

472 Accesses
3 Citations
Explore all metrics

Abstract

SpMV (Sparse matrix-vector multiplication) is an essential component in scientific computing and has attracted the attention of researchers in related fields at home and abroad. With the continuous expansion of matrix data, the efficient parallel SpMV algorithm has become a research hotspot for research experts in related fields. The sparse matrix compression format as a critical point to improve computing performance can effectively save storage space and efficiently cooperate with the advantages of the processor system structure to give full play to performance. This paper proposes a new sparse matrix storage format CSR2 (Compressed Sparse Row 2). It is a new single format and suitable for processor platforms with SIMD (Single Instruction Multiple Data) vectorizations. The format operation of CSR2 is easy to implement with a low overhead of conversion. We compared the SpMV algorithm based on CSR2 with the most advanced single format CSR5 (Compressed Sparse Row 5) and Intel MKL (Intel Math Kernel Library) on the mainstream high-performance processor Intel Xeon E5-2670 v3 CPU. We choose 48 sets of matrices to be used as a benchmark suite. Experimental results show that CSR2 has a remarkable performance improvement compared with CSR5 and MKL. Compared to CSR5, CSR2 can achieve an average acceleration of 1.401 × (up to 1.861 ×). Compared to MKL, CSR2 can achieve an average acceleration of 1.261 × (up to 5.921 ×). In reality, for applications with multiple iterations, using our CSR2 can bring low-overhead format conversion and high-throughput computing performance.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

VBSF: a new storage format for SIMD sparse matrix–vector multiplication on modern processors

Article 10 April 2019

New Efficient General Sparse Matrix Formats for Parallel SpMV Operations

Parallelization Designs of SpMV Using Compressed Storage for Sparse Matrices on GPU

Notes

The source code of CSR2 is downloadable at https://github.com/nulidangxueshen/CSR2.
The source code of CSR5 is downloadable at https://github.com/bhSPARSE/Benchmark_SpMV_using_CSR5.
https://sparse.tamu.edu/.

References

Zhu, X., Han, W., Chen, W.: GridGraph: large-scale graph processing on a single machine using 2-level hierarchical partitioning. In: Proceedings of the USENIX ATC, USA, pp. 375–386 (2015)
Zhu, X., Chen, W., Zheng, W., Ma, X.: Gemini: a computation-centric distributed graph processing system. In: Proceedings of the OSDI, USA, pp. 301–316 (2016)
Kyrola, A., Blelloch, G., Guestrin, C.: GraphChi: large-scale graph computation on just a PC. In: Proceedings of the OSDI, USA, pp. 31–46 (2012)
Sundaram, N., Satish, N., Patwary, M.M.A., Dulloor, S.R., Anderson, M.J., Vadlamudi, S.G., Das, D., Dubey, P.: GraphMat: high performance graph analytics made productive. In: Proceedings of the VLDB Endowment, vol. 8, pp. 1214–1225, July 2015
Wang, Y., Pan, Y., Davidson, A., Wu, Y., Yang, C., Wang, L., Osama, M., Yuan, C., Liu, W., Riffel, A.T., Owens, J.D.: Gunrock: GPU graph analytics. ACM Trans. Parallel Comput. 4, 49 (2017)
Han, S., Liu, X., Mao, H., Pu, J., Pedram, A., Horowitz, M.A., Dally, W.J.: EIE: efficient inference engine on compressed deep neural network. Assoc. Comput. Mach. 44, 243–254 (2016)
Nisa, I., Siegel, C., Rajam, A.S., Vishnu, A., Sadayappan, P.: Effective machine learning based format selection and performance modeling for SpMV on GPUs. In: 2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), Vancouver, BC, pp. 1056–1065 (2018)
Ahamed, A.C., Magoulés, F.: Iterative Methods for sparse linear systems on graphics processing unit. In: 2012 IEEE 14th International Conference on High Performance Computing and Communication & 2012 IEEE 9th International Conference on Embedded Software and Systems, Liverpool, pp. 836–842 (2012)
Mohammed, T., Albeshri, A., Katib, I., et al.: DIESEL: a novel deep learning-based tool for SpMV computations and solving sparse linear equation systems. J. Supercomput. 77, 6313–6355 (2021)
Article Google Scholar
Dziekonski, A., Rewienski, M., Sypek, P., Lamecki, A., Mrozowski, M.: GPU-accelerated LOBPCG method with inexact null-space filtering for solving generalized eigenvalue problems in computational electromagnetics analysis with higher-order FEM. Commun. Comput. Phys. 22, 997–1014 (2017)
Article MathSciNet Google Scholar
Imakura, A., Sakurai, T.: Block Krylov-type complex moment-based eigensolvers for solving generalized eigenvalue problems. Numer. Algorithms 75, 413–433 (2017)
Wozniak, B., Witherden, F.D., Russell, F.P., Vincent, P.E., Kelly, P.H.: GiMMiK—generating bespoke matrix multiplication kernels for accelerators: application to high-order computational fluid dynamics. Comput. Phys. Commun. 202, 12–22 (2017)
Article MathSciNet Google Scholar
AlAhmadi, S., Muhammed, T., Mehmood, R., Albeshri, A.: Performance characteristics for sparse matrix-vector multiplication on GPUs. In: Smart Infrastructure and Applications, pp. 409–426. Springer, Cham (2019)
Sun, Q., Zhang, C., Wu, C., Zhang, J., Li, L.: Bandwidth reduced parallel SpMV on the SW26010 many-core platform. In: Proceedings of the ICPP, New York, NY, USA, pp. 1–10 (2018)
Xiao, G., Li, K., Chen, Y., He, W., Zomaya, A., Li, T.: CASpMV: a customized and accelerative SpMV Framework for the Sunway TaihuLight. In: IEEE Transactions on Parallel and Distributed Systems, pp. 1–1 (2019)
Chen, Y., Xiao, G., Xiao, Z., Yang, W.: hpSpMV: a heterogeneous parallel computing scheme for SpMV on the Sunway TaihuLight Supercomputer. In: Proceedings of the (HPCC/SmartCity/DSS), Zhangjiajie, China, pp. 989–995 (2019)
Chen, Y., Xiao, G., Wu, F., Tang, Z., Li, K.: tpSpMV: a two-phase large-scale sparse matrix-vector multiplication kernel for manycore architectures. Inf. Sci. 523, 279–295 (2020)
Saule, E., Kaya, K., Catalyurek, U.V.: Performance evaluation of sparse matrix multiplication kernels on Intel Xeon Phi. Parallel Process. Appl. Math. 8384, 559–570 (2014)
Lim, R., Lee, Y., Kim, R., et al.: An implementation of matrix–matrix multiplication on the Intel KNL processor with AVX-512. Clust. Comput. 21, 1785–1795 (2018)
Xie, B., Zhan, J., Liu, X., Gao, W., Jia, Z., He, X., Zhang, L.: CVR: efficient vectorization of SpMV on x86 processors. In: Proceedings of the CGO, New York, NY, USA, pp. 149–162 (2018)
Zhang, H., Mills, R.T., Rupp, K., Smith, B.F.: Vectorized parallel sparse matrix-vector multiplication in PETSc using AVX-512. In: Proceedings of the ICPP, New York, NY, USA, pp. 1–10 (2018)
Su, B., Keutzer, K.: clSpMV: a cross-platform OpenCL SpMV Framework on GPUs. In: Proceedings of the ICS, New York, NY, USA, pp. 353–364 (2012)
Vazquez, F., Fernandez, J., Garzon, E.M.: A new approach for sparse matrix vector product on NVIDIA GPUs. Concurr. Comput.: Pract. Exp. 23, 815–826 (2011)
Li, K., Yang, W., Li, K.: Performance analysis and optimization for SpMV on GPU using probabilistic modeling. IEEE Trans. Parallel Distrib. Syst. 26(1), 196–205 (2015)
Article Google Scholar
Ashari, A., Sedaghati, N., Eisenlohr, J., Parthasarath, S., Sadayappan, P.: Fast sparse matrix-vector multiplication on GPUs for graph applications. In: Proceedings of the SC, New Orleans, LA, pp. 781–792 (2014)
Sigurbergsson, B., Hogervorst, T., Qiu, T.D., Nane, R.: Sparstition: a partitioning scheme for large-scale sparse matrix vector multiplication on FPGA. In: Proceedings of the ASAP, New York, NY, USA, pp. 51–58 (2019)
Wu, T., Wang, B., Shan, Y., Yan, F., Wang, Y., Xu, N.: Efficient PageRank and SpMV computation on AMD GPUs. In: Proceedings of the ICPP, San Diego, pp. 81–89 (2010)
Shan, Y., Wu, T., Wang, Y., Wang, B., Wang, Z., Xu, N., Yang, H.: FPGA and GPU implementation of large scale SpMV. In: Proceedings of the SASP, Anaheim, CA, pp. 64–70 (2010)
Liu, W., Vinter, B.: CSR5: an efficient storage format for cross-platform sparse matrix-vector multiplication. In: Proceedings of the ICS, New York, NY, USA, pp. 339–350 (2015)
Kourtis, K., Karakasis, V., Goumas, G., Koziris, N.: CSX: an extended compression format for spmv on shared memory systems. In: Proceedings of the PPoPP, NY, USA, pp. 247–256 (2011)
Yan, S., Li, C., Zhang, Y., Zhou, H.: YaSpMV: yet another SpMV framework on GPUs. In: Proceedings of the PPoPP, New York, NY, USA, pp. 107–118 (2014)
Coronadobarrientos, E., Indalecio, G., Garcialoureiro, A.J.: AXC: a new format to perform the SpMV oriented to Intel Xeon Phi architecture in OpenCL. Concurr. Comput.: Pract. Exp. 31, e4864 (2019)
Pizzuti, F., Steuwer, M., Dubach, C.: Generating fast sparse matrix vector multiplication from a high level generic functional IR. In: Proceedings of the CC, New York, NY, USA, pp. 85–95 (2020)
Cao, W., Yao, L., Li, Z., Wang, Y., Wang, Z.: Implementing Sparse Matrix-Vector multiplication using CUDA based on a hybrid sparse matrix format. In: Proceedings of the ICCASM, Taiyuan, pp. V11-161–V11-165, (2010)
Merrill, D., Garland, M.: Merge-based sparse matrix-vector multiplication (SpMV) using the CSR storage format. In: Proceedings of the PPoPP, New York, NY, USA, pp. 1–2 (2016)
Li, Y., et al.: VBSF: a new storage format for SIMD sparse matrix-vector multiplication on modern processors. J. Supercomput. 76, 2063–2081 (2019)
Article Google Scholar
Kreutzer, M., Hager, G., Wellein, G., Fehske, H., Bishop, A.R.: A unified sparse matrix data format for efficient general sparse matrix-vector multiplication on modern processors with wide SIMD units. SIAM J. Sci. Comput. 36, 401–423 (2014)
Bian, H., Huang, J., Liu, L., Huang, D., Wang, X.: Albus: a method for efficiently processing spmv using simd and load balancing. Future Gener. Comput. Syst. 116, 371–392 (2021)
Article Google Scholar
Jin, X., Yang, T., Tang, X.: A comparison of cache blocking methods for fast execution of ensemble-based score computation. In: Proceedings of the SIGIR, New York, NY, USA, pp. 629–638 (2016)
Majo, Z., Gross, T.R.: Memory management in NUMA multicore systems: trapped between cache contention and interconnect overhead. Assoc. Comput. Mach. 46, 11–20 (2011)
Edgar, A.: León, Mpibind: a memory-centric affinity algorithm for hybrid applications. In: Proceedings of the MEMSYS, New York, NY, USA, pp. 262–264 (2017)
Bian, H., Huang, J., Dong, R., Liu, L., Wang, X.: CSR2: a new format for SIMD-accelerated SpMV. In: Proceedings of the CCGRID, Melbourne, Australia, pp. 350–359 (2020)
Park, Y., Kim, R., Nguyen, T.M.T., et al.: Improving blocked matrix-matrix multiplication routine by utilizing AVX-512 instructions on intel knights landing and xeon scalable processors. Clust. Comput. (2021). https://doi.org/10.1007/s10586-021-03274-8
Cheon, H., Ryu, J., Ryou, J., et al.: ARED: automata-based runtime estimation for distributed systems using deep learning. Clust. Comput. (2021). https://doi.org/10.1007/s10586-021-03272-w

Download references

Acknowledgements

The authors are grateful to the reviewers for valuable comments that have greatly improved the paper. This paper is partially supported by the National Natural Science Foundation of China (No.62062059, No.61962051), National Natural Science Foundation of Qinghai Province (No.2019-ZJ-7034).

Author information

Authors and Affiliations

Department of Computer Technology and Application, Qinghai University, Xining, China
Haodong Bian, Jianqiang Huang, Runting Dong, Yuluo Guo, Lingbin Liu, Dongqiang Huang & Xiaoying Wang
Department of Computer Science and Technology, Tsinghua University, Beijing, China
Jianqiang Huang

Authors

Haodong Bian
View author publications
You can also search for this author in PubMed Google Scholar
Jianqiang Huang
View author publications
You can also search for this author in PubMed Google Scholar
Runting Dong
View author publications
You can also search for this author in PubMed Google Scholar
Yuluo Guo
View author publications
You can also search for this author in PubMed Google Scholar
Lingbin Liu
View author publications
You can also search for this author in PubMed Google Scholar
Dongqiang Huang
View author publications
You can also search for this author in PubMed Google Scholar
Xiaoying Wang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jianqiang Huang.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Bian, H., Huang, J., Dong, R. et al. A simple and efficient storage format for SIMD-accelerated SpMV. Cluster Comput 24, 3431–3448 (2021). https://doi.org/10.1007/s10586-021-03340-1

Download citation

Received: 30 May 2020
Revised: 01 June 2021
Accepted: 09 June 2021
Published: 20 June 2021
Issue Date: December 2021
DOI: https://doi.org/10.1007/s10586-021-03340-1

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A simple and efficient storage format for SIMD-accelerated SpMV

Abstract

Access this article

Similar content being viewed by others

VBSF: a new storage format for SIMD sparse matrix–vector multiplication on modern processors

New Efficient General Sparse Matrix Formats for Parallel SpMV Operations

Parallelization Designs of SpMV Using Compressed Storage for Sparse Matrices on GPU

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A simple and efficient storage format for SIMD-accelerated SpMV

Abstract

Access this article

Similar content being viewed by others

VBSF: a new storage format for SIMD sparse matrix–vector multiplication on modern processors

New Efficient General Sparse Matrix Formats for Parallel SpMV Operations

Parallelization Designs of SpMV Using Compressed Storage for Sparse Matrices on GPU

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation