research-article

Optimizing Sparse Matrix-Vector Multiplication for Large-Scale Data Analytics

Authors:
Daniele Buono

IBM Research, T. J. Watson Research Center, Yorktown Heights, NY, USA

IBM Research, T. J. Watson Research Center, Yorktown Heights, NY, USA
View Profile

,
Fabrizio Petrini

IBM Research, T. J. Watson Research Center, Yorktown Heights, NY, USA

IBM Research, T. J. Watson Research Center, Yorktown Heights, NY, USA
View Profile

,
Fabio Checconi

IBM Research, T. J. Watson Research Center, Yorktown Heights, NY, USA

IBM Research, T. J. Watson Research Center, Yorktown Heights, NY, USA
View Profile

,
Xing Liu

IBM Research, T. J. Watson Research Center, Yorktown Heights, NY, USA

IBM Research, T. J. Watson Research Center, Yorktown Heights, NY, USA
View Profile

,
Xinyu Que

IBM Research, T. J. Watson Research Center, Yorktown Heights, NY, USA

IBM Research, T. J. Watson Research Center, Yorktown Heights, NY, USA
View Profile

,
Chris Long

US Department of Defense

US Department of Defense
View Profile

,
Tai-Ching Tuan

US Department of Defense

US Department of Defense
View Profile

ICS '16: Proceedings of the 2016 International Conference on SupercomputingJune 2016Article No.: 37Pages 1–12https://doi.org/10.1145/2925426.2926278

Published:01 June 2016Publication History

ICS '16: Proceedings of the 2016 International Conference on Supercomputing

Pages 1–12

ABSTRACT

Sparse Matrix-Vector multiplication (SpMV) is a fundamental kernel, used by a large class of numerical algorithms. Emerging big-data and machine learning applications are propelling a renewed interest in SpMV algorithms that can tackle massive amount of unstructured data---rapidly approaching the TeraByte range---with predictable, high performance. In this paper we describe a new methodology to design SpMV algorithms for shared memory multiprocessors (SMPs) that organizes the original SpMV algorithm into two distinct phases. In the first phase we build a scaled matrix, that is reduced in the second phase, providing numerous opportunities to exploit memory locality. Using this methodology, we have designed two algorithms. Our experiments on irregular big-data matrices (an order of magnitude larger than the current state of the art) show a quasi-optimal scaling on a large-scale POWER8 SMP system, with an average performance speedup of 3.8x, when compared to an equally optimized version of the CSR algorithm. In terms of absolute performance, with our implementation, the POWER8 SMP system is comparable to a 256-node cluster. In terms of size, it can process matrices with up to 68 billion edges, an order of magnitude larger than state-of-the-art clusters.

References

K. Akbudak, E. Kayaaslan, and C. Aykanat. Hypergraph partitioning based models and methods for exploiting cache locality in sparse matrix-vector multiplication. SIAM Journal on Scientific Computing, 35(3):C237--C262, 2013.Google ScholarCross Ref
P. N. Q. Anh, R. Fan, and Y. Wen. Reducing vector i/o for faster gpu sparse matrix-vector multiplication. In Parallel and Distributed Processing Symposium (IPDPS), 2015 IEEE International, pages 1043--1052, May 2015. Google ScholarDigital Library
A. Ashari, N. Sedaghati, J. Eisenlohr, S. Parthasarathy, and P. Sadayappan. Fast sparse matrix-vector multiplication on gpus for graph applications. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC '14, pages 781--792, Piscataway, NJ, USA, 2014. IEEE Press. Google ScholarDigital Library
N. Bell and M. Garland. Implementing sparse matrix-vector multiplication on throughput-oriented processors. In Proc. ACM/IEEE Conf. Supercomputing, SC '09, page 18. ACM, 2009. Google ScholarDigital Library
E. G. Boman, K. D. Devine, and S. Rajamanickam. Scalable matrix computations on large scale-free graphs using 2d graph partitioning. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC '13, pages 50:1--50:12, New York, NY, USA, 2013. ACM. Google ScholarDigital Library
A. Buluç, J. Fineman, M. Frigo, J. Gilbert, and C. Leiserson. Parallel sparse matrix-vector and matrix-transpose-vector multiplication using compressed sparse blocks. In Proc. 21th Annual Symposium Parallelism in Algorithms and Architectures, pages 233--244. ACM, 2009. Google ScholarDigital Library
A. Buluç and J. R. Gilbert. The combinatorial blas: Design, implementation, and applications. Int. J. High Perform. Comput. Appl., 25(4):496--509, Nov. 2011. Google ScholarDigital Library
D. Buono, J. Gunnels, X. Que, F. Checconi, F. Petrini, T.-C. Tuan, and C. Long. Optimizing sparse linear algebra for large-scale graph analytics. Computer, 48(8):26--34, Aug 2015.Google ScholarCross Ref
D. Chakrabarti, Y. Zhan, and C. Faloutsos. R-mat: A recursive model for graph mining. In SIAM Data Mining, volume 4, pages 442--446. SIAM, 2004.Google Scholar
J. Choi, A. Singh, and R. Vuduc. Model-driven autotuning of sparse matrix-vector multiply on GPUs. In ACM SIGPLAN Notices, volume 45, pages 115--126. ACM, 2010. Google ScholarDigital Library
T. Davis. The University of Florida sparse matrix collection. In NA digest. Citeseer, 1994.Google Scholar
T. A. Davis and Y. Hu. The university of florida sparse matrix collection. ACM Trans. Math. Softw., 38(1):1:1--1:25, Dec. 2011. Google ScholarDigital Library
G. Goumas, K. Kourtis, N. Anastopoulos, V. Karakasis, and N. Koziris. Performance evaluation of the sparse matrix-vector multiplication on modern architectures. J. Supercomput., 50:36--77, 2009. Google ScholarDigital Library
E. Im, K. Yelick, and R. Vuduc. SPARSITY: Optimization framework for sparse matrix kernels. Intl J. High Perf. Comput. Appl., 18:135--158, 2004. Google ScholarDigital Library
Intel Corporation. Math kernel library: http://software.intel.com/en-us/articles/intel-mkl.Google Scholar
A. Jain. pOSKI: An extensible autotuning framework to perform optimized SpMVs on multicore architectures. 2009.Google Scholar
S. D. Kamvar, T. H. Haveliwala, C. D. Manning, and G. H. Golub. Extrapolation methods for accelerating pagerank computations. In Proceedings of the 12th International Conference on World Wide Web, WWW '03, pages 261--270, New York, NY, USA, 2003. ACM. Google ScholarDigital Library
J. Kepner and J. Gilbert. Graph Algorithms in the Language of Linear Algebra. Society for Industrial and Applied Mathematics, 2011. Google ScholarDigital Library
J. M. Kleinberg. Authoritative sources in a hyperlinked environment. J. ACM, 46(5):604--632, Sept. 1999. Google ScholarDigital Library
T. G. Kolda, A. Pinar, T. Plantenga, and C. Seshadhri. A scalable generative graph model with community structure. SIAM Journal on Scientific Computing, 36(5):C424--C452, September 2014.Google ScholarDigital Library
J. Leskovec and A. Krevl. SNAP Datasets: Stanford large network dataset collection. http://snap.stanford.edu/data, June 2014.Google Scholar
X. Liu, M. Smelyanskiy, E. Chow, and P. Dubey. Efficient sparse matrix-vector multiplication on x86-based many-core processors. In Proceedings of the 27th International ACM Conference on International Conference on Supercomputing, ICS '13, pages 273--282, New York, NY, USA, 2013. ACM. Google ScholarDigital Library
NVIDIA Corporation. cusparse library (included in cuda toolkit): https://developer.nvidia.com/cusparse.Google Scholar
J.-P. Onnela, J. SaramÃd'ki, J. HyvÃűnen, G. SzabÃş, D. Lazer, K. Kaski, J. KertÃl'sz, and A.-L. BarabÃąsi. Structure and tie strengths in mobile communication networks. Proceedings of the National Academy of Sciences, 104(18):7332--7336, 2007.Google ScholarCross Ref
Y. Saad. Sparskit: a basic tool kit for sparse matrix computations - version 2, 1994.Google Scholar
W. Starke, J. Stuecheli, D. Daly, J. Dodson, F. Auernhammer, P. Sagmeister, G. Guthrie, C. Marino, M. Siegel, and B. Blaner. The cache and memory subsystems of the ibm power8 processor. IBM Journal of Research and Development, 59(1):3:1--3:13, Jan 2015.Google ScholarDigital Library
W. T. Tang, R. Zhao, M. Lu, Y. Liang, H. P. Huynh, X. Li, and R. S. M. Goh. Optimizing and auto-tuning scale-free sparse matrix-vector multiplication on intel xeon phi. In Proceedings of the 13th Annual IEEE/ACM International Symposium on Code Generation and Optimization, CGO '15, pages 136--145, Washington, DC, USA, 2015. IEEE Computer Society. Google ScholarDigital Library
H. Tong, C. Faloutsos, and J.-Y. Pan. Random walk with restart: Fast solutions and applications. Knowl. Inf. Syst., 14(3):327--346, Mar. 2008. Google ScholarDigital Library
R. W. Vuduc. Automatic performance tuning of sparse matrix kernels. PhD thesis, Univ. of California, Berkeley, 2003. Google ScholarDigital Library
S. Williams, L. Oliker, R. Vuduc, J. Shalf, K. Yelick, and J. Demmel. Optimization of sparse matrix-vector multiplication on emerging multicore platforms. In Proc. ACM/IEEE Conf. Supercomputing, SC '07, pages 38:1--38:12, New York, NY, USA, 2007. ACM. Google ScholarDigital Library
A. Yoo, A. H. Baker, R. Pearce, and V. E. Henson. A scalable eigensolver for large scale-free graphs using 2d graph partitioning. In Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, SC '11, pages 63:1--63:11, New York, NY, USA, 2011. ACM. Google ScholarDigital Library
A. N. Yzelman and R. H. Bisseling. Cache-oblivious sparse matrixâĂŞvector multiplication by using sparse matrix partitioning methods. SIAM Journal on Scientific Computing, 31(4):3128--3154, 2009. Google ScholarDigital Library

Recommendations

CSR5: An Efficient Storage Format for Cross-Platform Sparse Matrix-Vector Multiplication
ICS '15: Proceedings of the 29th ACM on International Conference on Supercomputing

Sparse matrix-vector multiplication (SpMV) is a fundamental building block for numerous applications. In this paper, we propose CSR5 (Compressed Sparse Row 5), a new storage format, which offers high-throughput SpMV on various platforms including CPUs, ...
Read More
Efficient sparse matrix-vector multiplication on x86-based many-core processors
ICS '13: Proceedings of the 27th international ACM conference on International conference on supercomputing

Sparse matrix-vector multiplication (SpMV) is an important kernel in many scientific applications and is known to be memory bandwidth limited. On modern processors with wide SIMD and large numbers of cores, we identify and address several bottlenecks ...
Read More
Analysis of Sparse Matrix-Vector Multiplication Using Iterative Method in CUDA
NAS '13: Proceedings of the 2013 IEEE Eighth International Conference on Networking, Architecture and Storage

Scaling up the sparse matrix-vector multiplication has been at the heart of numerous studies in both academia and industry. The massive parallelism of graphics processing units offers tremendous performance in many high-performance computing ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in

ICS '16: Proceedings of the 2016 International Conference on Supercomputing
June 2016
547 pages
ISBN:9781450343619
DOI:10.1145/2925426

Copyright © 2016 ACM
© 2016 Association for Computing Machinery. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of the United States government. As such, the United States Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 1 June 2016
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Data analytics
Power8
SpMV
Sparse matrices
Storage Formats
Qualifiers
- research-article
- Research
- Refereed limited
Conference

Acceptance Rates
Overall Acceptance Rate584of2,055submissions,28%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 29
  Total Citations
  View Citations
- 932
  Total Downloads
- Downloads (Last 12 months)61
- Downloads (Last 6 weeks)2
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Optimizing Sparse Matrix-Vector Multiplication for Large-Scale Data Analytics

ICS '16: Proceedings of the 2016 International Conference on Supercomputing

ABSTRACT

References

Cited By

Recommendations

CSR5: An Efficient Storage Format for Cross-Platform Sparse Matrix-Vector Multiplication

Efficient sparse matrix-vector multiplication on x86-based many-core processors

Analysis of Sparse Matrix-Vector Multiplication Using Iterative Method in CUDA

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Optimizing Sparse Matrix-Vector Multiplication for Large-Scale Data Analytics

ICS '16: Proceedings of the 2016 International Conference on Supercomputing

ABSTRACT

References

Cited By

Recommendations

CSR5: An Efficient Storage Format for Cross-Platform Sparse Matrix-Vector Multiplication

Efficient sparse matrix-vector multiplication on x86-based many-core processors

Analysis of Sparse Matrix-Vector Multiplication Using Iterative Method in CUDA

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media