skip to main content
10.1145/3079079.3079085acmconferencesArticle/Chapter ViewAbstractPublication PagesicsConference Proceedingsconference-collections
research-article

Dynamic scheduling for efficient hierarchical sparse matrix operations on the GPU

Published: 14 June 2017 Publication History

Abstract

We introduce a hierarchical sparse matrix representation (HiSparse) tailored for the graphics processing unit (GPU). The representation adapts to the local nonzero pattern at all levels of the hierarchy and uses reduced bit length for addressing the entries. This allows a smaller memory footprint than standard formats. Executing algorithms on a hierarchical structure on the GPU usually entails significant synchronization and management overhead or slowdowns due to diverging execution paths and memory access patterns. We address these issues by means of a dynamic scheduling strategy specifically designed for executing algorithms on top of a hierarchical matrix on the GPU. The evaluation of our implementation of basic linear algebra routines, suggests that our hierarchical format is competitive to highly optimized standard libraries and significantly outperforms them in the case of transpose matrix operations. The results point towards the viability of hierarchical matrix formats on massively parallel devices such as the GPU.

References

[1]
A. Buluç, J. T. Fineman, M. Frigo, J. R. Gilbert, and C. E. Leiserson, "Parallel sparse matrix-vector and matrix-transpose-vector multiplication using compressed sparse blocks," in Proc. SPAA '09. ACM, 2009, pp. 233--244.
[2]
M. M. Baskaran and R. Bordawekar, "Optimizing sparse matrix-vector multiplication on GPUs using compile-time and run-time strategies," IBM Reserach Report, RC24704 (W0812-047), 2008.
[3]
A. Monakov, A. Lokhmotov, and A. Avetisyan, "Automatically tuning sparse matrix-vector multiplication for GPU architectures," in Proc. HiPEAC'10. Springer-Verlag, 2010, pp. 111--125.
[4]
F. Vázquez, J. J. Fernández, and E. M. Garzón, "A new approach for sparse matrix vector product on NVIDIA GPUs," Concurr. Comput. : Pract. Exper., vol. 23, no. 8, pp. 815--826, Jun. 2011.
[5]
B.-Y. Su and K. Keutzer, "clSpMV: A cross-platform OpenCL SpMV framework on GPUs," in Proc. ICS '12. ACM, 2012, pp. 353--364.
[6]
J. C. Pichel, F. F. Rivera, M. Fernández, and A. Rodríguez, "Optimization of sparse matrix-vector multiplication using reordering techniques on GPUs," Microprocess. Microsyst., vol. 36, no. 2, pp. 65--77, Mar. 2012.
[7]
W. Liu and B. Vinter, "CSR5: an efficient storage format for cross-platform sparse matrix-vector multiplication," in Proc. ICS '15. ACM, 2015, pp. 339--350.
[8]
E. Schrem, "Computer implementation of the finite-element procedure," in Numerical and Computer Methods in Structural Mechanics. Academic Press, 1973, pp. 79 -- 121.
[9]
P. Stathis, S. Vassiliadis, and S. Cotofana, "A hierarchical sparse matrix storage format for vector processors," in Proc. IPDPS '03, April 2003, p. 8 pp.
[10]
N. Bell and M. Garland, "Implementing sparse matrix-vector multiplication on throughput-oriented processors," in Proc. SC '09. ACM, 2009, pp. 1--11.
[11]
E.-J. Im, K. Yelick, and R. Vuduc, "Sparsity: Optimization framework for sparse matrix kernels," Int. J. High Perform. Comput. Appl., vol. 18, no. 1, pp. 135--158, Feb. 2004.
[12]
J. W. Choi, A. Singh, and R. W. Vuduc, "Model-driven autotuning of sparse matrix-vector multiply on GPUs," SIGPLAN Not., vol. 45, no. 5, pp. 115--126, Jan. 2010.
[13]
R. Nishtala, R. W. Vuduc, J. W. Demmel, and K. A. Yelick, "When cache blocking of sparse matrix vector multiply works and why," Applicable Algebra in Engineering, Communication and Computing, vol. 18, no. 3, pp. 297--311, 2007.
[14]
D. Langr, I. imecek, and P. Tvrdk, "Storing sparse matrices to files in the adaptive-blocking hierarchical storage format," in FedCSIS 2013, Sept 2013, pp. 479--486.
[15]
A. Ashari, N. Sedaghati, J. Eisenlohr, and P. Sadayappan, "An efficient two-dimensional blocking strategy for sparse matrix-vector multiplication on GPUs," in Proc. ICS '14. ACM, 2014, pp. 273--282.
[16]
S. Yan, C. Li, Y. Zhang, and H. Zhou, "yaspmv: Yet another spmv framework on gpus," in Proc. PPoPP '14. ACM, 2014, pp. 107--118.
[17]
X. Liu, M. Smelyanskiy, E. Chow, and P. Dubey, "Efficient sparse matrix-vector multiplication on x86-based many-core processors," in Proc. ICS '13. ACM, 2013, pp. 273--282.
[18]
R. Li and Y. Saad, "GPU-accelerated preconditioned iterative linear solvers," The Journal of Supercomputing, vol. 63, no. 2, pp. 443--466, 2013.
[19]
M. Martone, S. Filippone, M. Paprzycki, and S. Tucci, "On the usage of 16 bit indices in recursively stored sparse matrices," in 12th International Symposium on Symbolic and Numeric Algorithms for Scientific Computing, Sept 2010, pp. 57--64.
[20]
J. L. Greathouse and M. Daga, "Efficient sparse matrix-vector multiplication on GPUs using the CSR storage format," in Proc. SC '14. IEEE Press, 2014, pp. 769--780.
[21]
H. Yoshizawa and D. Takahashi, "Automatic tuning of sparse matrix-vector multiplication for crs format on GPUs," in IEEE 15th International Conference on Computational Science and Engineering (CSE), Dec 2012, pp. 130--136.
[22]
A. Ashari, N. Sedaghati, J. Eisenlohr, S. Parthasarathy, and P. Sadayappan, "Fast sparse matrix-vector multiplication on GPUs for graph applications," in Proc. SC '14. IEEE Press, 2014, pp. 781--792.
[23]
Y. Liu and B. Schmidt, "LightSpMV: Faster CSR-based sparse matrix-vector multiplication on CUDA-enabled GPUs," in ASAP 2015, July 2015, pp. 82--89.
[24]
D. Merrill and M. Garland, "Merge-based parallel sparse matrix-vector multiplication," in Proc. SC '16. IEEE Press, 2016, pp. 58:1--58:12.
[25]
M. Steinberger, R. Zayer, and H.-P. Seidel, "Globally homogeneous, locally adaptive sparse matrix-vector multiplication on the GPU," in Proc. ICS '17. ACM, 2017.
[26]
W. Liu and B. Vinter, "Speculative segmented sum for sparse matrix-vector multiplication on heterogeneous processors," Parallel Comput., vol. 49, no. C, pp. 179--193, Nov. 2015.
[27]
J. Kepner and J. Gilbert, Eds., Graph Algorithms in the Language of Linear Algebra. Society for Industrial and Applied Mathematics, 2011.
[28]
NVIDIA Corporation, NVIDIA CUDA Compute Unified Device Architecture Programming Guide. NVIDIA Corporation, 2016.
[29]
M. Steinberger, M. Kenzel, P. Boechat, B. Kerbl, M. Dokter, and D. Schmalstieg, "Whippletree: Task-based scheduling of dynamic workloads on the GPU," ACM Trans. Graph., vol. 33, no. 6, pp. 228:1--228:11, Nov. 2014.
[30]
T. Aila and S. Laine, "Understanding the efficiency of ray traversal on GPUs," in Proc. HPG '09. ACM, 2009, pp. 145--149.
[31]
S. Laine, T. Karras, and T. Aila, "Megakernels considered harmful: Wavefront path tracing on GPUs," in Proc. HPG '13. ACM, 2013, pp. 137--143.
[32]
T. A. Davis and Y. Hu, "The university of florida sparse matrix collection," ACM Trans. Math. Softw., vol. 38, no. 1, pp. 1:1--1:25, Dec. 2011.
[33]
NVIDIA, The API reference guide for cuSPARSE, the CUDA sparse matrix library., v7.5 ed., NVIDIA, October 2016.
[34]
S. Dalton, N. Bell, L. Olson, and M. Garland, "Cusp: Generic parallel algorithms for sparse matrix and graph computations," 2014, version 0.5.0. {Online}. Available: http://cusplibrary.github.io/
[35]
M. Steinberger, A. Derler, R. Zayer, and H. P. Seidel, "How naive is naive SpMV on the GPU?" in Proc. IEEE HPEC, Sept 2016, pp. 1--8.
[36]
Nvidia, "NVIDIA geforce gtx 1080 whitepaper," 2016.

Cited By

View all
  • (2021)Are van Emde Boas trees viable on the GPU?2021 IEEE High Performance Extreme Computing Conference (HPEC)10.1109/HPEC49654.2021.9622837(1-7)Online publication date: 20-Sep-2021
  • (2020)Analysis of Schedule and Layout Tuning for Sparse Matrices With Compound Entries on GPUsComputer Graphics Forum10.1111/cgf.1395739:6(133-143)Online publication date: 30-Mar-2020
  • (2018)Regularizing irregularityProceedings of the 1st ACM SIGMOD Joint International Workshop on Graph Data Management Experiences & Systems (GRADES) and Network Data Analytics (NDA)10.1145/3210259.3210263(1-8)Online publication date: 10-Jun-2018
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
ICS '17: Proceedings of the International Conference on Supercomputing
June 2017
300 pages
ISBN:9781450350204
DOI:10.1145/3079079
  • General Chairs:
  • William D. Gropp,
  • Pete Beckman,
  • Program Chairs:
  • Zhiyuan Li,
  • Francisco J. Cazorla
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 14 June 2017

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. GPU
  2. hierarchical
  3. linear algebra
  4. sparse matrix

Qualifiers

  • Research-article

Funding Sources

Conference

ICS '17
Sponsor:

Acceptance Rates

Overall Acceptance Rate 629 of 2,180 submissions, 29%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)5
  • Downloads (Last 6 weeks)1
Reflects downloads up to 13 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2021)Are van Emde Boas trees viable on the GPU?2021 IEEE High Performance Extreme Computing Conference (HPEC)10.1109/HPEC49654.2021.9622837(1-7)Online publication date: 20-Sep-2021
  • (2020)Analysis of Schedule and Layout Tuning for Sparse Matrices With Compound Entries on GPUsComputer Graphics Forum10.1111/cgf.1395739:6(133-143)Online publication date: 30-Mar-2020
  • (2018)Regularizing irregularityProceedings of the 1st ACM SIGMOD Joint International Workshop on Graph Data Management Experiences & Systems (GRADES) and Network Data Analytics (NDA)10.1145/3210259.3210263(1-8)Online publication date: 10-Jun-2018
  • (2018)On Dynamic Scheduling for the GPU and its Applications in Computer Graphics and BeyondIEEE Computer Graphics and Applications10.1109/MCG.2018.03242165938:3(119-130)Online publication date: May-2018
  • (2018)TTLG - An Efficient Tensor Transposition Library for GPUs2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS.2018.00067(578-588)Online publication date: May-2018

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media