research-article

Dynamic scheduling for efficient hierarchical sparse matrix operations on the GPU

Authors:

Andreas Derler,

Hans-Peter Seidel,

Markus SteinbergerAuthors Info & Claims

ICS '17: Proceedings of the International Conference on Supercomputing

Article No.: 7, Pages 1 - 10

https://doi.org/10.1145/3079079.3079085

Published: 14 June 2017 Publication History

Abstract

We introduce a hierarchical sparse matrix representation (HiSparse) tailored for the graphics processing unit (GPU). The representation adapts to the local nonzero pattern at all levels of the hierarchy and uses reduced bit length for addressing the entries. This allows a smaller memory footprint than standard formats. Executing algorithms on a hierarchical structure on the GPU usually entails significant synchronization and management overhead or slowdowns due to diverging execution paths and memory access patterns. We address these issues by means of a dynamic scheduling strategy specifically designed for executing algorithms on top of a hierarchical matrix on the GPU. The evaluation of our implementation of basic linear algebra routines, suggests that our hierarchical format is competitive to highly optimized standard libraries and significantly outperforms them in the case of transpose matrix operations. The results point towards the viability of hierarchical matrix formats on massively parallel devices such as the GPU.

References

[1]

A. Buluç, J. T. Fineman, M. Frigo, J. R. Gilbert, and C. E. Leiserson, "Parallel sparse matrix-vector and matrix-transpose-vector multiplication using compressed sparse blocks," in Proc. SPAA '09. ACM, 2009, pp. 233--244.

Digital Library

[2]

M. M. Baskaran and R. Bordawekar, "Optimizing sparse matrix-vector multiplication on GPUs using compile-time and run-time strategies," IBM Reserach Report, RC24704 (W0812-047), 2008.

[3]

A. Monakov, A. Lokhmotov, and A. Avetisyan, "Automatically tuning sparse matrix-vector multiplication for GPU architectures," in Proc. HiPEAC'10. Springer-Verlag, 2010, pp. 111--125.

Digital Library

[4]

F. Vázquez, J. J. Fernández, and E. M. Garzón, "A new approach for sparse matrix vector product on NVIDIA GPUs," Concurr. Comput. : Pract. Exper., vol. 23, no. 8, pp. 815--826, Jun. 2011.

Digital Library

[5]

B.-Y. Su and K. Keutzer, "clSpMV: A cross-platform OpenCL SpMV framework on GPUs," in Proc. ICS '12. ACM, 2012, pp. 353--364.

Digital Library

[6]

J. C. Pichel, F. F. Rivera, M. Fernández, and A. Rodríguez, "Optimization of sparse matrix-vector multiplication using reordering techniques on GPUs," Microprocess. Microsyst., vol. 36, no. 2, pp. 65--77, Mar. 2012.

Digital Library

[7]

W. Liu and B. Vinter, "CSR5: an efficient storage format for cross-platform sparse matrix-vector multiplication," in Proc. ICS '15. ACM, 2015, pp. 339--350.

Digital Library

[8]

E. Schrem, "Computer implementation of the finite-element procedure," in Numerical and Computer Methods in Structural Mechanics. Academic Press, 1973, pp. 79 -- 121.

[9]

P. Stathis, S. Vassiliadis, and S. Cotofana, "A hierarchical sparse matrix storage format for vector processors," in Proc. IPDPS '03, April 2003, p. 8 pp.

Digital Library

[10]

N. Bell and M. Garland, "Implementing sparse matrix-vector multiplication on throughput-oriented processors," in Proc. SC '09. ACM, 2009, pp. 1--11.

Digital Library

[11]

E.-J. Im, K. Yelick, and R. Vuduc, "Sparsity: Optimization framework for sparse matrix kernels," Int. J. High Perform. Comput. Appl., vol. 18, no. 1, pp. 135--158, Feb. 2004.

Digital Library

[12]

J. W. Choi, A. Singh, and R. W. Vuduc, "Model-driven autotuning of sparse matrix-vector multiply on GPUs," SIGPLAN Not., vol. 45, no. 5, pp. 115--126, Jan. 2010.

Digital Library

[13]

R. Nishtala, R. W. Vuduc, J. W. Demmel, and K. A. Yelick, "When cache blocking of sparse matrix vector multiply works and why," Applicable Algebra in Engineering, Communication and Computing, vol. 18, no. 3, pp. 297--311, 2007.

Digital Library

[14]

D. Langr, I. imecek, and P. Tvrdk, "Storing sparse matrices to files in the adaptive-blocking hierarchical storage format," in FedCSIS 2013, Sept 2013, pp. 479--486.

[15]

A. Ashari, N. Sedaghati, J. Eisenlohr, and P. Sadayappan, "An efficient two-dimensional blocking strategy for sparse matrix-vector multiplication on GPUs," in Proc. ICS '14. ACM, 2014, pp. 273--282.

Digital Library

[16]

S. Yan, C. Li, Y. Zhang, and H. Zhou, "yaspmv: Yet another spmv framework on gpus," in Proc. PPoPP '14. ACM, 2014, pp. 107--118.

Digital Library

[17]

X. Liu, M. Smelyanskiy, E. Chow, and P. Dubey, "Efficient sparse matrix-vector multiplication on x86-based many-core processors," in Proc. ICS '13. ACM, 2013, pp. 273--282.

Digital Library

[18]

R. Li and Y. Saad, "GPU-accelerated preconditioned iterative linear solvers," The Journal of Supercomputing, vol. 63, no. 2, pp. 443--466, 2013.

Digital Library

[19]

M. Martone, S. Filippone, M. Paprzycki, and S. Tucci, "On the usage of 16 bit indices in recursively stored sparse matrices," in 12th International Symposium on Symbolic and Numeric Algorithms for Scientific Computing, Sept 2010, pp. 57--64.

Digital Library

[20]

J. L. Greathouse and M. Daga, "Efficient sparse matrix-vector multiplication on GPUs using the CSR storage format," in Proc. SC '14. IEEE Press, 2014, pp. 769--780.

Digital Library

[21]

H. Yoshizawa and D. Takahashi, "Automatic tuning of sparse matrix-vector multiplication for crs format on GPUs," in IEEE 15th International Conference on Computational Science and Engineering (CSE), Dec 2012, pp. 130--136.

Digital Library

[22]

A. Ashari, N. Sedaghati, J. Eisenlohr, S. Parthasarathy, and P. Sadayappan, "Fast sparse matrix-vector multiplication on GPUs for graph applications," in Proc. SC '14. IEEE Press, 2014, pp. 781--792.

Digital Library

[23]

Y. Liu and B. Schmidt, "LightSpMV: Faster CSR-based sparse matrix-vector multiplication on CUDA-enabled GPUs," in ASAP 2015, July 2015, pp. 82--89.

[24]

D. Merrill and M. Garland, "Merge-based parallel sparse matrix-vector multiplication," in Proc. SC '16. IEEE Press, 2016, pp. 58:1--58:12.

Digital Library

[25]

M. Steinberger, R. Zayer, and H.-P. Seidel, "Globally homogeneous, locally adaptive sparse matrix-vector multiplication on the GPU," in Proc. ICS '17. ACM, 2017.

Digital Library

[26]

W. Liu and B. Vinter, "Speculative segmented sum for sparse matrix-vector multiplication on heterogeneous processors," Parallel Comput., vol. 49, no. C, pp. 179--193, Nov. 2015.

Digital Library

[27]

J. Kepner and J. Gilbert, Eds., Graph Algorithms in the Language of Linear Algebra. Society for Industrial and Applied Mathematics, 2011.

Digital Library

[28]

NVIDIA Corporation, NVIDIA CUDA Compute Unified Device Architecture Programming Guide. NVIDIA Corporation, 2016.

[29]

M. Steinberger, M. Kenzel, P. Boechat, B. Kerbl, M. Dokter, and D. Schmalstieg, "Whippletree: Task-based scheduling of dynamic workloads on the GPU," ACM Trans. Graph., vol. 33, no. 6, pp. 228:1--228:11, Nov. 2014.

Digital Library

[30]

T. Aila and S. Laine, "Understanding the efficiency of ray traversal on GPUs," in Proc. HPG '09. ACM, 2009, pp. 145--149.

Digital Library

[31]

S. Laine, T. Karras, and T. Aila, "Megakernels considered harmful: Wavefront path tracing on GPUs," in Proc. HPG '13. ACM, 2013, pp. 137--143.

Digital Library

[32]

T. A. Davis and Y. Hu, "The university of florida sparse matrix collection," ACM Trans. Math. Softw., vol. 38, no. 1, pp. 1:1--1:25, Dec. 2011.

Digital Library

[33]

NVIDIA, The API reference guide for cuSPARSE, the CUDA sparse matrix library., v7.5 ed., NVIDIA, October 2016.

[34]

S. Dalton, N. Bell, L. Olson, and M. Garland, "Cusp: Generic parallel algorithms for sparse matrix and graph computations," 2014, version 0.5.0. {Online}. Available: http://cusplibrary.github.io/

[35]

M. Steinberger, A. Derler, R. Zayer, and H. P. Seidel, "How naive is naive SpMV on the GPU?" in Proc. IEEE HPEC, Sept 2016, pp. 1--8.

[36]

Nvidia, "NVIDIA geforce gtx 1080 whitepaper," 2016.

Cited By

Mayr BWeinrauch AParger MSteinberger M(2021)Are van Emde Boas trees viable on the GPU?2021 IEEE High Performance Extreme Computing Conference (HPEC)10.1109/HPEC49654.2021.9622837(1-7)Online publication date: 20-Sep-2021
https://doi.org/10.1109/HPEC49654.2021.9622837
Mueller‐Roemer JStork AFellner D(2020)Analysis of Schedule and Layout Tuning for Sparse Matrices With Compound Entries on GPUsComputer Graphics Forum10.1111/cgf.1395739:6(133-143)Online publication date: 30-Mar-2020
https://doi.org/10.1111/cgf.13957
Zhang JGruenwald L(2018)Regularizing irregularityProceedings of the 1st ACM SIGMOD Joint International Workshop on Graph Data Management Experiences & Systems (GRADES) and Network Data Analytics (NDA)10.1145/3210259.3210263(1-8)Online publication date: 10-Jun-2018
https://dl.acm.org/doi/10.1145/3210259.3210263
Show More Cited By

Index Terms

Dynamic scheduling for efficient hierarchical sparse matrix operations on the GPU
1. Computing methodologies
  1. Parallel computing methodologies
    1. Parallel algorithms
      1. Massively parallel algorithms
2. Mathematics of computing
  1. Mathematical software
    1. Mathematical software performance

Recommendations

Adaptive sparse matrix-matrix multiplication on the GPU
PPoPP '19: Proceedings of the 24th Symposium on Principles and Practice of Parallel Programming

In the ongoing efforts targeting the vectorization of linear algebra primitives, sparse matrix-matrix multiplication (SpGEMM) has received considerably less attention than sparse Matrix-Vector multiplication (SpMV). While both are equally important, ...
A framework for general sparse matrix-matrix multiplication on GPUs and heterogeneous processors

General sparse matrix-matrix multiplication (SpGEMM) is a fundamental building block for numerous applications such as algebraic multigrid method (AMG), breadth first search and shortest path problem. Compared to other sparse BLAS routines, an efficient ...
Batched QR and SVD algorithms on GPUs with applications in hierarchical matrix compression
Highlights
- High performance GPU hosted batched QR decomposition kernels are developed and outperform current implementations for small and rectangular matrices.
Abstract
We present high performance implementations of the QR and the singular value decomposition of a batch of small matrices hosted on the GPU with applications in the compression of hierarchical matrices. The one-sided Jacobi algorithm is ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ICS '17: Proceedings of the International Conference on Supercomputing

June 2017

300 pages

ISBN:9781450350204

DOI:10.1145/3079079

General Chairs:
William D. Gropp
University of Illinois at Urbana-Champaign, Illinois
,
Pete Beckman
Argonne National Laboratory/Northwestern University, Illinois
,
Program Chairs:
Zhiyuan Li
Purdue University, West Lafayette, Indiana
,
Francisco J. Cazorla
IIIA-CSIC and Barcelona Supercomputing Center, Barcelona, Spain

Copyright © 2017 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGARCH: ACM Special Interest Group on Computer Architecture

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 14 June 2017

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Austrian Science Fund
Max Planck Center for Visual Computing and Communication
Deutsche Forschungsgemeinschaft

Conference

ICS '17

Sponsor:

SIGARCH

ICS '17: 2017 International Conference on Supercomputing

June 14 - 16, 2017

Illinois, Chicago

Acceptance Rates

Overall Acceptance Rate 629 of 2,180 submissions, 29%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

5
Total Citations
View Citations
293
Total Downloads

Downloads (Last 12 months)5
Downloads (Last 6 weeks)1

Reflects downloads up to 13 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Mayr BWeinrauch AParger MSteinberger M(2021)Are van Emde Boas trees viable on the GPU?2021 IEEE High Performance Extreme Computing Conference (HPEC)10.1109/HPEC49654.2021.9622837(1-7)Online publication date: 20-Sep-2021
https://doi.org/10.1109/HPEC49654.2021.9622837
Mueller‐Roemer JStork AFellner D(2020)Analysis of Schedule and Layout Tuning for Sparse Matrices With Compound Entries on GPUsComputer Graphics Forum10.1111/cgf.1395739:6(133-143)Online publication date: 30-Mar-2020
https://doi.org/10.1111/cgf.13957
Zhang JGruenwald L(2018)Regularizing irregularityProceedings of the 1st ACM SIGMOD Joint International Workshop on Graph Data Management Experiences & Systems (GRADES) and Network Data Analytics (NDA)10.1145/3210259.3210263(1-8)Online publication date: 10-Jun-2018
https://dl.acm.org/doi/10.1145/3210259.3210263
Steinberger M(2018)On Dynamic Scheduling for the GPU and its Applications in Computer Graphics and BeyondIEEE Computer Graphics and Applications10.1109/MCG.2018.03242165938:3(119-130)Online publication date: May-2018
https://doi.org/10.1109/MCG.2018.032421659
Vedurada JSuresh ARajam AKim JHong CPanyala AKrishnamoorthy SNandivada VSrivastava RSadayappan P(2018)TTLG - An Efficient Tensor Transposition Library for GPUs2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS.2018.00067(578-588)Online publication date: May-2018
https://doi.org/10.1109/IPDPS.2018.00067

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten