research-article

Public Access

Dense dynamic blocks: optimizing SpMM for processors with vector and matrix units using machine learning techniques

Authors:

Serif Yesil,

José E. Moreira,

Josep TorrellasAuthors Info & Claims

ICS '22: Proceedings of the 36th ACM International Conference on Supercomputing

Article No.: 27, Pages 1 - 14

https://doi.org/10.1145/3524059.3532369

Published: 28 June 2022 Publication History

PDF eReader

Abstract

Recent processors have been augmented with matrix-multiply units that operate on small matrices, creating a functional unit-rich environment. These units have been successfully employed on dense matrix operations such as those found in the Basic Linear Algebra Subprograms (BLAS). In this work, we exploit these new matrix-multiply facilities to speed up Sparse Matrix Dense Matrix Multiplications (SpMM) for highly sparse matrices.

SpMM is hard to optimize. The sparsity patterns lead to a highly irregular memory access behavior. Additionally, each sparse matrix has unique characteristics, making it hard to find a single SpMM strategy that works well for all sparse matrices. The addition of matrix-multiply units makes this even more challenging.

In this paper, we address these challenges. First, we design Dense Dynamic Blocks (DDB), a method to utilize the new matrix units. DDB has two specialized versions: DDB-MM and DDB-HYB. DDB-MM is a strategy that only utilizes the matrix-multiply facilities. DDB-HYB is a hybrid approach that maximizes the floating-point throughput by utilizing both vector and matrix units. Furthermore, we design a prediction mechanism for identifying the best SpMM strategy for a given sparse matrix and dense matrix pair: SpMM-OPT. SpMM-OPT selects among vector unit oriented, matrix unit oriented, and hybrid strategies for the highest floating-point throughput while taking cache optimizations into account.

We experiment with 440 matrices from the well-known SuiteSparse matrix collection on a POWER10 system with vector and matrix units. We show that DDB-MM and DDB-HYB can achieve a floating-point throughput of up to 1.1 and 2.5 TFLOPs/s on a POWER10 single-chip module for double- and single-precision SpMM, respectively. Our analysis also shows that SpMM-OPT effectively chooses the best SpMM strategy and can achieve an average speedup of up to 2X compared to an optimized CSR baseline.

References

[1]

2015. Intel Math Kernel Library Inspector-executor Sparse BLAS Routines. https://software.intel.com/en-us/articles/intel-math-kernel-library-inspector-executor-sparse-blas-routines

Abstract

References

Cited By

Index Terms

Recommendations

Adaptive sparse tiling for sparse matrix multiplication

Arrow Matrix Decomposition: A Novel Approach for Communication-Efficient Sparse Matrix Multiplication

On Implementing Sparse Matrix Multi-vector Multiplication on GPUs

Comments

Information

Published In

Sponsors

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Funding Sources

Conference

Acceptance Rates

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

PDF

eReader

Login options

Full Access

Share

Share this Publication link

Share on social media

Affiliations