research-article

STile: Searching Hybrid Sparse Formats for Sparse Deep Learning Operators Automatically

Authors:
Jingzhi Fang

The Hong Kong University of Science and Technology, Hong Kong SAR, China

The Hong Kong University of Science and Technology, Hong Kong SAR, China

0000-0001-6462-5825
View Profile

,
Yanyan Shen

Shanghai Jiao Tong University, Shanghai, China

Shanghai Jiao Tong University, Shanghai, China

0000-0001-8364-3674
View Profile

,
Yue Wang

Shenzhen Institute of Computing Sciences, Shenzhen, China

Shenzhen Institute of Computing Sciences, Shenzhen, China

0000-0002-8618-9806
View Profile

,
Lei Chen

The Hong Kong University of Science and Technology & The Hong Kong University of Science and Technology (Guangzhou), Hong Kong SAR, China

The Hong Kong University of Science and Technology & The Hong Kong University of Science and Technology (Guangzhou), Hong Kong SAR, China

0000-0002-8257-5806
View Profile

Proceedings of the ACM on Management of Data Volume 2 Issue 1Article No.: 68pp 1–26https://doi.org/10.1145/3639323

Published:26 March 2024Publication History

Proceedings of the ACM on Management of Data

Abstract

Sparse operators, i.e., operators that take sparse tensors as input, are of great importance in deep learning models. Due to the diverse sparsity patterns in different sparse tensors, it is challenging to optimize sparse operators by seeking an optimal sparse format, i.e., leading to the lowest operator latency. Existing works propose to decompose a sparse tensor into several parts and search for a hybrid of sparse formats to handle diverse sparse patterns. However, they often make a trade-off between search space and search time: their search spaces are limited in some cases, resulting in limited operator running efficiency they can achieve. In this paper, we try to extend the search space in its breadth (by doing flexible sparse tensor transformations) and depth (by enabling multi-level decomposition). We formally define the multi-level sparse format decomposition problem, which is NP-hard, and we propose a framework STile for it. To search efficiently, a greedy algorithm is used, which is guided by a cost model about the latency of computing a sub-task of the original operator after decomposing the sparse tensor. Experiments of two common kinds of sparse operators, SpMM and SDDMM, are conducted on various sparsity patterns, and we achieve 2.1-18.0× speedup against cuSPARSE on SpMMs and 1.5 - 6.9× speedup against DGL on SDDMM. The search time is less than one hour for any tested sparse operator, which can be amortized.

References

Peter Ahrens and Erik G. Boman. 2020. On Optimal Partitioning For Sparse Matrices In Variable Block Row Format. CoRR, Vol. abs/2005.12414 (2020). showeprint[arXiv]2005.12414 https://arxiv.org/abs/2005.12414Google Scholar
Réka Albert, Hawoong Jeong, and Albert-László Barabási. 1999. Diameter of the world-wide web. nature, Vol. 401, 6749 (1999), 130--131.Google Scholar
Iz Beltagy, Matthew E. Peters, and Arman Cohan. 2020. Longformer: The Long-Document Transformer. CoRR, Vol. abs/2004.05150 (2020). showeprint[arXiv]2004.05150 https://arxiv.org/abs/2004.05150Google Scholar
Beidi Chen, Tri Dao, Kaizhao Liang, Jiaming Yang, Zhao Song, Atri Rudra, and Christopher Ré. 2022. Pixelated Butterfly: Simple and Efficient Sparse training for Neural Network Models. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25--29, 2022. OpenReview.net. https://openreview.net/forum?id=Nfl-iXa-y7RGoogle Scholar
Zhaodong Chen, Zheng Qu, Liu Liu, Yufei Ding, and Yuan Xie. 2021. Efficient tensor core-based gpu kernels for structured sparsity under reduced precision. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 1--14.Google ScholarDigital Library
Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. 2019a. Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509 (2019).Google Scholar
Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. 2019b. Generating Long Sequences with Sparse Transformers. CoRR, Vol. abs/1904.10509 (2019). showeprint[arXiv]1904.10509 http://arxiv.org/abs/1904.10509Google Scholar
JeeWhan Choi, Amik Singh, and Richard W. Vuduc. 2010. Model-driven autotuning of sparse matrix-vector multiply on GPUs. In Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPOPP 2010, Bangalore, India, January 9--14, 2010, R. Govindarajan, David A. Padua, and Mary W. Hall (Eds.). ACM, 115--126. https://doi.org/10.1145/1693453.1693471Google ScholarDigital Library
Stephen Chou, Fredrik Kjolstad, and Saman Amarasinghe. 2018. Format abstraction for sparse tensor algebra compilers. Proceedings of the ACM on Programming Languages, Vol. 2, OOPSLA (2018), 1--30.Google ScholarDigital Library
NVIDIA Corporation. 2022. cuSPARSE :: CUDA Toolkit Documentation v11.7.1. https://docs.nvidia.com/cuda/cusparse/index.html Retrieved July 15, 2023 fromGoogle Scholar
Zhen Du, Jiajia Li, Yinshan Wang, Xueqi Li, Guangming Tan, and Ninghui Sun. 2022. AlphaSparse: Generating High Performance SpMV Codes Directly from Sparse Matrices. In SC22: International Conference for High Performance Computing, Networking, Storage and Analysis, Dallas, TX, USA, November 13--18, 2022. IEEE, 1--15. https://doi.org/10.1109/SC41404.2022.00071Google ScholarCross Ref
Trevor Gale, Matei Zaharia, Cliff Young, and Erich Elsen. 2020a. Sparse GPU kernels for deep learning. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2020, Virtual Event / Atlanta, Georgia, USA, November 9--19, 2020, Christine Cuicchi, Irene Qualters, and William T. Kramer (Eds.). IEEE/ACM, 17. https://doi.org/10.1109/SC41405.2020.00021Google ScholarCross Ref
Trevor Gale, Matei Zaharia, Cliff Young, and Erich Elsen. 2020b. Sparse GPU kernels for deep learning. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 1--14.Google ScholarDigital Library
Will Hamilton, Zhitao Ying, and Jure Leskovec. 2017. Inductive representation learning on large graphs. Advances in neural information processing systems, Vol. 30 (2017).Google Scholar
Song Han, Jeff Pool, John Tran, and William J. Dally. 2015. Learning both Weights and Connections for Efficient Neural Networks. CoRR, Vol. abs/1506.02626 (2015). showeprint[arXiv]1506.02626 http://arxiv.org/abs/1506.02626Google Scholar
Refael Hassin and Asaf Levin. 2005. A Better-Than-Greedy Approximation Algorithm for the Minimum Set Cover Problem. SIAM J. Comput., Vol. 35, 1 (2005), 189--200. https://doi.org/10.1137/S0097539704444750Google ScholarDigital Library
Ahmed E. Helal, Jan Laukemann, Fabio Checconi, Jesmin Jahan Tithi, Teresa M. Ranadive, Fabrizio Petrini, and Jeewhan Choi. 2021. ALTO: adaptive linearized storage of sparse tensors. In ICS '21: 2021 International Conference on Supercomputing, Virtual Event, USA, June 14--17, 2021, Huiyang Zhou, Jose Moreira, Frank Mueller, and Yoav Etsion (Eds.). ACM, 404--416. https://doi.org/10.1145/3447818.3461703Google ScholarDigital Library
Changwan Hong, Aravind Sukumaran-Rajam, Israt Nisa, Kunal Singh, and P. Sadayappan. 2019. Adaptive sparse tiling for sparse matrix multiplication. In Proceedings of the 24th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP 2019, Washington, DC, USA, February 16--20, 2019, Jeffrey K. Hollingsworth and Idit Keidar (Eds.). ACM, 300--314. https://doi.org/10.1145/3293883.3295712Google ScholarDigital Library
Weihua Hu, Matthias Fey, Marinka Zitnik, Yuxiao Dong, Hongyu Ren, Bowen Liu, Michele Catasta, and Jure Leskovec. 2020a. Open graph benchmark: Datasets for machine learning on graphs. Advances in neural information processing systems, Vol. 33 (2020), 22118--22133.Google Scholar
Yuwei Hu, Zihao Ye, Minjie Wang, Jiali Yu, Da Zheng, Mu Li, Zheng Zhang, Zhiru Zhang, and Yida Wang. 2020b. FeatGraph: a flexible and efficient backend for graph neural network systems. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2020, Virtual Event / Atlanta, Georgia, USA, November 9--19, 2020, Christine Cuicchi, Irene Qualters, and William T. Kramer (Eds.). IEEE/ACM, 71. https://doi.org/10.1109/SC41405.2020.00075Google ScholarCross Ref
Paolo Sylos Labini, Massimo Bernaschi, Francesco Silvestri, and Flavio Vella. 2022. Blocking Techniques for Sparse Matrix Multiplication on Tensor Accelerators. CoRR, Vol. abs/2202.05868 (2022). showeprint[arXiv]2202.05868 https://arxiv.org/abs/2202.05868Google Scholar
Francc ois Lagunas, Ella Charlaix, Victor Sanh, and Alexander M. Rush. 2021. Block Pruning For Faster Transformers. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 7--11 November, 2021, Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih (Eds.). Association for Computational Linguistics, 10619--10629. https://doi.org/10.18653/v1/2021.emnlp-main.829Google ScholarCross Ref
Ao Li, Bojian Zheng, Gennady Pekhimenko, and Fan Long. 2022b. Automatic Horizontal Fusion for GPU Kernels. In IEEE/ACM International Symposium on Code Generation and Optimization, CGO 2022, Seoul, Korea, Republic of, April 2--6, 2022, Jae W. Lee, Sebastian Hack, and Tatiana Shpeisman (Eds.). IEEE, 14--27. https://doi.org/10.1109/CGO53902.2022.9741270Google ScholarDigital Library
Shiyang Li, Xiaoyong Jin, Yao Xuan, Xiyou Zhou, Wenhu Chen, Yu-Xiang Wang, and Xifeng Yan. 2019. Enhancing the locality and breaking the memory bottleneck of transformer on time series forecasting. Advances in neural information processing systems, Vol. 32 (2019).Google Scholar
Shigang Li, Kazuki Osawa, and Torsten Hoefler. 2022a. Efficient Quantized Sparse Matrix Operations on Tensor Cores. In SC22: International Conference for High Performance Computing, Networking, Storage and Analysis, Dallas, TX, USA, November 13--18, 2022, Felix Wolf, Sameer Shende, Candace Culhane, Sadaf R. Alam, and Heike Jagode (Eds.). IEEE, 37:1--37:15. https://doi.org/10.1109/SC41404.2022.00042Google ScholarCross Ref
Weifeng Liu and Brian Vinter. 2015. CSR5: An efficient storage format for cross-platform sparse matrix-vector multiplication. In Proceedings of the 29th ACM on International Conference on Supercomputing. 339--350.Google ScholarDigital Library
JS McCarley, Rishav Chakravarti, and Avirup Sil. 2019. Structured pruning of a bert-based question answering model. arXiv preprint arXiv:1910.06360 (2019).Google Scholar
Atefeh Mehrabi, Donghyuk Lee, Niladrish Chatterjee, Daniel J. Sorin, Benjamin C. Lee, and Mike O'Connor. 2021. Learning Sparse Matrix Row Permutations for Efficient SpMM on GPU Architectures. In IEEE International Symposium on Performance Analysis of Systems and Software, ISPASS 2021, Stony Brook, NY, USA, March 28--30, 2021. IEEE, 48--58. https://doi.org/10.1109/ISPASS51385.2021.00016Google ScholarCross Ref
Yuyao Niu, Zhengyang Lu, Meichen Dong, Zhou Jin, Weifeng Liu, and Guangming Tan. 2021. TileSpMV: A Tiled Algorithm for Sparse Matrix-Vector Multiplication on GPUs. In 35th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2021, Portland, OR, USA, May 17--21, 2021. IEEE, 68--78. https://doi.org/10.1109/IPDPS49936.2021.00016Google ScholarCross Ref
Yuyao Niu, Zhengyang Lu, Haonan Ji, Shuhui Song, Zhou Jin, and Weifeng Liu. 2022. TileSpGEMM: a tiled algorithm for parallel sparse general matrix-matrix multiplication on GPUs. In PPoPP '22: 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, Seoul, Republic of Korea, April 2 - 6, 2022, Jaejin Lee, Kunal Agrawal, and Michael F. Spear (Eds.). ACM, 90--106. https://doi.org/10.1145/3503221.3508431Google ScholarDigital Library
Victor Sanh, Thomas Wolf, and Alexander M. Rush. 2020. Movement Pruning: Adaptive Sparsity by Fine-Tuning. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6--12, 2020, virtual, Hugo Larochelle, Marc'Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin (Eds.). https://proceedings.neurips.cc/paper/2020/hash/eae15aabaa768ae4a5993a8a4f4fa6e4-Abstract.htmlGoogle Scholar
Prithviraj Sen, Galileo Namata, Mustafa Bilgic, Lise Getoor, Brian Galligher, and Tina Eliassi-Rad. 2008. Collective classification in network data. AI magazine, Vol. 29, 3 (2008), 93--93.Google Scholar
Philippe Tillet, Hsiang-Tsung Kung, and David D. Cox. 2019. Triton: an intermediate language and compiler for tiled neural network computations. In Proceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages, MAPL@PLDI 2019, Phoenix, AZ, USA, June 22, 2019, Tim Mattson, Abdullah Muzahid, and Armando Solar-Lezama (Eds.). ACM, 10--19. https://doi.org/10.1145/3315508.3329973Google ScholarDigital Library
Minjie Wang, Da Zheng, Zihao Ye, Quan Gan, Mufei Li, Xiang Song, Jinjing Zhou, Chao Ma, Lingfan Yu, Yu Gai, et al. 2019. Deep graph library: A graph-centric, highly-performant package for graph neural networks. arXiv preprint arXiv:1909.01315 (2019).Google Scholar
Yuke Wang, Boyuan Feng, and Yufei Ding. 2021a. TC-GNN: Accelerating Sparse Graph Neural Network Computation Via Dense Tensor Core on GPUs. CoRR, Vol. abs/2112.02052 (2021). showeprint[arXiv]2112.02052 https://arxiv.org/abs/2112.02052Google Scholar
Yang Wang, Chen Zhang, Zhiqiang Xie, Cong Guo, Yunxin Liu, and Jingwen Leng. 2021b. Dual-side Sparse Tensor Core. In 48th ACM/IEEE Annual International Symposium on Computer Architecture, ISCA 2021, Valencia, Spain, June 14--18, 2021. IEEE, 1083--1095. https://doi.org/10.1109/ISCA52012.2021.00088Google ScholarDigital Library
Samuel Williams, Andrew Waterman, and David A. Patterson. 2009. Roofline: an insightful visual performance model for multicore architectures. Commun. ACM, Vol. 52, 4 (2009), 65--76. https://doi.org/10.1145/1498765.1498785Google ScholarDigital Library
Jaeyeon Won, Charith Mendis, Joel S Emer, and Saman Amarasinghe. 2023. WACO: Learning Workload-Aware Co-optimization of the Format and Schedule of a Sparse Tensor Program. In Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2. 920--934.Google ScholarDigital Library
Zihao Ye, Ruihang Lai, Junru Shao, Tianqi Chen, and Luis Ceze. 2023. SparseTIR: Composable Abstractions for Sparse Compilation in Deep Learning. In Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3, ASPLOS 2023, Vancouver, BC, Canada, March 25--29, 2023, Tor M. Aamodt, Natalie D. Enright Jerger, and Michael M. Swift (Eds.). ACM, 660--678. https://doi.org/10.1145/3582016.3582047Google ScholarDigital Library
Zhongming Yu, Guohao Dai, Guyue Huang, Yu Wang, and Huazhong Yang. 2021. Exploiting Online Locality and Reduction Parallelism for Sampled Dense Matrix Multiplication on GPUs. In 39th IEEE International Conference on Computer Design, ICCD 2021, Storrs, CT, USA, October 24--27, 2021. IEEE, 567--574. https://doi.org/10.1109/ICCD53106.2021.00092Google ScholarCross Ref
Ningxin Zheng, Bin Lin, Quanlu Zhang, Lingxiao Ma, Yuqing Yang, Fan Yang, Yang Wang, Mao Yang, and Lidong Zhou. 2022. SparTA: Deep-Learning Model Sparsity via $$Tensor-with-Sparsity-Attribute$$. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22). 213--232.Google Scholar
Yangjie Zhou, Jingwen Leng, Yaoxu Song, Shuwen Lu, Mian Wang, Chao Li, Minyi Guo, Wenting Shen, Yong Li, Wei Lin, et al. 2023. uGrapher: High-Performance Graph Operator Computation via Unified Abstraction for Graph Neural Networks. In Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2. 878--891.Google ScholarDigital Library

Index Terms

STile: Searching Hybrid Sparse Formats for Sparse Deep Learning Operators Automatically
1. Mathematics of computing
  1. Discrete mathematics
    1. Combinatorics
      1. Combinatorial optimization
2. Software and its engineering
  1. Software notations and tools
    1. Compilers
      1. Source code generation

Recommendations

Block-sparse signals: uncertainty relations and efficient recovery

We consider efficient methods for the recovery of block-sparse signals--i.e., sparse signals that have nonzero entries occurring in clusters--from an underdetermined system of linear equations. An uncertainty relation for block-sparse signals is derived,...
Read More
Sparse and Truncated Nuclear Norm Based Tensor Completion

One of the main difficulties in tensor completion is the calculation of the tensor rank. Recently a tensor nuclear norm, which is equal to the weighted sum of matrix nuclear norms of all unfoldings of the tensor, was proposed to address this issue. ...
Read More
Sparse Learning for Neural Networks with A Generalized Sparse Regularization
Abstract
Deep neural networks (DNNs) is very important and have achieved remarkable accuracies in tasks such as image processing. However, the success of DNNs heavily relies on excessive computation and parameter storage costs. To cut down the overheads, a ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
Proceedings of the ACM on Management of Data Volume 2, Issue 1
PACMMOD
February 2024
1874 pages
EISSN:2836-6573
DOI:10.1145/3654807
Editor:
Divyakant Agrawal
UC Santa Barbara, United States
Issue’s Table of Contents
Copyright © 2024 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 26 March 2024
Published in pacmmod Volume 2, Issue 1

Permissions
Request permissions about this article.
Request Permissions
Author Tags
hybrid format
sparse format
sparse operation
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 82
  Total Downloads
- Downloads (Last 12 months)82
- Downloads (Last 6 weeks)65
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

STile: Searching Hybrid Sparse Formats for Sparse Deep Learning Operators Automatically

Proceedings of the ACM on Management of Data

Abstract

References

Cited By

Index Terms

Recommendations

Block-sparse signals: uncertainty relations and efficient recovery

Sparse and Truncated Nuclear Norm Based Tensor Completion

Sparse Learning for Neural Networks with A Generalized Sparse Regularization

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

STile: Searching Hybrid Sparse Formats for Sparse Deep Learning Operators Automatically

Proceedings of the ACM on Management of Data

Abstract

References

Cited By

Index Terms

Recommendations

Block-sparse signals: uncertainty relations and efficient recovery

Sparse and Truncated Nuclear Norm Based Tensor Completion

Sparse Learning for Neural Networks with A Generalized Sparse Regularization

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media