Abstract
Sparse operators, i.e., operators that take sparse tensors as input, are of great importance in deep learning models. Due to the diverse sparsity patterns in different sparse tensors, it is challenging to optimize sparse operators by seeking an optimal sparse format, i.e., leading to the lowest operator latency. Existing works propose to decompose a sparse tensor into several parts and search for a hybrid of sparse formats to handle diverse sparse patterns. However, they often make a trade-off between search space and search time: their search spaces are limited in some cases, resulting in limited operator running efficiency they can achieve. In this paper, we try to extend the search space in its breadth (by doing flexible sparse tensor transformations) and depth (by enabling multi-level decomposition). We formally define the multi-level sparse format decomposition problem, which is NP-hard, and we propose a framework STile for it. To search efficiently, a greedy algorithm is used, which is guided by a cost model about the latency of computing a sub-task of the original operator after decomposing the sparse tensor. Experiments of two common kinds of sparse operators, SpMM and SDDMM, are conducted on various sparsity patterns, and we achieve 2.1-18.0× speedup against cuSPARSE on SpMMs and 1.5 - 6.9× speedup against DGL on SDDMM. The search time is less than one hour for any tested sparse operator, which can be amortized.
- Peter Ahrens and Erik G. Boman. 2020. On Optimal Partitioning For Sparse Matrices In Variable Block Row Format. CoRR, Vol. abs/2005.12414 (2020). showeprint[arXiv]2005.12414 https://arxiv.org/abs/2005.12414Google Scholar
- Réka Albert, Hawoong Jeong, and Albert-László Barabási. 1999. Diameter of the world-wide web. nature, Vol. 401, 6749 (1999), 130--131.Google Scholar
- Iz Beltagy, Matthew E. Peters, and Arman Cohan. 2020. Longformer: The Long-Document Transformer. CoRR, Vol. abs/2004.05150 (2020). showeprint[arXiv]2004.05150 https://arxiv.org/abs/2004.05150Google Scholar
- Beidi Chen, Tri Dao, Kaizhao Liang, Jiaming Yang, Zhao Song, Atri Rudra, and Christopher Ré. 2022. Pixelated Butterfly: Simple and Efficient Sparse training for Neural Network Models. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25--29, 2022. OpenReview.net. https://openreview.net/forum?id=Nfl-iXa-y7RGoogle Scholar
- Zhaodong Chen, Zheng Qu, Liu Liu, Yufei Ding, and Yuan Xie. 2021. Efficient tensor core-based gpu kernels for structured sparsity under reduced precision. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 1--14.Google ScholarDigital Library
- Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. 2019a. Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509 (2019).Google Scholar
- Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. 2019b. Generating Long Sequences with Sparse Transformers. CoRR, Vol. abs/1904.10509 (2019). showeprint[arXiv]1904.10509 http://arxiv.org/abs/1904.10509Google Scholar
- JeeWhan Choi, Amik Singh, and Richard W. Vuduc. 2010. Model-driven autotuning of sparse matrix-vector multiply on GPUs. In Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPOPP 2010, Bangalore, India, January 9--14, 2010, R. Govindarajan, David A. Padua, and Mary W. Hall (Eds.). ACM, 115--126. https://doi.org/10.1145/1693453.1693471Google ScholarDigital Library
- Stephen Chou, Fredrik Kjolstad, and Saman Amarasinghe. 2018. Format abstraction for sparse tensor algebra compilers. Proceedings of the ACM on Programming Languages, Vol. 2, OOPSLA (2018), 1--30.Google ScholarDigital Library
- NVIDIA Corporation. 2022. cuSPARSE :: CUDA Toolkit Documentation v11.7.1. https://docs.nvidia.com/cuda/cusparse/index.html Retrieved July 15, 2023 fromGoogle Scholar
- Zhen Du, Jiajia Li, Yinshan Wang, Xueqi Li, Guangming Tan, and Ninghui Sun. 2022. AlphaSparse: Generating High Performance SpMV Codes Directly from Sparse Matrices. In SC22: International Conference for High Performance Computing, Networking, Storage and Analysis, Dallas, TX, USA, November 13--18, 2022. IEEE, 1--15. https://doi.org/10.1109/SC41404.2022.00071Google ScholarCross Ref
- Trevor Gale, Matei Zaharia, Cliff Young, and Erich Elsen. 2020a. Sparse GPU kernels for deep learning. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2020, Virtual Event / Atlanta, Georgia, USA, November 9--19, 2020, Christine Cuicchi, Irene Qualters, and William T. Kramer (Eds.). IEEE/ACM, 17. https://doi.org/10.1109/SC41405.2020.00021Google ScholarCross Ref
- Trevor Gale, Matei Zaharia, Cliff Young, and Erich Elsen. 2020b. Sparse GPU kernels for deep learning. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 1--14.Google ScholarDigital Library
- Will Hamilton, Zhitao Ying, and Jure Leskovec. 2017. Inductive representation learning on large graphs. Advances in neural information processing systems, Vol. 30 (2017).Google Scholar
- Song Han, Jeff Pool, John Tran, and William J. Dally. 2015. Learning both Weights and Connections for Efficient Neural Networks. CoRR, Vol. abs/1506.02626 (2015). showeprint[arXiv]1506.02626 http://arxiv.org/abs/1506.02626Google Scholar
- Refael Hassin and Asaf Levin. 2005. A Better-Than-Greedy Approximation Algorithm for the Minimum Set Cover Problem. SIAM J. Comput., Vol. 35, 1 (2005), 189--200. https://doi.org/10.1137/S0097539704444750Google ScholarDigital Library
- Ahmed E. Helal, Jan Laukemann, Fabio Checconi, Jesmin Jahan Tithi, Teresa M. Ranadive, Fabrizio Petrini, and Jeewhan Choi. 2021. ALTO: adaptive linearized storage of sparse tensors. In ICS '21: 2021 International Conference on Supercomputing, Virtual Event, USA, June 14--17, 2021, Huiyang Zhou, Jose Moreira, Frank Mueller, and Yoav Etsion (Eds.). ACM, 404--416. https://doi.org/10.1145/3447818.3461703Google ScholarDigital Library
- Changwan Hong, Aravind Sukumaran-Rajam, Israt Nisa, Kunal Singh, and P. Sadayappan. 2019. Adaptive sparse tiling for sparse matrix multiplication. In Proceedings of the 24th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP 2019, Washington, DC, USA, February 16--20, 2019, Jeffrey K. Hollingsworth and Idit Keidar (Eds.). ACM, 300--314. https://doi.org/10.1145/3293883.3295712Google ScholarDigital Library
- Weihua Hu, Matthias Fey, Marinka Zitnik, Yuxiao Dong, Hongyu Ren, Bowen Liu, Michele Catasta, and Jure Leskovec. 2020a. Open graph benchmark: Datasets for machine learning on graphs. Advances in neural information processing systems, Vol. 33 (2020), 22118--22133.Google Scholar
- Yuwei Hu, Zihao Ye, Minjie Wang, Jiali Yu, Da Zheng, Mu Li, Zheng Zhang, Zhiru Zhang, and Yida Wang. 2020b. FeatGraph: a flexible and efficient backend for graph neural network systems. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2020, Virtual Event / Atlanta, Georgia, USA, November 9--19, 2020, Christine Cuicchi, Irene Qualters, and William T. Kramer (Eds.). IEEE/ACM, 71. https://doi.org/10.1109/SC41405.2020.00075Google ScholarCross Ref
- Paolo Sylos Labini, Massimo Bernaschi, Francesco Silvestri, and Flavio Vella. 2022. Blocking Techniques for Sparse Matrix Multiplication on Tensor Accelerators. CoRR, Vol. abs/2202.05868 (2022). showeprint[arXiv]2202.05868 https://arxiv.org/abs/2202.05868Google Scholar
- Francc ois Lagunas, Ella Charlaix, Victor Sanh, and Alexander M. Rush. 2021. Block Pruning For Faster Transformers. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 7--11 November, 2021, Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih (Eds.). Association for Computational Linguistics, 10619--10629. https://doi.org/10.18653/v1/2021.emnlp-main.829Google ScholarCross Ref
- Ao Li, Bojian Zheng, Gennady Pekhimenko, and Fan Long. 2022b. Automatic Horizontal Fusion for GPU Kernels. In IEEE/ACM International Symposium on Code Generation and Optimization, CGO 2022, Seoul, Korea, Republic of, April 2--6, 2022, Jae W. Lee, Sebastian Hack, and Tatiana Shpeisman (Eds.). IEEE, 14--27. https://doi.org/10.1109/CGO53902.2022.9741270Google ScholarDigital Library
- Shiyang Li, Xiaoyong Jin, Yao Xuan, Xiyou Zhou, Wenhu Chen, Yu-Xiang Wang, and Xifeng Yan. 2019. Enhancing the locality and breaking the memory bottleneck of transformer on time series forecasting. Advances in neural information processing systems, Vol. 32 (2019).Google Scholar
- Shigang Li, Kazuki Osawa, and Torsten Hoefler. 2022a. Efficient Quantized Sparse Matrix Operations on Tensor Cores. In SC22: International Conference for High Performance Computing, Networking, Storage and Analysis, Dallas, TX, USA, November 13--18, 2022, Felix Wolf, Sameer Shende, Candace Culhane, Sadaf R. Alam, and Heike Jagode (Eds.). IEEE, 37:1--37:15. https://doi.org/10.1109/SC41404.2022.00042Google ScholarCross Ref
- Weifeng Liu and Brian Vinter. 2015. CSR5: An efficient storage format for cross-platform sparse matrix-vector multiplication. In Proceedings of the 29th ACM on International Conference on Supercomputing. 339--350.Google ScholarDigital Library
- JS McCarley, Rishav Chakravarti, and Avirup Sil. 2019. Structured pruning of a bert-based question answering model. arXiv preprint arXiv:1910.06360 (2019).Google Scholar
- Atefeh Mehrabi, Donghyuk Lee, Niladrish Chatterjee, Daniel J. Sorin, Benjamin C. Lee, and Mike O'Connor. 2021. Learning Sparse Matrix Row Permutations for Efficient SpMM on GPU Architectures. In IEEE International Symposium on Performance Analysis of Systems and Software, ISPASS 2021, Stony Brook, NY, USA, March 28--30, 2021. IEEE, 48--58. https://doi.org/10.1109/ISPASS51385.2021.00016Google ScholarCross Ref
- Yuyao Niu, Zhengyang Lu, Meichen Dong, Zhou Jin, Weifeng Liu, and Guangming Tan. 2021. TileSpMV: A Tiled Algorithm for Sparse Matrix-Vector Multiplication on GPUs. In 35th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2021, Portland, OR, USA, May 17--21, 2021. IEEE, 68--78. https://doi.org/10.1109/IPDPS49936.2021.00016Google ScholarCross Ref
- Yuyao Niu, Zhengyang Lu, Haonan Ji, Shuhui Song, Zhou Jin, and Weifeng Liu. 2022. TileSpGEMM: a tiled algorithm for parallel sparse general matrix-matrix multiplication on GPUs. In PPoPP '22: 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, Seoul, Republic of Korea, April 2 - 6, 2022, Jaejin Lee, Kunal Agrawal, and Michael F. Spear (Eds.). ACM, 90--106. https://doi.org/10.1145/3503221.3508431Google ScholarDigital Library
- Victor Sanh, Thomas Wolf, and Alexander M. Rush. 2020. Movement Pruning: Adaptive Sparsity by Fine-Tuning. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6--12, 2020, virtual, Hugo Larochelle, Marc'Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin (Eds.). https://proceedings.neurips.cc/paper/2020/hash/eae15aabaa768ae4a5993a8a4f4fa6e4-Abstract.htmlGoogle Scholar
- Prithviraj Sen, Galileo Namata, Mustafa Bilgic, Lise Getoor, Brian Galligher, and Tina Eliassi-Rad. 2008. Collective classification in network data. AI magazine, Vol. 29, 3 (2008), 93--93.Google Scholar
- Philippe Tillet, Hsiang-Tsung Kung, and David D. Cox. 2019. Triton: an intermediate language and compiler for tiled neural network computations. In Proceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages, MAPL@PLDI 2019, Phoenix, AZ, USA, June 22, 2019, Tim Mattson, Abdullah Muzahid, and Armando Solar-Lezama (Eds.). ACM, 10--19. https://doi.org/10.1145/3315508.3329973Google ScholarDigital Library
- Minjie Wang, Da Zheng, Zihao Ye, Quan Gan, Mufei Li, Xiang Song, Jinjing Zhou, Chao Ma, Lingfan Yu, Yu Gai, et al. 2019. Deep graph library: A graph-centric, highly-performant package for graph neural networks. arXiv preprint arXiv:1909.01315 (2019).Google Scholar
- Yuke Wang, Boyuan Feng, and Yufei Ding. 2021a. TC-GNN: Accelerating Sparse Graph Neural Network Computation Via Dense Tensor Core on GPUs. CoRR, Vol. abs/2112.02052 (2021). showeprint[arXiv]2112.02052 https://arxiv.org/abs/2112.02052Google Scholar
- Yang Wang, Chen Zhang, Zhiqiang Xie, Cong Guo, Yunxin Liu, and Jingwen Leng. 2021b. Dual-side Sparse Tensor Core. In 48th ACM/IEEE Annual International Symposium on Computer Architecture, ISCA 2021, Valencia, Spain, June 14--18, 2021. IEEE, 1083--1095. https://doi.org/10.1109/ISCA52012.2021.00088Google ScholarDigital Library
- Samuel Williams, Andrew Waterman, and David A. Patterson. 2009. Roofline: an insightful visual performance model for multicore architectures. Commun. ACM, Vol. 52, 4 (2009), 65--76. https://doi.org/10.1145/1498765.1498785Google ScholarDigital Library
- Jaeyeon Won, Charith Mendis, Joel S Emer, and Saman Amarasinghe. 2023. WACO: Learning Workload-Aware Co-optimization of the Format and Schedule of a Sparse Tensor Program. In Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2. 920--934.Google ScholarDigital Library
- Zihao Ye, Ruihang Lai, Junru Shao, Tianqi Chen, and Luis Ceze. 2023. SparseTIR: Composable Abstractions for Sparse Compilation in Deep Learning. In Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3, ASPLOS 2023, Vancouver, BC, Canada, March 25--29, 2023, Tor M. Aamodt, Natalie D. Enright Jerger, and Michael M. Swift (Eds.). ACM, 660--678. https://doi.org/10.1145/3582016.3582047Google ScholarDigital Library
- Zhongming Yu, Guohao Dai, Guyue Huang, Yu Wang, and Huazhong Yang. 2021. Exploiting Online Locality and Reduction Parallelism for Sampled Dense Matrix Multiplication on GPUs. In 39th IEEE International Conference on Computer Design, ICCD 2021, Storrs, CT, USA, October 24--27, 2021. IEEE, 567--574. https://doi.org/10.1109/ICCD53106.2021.00092Google ScholarCross Ref
- Ningxin Zheng, Bin Lin, Quanlu Zhang, Lingxiao Ma, Yuqing Yang, Fan Yang, Yang Wang, Mao Yang, and Lidong Zhou. 2022. SparTA: Deep-Learning Model Sparsity via $$Tensor-with-Sparsity-Attribute$$. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22). 213--232.Google Scholar
- Yangjie Zhou, Jingwen Leng, Yaoxu Song, Shuwen Lu, Mian Wang, Chao Li, Minyi Guo, Wenting Shen, Yong Li, Wei Lin, et al. 2023. uGrapher: High-Performance Graph Operator Computation via Unified Abstraction for Graph Neural Networks. In Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2. 878--891.Google ScholarDigital Library
Index Terms
- STile: Searching Hybrid Sparse Formats for Sparse Deep Learning Operators Automatically
Recommendations
Block-sparse signals: uncertainty relations and efficient recovery
We consider efficient methods for the recovery of block-sparse signals--i.e., sparse signals that have nonzero entries occurring in clusters--from an underdetermined system of linear equations. An uncertainty relation for block-sparse signals is derived,...
Sparse and Truncated Nuclear Norm Based Tensor Completion
One of the main difficulties in tensor completion is the calculation of the tensor rank. Recently a tensor nuclear norm, which is equal to the weighted sum of matrix nuclear norms of all unfoldings of the tensor, was proposed to address this issue. ...
Sparse Learning for Neural Networks with A Generalized Sparse Regularization
AbstractDeep neural networks (DNNs) is very important and have achieved remarkable accuracies in tasks such as image processing. However, the success of DNNs heavily relies on excessive computation and parameter storage costs. To cut down the overheads, a ...
Comments