skip to main content
research-article

STile: Searching Hybrid Sparse Formats for Sparse Deep Learning Operators Automatically

Published:26 March 2024Publication History
Skip Abstract Section

Abstract

Sparse operators, i.e., operators that take sparse tensors as input, are of great importance in deep learning models. Due to the diverse sparsity patterns in different sparse tensors, it is challenging to optimize sparse operators by seeking an optimal sparse format, i.e., leading to the lowest operator latency. Existing works propose to decompose a sparse tensor into several parts and search for a hybrid of sparse formats to handle diverse sparse patterns. However, they often make a trade-off between search space and search time: their search spaces are limited in some cases, resulting in limited operator running efficiency they can achieve. In this paper, we try to extend the search space in its breadth (by doing flexible sparse tensor transformations) and depth (by enabling multi-level decomposition). We formally define the multi-level sparse format decomposition problem, which is NP-hard, and we propose a framework STile for it. To search efficiently, a greedy algorithm is used, which is guided by a cost model about the latency of computing a sub-task of the original operator after decomposing the sparse tensor. Experiments of two common kinds of sparse operators, SpMM and SDDMM, are conducted on various sparsity patterns, and we achieve 2.1-18.0× speedup against cuSPARSE on SpMMs and 1.5 - 6.9× speedup against DGL on SDDMM. The search time is less than one hour for any tested sparse operator, which can be amortized.

References

  1. Peter Ahrens and Erik G. Boman. 2020. On Optimal Partitioning For Sparse Matrices In Variable Block Row Format. CoRR, Vol. abs/2005.12414 (2020). showeprint[arXiv]2005.12414 https://arxiv.org/abs/2005.12414Google ScholarGoogle Scholar
  2. Réka Albert, Hawoong Jeong, and Albert-László Barabási. 1999. Diameter of the world-wide web. nature, Vol. 401, 6749 (1999), 130--131.Google ScholarGoogle Scholar
  3. Iz Beltagy, Matthew E. Peters, and Arman Cohan. 2020. Longformer: The Long-Document Transformer. CoRR, Vol. abs/2004.05150 (2020). showeprint[arXiv]2004.05150 https://arxiv.org/abs/2004.05150Google ScholarGoogle Scholar
  4. Beidi Chen, Tri Dao, Kaizhao Liang, Jiaming Yang, Zhao Song, Atri Rudra, and Christopher Ré. 2022. Pixelated Butterfly: Simple and Efficient Sparse training for Neural Network Models. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25--29, 2022. OpenReview.net. https://openreview.net/forum?id=Nfl-iXa-y7RGoogle ScholarGoogle Scholar
  5. Zhaodong Chen, Zheng Qu, Liu Liu, Yufei Ding, and Yuan Xie. 2021. Efficient tensor core-based gpu kernels for structured sparsity under reduced precision. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 1--14.Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. 2019a. Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509 (2019).Google ScholarGoogle Scholar
  7. Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. 2019b. Generating Long Sequences with Sparse Transformers. CoRR, Vol. abs/1904.10509 (2019). showeprint[arXiv]1904.10509 http://arxiv.org/abs/1904.10509Google ScholarGoogle Scholar
  8. JeeWhan Choi, Amik Singh, and Richard W. Vuduc. 2010. Model-driven autotuning of sparse matrix-vector multiply on GPUs. In Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPOPP 2010, Bangalore, India, January 9--14, 2010, R. Govindarajan, David A. Padua, and Mary W. Hall (Eds.). ACM, 115--126. https://doi.org/10.1145/1693453.1693471Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Stephen Chou, Fredrik Kjolstad, and Saman Amarasinghe. 2018. Format abstraction for sparse tensor algebra compilers. Proceedings of the ACM on Programming Languages, Vol. 2, OOPSLA (2018), 1--30.Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. NVIDIA Corporation. 2022. cuSPARSE :: CUDA Toolkit Documentation v11.7.1. https://docs.nvidia.com/cuda/cusparse/index.html Retrieved July 15, 2023 fromGoogle ScholarGoogle Scholar
  11. Zhen Du, Jiajia Li, Yinshan Wang, Xueqi Li, Guangming Tan, and Ninghui Sun. 2022. AlphaSparse: Generating High Performance SpMV Codes Directly from Sparse Matrices. In SC22: International Conference for High Performance Computing, Networking, Storage and Analysis, Dallas, TX, USA, November 13--18, 2022. IEEE, 1--15. https://doi.org/10.1109/SC41404.2022.00071Google ScholarGoogle ScholarCross RefCross Ref
  12. Trevor Gale, Matei Zaharia, Cliff Young, and Erich Elsen. 2020a. Sparse GPU kernels for deep learning. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2020, Virtual Event / Atlanta, Georgia, USA, November 9--19, 2020, Christine Cuicchi, Irene Qualters, and William T. Kramer (Eds.). IEEE/ACM, 17. https://doi.org/10.1109/SC41405.2020.00021Google ScholarGoogle ScholarCross RefCross Ref
  13. Trevor Gale, Matei Zaharia, Cliff Young, and Erich Elsen. 2020b. Sparse GPU kernels for deep learning. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 1--14.Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Will Hamilton, Zhitao Ying, and Jure Leskovec. 2017. Inductive representation learning on large graphs. Advances in neural information processing systems, Vol. 30 (2017).Google ScholarGoogle Scholar
  15. Song Han, Jeff Pool, John Tran, and William J. Dally. 2015. Learning both Weights and Connections for Efficient Neural Networks. CoRR, Vol. abs/1506.02626 (2015). showeprint[arXiv]1506.02626 http://arxiv.org/abs/1506.02626Google ScholarGoogle Scholar
  16. Refael Hassin and Asaf Levin. 2005. A Better-Than-Greedy Approximation Algorithm for the Minimum Set Cover Problem. SIAM J. Comput., Vol. 35, 1 (2005), 189--200. https://doi.org/10.1137/S0097539704444750Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Ahmed E. Helal, Jan Laukemann, Fabio Checconi, Jesmin Jahan Tithi, Teresa M. Ranadive, Fabrizio Petrini, and Jeewhan Choi. 2021. ALTO: adaptive linearized storage of sparse tensors. In ICS '21: 2021 International Conference on Supercomputing, Virtual Event, USA, June 14--17, 2021, Huiyang Zhou, Jose Moreira, Frank Mueller, and Yoav Etsion (Eds.). ACM, 404--416. https://doi.org/10.1145/3447818.3461703Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Changwan Hong, Aravind Sukumaran-Rajam, Israt Nisa, Kunal Singh, and P. Sadayappan. 2019. Adaptive sparse tiling for sparse matrix multiplication. In Proceedings of the 24th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP 2019, Washington, DC, USA, February 16--20, 2019, Jeffrey K. Hollingsworth and Idit Keidar (Eds.). ACM, 300--314. https://doi.org/10.1145/3293883.3295712Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Weihua Hu, Matthias Fey, Marinka Zitnik, Yuxiao Dong, Hongyu Ren, Bowen Liu, Michele Catasta, and Jure Leskovec. 2020a. Open graph benchmark: Datasets for machine learning on graphs. Advances in neural information processing systems, Vol. 33 (2020), 22118--22133.Google ScholarGoogle Scholar
  20. Yuwei Hu, Zihao Ye, Minjie Wang, Jiali Yu, Da Zheng, Mu Li, Zheng Zhang, Zhiru Zhang, and Yida Wang. 2020b. FeatGraph: a flexible and efficient backend for graph neural network systems. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2020, Virtual Event / Atlanta, Georgia, USA, November 9--19, 2020, Christine Cuicchi, Irene Qualters, and William T. Kramer (Eds.). IEEE/ACM, 71. https://doi.org/10.1109/SC41405.2020.00075Google ScholarGoogle ScholarCross RefCross Ref
  21. Paolo Sylos Labini, Massimo Bernaschi, Francesco Silvestri, and Flavio Vella. 2022. Blocking Techniques for Sparse Matrix Multiplication on Tensor Accelerators. CoRR, Vol. abs/2202.05868 (2022). showeprint[arXiv]2202.05868 https://arxiv.org/abs/2202.05868Google ScholarGoogle Scholar
  22. Francc ois Lagunas, Ella Charlaix, Victor Sanh, and Alexander M. Rush. 2021. Block Pruning For Faster Transformers. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 7--11 November, 2021, Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih (Eds.). Association for Computational Linguistics, 10619--10629. https://doi.org/10.18653/v1/2021.emnlp-main.829Google ScholarGoogle ScholarCross RefCross Ref
  23. Ao Li, Bojian Zheng, Gennady Pekhimenko, and Fan Long. 2022b. Automatic Horizontal Fusion for GPU Kernels. In IEEE/ACM International Symposium on Code Generation and Optimization, CGO 2022, Seoul, Korea, Republic of, April 2--6, 2022, Jae W. Lee, Sebastian Hack, and Tatiana Shpeisman (Eds.). IEEE, 14--27. https://doi.org/10.1109/CGO53902.2022.9741270Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Shiyang Li, Xiaoyong Jin, Yao Xuan, Xiyou Zhou, Wenhu Chen, Yu-Xiang Wang, and Xifeng Yan. 2019. Enhancing the locality and breaking the memory bottleneck of transformer on time series forecasting. Advances in neural information processing systems, Vol. 32 (2019).Google ScholarGoogle Scholar
  25. Shigang Li, Kazuki Osawa, and Torsten Hoefler. 2022a. Efficient Quantized Sparse Matrix Operations on Tensor Cores. In SC22: International Conference for High Performance Computing, Networking, Storage and Analysis, Dallas, TX, USA, November 13--18, 2022, Felix Wolf, Sameer Shende, Candace Culhane, Sadaf R. Alam, and Heike Jagode (Eds.). IEEE, 37:1--37:15. https://doi.org/10.1109/SC41404.2022.00042Google ScholarGoogle ScholarCross RefCross Ref
  26. Weifeng Liu and Brian Vinter. 2015. CSR5: An efficient storage format for cross-platform sparse matrix-vector multiplication. In Proceedings of the 29th ACM on International Conference on Supercomputing. 339--350.Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. JS McCarley, Rishav Chakravarti, and Avirup Sil. 2019. Structured pruning of a bert-based question answering model. arXiv preprint arXiv:1910.06360 (2019).Google ScholarGoogle Scholar
  28. Atefeh Mehrabi, Donghyuk Lee, Niladrish Chatterjee, Daniel J. Sorin, Benjamin C. Lee, and Mike O'Connor. 2021. Learning Sparse Matrix Row Permutations for Efficient SpMM on GPU Architectures. In IEEE International Symposium on Performance Analysis of Systems and Software, ISPASS 2021, Stony Brook, NY, USA, March 28--30, 2021. IEEE, 48--58. https://doi.org/10.1109/ISPASS51385.2021.00016Google ScholarGoogle ScholarCross RefCross Ref
  29. Yuyao Niu, Zhengyang Lu, Meichen Dong, Zhou Jin, Weifeng Liu, and Guangming Tan. 2021. TileSpMV: A Tiled Algorithm for Sparse Matrix-Vector Multiplication on GPUs. In 35th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2021, Portland, OR, USA, May 17--21, 2021. IEEE, 68--78. https://doi.org/10.1109/IPDPS49936.2021.00016Google ScholarGoogle ScholarCross RefCross Ref
  30. Yuyao Niu, Zhengyang Lu, Haonan Ji, Shuhui Song, Zhou Jin, and Weifeng Liu. 2022. TileSpGEMM: a tiled algorithm for parallel sparse general matrix-matrix multiplication on GPUs. In PPoPP '22: 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, Seoul, Republic of Korea, April 2 - 6, 2022, Jaejin Lee, Kunal Agrawal, and Michael F. Spear (Eds.). ACM, 90--106. https://doi.org/10.1145/3503221.3508431Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Victor Sanh, Thomas Wolf, and Alexander M. Rush. 2020. Movement Pruning: Adaptive Sparsity by Fine-Tuning. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6--12, 2020, virtual, Hugo Larochelle, Marc'Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin (Eds.). https://proceedings.neurips.cc/paper/2020/hash/eae15aabaa768ae4a5993a8a4f4fa6e4-Abstract.htmlGoogle ScholarGoogle Scholar
  32. Prithviraj Sen, Galileo Namata, Mustafa Bilgic, Lise Getoor, Brian Galligher, and Tina Eliassi-Rad. 2008. Collective classification in network data. AI magazine, Vol. 29, 3 (2008), 93--93.Google ScholarGoogle Scholar
  33. Philippe Tillet, Hsiang-Tsung Kung, and David D. Cox. 2019. Triton: an intermediate language and compiler for tiled neural network computations. In Proceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages, MAPL@PLDI 2019, Phoenix, AZ, USA, June 22, 2019, Tim Mattson, Abdullah Muzahid, and Armando Solar-Lezama (Eds.). ACM, 10--19. https://doi.org/10.1145/3315508.3329973Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Minjie Wang, Da Zheng, Zihao Ye, Quan Gan, Mufei Li, Xiang Song, Jinjing Zhou, Chao Ma, Lingfan Yu, Yu Gai, et al. 2019. Deep graph library: A graph-centric, highly-performant package for graph neural networks. arXiv preprint arXiv:1909.01315 (2019).Google ScholarGoogle Scholar
  35. Yuke Wang, Boyuan Feng, and Yufei Ding. 2021a. TC-GNN: Accelerating Sparse Graph Neural Network Computation Via Dense Tensor Core on GPUs. CoRR, Vol. abs/2112.02052 (2021). showeprint[arXiv]2112.02052 https://arxiv.org/abs/2112.02052Google ScholarGoogle Scholar
  36. Yang Wang, Chen Zhang, Zhiqiang Xie, Cong Guo, Yunxin Liu, and Jingwen Leng. 2021b. Dual-side Sparse Tensor Core. In 48th ACM/IEEE Annual International Symposium on Computer Architecture, ISCA 2021, Valencia, Spain, June 14--18, 2021. IEEE, 1083--1095. https://doi.org/10.1109/ISCA52012.2021.00088Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Samuel Williams, Andrew Waterman, and David A. Patterson. 2009. Roofline: an insightful visual performance model for multicore architectures. Commun. ACM, Vol. 52, 4 (2009), 65--76. https://doi.org/10.1145/1498765.1498785Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Jaeyeon Won, Charith Mendis, Joel S Emer, and Saman Amarasinghe. 2023. WACO: Learning Workload-Aware Co-optimization of the Format and Schedule of a Sparse Tensor Program. In Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2. 920--934.Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Zihao Ye, Ruihang Lai, Junru Shao, Tianqi Chen, and Luis Ceze. 2023. SparseTIR: Composable Abstractions for Sparse Compilation in Deep Learning. In Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3, ASPLOS 2023, Vancouver, BC, Canada, March 25--29, 2023, Tor M. Aamodt, Natalie D. Enright Jerger, and Michael M. Swift (Eds.). ACM, 660--678. https://doi.org/10.1145/3582016.3582047Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Zhongming Yu, Guohao Dai, Guyue Huang, Yu Wang, and Huazhong Yang. 2021. Exploiting Online Locality and Reduction Parallelism for Sampled Dense Matrix Multiplication on GPUs. In 39th IEEE International Conference on Computer Design, ICCD 2021, Storrs, CT, USA, October 24--27, 2021. IEEE, 567--574. https://doi.org/10.1109/ICCD53106.2021.00092Google ScholarGoogle ScholarCross RefCross Ref
  41. Ningxin Zheng, Bin Lin, Quanlu Zhang, Lingxiao Ma, Yuqing Yang, Fan Yang, Yang Wang, Mao Yang, and Lidong Zhou. 2022. SparTA: Deep-Learning Model Sparsity via $$Tensor-with-Sparsity-Attribute$$. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22). 213--232.Google ScholarGoogle Scholar
  42. Yangjie Zhou, Jingwen Leng, Yaoxu Song, Shuwen Lu, Mian Wang, Chao Li, Minyi Guo, Wenting Shen, Yong Li, Wei Lin, et al. 2023. uGrapher: High-Performance Graph Operator Computation via Unified Abstraction for Graph Neural Networks. In Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2. 878--891.Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. STile: Searching Hybrid Sparse Formats for Sparse Deep Learning Operators Automatically

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image Proceedings of the ACM on Management of Data
        Proceedings of the ACM on Management of Data  Volume 2, Issue 1
        PACMMOD
        February 2024
        1874 pages
        EISSN:2836-6573
        DOI:10.1145/3654807
        Issue’s Table of Contents

        Copyright © 2024 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 26 March 2024
        Published in pacmmod Volume 2, Issue 1

        Permissions

        Request permissions about this article.

        Request Permissions

        Qualifiers

        • research-article
      • Article Metrics

        • Downloads (Last 12 months)82
        • Downloads (Last 6 weeks)65

        Other Metrics

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader