skip to main content
research-article

STCO: Enhancing Training Efficiency via Structured Sparse Tensor Compilation Optimization

Published: 09 November 2024 Publication History

Abstract

Network sparsification serves as an effective technique to accelerate Deep Neural Network (DNN) inference. However, existing sparsification techniques often rely on structured sparsity, which yields limited benefits. This is primarily due to the significant memory and computational overhead introduced by numerous sparse storage formats during address generation and gradient updates. Additionally, many of these solutions are tailored solely for the inference phase, neglecting the crucial training phase.
In this article, we introduce STCO, a novel Sparse Tensor Compilation Optimization technique that significantly enhances training efficiency through structured sparse tensor compilation. Central to STCO is the Tensorization-aware Index Entity (TIE) format, which effectively represents structured sparse tensors by eliminating redundant indices and minimizing storage overhead. The TIE format plays a pivotal role in the Address-carry flow (AC flow) pass, which optimizes the data layout at the computational graph level. This pass leverages the TIE format to enhance the efficiency of tensor representations, enabling more compact and efficient sparse tensor storage. Meanwhile, a shape inference pass utilizes the AC flow to derive optimized tensor shapes, further refining the performance of sparse tensor operations. Moreover, the Address-Carry TIE Flow dynamically tracks nonzero addresses, extending the benefits of sparse optimization to both forward and backward propagation. This seamless integration into the training pipeline enables a smooth transition to sparse tensor compilation without significant modifications to existing codebases. To further boost training performance, we implement an operator-level AC flow optimization pass tailored for structured sparse tensors. This pass generates efficient addresses, ensuring minimal computational overhead during sparse tensor operations. The flexibility of STCO allows it to be efficiently integrated into various frameworks or compilers, providing a robust solution for enhancing training efficiency with structured sparse tensors. Experiments demonstrated that STCO achieved impressive speedups of 3.64×, 5.43×, 4.89×, and 3.91× when compared to state-of-the-art sparse formats on VGG16, ResNet-18, MobileNetV1, and MobileNetV2, respectively. These findings underscore the efficiency and superiority of our proposed approach in leveraging unstructured sparsity for DNN inference acceleration.

References

[1]
2021. cuSPARSELt: A high-performance CUDA library for sparse matrix-matrix multiplication. Retrieved 10 April 2022 fromhttps://docs.nvidia.com/cuda/cusparselt/index.html
[2]
Uday Bondhugula, Albert Hartono, J. Ramanujam, and P. Sadayappan. 2008. A practical automatic polyhedral parallelizer and locality optimizer. SIGPLAN Notices 43, 6 (Jun2008), 101–113. DOI:
[3]
Aydin Buluç, Jeremy T. Fineman, Matteo Frigo, John R. Gilbert, and Charles E. Leiserson. 2009. Parallel sparse matrix-vector and matrix-transpose-vector multiplication using compressed sparse blocks. In Proceedings of the 21st Annual Symposium on Parallelism in Algorithms and Architectures. ACM, New York, NY, USA. DOI:
[4]
Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. 2018. TVM: An automated end-to-end optimizing compiler for deep learning. In Proceedings of the 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI’18). USENIX Association, Carlsbad, CA, 578–594. Retrieved fromhttps://www.usenix.org/conference/osdi18/presentation/chen
[5]
Scott Cyphers, Arjun K. Bansal, Anahita Bhiwandiwalla, Jayaram Bobba, Matthew Brookhart, Avijit Chakraborty, Will Constable, Christian Convey, Leona Cook, Omar Kanawi, Robert Kimball, Jason Knight, Nikolay Korovaiko, Varun Kumar, Yixing Lao, Christopher R. Lishka, Jaikrishnan Menon, Jennifer Myers, Sandeep Aswath Narayana, Adam Procter, and Tristan J. Webb. 2018. Intel nGraph: An intermediate representation, compiler, and executor for deep learning. Retrieved from https://arxiv.org/abs/1801.08058
[6]
Iain S. Duff, Michael A. Heroux, and Roldan Pozo. 2002. An overview of the sparse basic linear algebra subprograms: The new standard from the BLAS technical forum. ACM Transactions on Mathematical Software 28, 2 (2002), 239–267. DOI:
[7]
Trevor Gale, Matei Zaharia, Cliff Young, and Erich Elsen. 2020. Sparse GPU kernels for deep learning. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (Atlanta, Georgia) (SC’20). IEEE Press, Article 17, 14 pages.
[8]
Aidan N. Gomez, Ivan Zhang, Siddhartha Rao Kamalakara, Divyam Madaan, Kevin Swersky, Yarin Gal, and Geoffrey E. Hinton. 2019. Learning sparse networks using targeted dropout. Retrieved from https://arxiv.org/abs/1905.13678
[9]
Aidan N. Gomez, Ivan Zhang, Kevin Swersky, Yarin Gal, and Geoffrey E. Hinton. 2011. Targeted dropout. Retrieved 6 May 2022 from https://openreview.net/pdf?id=HkghWScuoQ
[10]
Google.2019. In Tensorflow-lite. Retrieved 6 May 2022 from https://www.tensorflow.org/mobile/tflite
[11]
Yiwen Guo, Anbang Yao, and Yurong Chen. 2016. Dynamic network surgery for efficient DNNs. In Advances in Neural Information Processing Systems, D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, and R. Garnett (Eds.). Vol. 29. Curran Associates, Inc. Retrieved from https://proceedings.neurips.cc/paper_files/paper/2016/file/2823f4797102ce1a1aec05359cc16dd9-Paper.pdf
[12]
Song Han, Jeff Pool, John Tran, and William Dally. 2015. Learning both weights and connections for efficient neural network. In Advances in Neural Information Processing Systems, C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett (Eds.). Vol. 28. Curran Associates, Inc. Retrieved from https://proceedings.neurips.cc/paper_files/paper/2015/file/ae0eb3eed39d2bcef4622b2499a05fe6-Paper.pdf
[13]
Andrew Hard, Kanishka Rao, Rajiv Mathews, Swaroop Ramaswamy, Françoise Beaufays, Sean Augenstein, Hubert Eichner, Chloé Kiddon, and Daniel Ramage. 2019. Federated learning for mobile keyboard prediction. Retrieved from https://arxiv.org/abs/1811.03604
[14]
Yihui He, Xiangyu Zhang, and Jian Sun. 2017. Channel pruning for accelerating very deep neural networks. In Proceedings of the IEEE International Conference on Computer Vision (ICCV’17).
[15]
Changwan Hong, Aravind Sukumaran-Rajam, Israt Nisa, Kunal Singh, and P. Sadayappan. 2019. Adaptive sparse tiling for sparse matrix multiplication. In Proceedings of the 24th Symposium on Principles and Practice of Parallel Programming. ACM, New York, NY, USA. DOI:
[16]
Shiyuan Huang, Fangxin Liu, Tao Yang, Zongwu Wang, Ning Yang, and Li Jiang. 2024. SpMMPlu-Pro: An enhanced compiler plug-in for efficient SpMM and sparsity propagation algorithm. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (2024).
[17]
Zehao Huang and Naiyan Wang. 2018. Data-driven sparse structure selection for deep neural networks. In Proceedings of the European Conference on Computer Vision (ECCV’18).
[18]
Fredrik Kjolstad, Willow Ahrens, Shoaib Kamil, and Saman Amarasinghe. 2019. Tensor algebra compilation with workspaces. In Proceedings of the 2019 IEEE/ACM International Symposium on Code Generation and Optimization (CGO’19). 180–192. DOI:
[19]
Fredrik Kjolstad, Shoaib Kamil, Stephen Chou, David Lugato, and Saman Amarasinghe. 2017. The tensor algebra compiler. Proceedings of the ACM on Programming Languages 1, OOPSLA, Article 77 (Oct2017), 29 pages. DOI:
[20]
Fangxin Liu, Ning Yang, Haomin Li, Zongwu Wang, Zhuoran Song, Songwen Pei, and Li Jiang. 2024. SPARK: Scalable and precision-aware acceleration of neural networks via efficient encoding. In Proceedings of the 2024 IEEE International Symposium on High-Performance Computer Architecture (HPCA’24). IEEE, 1029–1042.
[21]
Fangxin Liu, Wenbo Zhao, Yongbiao Chen, Zongwu Wang, and Li Jiang. 2022. Spikeconverter: An efficient conversion framework zipping the gap between artificial neural networks and spiking neural networks. In Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 36, 1692–1701.
[22]
Fangxin Liu, Wenbo Zhao, Zhezhi He, Yanzhi Wang, Zongwu Wang, Changzhi Dai, Xiaoyao Liang, and Li Jiang. 2021. Improving neural network efficiency via post-training quantization with adaptive floating-point. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV’21). 5281–5290.
[23]
Fangxin Liu, Wenbo Zhao, Zhezhi He, Zongwu Wang, Yilong Zhao, Tao Yang, Jingnai Feng, Xiaoyao Liang, and Li Jiang. 2021. SME: ReRAM-based sparse-multiplication-engine to squeeze-out bit sparsity of neural network. In Proceedings of the 2021 IEEE 39th International Conference on Computer Design (ICCD’21). IEEE, 417–424.
[24]
Fangxin Liu, Wenbo Zhao, Zongwu Wang, Yongbiao Chen, Zhezhi He, Naifeng Jing, Xiaoyao Liang, and Li Jiang. 2022. EBSP: Evolving bit sparsity patterns for hardware-friendly inference of quantized deep neural networks. In Proceedings of the 59th ACM/IEEE Design Automation Conference. 259–264.
[25]
Fangxin Liu, Wenbo Zhao, Zongwu Wang, Yongbiao Chen, Xiaoyao Liang, and Li Jiang. 2024. ERA-BS: Boosting the efficiency of ReRAM-based PIM accelerator with fine-grained bit-level sparsity. IEEE Transactions on Computers 73, 9 (2024), 2320–2334. DOI:
[26]
Zhuang Liu, Jianguo Li, Zhiqiang Shen, Gao Huang, Shoumeng Yan, and Changshui Zhang. 2017. Learning efficient convolutional networks through network slimming. In Proceedings of the IEEE International Conference on Computer Vision (ICCV’17).
[27]
Xiaolong Ma, Sheng Lin, Shaokai Ye, Zhezhi He, Linfeng Zhang, Geng Yuan, Sia Huat Tan, Zhengang Li, Deliang Fan, Xuehai Qian, Xue Lin, Kaisheng Ma, and Yanzhi Wang. 2022. Non-structured DNN weight pruning–is it beneficial in any platform?IEEE Transactions on Neural Networks and Learning Systems 33, 9 (2022), 4930–4944. DOI:
[28]
Arun Mallya and Svetlana Lazebnik. 2018. PackNet: Adding multiple tasks to a single network by iterative pruning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’18).
[29]
Matiur Rahman Minar and Jibon Naher. 2018. Recent Advances in Deep Learning: An Overview. Unpublished. DOI:
[30]
Asit Mishra, Jorge Albericio Latorre, Jeff Pool, Darko Stosic, Dusan Stosic, Ganesh Venkatesh, Chong Yu, and Paulius Micikevicius. 2021. Accelerating sparse deep neural networks. Retrieved from https://arxiv.org/abs/2104.08378
[31]
Meng Pang, Xiang Fei, Peng Qu, Youhui Zhang, and Zhaolin Li. 2024. A row decomposition-based approach for sparse matrix multiplication on GPUs. In Proceedings of the 29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming (PPoPP’24). ACM, New York, NY, USA, 377–389. DOI:
[32]
Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. 2019. PyTorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems, Curran Associates, Inc. Retrieved from https://proceedings.neurips.cc/paper_files/paper/2019/file/bdbca288fee7f92f2bfa9f7012727740-Paper.pdf
[33]
Ari Rasch. 2024. (De/Re)-composition of data-parallel computations via multi-dimensional homomorphisms. ACM Trans. Program. Lang. Syst. 46, 3 (October 2024). DOI:
[34]
Nadav Rotem, Jordan Fix, Saleem Abdulrasool, Garret Catron, Summer Deng, Roman Dzhabarov, Nick Gibson, James Hegeman, Meghan Lele, Roman Levenstein, Jack Montgomery, Bert Maher, Satish Nadathur, Jakob Olesen, Jongsoo Park, Artem Rakhov, Misha Smelyanskiy, and Man Wang. 2019. Glow: Graph lowering compiler techniques for neural networks. Retrieved from https://arxiv.org/abs/1805.00907
[35]
Mahadev Satyanarayanan. 2017. The emergence of edge computing. Computer 50, 1 (2017), 30–39. DOI:
[36]
Shaden Smith and George Karypis. 2015. Tensor-matrix products with a compressed sparse tensor. In Proceedings of the 5th Workshop on Irregular Applications: Architectures and Algorithms. ACM, New York, NY, USA. DOI:
[37]
Zhuoran Song, Ru Wang, Dongyu Ru, Zhenghao Peng, Hongru Huang, Hai Zhao, Xiaoyao Liang, and Li Jiang. 2019. Approximate random dropout for DNN training acceleration in GPGPU. In Proceedings of the 2019 Design, Automation & Test in Europe Conference & Exhibition (DATE’19). 108–113. DOI:
[38]
Youcef Saad. 1994. SPARSKIT: A Basic Tool Kit for Sparse Matrix Computations-Version 2. Technical Report. Tech. Rep. Computer Science Department, Univ. of Minnesota, Minneapolis, MN.
[39]
Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: A simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research 15, 1 (2014), 1929–1958.
[40]
Xu Sun, Xuancheng Ren, Shuming Ma, and Houfeng Wang. 2017. meProp: Sparsified back propagation for accelerated deep learning with reduced overfitting. In Proceedings of the 34th International Conference on Machine Learning (Proceedings of Machine Learning Research). PMLR, 3299–3308. Retrieved from https://proceedings.mlr.press/v70/sun17c.html
[41]
Nicolas Vasilache, Oleksandr Zinenko, Theodoros Theodoridis, Priya Goyal, Zachary DeVito, William S. Moses, Sven Verdoolaege, Andrew Adams, and Albert Cohen. 2018. Tensor comprehensions: Framework-agnostic high-performance machine learning abstractions. Retrieved from https://arxiv.org/abs/1802.04730
[42]
Sven Verdoolaege, Juan Carlos Juega, Albert Cohen, José Ignacio Gómez, Christian Tenllado, and Francky Catthoor. 2013. Polyhedral parallel code generation for CUDA. ACM Transactions on Architecture and Code Optimization 9, 4, Article 54 (Jan2013), 23 pages. DOI:
[43]
Wei Wen, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. 2016. Learning structured sparsity in deep neural networks. In Advances in Neural Information Processing Systems. Curran Associates, Inc. Retrieved from https://proceedings.neurips.cc/paper_files/paper/2016/file/41bfd20a38bb1b0bec75acf0845530a7-Paper.pdf
[44]
Wei Wen, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. 2016. Learning structured sparsity in deep neural networks. In Advances in Neural Information Processing Systems, D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, and R. Garnett (Eds.). Vol. 29. Curran Associates, Inc. Retrieved from https://proceedings.neurips.cc/paper_files/paper/2016/file/41bfd20a38bb1b0bec75acf0845530a7-Paper.pdf
[45]
Zihao Ye, Ruihang Lai, Junru Shao, Tianqi Chen, and Luis Ceze. 2023. SparseTIR: Composable abstractions for sparse compilation in deep learning. In Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3 (ASPLOS’23). ACM, New York, NY, USA, 660–678. DOI:
[46]
Haoran You, Chaojian Li, Pengfei Xu, Yonggan Fu, Yue Wang, Xiaohan Chen, Richard G. Baraniuk, Zhangyang Wang, and Yingyan Lin. 2022. Drawing early-bird tickets: Towards more efficient training of deep networks. Retrieved from https://arxiv.org/abs/1909.11957
[47]
Jie Zhao, Bojie Li, Wang Nie, Zhen Geng, Renwei Zhang, Xiong Gao, Bin Cheng, Chen Wu, Yun Cheng, Zheng Li, Peng Di, Kun Zhang, and Xuefeng Jin. 2021. AKG: Automatic kernel generation for neural processing units using polyhedral transformations. In (PLDI 2021), Association for Computing Machinery, Virtual, Canada, 1233–1248. DOI:
[48]
Tuowen Zhao, Tobi Popoola, Mary Hall, Catherine Olschanowsky, and Michelle Strout. 2022. Polyhedral specification and code generation of sparse tensor contraction with co-iteration. ACM Transactions on Architecture and Code Optimization 20, 1, Article 16 (Dec2022), 26 pages. DOI:

Index Terms

  1. STCO: Enhancing Training Efficiency via Structured Sparse Tensor Compilation Optimization

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Design Automation of Electronic Systems
    ACM Transactions on Design Automation of Electronic Systems  Volume 30, Issue 1
    January 2025
    360 pages
    EISSN:1557-7309
    DOI:10.1145/3697150
    • Editor:
    • Jiang Hu
    Issue’s Table of Contents

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Journal Family

    Publication History

    Published: 09 November 2024
    Online AM: 21 October 2024
    Accepted: 02 October 2024
    Revised: 29 September 2024
    Received: 24 April 2024
    Published in TODAES Volume 30, Issue 1

    Check for updates

    Author Tags

    1. Sparsity
    2. training
    3. compilation optimization

    Qualifiers

    • Research-article

    Funding Sources

    • National Natural Science Foundation of China

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 182
      Total Downloads
    • Downloads (Last 12 months)182
    • Downloads (Last 6 weeks)35
    Reflects downloads up to 15 Feb 2025

    Other Metrics

    Citations

    View Options

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Full Text

    View this article in Full Text.

    Full Text

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media