skip to main content
10.1145/3410463.3414648acmconferencesArticle/Chapter ViewAbstractPublication PagespactConference Proceedingsconference-collections
research-article

Accelerating Sparse CNN Inference on GPUs with Performance-Aware Weight Pruning

Published: 30 September 2020 Publication History

Abstract

Weight pruning is a popular technique to reduce the size and computation complexity of the Convolutional Neural Networks (CNNs). Despite its success in reducing the model size, weight pruning has brought limited benefit to the CNN inference performance, due to the irregularity introduced in the sparse convolution operations. In this work, we aim to improve the performance of sparse convolutions on GPUs by mitigating the irregularity. We find that the existing performance optimization techniques for sparse matrix computations fail to accelerate sparse convolutions, and we observe that the main performance bottleneck is caused by the heavy control-flow instructions. Based on the observation, we proposed a new GEMM-based implementation of sparse convolutions. Our main idea is to extract dense blocks of non-zeros in the sparse convolution kernels, and use dense matrix-matrix multiplication for these dense blocks to achieve high throughput. For cases where many non-zero weights cannot be grouped into dense blocks, we propose a performance-aware re-pruning strategy that removes the least important weights in the sparse kernels to further improve the throughput. The experimental results with five real-world pruned CNN models show that our techniques can significantly improve the layer-wise performance of sparse convolution operations as well as the end-to-end performance of CNN inference.

References

[1]
2007. hMETIS. http://glaros.dtc.umn.edu/gkhome/metis/hmetis/overview
[2]
2016. CUDA8 Performance Overview. http://developer.download.nvidia.com/ compute/cuda/compute-docs/cuda-performance-report.pdf
[3]
2019. The API reference guide for cuSPARSE, the CUDA sparse matrix library. https://docs.nvidia.com/cuda/cusparse/index.html Version 10.1.168.
[4]
2019. cuDNN Developer Guide. https://docs.nvidia.com/deeplearning/sdk/ cudnn-developer-guide/index.html.
[5]
2019. MKLDNN. http://intel.github.io/mkl-dnn/
[6]
Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, and et al. 2015. TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems. http://tensorflow.org/ Software available from tensorflow.org.
[7]
Hasan Metin Aktulga, Aydin Bulucc, Samuel Williams, and Chao Yang. 2014. Optimizing Sparse Matrix-Multiple Vectors Multiplication for Nuclear Configuration Interaction Calculations. In Proceedings of the 2014 IEEE 28th International Parallel and Distributed Processing Symposium (IPDPS '14). IEEE Computer Society, Washington, DC, USA, 1213--1222. https://doi.org/10.1109/IPDPS.2014.125
[8]
Chun-Fu Chen, Jinwook Oh, Quanfu Fan, and Marco Pistoia. 2018b. SC-Conv: Sparse-complementary convolution for efficient model utilization on CNNs. In 2018 IEEE International Symposium on Multimedia (ISM). IEEE, 97--100.
[9]
Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. 2018a. TVM: An Automated End-to-End Optimizing Compiler for Deep Learning. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18). USENIX Association, Carlsbad, CA, 578--594.
[10]
Xuhao Chen. 2018. Escort: Efficient Sparse Convolutional Neural Networks on GPUs. CoRR, Vol. abs/1802.10280 (2018). arxiv: 1802.10280 http://arxiv.org/abs/1802.10280
[11]
Xiaoming Chen, Jianxu Chen, Danny Z. Chen, and Xiaobo Sharon Hu. 2017. Optimizing Memory Efficiency for Convolution Kernels on Kepler GPUs. In Proceedings of the 54th Annual Design Automation Conference 2017 (DAC '17). ACM, New York, NY, USA, Article 68, bibinfonumpages6 pages. https://doi.org/10.1145/3061639.3062297
[12]
Sharan Chetlur, Cliff Woolley, Philippe Vandermersch, Jonathan Cohen, John Tran, Bryan Catanzaro, and Evan Shelhamer. 2014. cuDNN: Efficient Primitives for Deep Learning. ArXiv, Vol. abs/1410.0759 (2014).
[13]
Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein. 2009. Introduction to Algorithms, Third Edition 3rd ed.). The MIT Press.
[14]
J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. 2009. ImageNet: A Large-Scale Hierarchical Image Database. In CVPR09.
[15]
Jonas Gehring, Michael Auli, David Grangier, and Yann Dauphin. 2017. A Convolutional Encoder Model for Neural Machine Translation. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Vancouver, Canada, 123--135. https://doi.org/10.18653/v1/P17-1012
[16]
Yiwen Guo, Anbang Yao, and Yurong Chen. 2016. Dynamic Network Surgery for Efficient DNNs. In Proceedings of the 30th International Conference on Neural Information Processing Systems (NIPS'16). Curran Associates Inc., USA, 1387--1395. http://dl.acm.org/citation.cfm?id=3157096.3157251
[17]
Song Han, Huizi Mao, and William J Dally. 2015a. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149 (2015).
[18]
Song Han, Jeff Pool, John Tran, and William J. Dally. 2015b. Learning Both Weights and Connections for Efficient Neural Networks. In Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 1 (NIPS'15). MIT Press, Cambridge, MA, USA, 1135--1143.
[19]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. Deep Residual Learning for Image Recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2015), 770--778.
[20]
Yihui He, Xiangyu Zhang, and Jian Sun. 2017. Channel Pruning for Accelerating Very Deep Neural Networks. 1398--1406. https://doi.org/10.1109/ICCV.2017.155
[21]
Tyler Highlander and Andres Rodriguez. 2016. Very Efficient Training of Convolutional Neural Networks using Fast Fourier Transform and Overlap-and-Add. arXiv e-prints (Jan 2016).
[22]
Changwan Hong, Aravind Sukumaran-Rajam, Bortik Bandyopadhyay, Jinsung Kim, Süreyya Emre Kurt, Israt Nisa, Shivani Sabhlok, Ümit V. cCatalyürek, Srinivasan Parthasarathy, and P. Sadayappan. 2018. Efficient Sparse-matrix Multi-vector Product on GPUs. In Proceedings of the 27th International Symposium on High-Performance Parallel and Distributed Computing (HPDC '18). ACM, New York, NY, USA, 66--79.
[23]
Changwan Hong, Aravind Sukumaran-Rajam, Israt Nisa, Kunal Singh, and P. Sadayappan. 2019. Adaptive Sparse Tiling for Sparse Matrix Multiplication. In Proceedings of the 24th Symposium on Principles and Practice of Parallel Programming (PPoPP '19). ACM, New York, NY, USA, 300--314.
[24]
Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. 2014. Caffe: Convolutional Architecture for Fast Feature Embedding. In Proceedings of the 22Nd ACM International Conference on Multimedia (MM '14). ACM, New York, NY, USA, 675--678. https://doi.org/10.1145/2647868.2654889
[25]
Peng Jiang, Changwan Hong, and Gagan Agrawal. 2020. A Novel Data Transformation and Execution Strategy for Accelerating Sparse Matrix Multiplication on GPUs. In Proceedings of the 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP ?20). Association for Computing Machinery, New York, NY, USA, 376--388. https://doi.org/10.1145/3332466.3374546
[26]
Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton. [n. d.]. CIFAR-10 (Canadian Institute for Advanced Research). ([n. d.]). http://www.cs.toronto.edu/ kriz/cifar.html
[27]
A. Lavin and S. Gray. 2016. Fast Algorithms for Convolutional Neural Networks. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 4013--4021. https://doi.org/10.1109/CVPR.2016.435
[28]
Shigang Li, Yunquan Zhang, Chunyang Xiang, and Lei Shi. 2015. Fast Convolution Operations on Many-Core Architectures (HPCC-CSS-ICESS '15). IEEE Computer Society, Washington, DC, USA, 316--323. https://doi.org/10.1109/HPCC-CSS-ICESS.2015.94
[29]
Xiaolong Ma, Fu-Ming Guo, Wei Niu, Xue Lin, Jian Tang, Kaisheng Ma, Bin Ren, and Yanzhi Wang. 2019. PCONV: The Missing but Desirable Sparsity in DNN Weight Pruning for Real-time Execution on Mobile Devices. arXiv e-prints, Article arXiv:1909.05073 (Sep 2019), arXiv:1909.05073 pages.arxiv: cs.LG/1909.05073
[30]
Huizi Mao, Song Han, Jeff Pool, Wenshuo Li, Xingyu Liu, Yu Wang, and William J. Dally. 2017. Exploring the Regularity of Sparse Structure in Convolutional Neural Networks. ArXiv, Vol. abs/1705.08922 (2017).
[31]
Michaël Mathieu, Mikael Henaff, and Yann LeCun. 2013. Fast Training of Convolutional Networks through FFTs. CoRR, Vol. abs/1312.5851 (2013).
[32]
Pavlo Molchanov, Stephen Tyree, Tero Karras, Timo Aila, and Jan Kautz. 2016. Pruning Convolutional Neural Networks for Resource Efficient Transfer Learning. CoRR, Vol. abs/1611.06440 (2016). arxiv: 1611.06440
[33]
Wei Niu, Xiaolong Ma, Sheng Lin, Shihao Wang, Xuehai Qian, Xue Lin, Yanzhi Wang, and Bin Ren. 2020. Patdnn: Achieving real-time DNN execution on mobile devices with pattern-based weight pruning. In Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems. 907--922.
[34]
Brilian Tafjira Nugraha, Shun-Feng Su, et almbox. 2017. Towards self-driving car using convolutional neural network and road lane detector. In 2017 2nd International Conference on Automation, Cognitive Science, Optics, Micro Electro-Mechanical System, and Information Technology (ICACOMIT). IEEE, 65--69.
[35]
Hyunsun Park, Dongyoung Kim, Junwhan Ahn, and Sungjoo Yoo. 2016a. Zero and Data Reuse-aware Fast Convolution for Deep Neural Networks on GPU. In Proceedings of the Eleventh IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis (CODES '16). ACM, New York, NY, USA, Article 33, 10 pages. https://doi.org/10.1145/2968456.2968476
[36]
Jongsoo Park, Sheng R. Li, Wei Wen, Ping Tak Peter Tang, Hai Li, Yiran Chen, and Pradeep Dubey. 2016b. Faster CNNs with Direct Sparse Convolutions and Guided Pruning. In ICLR.
[37]
Kuo-You Peng, Sheng-Yu Fu, Yu-Ping Liu, and Wei-Chung Hsu. 2017. Adaptive runtime exploiting sparsity in tensor of deep learning neural network on heterogeneous systems. In 2017 International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS). IEEE, 105--112.
[38]
Valentin Radu, Kuba Kaszyk, Yuan Wen, Jack Turner, Josè Cano, Elliot J Crowley, Björn Franke, Amos Storkey, and Michael O'Boyle. 2019. Performance Aware Convolutional Neural Network Channel Pruning for Embedded GPUs (IISWC'19).
[39]
Ao Ren, Tianyun Zhang, Shaokai Ye, Jiayu Li, Wenyao Xu, Xuehai Qian, Xue Lin, and Yanzhi Wang. 2019. Admm-nn: An algorithm-hardware co-design framework of dnns using alternating direction methods of multipliers. In Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems. 925--938.
[40]
Karen Simonyan and Andrew Zisserman. 2014. Very Deep Convolutional Networks for Large-Scale Image Recognition. CoRR, Vol. abs/1409.1556 (2014).
[41]
Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. Going Deeper with Convolutions. In Computer Vision and Pattern Recognition (CVPR).
[42]
Ben van Werkhoven, Jason Maassen, Henri Bal, and Frank Seinstra. 2013. Optimizing Convolution Operations on GPUs using Adaptive Tiling. Future Generation Computer Systems (09 2013). https://doi.org/10.1016/j.future.2013.09.003
[43]
Nicolas Vasilache, Jeff Johnson, Michael Mathieu, Soumith Chintala, Serkan Piantino, and Yann Lecun. 2014. Fast Convolutional Nets With fbfft: A GPU Performance Evaluation.
[44]
Wei Wen, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. 2016. Learning Structured Sparsity in Deep Neural Networks. In Proceedings of the 30th International Conference on Neural Information Processing Systems (NIPS'16). Curran Associates Inc., USA, 2082--2090.
[45]
Carl Yang, Aydin Bulucc, and John D Owens. 2018. Design principles for sparse matrix multiplication on the GPU. In European Conference on Parallel Processing. Springer, 672--687.
[46]
Tianyun Zhang, Shaokai Ye, Kaiqi Zhang, Jian Tang, Wujie Wen, Makan Fardad, and Yanzhi Wang. 2018. A Systematic DNN Weight Pruning Framework using Alternating Direction Method of Multipliers. In ECCV.
[47]
Maohua Zhu, Tao Zhang, Zhenyu Gu, and Yuan Xie. 2019. Sparse tensor core: Algorithm and hardware co-design for vector-wise sparse neural networks on modern gpus. In Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture. 359--371.

Cited By

View all
  • (2024)Sparse cocktailProceedings of the 41st International Conference on Machine Learning10.5555/3692070.3693208(28368-28386)Online publication date: 21-Jul-2024
  • (2024)Combining Weight Approximation, Sharing and Retraining for Neural Network Model CompressionACM Transactions on Embedded Computing Systems10.1145/368746623:6(1-23)Online publication date: 11-Sep-2024
  • (2024)Adaptive Pruning of Channel Spatial Dependability in Convolutional Neural NetworksProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681419(6073-6082)Online publication date: 28-Oct-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
PACT '20: Proceedings of the ACM International Conference on Parallel Architectures and Compilation Techniques
September 2020
505 pages
ISBN:9781450380751
DOI:10.1145/3410463
  • General Chair:
  • Vivek Sarkar,
  • Program Chair:
  • Hyesoon Kim
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 30 September 2020

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. cnn pruning
  2. gpus
  3. sparse convolution

Qualifiers

  • Research-article

Conference

PACT '20
Sponsor:

Acceptance Rates

Overall Acceptance Rate 121 of 471 submissions, 26%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)102
  • Downloads (Last 6 weeks)5
Reflects downloads up to 28 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Sparse cocktailProceedings of the 41st International Conference on Machine Learning10.5555/3692070.3693208(28368-28386)Online publication date: 21-Jul-2024
  • (2024)Combining Weight Approximation, Sharing and Retraining for Neural Network Model CompressionACM Transactions on Embedded Computing Systems10.1145/368746623:6(1-23)Online publication date: 11-Sep-2024
  • (2024)Adaptive Pruning of Channel Spatial Dependability in Convolutional Neural NetworksProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681419(6073-6082)Online publication date: 28-Oct-2024
  • (2023)Dynamic sparsity is channel-level sparsity learnerProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3669097(67993-68012)Online publication date: 10-Dec-2023
  • (2023)cuSCNN : an Efficient CUDA Implementation of Sparse CNNsProceedings of the 13th International Symposium on Highly Efficient Accelerators and Reconfigurable Technologies10.1145/3597031.3597057(107-113)Online publication date: 14-Jun-2023
  • (2023)Estimating Redundancy-Reliability of CNNs Based on Strip-Median AttributesIEEE Transactions on Very Large Scale Integration (VLSI) Systems10.1109/TVLSI.2023.329712531:10(1486-1496)Online publication date: Oct-2023
  • (2023)Unified Data-Free Compression: Pruning and Quantization without Fine-Tuning2023 IEEE/CVF International Conference on Computer Vision (ICCV)10.1109/ICCV51070.2023.00540(5853-5862)Online publication date: 1-Oct-2023
  • (2023)Re-compact: Structured Pruning and SpMM Kernel Co-design for Accelerating DNNs on GPUs2023 IEEE 41st International Conference on Computer Design (ICCD)10.1109/ICCD58817.2023.00066(399-406)Online publication date: 6-Nov-2023
  • (2023)A Channel Pruning Optimization With Layer-Wise Sensitivity in a Single-Shot Manner Under Computational ConstraintsIEEE Access10.1109/ACCESS.2022.323256611(7043-7055)Online publication date: 2023
  • (2022)Exposing and exploiting fine-grained block structures for fast and accurate sparse trainingProceedings of the 36th International Conference on Neural Information Processing Systems10.5555/3600270.3603048(38345-38357)Online publication date: 28-Nov-2022
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media