research-article

Accelerating Sparse CNN Inference on GPUs with Performance-Aware Weight Pruning

Authors:

Masuma Akter Rumi,

Peng JiangAuthors Info & Claims

PACT '20: Proceedings of the ACM International Conference on Parallel Architectures and Compilation Techniques

Pages 267 - 278

https://doi.org/10.1145/3410463.3414648

Published: 30 September 2020 Publication History

Abstract

Weight pruning is a popular technique to reduce the size and computation complexity of the Convolutional Neural Networks (CNNs). Despite its success in reducing the model size, weight pruning has brought limited benefit to the CNN inference performance, due to the irregularity introduced in the sparse convolution operations. In this work, we aim to improve the performance of sparse convolutions on GPUs by mitigating the irregularity. We find that the existing performance optimization techniques for sparse matrix computations fail to accelerate sparse convolutions, and we observe that the main performance bottleneck is caused by the heavy control-flow instructions. Based on the observation, we proposed a new GEMM-based implementation of sparse convolutions. Our main idea is to extract dense blocks of non-zeros in the sparse convolution kernels, and use dense matrix-matrix multiplication for these dense blocks to achieve high throughput. For cases where many non-zero weights cannot be grouped into dense blocks, we propose a performance-aware re-pruning strategy that removes the least important weights in the sparse kernels to further improve the throughput. The experimental results with five real-world pruned CNN models show that our techniques can significantly improve the layer-wise performance of sparse convolution operations as well as the end-to-end performance of CNN inference.

References

[1]

2007. hMETIS. http://glaros.dtc.umn.edu/gkhome/metis/hmetis/overview

[2]

2016. CUDA8 Performance Overview. http://developer.download.nvidia.com/ compute/cuda/compute-docs/cuda-performance-report.pdf

[3]

2019. The API reference guide for cuSPARSE, the CUDA sparse matrix library. https://docs.nvidia.com/cuda/cusparse/index.html Version 10.1.168.

[4]

2019. cuDNN Developer Guide. https://docs.nvidia.com/deeplearning/sdk/ cudnn-developer-guide/index.html.

[5]

2019. MKLDNN. http://intel.github.io/mkl-dnn/

[6]

Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, and et al. 2015. TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems. http://tensorflow.org/ Software available from tensorflow.org.

[7]

Hasan Metin Aktulga, Aydin Bulucc, Samuel Williams, and Chao Yang. 2014. Optimizing Sparse Matrix-Multiple Vectors Multiplication for Nuclear Configuration Interaction Calculations. In Proceedings of the 2014 IEEE 28th International Parallel and Distributed Processing Symposium (IPDPS '14). IEEE Computer Society, Washington, DC, USA, 1213--1222. https://doi.org/10.1109/IPDPS.2014.125

Digital Library

[8]

Chun-Fu Chen, Jinwook Oh, Quanfu Fan, and Marco Pistoia. 2018b. SC-Conv: Sparse-complementary convolution for efficient model utilization on CNNs. In 2018 IEEE International Symposium on Multimedia (ISM). IEEE, 97--100.

[9]

Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. 2018a. TVM: An Automated End-to-End Optimizing Compiler for Deep Learning. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18). USENIX Association, Carlsbad, CA, 578--594.

[10]

Xuhao Chen. 2018. Escort: Efficient Sparse Convolutional Neural Networks on GPUs. CoRR, Vol. abs/1802.10280 (2018). arxiv: 1802.10280 http://arxiv.org/abs/1802.10280

[11]

Xiaoming Chen, Jianxu Chen, Danny Z. Chen, and Xiaobo Sharon Hu. 2017. Optimizing Memory Efficiency for Convolution Kernels on Kepler GPUs. In Proceedings of the 54th Annual Design Automation Conference 2017 (DAC '17). ACM, New York, NY, USA, Article 68, bibinfonumpages6 pages. https://doi.org/10.1145/3061639.3062297

Digital Library

[12]

Sharan Chetlur, Cliff Woolley, Philippe Vandermersch, Jonathan Cohen, John Tran, Bryan Catanzaro, and Evan Shelhamer. 2014. cuDNN: Efficient Primitives for Deep Learning. ArXiv, Vol. abs/1410.0759 (2014).

[13]

Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein. 2009. Introduction to Algorithms, Third Edition 3rd ed.). The MIT Press.

[14]

J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. 2009. ImageNet: A Large-Scale Hierarchical Image Database. In CVPR09.

[15]

Jonas Gehring, Michael Auli, David Grangier, and Yann Dauphin. 2017. A Convolutional Encoder Model for Neural Machine Translation. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Vancouver, Canada, 123--135. https://doi.org/10.18653/v1/P17-1012

[16]

Yiwen Guo, Anbang Yao, and Yurong Chen. 2016. Dynamic Network Surgery for Efficient DNNs. In Proceedings of the 30th International Conference on Neural Information Processing Systems (NIPS'16). Curran Associates Inc., USA, 1387--1395. http://dl.acm.org/citation.cfm?id=3157096.3157251

Digital Library

[17]

Song Han, Huizi Mao, and William J Dally. 2015a. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149 (2015).

[18]

Song Han, Jeff Pool, John Tran, and William J. Dally. 2015b. Learning Both Weights and Connections for Efficient Neural Networks. In Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 1 (NIPS'15). MIT Press, Cambridge, MA, USA, 1135--1143.

[19]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. Deep Residual Learning for Image Recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2015), 770--778.

[20]

Yihui He, Xiangyu Zhang, and Jian Sun. 2017. Channel Pruning for Accelerating Very Deep Neural Networks. 1398--1406. https://doi.org/10.1109/ICCV.2017.155

[21]

Tyler Highlander and Andres Rodriguez. 2016. Very Efficient Training of Convolutional Neural Networks using Fast Fourier Transform and Overlap-and-Add. arXiv e-prints (Jan 2016).

[22]

Changwan Hong, Aravind Sukumaran-Rajam, Bortik Bandyopadhyay, Jinsung Kim, Süreyya Emre Kurt, Israt Nisa, Shivani Sabhlok, Ümit V. cCatalyürek, Srinivasan Parthasarathy, and P. Sadayappan. 2018. Efficient Sparse-matrix Multi-vector Product on GPUs. In Proceedings of the 27th International Symposium on High-Performance Parallel and Distributed Computing (HPDC '18). ACM, New York, NY, USA, 66--79.

[23]

Changwan Hong, Aravind Sukumaran-Rajam, Israt Nisa, Kunal Singh, and P. Sadayappan. 2019. Adaptive Sparse Tiling for Sparse Matrix Multiplication. In Proceedings of the 24th Symposium on Principles and Practice of Parallel Programming (PPoPP '19). ACM, New York, NY, USA, 300--314.

[24]

Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. 2014. Caffe: Convolutional Architecture for Fast Feature Embedding. In Proceedings of the 22Nd ACM International Conference on Multimedia (MM '14). ACM, New York, NY, USA, 675--678. https://doi.org/10.1145/2647868.2654889

Digital Library

[25]

Peng Jiang, Changwan Hong, and Gagan Agrawal. 2020. A Novel Data Transformation and Execution Strategy for Accelerating Sparse Matrix Multiplication on GPUs. In Proceedings of the 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP ?20). Association for Computing Machinery, New York, NY, USA, 376--388. https://doi.org/10.1145/3332466.3374546

Digital Library

[26]

Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton. [n. d.]. CIFAR-10 (Canadian Institute for Advanced Research). ([n. d.]). http://www.cs.toronto.edu/ kriz/cifar.html

[27]

A. Lavin and S. Gray. 2016. Fast Algorithms for Convolutional Neural Networks. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 4013--4021. https://doi.org/10.1109/CVPR.2016.435

[28]

Shigang Li, Yunquan Zhang, Chunyang Xiang, and Lei Shi. 2015. Fast Convolution Operations on Many-Core Architectures (HPCC-CSS-ICESS '15). IEEE Computer Society, Washington, DC, USA, 316--323. https://doi.org/10.1109/HPCC-CSS-ICESS.2015.94

[29]

Xiaolong Ma, Fu-Ming Guo, Wei Niu, Xue Lin, Jian Tang, Kaisheng Ma, Bin Ren, and Yanzhi Wang. 2019. PCONV: The Missing but Desirable Sparsity in DNN Weight Pruning for Real-time Execution on Mobile Devices. arXiv e-prints, Article arXiv:1909.05073 (Sep 2019), arXiv:1909.05073 pages.arxiv: cs.LG/1909.05073

[30]

Huizi Mao, Song Han, Jeff Pool, Wenshuo Li, Xingyu Liu, Yu Wang, and William J. Dally. 2017. Exploring the Regularity of Sparse Structure in Convolutional Neural Networks. ArXiv, Vol. abs/1705.08922 (2017).

[31]

Michaël Mathieu, Mikael Henaff, and Yann LeCun. 2013. Fast Training of Convolutional Networks through FFTs. CoRR, Vol. abs/1312.5851 (2013).

[32]

Pavlo Molchanov, Stephen Tyree, Tero Karras, Timo Aila, and Jan Kautz. 2016. Pruning Convolutional Neural Networks for Resource Efficient Transfer Learning. CoRR, Vol. abs/1611.06440 (2016). arxiv: 1611.06440

[33]

Wei Niu, Xiaolong Ma, Sheng Lin, Shihao Wang, Xuehai Qian, Xue Lin, Yanzhi Wang, and Bin Ren. 2020. Patdnn: Achieving real-time DNN execution on mobile devices with pattern-based weight pruning. In Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems. 907--922.

Digital Library

[34]

Brilian Tafjira Nugraha, Shun-Feng Su, et almbox. 2017. Towards self-driving car using convolutional neural network and road lane detector. In 2017 2nd International Conference on Automation, Cognitive Science, Optics, Micro Electro-Mechanical System, and Information Technology (ICACOMIT). IEEE, 65--69.

[35]

Hyunsun Park, Dongyoung Kim, Junwhan Ahn, and Sungjoo Yoo. 2016a. Zero and Data Reuse-aware Fast Convolution for Deep Neural Networks on GPU. In Proceedings of the Eleventh IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis (CODES '16). ACM, New York, NY, USA, Article 33, 10 pages. https://doi.org/10.1145/2968456.2968476

Digital Library

[36]

Jongsoo Park, Sheng R. Li, Wei Wen, Ping Tak Peter Tang, Hai Li, Yiran Chen, and Pradeep Dubey. 2016b. Faster CNNs with Direct Sparse Convolutions and Guided Pruning. In ICLR.

[37]

Kuo-You Peng, Sheng-Yu Fu, Yu-Ping Liu, and Wei-Chung Hsu. 2017. Adaptive runtime exploiting sparsity in tensor of deep learning neural network on heterogeneous systems. In 2017 International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS). IEEE, 105--112.

[38]

Valentin Radu, Kuba Kaszyk, Yuan Wen, Jack Turner, Josè Cano, Elliot J Crowley, Björn Franke, Amos Storkey, and Michael O'Boyle. 2019. Performance Aware Convolutional Neural Network Channel Pruning for Embedded GPUs (IISWC'19).

[39]

Ao Ren, Tianyun Zhang, Shaokai Ye, Jiayu Li, Wenyao Xu, Xuehai Qian, Xue Lin, and Yanzhi Wang. 2019. Admm-nn: An algorithm-hardware co-design framework of dnns using alternating direction methods of multipliers. In Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems. 925--938.

Digital Library

[40]

Karen Simonyan and Andrew Zisserman. 2014. Very Deep Convolutional Networks for Large-Scale Image Recognition. CoRR, Vol. abs/1409.1556 (2014).

[41]

Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. Going Deeper with Convolutions. In Computer Vision and Pattern Recognition (CVPR).

[42]

Ben van Werkhoven, Jason Maassen, Henri Bal, and Frank Seinstra. 2013. Optimizing Convolution Operations on GPUs using Adaptive Tiling. Future Generation Computer Systems (09 2013). https://doi.org/10.1016/j.future.2013.09.003

[43]

Nicolas Vasilache, Jeff Johnson, Michael Mathieu, Soumith Chintala, Serkan Piantino, and Yann Lecun. 2014. Fast Convolutional Nets With fbfft: A GPU Performance Evaluation.

[44]

Wei Wen, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. 2016. Learning Structured Sparsity in Deep Neural Networks. In Proceedings of the 30th International Conference on Neural Information Processing Systems (NIPS'16). Curran Associates Inc., USA, 2082--2090.

Digital Library

[45]

Carl Yang, Aydin Bulucc, and John D Owens. 2018. Design principles for sparse matrix multiplication on the GPU. In European Conference on Parallel Processing. Springer, 672--687.

Digital Library

[46]

Tianyun Zhang, Shaokai Ye, Kaiqi Zhang, Jian Tang, Wujie Wen, Makan Fardad, and Yanzhi Wang. 2018. A Systematic DNN Weight Pruning Framework using Alternating Direction Method of Multipliers. In ECCV.

[47]

Maohua Zhu, Tao Zhang, Zhenyu Gu, and Yuan Xie. 2019. Sparse tensor core: Algorithm and hardware co-design for vector-wise sparse neural networks on modern gpus. In Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture. 359--371.

Digital Library

Cited By

Li ZLiu SChen TJaiswal AZhang ZWang DKrishnamoorthi RChang SWang ZSalakhutdinov RKolter ZHeller KWeller AOliver NScarlett JBerkenkamp F(2024)Sparse cocktailProceedings of the 41st International Conference on Machine Learning10.5555/3692070.3693208(28368-28386)Online publication date: 21-Jul-2024
https://dl.acm.org/doi/10.5555/3692070.3693208
Kashikar PSentieys OSinha S(2024)Combining Weight Approximation, Sharing and Retraining for Neural Network Model CompressionACM Transactions on Embedded Computing Systems10.1145/368746623:6(1-23)Online publication date: 11-Sep-2024
https://dl.acm.org/doi/10.1145/3687466
Xie WYuan MMa JLi YCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)Adaptive Pruning of Channel Spatial Dependability in Convolutional Neural NetworksProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681419(6073-6082)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3681419
Show More Cited By

Index Terms

Accelerating Sparse CNN Inference on GPUs with Performance-Aware Weight Pruning
1. Computing methodologies
  1. Machine learning
    1. Machine learning approaches
      1. Neural networks
2. Software and its engineering
  1. Software notations and tools
    1. Compilers
      1. Source code generation

Recommendations

Flexible batched sparse matrix-vector product on GPUs
ScalA '17: Proceedings of the 8th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems

We propose a variety of batched routines for concurrently processing a large collection of small-size, independent sparse matrix-vector products (SpMV) on graphics processing units (GPUs). These batched SpMV kernels are designed to be flexible in order ...
Minuet: Accelerating 3D Sparse Convolutions on GPUs
EuroSys '24: Proceedings of the Nineteenth European Conference on Computer Systems

Sparse Convolution (SC) is widely used for processing 3D point clouds that are inherently sparse. Different from dense convolution, SC preserves the sparsity of the input point cloud by only allowing outputs to specific locations. To efficiently compute ...
TorchSparse++: Efficient Training and Inference Framework for Sparse Convolution on GPUs
MICRO '23: Proceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture

Sparse convolution plays a pivotal role in emerging workloads, including point cloud processing in AR/VR, autonomous driving, and graph understanding in recommendation systems. Since the computation pattern is sparse and irregular, specialized high-...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

PACT '20: Proceedings of the ACM International Conference on Parallel Architectures and Compilation Techniques

September 2020

505 pages

ISBN:9781450380751

DOI:10.1145/3410463

General Chair:
Vivek Sarkar
Georgia Institute of Technology
,
Program Chair:
Hyesoon Kim
Georgia Institute of Technology

Copyright © 2020 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGARCH: ACM Special Interest Group on Computer Architecture

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 30 September 2020

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

PACT '20

Sponsor:

SIGARCH

PACT '20: International Conference on Parallel Architectures and Compilation Techniques

October 3 - 7, 2020

GA, Virtual Event, USA

Acceptance Rates

Overall Acceptance Rate 121 of 471 submissions, 26%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

22
Total Citations
View Citations
1,004
Total Downloads

Downloads (Last 12 months)102
Downloads (Last 6 weeks)5

Reflects downloads up to 28 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Li ZLiu SChen TJaiswal AZhang ZWang DKrishnamoorthi RChang SWang ZSalakhutdinov RKolter ZHeller KWeller AOliver NScarlett JBerkenkamp F(2024)Sparse cocktailProceedings of the 41st International Conference on Machine Learning10.5555/3692070.3693208(28368-28386)Online publication date: 21-Jul-2024
https://dl.acm.org/doi/10.5555/3692070.3693208
Kashikar PSentieys OSinha S(2024)Combining Weight Approximation, Sharing and Retraining for Neural Network Model CompressionACM Transactions on Embedded Computing Systems10.1145/368746623:6(1-23)Online publication date: 11-Sep-2024
https://dl.acm.org/doi/10.1145/3687466
Xie WYuan MMa JLi YCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)Adaptive Pruning of Channel Spatial Dependability in Convolutional Neural NetworksProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681419(6073-6082)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3681419
Yin LLi GFang MShen LHuang TWang ZMenkovski VMa XPechenizkiy MLiu SOh ANaumann TGloberson ASaenko KHardt MLevine S(2023)Dynamic sparsity is channel-level sparsity learnerProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3669097(67993-68012)Online publication date: 10-Dec-2023
https://dl.acm.org/doi/10.5555/3666122.3669097
Elgammal MAwad OVivancos IMoshovos ABetz V(2023)cuSCNN : an Efficient CUDA Implementation of Sparse CNNsProceedings of the 13th International Symposium on Highly Efficient Accelerators and Reconfigurable Technologies10.1145/3597031.3597057(107-113)Online publication date: 14-Jun-2023
https://dl.acm.org/doi/10.1145/3597031.3597057
Xiao JYang YLong HQin RLou J(2023)Estimating Redundancy-Reliability of CNNs Based on Strip-Median AttributesIEEE Transactions on Very Large Scale Integration (VLSI) Systems10.1109/TVLSI.2023.329712531:10(1486-1496)Online publication date: Oct-2023
https://doi.org/10.1109/TVLSI.2023.3297125
Bai SChen JShen XQian YLiu Y(2023)Unified Data-Free Compression: Pruning and Quantization without Fine-Tuning2023 IEEE/CVF International Conference on Computer Vision (ICCV)10.1109/ICCV51070.2023.00540(5853-5862)Online publication date: 1-Oct-2023
https://doi.org/10.1109/ICCV51070.2023.00540
Zhang YRen AChen XLin QTan YLiu D(2023)Re-compact: Structured Pruning and SpMM Kernel Co-design for Accelerating DNNs on GPUs2023 IEEE 41st International Conference on Computer Design (ICCD)10.1109/ICCD58817.2023.00066(399-406)Online publication date: 6-Nov-2023
https://doi.org/10.1109/ICCD58817.2023.00066
Jeon MKim TLee CYoun C(2023)A Channel Pruning Optimization With Layer-Wise Sensitivity in a Single-Shot Manner Under Computational ConstraintsIEEE Access10.1109/ACCESS.2022.323256611(7043-7055)Online publication date: 2023
https://doi.org/10.1109/ACCESS.2022.3232566
Jiang PHu LSong SKoyejo SMohamed SAgarwal ABelgrave DCho KOh A(2022)Exposing and exploiting fine-grained block structures for fast and accurate sparse trainingProceedings of the 36th International Conference on Neural Information Processing Systems10.5555/3600270.3603048(38345-38357)Online publication date: 28-Nov-2022
https://dl.acm.org/doi/10.5555/3600270.3603048
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten