research-article

cuSCNN : an Efficient CUDA Implementation of Sparse CNNs

Authors:

Mohamed A. Elgammal,

Omar Mohamed Awad,

Isak Edo Vivancos,

Andreas Moshovos,

Vaughn BetzAuthors Info & Claims

HEART '23: Proceedings of the 13th International Symposium on Highly Efficient Accelerators and Reconfigurable Technologies

Pages 107 - 113

https://doi.org/10.1145/3597031.3597057

Published: 19 July 2023 Publication History

Abstract

Deep Neural Network models are becoming much larger which greatly increases their computation and memory requirements. Sparsity offers great opportunities to reduce unnecessary data transfers and computations. However, exploiting sparsity in CNN inference presents challenges such as irregularities in memory access patterns. To overcome this challenge, we propose cuSCNN, an efficient sparse CNN inference engine that leverages the sparsity of both models and activations using optimized sparse-sparse matrix convolution kernels with compressed operands. cuSCNN is motivated by the concepts introduced by the SCNN hardware accelerator[21] but modified appropriately to achieve an efficient software implementation for GPUs. We develop GPU optimizations that boost execution performance and reduce the required memory size and bandwidth. cuSCNN achieves a speedup of up to 171 × compared to an efficient CPU implementation and 30 × speedup compared to a multi-threaded CPU implementation without batching, enabling the use of inexpensive low-end memory-constrained GPUs to implement large networks with near real-time latency. Although GPU throughput can benefit from larger batch sizes, batch size 1 achieves the lowest latency and hence we focus on it.

References

[1]

Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, 2016. Tensorflow: A system for large-scale machine learning. In 12th { USENIX} Symposium on Operating Systems Design and Implementation ({ OSDI} 16). 265–283.

[2]

Jorge Albericio, Patrick Judd, Tayler Hetherington, Tor Aamodt, Natalie Enright Jerger, and Andreas Moshovos. 2016. Cnvlutin: Ineffectual-Neuron-Free Deep Neural Network Computing. In 2016 IEEE/ACM International Conference on Computer Architecture (ISCA).

Digital Library

[3]

Ken Chatfield, Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. 2014. Return of the Devil in the Details: Delving Deep into Convolutional Nets. CoRR abs/1405.3531 (2014).

[4]

Leiyu Chen, Shaobo Li, Qiang Bai, Jing Yang, Sanlong Jiang, and Yanming Miao. 2021. Review of image classification algorithms based on convolutional neural networks. Remote Sensing 13, 22 (2021), 4712.

[5]

Xuhao Chen. 2019. Escoin: Efficient Sparse Convolutional Neural Network Inference on GPUs. Matrix 4, 5 (2019), 7–8.

[6]

Thibaud Ehret and Gabriele Facciolo. 2019. A study of two cnn demosaicking algorithms. Image Processing On Line 9 (2019), 220–230.

[7]

Trevor Gale, Erich Elsen, and Sara Hooker. [n. d.]. The State of Sparsity in Deep Neural Networks. Technical Report. arxiv:1902.09574v1https://bit.ly/2T8hBGn

[8]

Mathew Hall and Vaughn Betz. 2020. HPIPE: Heterogeneous layer-pipelined and sparse-aware CNN inference for FPGAs. arXiv preprint arXiv:2007.10451 (2020).

[9]

Ademola E Ilesanmi and Taiwo O Ilesanmi. 2021. Methods for image denoising using convolutional neural network: a review. Complex & Intelligent Systems 7, 5 (2021), 2179–2198.

[10]

Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. 2014. Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the 22nd ACM international conference on Multimedia. 675–678.

Digital Library

[11]

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. ImageNet Classification with Deep Convolutional Neural Networks. In Advances in Neural Information Processing Systems 25: 26th Annual Conference on Neural Information Processing Systems 2012. Proceedings of a meeting held December 3-6, 2012, Lake Tahoe, Nevada, United States.1106–1114. http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks

[12]

D. Li and Z. Wang. 2017. Video Superresolution via Motion Compensation and Deep Residual Learning. IEEE Transactions on Computational Imaging 3, 4 (Dec 2017), 749–762. https://doi.org/10.1109/TCI.2017.2671360

[13]

D. Martin, C. Fowlkes, D. Tal, and J. Malik. 2001. A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. In Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001, Vol. 2. 416–423 vol.2. https://doi.org/10.1109/ICCV.2001.937655

[14]

Pavlo Molchanov, Stephen Tyree, Tero Karras, Timo Aila, and Jan Kautz. 2016. Pruning convolutional neural networks for resource efficient inference. arXiv preprint arXiv:1611.06440 (2016).

[15]

Yoshifumi Nakano, Takaaki Saeki, Shinnosuke Takamichi, Katsuhito Sudoh, and Hiroshi Saruwatari. 2023. vTTS: visual-text to speech. In 2022 IEEE Spoken Language Technology Workshop (SLT). IEEE, 936–942.

[16]

Kuniaki Noda, Yuki Yamaguchi, Kazuhiro Nakadai, Hiroshi G Okuno, and Tetsuya Ogata. 2015. Audio-visual speech recognition using deep learning. Applied intelligence 42 (2015), 722–737.

[17]

CUDA Nvidia. 2008. Cublas library. NVIDIA Corporation, Santa Clara, California 15, 27 (2008), 31.

[18]

CUDA Nvidia. 2014. Cusparse library. NVIDIA Corporation, Santa Clara, California (2014).

[19]

Daniel W Otter, Julian R Medina, and Jugal K Kalita. 2020. A survey of the usages of deep learning for natural language processing. IEEE transactions on neural networks and learning systems 32, 2 (2020), 604–624.

[20]

Niall O’Mahony, Sean Campbell, Anderson Carvalho, Suman Harapanahalli, Gustavo Velasco Hernandez, Lenka Krpalkova, Daniel Riordan, and Joseph Walsh. 2020. Deep learning vs. traditional computer vision. In Advances in Computer Vision: Proceedings of the 2019 Computer Vision Conference (CVC), Volume 1 1. Springer, 128–144.

[21]

Angshuman Parashar, Minsoo Rhu, Anurag Mukkara, Antonio Puglielli, Rangharajan Venkatesan, Brucek Khailany, Joel Emer, Stephen W. Keckler, and William J. Dally. 2017. SCNN: An Accelerator for Compressed-sparse Convolutional Neural Networks. In Proceedings of the 44th Annual International Symposium on Computer Architecture (Toronto, ON, Canada) (ISCA ’17). ACM, New York, NY, USA, 27–40. https://doi.org/10.1145/3079856.3080254

Digital Library

[22]

Angshuman Parashar, Minsoo Rhu, Anurag Mukkara, Antonio Puglielli, Rangharajan Venkatesan, Brucek Khailany, Joel Emer, Stephen W Keckler, and William J Dally. 2017. Scnn: An accelerator for compressed-sparse convolutional neural networks. ACM SIGARCH Computer Architecture News 45, 2 (2017), 27–40.

Digital Library

[23]

Masuma Akter Rumi, Xiaolong Ma, Yanzhi Wang, and Peng Jiang. 2020. Accelerating sparse cnn inference on gpus with performance-aware weight pruning. In Proceedings of the ACM International Conference on Parallel Architectures and Compilation Techniques. 267–278.

Digital Library

[24]

Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. 2014. ImageNet Large Scale Visual Recognition Challenge. arXiv:1409.0575 [cs] (Sept. 2014). arXiv:1409.0575.

[25]

Marius Octavian Stan. 2022. HPIPE-NX: Leveraging tensor blocks for high-performance CNN inference acceleration on FPGAs. Ph. D. Dissertation. University of Toronto (Canada).

[26]

Shuoheng Yang, Yuxin Wang, and Xiaowen Chu. 2020. A survey of deep learning techniques for neural machine translation. arXiv preprint arXiv:2002.07526 (2020).

[27]

Zhuliang Yao, Shijie Cao, Wencong Xiao, Chen Zhang, and Lanshun Nie. 2019. Balanced sparsity for efficient dnn inference on gpu. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 5676–5683.

Digital Library

[28]

Kai Zhang, Wangmeng Zuo, Shuhang Gu, and Lei Zhang. 2017. Learning Deep CNN Denoiser Prior for Image Restoration. In IEEE Conference on Computer Vision and Pattern Recognition. 3929–3938.

[29]

Shijin Zhang, Zidong Du, Lei Zhang, Huiying Lan, Shaoli Liu, Ling Li, Qi Guo, Tianshi Chen, and Yunji Chen. 2016. Cambricon-X: An accelerator for sparse neural networks. In 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 1–12.

[30]

Hongyu Zhu, Chao Xie, Yeqi Fei, and Huanjie Tao. 2021. Attention mechanisms in CNN-based single image super-resolution: A brief review and a new perspective. Electronics 10, 10 (2021), 1187.

Index Terms

cuSCNN : an Efficient CUDA Implementation of Sparse CNNs

Index terms have been assigned to the content through auto-classification.

Recommendations

Accelerating incompressible flow computations with a Pthreads-CUDA implementation on small-footprint multi-GPU platforms

Graphics processor units (GPU) that are originally designed for graphics rendering have emerged as massively-parallel "co-processors" to the central processing unit (CPU). Small-footprint multi-GPU workstations with hundreds of processing elements can ...
CUDA-BLASTP: Accelerating BLASTP on CUDA-Enabled Graphics Hardware

Scanning protein sequence database is an often repeated task in computational biology and bioinformatics. However, scanning large protein databases, such as GenBank, with popular tools such as BLASTP requires long runtimes on sequential architectures. ...
CUDA optimization strategies for compute- and memory-bound neuroimaging algorithms

As neuroimaging algorithms and technology continue to grow faster than CPU performance in complexity and image resolution, data-parallel computing methods will be increasingly important. The high performance, data-parallel architecture of modern ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

HEART '23: Proceedings of the 13th International Symposium on Highly Efficient Accelerators and Reconfigurable Technologies

June 2023

127 pages

ISBN:9798400700439

DOI:10.1145/3597031

Copyright © 2023 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 19 July 2023

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

HEART 2023

HEART 2023: The International Symposium on Highly Efficient Accelerators and Reconfigurable Technologies 2023

June 14 - 16, 2023

Kusatsu, Japan

Acceptance Rates

Overall Acceptance Rate 22 of 50 submissions, 44%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
124
Total Downloads

Downloads (Last 12 months)54
Downloads (Last 6 weeks)1

Reflects downloads up to 24 Jan 2025

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Figures

Tables

Media

View Table of Conten