research-article

Exploiting Sparsity to Accelerate Fully Connected Layers of CNN-Based Applications on Mobile SoCs

Authors:
Xinfeng Xie

School of EECS, Peking Univerisity, China

School of EECS, Peking Univerisity, China
View Profile

,
Dayou Du

School of EECS, Peking Univerisity, China

School of EECS, Peking Univerisity, China
View Profile

,
Qian Li

School of EECS, Peking Univerisity, China

School of EECS, Peking Univerisity, China
View Profile

,
Yun Liang

School of EECS, Peking Univerisity, China

School of EECS, Peking Univerisity, China
View Profile

,
Wai Teng Tang

Institute of High Performance Computing, A*STAR, Singapore

Institute of High Performance Computing, A*STAR, Singapore
View Profile

,
Zhong Liang Ong

Institute of High Performance Computing, A*STAR, Singapore

Institute of High Performance Computing, A*STAR, Singapore
View Profile

,
Mian Lu

Huawei Singapore Research Centre

Huawei Singapore Research Centre
View Profile

,
Huynh Phung Huynh

Institute of High Performance Computing, A*STAR, Singapore

Institute of High Performance Computing, A*STAR, Singapore
View Profile

,
Rick Siow Mong Goh

Institute of High Performance Computing, A*STAR, Singapore

Institute of High Performance Computing, A*STAR, Singapore
View Profile

Authors Info & Claims

ACM Transactions on Embedded Computing Systems Volume 17 Issue 2Article No.: 37pp 1–25https://doi.org/10.1145/3122788

Published:07 December 2017Publication History

ACM Transactions on Embedded Computing Systems

Abstract

Convolutional neural networks (CNNs) are widely employed in many image recognition applications. With the proliferation of embedded and mobile devices, such applications are becoming commonplace on mobile devices. Network pruning is a commonly used strategy to reduce the memory and storage footprints of CNNs on mobile devices. In this article, we propose customized versions of the sparse matrix multiplication algorithm to speed up inference on mobile devices and make it more energy efficient. Specifically, we propose a Block Compressed Sparse Column algorithm and a bit-representation-based algorithm (BitsGEMM) that exploit sparsity to accelerate the fully connected layers of a network on the NVIDIA Jetson TK1 platform. We evaluate the proposed algorithms using real-world object classification and object detection applications. Experiments show that performance speedups can be achieved over the original baseline implementation using cuBLAS. On object detection CNNs, an average speedup of 1.82× is obtained over baseline cuBLAS in the fully connected layer of the VGG model, whereas on classification CNNs, an average speedup of 1.51× is achieved for the fully connected layer of the pruned-VGG model. Energy consumption reduction of 43--46% is also observed due to decreased computational and memory bandwidth demands.

References

Sajid Anwar, Kyuyeon Hwang, and Wonyong Sung. 2015. Structured pruning of deep convolutional neural networks. CoRR abs/1512.08571 (2015).Google Scholar
Arash Ashari, Shirish Tatikonda, Matthias Boehm, Berthold Reinwald, Keith Campbell, John Keenleyside, and P. Sadayappan. 2015. On optimizing machine learning workloads via kernel fusion. In Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’15). Google ScholarDigital Library
M Gethsiyal Augasta and Thangairulappan Kathirvalavakumar. 2013. Pruning algorithms of neural networksa comparative study. Central Eur. J. Comput. Sci. 3, 3 (2013), 105--115.Google Scholar
Grey Ballard, Alex Druinsky, Nicholas Knight, and Oded Schwartz. 2016. Hypergraph partitioning for sparse matrix-matrix multiplication. ACM Trans. Parallel Comput. 3, 3, Article 18 (Dec. 2016), 34 pages. 2329-4949 Google ScholarDigital Library
Muthu Manikandan Baskaran and Rajesh Bordawekar. 2009. Optimizing Sparse Matrix-Vector Multiplication on GPUs. Technical Report. RC24704, IBM T. J. Watson.Google Scholar
Frédéric Bastien, Pascal Lamblin, Razvan Pascanu, James Bergstra, Ian J. Goodfellow, Arnaud Bergeron, Nicolas Bouchard, and Yoshua Bengio. 2012. Theano: New features and speed improvements. arXiv preprint arXiv:1211.5590.Google Scholar
Nathan Bell and Michael Garland. 2009. Implementing sparse matrix-vector multiplication on throughput-oriented processors. In Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis (SC’09). Article 18, 11 pages. Google ScholarDigital Library
Aydin Buluc and John R. Gilbert. 2008. Challenges and advances in parallel sparse matrix-matrix multiplication. In Proceedings of the 2008 37th International Conference on Parallel Processing (ICPP’08). 503--510. Google ScholarDigital Library
Srihari Cadambi, Abhinandan Majumdar, Michela Becchi, Srimat Chakradhar, and Hans Peter Graf. 2010. A programmable parallel accelerator for learning and classification. In Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques (PACT’10). 273--284. Google ScholarDigital Library
Srimat Chakradhar, Murugan Sankaradas, Venkata Jakkula, and Srihari Cadambi. 2010. A dynamically configurable coprocessor for convolutional neural networks. In Proceedings of the 37th Annual International Symposium on Computer Architecture (ISCA’10). 247--257. Google ScholarDigital Library
Wenlin Chen, James Wilson, Stephen Tyree, Kilian Weinberger, and Yixin Chen. 2015. Compressing neural networks with the hashing trick. In Proceedings of The 32nd International Conference on Machine Learning (ICML’13). 2285--2294. Google ScholarDigital Library
Sharan Chetlur, Cliff Woolley, Philippe Vandermersch, Jonathan Cohen, John Tran, Bryan Catanzaro, and Evan Shelhamer. 2014. cuDNN: Efficient primitives for deep learning. CoRR abs/1410.0759 (2014).Google Scholar
Adam Coates, Brody Huval, Tao Wang, David J. Wu, Bryan C. Catanzaro, and Andrew Y. Ng. 2013. Deep learning with COTS HPC systems. In Proceedings of the 30th International Conference on Machine Learning (ICML’13). 1337--1345. Google ScholarDigital Library
Ronan Collobert, Koray Kavukcuoglu, and Clément Farabet. 2011. Torch7: A matlab-like environment for machine learning. In Proceedings of the BigLearn, Neural Information Processing Systems Workshop (NIPS’11).Google Scholar
Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. 2007. Proceedings of the PASCAL Visual Object Classes Challenge 2007 (VOC’07).Google Scholar
Ross Girshick. 2015. Fast R-CNN. In Proceedings of the IEEE International Conference on Computer Vision (ICCV’15). 1440--1448. Google ScholarDigital Library
Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. 2014. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’14). IEEE Computer Society, Washington, DC, 580--587. Google ScholarDigital Library
Jeswin Godwin, Justin Holewinski, and P. Sadayappan. 2012. High-performance sparse matrix-vector multiplication on GPUs for structured grid computations. In Proceedings of the 5th Annual Workshop on General Purpose Processing with Graphics Processing Units (GPGPU’12). New York, NY, 47--56. Google ScholarDigital Library
Joseph L. Greathouse and Mayank Daga. 2014. Efficient sparse matrix-vector multiplication on GPUs using the CSR storage format. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC’14). 769--780. Google ScholarDigital Library
Song Han, Junlong Kang, Huizi Mao, Yiming Hu, Xin Li, Yubin Li, Dongliang Xie, Hong Luo, Song Yao, Yu Wang, et al. 2017. ESE: Efficient speech recognition engine with sparse LSTM on FPGA. In Proceedings of the Conference on Field-Programmable Gate Arrays (FPGA’17). 75--84. Google ScholarDigital Library
Song Han, Xingyu Liu, Huizi Mao, Jing Pu, Ardavan Pedram, Mark A Horowitz, and William J Dally. 2016. EIE: Efficient inference engine on compressed deep neural network. In Proceedings of the 43rd International Symposium on Computer Architecture. IEEE Press, 243--254. Google ScholarDigital Library
Song Han, Huizi Mao, and William J. Dally. 2015. Deep compression: Compressing deep neural network with pruning, trained quantization and huffman coding. CoRR abs/1510.00149 (2015).Google Scholar
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015a. Deep residual learning for image recognition. CoRR abs/1512.03385 (2015).Google Scholar
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015b. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the IEEE International Conference on Computer Vision. 1026--1034. Google ScholarDigital Library
Eun-Jin Im. 2000. Optimizing the performance of sparse matrix-vector multiplication. Ph.D. Dissertation. University of California Berkeley. Google ScholarDigital Library
Eun-Jin Im and Katherine A. Yelick. 2001. Optimizing sparse matrix computations for register reuse in SPARSITY. In Proceedings of the International Conference on Computational Sciences-Part I (ICCS’01). 127--136. Google ScholarDigital Library
Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. 2014. Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the 22nd ACM International Conference on Multimedia (MM’14). Google ScholarDigital Library
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. ImageNet classification with deep convolutional neural networks. In Proceedings of the Conference on Advances in Neural Information Processing Systems (NIPS’12). Google ScholarDigital Library
Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. 1998. Gradient-based learning applied to document recognition. Proc. IEEE 86, 11 (1998), 2278--2324.Google ScholarCross Ref
Xing Liu, Mikhail Smelyanskiy, Edmond Chow, and Pradeep Dubey. 2013. Efficient sparse matrix-vector multiplication on x86-based many-core processors. In Proceedings of the 27th International ACM Conference on International Conference on Supercomputing (ICS’13). ACM, 273--282. Google ScholarDigital Library
Liqiang Lu, Yun Liang, Qingcheng Xiao, and Shengen Yan. 2017. Evaluating fast algorithms for convolutional neural networks on fpgas. In Proceedings of the 2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM’17). IEEE, 101--108.Google ScholarCross Ref
John Mellor-Crummey and John Garvin. 2004. Optimizing sparse matrix-vector product computations using unroll and jam. Int. J. High Perform. Comput. Appl. 18, 2 (May 2004), 225--236. Google ScholarDigital Library
Duane Merrill and Michael Garland. 2016. Merge-based sparse matrix-vector multiplication (SpMV) using the CSR storage format. In Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’16). Article 43, 2 pages. Google ScholarDigital Library
Yusuke Nagasaka, Akira Nukada, and Satoshi Matsuoka. 2016. Adaptive multi-level blocking optimization for sparse matrix vector multiplication on GPU. In Proceedings of the International Conference on Computational Science (ICCS’16), Vol. 80. 131--142. Google ScholarDigital Library
Andrew Nere, Atif Hashmi, and Mikko Lipasti. 2011. Profiling heterogeneous multi-GPU systems to accelerate cortically inspired learning algorithms. In Parallel 8 Distributed Processing Symposium (IPDPS), 2011 IEEE International. IEEE, 906--920. Google ScholarDigital Library
NVIDIA. 2016. DIGITS—Interactive Deep Learning GPU Training System. Retrieved from https://developer.nvidia.com/digits.Google Scholar
Nikola Rajovic, Paul M. Carpenter, Isaac Gelado, Nikola Puzovic, Alex Ramirez, and Mateo Valero. 2013. Supercomputing with commodity CPUs: Are mobile SoCs ready for HPC? In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC’13). Article 40, 12 pages. Google ScholarDigital Library
Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster R-CNN: Towards real-time object detection with region proposal networks. In Proceedings of the Conference on Advances in Neural Information Processing Systems (NIPS’15). Google ScholarDigital Library
Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. 2015. ImageNet large scale visual recognition challenge. Int. J. Comput. Vision (IJCV) 115, 3 (2015), 211--252. Google ScholarDigital Library
Yousef Saad. 2003. Iterative Methods for Sparse Linear Systems (2nd ed.). Society for Industrial and Applied Mathematics, Philadelphia, PA. Google ScholarDigital Library
Erik Saule, Kamer Kaya, and Ümit V. Çatalyürek. 2014. Performance evaluation of sparse matrix multiplication kernels on intel xeon phi. In Parallel Processing and Applied Mathematics. Springer, Berlin, 559--570.Google Scholar
Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: A simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15 (2014), 1929--1958. Google ScholarDigital Library
Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. Going deeper with convolutions. In Computer Vision and Pattern Recognition (CVPR’15).Google Scholar
Wai Teng Tang, Ruizhe Zhao, Mian Lu, Yun Liang, Huynh Phung Huynh, Xibai Li, and Rick Siow Mong Goh. 2015b. Optimizing and auto-tuning scale-free sparse matrix-vector multiplication on intel xeon phi. In Proceedings of the 13th Annual IEEE/ACM International Symposium on Code Generation and Optimization (CGO’15). 136--145. Google ScholarDigital Library
Xiaoxin Tang, Zhiyi Huang, David M. Eyers, Steven Mills, and Minyi Guo. 2015a. Efficient selection algorithm for fast k-NN search on GPUs. In Proceedings of the International Parallel and Distributed Symposium (IPDPS’15). 397--406. Google ScholarDigital Library
Francisco Vázquez, José JesúS Fernández, and Ester M. Garzón. 2012. Automatic tuning of the sparse matrix vector product on GPUs based on the ELLR-T approach. Parallel Comput. 38, 8 (Aug. 2012), 408--420. Google ScholarDigital Library
Xuechao Wei, Cody Hao Yu, Peng Zhang, Youxiang Chen, Yuxin Wang, Han Hu, Yun Liang, and Jason Cong. 2017. Automated systolic array architecture synthesis for high throughput CNN inference on FPGAs. In Proceedings of the 54th Annual Design Automation Conference 2017. ACM, 29. Google ScholarDigital Library
Samuel Williams, Leonid Oliker, Richard Vuduc, John Shalf, Katherine Yelick, and James Demmel. 2007. Optimization of sparse matrix-vector multiplication on emerging multicore platforms. In Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC’07). Article 38, 12 pages. Google ScholarDigital Library
Qingcheng Xiao, Yun Liang, Liqiang Lu, Shengen Yan, and Yu-Wing Tai. 2017. Exploring heterogeneous algorithms for accelerating deep convolutional neural networks on FPGAs. In Proceedings of the 54th Annual Design Automation Conference 2017. ACM, 62. Google ScholarDigital Library
Shengen Yan, Chao Li, Yunquan Zhang, and Huiyang Zhou. 2014. yaSpMV: Yet another spmv framework on GPUs. In Proceedings of the 19th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’14). 107--118. Google ScholarDigital Library
Matthew D. Zeiler and Rob Fergus. 2014. Visualizing and understanding convolutional networks. In Proceedings of the 13th European Conference in Computer Vision. 818--833.Google Scholar
Chen Zhang, Zhenman Fang, Peipei Zhou, Peichen Pan, and Jason Cong. 2016. Caffeine: Towards uniformed representation and acceleration for deep convolutional neural networks. In Proceedings of the 2016 IEEE/ACM International Conference on Computer-Aided Design (ICCAD’16). IEEE, 1--8. Google ScholarDigital Library
Chen Zhang, Peng Li, Guangyu Sun, Yijin Guan, Bingjun Xiao, and Jason Cong. 2015. Optimizing FPGA-based accelerator design for deep convolutional neural networks. In Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA’15). 161--170. Google ScholarDigital Library

Index Terms

Exploiting Sparsity to Accelerate Fully Connected Layers of CNN-Based Applications on Mobile SoCs
1. Computer systems organization
  1. Architectures
    1. Other architectures
      1. Heterogeneous (hybrid) systems
  2. Embedded and cyber-physical systems
    1. Embedded systems
      1. Embedded software
2. Computing methodologies
  1. Parallel computing methodologies
    1. Parallel algorithms
      1. Shared memory algorithms

Recommendations

Performance optimization of Sparse Matrix-Vector Multiplication for multi-component PDE-based applications using GPUs

Simulations of many multi-component PDE-based applications, such as petroleum reservoirs or reacting flows, are dominated by the solution, on each time step and within each Newton step, of large sparse linear systems. The standard solver is a ...
Read More
LightSpMV: Faster CUDA-Compatible Sparse Matrix-Vector Multiplication Using Compressed Sparse Rows

Compressed sparse row (CSR) is one of the most frequently used sparse matrix storage formats. However, the efficiency of existing CUDA-compatible CSR-based sparse matrix vector multiplication (SpMV) implementations is relatively low. We address this ...
Read More
CUDA-enabled Sparse Matrix-Vector Multiplication on GPUs using atomic operations

We propose the Sliced Coordinate Format (SCOO) for Sparse Matrix-Vector Multiplication on GPUs.An associated CUDA implementation which takes advantage of atomic operations is presented.We propose partitioning methods to transform a given sparse matrix ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on Embedded Computing Systems Volume 17, Issue 2
Special Issue on MEMCODE 2015 and Regular Papers (Diamonds)
March 2018
640 pages
ISSN:1539-9087
EISSN:1558-3465
DOI:10.1145/3160927
Editor:
Sandeep K. Shukla
Indian Institute of Technology, India
Issue’s Table of Contents
Copyright © 2017 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States

Journal Family
ACM Journals for the Design of Smart and Connected Systems
Publication History
- Published: 7 December 2017
- Accepted: 1 June 2017
- Revised: 1 March 2017
- Received: 1 November 2016
Published in tecs Volume 17, Issue 2

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Convolutional neural networks
matrix tiling
mobile GPU
sparse matrix compression
sparse matrix multiplication
sparse matrix-vector multiplication
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 9
  Total Citations
  View Citations
- 518
  Total Downloads
- Downloads (Last 12 months)25
- Downloads (Last 6 weeks)2
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Exploiting Sparsity to Accelerate Fully Connected Layers of CNN-Based Applications on Mobile SoCs

ACM Transactions on Embedded Computing Systems

Abstract

References

Cited By

Index Terms

Recommendations

Performance optimization of Sparse Matrix-Vector Multiplication for multi-component PDE-based applications using GPUs

LightSpMV: Faster CUDA-Compatible Sparse Matrix-Vector Multiplication Using Compressed Sparse Rows

CUDA-enabled Sparse Matrix-Vector Multiplication on GPUs using atomic operations