Abstract
Convolutional neural networks (CNNs) are widely employed in many image recognition applications. With the proliferation of embedded and mobile devices, such applications are becoming commonplace on mobile devices. Network pruning is a commonly used strategy to reduce the memory and storage footprints of CNNs on mobile devices. In this article, we propose customized versions of the sparse matrix multiplication algorithm to speed up inference on mobile devices and make it more energy efficient. Specifically, we propose a Block Compressed Sparse Column algorithm and a bit-representation-based algorithm (BitsGEMM) that exploit sparsity to accelerate the fully connected layers of a network on the NVIDIA Jetson TK1 platform. We evaluate the proposed algorithms using real-world object classification and object detection applications. Experiments show that performance speedups can be achieved over the original baseline implementation using cuBLAS. On object detection CNNs, an average speedup of 1.82× is obtained over baseline cuBLAS in the fully connected layer of the VGG model, whereas on classification CNNs, an average speedup of 1.51× is achieved for the fully connected layer of the pruned-VGG model. Energy consumption reduction of 43--46% is also observed due to decreased computational and memory bandwidth demands.
- Sajid Anwar, Kyuyeon Hwang, and Wonyong Sung. 2015. Structured pruning of deep convolutional neural networks. CoRR abs/1512.08571 (2015).Google Scholar
- Arash Ashari, Shirish Tatikonda, Matthias Boehm, Berthold Reinwald, Keith Campbell, John Keenleyside, and P. Sadayappan. 2015. On optimizing machine learning workloads via kernel fusion. In Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’15). Google ScholarDigital Library
- M Gethsiyal Augasta and Thangairulappan Kathirvalavakumar. 2013. Pruning algorithms of neural networksa comparative study. Central Eur. J. Comput. Sci. 3, 3 (2013), 105--115.Google Scholar
- Grey Ballard, Alex Druinsky, Nicholas Knight, and Oded Schwartz. 2016. Hypergraph partitioning for sparse matrix-matrix multiplication. ACM Trans. Parallel Comput. 3, 3, Article 18 (Dec. 2016), 34 pages. 2329-4949 Google ScholarDigital Library
- Muthu Manikandan Baskaran and Rajesh Bordawekar. 2009. Optimizing Sparse Matrix-Vector Multiplication on GPUs. Technical Report. RC24704, IBM T. J. Watson.Google Scholar
- Frédéric Bastien, Pascal Lamblin, Razvan Pascanu, James Bergstra, Ian J. Goodfellow, Arnaud Bergeron, Nicolas Bouchard, and Yoshua Bengio. 2012. Theano: New features and speed improvements. arXiv preprint arXiv:1211.5590.Google Scholar
- Nathan Bell and Michael Garland. 2009. Implementing sparse matrix-vector multiplication on throughput-oriented processors. In Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis (SC’09). Article 18, 11 pages. Google ScholarDigital Library
- Aydin Buluc and John R. Gilbert. 2008. Challenges and advances in parallel sparse matrix-matrix multiplication. In Proceedings of the 2008 37th International Conference on Parallel Processing (ICPP’08). 503--510. Google ScholarDigital Library
- Srihari Cadambi, Abhinandan Majumdar, Michela Becchi, Srimat Chakradhar, and Hans Peter Graf. 2010. A programmable parallel accelerator for learning and classification. In Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques (PACT’10). 273--284. Google ScholarDigital Library
- Srimat Chakradhar, Murugan Sankaradas, Venkata Jakkula, and Srihari Cadambi. 2010. A dynamically configurable coprocessor for convolutional neural networks. In Proceedings of the 37th Annual International Symposium on Computer Architecture (ISCA’10). 247--257. Google ScholarDigital Library
- Wenlin Chen, James Wilson, Stephen Tyree, Kilian Weinberger, and Yixin Chen. 2015. Compressing neural networks with the hashing trick. In Proceedings of The 32nd International Conference on Machine Learning (ICML’13). 2285--2294. Google ScholarDigital Library
- Sharan Chetlur, Cliff Woolley, Philippe Vandermersch, Jonathan Cohen, John Tran, Bryan Catanzaro, and Evan Shelhamer. 2014. cuDNN: Efficient primitives for deep learning. CoRR abs/1410.0759 (2014).Google Scholar
- Adam Coates, Brody Huval, Tao Wang, David J. Wu, Bryan C. Catanzaro, and Andrew Y. Ng. 2013. Deep learning with COTS HPC systems. In Proceedings of the 30th International Conference on Machine Learning (ICML’13). 1337--1345. Google ScholarDigital Library
- Ronan Collobert, Koray Kavukcuoglu, and Clément Farabet. 2011. Torch7: A matlab-like environment for machine learning. In Proceedings of the BigLearn, Neural Information Processing Systems Workshop (NIPS’11).Google Scholar
- Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. 2007. Proceedings of the PASCAL Visual Object Classes Challenge 2007 (VOC’07).Google Scholar
- Ross Girshick. 2015. Fast R-CNN. In Proceedings of the IEEE International Conference on Computer Vision (ICCV’15). 1440--1448. Google ScholarDigital Library
- Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. 2014. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’14). IEEE Computer Society, Washington, DC, 580--587. Google ScholarDigital Library
- Jeswin Godwin, Justin Holewinski, and P. Sadayappan. 2012. High-performance sparse matrix-vector multiplication on GPUs for structured grid computations. In Proceedings of the 5th Annual Workshop on General Purpose Processing with Graphics Processing Units (GPGPU’12). New York, NY, 47--56. Google ScholarDigital Library
- Joseph L. Greathouse and Mayank Daga. 2014. Efficient sparse matrix-vector multiplication on GPUs using the CSR storage format. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC’14). 769--780. Google ScholarDigital Library
- Song Han, Junlong Kang, Huizi Mao, Yiming Hu, Xin Li, Yubin Li, Dongliang Xie, Hong Luo, Song Yao, Yu Wang, et al. 2017. ESE: Efficient speech recognition engine with sparse LSTM on FPGA. In Proceedings of the Conference on Field-Programmable Gate Arrays (FPGA’17). 75--84. Google ScholarDigital Library
- Song Han, Xingyu Liu, Huizi Mao, Jing Pu, Ardavan Pedram, Mark A Horowitz, and William J Dally. 2016. EIE: Efficient inference engine on compressed deep neural network. In Proceedings of the 43rd International Symposium on Computer Architecture. IEEE Press, 243--254. Google ScholarDigital Library
- Song Han, Huizi Mao, and William J. Dally. 2015. Deep compression: Compressing deep neural network with pruning, trained quantization and huffman coding. CoRR abs/1510.00149 (2015).Google Scholar
- Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015a. Deep residual learning for image recognition. CoRR abs/1512.03385 (2015).Google Scholar
- Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015b. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the IEEE International Conference on Computer Vision. 1026--1034. Google ScholarDigital Library
- Eun-Jin Im. 2000. Optimizing the performance of sparse matrix-vector multiplication. Ph.D. Dissertation. University of California Berkeley. Google ScholarDigital Library
- Eun-Jin Im and Katherine A. Yelick. 2001. Optimizing sparse matrix computations for register reuse in SPARSITY. In Proceedings of the International Conference on Computational Sciences-Part I (ICCS’01). 127--136. Google ScholarDigital Library
- Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. 2014. Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the 22nd ACM International Conference on Multimedia (MM’14). Google ScholarDigital Library
- Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. ImageNet classification with deep convolutional neural networks. In Proceedings of the Conference on Advances in Neural Information Processing Systems (NIPS’12). Google ScholarDigital Library
- Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. 1998. Gradient-based learning applied to document recognition. Proc. IEEE 86, 11 (1998), 2278--2324.Google ScholarCross Ref
- Xing Liu, Mikhail Smelyanskiy, Edmond Chow, and Pradeep Dubey. 2013. Efficient sparse matrix-vector multiplication on x86-based many-core processors. In Proceedings of the 27th International ACM Conference on International Conference on Supercomputing (ICS’13). ACM, 273--282. Google ScholarDigital Library
- Liqiang Lu, Yun Liang, Qingcheng Xiao, and Shengen Yan. 2017. Evaluating fast algorithms for convolutional neural networks on fpgas. In Proceedings of the 2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM’17). IEEE, 101--108.Google ScholarCross Ref
- John Mellor-Crummey and John Garvin. 2004. Optimizing sparse matrix-vector product computations using unroll and jam. Int. J. High Perform. Comput. Appl. 18, 2 (May 2004), 225--236. Google ScholarDigital Library
- Duane Merrill and Michael Garland. 2016. Merge-based sparse matrix-vector multiplication (SpMV) using the CSR storage format. In Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’16). Article 43, 2 pages. Google ScholarDigital Library
- Yusuke Nagasaka, Akira Nukada, and Satoshi Matsuoka. 2016. Adaptive multi-level blocking optimization for sparse matrix vector multiplication on GPU. In Proceedings of the International Conference on Computational Science (ICCS’16), Vol. 80. 131--142. Google ScholarDigital Library
- Andrew Nere, Atif Hashmi, and Mikko Lipasti. 2011. Profiling heterogeneous multi-GPU systems to accelerate cortically inspired learning algorithms. In Parallel 8 Distributed Processing Symposium (IPDPS), 2011 IEEE International. IEEE, 906--920. Google ScholarDigital Library
- NVIDIA. 2016. DIGITS—Interactive Deep Learning GPU Training System. Retrieved from https://developer.nvidia.com/digits.Google Scholar
- Nikola Rajovic, Paul M. Carpenter, Isaac Gelado, Nikola Puzovic, Alex Ramirez, and Mateo Valero. 2013. Supercomputing with commodity CPUs: Are mobile SoCs ready for HPC? In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC’13). Article 40, 12 pages. Google ScholarDigital Library
- Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster R-CNN: Towards real-time object detection with region proposal networks. In Proceedings of the Conference on Advances in Neural Information Processing Systems (NIPS’15). Google ScholarDigital Library
- Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. 2015. ImageNet large scale visual recognition challenge. Int. J. Comput. Vision (IJCV) 115, 3 (2015), 211--252. Google ScholarDigital Library
- Yousef Saad. 2003. Iterative Methods for Sparse Linear Systems (2nd ed.). Society for Industrial and Applied Mathematics, Philadelphia, PA. Google ScholarDigital Library
- Erik Saule, Kamer Kaya, and Ümit V. Çatalyürek. 2014. Performance evaluation of sparse matrix multiplication kernels on intel xeon phi. In Parallel Processing and Applied Mathematics. Springer, Berlin, 559--570.Google Scholar
- Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: A simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15 (2014), 1929--1958. Google ScholarDigital Library
- Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. Going deeper with convolutions. In Computer Vision and Pattern Recognition (CVPR’15).Google Scholar
- Wai Teng Tang, Ruizhe Zhao, Mian Lu, Yun Liang, Huynh Phung Huynh, Xibai Li, and Rick Siow Mong Goh. 2015b. Optimizing and auto-tuning scale-free sparse matrix-vector multiplication on intel xeon phi. In Proceedings of the 13th Annual IEEE/ACM International Symposium on Code Generation and Optimization (CGO’15). 136--145. Google ScholarDigital Library
- Xiaoxin Tang, Zhiyi Huang, David M. Eyers, Steven Mills, and Minyi Guo. 2015a. Efficient selection algorithm for fast k-NN search on GPUs. In Proceedings of the International Parallel and Distributed Symposium (IPDPS’15). 397--406. Google ScholarDigital Library
- Francisco Vázquez, José JesúS Fernández, and Ester M. Garzón. 2012. Automatic tuning of the sparse matrix vector product on GPUs based on the ELLR-T approach. Parallel Comput. 38, 8 (Aug. 2012), 408--420. Google ScholarDigital Library
- Xuechao Wei, Cody Hao Yu, Peng Zhang, Youxiang Chen, Yuxin Wang, Han Hu, Yun Liang, and Jason Cong. 2017. Automated systolic array architecture synthesis for high throughput CNN inference on FPGAs. In Proceedings of the 54th Annual Design Automation Conference 2017. ACM, 29. Google ScholarDigital Library
- Samuel Williams, Leonid Oliker, Richard Vuduc, John Shalf, Katherine Yelick, and James Demmel. 2007. Optimization of sparse matrix-vector multiplication on emerging multicore platforms. In Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC’07). Article 38, 12 pages. Google ScholarDigital Library
- Qingcheng Xiao, Yun Liang, Liqiang Lu, Shengen Yan, and Yu-Wing Tai. 2017. Exploring heterogeneous algorithms for accelerating deep convolutional neural networks on FPGAs. In Proceedings of the 54th Annual Design Automation Conference 2017. ACM, 62. Google ScholarDigital Library
- Shengen Yan, Chao Li, Yunquan Zhang, and Huiyang Zhou. 2014. yaSpMV: Yet another spmv framework on GPUs. In Proceedings of the 19th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’14). 107--118. Google ScholarDigital Library
- Matthew D. Zeiler and Rob Fergus. 2014. Visualizing and understanding convolutional networks. In Proceedings of the 13th European Conference in Computer Vision. 818--833.Google Scholar
- Chen Zhang, Zhenman Fang, Peipei Zhou, Peichen Pan, and Jason Cong. 2016. Caffeine: Towards uniformed representation and acceleration for deep convolutional neural networks. In Proceedings of the 2016 IEEE/ACM International Conference on Computer-Aided Design (ICCAD’16). IEEE, 1--8. Google ScholarDigital Library
- Chen Zhang, Peng Li, Guangyu Sun, Yijin Guan, Bingjun Xiao, and Jason Cong. 2015. Optimizing FPGA-based accelerator design for deep convolutional neural networks. In Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA’15). 161--170. Google ScholarDigital Library
Index Terms
- Exploiting Sparsity to Accelerate Fully Connected Layers of CNN-Based Applications on Mobile SoCs
Recommendations
Performance optimization of Sparse Matrix-Vector Multiplication for multi-component PDE-based applications using GPUs
Simulations of many multi-component PDE-based applications, such as petroleum reservoirs or reacting flows, are dominated by the solution, on each time step and within each Newton step, of large sparse linear systems. The standard solver is a ...
LightSpMV: Faster CUDA-Compatible Sparse Matrix-Vector Multiplication Using Compressed Sparse Rows
Compressed sparse row (CSR) is one of the most frequently used sparse matrix storage formats. However, the efficiency of existing CUDA-compatible CSR-based sparse matrix vector multiplication (SpMV) implementations is relatively low. We address this ...
CUDA-enabled Sparse Matrix-Vector Multiplication on GPUs using atomic operations
We propose the Sliced Coordinate Format (SCOO) for Sparse Matrix-Vector Multiplication on GPUs.An associated CUDA implementation which takes advantage of atomic operations is presented.We propose partitioning methods to transform a given sparse matrix ...
Comments