skip to main content
research-article

Exploiting Sparsity to Accelerate Fully Connected Layers of CNN-Based Applications on Mobile SoCs

Authors Info & Claims
Published:07 December 2017Publication History
Skip Abstract Section

Abstract

Convolutional neural networks (CNNs) are widely employed in many image recognition applications. With the proliferation of embedded and mobile devices, such applications are becoming commonplace on mobile devices. Network pruning is a commonly used strategy to reduce the memory and storage footprints of CNNs on mobile devices. In this article, we propose customized versions of the sparse matrix multiplication algorithm to speed up inference on mobile devices and make it more energy efficient. Specifically, we propose a Block Compressed Sparse Column algorithm and a bit-representation-based algorithm (BitsGEMM) that exploit sparsity to accelerate the fully connected layers of a network on the NVIDIA Jetson TK1 platform. We evaluate the proposed algorithms using real-world object classification and object detection applications. Experiments show that performance speedups can be achieved over the original baseline implementation using cuBLAS. On object detection CNNs, an average speedup of 1.82× is obtained over baseline cuBLAS in the fully connected layer of the VGG model, whereas on classification CNNs, an average speedup of 1.51× is achieved for the fully connected layer of the pruned-VGG model. Energy consumption reduction of 43--46% is also observed due to decreased computational and memory bandwidth demands.

References

  1. Sajid Anwar, Kyuyeon Hwang, and Wonyong Sung. 2015. Structured pruning of deep convolutional neural networks. CoRR abs/1512.08571 (2015).Google ScholarGoogle Scholar
  2. Arash Ashari, Shirish Tatikonda, Matthias Boehm, Berthold Reinwald, Keith Campbell, John Keenleyside, and P. Sadayappan. 2015. On optimizing machine learning workloads via kernel fusion. In Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’15). Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. M Gethsiyal Augasta and Thangairulappan Kathirvalavakumar. 2013. Pruning algorithms of neural networksa comparative study. Central Eur. J. Comput. Sci. 3, 3 (2013), 105--115.Google ScholarGoogle Scholar
  4. Grey Ballard, Alex Druinsky, Nicholas Knight, and Oded Schwartz. 2016. Hypergraph partitioning for sparse matrix-matrix multiplication. ACM Trans. Parallel Comput. 3, 3, Article 18 (Dec. 2016), 34 pages. 2329-4949 Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Muthu Manikandan Baskaran and Rajesh Bordawekar. 2009. Optimizing Sparse Matrix-Vector Multiplication on GPUs. Technical Report. RC24704, IBM T. J. Watson.Google ScholarGoogle Scholar
  6. Frédéric Bastien, Pascal Lamblin, Razvan Pascanu, James Bergstra, Ian J. Goodfellow, Arnaud Bergeron, Nicolas Bouchard, and Yoshua Bengio. 2012. Theano: New features and speed improvements. arXiv preprint arXiv:1211.5590.Google ScholarGoogle Scholar
  7. Nathan Bell and Michael Garland. 2009. Implementing sparse matrix-vector multiplication on throughput-oriented processors. In Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis (SC’09). Article 18, 11 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Aydin Buluc and John R. Gilbert. 2008. Challenges and advances in parallel sparse matrix-matrix multiplication. In Proceedings of the 2008 37th International Conference on Parallel Processing (ICPP’08). 503--510. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Srihari Cadambi, Abhinandan Majumdar, Michela Becchi, Srimat Chakradhar, and Hans Peter Graf. 2010. A programmable parallel accelerator for learning and classification. In Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques (PACT’10). 273--284. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Srimat Chakradhar, Murugan Sankaradas, Venkata Jakkula, and Srihari Cadambi. 2010. A dynamically configurable coprocessor for convolutional neural networks. In Proceedings of the 37th Annual International Symposium on Computer Architecture (ISCA’10). 247--257. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Wenlin Chen, James Wilson, Stephen Tyree, Kilian Weinberger, and Yixin Chen. 2015. Compressing neural networks with the hashing trick. In Proceedings of The 32nd International Conference on Machine Learning (ICML’13). 2285--2294. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Sharan Chetlur, Cliff Woolley, Philippe Vandermersch, Jonathan Cohen, John Tran, Bryan Catanzaro, and Evan Shelhamer. 2014. cuDNN: Efficient primitives for deep learning. CoRR abs/1410.0759 (2014).Google ScholarGoogle Scholar
  13. Adam Coates, Brody Huval, Tao Wang, David J. Wu, Bryan C. Catanzaro, and Andrew Y. Ng. 2013. Deep learning with COTS HPC systems. In Proceedings of the 30th International Conference on Machine Learning (ICML’13). 1337--1345. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Ronan Collobert, Koray Kavukcuoglu, and Clément Farabet. 2011. Torch7: A matlab-like environment for machine learning. In Proceedings of the BigLearn, Neural Information Processing Systems Workshop (NIPS’11).Google ScholarGoogle Scholar
  15. Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. 2007. Proceedings of the PASCAL Visual Object Classes Challenge 2007 (VOC’07).Google ScholarGoogle Scholar
  16. Ross Girshick. 2015. Fast R-CNN. In Proceedings of the IEEE International Conference on Computer Vision (ICCV’15). 1440--1448. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. 2014. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’14). IEEE Computer Society, Washington, DC, 580--587. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Jeswin Godwin, Justin Holewinski, and P. Sadayappan. 2012. High-performance sparse matrix-vector multiplication on GPUs for structured grid computations. In Proceedings of the 5th Annual Workshop on General Purpose Processing with Graphics Processing Units (GPGPU’12). New York, NY, 47--56. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Joseph L. Greathouse and Mayank Daga. 2014. Efficient sparse matrix-vector multiplication on GPUs using the CSR storage format. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC’14). 769--780. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Song Han, Junlong Kang, Huizi Mao, Yiming Hu, Xin Li, Yubin Li, Dongliang Xie, Hong Luo, Song Yao, Yu Wang, et al. 2017. ESE: Efficient speech recognition engine with sparse LSTM on FPGA. In Proceedings of the Conference on Field-Programmable Gate Arrays (FPGA’17). 75--84. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Song Han, Xingyu Liu, Huizi Mao, Jing Pu, Ardavan Pedram, Mark A Horowitz, and William J Dally. 2016. EIE: Efficient inference engine on compressed deep neural network. In Proceedings of the 43rd International Symposium on Computer Architecture. IEEE Press, 243--254. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Song Han, Huizi Mao, and William J. Dally. 2015. Deep compression: Compressing deep neural network with pruning, trained quantization and huffman coding. CoRR abs/1510.00149 (2015).Google ScholarGoogle Scholar
  23. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015a. Deep residual learning for image recognition. CoRR abs/1512.03385 (2015).Google ScholarGoogle Scholar
  24. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015b. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the IEEE International Conference on Computer Vision. 1026--1034. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Eun-Jin Im. 2000. Optimizing the performance of sparse matrix-vector multiplication. Ph.D. Dissertation. University of California Berkeley. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Eun-Jin Im and Katherine A. Yelick. 2001. Optimizing sparse matrix computations for register reuse in SPARSITY. In Proceedings of the International Conference on Computational Sciences-Part I (ICCS’01). 127--136. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. 2014. Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the 22nd ACM International Conference on Multimedia (MM’14). Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. ImageNet classification with deep convolutional neural networks. In Proceedings of the Conference on Advances in Neural Information Processing Systems (NIPS’12). Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. 1998. Gradient-based learning applied to document recognition. Proc. IEEE 86, 11 (1998), 2278--2324.Google ScholarGoogle ScholarCross RefCross Ref
  30. Xing Liu, Mikhail Smelyanskiy, Edmond Chow, and Pradeep Dubey. 2013. Efficient sparse matrix-vector multiplication on x86-based many-core processors. In Proceedings of the 27th International ACM Conference on International Conference on Supercomputing (ICS’13). ACM, 273--282. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Liqiang Lu, Yun Liang, Qingcheng Xiao, and Shengen Yan. 2017. Evaluating fast algorithms for convolutional neural networks on fpgas. In Proceedings of the 2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM’17). IEEE, 101--108.Google ScholarGoogle ScholarCross RefCross Ref
  32. John Mellor-Crummey and John Garvin. 2004. Optimizing sparse matrix-vector product computations using unroll and jam. Int. J. High Perform. Comput. Appl. 18, 2 (May 2004), 225--236. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Duane Merrill and Michael Garland. 2016. Merge-based sparse matrix-vector multiplication (SpMV) using the CSR storage format. In Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’16). Article 43, 2 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Yusuke Nagasaka, Akira Nukada, and Satoshi Matsuoka. 2016. Adaptive multi-level blocking optimization for sparse matrix vector multiplication on GPU. In Proceedings of the International Conference on Computational Science (ICCS’16), Vol. 80. 131--142. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Andrew Nere, Atif Hashmi, and Mikko Lipasti. 2011. Profiling heterogeneous multi-GPU systems to accelerate cortically inspired learning algorithms. In Parallel 8 Distributed Processing Symposium (IPDPS), 2011 IEEE International. IEEE, 906--920. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. NVIDIA. 2016. DIGITS—Interactive Deep Learning GPU Training System. Retrieved from https://developer.nvidia.com/digits.Google ScholarGoogle Scholar
  37. Nikola Rajovic, Paul M. Carpenter, Isaac Gelado, Nikola Puzovic, Alex Ramirez, and Mateo Valero. 2013. Supercomputing with commodity CPUs: Are mobile SoCs ready for HPC? In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC’13). Article 40, 12 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster R-CNN: Towards real-time object detection with region proposal networks. In Proceedings of the Conference on Advances in Neural Information Processing Systems (NIPS’15). Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. 2015. ImageNet large scale visual recognition challenge. Int. J. Comput. Vision (IJCV) 115, 3 (2015), 211--252. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Yousef Saad. 2003. Iterative Methods for Sparse Linear Systems (2nd ed.). Society for Industrial and Applied Mathematics, Philadelphia, PA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Erik Saule, Kamer Kaya, and Ümit V. Çatalyürek. 2014. Performance evaluation of sparse matrix multiplication kernels on intel xeon phi. In Parallel Processing and Applied Mathematics. Springer, Berlin, 559--570.Google ScholarGoogle Scholar
  42. Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: A simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15 (2014), 1929--1958. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. Going deeper with convolutions. In Computer Vision and Pattern Recognition (CVPR’15).Google ScholarGoogle Scholar
  44. Wai Teng Tang, Ruizhe Zhao, Mian Lu, Yun Liang, Huynh Phung Huynh, Xibai Li, and Rick Siow Mong Goh. 2015b. Optimizing and auto-tuning scale-free sparse matrix-vector multiplication on intel xeon phi. In Proceedings of the 13th Annual IEEE/ACM International Symposium on Code Generation and Optimization (CGO’15). 136--145. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Xiaoxin Tang, Zhiyi Huang, David M. Eyers, Steven Mills, and Minyi Guo. 2015a. Efficient selection algorithm for fast k-NN search on GPUs. In Proceedings of the International Parallel and Distributed Symposium (IPDPS’15). 397--406. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Francisco Vázquez, José JesúS Fernández, and Ester M. Garzón. 2012. Automatic tuning of the sparse matrix vector product on GPUs based on the ELLR-T approach. Parallel Comput. 38, 8 (Aug. 2012), 408--420. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Xuechao Wei, Cody Hao Yu, Peng Zhang, Youxiang Chen, Yuxin Wang, Han Hu, Yun Liang, and Jason Cong. 2017. Automated systolic array architecture synthesis for high throughput CNN inference on FPGAs. In Proceedings of the 54th Annual Design Automation Conference 2017. ACM, 29. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Samuel Williams, Leonid Oliker, Richard Vuduc, John Shalf, Katherine Yelick, and James Demmel. 2007. Optimization of sparse matrix-vector multiplication on emerging multicore platforms. In Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC’07). Article 38, 12 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Qingcheng Xiao, Yun Liang, Liqiang Lu, Shengen Yan, and Yu-Wing Tai. 2017. Exploring heterogeneous algorithms for accelerating deep convolutional neural networks on FPGAs. In Proceedings of the 54th Annual Design Automation Conference 2017. ACM, 62. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. Shengen Yan, Chao Li, Yunquan Zhang, and Huiyang Zhou. 2014. yaSpMV: Yet another spmv framework on GPUs. In Proceedings of the 19th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’14). 107--118. Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. Matthew D. Zeiler and Rob Fergus. 2014. Visualizing and understanding convolutional networks. In Proceedings of the 13th European Conference in Computer Vision. 818--833.Google ScholarGoogle Scholar
  52. Chen Zhang, Zhenman Fang, Peipei Zhou, Peichen Pan, and Jason Cong. 2016. Caffeine: Towards uniformed representation and acceleration for deep convolutional neural networks. In Proceedings of the 2016 IEEE/ACM International Conference on Computer-Aided Design (ICCAD’16). IEEE, 1--8. Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. Chen Zhang, Peng Li, Guangyu Sun, Yijin Guan, Bingjun Xiao, and Jason Cong. 2015. Optimizing FPGA-based accelerator design for deep convolutional neural networks. In Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA’15). 161--170. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Exploiting Sparsity to Accelerate Fully Connected Layers of CNN-Based Applications on Mobile SoCs

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image ACM Transactions on Embedded Computing Systems
          ACM Transactions on Embedded Computing Systems  Volume 17, Issue 2
          Special Issue on MEMCODE 2015 and Regular Papers (Diamonds)
          March 2018
          640 pages
          ISSN:1539-9087
          EISSN:1558-3465
          DOI:10.1145/3160927
          Issue’s Table of Contents

          Copyright © 2017 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 7 December 2017
          • Accepted: 1 June 2017
          • Revised: 1 March 2017
          • Received: 1 November 2016
          Published in tecs Volume 17, Issue 2

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article
          • Research
          • Refereed

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader