Skip to main content

Accelerating Sparse Convolutional Neural Networks Based on Dataflow Architecture

  • Conference paper
  • First Online:
Algorithms and Architectures for Parallel Processing (ICA3PP 2020)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 12453))

Abstract

Convolutional Neural Networks (CNNs) achieve state-of-the art performance in a wide range of applications including image recognition, speech recognition, and natural language processing. Large-scale CNNs generally have encountered limitations in computing and storage resources, but sparse CNNs have emerged as an effective solution to reduce the amount of computation and memory required. Though existing neural networks accelerators are able to efficiently process sparse networks, the strong coupling of algorithms and structures makes them inflexible. Dataflow architecture can implement different neural network applications through flexible instruction scheduling. The dataflow architecture needs to be initialized at execution time to load instructions into the computing array. Running a dense convolutional layer only needs to be initialized once due to regular calculations. However, running a sparse convolutional layer requires multiple initializations, which takes a long time to fetch instructions from memory, resulting in the computing array being idle and degrading performance. In this paper, we propose an instruction sharing strategy based on the field content in the instruction, which can reduce initialization time and improve performance. Moreover, we use an extended instruction sharing strategy based on the static nature of filters to remove filters-related instructions, further reducing initialization time. Experiments show that our strategies achieve 1.69x (Alexnet), 1.45x (VGG-16) speedup and 37.2\(\%\) (Alexnet), 34.26\(\%\) (VGG-16) energy reduction compared with dense networks. Also, they achieve on average 2.34x (Alexnet), 2.12x (VGG-16) and 1.75x (Alexnet), 1.49x (VGG-16) speedup over Titan Xp GPU and Cambricon-X for our benchmarks.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Albericio, J., Judd, P., Hetherington, T., Aamodt, T., Jerger, N.E., Moshovos, A.: Cnvlutin: ineffectual-neuron-free deep neural network computing. ACM SIGARCH Comput. Arch. News 44(3), 1–13 (2016)

    Article  Google Scholar 

  2. Carter, N.P., et al.: Runnemede: an architecture for ubiquitous high-performance computing. In: 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA), pp. 198–209. IEEE (2013)

    Google Scholar 

  3. Chen, Y.H., Krishna, T., Emer, J.S., Sze, V.: Eyeriss: an energy-efficient reconfigurable accelerator for deep convolutional neural networks. IEEE J. Solid-State Circuits 52(1), 127–138 (2016)

    Article  Google Scholar 

  4. Chen, Y., Chen, T., Xu, Z., Sun, N., Temam, O.: Diannao family: energy-efficient hardware accelerators for machine learning. Commun. ACM 59(11), 105–112 (2016)

    Article  Google Scholar 

  5. Chen, Y., et al.: DaDianNao: a machine-learning supercomputer. In: Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture, pp. 609–622. IEEE Computer Society (2014)

    Google Scholar 

  6. Dean, J., et al.: Large scale distributed deep networks. In: Advances in Neural Information Processing Systems, pp. 1223–1231 (2012)

    Google Scholar 

  7. Denil, M., Shakibi, B., Dinh, L., Ranzato, M., De Freitas, N.: Predicting parameters in deep learning. In: Advances in Neural Information Processing Systems, pp. 2148–2156 (2013)

    Google Scholar 

  8. Fan, D., et al.: SmarCo: an efficient many-core processor for high-throughput applications in datacenters. In: 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA), pp. 596–607. IEEE (2018)

    Google Scholar 

  9. Gao, G.R., Suetterlein, J., Zuckerman, S.: Toward an execution model for extreme-scale systems-runnemede and beyond. CAPSL Tecnhical Memo 104, Department of Electrical and Computer Engineering, University of Delaware (2011)

    Google Scholar 

  10. Giorgi, R.: TERAFLUX: harnessing dataflow in next generation teradevices. Microprocess. Microsyst. 38(8), 976–990 (2014)

    Article  Google Scholar 

  11. Han, S., et al.: ESE: efficient speech recognition engine with sparse LSTM on FPGA. In: Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, pp. 75–84. ACM (2017)

    Google Scholar 

  12. Han, S., et al.: EIE: efficient inference engine on compressed deep neural network. In: 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA), pp. 243–254. IEEE (2016)

    Google Scholar 

  13. Han, S., Pool, J., Tran, J., Dally, W.: Learning both weights and connections for efficient neural network. In: Advances in Neural Information Processing Systems, pp. 1135–1143 (2015)

    Google Scholar 

  14. Holi, J.L., Hwang, J.N.: Finite precision error analysis of neural network hardware implementations. IEEE Trans. Comput. 42(3), 281–290 (1993)

    Article  Google Scholar 

  15. Jaderberg, M., Vedaldi, A., Zisserman, A.: Speeding up convolutional neural networks with low rank expansions. arXiv preprint arXiv:1405.3866 (2014)

  16. Jouppi, N.P., et al.: In-datacenter performance analysis of a tensor processing unit. In: 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA), pp. 1–12. IEEE (2017)

    Google Scholar 

  17. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, pp. 1097–1105 (2012)

    Google Scholar 

  18. Lattner, C., Adve, V.: LLVM: a compilation framework for lifelong program analysis & transformation. In: Proceedings of the International Symposium on Code Generation and Optimization: Feedback-Directed and Runtime Optimization, p. 75. IEEE Computer Society (2004)

    Google Scholar 

  19. Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3431–3440 (2015)

    Google Scholar 

  20. Naumov, M., Chien, L., Vandermersch, P., Kapasi, U.: Cusparse library. In: GPU Technology Conference (2010)

    Google Scholar 

  21. Oriato, D., Tilbury, S., Marrocu, M., Pusceddu, G.: Acceleration of a meteorological limited area model with dataflow engines. In: 2012 Symposium on Application Accelerators in High Performance Computing, pp. 129–132. IEEE (2012)

    Google Scholar 

  22. Parashar, A., et al.: SCNN: an accelerator for compressed-sparse convolutional neural networks. In: 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA), pp. 27–40. IEEE (2017)

    Google Scholar 

  23. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)

  24. Szegedy, C., et al.: Going deeper with convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–9 (2015)

    Google Scholar 

  25. Xiang, T., et al.: Accelerating CNN algorithm with fine-grained dataflow architectures. In: 2018 IEEE 20th International Conference on High Performance Computing and Communications; IEEE 16th International Conference on Smart City; IEEE 4th International Conference on Data Science and Systems (HPCC/SmartCity/DSS), pp. 243–251. IEEE (2018)

    Google Scholar 

  26. Ye, X., Fan, D., Sun, N., Tang, S., Zhang, M., Zhang, H.: SimICT: a fast and flexible framework for performance and power evaluation of large-scale architecture. In: Proceedings of the 2013 International Symposium on Low Power Electronics and Design, pp. 273–278. IEEE Press (2013)

    Google Scholar 

  27. Ye, X., et al.: An efficient dataflow accelerator for scientific applications. Fut. Gener. Comput. Syst. 112, 580–588 (2020)

    Article  Google Scholar 

  28. Ye, X.: Applying cnn on a scientific application accelerator based on dataflow architecture. CCF Trans. High Perform. Comput. 1(3–4), 177–195 (2019)

    Article  Google Scholar 

  29. Zhang, S., et al.: Cambricon-X: an accelerator for sparse neural networks. In: The 49th Annual IEEE/ACM International Symposium on Microarchitecture, p. 20. IEEE Press (2016)

    Google Scholar 

  30. Zhou, X., et al.: Cambricon-S: addressing irregularity in sparse neural networks through a cooperative software/hardware approach. In: 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pp. 15–28. IEEE (2018)

    Google Scholar 

Download references

Acknowledgement

This work was supported by the National Key Research and Development Program (2018YFB1003501), the National Natural Science Foundation of China (61732018, 61872335, 61802367, 61672499), Austrian-Chinese Cooperative R&D Project (FFG and CAS) Grant No. 171111KYSB20170032, the Strategic Priority Research Program of Chinese Academy of Sciences, Grant No. XDC05000000, and the Open Project Program of the State Key Laboratory of Mathematical Engineering and Advanced Computing (2019A07).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xinxin Wu .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Wu, X. et al. (2020). Accelerating Sparse Convolutional Neural Networks Based on Dataflow Architecture. In: Qiu, M. (eds) Algorithms and Architectures for Parallel Processing. ICA3PP 2020. Lecture Notes in Computer Science(), vol 12453. Springer, Cham. https://doi.org/10.1007/978-3-030-60239-0_2

Download citation

Publish with us

Policies and ethics