Abstract
Convolutional Neural Networks (CNNs) achieve state-of-the art performance in a wide range of applications including image recognition, speech recognition, and natural language processing. Large-scale CNNs generally have encountered limitations in computing and storage resources, but sparse CNNs have emerged as an effective solution to reduce the amount of computation and memory required. Though existing neural networks accelerators are able to efficiently process sparse networks, the strong coupling of algorithms and structures makes them inflexible. Dataflow architecture can implement different neural network applications through flexible instruction scheduling. The dataflow architecture needs to be initialized at execution time to load instructions into the computing array. Running a dense convolutional layer only needs to be initialized once due to regular calculations. However, running a sparse convolutional layer requires multiple initializations, which takes a long time to fetch instructions from memory, resulting in the computing array being idle and degrading performance. In this paper, we propose an instruction sharing strategy based on the field content in the instruction, which can reduce initialization time and improve performance. Moreover, we use an extended instruction sharing strategy based on the static nature of filters to remove filters-related instructions, further reducing initialization time. Experiments show that our strategies achieve 1.69x (Alexnet), 1.45x (VGG-16) speedup and 37.2\(\%\) (Alexnet), 34.26\(\%\) (VGG-16) energy reduction compared with dense networks. Also, they achieve on average 2.34x (Alexnet), 2.12x (VGG-16) and 1.75x (Alexnet), 1.49x (VGG-16) speedup over Titan Xp GPU and Cambricon-X for our benchmarks.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Albericio, J., Judd, P., Hetherington, T., Aamodt, T., Jerger, N.E., Moshovos, A.: Cnvlutin: ineffectual-neuron-free deep neural network computing. ACM SIGARCH Comput. Arch. News 44(3), 1–13 (2016)
Carter, N.P., et al.: Runnemede: an architecture for ubiquitous high-performance computing. In: 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA), pp. 198–209. IEEE (2013)
Chen, Y.H., Krishna, T., Emer, J.S., Sze, V.: Eyeriss: an energy-efficient reconfigurable accelerator for deep convolutional neural networks. IEEE J. Solid-State Circuits 52(1), 127–138 (2016)
Chen, Y., Chen, T., Xu, Z., Sun, N., Temam, O.: Diannao family: energy-efficient hardware accelerators for machine learning. Commun. ACM 59(11), 105–112 (2016)
Chen, Y., et al.: DaDianNao: a machine-learning supercomputer. In: Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture, pp. 609–622. IEEE Computer Society (2014)
Dean, J., et al.: Large scale distributed deep networks. In: Advances in Neural Information Processing Systems, pp. 1223–1231 (2012)
Denil, M., Shakibi, B., Dinh, L., Ranzato, M., De Freitas, N.: Predicting parameters in deep learning. In: Advances in Neural Information Processing Systems, pp. 2148–2156 (2013)
Fan, D., et al.: SmarCo: an efficient many-core processor for high-throughput applications in datacenters. In: 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA), pp. 596–607. IEEE (2018)
Gao, G.R., Suetterlein, J., Zuckerman, S.: Toward an execution model for extreme-scale systems-runnemede and beyond. CAPSL Tecnhical Memo 104, Department of Electrical and Computer Engineering, University of Delaware (2011)
Giorgi, R.: TERAFLUX: harnessing dataflow in next generation teradevices. Microprocess. Microsyst. 38(8), 976–990 (2014)
Han, S., et al.: ESE: efficient speech recognition engine with sparse LSTM on FPGA. In: Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, pp. 75–84. ACM (2017)
Han, S., et al.: EIE: efficient inference engine on compressed deep neural network. In: 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA), pp. 243–254. IEEE (2016)
Han, S., Pool, J., Tran, J., Dally, W.: Learning both weights and connections for efficient neural network. In: Advances in Neural Information Processing Systems, pp. 1135–1143 (2015)
Holi, J.L., Hwang, J.N.: Finite precision error analysis of neural network hardware implementations. IEEE Trans. Comput. 42(3), 281–290 (1993)
Jaderberg, M., Vedaldi, A., Zisserman, A.: Speeding up convolutional neural networks with low rank expansions. arXiv preprint arXiv:1405.3866 (2014)
Jouppi, N.P., et al.: In-datacenter performance analysis of a tensor processing unit. In: 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA), pp. 1–12. IEEE (2017)
Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, pp. 1097–1105 (2012)
Lattner, C., Adve, V.: LLVM: a compilation framework for lifelong program analysis & transformation. In: Proceedings of the International Symposium on Code Generation and Optimization: Feedback-Directed and Runtime Optimization, p. 75. IEEE Computer Society (2004)
Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3431–3440 (2015)
Naumov, M., Chien, L., Vandermersch, P., Kapasi, U.: Cusparse library. In: GPU Technology Conference (2010)
Oriato, D., Tilbury, S., Marrocu, M., Pusceddu, G.: Acceleration of a meteorological limited area model with dataflow engines. In: 2012 Symposium on Application Accelerators in High Performance Computing, pp. 129–132. IEEE (2012)
Parashar, A., et al.: SCNN: an accelerator for compressed-sparse convolutional neural networks. In: 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA), pp. 27–40. IEEE (2017)
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
Szegedy, C., et al.: Going deeper with convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–9 (2015)
Xiang, T., et al.: Accelerating CNN algorithm with fine-grained dataflow architectures. In: 2018 IEEE 20th International Conference on High Performance Computing and Communications; IEEE 16th International Conference on Smart City; IEEE 4th International Conference on Data Science and Systems (HPCC/SmartCity/DSS), pp. 243–251. IEEE (2018)
Ye, X., Fan, D., Sun, N., Tang, S., Zhang, M., Zhang, H.: SimICT: a fast and flexible framework for performance and power evaluation of large-scale architecture. In: Proceedings of the 2013 International Symposium on Low Power Electronics and Design, pp. 273–278. IEEE Press (2013)
Ye, X., et al.: An efficient dataflow accelerator for scientific applications. Fut. Gener. Comput. Syst. 112, 580–588 (2020)
Ye, X.: Applying cnn on a scientific application accelerator based on dataflow architecture. CCF Trans. High Perform. Comput. 1(3–4), 177–195 (2019)
Zhang, S., et al.: Cambricon-X: an accelerator for sparse neural networks. In: The 49th Annual IEEE/ACM International Symposium on Microarchitecture, p. 20. IEEE Press (2016)
Zhou, X., et al.: Cambricon-S: addressing irregularity in sparse neural networks through a cooperative software/hardware approach. In: 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pp. 15–28. IEEE (2018)
Acknowledgement
This work was supported by the National Key Research and Development Program (2018YFB1003501), the National Natural Science Foundation of China (61732018, 61872335, 61802367, 61672499), Austrian-Chinese Cooperative R&D Project (FFG and CAS) Grant No. 171111KYSB20170032, the Strategic Priority Research Program of Chinese Academy of Sciences, Grant No. XDC05000000, and the Open Project Program of the State Key Laboratory of Mathematical Engineering and Advanced Computing (2019A07).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Wu, X. et al. (2020). Accelerating Sparse Convolutional Neural Networks Based on Dataflow Architecture. In: Qiu, M. (eds) Algorithms and Architectures for Parallel Processing. ICA3PP 2020. Lecture Notes in Computer Science(), vol 12453. Springer, Cham. https://doi.org/10.1007/978-3-030-60239-0_2
Download citation
DOI: https://doi.org/10.1007/978-3-030-60239-0_2
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-60238-3
Online ISBN: 978-3-030-60239-0
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)