Abstract
Convolutional neural network (CNN) is widely used in applications such as face recognition, intelligent monitoring, image recognition and text recognition. Because of its high computational complexity, many efficient hardware accelerators have been proposed to exploit high degree of parallel processing for CNN. However, accelerators which are implemented on FPGAs and ASICs usually sacrifice generality for higher performance and lower power consumption. Other accelerators, such as GPUs, are general enough, but they lead to higher power consumption. Fine-grained dataflow architectures, which break conventional Von Neumann architectures, show natural advantages in processing scientific applications. Meanwhile, CNN algorithm shares many vital characteristics with scientific applications including high parallelism, simple loop and regular memory accessing pattern. In this paper, we propose a scheme for implementing and optimizing CNN on fine-grained dataflow architecture designed for scientific applications, namely Scientific Processing Unit (SPU). The experiment results reveal that by using our scheme, the performance of AlexNet and VGG-19 running on SPU is averagely \(2.29\,\times\) higher than that on NVIDIA Titan Xp, and the energy consumption of our hardware is averagely \(5.76\,\times\) lower than that of Titan Xp.






















Similar content being viewed by others
References
Albericio, J., Judd, P., Hetherington, T., et al.: Cnvlutin: ineffectual-neuron-free deep neural network computing. In: 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA), pp. 1–13 (2016). https://doi.org/10.1109/ISCA.2016.11
Chellapilla, K., Puri, S., Simard, P.: High performance convolutional neural networks for document processing. In: Tenth International Workshop on Frontiers in Handwriting Recognition, pp. 1–6 (2006)
Chen, T., Du, Z., Sun, N., et al.: DianNao: a small-footprint high-throughput accelerator for ubiquitous machine-learning. In: Proceedings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS ’14, pp. 269–284. ACM, New York (2014). https://doi.org/10.1145/2541940.2541967
Chen, Y., Luo, T., Liu, S., et al.: Dadiannao: a machine-learning supercomputer. In: 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture, pp. 609–622 (2014). https://doi.org/10.1109/MICRO.2014.58
Chen, Y.H., Emer, J., Sze, V.: Eyeriss: A spatial architecture for energy-efficient dataflow for convolutional neural networks. In: 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA), pp. 367–379 (2016). https://doi.org/10.1109/ISCA.2016.40
Chetlur, S., Woolley, C., Vandermersch, P., et al.: cuDNN: efficient primitives for deep learning. CoRR arxiv: abs/1410.0759 (2014)
Coates, A., Huval, B., Wang, T., et al.: Deep learning with COTS HPC systems. In: Proceedings of the 30th International Conference on Machine Learning, vol. 28. ICML’13, pp. III-1337–III-1345. JMLR.org (2013). http://dl.acm.org/citation.cfm?id=3042817.3043086
Dennis, J.B.: First version of a data flow procedure language. In: Programming Symposium, Proceedings Colloque Sur La Programmation, pp. 362–376. Springer, London (1974). http://dl.acm.org/citation.cfm?id=647323.721501
Fan, D., Zhang, H., Wang, D., Ye, X., Song, F., Li, G., Sun, N.: Godson-T: an efficient many-core processor exploring thread-level parallelism. IEEE Micro 32(2), 38–47 (2012)
Fan, D., Li, W., Ye, X., Wang, D., Zhang, H., Tang, Z., Sun, N.: SmarCO: an efficient many-core processor for high-throughput applications in datacenters. In: 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA), pp. 596–607. IEEE, New York (2018)
Fu, H., Gan, L., Clapp, R.G., Ruan, H., Pell, O., Mencer, O., Flynn, M., Huang, X., Yang, G.: Scaling reverse time migration performance through reconfigurable dataflow engines. IEEE Micro 34(1), 30–40 (2014)
Govindan, M.S.S., Burger, D., Keckler, S.: Trips: a distributed explicit data graph execution (edge) microprocessor. In: 2007 IEEE Hot Chips 19 Symposium (HCS), pp. 1–13 (2007). https://doi.org/10.1109/HOTCHIPS.2007.7482519
Gu, L., Li, X., Siegel, J.: An empirically tuned 2D and 3D FFT library on CUDA GPU. In: Proceedings of the 24th ACM International Conference on Supercomputing, ICS ’10, pp. 305–314. ACM, New York (2010). https://doi.org/10.1145/1810085.1810127
Jouppi, N.P., Young, C., Patil, N., et al.: In-datacenter performance analysis of a tensor processing unit. In: Proceedings of the 44th Annual International Symposium on Computer Architecture, ISCA ’17, pp. 1–12. ACM, New York (2017). https://doi.org/10.1145/3079856.3080246
Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: F. Pereira, C.J.C. Burges, L. Bottou, K.Q. Weinberger (eds.) Advances in Neural Information Processing Systems, vol. 25, pp. 1097–1105. Curran Associates, Inc., Red Hook (2012)
Liang, Y., Lu, L., Xiao, Q., Yan, S.: Evaluating fast algorithms for convolutional neural networks on FPGAS. In: IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, pp. 1–1 (2019). https://doi.org/10.1109/TCAD.2019.2897701
Lu, L., Liang, Y.: SpWA: an efficient sparse Winograd convolutional neural networks accelerator on FPGAS. In: 2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC), pp. 1–6 (2018). https://doi.org/10.1109/DAC.2018.8465842
Lu, W., Yan, G., Li, J., et al.: FlexFlow: a flexible dataflow accelerator architecture for convolutional neural networks. In: 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA), pp. 553–564 (2017). https://doi.org/10.1109/HPCA.2017.29
Nguyen, A., Satish, N., Chhugani, J., et al.: 3.5-D blocking optimization for stencil computations on modern CPUs and GPUs. In: 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–13 (2010). https://doi.org/10.1109/SC.2010.2
Oriato, D., Tilbury, S., Marrocu, M., Pusceddu, G.: Acceleration of a meteorological limited area model with dataflow engines. In: 2012 Symposium on Application Accelerators in High Performance Computing (SAAHPC), pp. 129–132. IEEE, New York (2012)
Pratas, F., Oriato, D., Pell, O., Mata, R.A., Sousa, L.: Accelerating the computation of induced dipoles for molecular mechanics with dataflow engines. In: IEEE 21st Annual International Symposium on Field-Programmable Custom Computing Machines, pp. 177–180. IEEE, New York (2013)
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
Swanson, S., Schwerin, A., Mercaldi, M., et al.: The wavescalar architecture. ACM Trans. Comput. Syst. 25(2), 4:1–4:54 (2007). https://doi.org/10.1145/1233307.1233308
Sze, V., Chen, Y., Yang, T., Emer, J.S.: Efficient processing of deep neural networks: a tutorial and survey. CoRR arxiv: abs/1703.09039 (2017)
Tan, X., Ye, X.C., Shen, X.W., Xu, Y.C., Wang, D., Zhang, L., Li, W.M., Fan, D.R., Tang, Z.M.: A pipelining loop optimization method for dataflow architecture. J. Comput. Sci. Technol. 33(1), 116–130 (2018). https://doi.org/10.1007/s11390-017-1748-5
Venkataramanan, G., Sarma, D.D.: Compute and redundancy solution for Tesla’s full self driving computer. In: Hotchips 2019 (2019)
Verdoscia, L., Vaccaro, R., Giorgi, R.: A matrix multiplier case study for an evaluation of a configurable dataflow-machine. In: Proceedings of the 12th ACM International Conference on Computing Frontiers, CF ’15, pp. 63:1–63:6. ACM, New York (2015). https://doi.org/10.1145/2742854.2747287
Xiao-Wei, S., Xiao-Chun, Y., Da, W., et al.: Optimizing dataflow architecture for scientific applications. Chin. J. Comput. 9, 2181–2196 (2017)
Ye, X., Fan, D., Sun, N., et al.: SimICT: a fast and flexible framework for performance and power evaluation of large-scale architecture. In: International Symposium on Low Power Electronics and Design (ISLPED), pp. 273–278 (2013). https://doi.org/10.1109/ISLPED.2013.6629308
Zhang, C., Li, P., Sun, G., et al.: Optimizing FPGA-based accelerator design for deep convolutional neural networks. In: Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, FPGA ’15, pp. 161–170. ACM, New York (2015). https://doi.org/10.1145/2684746.2689060
Acknowledgements
This work was supported by the National Key Research and Development Plan of China under Grant no. 2017YFC0803401, the National Natural Science Foundation of China under Grant nos. 61872335 and 61732018, the International Partnership Program of Chinese Academy of Sciences under Grant no. 171111KYSB20170032.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Ye, X., Xiang, T., Tan, X. et al. Applying CNN on a scientific application accelerator based on dataflow architecture. CCF Trans. HPC 1, 177–195 (2019). https://doi.org/10.1007/s42514-019-00015-7
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s42514-019-00015-7