Abstract
Deep neural networks (DNNs) have been employed to different devices as a popular machine learning algorithm (ML) owing to deploy the Internet of Things (IoT), data mining in cloud computing, and web search engines, which MLs had an impressive effect on IoT’s edge level nodes. Deploying DNN-based applications leads to memory access problems, including communication delay, energy efficiency, and bandwidth requirement. We propose a bus scheduling for data placement on distributed local buffers in a deep learning accelerator (DLA). The contributions of this paper include (1) providing a method for data-flow mapping between off-chip DRAM and distributed local buffers, and a flow mapping approach for data transfer between distributed local buffers and processing elements (PEs) (2) employing distributed local buffers in four directions for traffic distribution on a mesh based on memory access mechanism (3) bus scheduling for data placement on distributed local buffers. Simulated experiment based on typical DNN (i.e., AlexNet, VGG-16, and GoogLeNet) workflows demonstrates the effectiveness of the design: (1) the scheduling and mapping methods improve total runtime and bandwidth requirement by approximately 42.29% and 88.95% compared with TPU, respectively. Additionally, (2) our methods reduce total runtime for row-column stationary plus by approximately 99% compared with weight-stationary data-flow in CONV1 and CONV11 of VGG-16, respectively. This work reports the simulation results based on distributing AlexNet, VGG-16, and GoogLeNet’s traffics as the popular CNNs and DNNs models, whereas it investigates our method's efficiency for other trained models.
Similar content being viewed by others
References
Chen, Y.-H., Krishna, T., Emer, J.S., Sze, V.: Eyeriss: an energy-efficient reconfigurable accelerator for deep convolutional neural networks. IEEE Int. J. (SOLID-STATE CIRCUITS) (2016)
Dally, W.J., Thomas Gray, C., Poulton, J., Khailany, B., Wilson, J., Dennison, L.: Hardware-enabled artificial intelligence. In: IEEE International Conference (SVCDT) (2018)
Sze, V., Chen, Y.-H., Yang, T.-J., Emer, J.S.: Efficient processing of deep neural networks: a tutorial and survey. IEEE Int. J. (JoP) 105, 2295–2329 (2017)
Andri, R., Cavigelli, L., Rossiy, D., Benini, L.: Hyperdrive: a systolically scalable binary-weight CNN inference engine for mW IoT end-nodes. In: IEEE International Conference (ISVLSI) (2018)
Luo, T., Liu, S., Li, L., Wang, Y., Zhang, S., Chen, T., Xu, Z., Temam, O., Chen, Y.: DaDianNao: a machine-learning supercomputer. IEEE Int. J. (Computer) (2016)
Du, Z., Fasthuber, R., Chen, T., Ienne, P., Li, L., Luo, T., Feng, X., Chen, X., Temam, O.: ShiDianNao: shifting vision processing closer to the sensor. In: IEEE International Conference (ISCA) (2015)
Chen, Y.-H., Emer, J.S., Sze, V.: Eyeriss v2: a flexible and high-performance accelerator for emerging deep neural networks. IEEE Int. J. (ArXiv) (2018)
Jouppi, N.P., et al.: In-datacenter performance analysis of a tensor processing unit. In: IEEE International Conference (ISCA) (2017)
Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A.: Going deeper with convolutions. In: IEEE International Conference (CVPR’15) (2015)
Schuiki, F., Schaffner, M., Gürkaynak, F.K., Benini, L.: A scalable near-memory architecture for training deep neural networks on large in-memory datasets. IEEE Int. J. (ITC) (2019)
Samajdar, A., Zhu, Y., Whatmough, P., Mattina, M., Krishna, T.: SCALE-sim: systolic CNN accelerator simulator. In: IEEE International Conference (ASPLOS’18) (2018)
Tang, T., Li, S., Xie, Y., Jouppi, N.: MLPAT: a power, area, timing modeling framework for machine learning accelerators. In: IEEE International Conference (MICRO’18) (2018)
Gao, M., Pu, J., Yang, X., Horowitz, M., Kozyrakis, C.: TETRIS: scalable and efficient neural network acceleration with 3D memory. In: IEEE International Conference (ASPLOS) (2017)
Chen (Jimmy), K.-C., Ebrahimi, M., Wang, T.-Y., Yang, Y.-C.: NoC-based DNN accelerator: a future design paradigm. In: Conference (NOCS ’19) (2019)
Mirmahaleh, S.Y., Reshadi, M., Bagherzadeh, N.: Flow mapping on mesh-based deep learning accelerator. J. Parallel Distrib. Comput. 144, 80–97 (2020)
Kwon, H., Samajdar, A., Krishna, T.: A communication-centric approach for designing flexible DNN accelerators. In: IEEE International Journal (MICRO) (2018)
Ascia, G., Catania, V., Monteleone, S., Palesi, M., Patti, D., Jose, J.: Analyzing networks-on-chip based deep neural networks. In: Conference (NOCS ’19) (2019)
Mirmahaleh, S.Y.H., Reshadi, M., Shabani, H., Guo, X., Bagherzadeh, N.: Flow mapping and data distribution on mesh-based deep learning accelerator. In: Proceedings of international symposium on networks-on chip, New York, NY, USA, October 17–18, 2019 (NOCS ’19), 8 pages (2019)
Chen, T., Du, Z., Sun, N., Wang, J., Wu, C., Chen, Y., Temam, O.: DianNao: a small-footprint high-throughput accelerator for ubiquitous machine-learning. In: IEEE International Conference (ASPLOS) (2014)
Larochelle, H., Erhan, D., Courville, A., Bergstra, J., Bengio, Y.: An empirical evaluation of deep architectures on problems with many factors of variation. In: IEEE International Conference (ICML) (2007)
Han, X., Zhou, D., Wang, S., Kimura, S.: CNN-MERP: an FPGA-based memory-efficient reconfigurable processor for forward and backward propagation of convolutional neural networks. In: IEEE International Conference (ICCD) (2016)
Li, J., Mei, X., Prokhorov, D.: Deep neural network for structural prediction and lane detection in traffic scene. IEEE Trans. Neural Netw. Learning Syst. 28, 690–703 (2016)
Chen, Y.-H., Krishna, T., Emer, J.S., Sze, V.: Understanding the limitations of existing energy-efficient design approaches for deep neural networks. In: IEEE International Conference (SYSML’18) (2018)
Chen, Y.-H., Emer, J., Sze, V.: Using dataflow to optimize energy efficiency of deep neural network accelerators. IEEE Int. J. (Computer Society) 37, 12–21 (2017)
Kwon, H., Samajdar, A., Krishna, T.: Rethinking NoCs for spatial neural network accelerators. In: IEEE International Conference (NOCS) (2017)
Choi, W., Duraisamy, K., Kim, R.G., Doppa, J.R., Pande, P.P., Marculescu, D., Marculescu, R.: On-chip communication network for efficient training of deep convolutional networks on heterogeneous manycore systems. IEEE Int. J. (Computers) 67, 672–686 (2017)
Joardar, B.K., Choi, W., Kim, R.G., Doppa, J.R., Pande, P.P., Marculescu, D., Marculescu, R.: 3D NoC-enabled heterogeneous manycore architectures for accelerating CNN training: performance and thermal trade-offs. In: IEEE International Conference (NoC'17) (2017)
Park, S., Bong, K., Shin, D., Lee, J., Choi, S., Yoo, H.-J.: A 1.93UPS/W scalable deep learning/inference processor with tetra-parallel MIMD architecture for big-data applications. In: IEEE International Conference (ISSCC) (2015)
Wang, C., Gong, L., Yu, Q., Li, X., Xie, Y., Zhou, X.: DLAU: a scalable deep learning accelerator unit on FPGA. IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst. https://doi.org/10.1109/TCAD.2016.2587683 (2016)
Adolf, R., Rama, S., Reagen, B., Wei, G.-Y., Brooks, D.: Fathom: reference workloads for modern deep learning methods. IEEE International arXiv.org (2016)
Song, L., Wang, Y., Han, Y., Zhao, X., Liu, B., Li, X.: C-brain: a deep learning accelerator that tames the diversity of CNNs through adaptive data-level parallelization. In: IEEE International Conference (DAC) (2016)
Guo, K., Lingzhi, Qiu, J., Yao, S., Han, S., Wang, Y., Yang, H.: Angel-eye: a complete design flow for mapping CNN onto customized hardware. In: IEEE International Conference (VLSI) (2016)
Reagen, B., Whatmough, P., Adolf, R., Rama, S., Lee, H., Lee, S.K., Hernández-Lobato, J.M., Wei, G.-Y.: David brooks, minerva: enabling low-power, highly-accurate deep neural network accelerators. In: IEEE International Conference (Computer Architecture) (2016)
Jouppi, N.P., Young, C., Patil, N., Patterson, D.: Motivation for and evaluation of the first tensor processing unit. IEEE Int. J. (Micro) 38, 10–19 (2018)
Kwon, H., Emer, J.S., Krishna, T.: MAERI: enabling flexible dataflow mapping over DNN accelerators via reconfigurable interconnects. In: IEEE International Conference (ASPLOS’18) (2018)
Kwon, H., Pellauer, M., Krishna, T.: MAESTRO: an open-source infrastructure for modeling dataflows within deep learning accelerators. In: IEEE International Conference (ArXiv) (2018)
Yang, T.-J., Chen, Y.-H.: Vivienne sze, designing energy-efficient convolutional neural networks using energy-aware pruning. In: IEEE International Conference (CVPR) (2017)
Cavigelli, L., Magno, M., Benini, L.: Accelerating real-time embedded scene labeling with convolutional networks. In: IEEE International Conference (DAC) (2015)
Mirmahaleh, S.Y., Rahmani, A.M.: DNN pruning and mapping on NoC-Based communication infrastructure. Microelectron. J. 94, 104655 (2019)
Karam, R., Puri, R., Bhunia, S.: Energy-efficient adaptive hardware accelerator for text mining application kernels. IEEE Int. J. (VLSI) 24, 3526–3537 (2016)
Firuzan, A., Modarresi, M., Daneshtalab, M., Reshadi, M.: Reconfigurable network-on-chip for 3D neural network accelerators. In: IEEE International Conference (NOCS’18) (2018)
Samajdar, A., Mannan, P., Garg, K., Krishna, T.: GeneSys: enabling continuous learning through neural network evolution in hardware. In: IEEE Int. Conference (ArXiv) (2018)
Li, J., Yan, G., Lu, W., Jiang, S., Gong, S., Wu, J., Li, X.: SmartShuttle: optimizing off-chip memory accesses for deep learning accelerators. In: IEEE International Conference (DATE), 2018
Dahl, G., Sainath, T., Hinton, G.: Improving deep neural networks for LVCSR using rectified linear units and dropout. In: IEEE International Conference (Acoustics, Speech and Signal Processing) (2013)
Huang, P., He, X., Gao, J., Deng, L.: Learning deep structured semantic models for web search using clickthrough data. In: IEEE International Conference (Information and Knowledge Management) (2013)
Haykin, S.: Neural Networks: A Comprehensive Foundation, 2nd edn. Prentice Hall PTR, Upper Saddle River, NJ (1998)
Sermanet, P., Eigen, D., Zhang, X., Mathieu, M., Fergus, R., LeCun, Y.: OverFeat: integrated recognition, localization and detection using convolutional networks. In: IEEE International Conference (CoRR) (2013)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: IEEE International Conference (CVPR) (2016)
Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: IEEE International Conference (CVPR) (2014)
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A.C., Fei-Fei, L.: ImageNet large scale visual recognition challenge. IEEE Int. J. Comput. Vis. 115, 211–252 (2015)
Hagan, M.T., Demuth, H.B., Beale, M.H., De Jesús, O.: IEEE International Book, 2nd Edition. Martin Hagan Publisher (2014)
Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: IEEE International Conference (NIPS) (2012)
Parashar, A., Rhu, M., Mukkara, A., Puglielli, A., Venkatesan, R., Khailany, B., Emer, J., Keckler, S.W., Dally, W.J.: SCNN: an accelerator for compressed-sparse convolutional neural networks. In: IEEE International Conference (ISCA) (2017)
Albaqsami, A., Hosseini, M.S., Bagherzadeh, N.: HTF-MPR: a heterogeneous TensorFlow mapper targeting performance using genetic algorithms and gradient boosting regressors. In: IEEE International Conference (DATE) (2018)
Ebrahimi, M., Daneshtalab, M., Liljeberg, P., Tenhunen, H.: HAMUM-A novel routing protocol for unicast and multicast traffic in MPSoCs. In: IEEE Int. Conference (PDP) (2010)
Daneshtalab, M., Ebrahimi, M., Mohammadi, S., Afzali-Kusha, A.: Low-distance path-based multicast routing algorithm for network-on-chips. IEEE Int. J. 3, 430 (2009)
Chi, P., Li, S., Xu, C., Zhang, T., Zhao, J., Liu, Y., Wang, Y., Xie, Y.: PRIME: a novel processing-in memory architecture for neural network computation in ReRAM-based main memory. In: IEEE International Conference (ISCA) (2016)
Moons, B., Verhelst, M.: A 0.3–2.6 UPS/W precision-scalable processor for real-time large-scale ConvNets. IEEE International Symposium (VLSI) (2016).
Catania, V., Mineo, A., Palesi, M., Patti, D., Monteleone, S.: cycle-accurate network on chip simulation with Noxim. IEEE Int. J. (TOMACS) 27, 1–25 (2016)
Jimmy Chen, K.-C., Wang, T.-Y.: NN-Noxim: high-level cycle-accurate NoC-based neural networks simulator. In: IEEE International Conference (NOCARC) (2018)
Jimmy Chen, K.-C., Wang, T.-Y.: NN-Noxim: high-level cycle-accurate NoC-based neural networks simulator. In: Conference (NOCARC) (2018)
Chen, K.-C., Wang, T.-Y., Yang, Y.-C.: Cycle-accurate NoC-based convolutional neural network simulator. In: Conference (COINS’19) (2019)
Kwon, H., Krishna, T.: OpenSMART: single-cycle multi-hop NoC generator in BSV and chisel. In: IEEE International Conference (ISPASS) (2017)
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: IEEE International Conference (ICLR) (2015)
Funding
No funding was received.
Author information
Authors and Affiliations
Contributions
SYHM read and approved the final manuscript. MR read and approved the final manuscript. SYHM, MR, NB, and AK have contributed equally.
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Mirmahaleh, S.Y.H., Reshadi, M., Bagherzadeh, N. et al. Data scheduling and placement in deep learning accelerator. Cluster Comput 24, 3651–3669 (2021). https://doi.org/10.1007/s10586-021-03355-8
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10586-021-03355-8