Data scheduling and placement in deep learning accelerator

Mirmahaleh, Seyedeh Yasaman Hosseini; Reshadi, Midia; Bagherzadeh, Nader; Khademzadeh, Ahmad

doi:10.1007/s10586-021-03355-8

Data scheduling and placement in deep learning accelerator

Published: 10 July 2021

Volume 24, pages 3651–3669, (2021)
Cite this article

Cluster Computing Aims and scope Submit manuscript

Seyedeh Yasaman Hosseini Mirmahaleh¹,
Midia Reshadi ORCID: orcid.org/0000-0001-7628-2401¹,
Nader Bagherzadeh² &
…
Ahmad Khademzadeh³

634 Accesses
2 Citations
3 Altmetric
Explore all metrics

Abstract

Deep neural networks (DNNs) have been employed to different devices as a popular machine learning algorithm (ML) owing to deploy the Internet of Things (IoT), data mining in cloud computing, and web search engines, which MLs had an impressive effect on IoT’s edge level nodes. Deploying DNN-based applications leads to memory access problems, including communication delay, energy efficiency, and bandwidth requirement. We propose a bus scheduling for data placement on distributed local buffers in a deep learning accelerator (DLA). The contributions of this paper include (1) providing a method for data-flow mapping between off-chip DRAM and distributed local buffers, and a flow mapping approach for data transfer between distributed local buffers and processing elements (PEs) (2) employing distributed local buffers in four directions for traffic distribution on a mesh based on memory access mechanism (3) bus scheduling for data placement on distributed local buffers. Simulated experiment based on typical DNN (i.e., AlexNet, VGG-16, and GoogLeNet) workflows demonstrates the effectiveness of the design: (1) the scheduling and mapping methods improve total runtime and bandwidth requirement by approximately 42.29% and 88.95% compared with TPU, respectively. Additionally, (2) our methods reduce total runtime for row-column stationary plus by approximately 99% compared with weight-stationary data-flow in CONV1 and CONV11 of VGG-16, respectively. This work reports the simulation results based on distributing AlexNet, VGG-16, and GoogLeNet’s traffics as the popular CNNs and DNNs models, whereas it investigates our method's efficiency for other trained models.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A comprehensive review of Binary Neural Network

Article 30 March 2023

Chunyu Yuan & Sos S. Agaian

A review of convolutional neural network architectures and their optimizations

Article 22 June 2022

Shuang Cong & Yang Zhou

A comprehensive survey on model compression and acceleration

Article 08 February 2020

Tejalal Choudhary, Vipul Mishra, … Jagannathan Sarangapani

References

Chen, Y.-H., Krishna, T., Emer, J.S., Sze, V.: Eyeriss: an energy-efficient reconfigurable accelerator for deep convolutional neural networks. IEEE Int. J. (SOLID-STATE CIRCUITS) (2016)
Dally, W.J., Thomas Gray, C., Poulton, J., Khailany, B., Wilson, J., Dennison, L.: Hardware-enabled artificial intelligence. In: IEEE International Conference (SVCDT) (2018)
Sze, V., Chen, Y.-H., Yang, T.-J., Emer, J.S.: Efficient processing of deep neural networks: a tutorial and survey. IEEE Int. J. (JoP) 105, 2295–2329 (2017)
Google Scholar
Andri, R., Cavigelli, L., Rossiy, D., Benini, L.: Hyperdrive: a systolically scalable binary-weight CNN inference engine for mW IoT end-nodes. In: IEEE International Conference (ISVLSI) (2018)
Luo, T., Liu, S., Li, L., Wang, Y., Zhang, S., Chen, T., Xu, Z., Temam, O., Chen, Y.: DaDianNao: a machine-learning supercomputer. IEEE Int. J. (Computer) (2016)
Du, Z., Fasthuber, R., Chen, T., Ienne, P., Li, L., Luo, T., Feng, X., Chen, X., Temam, O.: ShiDianNao: shifting vision processing closer to the sensor. In: IEEE International Conference (ISCA) (2015)
Chen, Y.-H., Emer, J.S., Sze, V.: Eyeriss v2: a flexible and high-performance accelerator for emerging deep neural networks. IEEE Int. J. (ArXiv) (2018)
Jouppi, N.P., et al.: In-datacenter performance analysis of a tensor processing unit. In: IEEE International Conference (ISCA) (2017)
Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A.: Going deeper with convolutions. In: IEEE International Conference (CVPR’15) (2015)
Schuiki, F., Schaffner, M., Gürkaynak, F.K., Benini, L.: A scalable near-memory architecture for training deep neural networks on large in-memory datasets. IEEE Int. J. (ITC) (2019)
Samajdar, A., Zhu, Y., Whatmough, P., Mattina, M., Krishna, T.: SCALE-sim: systolic CNN accelerator simulator. In: IEEE International Conference (ASPLOS’18) (2018)
Tang, T., Li, S., Xie, Y., Jouppi, N.: MLPAT: a power, area, timing modeling framework for machine learning accelerators. In: IEEE International Conference (MICRO’18) (2018)
Gao, M., Pu, J., Yang, X., Horowitz, M., Kozyrakis, C.: TETRIS: scalable and efficient neural network acceleration with 3D memory. In: IEEE International Conference (ASPLOS) (2017)
Chen (Jimmy), K.-C., Ebrahimi, M., Wang, T.-Y., Yang, Y.-C.: NoC-based DNN accelerator: a future design paradigm. In: Conference (NOCS ’19) (2019)
Mirmahaleh, S.Y., Reshadi, M., Bagherzadeh, N.: Flow mapping on mesh-based deep learning accelerator. J. Parallel Distrib. Comput. 144, 80–97 (2020)
Article Google Scholar
Kwon, H., Samajdar, A., Krishna, T.: A communication-centric approach for designing flexible DNN accelerators. In: IEEE International Journal (MICRO) (2018)
Ascia, G., Catania, V., Monteleone, S., Palesi, M., Patti, D., Jose, J.: Analyzing networks-on-chip based deep neural networks. In: Conference (NOCS ’19) (2019)
Mirmahaleh, S.Y.H., Reshadi, M., Shabani, H., Guo, X., Bagherzadeh, N.: Flow mapping and data distribution on mesh-based deep learning accelerator. In: Proceedings of international symposium on networks-on chip, New York, NY, USA, October 17–18, 2019 (NOCS ’19), 8 pages (2019)
Chen, T., Du, Z., Sun, N., Wang, J., Wu, C., Chen, Y., Temam, O.: DianNao: a small-footprint high-throughput accelerator for ubiquitous machine-learning. In: IEEE International Conference (ASPLOS) (2014)
Larochelle, H., Erhan, D., Courville, A., Bergstra, J., Bengio, Y.: An empirical evaluation of deep architectures on problems with many factors of variation. In: IEEE International Conference (ICML) (2007)
Han, X., Zhou, D., Wang, S., Kimura, S.: CNN-MERP: an FPGA-based memory-efficient reconfigurable processor for forward and backward propagation of convolutional neural networks. In: IEEE International Conference (ICCD) (2016)
Li, J., Mei, X., Prokhorov, D.: Deep neural network for structural prediction and lane detection in traffic scene. IEEE Trans. Neural Netw. Learning Syst. 28, 690–703 (2016)
Article Google Scholar
Chen, Y.-H., Krishna, T., Emer, J.S., Sze, V.: Understanding the limitations of existing energy-efficient design approaches for deep neural networks. In: IEEE International Conference (SYSML’18) (2018)
Chen, Y.-H., Emer, J., Sze, V.: Using dataflow to optimize energy efficiency of deep neural network accelerators. IEEE Int. J. (Computer Society) 37, 12–21 (2017)
Google Scholar
Kwon, H., Samajdar, A., Krishna, T.: Rethinking NoCs for spatial neural network accelerators. In: IEEE International Conference (NOCS) (2017)
Choi, W., Duraisamy, K., Kim, R.G., Doppa, J.R., Pande, P.P., Marculescu, D., Marculescu, R.: On-chip communication network for efficient training of deep convolutional networks on heterogeneous manycore systems. IEEE Int. J. (Computers) 67, 672–686 (2017)
MathSciNet MATH Google Scholar
Joardar, B.K., Choi, W., Kim, R.G., Doppa, J.R., Pande, P.P., Marculescu, D., Marculescu, R.: 3D NoC-enabled heterogeneous manycore architectures for accelerating CNN training: performance and thermal trade-offs. In: IEEE International Conference (NoC'17) (2017)
Park, S., Bong, K., Shin, D., Lee, J., Choi, S., Yoo, H.-J.: A 1.93UPS/W scalable deep learning/inference processor with tetra-parallel MIMD architecture for big-data applications. In: IEEE International Conference (ISSCC) (2015)
Wang, C., Gong, L., Yu, Q., Li, X., Xie, Y., Zhou, X.: DLAU: a scalable deep learning accelerator unit on FPGA. IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst. https://doi.org/10.1109/TCAD.2016.2587683 (2016)
Adolf, R., Rama, S., Reagen, B., Wei, G.-Y., Brooks, D.: Fathom: reference workloads for modern deep learning methods. IEEE International arXiv.org (2016)
Song, L., Wang, Y., Han, Y., Zhao, X., Liu, B., Li, X.: C-brain: a deep learning accelerator that tames the diversity of CNNs through adaptive data-level parallelization. In: IEEE International Conference (DAC) (2016)
Guo, K., Lingzhi, Qiu, J., Yao, S., Han, S., Wang, Y., Yang, H.: Angel-eye: a complete design flow for mapping CNN onto customized hardware. In: IEEE International Conference (VLSI) (2016)
Reagen, B., Whatmough, P., Adolf, R., Rama, S., Lee, H., Lee, S.K., Hernández-Lobato, J.M., Wei, G.-Y.: David brooks, minerva: enabling low-power, highly-accurate deep neural network accelerators. In: IEEE International Conference (Computer Architecture) (2016)
Jouppi, N.P., Young, C., Patil, N., Patterson, D.: Motivation for and evaluation of the first tensor processing unit. IEEE Int. J. (Micro) 38, 10–19 (2018)
Google Scholar
Kwon, H., Emer, J.S., Krishna, T.: MAERI: enabling flexible dataflow mapping over DNN accelerators via reconfigurable interconnects. In: IEEE International Conference (ASPLOS’18) (2018)
Kwon, H., Pellauer, M., Krishna, T.: MAESTRO: an open-source infrastructure for modeling dataflows within deep learning accelerators. In: IEEE International Conference (ArXiv) (2018)
Yang, T.-J., Chen, Y.-H.: Vivienne sze, designing energy-efficient convolutional neural networks using energy-aware pruning. In: IEEE International Conference (CVPR) (2017)
Cavigelli, L., Magno, M., Benini, L.: Accelerating real-time embedded scene labeling with convolutional networks. In: IEEE International Conference (DAC) (2015)
Mirmahaleh, S.Y., Rahmani, A.M.: DNN pruning and mapping on NoC-Based communication infrastructure. Microelectron. J. 94, 104655 (2019)
Article Google Scholar
Karam, R., Puri, R., Bhunia, S.: Energy-efficient adaptive hardware accelerator for text mining application kernels. IEEE Int. J. (VLSI) 24, 3526–3537 (2016)
Google Scholar
Firuzan, A., Modarresi, M., Daneshtalab, M., Reshadi, M.: Reconfigurable network-on-chip for 3D neural network accelerators. In: IEEE International Conference (NOCS’18) (2018)
Samajdar, A., Mannan, P., Garg, K., Krishna, T.: GeneSys: enabling continuous learning through neural network evolution in hardware. In: IEEE Int. Conference (ArXiv) (2018)
Li, J., Yan, G., Lu, W., Jiang, S., Gong, S., Wu, J., Li, X.: SmartShuttle: optimizing off-chip memory accesses for deep learning accelerators. In: IEEE International Conference (DATE), 2018
http://synergy.ece.gatech.edu/tools/maestro/
https://github.com/ARM-software/SCALE-Sim
Dahl, G., Sainath, T., Hinton, G.: Improving deep neural networks for LVCSR using rectified linear units and dropout. In: IEEE International Conference (Acoustics, Speech and Signal Processing) (2013)
Huang, P., He, X., Gao, J., Deng, L.: Learning deep structured semantic models for web search using clickthrough data. In: IEEE International Conference (Information and Knowledge Management) (2013)
Haykin, S.: Neural Networks: A Comprehensive Foundation, 2nd edn. Prentice Hall PTR, Upper Saddle River, NJ (1998)
MATH Google Scholar
Sermanet, P., Eigen, D., Zhang, X., Mathieu, M., Fergus, R., LeCun, Y.: OverFeat: integrated recognition, localization and detection using convolutional networks. In: IEEE International Conference (CoRR) (2013)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: IEEE International Conference (CVPR) (2016)
Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: IEEE International Conference (CVPR) (2014)
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A.C., Fei-Fei, L.: ImageNet large scale visual recognition challenge. IEEE Int. J. Comput. Vis. 115, 211–252 (2015)
Article MathSciNet Google Scholar
Hagan, M.T., Demuth, H.B., Beale, M.H., De Jesús, O.: IEEE International Book, 2nd Edition. Martin Hagan Publisher (2014)
Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: IEEE International Conference (NIPS) (2012)
Parashar, A., Rhu, M., Mukkara, A., Puglielli, A., Venkatesan, R., Khailany, B., Emer, J., Keckler, S.W., Dally, W.J.: SCNN: an accelerator for compressed-sparse convolutional neural networks. In: IEEE International Conference (ISCA) (2017)
Albaqsami, A., Hosseini, M.S., Bagherzadeh, N.: HTF-MPR: a heterogeneous TensorFlow mapper targeting performance using genetic algorithms and gradient boosting regressors. In: IEEE International Conference (DATE) (2018)
Ebrahimi, M., Daneshtalab, M., Liljeberg, P., Tenhunen, H.: HAMUM-A novel routing protocol for unicast and multicast traffic in MPSoCs. In: IEEE Int. Conference (PDP) (2010)
Daneshtalab, M., Ebrahimi, M., Mohammadi, S., Afzali-Kusha, A.: Low-distance path-based multicast routing algorithm for network-on-chips. IEEE Int. J. 3, 430 (2009)
Google Scholar
Chi, P., Li, S., Xu, C., Zhang, T., Zhao, J., Liu, Y., Wang, Y., Xie, Y.: PRIME: a novel processing-in memory architecture for neural network computation in ReRAM-based main memory. In: IEEE International Conference (ISCA) (2016)
Moons, B., Verhelst, M.: A 0.3–2.6 UPS/W precision-scalable processor for real-time large-scale ConvNets. IEEE International Symposium (VLSI) (2016).
https://github.com/davidepatti/noxim
Catania, V., Mineo, A., Palesi, M., Patti, D., Monteleone, S.: cycle-accurate network on chip simulation with Noxim. IEEE Int. J. (TOMACS) 27, 1–25 (2016)
Google Scholar
Jimmy Chen, K.-C., Wang, T.-Y.: NN-Noxim: high-level cycle-accurate NoC-based neural networks simulator. In: IEEE International Conference (NOCARC) (2018)
Jimmy Chen, K.-C., Wang, T.-Y.: NN-Noxim: high-level cycle-accurate NoC-based neural networks simulator. In: Conference (NOCARC) (2018)
Chen, K.-C., Wang, T.-Y., Yang, Y.-C.: Cycle-accurate NoC-based convolutional neural network simulator. In: Conference (COINS’19) (2019)
https://www.xilinx.com/products/design-tools/vivado.html
Kwon, H., Krishna, T.: OpenSMART: single-cycle multi-hop NoC generator in BSV and chisel. In: IEEE International Conference (ISPASS) (2017)
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: IEEE International Conference (ICLR) (2015)
http://caffe.berkeleyvision.org/
https://www.tensorflow.org/

Download references

Funding

No funding was received.

Author information

Authors and Affiliations

Department of Computer Engineering, Science and Research Branch, Islamic Azad University, Tehran, Iran
Seyedeh Yasaman Hosseini Mirmahaleh & Midia Reshadi
Department of Electrical Engineering and Computer Science, University of California Irvine, Irvine, CA, USA
Nader Bagherzadeh
IRAN Telecommunication Research Center, Tehran, Iran
Ahmad Khademzadeh

Authors

Seyedeh Yasaman Hosseini Mirmahaleh
View author publications
You can also search for this author in PubMed Google Scholar
Midia Reshadi
View author publications
You can also search for this author in PubMed Google Scholar
Nader Bagherzadeh
View author publications
You can also search for this author in PubMed Google Scholar
Ahmad Khademzadeh
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

SYHM read and approved the final manuscript. MR read and approved the final manuscript. SYHM, MR, NB, and AK have contributed equally.

Corresponding author

Correspondence to Midia Reshadi.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Mirmahaleh, S.Y.H., Reshadi, M., Bagherzadeh, N. et al. Data scheduling and placement in deep learning accelerator. Cluster Comput 24, 3651–3669 (2021). https://doi.org/10.1007/s10586-021-03355-8

Download citation

Received: 17 January 2021
Revised: 10 June 2021
Accepted: 02 July 2021
Published: 10 July 2021
Issue Date: December 2021
DOI: https://doi.org/10.1007/s10586-021-03355-8

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Data scheduling and placement in deep learning accelerator

Abstract

Access this article

Similar content being viewed by others

A comprehensive review of Binary Neural Network

A review of convolutional neural network architectures and their optimizations

A comprehensive survey on model compression and acceleration

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Data scheduling and placement in deep learning accelerator

Abstract

Access this article

Similar content being viewed by others

A comprehensive review of Binary Neural Network

A review of convolutional neural network architectures and their optimizations

A comprehensive survey on model compression and acceleration

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation