Skip to main content

Advertisement

Log in

Data scheduling and placement in deep learning accelerator

  • Published:
Cluster Computing Aims and scope Submit manuscript

Abstract

Deep neural networks (DNNs) have been employed to different devices as a popular machine learning algorithm (ML) owing to deploy the Internet of Things (IoT), data mining in cloud computing, and web search engines, which MLs had an impressive effect on IoT’s edge level nodes. Deploying DNN-based applications leads to memory access problems, including communication delay, energy efficiency, and bandwidth requirement. We propose a bus scheduling for data placement on distributed local buffers in a deep learning accelerator (DLA). The contributions of this paper include (1) providing a method for data-flow mapping between off-chip DRAM and distributed local buffers, and a flow mapping approach for data transfer between distributed local buffers and processing elements (PEs) (2) employing distributed local buffers in four directions for traffic distribution on a mesh based on memory access mechanism (3) bus scheduling for data placement on distributed local buffers. Simulated experiment based on typical DNN (i.e., AlexNet, VGG-16, and GoogLeNet) workflows demonstrates the effectiveness of the design: (1) the scheduling and mapping methods improve total runtime and bandwidth requirement by approximately 42.29% and 88.95% compared with TPU, respectively. Additionally, (2) our methods reduce total runtime for row-column stationary plus by approximately 99% compared with weight-stationary data-flow in CONV1 and CONV11 of VGG-16, respectively. This work reports the simulation results based on distributing AlexNet, VGG-16, and GoogLeNet’s traffics as the popular CNNs and DNNs models, whereas it investigates our method's efficiency for other trained models.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4

source and destination nodes)

Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14

Similar content being viewed by others

References

  1. Chen, Y.-H., Krishna, T., Emer, J.S., Sze, V.: Eyeriss: an energy-efficient reconfigurable accelerator for deep convolutional neural networks. IEEE Int. J. (SOLID-STATE CIRCUITS) (2016)

  2. Dally, W.J., Thomas Gray, C., Poulton, J., Khailany, B., Wilson, J., Dennison, L.: Hardware-enabled artificial intelligence. In: IEEE International Conference (SVCDT) (2018)

  3. Sze, V., Chen, Y.-H., Yang, T.-J., Emer, J.S.: Efficient processing of deep neural networks: a tutorial and survey. IEEE Int. J. (JoP) 105, 2295–2329 (2017)

    Google Scholar 

  4. Andri, R., Cavigelli, L., Rossiy, D., Benini, L.: Hyperdrive: a systolically scalable binary-weight CNN inference engine for mW IoT end-nodes. In: IEEE International Conference (ISVLSI) (2018)

  5. Luo, T., Liu, S., Li, L., Wang, Y., Zhang, S., Chen, T., Xu, Z., Temam, O., Chen, Y.: DaDianNao: a machine-learning supercomputer. IEEE Int. J. (Computer) (2016)

  6. Du, Z., Fasthuber, R., Chen, T., Ienne, P., Li, L., Luo, T., Feng, X., Chen, X., Temam, O.: ShiDianNao: shifting vision processing closer to the sensor. In: IEEE International Conference (ISCA) (2015)

  7. Chen, Y.-H., Emer, J.S., Sze, V.: Eyeriss v2: a flexible and high-performance accelerator for emerging deep neural networks. IEEE Int. J. (ArXiv) (2018)

  8. Jouppi, N.P., et al.: In-datacenter performance analysis of a tensor processing unit. In: IEEE International Conference (ISCA) (2017)

  9. Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A.: Going deeper with convolutions. In: IEEE International Conference (CVPR’15) (2015)

  10. Schuiki, F., Schaffner, M., Gürkaynak, F.K., Benini, L.: A scalable near-memory architecture for training deep neural networks on large in-memory datasets. IEEE Int. J. (ITC) (2019)

  11. Samajdar, A., Zhu, Y., Whatmough, P., Mattina, M., Krishna, T.: SCALE-sim: systolic CNN accelerator simulator. In: IEEE International Conference (ASPLOS’18) (2018)

  12. Tang, T., Li, S., Xie, Y., Jouppi, N.: MLPAT: a power, area, timing modeling framework for machine learning accelerators. In: IEEE International Conference (MICRO’18) (2018)

  13. Gao, M., Pu, J., Yang, X., Horowitz, M., Kozyrakis, C.: TETRIS: scalable and efficient neural network acceleration with 3D memory. In: IEEE International Conference (ASPLOS) (2017)

  14. Chen (Jimmy), K.-C., Ebrahimi, M., Wang, T.-Y., Yang, Y.-C.: NoC-based DNN accelerator: a future design paradigm. In: Conference (NOCS ’19) (2019)

  15. Mirmahaleh, S.Y., Reshadi, M., Bagherzadeh, N.: Flow mapping on mesh-based deep learning accelerator. J. Parallel Distrib. Comput. 144, 80–97 (2020)

    Article  Google Scholar 

  16. Kwon, H., Samajdar, A., Krishna, T.: A communication-centric approach for designing flexible DNN accelerators. In: IEEE International Journal (MICRO) (2018)

  17. Ascia, G., Catania, V., Monteleone, S., Palesi, M., Patti, D., Jose, J.: Analyzing networks-on-chip based deep neural networks. In: Conference (NOCS ’19) (2019)

  18. Mirmahaleh, S.Y.H., Reshadi, M., Shabani, H., Guo, X., Bagherzadeh, N.: Flow mapping and data distribution on mesh-based deep learning accelerator. In: Proceedings of international symposium on networks-on chip, New York, NY, USA, October 17–18, 2019 (NOCS ’19), 8 pages (2019)

  19. Chen, T., Du, Z., Sun, N., Wang, J., Wu, C., Chen, Y., Temam, O.: DianNao: a small-footprint high-throughput accelerator for ubiquitous machine-learning. In: IEEE International Conference (ASPLOS) (2014)

  20. Larochelle, H., Erhan, D., Courville, A., Bergstra, J., Bengio, Y.: An empirical evaluation of deep architectures on problems with many factors of variation. In: IEEE International Conference (ICML) (2007)

  21. Han, X., Zhou, D., Wang, S., Kimura, S.: CNN-MERP: an FPGA-based memory-efficient reconfigurable processor for forward and backward propagation of convolutional neural networks. In: IEEE International Conference (ICCD) (2016)

  22. Li, J., Mei, X., Prokhorov, D.: Deep neural network for structural prediction and lane detection in traffic scene. IEEE Trans. Neural Netw. Learning Syst. 28, 690–703 (2016)

    Article  Google Scholar 

  23. Chen, Y.-H., Krishna, T., Emer, J.S., Sze, V.: Understanding the limitations of existing energy-efficient design approaches for deep neural networks. In: IEEE International Conference (SYSML’18) (2018)

  24. Chen, Y.-H., Emer, J., Sze, V.: Using dataflow to optimize energy efficiency of deep neural network accelerators. IEEE Int. J. (Computer Society) 37, 12–21 (2017)

    Google Scholar 

  25. Kwon, H., Samajdar, A., Krishna, T.: Rethinking NoCs for spatial neural network accelerators. In: IEEE International Conference (NOCS) (2017)

  26. Choi, W., Duraisamy, K., Kim, R.G., Doppa, J.R., Pande, P.P., Marculescu, D., Marculescu, R.: On-chip communication network for efficient training of deep convolutional networks on heterogeneous manycore systems. IEEE Int. J. (Computers) 67, 672–686 (2017)

    MathSciNet  MATH  Google Scholar 

  27. Joardar, B.K., Choi, W., Kim, R.G., Doppa, J.R., Pande, P.P., Marculescu, D., Marculescu, R.: 3D NoC-enabled heterogeneous manycore architectures for accelerating CNN training: performance and thermal trade-offs. In: IEEE International Conference (NoC'17) (2017)

  28. Park, S., Bong, K., Shin, D., Lee, J., Choi, S., Yoo, H.-J.: A 1.93UPS/W scalable deep learning/inference processor with tetra-parallel MIMD architecture for big-data applications. In: IEEE International Conference (ISSCC) (2015)

  29. Wang, C., Gong, L., Yu, Q., Li, X., Xie, Y., Zhou, X.: DLAU: a scalable deep learning accelerator unit on FPGA. IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst. https://doi.org/10.1109/TCAD.2016.2587683 (2016)

  30. Adolf, R., Rama, S., Reagen, B., Wei, G.-Y., Brooks, D.: Fathom: reference workloads for modern deep learning methods. IEEE International arXiv.org (2016)

  31. Song, L., Wang, Y., Han, Y., Zhao, X., Liu, B., Li, X.: C-brain: a deep learning accelerator that tames the diversity of CNNs through adaptive data-level parallelization. In: IEEE International Conference (DAC) (2016)

  32. Guo, K., Lingzhi, Qiu, J., Yao, S., Han, S., Wang, Y., Yang, H.: Angel-eye: a complete design flow for mapping CNN onto customized hardware. In: IEEE International Conference (VLSI) (2016)

  33. Reagen, B., Whatmough, P., Adolf, R., Rama, S., Lee, H., Lee, S.K., Hernández-Lobato, J.M., Wei, G.-Y.: David brooks, minerva: enabling low-power, highly-accurate deep neural network accelerators. In: IEEE International Conference (Computer Architecture) (2016)

  34. Jouppi, N.P., Young, C., Patil, N., Patterson, D.: Motivation for and evaluation of the first tensor processing unit. IEEE Int. J. (Micro) 38, 10–19 (2018)

    Google Scholar 

  35. Kwon, H., Emer, J.S., Krishna, T.: MAERI: enabling flexible dataflow mapping over DNN accelerators via reconfigurable interconnects. In: IEEE International Conference (ASPLOS’18) (2018)

  36. Kwon, H., Pellauer, M., Krishna, T.: MAESTRO: an open-source infrastructure for modeling dataflows within deep learning accelerators. In: IEEE International Conference (ArXiv) (2018)

  37. Yang, T.-J., Chen, Y.-H.: Vivienne sze, designing energy-efficient convolutional neural networks using energy-aware pruning. In: IEEE International Conference (CVPR) (2017)

  38. Cavigelli, L., Magno, M., Benini, L.: Accelerating real-time embedded scene labeling with convolutional networks. In: IEEE International Conference (DAC) (2015)

  39. Mirmahaleh, S.Y., Rahmani, A.M.: DNN pruning and mapping on NoC-Based communication infrastructure. Microelectron. J. 94, 104655 (2019)

    Article  Google Scholar 

  40. Karam, R., Puri, R., Bhunia, S.: Energy-efficient adaptive hardware accelerator for text mining application kernels. IEEE Int. J. (VLSI) 24, 3526–3537 (2016)

    Google Scholar 

  41. Firuzan, A., Modarresi, M., Daneshtalab, M., Reshadi, M.: Reconfigurable network-on-chip for 3D neural network accelerators. In: IEEE International Conference (NOCS’18) (2018)

  42. Samajdar, A., Mannan, P., Garg, K., Krishna, T.: GeneSys: enabling continuous learning through neural network evolution in hardware. In: IEEE Int. Conference (ArXiv) (2018)

  43. Li, J., Yan, G., Lu, W., Jiang, S., Gong, S., Wu, J., Li, X.: SmartShuttle: optimizing off-chip memory accesses for deep learning accelerators. In: IEEE International Conference (DATE), 2018

  44. http://synergy.ece.gatech.edu/tools/maestro/

  45. https://github.com/ARM-software/SCALE-Sim

  46. Dahl, G., Sainath, T., Hinton, G.: Improving deep neural networks for LVCSR using rectified linear units and dropout. In: IEEE International Conference (Acoustics, Speech and Signal Processing) (2013)

  47. Huang, P., He, X., Gao, J., Deng, L.: Learning deep structured semantic models for web search using clickthrough data. In: IEEE International Conference (Information and Knowledge Management) (2013)

  48. Haykin, S.: Neural Networks: A Comprehensive Foundation, 2nd edn. Prentice Hall PTR, Upper Saddle River, NJ (1998)

    MATH  Google Scholar 

  49. Sermanet, P., Eigen, D., Zhang, X., Mathieu, M., Fergus, R., LeCun, Y.: OverFeat: integrated recognition, localization and detection using convolutional networks. In: IEEE International Conference (CoRR) (2013)

  50. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: IEEE International Conference (CVPR) (2016)

  51. Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: IEEE International Conference (CVPR) (2014)

  52. Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A.C., Fei-Fei, L.: ImageNet large scale visual recognition challenge. IEEE Int. J. Comput. Vis. 115, 211–252 (2015)

    Article  MathSciNet  Google Scholar 

  53. Hagan, M.T., Demuth, H.B., Beale, M.H., De Jesús, O.: IEEE International Book, 2nd Edition. Martin Hagan Publisher (2014)

  54. Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: IEEE International Conference (NIPS) (2012)

  55. Parashar, A., Rhu, M., Mukkara, A., Puglielli, A., Venkatesan, R., Khailany, B., Emer, J., Keckler, S.W., Dally, W.J.: SCNN: an accelerator for compressed-sparse convolutional neural networks. In: IEEE International Conference (ISCA) (2017)

  56. Albaqsami, A., Hosseini, M.S., Bagherzadeh, N.: HTF-MPR: a heterogeneous TensorFlow mapper targeting performance using genetic algorithms and gradient boosting regressors. In: IEEE International Conference (DATE) (2018)

  57. Ebrahimi, M., Daneshtalab, M., Liljeberg, P., Tenhunen, H.: HAMUM-A novel routing protocol for unicast and multicast traffic in MPSoCs. In: IEEE Int. Conference (PDP) (2010)

  58. Daneshtalab, M., Ebrahimi, M., Mohammadi, S., Afzali-Kusha, A.: Low-distance path-based multicast routing algorithm for network-on-chips. IEEE Int. J. 3, 430 (2009)

    Google Scholar 

  59. Chi, P., Li, S., Xu, C., Zhang, T., Zhao, J., Liu, Y., Wang, Y., Xie, Y.: PRIME: a novel processing-in memory architecture for neural network computation in ReRAM-based main memory. In: IEEE International Conference (ISCA) (2016)

  60. Moons, B., Verhelst, M.: A 0.3–2.6 UPS/W precision-scalable processor for real-time large-scale ConvNets. IEEE International Symposium (VLSI) (2016).

  61. https://github.com/davidepatti/noxim

  62. Catania, V., Mineo, A., Palesi, M., Patti, D., Monteleone, S.: cycle-accurate network on chip simulation with Noxim. IEEE Int. J. (TOMACS) 27, 1–25 (2016)

    Google Scholar 

  63. Jimmy Chen, K.-C., Wang, T.-Y.: NN-Noxim: high-level cycle-accurate NoC-based neural networks simulator. In: IEEE International Conference (NOCARC) (2018)

  64. Jimmy Chen, K.-C., Wang, T.-Y.: NN-Noxim: high-level cycle-accurate NoC-based neural networks simulator. In: Conference (NOCARC) (2018)

  65. Chen, K.-C., Wang, T.-Y., Yang, Y.-C.: Cycle-accurate NoC-based convolutional neural network simulator. In: Conference (COINS’19) (2019)

  66. https://www.xilinx.com/products/design-tools/vivado.html

  67. Kwon, H., Krishna, T.: OpenSMART: single-cycle multi-hop NoC generator in BSV and chisel. In: IEEE International Conference (ISPASS) (2017)

  68. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: IEEE International Conference (ICLR) (2015)

  69. http://caffe.berkeleyvision.org/

  70. https://www.tensorflow.org/

Download references

Funding

No funding was received.

Author information

Authors and Affiliations

Authors

Contributions

SYHM read and approved the final manuscript. MR read and approved the final manuscript. SYHM, MR, NB, and AK have contributed equally.

Corresponding author

Correspondence to Midia Reshadi.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Mirmahaleh, S.Y.H., Reshadi, M., Bagherzadeh, N. et al. Data scheduling and placement in deep learning accelerator. Cluster Comput 24, 3651–3669 (2021). https://doi.org/10.1007/s10586-021-03355-8

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10586-021-03355-8

Keywords

Navigation