Skip to main content
Log in

The Data Flow and Architectural Optimizations for a Highly Efficient CNN Accelerator Based on the Depthwise Separable Convolution

  • Published:
Circuits, Systems, and Signal Processing Aims and scope Submit manuscript

Abstract

This paper presents the design and implementation of a convolutional neural network (CNN) accelerator for embedded and edge computing systems. To be specific, a novel processing flow is proposed in this paper so that the data that is already stored in the accelerator is maximally reused. This greatly reduces the requirements for the on-chip storage elements and the accesses to the off-chip memory. Therefore, significant reductions in the memory-access delay and the area complexity can be achieved. Based on the proposed data processing flow, a highly efficient VLSI architecture is designed and implemented. This architecture is based on a pipelined structure and maximizes the efficiency for the utilizations of hardware components. The implemented circuit is synthesized and placed- and routed with TSMC 90 nm technology, and the evaluations for the performance and area complexity are conducted based on the post-layout estimations. The experimental results show that the proposed CNN accelerator achieves a throughput of 44.06 Giga-MAC/s with the complexity of 5909KGEs. Furthermore, this design leads to a performance of 79.1 frame-per-second (fps) under the frequency of 250 MHz. Compared to the state-of-the-art accelerators, the proposed architecture achieves a significant enhancement in efficiency.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14

Similar content being viewed by others

Data Availability

The datasets generated and/or analyzed during the current study are available from the corresponding author on reasonable request.

References

  1. A. Ardakani, C. Condo, M. Ahmadi, W.J. Gross, An architecture to accelerate convolution in deep neural networks. IEEE Trans. Circuits Syst. I Regul. Pap. 65(4), 1349–1362 (2018)

    Article  Google Scholar 

  2. L. Bai, Y. Zhao, X. Huang, A CNN accelerator on FPGA using depthwise separable convolution. IEEE Trans. Circuits Syst. II Express Briefs 65(10), 1415–1419 (2018)

    Article  Google Scholar 

  3. S. Bazrafkan, P.M. Corcoran, Pushing the AI envelope: merging deep networks to accelerate edge artificial intelligence in consumer electronics devices and systems. IEEE Consum. Electron. Mag. 7(2), 55–61 (2018)

    Article  Google Scholar 

  4. W. Chen, Z. Wang, S. Li, Z. Yu, H. Li, Accelerating compact convolutional neural networks with multi-threaded data streaming. in 2019 IEEE Computer Society Annual Symposium on VLSI (ISVLSI), pp. 519–522 (2019)

  5. Y. Chen, J. Emer, V. Sze, Using dataflow to optimize energy efficiency of deep neural network accelerators. IEEE Micro 37(3), 12–21 (2017)

    Article  Google Scholar 

  6. Y. Chen, T. Krishna, J.S. Emer, V. Sze, Eyeriss: an energy-efficient reconfigurable accelerator for deep convolutional neural networks. IEEE J. Solid-State Circuits 52(1), 127–138 (2017)

    Article  Google Scholar 

  7. Y. Chen, T. Yang, J. Emer, V. Sze, Eyeriss v2: a flexible accelerator for emerging deep neural networks on mobile devices. IEEE J. Emerg. Sel. Top. Circuits Syst. 9(2), 292–308 (2019)

    Article  Google Scholar 

  8. G. Desoli et al., 14.1 A 2.9TOPS/W deep convolutional neural network SoC in FD-SOI 28nm for intelligent embedded systems, in IEEE International Solid-State Circuits Conference (ISSCC), pp. 238–239 (2017)

  9. Z. Du et al., ShiDianNao: Shifting vision processing closer to the sensor, in 2015 ACM/IEEE Annual International Symposium on Computer Architecture, pp. 92–104 (2015)

  10. L. Jian, Z. Li, X. Yang, W. Wu, A. Ahmad, G. Jeon, Combining unmanned aerial vehicles with artificial-intelligence technology for traffic-congestion recognition: electronic eyes in the skies to spot clogged roads. IEEE Consumer Electron. Mag. 8(3), 81–86 (2019)

    Article  Google Scholar 

  11. A. Krizhevsky, S. Ilya, E.H. Geoffrey, Imagenet classification with deep convolutional neural networks. Commun. ACM 60(6), 84–90 (2017)

    Article  Google Scholar 

  12. J. Li, X. Liang, S. Shen, T. Xu, J. Feng, S. Yan, Scale-aware fast R-CNN for pedestrian detection. IEEE Trans. Multimedia 20(4), 985–996 (2017)

    Google Scholar 

  13. K.T. Malladi, F.A. Nothaft, K. Periyathambi, B.C. Lee, C. Kozyrakis and M. Horowitz, Towards energy-proportional datacenter memory with mobile DRAM, in 2012 39th Annual International Symposium on Computer Architecture (ISCA), pp. 37–48 (2012)

  14. B. Moons, R. Uytterhoeven, W. Dehaene, M. Verhelst, 14.5 Envision: A 0.26-to-10TOPS/W subword-parallel dynamic-voltage-accuracy-frequency-scalable Convolutional Neural Network processor in 28nm FDSOI, in 2017 IEEE International Solid-State Circuits Conference (ISSCC), pp. 246–247 (2017)

  15. T. Ogunfunmi, R.P. Ramachandran, R. Togneri, Y. Zhao, X. Xia, A primer on deep learning architectures and applications in speech processing. Circuits Syst. Signal Process 38(8), 3406–3432 (2019)

    Article  Google Scholar 

  16. B. Qiang et al., SqueezeNet and fusion network-based accurate fast fully convolutional network for hand detection and gesture recognition. IEEE Access 9, 77661–77674 (2021)

    Article  Google Scholar 

  17. D. Sinha, M. El-Sharkawy, Thin MobileNet: an enhanced MobileNet architecture, in IEEE Annual Ubiquitous Computing, Electronics & Mobile Communication Conference (UEMCON), pp. 0280–0285 (2019)

  18. L. Sifre, Rigid-motion scattering for image classification. PhD Thesis in Ecole Polytechnique, CMAP (2014)

  19. J. Su et al., Redundancy-reduced mobilenet acceleration on reconfigurable logic for ImageNet classification, in Applied Reconfigurable Computing. Architectures, Tools, and Applications, pp. 16–28 (2018)

  20. V. Sze, Y. Chen, T. Yang, J.S. Emer, Efficient processing of deep neural networks: a tutorial and survey. Proc. IEEE 105(12), 2295–2329 (2017)

    Article  Google Scholar 

  21. C. Szegedy et al., Going deeper with convolutions, in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1–9 (2015)

  22. X. Wang, M. Tang, S. Yang, H. Yin, H. Huang, L. He, Automatic hypernasality detection in cleft palate speech using CNN. Circuits Syst. Signal Process. 38(8), 3521–3547 (2019)

    Article  Google Scholar 

  23. Y. Yang, H. Luo, H. Xu, F. Wu, Towards real-time traffic sign detection and classification. IEEE Trans. Intell. Transp. Syst. 17(7), 2022–2031 (2016)

    Article  Google Scholar 

  24. X. Zhang, X. Zhou, M. Lin, J. Sun, ShuffleNet: an extremely efficient convolutional neural network for mobile devices, in 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6848–6856 (2018)

  25. R. Zhao, X. Niu, W. Luk, Automatic optimising CNN with depthwise separable convolution on FPGA: (Abstact only), in Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Array, p. 285 (2018)

Download references

Acknowledgements

This work is supported in part by the Ministry of Science and Technology, Taiwan under grants MOST 109-2221-E-011-142 and 110-2221-E-011-155. The authors would like to thank Prof. Gerd Ascheid and Dr. Andreas Bytyn of RWTH Aachen University for their valuable inputs regarding the design of CNN accelerator.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Chung-An Shen.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Lin, HJ., Shen, CA. The Data Flow and Architectural Optimizations for a Highly Efficient CNN Accelerator Based on the Depthwise Separable Convolution. Circuits Syst Signal Process 41, 3547–3569 (2022). https://doi.org/10.1007/s00034-022-01952-5

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00034-022-01952-5

Keywords

Navigation