A Unified and Energy-Efficient Depthwise Separable Convolution Accelerator

Chen, Yi; Lou, Jie; Lanius, Christian; Freye, Florian; Loh, Johnson; Gemmeke, Tobias

doi:10.1007/978-3-031-70947-0_7

Yi Chen¹⁷,
Jie Lou¹⁷,
Christian Lanius¹⁷,
Florian Freye¹⁷,
Johnson Loh¹⁷ &
…
Tobias Gemmeke¹⁷

Part of the book series: IFIP Advances in Information and Communication Technology ((IFIPAICT,volume 680))

Included in the following conference series:

IFIP/IEEE International Conference on Very Large Scale Integration - System on a Chip

131 Accesses

Abstract

Lightweight convolutional neural networks (CNNs) reduce computational workloads, making them suitable for embedded devices with limited hardware resources compared to conventional CNNs. Depthwise separable convolution (DSC) serves as the fundamental convolution unit of lightweight CNNs. This paper introduces a hardware accelerator tailored for DSC in Application-Specific Integrated Circuit (ASIC), featuring a unified engine supporting both depthwise convolution (DWC) and pointwise convolution (PWC) with high hardware utilization. It ensures 100% processing element (PE) array utilization for DWC and achieves up to 98% utilization for PWC while minimizing latency. By partitioning the input feature map (ifmap) Static Random-Access Memory (SRAM) into three banks, memory access is streamlined. Furthermore, a data scheduling strategy, along with a multiplexed registers (MR) bank based First-In-First-Out (FIFO) system between adjacent PEs, is implemented to maximize data reuse and reduce latency. This work is implemented in a 22 nm FDSOI technology and validated on the CIFAR10 dataset using the MobileNetV1 architecture. The proposed DSC accelerator can operate at 1 GHz, exhibiting an energy efficiency of 5.07 (3.96) TOPS/W and an area efficiency of 519.2 (461.52) GOPS/mm$^{2}$ for DWC (PWC) at 0.8 V. Scaling the supply voltage down to 0.5 V increases the energy efficiency to 13.64 TOPS/W for DWC and 10.64 TOPS/W for PWC.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 69.99; Price excludes VAT (USA)

Hardcover Book: USD 89.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Massively Parallel Neural Processing Array (MPNA): A CNN Accelerator for Embedded Systems

MCPS: a mapping method for MAERI accelerator base on Cartesian Product based Convolution for DNN layers with sparse input feature map

Article 02 February 2022

An efficient lightweight CNN acceleration architecture for edge computing based-on FPGA

Article 18 October 2022

References

Szegedy, C., et al.: Going deeper with convolutions. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1–9. Boston, MA, USA (2015)
Google Scholar
Chollet, F.: Xception: Deep learning with depthwise separable convolutions. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1800–1807. Honolulu, HI, USA (2017)
Google Scholar
Howard, AG., et al.: MobileNets: efficient convolutional neural networks for mobile vision applications (2017)
Google Scholar
Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., Chen, L.: MobileNetV2: inverted residuals and linear bottlenecks. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4510–4520. Salt Lake City, UT, USA (2018)
Google Scholar
Howard, A., et al.: Searching for MobileNetV3. In: IEEE/CVF International Conference on Computer Vision (ICCV), pp. 1314–1324 (2019)
Google Scholar
Zhang, X., Zhou, X., Lin, M., Sun, J.: ShuffleNet: an extremely efficient convolutional neural network for mobile devices. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6848-6856. Salt Lake City, UT, USA (2018)
Google Scholar
Tan, M.X., Quoc, V.Le.: EfficientNet: rethinking model scaling for convolutional neural networks (2019)
Google Scholar
Vaswani, A., et al.: Attention is all you need. In: 31st Conference on Neural Information Processing Systems, Association of Computational Machinery (2017)
Google Scholar
Gulati, A., et al.: Conformer: convolution-augmented transformer for speech recognition. In: Proceedings of Annual Conference on International Speech Communication Association, pp. 5036–5040 (2020)
Google Scholar
Chen, Yu-Hsin., et al.: Eyeriss: an energy-efficient reconfigurable accelerator for deep convolutional neural networks. IEEE J. Solid-State Circ. (JSSC), 127–138 (2017)
Google Scholar
Yue, J.S, et al.: A 3.77 TOPS/W convolutional neural network processor with priority-driven kernel optimization. IEEE Trans. Circ. Syst. II: Express Briefs, 277–281 (2019)
Google Scholar
Chang, K.W., et al.: VWA: hardware efficient vectorwise accelerator for convolutional neural network. IEEE Trans. Circu. Syst. I: Regular Papers, 145–154 (2020)
Google Scholar
Tu, F.B., et al.: Deep convolutional neural network architecture with reconfigurable computation patterns. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 2220–2233 (2017)
Google Scholar
Anders, M.A., et al.: 2.9 TOPS/W reconfigurable dense/sparse matrix-multiply accelerator with unified INT8/INTI6/FP16 Datapath in 14NM Tri-Gate CMOS. In: IEEE Symposium on VLSI Circuits, pp. 39–40 (2018)
Google Scholar
Kim, H., et al.: Row-streaming dataflow using a chaining buffer and systolic array+ structure. In: IEEE Computer Architecture Letters, pp. 34–37 (2021)
Google Scholar
Wu, X., Ma, Y., Wang, Z.: Efficient inference of large-scale and lightweight convolutional neural networks on FPGA. In: 2020 IEEE 33rd International System-on-Chip Conference (SOCC), pp. 168–173. Las Vegas, NV, USA (2020)
Google Scholar
An, F., et al.: A high performance reconfigurable hardware architecture for lightweight convolutional neural network. Electronics (2023)
Google Scholar
Huang, J., Liu, X., Guo, T., Zhao, Z.: A high-performance FPGA-based depthwise separable convolution accelerator. Electronics (2023)
Google Scholar
Xuan, L., et al.: An FPGA-based energy-efficient reconfigurable depthwise separable convolution accelerator for image recognition. IEEE Trans. Circuits Syst. II Express Briefs 69(10), 4003–4007 (2020)
MATH Google Scholar
Wu, D., et al.: A high-performance CNN processor based on FPGA for MobileNets. In: 29th International Conference on Field Programmable Logic and Applications (FPL), pp. 136–143. Barcelona, Spain (2019)
Google Scholar
Xiao, C.H., et al.: FGPA: fine-grained pipelined acceleration for depthwise separable CNN in resource constraint scenarios. IEEE (ISPA/BDCloud/SocialCom/SustainCom), pp. 246–254 (2021)
Google Scholar
Chen, Y., Lou, J., Lanius, C., Freye, F., Loh, J., Gemmeke, T.:An energy-efficient and area-efficient depthwise separable convolution accelerator with minimal on-chip memory access. In: IFIP/IEEE 31st International Conference on Very Large Scale Integration (VLSI-SoC), pp. 1–6. Dubai, United Arab Emirates (2023)
Google Scholar
Kung, H.T.: Why Systolic Architectures? Computer, pp. 37–46 (1982)
Google Scholar
Lin, Y., Zhang, Y., Yang, X.: A low memory requirement mobilenets accelerator based on FPGA for auxiliary medical tasks. In: Bioengineering, Basel (2022)
Google Scholar
Fan, Z., Hu, W., Guo, H., Liu, F., Xu, D.: Hardware and algorithm co-optimization for pointwise convolution and channel shuffle in ShuffleNet V2. In: 2021 IEEE International Conference on Systems. Man, and Cybernetics (SMC), pp. 3212–3217. Melbourne, Australia (2021)
Google Scholar
Ou, J., Li, X., Sun, Y., Shi, Y.: A configurable hardware accelerator based on hybrid dataflow for depthwise separable convolution. In: 4th International Conference on Advances in Computer Technology. Information Science and Communications (CTISC), pp. 1–5. Suzhou, China (2022)
Google Scholar
Esser, S.K., McKinstry, J.L., Bablani, D., Appuswamy, R., Modha, D.S.: Learned Step Size Quantization. ArXiv (2019)
Google Scholar
Chong, Y.S., et al.: An energy-efficient convolution unit for Depthwise separable convolutional neural networks. In: IEEE International Symposium on Circuits and System (ISCAS), pp. 1–5 (2021)
Google Scholar
Chen, W., Wang, Z., Li, S., Yu, Z., Li, H.: Accelerating compact convolutional neural networks with multi-threaded data streaming. In: 2019 IEEE Computer Society Annual Symposium on VLSI (ISVLSI), pp. 519–522 (2019)
Google Scholar
Shao, Z., et al.: Memory-efficient CNN accelerator based on interlayer feature map compression. IEEE Trans. Circ. Syst. I: Regular Pap. 668–681 (2021)
Google Scholar
Hsiao, S., Tsai, B.: Efficient computation of Depthwise separable convolution in MoblieNet deep neural network models. In: 2021 IEEE International Conference on Consumer Electronics-Taiwan (ICCE-TW), pp. 1–2 (2021)
Google Scholar
Stillmaker, A., Baas, B.M.: Scaling equations for the accurate prediction of CMOS device performance from 180 nm to 7 nm. Integration 58, 74–81 (2017)
Article Google Scholar
Latotzke, C., Gemmeke, T.: Efficiency versus accuracy: a review of design techniques for DNN hardware accelerators. IEEE Access 9, 9785–9799 (2021)
Article MATH Google Scholar

Download references

Acknowledgments

This work is partially funded by the Federal Ministry of Education and Research (BMBF, Germany) under the project NEUROTEC II (project number 16ME0399) and NeuroSys (project number 03ZU1106CA).

Author information

Authors and Affiliations

Chair of Integrated Digital Systems and Circuit Design, RWTH Aachen University, Aachen, Germany
Yi Chen, Jie Lou, Christian Lanius, Florian Freye, Johnson Loh & Tobias Gemmeke

Authors

Yi Chen
View author publications
You can also search for this author in PubMed Google Scholar
Jie Lou
View author publications
You can also search for this author in PubMed Google Scholar
Christian Lanius
View author publications
You can also search for this author in PubMed Google Scholar
Florian Freye
View author publications
You can also search for this author in PubMed Google Scholar
Johnson Loh
View author publications
You can also search for this author in PubMed Google Scholar
Tobias Gemmeke
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yi Chen .

Editor information

Editors and Affiliations

Khalifa University, Abu Dhabi, United Arab Emirates
Ibrahim (Abe) M. Elfadel
American University of Sharjah, Sharjah, United Arab Emirates
Lutfi Albasha

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Chen, Y., Lou, J., Lanius, C., Freye, F., Loh, J., Gemmeke, T. (2024). A Unified and Energy-Efficient Depthwise Separable Convolution Accelerator. In: Elfadel, I.(.M., Albasha, L. (eds) VLSI-SoC 2023: Innovations for Trustworthy Artificial Intelligence. VLSI-SoC 2023. IFIP Advances in Information and Communication Technology, vol 680. Springer, Cham. https://doi.org/10.1007/978-3-031-70947-0_7

Download citation

DOI: https://doi.org/10.1007/978-3-031-70947-0_7
Published: 29 December 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-70946-3
Online ISBN: 978-3-031-70947-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The International Federation for Information Processing (opens in a new tab)

A Unified and Energy-Efficient Depthwise Separable Convolution Accelerator