High performance dilated convolutions on multi-core DSPs

Wang, Yang; Wang, Qinglin; Pei, Xiangdong; Mei, Songzhu; Li, Rongchun; Liu, Jie

doi:10.1007/s42514-023-00166-8

High performance dilated convolutions on multi-core DSPs

Regular Paper
Published: 09 September 2023

Volume 6, pages 78–93, (2024)
Cite this article

CCF Transactions on High Performance Computing Aims and scope Submit manuscript

Yang Wang³,
Qinglin Wang ORCID: orcid.org/0000-0002-8286-6566^1,2,
Xiangdong Pei^1,2,
Songzhu Mei^1,2,
Rongchun Li^1,2 &
…
Jie Liu^1,2

130 Accesses
1 Citation
Explore all metrics

Abstract

Dilated convolutions are widely used to accomplish wide receptive fields while keeping the resolution of feature maps in deep learning applications, such as semantic segmentation and object detection. However, the data locality in dilated convolutions deteriorates rapidly with the increase of dilation rate, which brings a great challenge to the high-performance direct implementation of convolutions. Multi-core digital signal processors (DSPs) with software-controlled on-chip memories allow programmers to move data between on-chip and off-chip memories by hand so that it may be very friendly to the direct implementation of dilated convolutions. In this paper, we introduce a high-performance parallel direct implementation of dilated convolutions on multi-core DSPs in a CPU-DSP heterogeneous prototype processor, which can effectively capture the data locality in dilated convolutions. The experimental results demonstrate that the direct implementation achieves much better performance than GEMM-based ones on multi-core DSPs for all the tested layers, and gets much higher efficiency than the high-performance libraries on three other architectures in cases with large feature maps. In addition, the direct implementation also exhibits good scalability.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Optimizing Pointwise Convolutions on Multi-core DSPs

Im2win: An Efficient Convolution Paradigm on GPU

Accelerating Depthwise Separable Convolutions with Vector Processor

Data availability

Data will be made available on request.

References

Arm Corporation: ARM Computer Library: a software library for machine learning. https://www.arm.com/technologies/compute-library. Online, accessed 3-Jan-2023 (2023)
Chaudhary, N., Misra, S., Kalamkar, D., Heinecke, A., Georganas, E., Ziv, B., Adelman, M., Kaul, B.: Efficient and generic 1d dilated convolution layer for deep learning. arXiv preprint arXiv:2104.08002 (2021)
Chen, Q., Xu, J., Koltun, V.: Fast image processing with fully-convolutional networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2497–2506 (2017a)
Chen, L.-C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: Deeplab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFS. IEEE Trans. Pattern Anal. Mach. Intell. 40(4), 834–848 (2017b)
Article PubMed Google Scholar
Filippas, D., Nicopoulos, C., Dimitrakopoulos, G.: Streaming dilated convolution engine. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 31(3), 401–405 (2023)
Article Google Scholar
Georganas, E., Avancha, S., Banerjee, K., Kalamkar, D., Henry, G., Pabst, H., Heinecke, A.: Anatomy of high-performance deep learning convolutions on SIMD architectures. In: SC18: International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 830–841. IEEE (2018). https://doi.org/10.1109/SC.2018.00069
Hao, R., Wang, Q., Yin, S., Zhou, T., Shen, S., Mei, S., Liu, J.: Towards effective depthwise convolutions on armv8 architecture. arXiv preprint arXiv:2206.12124 (2022)
Heinecke, A., Georganas, E., Banerjee, K., Kalamkar, D., Sundaram, N., Venkat, A., Henry, G., Pabst, H.: Understanding the performance of small convolution operations for CNN on intel architecture. In: Poster in the International Conference for High Performance Computing, Networking, Storage, and Analysis (2017)
Igual, F.D., Ali, M., Friedmann, A., Stotzer, E., Wentz, T., Geijn, R.A.: Unleashing the high-performance and low-power of multi-core DSPS for general-purpose HPC. In: SC’12: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, pp. 1–11. IEEE (2012)
Intel Corporation: oneAPI deep neural network library (oneDNN). https://github.com/oneapi-src/oneDNN. Online, accessed 3-Jan-2023 (2023)
Kim, M., Park, C., Kim, S., Hong, T., Ro, W.W.: Efficient dilated-Winograd convolutional neural networks. In: 2019 IEEE International Conference on Image Processing (ICIP), pp. 2711–2715. IEEE (2019)
Kurth, T., Treichler, S., Romero, J., Mudigonda, M., Luehr, N., Phillips, E., Mahesh, A., Matheson, M., Deslippe, J., Fatica, M., et al.: Exascale deep learning for climate analytics. In: SC18: International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 649–660. IEEE (2018)
Lin, G., Wu, Q., Qiu, L., Huang, X.: Image super-resolution using a dilated convolutional neural network. Neurocomputing 275, 1219–1230 (2018)
Article Google Scholar
Lu, G., Zhang, W., Wang, Z.: Optimizing depthwise separable convolution operations on GPUs. IEEE Trans. Parallel Distrib. Syst. 33(1), 70–87 (2021)
Article Google Scholar
Mehta, S., Rastegari, M., Caspi, A., Shapiro, L., Hajishirzi, H.: ESPNet: efficient spatial pyramid of dilated convolutions for semantic segmentation. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 552–568 (2018)
Mogers, N., Radu, V., Li, L., Turner, J., O’Boyle, M., Dubach, C.: Automatic generation of specialized direct convolutions for mobile gpus. In: Proceedings of the 13th Annual Workshop on General Purpose Processing Using Graphics Processing Unit, pp. 41–50 (2020)
Paszke, A., Gross, S., Massa, F., et al.: PyTorch: an imperative style, high-performance deep learning library (2019)
Pei, X., Wang, Q., Liao, L., Li, R., Mei, S., Liu, J., Pang, Z.: Optimizing parallel matrix transpose algorithm on multi-core digital signal processors (in Chinese). J. Natl. Univ. Def. Technol. 45(1), 57–66 (2023)
Google Scholar
Safonov, I., Kornilov, A., Makienko, D.: An approach for matrix multiplication of 32-bit fixed point numbers by means of 16-bit SIMD instructions on DSP. Electronics 12, 78 (2022)
Article Google Scholar
Wang, Q., Pei, X., Liao, L., Wang, H., Li, R., Mei, S., Li, D.: Evaluating matrix multiplication-based convolution algorithm on multi-core digital signal processors (in Chinese). J. Natl. Univ. Def. Technol. 45(1), 86–94 (2023)
Google Scholar
Yin, S., Wang, Q., Hao, R., Zhou, T., Mei, S., Liu, J.: Optimizing irregular-shaped matrix-matrix multiplication on multi-core DSPS. In: 2022 IEEE International Conference on Cluster Computing (CLUSTER), pp. 451–461 (2022)
Yu, F., Koltun, V.: Multi-scale context aggregation by dilated convolutions. arXiv preprint arXiv:1511.07122 (2015)
Yu, F., Koltun, V., Funkhouser, T.: Dilated residual networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 472–480 (2017)
Zhang, K., Zuo, W., Gu, S., Zhang, L.: Learning deep CNN denoiser prior for image restoration. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3929–3938 (2017)
Zhang, J., Franchetti, F., Low, T.M.: High performance zero-memory overhead direct convolutions. In: International Conference on Machine Learning, pp. 5771–5780 (2018)

Download references

Funding

This work was supported by the National Natural Science Foundation of China under Grant nos. 62002365 and 62025208, and the National Key Research and Development Program of China under Grant no. 2021YFBO300101.

Author information

Authors and Affiliations

National Key Laboratory of Parallel and Distributed Computing, National University of Defense Technology, Changsha, 410073, Hunan, China
Qinglin Wang, Xiangdong Pei, Songzhu Mei, Rongchun Li & Jie Liu
College of Computer, National University of Defense Technology, Changsha, 410073, Hunan, China
Qinglin Wang, Xiangdong Pei, Songzhu Mei, Rongchun Li & Jie Liu
Beijing Institute of Astronautical Systems Engineering, Beijing, 100076, China
Yang Wang

Authors

Yang Wang
View author publications
You can also search for this author in PubMed Google Scholar
Qinglin Wang
View author publications
You can also search for this author in PubMed Google Scholar
Xiangdong Pei
View author publications
You can also search for this author in PubMed Google Scholar
Songzhu Mei
View author publications
You can also search for this author in PubMed Google Scholar
Rongchun Li
View author publications
You can also search for this author in PubMed Google Scholar
Jie Liu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Qinglin Wang.

Ethics declarations

Conflict of interest

On behalf of all authors, the corresponding author states that there is no conflict of interest.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Wang, Y., Wang, Q., Pei, X. et al. High performance dilated convolutions on multi-core DSPs. CCF Trans. HPC 6, 78–93 (2024). https://doi.org/10.1007/s42514-023-00166-8

Download citation

Received: 08 July 2023
Accepted: 28 August 2023
Published: 09 September 2023
Issue Date: February 2024
DOI: https://doi.org/10.1007/s42514-023-00166-8

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

High performance dilated convolutions on multi-core DSPs

Abstract

Access this article

Similar content being viewed by others

Optimizing Pointwise Convolutions on Multi-core DSPs

Im2win: An Efficient Convolution Paradigm on GPU

Accelerating Depthwise Separable Convolutions with Vector Processor

Data availability

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Rights and permissions

About this article

Cite this article

Keywords

Navigation

High performance dilated convolutions on multi-core DSPs

Abstract

Access this article

Similar content being viewed by others

Optimizing Pointwise Convolutions on Multi-core DSPs

Im2win: An Efficient Convolution Paradigm on GPU

Accelerating Depthwise Separable Convolutions with Vector Processor

Data availability

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation