Skip to main content
Log in

High performance dilated convolutions on multi-core DSPs

  • Regular Paper
  • Published:
CCF Transactions on High Performance Computing Aims and scope Submit manuscript

Abstract

Dilated convolutions are widely used to accomplish wide receptive fields while keeping the resolution of feature maps in deep learning applications, such as semantic segmentation and object detection. However, the data locality in dilated convolutions deteriorates rapidly with the increase of dilation rate, which brings a great challenge to the high-performance direct implementation of convolutions. Multi-core digital signal processors (DSPs) with software-controlled on-chip memories allow programmers to move data between on-chip and off-chip memories by hand so that it may be very friendly to the direct implementation of dilated convolutions. In this paper, we introduce a high-performance parallel direct implementation of dilated convolutions on multi-core DSPs in a CPU-DSP heterogeneous prototype processor, which can effectively capture the data locality in dilated convolutions. The experimental results demonstrate that the direct implementation achieves much better performance than GEMM-based ones on multi-core DSPs for all the tested layers, and gets much higher efficiency than the high-performance libraries on three other architectures in cases with large feature maps. In addition, the direct implementation also exhibits good scalability.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12

Similar content being viewed by others

Data availability

Data will be made available on request.

References

  • Arm Corporation: ARM Computer Library: a software library for machine learning. https://www.arm.com/technologies/compute-library. Online, accessed 3-Jan-2023 (2023)

  • Chaudhary, N., Misra, S., Kalamkar, D., Heinecke, A., Georganas, E., Ziv, B., Adelman, M., Kaul, B.: Efficient and generic 1d dilated convolution layer for deep learning. arXiv preprint arXiv:2104.08002 (2021)

  • Chen, Q., Xu, J., Koltun, V.: Fast image processing with fully-convolutional networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2497–2506 (2017a)

  • Chen, L.-C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: Deeplab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFS. IEEE Trans. Pattern Anal. Mach. Intell. 40(4), 834–848 (2017b)

    Article  PubMed  Google Scholar 

  • Filippas, D., Nicopoulos, C., Dimitrakopoulos, G.: Streaming dilated convolution engine. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 31(3), 401–405 (2023)

    Article  Google Scholar 

  • Georganas, E., Avancha, S., Banerjee, K., Kalamkar, D., Henry, G., Pabst, H., Heinecke, A.: Anatomy of high-performance deep learning convolutions on SIMD architectures. In: SC18: International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 830–841. IEEE (2018). https://doi.org/10.1109/SC.2018.00069

  • Hao, R., Wang, Q., Yin, S., Zhou, T., Shen, S., Mei, S., Liu, J.: Towards effective depthwise convolutions on armv8 architecture. arXiv preprint arXiv:2206.12124 (2022)

  • Heinecke, A., Georganas, E., Banerjee, K., Kalamkar, D., Sundaram, N., Venkat, A., Henry, G., Pabst, H.: Understanding the performance of small convolution operations for CNN on intel architecture. In: Poster in the International Conference for High Performance Computing, Networking, Storage, and Analysis (2017)

  • Igual, F.D., Ali, M., Friedmann, A., Stotzer, E., Wentz, T., Geijn, R.A.: Unleashing the high-performance and low-power of multi-core DSPS for general-purpose HPC. In: SC’12: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, pp. 1–11. IEEE (2012)

  • Intel Corporation: oneAPI deep neural network library (oneDNN). https://github.com/oneapi-src/oneDNN. Online, accessed 3-Jan-2023 (2023)

  • Kim, M., Park, C., Kim, S., Hong, T., Ro, W.W.: Efficient dilated-Winograd convolutional neural networks. In: 2019 IEEE International Conference on Image Processing (ICIP), pp. 2711–2715. IEEE (2019)

  • Kurth, T., Treichler, S., Romero, J., Mudigonda, M., Luehr, N., Phillips, E., Mahesh, A., Matheson, M., Deslippe, J., Fatica, M., et al.: Exascale deep learning for climate analytics. In: SC18: International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 649–660. IEEE (2018)

  • Lin, G., Wu, Q., Qiu, L., Huang, X.: Image super-resolution using a dilated convolutional neural network. Neurocomputing 275, 1219–1230 (2018)

    Article  Google Scholar 

  • Lu, G., Zhang, W., Wang, Z.: Optimizing depthwise separable convolution operations on GPUs. IEEE Trans. Parallel Distrib. Syst. 33(1), 70–87 (2021)

    Article  Google Scholar 

  • Mehta, S., Rastegari, M., Caspi, A., Shapiro, L., Hajishirzi, H.: ESPNet: efficient spatial pyramid of dilated convolutions for semantic segmentation. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 552–568 (2018)

  • Mogers, N., Radu, V., Li, L., Turner, J., O’Boyle, M., Dubach, C.: Automatic generation of specialized direct convolutions for mobile gpus. In: Proceedings of the 13th Annual Workshop on General Purpose Processing Using Graphics Processing Unit, pp. 41–50 (2020)

  • Paszke, A., Gross, S., Massa, F., et al.: PyTorch: an imperative style, high-performance deep learning library (2019)

  • Pei, X., Wang, Q., Liao, L., Li, R., Mei, S., Liu, J., Pang, Z.: Optimizing parallel matrix transpose algorithm on multi-core digital signal processors (in Chinese). J. Natl. Univ. Def. Technol. 45(1), 57–66 (2023)

    Google Scholar 

  • Safonov, I., Kornilov, A., Makienko, D.: An approach for matrix multiplication of 32-bit fixed point numbers by means of 16-bit SIMD instructions on DSP. Electronics 12, 78 (2022)

    Article  Google Scholar 

  • Wang, Q., Pei, X., Liao, L., Wang, H., Li, R., Mei, S., Li, D.: Evaluating matrix multiplication-based convolution algorithm on multi-core digital signal processors (in Chinese). J. Natl. Univ. Def. Technol. 45(1), 86–94 (2023)

    Google Scholar 

  • Yin, S., Wang, Q., Hao, R., Zhou, T., Mei, S., Liu, J.: Optimizing irregular-shaped matrix-matrix multiplication on multi-core DSPS. In: 2022 IEEE International Conference on Cluster Computing (CLUSTER), pp. 451–461 (2022)

  • Yu, F., Koltun, V.: Multi-scale context aggregation by dilated convolutions. arXiv preprint arXiv:1511.07122 (2015)

  • Yu, F., Koltun, V., Funkhouser, T.: Dilated residual networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 472–480 (2017)

  • Zhang, K., Zuo, W., Gu, S., Zhang, L.: Learning deep CNN denoiser prior for image restoration. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3929–3938 (2017)

  • Zhang, J., Franchetti, F., Low, T.M.: High performance zero-memory overhead direct convolutions. In: International Conference on Machine Learning, pp. 5771–5780 (2018)

Download references

Funding

This work was supported by the National Natural Science Foundation of China under Grant nos. 62002365 and 62025208, and the National Key Research and Development Program of China under Grant no. 2021YFBO300101.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Qinglin Wang.

Ethics declarations

Conflict of interest

On behalf of all authors, the corresponding author states that there is no conflict of interest.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wang, Y., Wang, Q., Pei, X. et al. High performance dilated convolutions on multi-core DSPs. CCF Trans. HPC 6, 78–93 (2024). https://doi.org/10.1007/s42514-023-00166-8

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s42514-023-00166-8

Keywords

Navigation