Abstract
Dilated convolutions are widely used to accomplish wide receptive fields while keeping the resolution of feature maps in deep learning applications, such as semantic segmentation and object detection. However, the data locality in dilated convolutions deteriorates rapidly with the increase of dilation rate, which brings a great challenge to the high-performance direct implementation of convolutions. Multi-core digital signal processors (DSPs) with software-controlled on-chip memories allow programmers to move data between on-chip and off-chip memories by hand so that it may be very friendly to the direct implementation of dilated convolutions. In this paper, we introduce a high-performance parallel direct implementation of dilated convolutions on multi-core DSPs in a CPU-DSP heterogeneous prototype processor, which can effectively capture the data locality in dilated convolutions. The experimental results demonstrate that the direct implementation achieves much better performance than GEMM-based ones on multi-core DSPs for all the tested layers, and gets much higher efficiency than the high-performance libraries on three other architectures in cases with large feature maps. In addition, the direct implementation also exhibits good scalability.
Similar content being viewed by others
Data availability
Data will be made available on request.
References
Arm Corporation: ARM Computer Library: a software library for machine learning. https://www.arm.com/technologies/compute-library. Online, accessed 3-Jan-2023 (2023)
Chaudhary, N., Misra, S., Kalamkar, D., Heinecke, A., Georganas, E., Ziv, B., Adelman, M., Kaul, B.: Efficient and generic 1d dilated convolution layer for deep learning. arXiv preprint arXiv:2104.08002 (2021)
Chen, Q., Xu, J., Koltun, V.: Fast image processing with fully-convolutional networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2497–2506 (2017a)
Chen, L.-C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: Deeplab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFS. IEEE Trans. Pattern Anal. Mach. Intell. 40(4), 834–848 (2017b)
Filippas, D., Nicopoulos, C., Dimitrakopoulos, G.: Streaming dilated convolution engine. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 31(3), 401–405 (2023)
Georganas, E., Avancha, S., Banerjee, K., Kalamkar, D., Henry, G., Pabst, H., Heinecke, A.: Anatomy of high-performance deep learning convolutions on SIMD architectures. In: SC18: International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 830–841. IEEE (2018). https://doi.org/10.1109/SC.2018.00069
Hao, R., Wang, Q., Yin, S., Zhou, T., Shen, S., Mei, S., Liu, J.: Towards effective depthwise convolutions on armv8 architecture. arXiv preprint arXiv:2206.12124 (2022)
Heinecke, A., Georganas, E., Banerjee, K., Kalamkar, D., Sundaram, N., Venkat, A., Henry, G., Pabst, H.: Understanding the performance of small convolution operations for CNN on intel architecture. In: Poster in the International Conference for High Performance Computing, Networking, Storage, and Analysis (2017)
Igual, F.D., Ali, M., Friedmann, A., Stotzer, E., Wentz, T., Geijn, R.A.: Unleashing the high-performance and low-power of multi-core DSPS for general-purpose HPC. In: SC’12: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, pp. 1–11. IEEE (2012)
Intel Corporation: oneAPI deep neural network library (oneDNN). https://github.com/oneapi-src/oneDNN. Online, accessed 3-Jan-2023 (2023)
Kim, M., Park, C., Kim, S., Hong, T., Ro, W.W.: Efficient dilated-Winograd convolutional neural networks. In: 2019 IEEE International Conference on Image Processing (ICIP), pp. 2711–2715. IEEE (2019)
Kurth, T., Treichler, S., Romero, J., Mudigonda, M., Luehr, N., Phillips, E., Mahesh, A., Matheson, M., Deslippe, J., Fatica, M., et al.: Exascale deep learning for climate analytics. In: SC18: International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 649–660. IEEE (2018)
Lin, G., Wu, Q., Qiu, L., Huang, X.: Image super-resolution using a dilated convolutional neural network. Neurocomputing 275, 1219–1230 (2018)
Lu, G., Zhang, W., Wang, Z.: Optimizing depthwise separable convolution operations on GPUs. IEEE Trans. Parallel Distrib. Syst. 33(1), 70–87 (2021)
Mehta, S., Rastegari, M., Caspi, A., Shapiro, L., Hajishirzi, H.: ESPNet: efficient spatial pyramid of dilated convolutions for semantic segmentation. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 552–568 (2018)
Mogers, N., Radu, V., Li, L., Turner, J., O’Boyle, M., Dubach, C.: Automatic generation of specialized direct convolutions for mobile gpus. In: Proceedings of the 13th Annual Workshop on General Purpose Processing Using Graphics Processing Unit, pp. 41–50 (2020)
Paszke, A., Gross, S., Massa, F., et al.: PyTorch: an imperative style, high-performance deep learning library (2019)
Pei, X., Wang, Q., Liao, L., Li, R., Mei, S., Liu, J., Pang, Z.: Optimizing parallel matrix transpose algorithm on multi-core digital signal processors (in Chinese). J. Natl. Univ. Def. Technol. 45(1), 57–66 (2023)
Safonov, I., Kornilov, A., Makienko, D.: An approach for matrix multiplication of 32-bit fixed point numbers by means of 16-bit SIMD instructions on DSP. Electronics 12, 78 (2022)
Wang, Q., Pei, X., Liao, L., Wang, H., Li, R., Mei, S., Li, D.: Evaluating matrix multiplication-based convolution algorithm on multi-core digital signal processors (in Chinese). J. Natl. Univ. Def. Technol. 45(1), 86–94 (2023)
Yin, S., Wang, Q., Hao, R., Zhou, T., Mei, S., Liu, J.: Optimizing irregular-shaped matrix-matrix multiplication on multi-core DSPS. In: 2022 IEEE International Conference on Cluster Computing (CLUSTER), pp. 451–461 (2022)
Yu, F., Koltun, V.: Multi-scale context aggregation by dilated convolutions. arXiv preprint arXiv:1511.07122 (2015)
Yu, F., Koltun, V., Funkhouser, T.: Dilated residual networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 472–480 (2017)
Zhang, K., Zuo, W., Gu, S., Zhang, L.: Learning deep CNN denoiser prior for image restoration. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3929–3938 (2017)
Zhang, J., Franchetti, F., Low, T.M.: High performance zero-memory overhead direct convolutions. In: International Conference on Machine Learning, pp. 5771–5780 (2018)
Funding
This work was supported by the National Natural Science Foundation of China under Grant nos. 62002365 and 62025208, and the National Key Research and Development Program of China under Grant no. 2021YFBO300101.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
On behalf of all authors, the corresponding author states that there is no conflict of interest.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Wang, Y., Wang, Q., Pei, X. et al. High performance dilated convolutions on multi-core DSPs. CCF Trans. HPC 6, 78–93 (2024). https://doi.org/10.1007/s42514-023-00166-8
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s42514-023-00166-8