Abstract
Convolution operations are the essential components in modern CNNs (Convolutional Neural Networks), which are also the most time-consuming. Several fast convolution algorithms include FFT and Winograd, have been proposed to solve this problem. Winograd convolution is used to improve the inference performance of the convolution operators with small kernels, which are the mainstream in the current popular CNNs. However, the implementations of Winograd convolution in many highly optimized deep neural network libraries and deep learning compilers are not efficient. Due to the complex data dependencies of the four stages of Winograd convolution, it is very challenging to optimize it. In this paper, we improve the inference performance of the Winograd convolution operator on GPUs. We propose a sync-free implementation of the calculation stage of Winograd and furtherly propose methods of PKF (Partial Kernel Fusion) utilizing different memory levels of GPUs. We implemented PKF-Reconstructor based on TVM for PKF Winograd convolution. Evaluations on convolution operators from real-world CNNs show that our method achieves a speedup of 8.22\(\times \)–13.69\(\times \) compared to cuDNN and 4.89\(\times \)–9.10\(\times \) to the fastest vanilla TVM Winograd implementation.
This work is supported in part by NSFC (No. 61872374, 62090023, 62172430), NSFHN (No. 2022JJ10064, 2021JJ10052) and NKRDP (No. 2021YFB0300300).
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Patel, R., Patel, S.: A comprehensive study of applying convolutional neural network for computer vision. Int. J. Adv. Sci. Technol. 6, 2161–2174 (2020)
Rawat, W., Wang, Z.: Deep convolutional neural networks for image classification: a comprehensive review. Neural Comput. 29, 2352–2449 (2017)
Fathi, E., Shoja, B.M.: Deep neural networks for natural language processing. In: Handbook of Statistics, vol. 38, pp. 229–316 (2018)
Goodfellow, I., Bengio, Y., Courville, A.: Deep Learning. The MIT Press (2016)
Sze, V., Chen, Y.H., Yang, T.J., Emer, J.S.: Efficient processing of deep neural networks: a tutorial and survey. Proc. IEEE 105(12), 2295–2329 (2017). https://doi.org/10.1109/JPROC.2017.2761740
Jia, L., Liang, Y., Li, X., Lu, L., Yan, S.: Enabling efficient fast convolution algorithms on GPUs via MegaKernels. IEEE Trans. Comput. 69(7), 986–997 (2020). https://doi.org/10.1109/TC.2020.2973144
Lavin, A., Gray, S.: Fast algorithms for convolutional neural networks. Arxiv, September 2015
Li, S., Park, J., Tang, P.T.P.: Enabling sparse Winograd convolution by native pruning. arXiv e-prints arXiv:1702.08597, February 2017
Meng, L., Brothers, J.: Efficient Winograd convolution via integer arithmetic. arXiv e-prints arXiv:1901.01965, January 2019
Barabasz, B., Gregg, D.: Winograd convolution for DNNs: beyond linear polynomials. In: Alviano, M., Greco, G., Scarcello, F. (eds.) AI*IA 2019. LNCS (LNAI), vol. 11946, pp. 307–320. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-35166-3_22
Yan, D., Wang, W., Chu, X.: Optimizing batched Winograd convolution on GPUs. In: Proceedings of the 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP 2020, pp. 32–44. Association for Computing Machinery, New York (2020). https://doi.org/10.1145/3332466.3374520
Huang, Y., Shen, J., Wang, Z., Wen, M., Zhang, C.: A high-efficiency FPGA-based accelerator for convolutional neural networks using Winograd algorithm. J. Phys. Conf. Ser. 1026, 012019, May 2018
Wang, Z., Lan, Q., He, H., Zhang, C.: Winograd algorithm for 3D convolution neural networks. In: Lintas, A., Rovetta, S., Verschure, P.F.M.J., Villa, A.E.P. (eds.) ICANN 2017. LNCS, vol. 10614, pp. 609–616. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-68612-7_69
Laine, S., Karras, T., Aila, T.: Megakernels considered harmful: wavefront path tracing on GPUs. In: Proceedings of the 5th High-Performance Graphics Conference, HPG 2013, pp. 137–143. Association for Computing Machinery, New York (2013). https://doi.org/10.1145/2492045.2492060
Kennedy, K., Allen, J.R.: Optimizing Compilers for Modern Architectures: A Dependence-Based Approach. Morgan Kaufmann Publishers Inc., San Francisco (2001)
Chen, T., Moreau, T., Jiang, Z., Zheng, L., Yan, E., et al.: TVM: an automated end-to-end optimizing compiler for deep learning. In: USENIX OSDI 2018, pp. 579-594. USENIX, USA (2018)
Chen, T., et al.: Learning to optimize tensor programs. In: Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., Garnett, R. (eds.) NIPS 2018, vol. 31. Curran Associates, Inc. (2018)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 IFIP International Federation for Information Processing
About this paper
Cite this paper
Tong, G. et al. (2022). Optimizing Winograd Convolution on GPUs via Partial Kernel Fusion. In: Liu, S., Wei, X. (eds) Network and Parallel Computing. NPC 2022. Lecture Notes in Computer Science, vol 13615. Springer, Cham. https://doi.org/10.1007/978-3-031-21395-3_2
Download citation
DOI: https://doi.org/10.1007/978-3-031-21395-3_2
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-21394-6
Online ISBN: 978-3-031-21395-3
eBook Packages: Computer ScienceComputer Science (R0)