Skip to main content

Optimizing Winograd Convolution on GPUs via Partial Kernel Fusion

  • Conference paper
  • First Online:
Network and Parallel Computing (NPC 2022)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13615))

Included in the following conference series:

  • 1365 Accesses

Abstract

Convolution operations are the essential components in modern CNNs (Convolutional Neural Networks), which are also the most time-consuming. Several fast convolution algorithms include FFT and Winograd, have been proposed to solve this problem. Winograd convolution is used to improve the inference performance of the convolution operators with small kernels, which are the mainstream in the current popular CNNs. However, the implementations of Winograd convolution in many highly optimized deep neural network libraries and deep learning compilers are not efficient. Due to the complex data dependencies of the four stages of Winograd convolution, it is very challenging to optimize it. In this paper, we improve the inference performance of the Winograd convolution operator on GPUs. We propose a sync-free implementation of the calculation stage of Winograd and furtherly propose methods of PKF (Partial Kernel Fusion) utilizing different memory levels of GPUs. We implemented PKF-Reconstructor based on TVM for PKF Winograd convolution. Evaluations on convolution operators from real-world CNNs show that our method achieves a speedup of 8.22\(\times \)–13.69\(\times \) compared to cuDNN and 4.89\(\times \)–9.10\(\times \) to the fastest vanilla TVM Winograd implementation.

This work is supported in part by NSFC (No. 61872374, 62090023, 62172430), NSFHN (No. 2022JJ10064, 2021JJ10052) and NKRDP (No. 2021YFB0300300).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Patel, R., Patel, S.: A comprehensive study of applying convolutional neural network for computer vision. Int. J. Adv. Sci. Technol. 6, 2161–2174 (2020)

    Google Scholar 

  2. Rawat, W., Wang, Z.: Deep convolutional neural networks for image classification: a comprehensive review. Neural Comput. 29, 2352–2449 (2017)

    Article  MathSciNet  MATH  Google Scholar 

  3. Fathi, E., Shoja, B.M.: Deep neural networks for natural language processing. In: Handbook of Statistics, vol. 38, pp. 229–316 (2018)

    Google Scholar 

  4. Goodfellow, I., Bengio, Y., Courville, A.: Deep Learning. The MIT Press (2016)

    Google Scholar 

  5. Sze, V., Chen, Y.H., Yang, T.J., Emer, J.S.: Efficient processing of deep neural networks: a tutorial and survey. Proc. IEEE 105(12), 2295–2329 (2017). https://doi.org/10.1109/JPROC.2017.2761740

    Article  Google Scholar 

  6. Jia, L., Liang, Y., Li, X., Lu, L., Yan, S.: Enabling efficient fast convolution algorithms on GPUs via MegaKernels. IEEE Trans. Comput. 69(7), 986–997 (2020). https://doi.org/10.1109/TC.2020.2973144

    Article  MathSciNet  MATH  Google Scholar 

  7. Lavin, A., Gray, S.: Fast algorithms for convolutional neural networks. Arxiv, September 2015

    Google Scholar 

  8. Li, S., Park, J., Tang, P.T.P.: Enabling sparse Winograd convolution by native pruning. arXiv e-prints arXiv:1702.08597, February 2017

  9. Meng, L., Brothers, J.: Efficient Winograd convolution via integer arithmetic. arXiv e-prints arXiv:1901.01965, January 2019

  10. Barabasz, B., Gregg, D.: Winograd convolution for DNNs: beyond linear polynomials. In: Alviano, M., Greco, G., Scarcello, F. (eds.) AI*IA 2019. LNCS (LNAI), vol. 11946, pp. 307–320. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-35166-3_22

    Chapter  Google Scholar 

  11. Yan, D., Wang, W., Chu, X.: Optimizing batched Winograd convolution on GPUs. In: Proceedings of the 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP 2020, pp. 32–44. Association for Computing Machinery, New York (2020). https://doi.org/10.1145/3332466.3374520

  12. Huang, Y., Shen, J., Wang, Z., Wen, M., Zhang, C.: A high-efficiency FPGA-based accelerator for convolutional neural networks using Winograd algorithm. J. Phys. Conf. Ser. 1026, 012019, May 2018

    Google Scholar 

  13. Wang, Z., Lan, Q., He, H., Zhang, C.: Winograd algorithm for 3D convolution neural networks. In: Lintas, A., Rovetta, S., Verschure, P.F.M.J., Villa, A.E.P. (eds.) ICANN 2017. LNCS, vol. 10614, pp. 609–616. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-68612-7_69

    Chapter  Google Scholar 

  14. Laine, S., Karras, T., Aila, T.: Megakernels considered harmful: wavefront path tracing on GPUs. In: Proceedings of the 5th High-Performance Graphics Conference, HPG 2013, pp. 137–143. Association for Computing Machinery, New York (2013). https://doi.org/10.1145/2492045.2492060

  15. Kennedy, K., Allen, J.R.: Optimizing Compilers for Modern Architectures: A Dependence-Based Approach. Morgan Kaufmann Publishers Inc., San Francisco (2001)

    Google Scholar 

  16. Chen, T., Moreau, T., Jiang, Z., Zheng, L., Yan, E., et al.: TVM: an automated end-to-end optimizing compiler for deep learning. In: USENIX OSDI 2018, pp. 579-594. USENIX, USA (2018)

    Google Scholar 

  17. Chen, T., et al.: Learning to optimize tensor programs. In: Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., Garnett, R. (eds.) NIPS 2018, vol. 31. Curran Associates, Inc. (2018)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Libo Huang .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 IFIP International Federation for Information Processing

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Tong, G. et al. (2022). Optimizing Winograd Convolution on GPUs via Partial Kernel Fusion. In: Liu, S., Wei, X. (eds) Network and Parallel Computing. NPC 2022. Lecture Notes in Computer Science, vol 13615. Springer, Cham. https://doi.org/10.1007/978-3-031-21395-3_2

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-21395-3_2

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-21394-6

  • Online ISBN: 978-3-031-21395-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics