skip to main content
10.1145/3332466.3374520acmconferencesArticle/Chapter ViewAbstractPublication PagesppoppConference Proceedingsconference-collections
research-article

Optimizing batched winograd convolution on GPUs

Published: 19 February 2020 Publication History

Abstract

In this paper, we present an optimized implementation for single-precision Winograd convolution on NVIDIA Volta and Turing GPUs. Compared with the state-of-the-art Winograd convolution in cuDNN 7.6.1, our implementation achieves up to 2.13X speedup on Volta V100 and up to 2.65X speedup on Turing RTX2070. On both Volta and Turing GPUs, our implementation achieves up to 93% of device peak.
Apart from analyzing and benchmarking different high-level optimization options, we also build a SASS assembler TuringAs for Volta and Turing that enables tuning the performance at the native assembly level. The new optimization opportunities uncovered by TuringAs not only improve the Winograd convolution but can also benefit CUDA compilers and native assembly programming. We have released TuringAs as an open-source software. To the best of our knowledge, this is the first public-available assembler for Volta and Turing GPUs.

References

[1]
Sharan Chetlur, Cliff Woolley, Philippe Vandermersch, Jonathan Cohen, John Tran, Bryan Catanzaro, and Evan Shelhamer. 2014. cuDNN: Efficient Primitives for Deep Learning. CoRR abs/1410.0759 (2014), 1--9.
[2]
Evangelos Georganas, Sasikanth Avancha, Kunal Banerjee, Dhiraj D. Kalamkar, Greg Henry, Hans Pabst, and Alexander Heinecke. 2018. Anatomy of high-performance deep learning convolutions on SIMD architectures. In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis, SC 2018. IEEE/ACM, Dallas, TX, USA, 66:1--66:12.
[3]
Priya Goyal, Piotr Dollár, Ross B. Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. 2017. Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour. CoRR abs/1706.02677 (2017), 1--12.
[4]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016. IEEE Computer Society, Las Vegas, NV, USA, 770--778.
[5]
Zhe Jia, Marco Maggioni, Benjamin Staiger, and Daniele Paolo Scarpazza. 2018. Dissecting the NVIDIA Volta GPU Architecture via Microbenchmarking. CoRR abs/1804.06826 (2018), 1--66.
[6]
Zhen Jia, Aleksandar Zlateski, Frédo Durand, and Kai Li. 2018. Optimizing N-dimensional, winograd-based convolution for manycore CPUs. In Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP 2018. ACM, Vienna, Austria, 109--123.
[7]
Alex Krizhevsky. 2015. cuda-convnet2. Retrieved Jan 12, 2019 from https://github.com/akrizhevsky/cuda-convnet2
[8]
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. ImageNet Classification with Deep Convolutional Neural Networks. In Advances in Neural Information Processing Systems, NIPS 2012. NIPS, Lake Tahoe, NV, USA, 1106--1114.
[9]
Junjie Lai and André Seznec. 2013. Performance upper bound analysis and optimization of SGEMM on Fermi and Kepler GPUs. In Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization, CGO 2013. IEEE Computer Society, Shenzhen, China, 4:1--4:10.
[10]
Monica D. Lam, Edward E. Rothberg, and Michael E. Wolf. 1991. The Cache Performance and Optimizations of Blocked Algorithms. In Proceedings of the Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS 1991. ACM, Santa Clara, CA, USA, 63--74.
[11]
Andrew Lavin and Scott Gray. 2016. Fast algorithms for convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016. IEEE Computer Society, Las Vegas, NV, USA, 4013--4021.
[12]
Michael Mathieu, Mikael Henaff, and Yann LeCun. 2013. Fast Training of Convolutional Networks through FFTs. CoRR abs/1312.5851 (2013), 1--9.
[13]
Xinxin Mei and Xiaowen Chu. 2017. Dissecting GPU Memory Hierarchy Through Microbenchmarking. IEEE TPDS 28 (2017), 72--86.
[14]
Xinxin Mei, Kaiyong Zhao, Chengjian Liu, and Xiaowen Chu. 2014. Benchmarking the memory hierarchy of modern GPUs. In IFIP International Conference on Network and Parallel Computing. Springer, Ilan, Taiwan, 144--156.
[15]
NervanaSystems. 2016. Maxas. Retrieved Jan 12, 2019 from https://github.com/NervanaSystems/maxas
[16]
NervanaSystems. 2016. Neon. Retrieved Jan 12, 2019 from https://github.com/NervanaSystems/neon/tree/master/neon/backends/kernels/sass
[17]
NVIDIA. 2018. NVIDIA TURING GPU ARCHITECTURE. Retrieved Jan 12, 2019 from https://www.nvidia.com/content/dam/en-zz/Solutions/design-visualization/technologies/turing-architecture/NVIDIA-Turing-Architecture-Whitepaper.pdf
[18]
NVIDIA. 2019. CUDA C Programming Guide. Retrieved Jul 2, 2019 from https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html
[19]
NVIDIA. 2019. How to Implement Performance Metrics in CUDA C/C++. Retrieved Jul 2, 2019 from https://devblogs.nvidia.com/how-implement-performance-metrics-cuda-cc/
[20]
NVIDIA. 2019. Nsight Compute. Retrieved Jul 2, 2019 from https://docs.nvidia.com/nsight-compute/NsightCompute/index.html
[21]
MLPerf Org. 2019. MLPerf. Retrieved Jul 2, 2019 from https://mlperf.org/
[22]
Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in Neural Information Processing Systems, NIPS 2015. NIPS, Montreal, Quebec, Canada, 91--99.
[23]
Shaohuai Shi, Qiang Wang, Pengfei Xu, and Xiaowen Chu. 2016. Benchmarking state-of-the-art deep learning software tools. In 2016 7th International Conference on Cloud Computing and Big Data (CCBD). IEEE, Macau, China, 99--104.
[24]
Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. CoRR abs/1409.1556 (2014), 1--14.
[25]
Vasily Volkov and James W. Demmel. 2008. Benchmarking GPUs to Tune Dense Linear Algebra. In Proceedings of the 2008 ACM/IEEE Conference on SuperComputing (SC). IEEE Press, Piscataway, NJ, USA, 31:1--31:11.
[26]
Shmuel Winograd. 1980. Arithmetic complexity of computations. Vol. 33. Siam, Salt Lake City, UT, USA.
[27]
Xiuxia Zhang, Guangming Tan, Shuangbai Xue, Jiajia Li, Keren Zhou, and Mingyu Chen. 2017. Understanding the GPU Microarchitecture to Achieve Bare-Metal Performance Tuning. In Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP 2017. ACM, Austin, TX, USA, 31--43.

Cited By

View all
  • (2025)Efficient GPU-accelerated Parallel Cross-correlationJournal of Parallel and Distributed Computing10.1016/j.jpdc.2025.105054(105054)Online publication date: Feb-2025
  • (2024)Im2col-Winograd: An Efficient and Flexible Fused-Winograd Convolution for NHWC Format on GPUsProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673039(1072-1081)Online publication date: 12-Aug-2024
  • (2024)YFlows: Systematic Dataflow Exploration and Code Generation for Efficient Neural Network Inference using SIMD Architectures on CPUsProceedings of the 33rd ACM SIGPLAN International Conference on Compiler Construction10.1145/3640537.3641566(212-226)Online publication date: 17-Feb-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
PPoPP '20: Proceedings of the 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
February 2020
454 pages
ISBN:9781450368186
DOI:10.1145/3332466
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication Notes

Badge change: Article originally badged under Version 1.0 guidelines https://www.acm.org/publications/policies/artifact-review-badging

Publication History

Published: 19 February 2020

Permissions

Request permissions for this article.

Check for updates

Badges

Author Tags

  1. GPU
  2. convolution
  3. performance

Qualifiers

  • Research-article

Funding Sources

  • HK Research Grants Council

Conference

PPoPP '20

Acceptance Rates

PPoPP '20 Paper Acceptance Rate 28 of 121 submissions, 23%;
Overall Acceptance Rate 230 of 1,014 submissions, 23%

Upcoming Conference

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)147
  • Downloads (Last 6 weeks)15
Reflects downloads up to 17 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2025)Efficient GPU-accelerated Parallel Cross-correlationJournal of Parallel and Distributed Computing10.1016/j.jpdc.2025.105054(105054)Online publication date: Feb-2025
  • (2024)Im2col-Winograd: An Efficient and Flexible Fused-Winograd Convolution for NHWC Format on GPUsProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673039(1072-1081)Online publication date: 12-Aug-2024
  • (2024)YFlows: Systematic Dataflow Exploration and Code Generation for Efficient Neural Network Inference using SIMD Architectures on CPUsProceedings of the 33rd ACM SIGPLAN International Conference on Compiler Construction10.1145/3640537.3641566(212-226)Online publication date: 17-Feb-2024
  • (2024)DTC-SpMM: Bridging the Gap in Accelerating General Sparse Matrix Multiplication with Tensor CoresProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 310.1145/3620666.3651378(253-267)Online publication date: 27-Apr-2024
  • (2024)Dataflow Optimization with Layer-Wise Design Variables Estimation Method for Enflame CNN AcceleratorsJournal of Parallel and Distributed Computing10.1016/j.jpdc.2024.104869(104869)Online publication date: Mar-2024
  • (2024)Optimizing depthwise separable convolution on DCUCCF Transactions on High Performance Computing10.1007/s42514-024-00200-3Online publication date: 13-Dec-2024
  • (2024)Study of OpenCL-Based Neural Network Convolutions on GPUsSupercomputing10.1007/978-3-031-78459-0_30(419-433)Online publication date: 23-Sep-2024
  • (2023)Advancing Direct Convolution Using Convolution Slicing Optimization and ISA ExtensionsACM Transactions on Architecture and Code Optimization10.1145/362500420:4(1-26)Online publication date: 20-Sep-2023
  • (2023)rNdN: Fast Query Compilation for NVIDIA GPUsACM Transactions on Architecture and Code Optimization10.1145/360350320:3(1-25)Online publication date: 19-Jul-2023
  • (2023)A Symbolic Emulator for Shuffle Synthesis on the NVIDIA PTX CodeProceedings of the 32nd ACM SIGPLAN International Conference on Compiler Construction10.1145/3578360.3580253(110-121)Online publication date: 17-Feb-2023
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media