research-article

Optimizing batched winograd convolution on GPUs

Authors:

Xiaowen ChuAuthors Info & Claims

PPoPP '20: Proceedings of the 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

Pages 32 - 44

https://doi.org/10.1145/3332466.3374520

Published: 19 February 2020 Publication History

Abstract

In this paper, we present an optimized implementation for single-precision Winograd convolution on NVIDIA Volta and Turing GPUs. Compared with the state-of-the-art Winograd convolution in cuDNN 7.6.1, our implementation achieves up to 2.13X speedup on Volta V100 and up to 2.65X speedup on Turing RTX2070. On both Volta and Turing GPUs, our implementation achieves up to 93% of device peak.

Apart from analyzing and benchmarking different high-level optimization options, we also build a SASS assembler TuringAs for Volta and Turing that enables tuning the performance at the native assembly level. The new optimization opportunities uncovered by TuringAs not only improve the Winograd convolution but can also benefit CUDA compilers and native assembly programming. We have released TuringAs as an open-source software. To the best of our knowledge, this is the first public-available assembler for Volta and Turing GPUs.

References

[1]

Sharan Chetlur, Cliff Woolley, Philippe Vandermersch, Jonathan Cohen, John Tran, Bryan Catanzaro, and Evan Shelhamer. 2014. cuDNN: Efficient Primitives for Deep Learning. CoRR abs/1410.0759 (2014), 1--9.

[2]

Evangelos Georganas, Sasikanth Avancha, Kunal Banerjee, Dhiraj D. Kalamkar, Greg Henry, Hans Pabst, and Alexander Heinecke. 2018. Anatomy of high-performance deep learning convolutions on SIMD architectures. In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis, SC 2018. IEEE/ACM, Dallas, TX, USA, 66:1--66:12.

Digital Library

[3]

Priya Goyal, Piotr Dollár, Ross B. Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. 2017. Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour. CoRR abs/1706.02677 (2017), 1--12.

[4]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016. IEEE Computer Society, Las Vegas, NV, USA, 770--778.

[5]

Zhe Jia, Marco Maggioni, Benjamin Staiger, and Daniele Paolo Scarpazza. 2018. Dissecting the NVIDIA Volta GPU Architecture via Microbenchmarking. CoRR abs/1804.06826 (2018), 1--66.

[6]

Zhen Jia, Aleksandar Zlateski, Frédo Durand, and Kai Li. 2018. Optimizing N-dimensional, winograd-based convolution for manycore CPUs. In Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP 2018. ACM, Vienna, Austria, 109--123.

Digital Library

[7]

Alex Krizhevsky. 2015. cuda-convnet2. Retrieved Jan 12, 2019 from https://github.com/akrizhevsky/cuda-convnet2

[8]

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. ImageNet Classification with Deep Convolutional Neural Networks. In Advances in Neural Information Processing Systems, NIPS 2012. NIPS, Lake Tahoe, NV, USA, 1106--1114.

[9]

Junjie Lai and André Seznec. 2013. Performance upper bound analysis and optimization of SGEMM on Fermi and Kepler GPUs. In Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization, CGO 2013. IEEE Computer Society, Shenzhen, China, 4:1--4:10.

[10]

Monica D. Lam, Edward E. Rothberg, and Michael E. Wolf. 1991. The Cache Performance and Optimizations of Blocked Algorithms. In Proceedings of the Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS 1991. ACM, Santa Clara, CA, USA, 63--74.

Digital Library

[11]

Andrew Lavin and Scott Gray. 2016. Fast algorithms for convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016. IEEE Computer Society, Las Vegas, NV, USA, 4013--4021.

[12]

Michael Mathieu, Mikael Henaff, and Yann LeCun. 2013. Fast Training of Convolutional Networks through FFTs. CoRR abs/1312.5851 (2013), 1--9.

[13]

Xinxin Mei and Xiaowen Chu. 2017. Dissecting GPU Memory Hierarchy Through Microbenchmarking. IEEE TPDS 28 (2017), 72--86.

[14]

Xinxin Mei, Kaiyong Zhao, Chengjian Liu, and Xiaowen Chu. 2014. Benchmarking the memory hierarchy of modern GPUs. In IFIP International Conference on Network and Parallel Computing. Springer, Ilan, Taiwan, 144--156.

[15]

NervanaSystems. 2016. Maxas. Retrieved Jan 12, 2019 from https://github.com/NervanaSystems/maxas

[16]

NervanaSystems. 2016. Neon. Retrieved Jan 12, 2019 from https://github.com/NervanaSystems/neon/tree/master/neon/backends/kernels/sass

[17]

NVIDIA. 2018. NVIDIA TURING GPU ARCHITECTURE. Retrieved Jan 12, 2019 from https://www.nvidia.com/content/dam/en-zz/Solutions/design-visualization/technologies/turing-architecture/NVIDIA-Turing-Architecture-Whitepaper.pdf

[18]

NVIDIA. 2019. CUDA C Programming Guide. Retrieved Jul 2, 2019 from https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html

[19]

NVIDIA. 2019. How to Implement Performance Metrics in CUDA C/C++. Retrieved Jul 2, 2019 from https://devblogs.nvidia.com/how-implement-performance-metrics-cuda-cc/

[20]

NVIDIA. 2019. Nsight Compute. Retrieved Jul 2, 2019 from https://docs.nvidia.com/nsight-compute/NsightCompute/index.html

[21]

MLPerf Org. 2019. MLPerf. Retrieved Jul 2, 2019 from https://mlperf.org/

[22]

Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in Neural Information Processing Systems, NIPS 2015. NIPS, Montreal, Quebec, Canada, 91--99.

[23]

Shaohuai Shi, Qiang Wang, Pengfei Xu, and Xiaowen Chu. 2016. Benchmarking state-of-the-art deep learning software tools. In 2016 7th International Conference on Cloud Computing and Big Data (CCBD). IEEE, Macau, China, 99--104.

[24]

Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. CoRR abs/1409.1556 (2014), 1--14.

[25]

Vasily Volkov and James W. Demmel. 2008. Benchmarking GPUs to Tune Dense Linear Algebra. In Proceedings of the 2008 ACM/IEEE Conference on SuperComputing (SC). IEEE Press, Piscataway, NJ, USA, 31:1--31:11.

[26]

Shmuel Winograd. 1980. Arithmetic complexity of computations. Vol. 33. Siam, Salt Lake City, UT, USA.

[27]

Xiuxia Zhang, Guangming Tan, Shuangbai Xue, Jiajia Li, Keren Zhou, and Mingyu Chen. 2017. Understanding the GPU Microarchitecture to Achieve Bare-Metal Performance Tuning. In Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP 2017. ACM, Austin, TX, USA, 31--43.

Digital Library

Cited By

Maděra KŠmelko AKruliš M(2025)Efficient GPU-accelerated Parallel Cross-correlationJournal of Parallel and Distributed Computing10.1016/j.jpdc.2025.105054(105054)Online publication date: Feb-2025
https://doi.org/10.1016/j.jpdc.2025.105054
Zhang ZZhang PXu ZYan BWang Q(2024)Im2col-Winograd: An Efficient and Flexible Fused-Winograd Convolution for NHWC Format on GPUsProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673039(1072-1081)Online publication date: 12-Aug-2024
https://dl.acm.org/doi/10.1145/3673038.3673039
Zhou CHassman ZShah DRichard VLi YRodríguez GSadayappan PSukumaran-Rajam A(2024)YFlows: Systematic Dataflow Exploration and Code Generation for Efficient Neural Network Inference using SIMD Architectures on CPUsProceedings of the 33rd ACM SIGPLAN International Conference on Compiler Construction10.1145/3640537.3641566(212-226)Online publication date: 17-Feb-2024
https://dl.acm.org/doi/10.1145/3640537.3641566
Show More Cited By

Index Terms

Optimizing batched winograd convolution on GPUs

Recommendations

Optimizing N-dimensional, winograd-based convolution for manycore CPUs
PPoPP '18: Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

Recent work on Winograd-based convolution allows for a great reduction of computational complexity, but existing implementations are limited to 2D data and a single kernel size of 3 by 3. They can achieve only slightly better, and often worse performance ...
Optimizing Memory Efficiency for Convolution Kernels on Kepler GPUs
DAC '17: Proceedings of the 54th Annual Design Automation Conference 2017

Convolution is a fundamental operation in many applications, such as computer vision, natural language processing, image processing, etc. Recent successes of convolutional neural networks in various deep learning applications put even higher demand on ...
Optimizing N-dimensional, winograd-based convolution for manycore CPUs
PPoPP '18

Recent work on Winograd-based convolution allows for a great reduction of computational complexity, but existing implementations are limited to 2D data and a single kernel size of 3 by 3. They can achieve only slightly better, and often worse performance ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

PPoPP '20: Proceedings of the 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

February 2020

454 pages

ISBN:9781450368186

DOI:10.1145/3332466

General Chair:
Rajiv Gupta
UC Riverside
,
Program Chair:
Xipeng Shen
NCSU

Copyright © 2020 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication Notes

Badge change: Article originally badged under Version 1.0 guidelines https://www.acm.org/publications/policies/artifact-review-badging

Publication History

Published: 19 February 2020

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Badges

Author Tags

Qualifiers

Research-article

Funding Sources

HK Research Grants Council

Conference

PPoPP '20

Sponsor:

PPoPP '20: 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

February 22 - 26, 2020

California, San Diego

Acceptance Rates

PPoPP '20 Paper Acceptance Rate 28 of 121 submissions, 23%;

Overall Acceptance Rate 230 of 1,014 submissions, 23%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

39
Total Citations
View Citations
1,282
Total Downloads

Downloads (Last 12 months)147
Downloads (Last 6 weeks)15

Reflects downloads up to 17 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Maděra KŠmelko AKruliš M(2025)Efficient GPU-accelerated Parallel Cross-correlationJournal of Parallel and Distributed Computing10.1016/j.jpdc.2025.105054(105054)Online publication date: Feb-2025
https://doi.org/10.1016/j.jpdc.2025.105054
Zhang ZZhang PXu ZYan BWang Q(2024)Im2col-Winograd: An Efficient and Flexible Fused-Winograd Convolution for NHWC Format on GPUsProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673039(1072-1081)Online publication date: 12-Aug-2024
https://dl.acm.org/doi/10.1145/3673038.3673039
Zhou CHassman ZShah DRichard VLi YRodríguez GSadayappan PSukumaran-Rajam A(2024)YFlows: Systematic Dataflow Exploration and Code Generation for Efficient Neural Network Inference using SIMD Architectures on CPUsProceedings of the 33rd ACM SIGPLAN International Conference on Compiler Construction10.1145/3640537.3641566(212-226)Online publication date: 17-Feb-2024
https://dl.acm.org/doi/10.1145/3640537.3641566
Fan RWang WChu XTsafrir DMusuvathi MGupta RAbu-Ghazaleh N(2024)DTC-SpMM: Bridging the Gap in Accelerating General Sparse Matrix Multiplication with Tensor CoresProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 310.1145/3620666.3651378(253-267)Online publication date: 27-Apr-2024
https://dl.acm.org/doi/10.1145/3620666.3651378
Chen TTan YZhang ZLuo NLi BLi Y(2024)Dataflow Optimization with Layer-Wise Design Variables Estimation Method for Enflame CNN AcceleratorsJournal of Parallel and Distributed Computing10.1016/j.jpdc.2024.104869(104869)Online publication date: Mar-2024
https://doi.org/10.1016/j.jpdc.2024.104869
Liu ZHao MZhang WLu GTian XYang SXie MDai JYuan CWang DYang H(2024)Optimizing depthwise separable convolution on DCUCCF Transactions on High Performance Computing10.1007/s42514-024-00200-3Online publication date: 13-Dec-2024
https://doi.org/10.1007/s42514-024-00200-3
Vakhrushev VPopova N(2024)Study of OpenCL-Based Neural Network Convolutions on GPUsSupercomputing10.1007/978-3-031-78459-0_30(419-433)Online publication date: 23-Sep-2024
https://dl.acm.org/doi/10.1007/978-3-031-78459-0_30
Ferrari VSousa RPereira ML. De Carvalho JAmaral JMoreira JAraujo G(2023)Advancing Direct Convolution Using Convolution Slicing Optimization and ISA ExtensionsACM Transactions on Architecture and Code Optimization10.1145/362500420:4(1-26)Online publication date: 20-Sep-2023
https://dl.acm.org/doi/10.1145/3625004
Krolik AVerbrugge CHendren L(2023)rNdN: Fast Query Compilation for NVIDIA GPUsACM Transactions on Architecture and Code Optimization10.1145/360350320:3(1-25)Online publication date: 19-Jul-2023
https://dl.acm.org/doi/10.1145/3603503
Matsumura KDe Gonzalo SPeña AVerbrugge CLhoták OShen X(2023)A Symbolic Emulator for Shuffle Synthesis on the NVIDIA PTX CodeProceedings of the 32nd ACM SIGPLAN International Conference on Compiler Construction10.1145/3578360.3580253(110-121)Online publication date: 17-Feb-2023
https://dl.acm.org/doi/10.1145/3578360.3580253
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten