skip to main content
10.1145/3493229.3493305acmotherconferencesArticle/Chapter ViewAbstractPublication PagesscopesConference Proceedingsconference-collections
research-article

Efficient Application of Tensor Core Units for Convolving Images

Published: 13 November 2021 Publication History

Abstract

Tensor Core Units (TCUs) are a domain-specific architecture capable of executing small Matrix Multiply-Accumulates (MMAs) in a single clock cycle, showing significant performance improvements over other optimized implementations. When Convolutional Neural Networks (CNNs) are accelerated using TCUs, the layout of the input image is transformed to allow large amounts of filters to be applied to an image using a single large matrix-matrix multiplication. However, there are applications in other domains that only require a small number of filters. To accommodate such applications, we first show the inappropriateness of this standard technique of transforming the data layout. Subsequently, we propose an approach that uses TCUs to convolve one filter with an image. Further, we introduce several optimizations of this method. Finally, we evaluate the performance of our approach and its optimizations by comparing it to code generated using a state-of-the-art image processing language.

References

[1]
NVIDIA Corporation. NVIDIA Tesla V100 GPU Architecture, The World's Most Advanced Data Center GPU. 2017. url: https://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf.
[2]
NVIDIA Corporation. CUDA C++ Programming Guide. Version PG-02829-001_v11.4. Sept. 2021. url: https://docs.nvidia.com/cuda/pdf/CUDA_C_Programming_Guide.pdf.
[3]
R. Szeliski. Computer Vision - Algorithms and Applications. Texts in Computer Science. Springer, 2011. isbn: 978-1-84882-934-3.
[4]
S. Chetlur, C. Woolley, P. Vandermersch, J. Cohen, J. Tran, B. Catanzaro, and E. Shelhamer. "cuDNN: Efficient Primitives for Deep Learning". In: The Computing Research Repository (CoRR) (2014). arXiv: 1410.0759.
[5]
K. He, X. Zhang, S. Ren, and J. Sun. "Deep Residual Learning for Image Recognition". In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2016, pp. 770-778.
[6]
C. G. Harris and M. Stephens. "A Combined Corner and Edge Detector". In: Proceedings of the Alvey Vision Conference (AVC). Ed. by C. J. Taylor. Alvey Vision Club, Sept. 1988, pp. 1-6.
[7]
T. McGraw. "Fast Bokeh Effects using Low-Rank Linear Filters". In: The Visual Computer: International Journal of Computer Graphics 31.5 (2015), pp. 601-611.
[8]
G. H. Golub and C. F. V. Loan. Matrix Computations, Third Edition. Johns Hopkins University Press, 1996. isbn: 978-0-8018-5414-9.
[9]
H.-N. Wu and C.-T. Huang. "Data Locality Optimization of Depthwise Separable Convolutions for CNN Inference Accelerators". In: Proceedings of the Conference on Design, Automation and Test in Europe (DATE). IEEE, Mar. 2019, pp. 120-125.
[10]
R. Membarth, O. Reiche, F. Hannig, J. Teich, M. Körner, and W. Eckert. "HIPAcc: A Domain-Specific Language and Compiler for Image Processing". In: IEEE Trans. Parallel Distributed Syst. 27.1 (2016), pp. 210-224.
[11]
R. Membarth, F. Hannig, J. Teich, M. Körner, and W. Eckert. "Generating Device-specific GPU Code for Local Operators in Medical Imaging". In: 26th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2012, Shanghai, China, May 21-25, 2012. IEEE Computer Society, 2012, pp. 569-581.
[12]
C. A. Navarro, R. Carrasco, R. J. Barrientos, J. A. Riquelme, and R. Vega. "GPU Tensor Cores for Fast Arithmetic Reductions". In: IEEE Trans. Parallel Distributed Syst. 32.1 (2021), pp. 72-84.
[13]
O. Zachariadis, N. Satpute, J. Gómez-Luna, and J. Olivares. "Accelerating Sparse Matrix-Matrix Multiplication with GPU Tensor Cores". In: Comput. Electr.Eng. 88 (2020).
[14]
L. Pisha and L. Ligowski. "Accelerating Non-Power-of-2 Size Fourier Transforms with GPU Tensor Cores". In: Proceedings of the 35th IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, May 2021, pp. 507-516.
[15]
S. Durrani, M. S. Chughtai, A. Dakkak, W.-m. Hwu, and L. Rauchwerger. "FFT Blitz: The Tensor Cores Strike Back". In: Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP). ACM, 2021, pp. 488-489.
[16]
L. Ducas, M. Stevens, and W. P. J. van Woerden. "Advanced Lattice Sieving on GPUs, with Tensor Cores". In: Proceedings of the 40th Annual International Conference on the Theory and Applications of Cryptographic Techniques (EUROCRYPT), Part II. Vol. 12697. Lecture Notes in Computer Science (LNCS). Springer, 2021, pp. 249-279.
[17]
S. Sioutas, S. Stuijk, T. Basten, L. J. Somers, and H. Corporaal. "Programming Tensor Cores from an Image Processing DSL". In: Proceedings of the 23rd International Workshop on Software and Compilers for Embedded Systems (SCOPES). ACM, May 2020, pp. 36-41.
[18]
O. Reiche, C. Kobylko, F. Hannig, and J. Teich. "Auto-Vectorization for Image Processing DSLs". In: Proc. of the 18th ACM SIGPLAN/SIGBED Conference on Languages, Compilers, and Tools for Embedded Systems (LCTES). Barcelona, Spain: ACM, 2017, pp. 21-30.

Cited By

View all
  • (2024)Exploring Hardware Fault Impacts on Different Real Number Representations of the Structural Resilience of TCUs in GPUsElectronics10.3390/electronics1303057813:3(578)Online publication date: 31-Jan-2024
  • (2024)SoK: Fully Homomorphic Encryption AcceleratorsACM Computing Surveys10.1145/367695556:12(1-32)Online publication date: 5-Jul-2024
  • (2024)Analyzing the Reliability of TCUs Through Micro-architecture and Structural Evaluations for Two Real Number FormatsVLSI-SoC 2023: Innovations for Trustworthy Artificial Intelligence10.1007/978-3-031-70947-0_8(149-176)Online publication date: 29-Dec-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
SCOPES '21: Proceedings of the 24th International Workshop on Software and Compilers for Embedded Systems
November 2021
48 pages
ISBN:9781450391665
DOI:10.1145/3493229
  • Editor:
  • Sander Stuijk
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

In-Cooperation

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 November 2021

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Convolution
  2. Image processing
  3. Parallel algorithm
  4. Tensor core unit

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

SCOPES '21

Acceptance Rates

SCOPES '21 Paper Acceptance Rate 7 of 15 submissions, 47%;
Overall Acceptance Rate 38 of 79 submissions, 48%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)31
  • Downloads (Last 6 weeks)2
Reflects downloads up to 14 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Exploring Hardware Fault Impacts on Different Real Number Representations of the Structural Resilience of TCUs in GPUsElectronics10.3390/electronics1303057813:3(578)Online publication date: 31-Jan-2024
  • (2024)SoK: Fully Homomorphic Encryption AcceleratorsACM Computing Surveys10.1145/367695556:12(1-32)Online publication date: 5-Jul-2024
  • (2024)Analyzing the Reliability of TCUs Through Micro-architecture and Structural Evaluations for Two Real Number FormatsVLSI-SoC 2023: Innovations for Trustworthy Artificial Intelligence10.1007/978-3-031-70947-0_8(149-176)Online publication date: 29-Dec-2024
  • (2023)Analyzing the Impact of Different Real Number Formats on the Structural Reliability of TCUs in GPUs2023 IFIP/IEEE 31st International Conference on Very Large Scale Integration (VLSI-SoC)10.1109/VLSI-SoC57769.2023.10321881(1-6)Online publication date: 16-Oct-2023
  • (2023)TensorCV: Accelerating Inference-Adjacent Computation Using Tensor Processors2023 IEEE/ACM International Symposium on Low Power Electronics and Design (ISLPED)10.1109/ISLPED58423.2023.10244461(1-6)Online publication date: 7-Aug-2023

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media