research-article

GTuner: tuning DNN computations on GPU via graph attention network

Authors:

Haisheng Zheng,

Bei YuAuthors Info & Claims

DAC '22: Proceedings of the 59th ACM/IEEE Design Automation Conference

Pages 1045 - 1050

https://doi.org/10.1145/3489517.3530584

Published: 23 August 2022 Publication History

Abstract

It is an open problem to compile DNN models on GPU and improve the performance. A novel framework, GTuner, is proposed to jointly learn from the structures of computational graphs and the statistical features of codes to find the optimal code implementations. A Graph ATtention network (GAT) is designed as the performance estimator in GTuner. In GAT, graph neural layers are used to propagate the information in the graph and a multi-head self-attention module is designed to learn the complicated relationships between the features. Under the guidance of GAT, the GPU codes are generated through auto-tuning. Experimental results demonstrate that our method outperforms the previous arts remarkably.

References

[1]

S. Han et al., "Deep Compression: Compressing deep neural networks with pruning, trained quantization and huffman coding," in Proc. ICLR, 2016.

[2]

T. Chen et al., "An efficient sharing grouped convolution via Bayesian learning," IEEE TNNLS, 2021.

[3]

X. Zhang et al., "Exploring HW/SW co-design for video analysis on CPU-FPGA heterogeneous systems," IEEE TCAD, 2021.

[4]

T.-M. Li et al., "Differentiable programming for image processing and deep learning in Halide," ACM SIGGRAPH, 2018.

[5]

T. Chen et al., "TVM: An automated end-to-end optimizing compiler for deep learning," in Proc. OSDI, 2018.

[6]

B. H. Ahn et al., "CHAMELEON: Adaptive code optimization for expedited deep neural network compilation," in Proc. ICLR, 2020.

[7]

L. Zheng et al., "Ansor: Generating high-performance tensor programs for deep learning," in Proc. OSDI, 2020.

[8]

T. Chen et al., "Learning to optimize tensor programs," in Proc. NeurIPS, 2018.

[9]

Q. Sun et al., "Deep neural network hardware deployment optimization via advanced active learning," in Proc. DATE, 2021.

[10]

J. Mu et al., "A history-based auto-tuning framework for fast and high-performance DNN design on GPU," in Proc. DAC, 2020.

[11]

Q. Sun et al., "Fast and efficient DNN deployment via deep Gaussian transfer learning," in Proc. ICCV, 2021.

[12]

C. Hao et al., "FPGA/DNN co-design: An efficient design methodology for IoT intelligence on the edge," in Proc. DAC, 2019.

Digital Library

[13]

Z. Song et al., "GPNPU: enabling efficient hardware-based direct convolution with multi-precision support in GPU tensor cores," in Proc. DAC, 2020.

[14]

Y.-H. Chen et al., "Eyeriss v2: A flexible accelerator for emerging deep neural networks on mobile devices," IEEE JETCAS, 2019.

[15]

N. P. Jouppi et al., "In-datacenter performance analysis of a tensor processing unit," in Proc. ISCA 2017.

[16]

Z. Wu et al., "A comprehensive survey on graph neural networks," IEEE TNNLS, 2020.

[17]

K. Xu et al., "How powerful are graph neural networks?" in Proc. ICLR, 2019.

[18]

J. Lee et al., "Self-attention graph pooling," in Proc. ICML, 2019.

[19]

J. Atwood et al., "Diffusion-convolutional neural networks," in Proc. NeurIPS, 2016.

[20]

A. Vaswani et al., "Attention is all you need," in Proc. NeurIPS, 2017.

[21]

A. Dosovitskiy et al., "An image is worth 16x16 words: Transformers for image recognition at scale," in Proc. ICLR, 2021.

[22]

N. Carion et al., "End-to-end object detection with transformers," in Proc. ECCV, 2020.

[23]

C. Morris et al., "Weisfeiler and Leman go neural: Higher-order graph neural networks," in Proc. AAAI, 2019.

Digital Library

[24]

K. He et al., "Deep residual learning for image recognition," in Proc. CVPR, 2016.

[25]

A. G. Howard et al., "MobileNets: Efficient convolutional neural networks for mobile vision applications," arXiv preprint arXiv:1704.04861, 2017.

[26]

F. N. Iandola et al., "SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and < 0.5 mb model size," arXiv preprint arXiv:1602.07360, 2016.

[27]

A. Paszke et al., "Automatic differentiation in PyTorch," in NIPS Workshop, 2017.

[28]

T. N. Kipf et al., "Semi-supervised classification with graph convolutional networks," in Proc. ICLR, 2017.

[29]

P. Veličković et al., "Graph attention networks," in Proc. ICLR, 2018.

[30]

W. Hamilton et al., "Inductive representation learning on large graphs," in Proc. NeurIPS, 2017.

Cited By

Sun QLiu YYang HJiang ZLuan ZQian D(2024)Adaptive Auto-Tuning Framework for Global Exploration of Stencil Optimization on GPUsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2023.332563035:1(20-33)Online publication date: 1-Jan-2024
https://dl.acm.org/doi/10.1109/TPDS.2023.3325630
Wang GDu YHuang W(2024)GPU Performance Optimization via Intergroup Cache CooperationIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2024.344370743:11(4142-4153)Online publication date: Nov-2024
https://doi.org/10.1109/TCAD.2024.3443707
Xu XWang LXiao LLiu LLv YXie XHan MLiu H(2024)ATA-Cache: Contention Mitigation for GPU Shared L1 Cache With Aggregated Tag ArrayIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2023.333719243:5(1429-1441)Online publication date: May-2024
https://doi.org/10.1109/TCAD.2023.3337192
Show More Cited By

Recommendations

On the Multichromatic Number of s-Stable Kneser Graphs

For positive integers n and s, a subset Sï [n] is s-stable if sï |i-j|ï n-s for distinct i,j∈S . The s-stable r-uniform Kneser hypergraph KGrn,ks-stable is the r-uniform hypergraph that has the collection of all s-stable k-element subsets of [n] as ...
Adjacent vertex-distinguishing edge and total chromatic numbers of hypercubes

An adjacent vertex-distinguishing edge coloring of a simple graph G is a proper edge coloring of G such that incident edge sets of any two adjacent vertices are assigned different sets of colors. A total coloring of a graph G is a coloring of both the ...
Forbidden Subgraphs and Weak Locally Connected Graphs

A graph is called H-free if it has no induced subgraph isomorphic to H. A graph is called $$N^i$$Ni-locally connected if $$G[\{ x\in V(G): 1\le d_G(w, x)\le i\}]$$G[{x?V(G):1≤dG(w,x)≤i}] is connected and $$N_2$$N2-locally connected if $$G[\{uv: \{uw, vw\...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

DAC '22: Proceedings of the 59th ACM/IEEE Design Automation Conference

July 2022

1462 pages

ISBN:9781450391429

DOI:10.1145/3489517

General Chair:
Rob Oshana
NXP

Copyright © 2022 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGDA: ACM Special Interest Group on Design Automation
IEEE CEDA

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 23 August 2022

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Research-article

Conference

DAC '22

Sponsor:

SIGDA

DAC '22: 59th ACM/IEEE Design Automation Conference

July 10 - 14, 2022

California, San Francisco

Acceptance Rates

Overall Acceptance Rate 1,770 of 5,499 submissions, 32%

Upcoming Conference

DAC '25

Sponsor:
sigda

62nd ACM/IEEE Design Automation Conference

June 22 - 26, 2025

San Francisco , CA , USA

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

5
Total Citations
View Citations
263
Total Downloads

Downloads (Last 12 months)51
Downloads (Last 6 weeks)3

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Sun QLiu YYang HJiang ZLuan ZQian D(2024)Adaptive Auto-Tuning Framework for Global Exploration of Stencil Optimization on GPUsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2023.332563035:1(20-33)Online publication date: 1-Jan-2024
https://dl.acm.org/doi/10.1109/TPDS.2023.3325630
Wang GDu YHuang W(2024)GPU Performance Optimization via Intergroup Cache CooperationIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2024.344370743:11(4142-4153)Online publication date: Nov-2024
https://doi.org/10.1109/TCAD.2024.3443707
Xu XWang LXiao LLiu LLv YXie XHan MLiu H(2024)ATA-Cache: Contention Mitigation for GPU Shared L1 Cache With Aggregated Tag ArrayIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2023.333719243:5(1429-1441)Online publication date: May-2024
https://doi.org/10.1109/TCAD.2023.3337192
Dai WJia ZBai YSun Q(2024)Convergence-aware operator-wise mixed-precision trainingCCF Transactions on High Performance Computing10.1007/s42514-024-00208-97:1(43-57)Online publication date: 31-Dec-2024
https://doi.org/10.1007/s42514-024-00208-9
Bai YYao XSun QZhao WChen SWang ZYu B(2023)GTCO: Graph and Tensor Co-Design for Transformer-Based Image Recognition on Tensor CoresIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2023.331716943:2(586-599)Online publication date: 19-Sep-2023
https://dl.acm.org/doi/10.1109/TCAD.2023.3317169

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten