research-article

Break down GPU execution time with an analytical method

Authors:

Junjie Lai,

André SeznecAuthors Info & Claims

RAPIDO '12: Proceedings of the 2012 Workshop on Rapid Simulation and Performance Evaluation: Methods and Tools

Pages 33 - 39

https://doi.org/10.1145/2162131.2162136

Published: 23 January 2012 Publication History

Get Access

Abstract

Because modern GPGPU can provide significant computing power and has very high memory bandwidth, and also, developer-friendly programming interfaces such as CUDA have been introduced, GPGPU becomes more and more accepted in the HPC research area. Much research has been done to help developers to better optimize GPU applications. But to fully understand GPU performance behavior remains a hot research topic.

We developed an analytical tool called TEG (Timing Estimation tool for GPU) to estimate GPU performance. Previous work shows that TEG has good approximation and can help us to quantify bottlenecks' performance effects. We have made some improvement to the tool and in this paper, we use TEG to analyze the GPU performance scaling behavior. TEG takes the dis-assembly output of CUDA kernel binary code and instruction trace as input. It does not execute the codes, but try to model the execution of CUDA codes with timing information. Because TEG takes the native GPU assembly code as input, it can estimate the execution time with a small error and it allows us to get more insight into GPU performance result.

References

[1]

S. S. Baghsorkhi, M. Delahaye, S. J. Patel, W. D. Gropp, and W.-m. W. Hwu. An adaptive performance modeling tool for gpu architectures. SIGPLAN Not., 45:105--114, January 2010.

Digital Library

Google Scholar

[2]

A. Bakhoda, G. L. Yuan, W. W. L. Fung, H. Wong, and T. M. Aamodt. Analyzing cuda workloads using a detailed gpu simulator. In ISPASS, pages 163--174. IEEE, 2009.

Crossref

Google Scholar

[3]

S. Collange, M. Daumas, D. Defour, and D. Parello. Barra: A parallel functional simulator for gpgpu. In 2010 18th Annual IEEE/ACM International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems, pages 351--360. IEEE, 2010.

Digital Library

Google Scholar

[4]

S. Hong and H. Kim. An analytical model for a gpu architecture with memory-level and thread-level parallelism awareness. SIGARCH Comput. Archit. News, 37(3):152--163, 2009.

Digital Library

Google Scholar

[5]

Y. Kim and A. Shrivastava. Cumapz: a tool to analyze memory access patterns in cuda. In Design Automation Conference (DAC), 2011 48th ACM/EDAC/IEEE, pages 128--133. IEEE, 2011.

Digital Library

Google Scholar

[6]

J. Lai and A. Seznec. Teg: Gpu performance estimation using a timing model. Technical Report 7804, INRIA, 2011.

Google Scholar

[7]

NVIDIA. NVIDIA CUDA C Programming Guide 4.0.

Google Scholar

[8]

Opencl. http://www.khronos.org/opencl/.

Google Scholar

[9]

G. Ruetsch and P. Micikevicius. Optimizing matrix transpose in cuda. 2009.

Google Scholar

[10]

H. Wong, M.-M. Papadopoulou, M. Sadooghi-Alvandi, and A. Moshovos. Demystifying gpu microarchitecture through microbenchmarking. In ISPASS'10, pages 235--246, 2010.

Crossref

Google Scholar

[11]

Y. Zhang and J. D. Owens. A quantitative performance analysis model for gpu architectures. In Proceedings of the 17th IEEE International Symposium on High-Performance Computer Architecture (HPCA 17), Feb. 2011.

Digital Library

Google Scholar

Cited By

View all

Wang XQian XKnoll AHuang K(2020)Efficient Performance Estimation and Work-Group Size Pruning for OpenCL Kernels on GPUsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2019.295834331:5(1089-1106)Online publication date: 1-May-2020
https://doi.org/10.1109/TPDS.2019.2958343
Zigon BSong F(2020)Utilizing GPU Performance Counters to Characterize GPU Kernels via Machine LearningComputational Science – ICCS 202010.1007/978-3-030-50371-0_7(88-101)Online publication date: 15-Jun-2020
https://doi.org/10.1007/978-3-030-50371-0_7
Wang XHuang KKnoll AQian X(2019)A Hybrid Framework for Fast and Accurate GPU Performance Estimation through Source-Level Analysis and Trace-Based Simulation2019 IEEE International Symposium on High Performance Computer Architecture (HPCA)10.1109/HPCA.2019.00062(506-518)Online publication date: Feb-2019
https://doi.org/10.1109/HPCA.2019.00062
Show More Cited By

Index Terms

Break down GPU execution time with an analytical method
1. Computing methodologies
  1. Modeling and simulation
    1. Model development and analysis
      1. Modeling methodologies
    2. Simulation evaluation

Recommendations

An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness
ISCA '09: Proceedings of the 36th annual international symposium on Computer architecture

GPU architectures are increasingly important in the multi-core era due to their high number of parallel processors. Programming thousands of massively parallel threads is a big challenge for software engineers, but understanding the performance ...
An integrated GPU power and performance model
ISCA '10: Proceedings of the 37th annual international symposium on Computer architecture

GPU architectures are increasingly important in the multi-core era due to their high number of parallel processors. Performance optimization for multi-core processors has been a challenge for programmers. Furthermore, optimizing for power consumption is ...
Accelerating the discontinuous Galerkin method for seismic wave propagation simulations using the graphic processing unit (GPU)-single-GPU implementation

We have successfully ported an arbitrary high-order discontinuous Galerkin (ADER-DG) method for solving the three-dimensional elastic seismic wave equation on unstructured tetrahedral meshes to an Nvidia Tesla C2075 GPU using the Nvidia CUDA programming ...

Comments

Information & Contributors

Information

Published In

RAPIDO '12: Proceedings of the 2012 Workshop on Rapid Simulation and Performance Evaluation: Methods and Tools

January 2012

44 pages

ISBN:9781450311144

DOI:10.1145/2162131

Conference Chairs:
Daniel Gracia Pérez
CEA LIST, France
,
Smail Niar
University of Valenciennes/INRIA, France
,
Cristina Silvano
Politecnico di Milano, Italy
,
Morteza Biglari Abhari
The University of Auckland, New Zealand

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 23 January 2012

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

RAPIDO '12

Sponsor:

HiPEAC

RAPIDO '12: Methods and Tools

January 23, 2012

Paris, France

Acceptance Rates

Overall Acceptance Rate 14 of 28 submissions, 50%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

12
Total Citations
View Citations
506
Total Downloads

Downloads (Last 12 months)16
Downloads (Last 6 weeks)2

Reflects downloads up to 20 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

View all

Wang XQian XKnoll AHuang K(2020)Efficient Performance Estimation and Work-Group Size Pruning for OpenCL Kernels on GPUsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2019.295834331:5(1089-1106)Online publication date: 1-May-2020
https://doi.org/10.1109/TPDS.2019.2958343
Zigon BSong F(2020)Utilizing GPU Performance Counters to Characterize GPU Kernels via Machine LearningComputational Science – ICCS 202010.1007/978-3-030-50371-0_7(88-101)Online publication date: 15-Jun-2020
https://doi.org/10.1007/978-3-030-50371-0_7
Wang XHuang KKnoll AQian X(2019)A Hybrid Framework for Fast and Accurate GPU Performance Estimation through Source-Level Analysis and Trace-Based Simulation2019 IEEE International Symposium on High Performance Computer Architecture (HPCA)10.1109/HPCA.2019.00062(506-518)Online publication date: Feb-2019
https://doi.org/10.1109/HPCA.2019.00062
Punniyamurthy KBoroujerdian BGerstlauer A(2017)GATSimProceedings of the Conference on Design, Automation & Test in Europe10.5555/3130379.3130390(43-48)Online publication date: 27-Mar-2017
https://dl.acm.org/doi/10.5555/3130379.3130390
Punniyamurthy KBoroujerdian BGerstlauer A(2017)GATSim: Abstract timing simulation of GPUsDesign, Automation & Test in Europe Conference & Exhibition (DATE), 201710.23919/DATE.2017.7926956(43-48)Online publication date: Mar-2017
https://doi.org/10.23919/DATE.2017.7926956
Bringmann OGerum COttlik S(2017)Timing Models for Fast Embedded Software Performance AnalysisHandbook of Hardware/Software Codesign10.1007/978-94-017-7358-4_22-2(1-28)Online publication date: 18-Apr-2017
https://doi.org/10.1007/978-94-017-7358-4_22-2
Bringmann OGerum COttlik S(2017)Timing Models for Fast Embedded Software Performance AnalysisHandbook of Hardware/Software Codesign10.1007/978-94-017-7267-9_22(655-682)Online publication date: 27-Sep-2017
https://doi.org/10.1007/978-94-017-7267-9_22
Gerum CBringmann ORosenstiel WNebel WAtienza D(2015)Source level performance simulation of GPU coresProceedings of the 2015 Design, Automation & Test in Europe Conference & Exhibition10.5555/2755753.2755800(217-222)Online publication date: 9-Mar-2015
https://dl.acm.org/doi/10.5555/2755753.2755800
Jung YCarloni L(2015)ΣVPProceedings of the 52nd Annual Design Automation Conference10.1145/2744769.2744913(1-6)Online publication date: 7-Jun-2015
https://dl.acm.org/doi/10.1145/2744769.2744913
Lopez-Novoa UMendiburu AMiguel-Alonso J(2015)A Survey of Performance Modeling and Simulation Techniques for Accelerator-Based ComputingIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2014.230821626:1(272-281)Online publication date: Jan-2015
https://doi.org/10.1109/TPDS.2014.2308216
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Abstract

References

Cited By

Index Terms

Recommendations

An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness

An integrated GPU power and performance model

Accelerating the discontinuous Galerkin method for seismic wave propagation simulations using the graphic processing unit (GPU)-single-GPU implementation

Comments

Information

Published In

Sponsors

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

Login options

Full Access

View options

PDF

eReader

Share

Share this Publication link

Share on social media

Affiliations