skip to main content
10.1145/3470496.3527384acmconferencesArticle/Chapter ViewAbstractPublication PagesiscaConference Proceedingsconference-collections
research-article

GCoM: a detailed GPU core model for accurate analytical modeling of modern GPUs

Published: 11 June 2022 Publication History

Abstract

Analytical models can greatly help computer architects perform orders of magnitude faster early-stage design space exploration than using cycle-level simulators. To facilitate rapid design space exploration for graphics processing units (GPUs), prior studies have proposed GPU analytical models which capture first-order stall events causing performance degradation; however, the existing analytical models cannot accurately model modern GPUs due to their outdated and highly abstract GPU core microarchitecture assumptions. Therefore, to accurately evaluate the performance of modern GPUs, we need a new GPU analytical model which accurately captures the stall events incurred by the significant changes in the core microarchitectures of modern GPUs.
We propose GCoM, an accurate GPU analytical model which faithfully captures the key core-side stall events of modern GPUs. Through detailed microarchitecture-driven GPU core modeling, GCoM accurately models modern GPUs by revealing the following key core-side stalls overlooked by the existing GPU analytical models. First, GCoM identifies the compute structural stall events caused by the limited per-sub-core functional units. Second, GCoM exposes the memory structural stalls due to the limited banks and shared nature of per-core L1 data caches. Third, GCoM correctly predicts the memory data stalls induced by the sectored L1 data caches which split a cache line into a set of sectors sharing the same tag. Fourth, GCoM captures the idle stalls incurred by the inter- and intra-core load imbalances. Our experiments using an NVIDIA RTX 2060 configuration show that GCoM greatly improves the modeling accuracy by achieving a mean absolute error of 10.0% against Accel-Sim cycle-level simulator, whereas the state-of-the-art GPU analytical model achieves a mean absolute error of 44.9%.

References

[1]
Advanced Micro Devices, Inc. 2016. RADEON: Dissecting the Polaris Architecture. https://www.amd.com/system/files/documents/polaris-whitepaper.pdf.
[2]
Johnathan Alsop, Matthew D. Sinclair, Rakesh Komuravelli, and Sarita V. Adve. 2016. GSI: A GPU Stall Inspector to Characterize the Sources of Memory Stalls for Tightly Coupled GPUs. In Proc. 2016 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).
[3]
Yehia Arafa, Abdel-Hameed Badawy, Ammar ElWazir, Atanu Barai, Ali Eker, Gopinath Chennupati, Nandakishore Santhi, and Stephan Eidenbenz. 2021. Hybrid, scalable, trace-driven performance modeling of GPGPUs. In Proc. 2021 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).
[4]
Yehia Arafa, Abdel-Hameed A. Badawy, Gopinath Chennupati, Nandakishore Santhi, and Stephan Eidenbenz. 2019. PPT-GPU: Scalable GPU Performance Modeling. IEEE Computer Architecture Letters (CAL) 18 (2019).
[5]
Newsha Ardalani, Clint Lestourgeon, Karthikeyan Sankaralingam, and Xiaojin Zhu. 2015. Cross-Architecture Performance Prediction (XAPP) Using CPU Code to Predict GPU Performance. In Proc. 48th IEEE/ACM International Symposium on Microarchitecture (MICRO).
[6]
Cesar A. Baddouh, Mahmoud Khairy, Roland Green, Mathias Payer, and Timothy G. Rogers. 2021. Principal Kernel Analysis: A Tractable Methodology to Simulate Scaled GPU Workloads. In Proc. 54th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).
[7]
Sara S. Baghsorkhi, Matthieu Delahaye, Sanjay J. Patel, William D. Gropp, and Wen mei W. Hwu. 2010. An Adaptive Performance Modeling Tool for GPU Architectures. In Proc. 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP).
[8]
Lorenz Braun, Sotirios Nikas, Chen Song, Vincent Heuveline, and Holger Fröning. 2020. A Simple Model for Portable and Fast Prediction of Execution Time and Power Consumption of GPU Kernels. ACM Transactions on Architecture and Code Optimization (TACO) 18 (2020).
[9]
John Burgess. 2020. RTX on---The NVIDIA Turing GPU. IEEE Micro 40 (2020).
[10]
Shuai Che, Bradford M. Beckmann, Steven K. Reinhardt, and Kevin Skadron. 2013. Pannotia: Understanding irregular GPGPU graph applications. In Proc. 2013 IEEE International Symposium on Workload Characterization (IISWC).
[11]
Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W, Sheaffer, Sang-Ha Lee, and Kevin Skadron. 2009. Rodinia: A benchmark suite for heterogeneous computing. In Proc. 2009 IEEE International Symposium on Workload Characterization (IISWC).
[12]
Jack Choquette, Olivier Giroux, and Denis Foley. 2018. Volta: Performance and Programmability. IEEE Micro 38 (2018).
[13]
Saumay Dublish, Vijay Nagarajan, and Nigel Topham. 2019. Poise: Balancing Thread-Level Parallelism and Memory System Performance in GPUs Using Machine Learning. In Proc. 25th IEEE International Symposium on High Performance Computer Architecture (HPCA).
[14]
Stijin Eyerman, Lieven Eeckhout, Tejas Karkhanis, and James E. Smith. 2009. A Mechanistic Performance Model for Superscalar Out-of-Order Processors. ACM Transactions on Computer Systems 27 (2009).
[15]
B.A. Fields, R. Bodik, M.D. Hill, and C.J. Newburn. 2003. Using interaction costs for microarchitectural bottleneck analysis. In Proc. 36th IEEE/ACM International Symposium on Microarchitecture (MICRO).
[16]
B. Fields, S. Rubin, and R. Bodik. 2001. Focusing processor policies via critical-path prediction. In Proc. 28th IEEE/ACM International Symposium on Computer Architecture (ISCA).
[17]
Denis Foley and John Danskin. 2017. Ultra-Performance Pascal GPU and NVLink Interconnect. IEEE Micro 37 (2017).
[18]
Michael Garland, Scott Le Grand, John Nickolls, Joshua Anderson, Jim Hardwick, Scott Morton, Everett Phillips, Yao Zhang, and Vasily Volkov. 2008. Parallel Computing Experiences with CUDA. IEEE Micro 28 (2008).
[19]
Xiang Gong, Chunling HU, and Chu-Cheow Lim. 2020. PAQSIM: Fast Performance Model for Graphics Workload on Mobile GPUs. In Proc. 21st ACM SIGPLAN/SIGBED Conference on Languages, Compilers, and Tools for Embedded Systems (LCTES).
[20]
Scott Grauer-Gray, Lifan Xu, Robert Searles, Sudhee Ayalasomayajula, and John Cavazos. 2012. Auto-tuning a bhigh-level language targeted to GPU codes. In 2012 Innovative Parallel Computing (InPar).
[21]
João Guerreiro, Aleksandar Ilic, Nuno Roma, and Pedro Tomás. 2019. GPU Static Modeling Using PTX and Deep Structured Learning. IEEE Access 7 (2019).
[22]
Anthony Gutierrez, Bradford M. Beckmann, Alexandru Dutu, Joseph Gross, Michael LeBeane, John Kalamatianos, Onur Kayiran, Matthew Poremba, Brandon Potter, Sooraj Puthoor, Matthew D. Sinclair, Mark Wyse, Jieming Yin, Xianwei Zhang, Akshay Jain, and Timothy Rogers. 2018. Lost in Abstraction: Pitfalls of Analyzing GPUs at the Intermediate Language Level. In Proc. 24th IEEE International Symposium on High Performance Computer Architecture (HPCA).
[23]
Seonyeong Heo, Sungjun Cho, Youngsok Kim, and Hanjun Kim. 2020. Real-Time Object Detection System with Multi-Path Neural Networks. In Proc. 26th IEEE Real-Time and Embedded Technology and Applications Symposium (RTAS).
[24]
Sunpyo Hong and Hyesoon Kim. 2009. An Analytical Model for a GPU Architecture with Memory-level and Thread-level Parallelism Awareness. In Proc. 36th IEEE/ACM International Symposium on Computer Architecture (ISCA).
[25]
Jen-Cheng Huang, Joo Hwan Lee, Hyesoon Kim, and Hsien-Hsin S. Lee. 2014. GPUMech: GPU Performance Modeling Techniques based on Interval Analysis. In Proc. 47th IEEE/ACM International Symposium on Microarchitecture (MICRO).
[26]
Jen-Cheng Huang, Lifeng Nai, Hyesoon Kim, and Hsien-Hsin S. Lee. 2014. TB-Point: Reducing Simulation Time for Large-Scale GPGPU Kernels. In Proc. 28th IEEE International Parallel and Distributed Processing Symposium (IPDPS).
[27]
Hanhwi Jang, Jae-Eon Jo, Jaewon Lee, and Jangwoo Kim. 2018. RpStacks-MT: A High-throughput Design Evaluation Methodology for Multi-core Processors. In Proc. 51st IEEE/ACM International Symposium on Microarchitecture (MICRO).
[28]
Tejas S. Karkhanis and James E. Smith. 2004. A First-Order Superscalar Processor Model. In Proc. 31st IEEE/ACM International Symposium on Computer Architecture (ISCA).
[29]
Aajna Karki, Chethan Palangotu Keshava, Spoorthi Mysore Shivakumar, Joshua Skow, Goutam Madhukeshwar Hegde, and Hyeran Jeon. 2019. Detailed Characterization of Deep Neural Networks on GPUs and FPGAs. In Proc. 12th Workshop on General Purpose Processing Using GPUs (GPGPU).
[30]
Mahmoud Khairy, Zhesheng Shen, Tor M. Aamodt, and Timothy G. Rogers. 2020. Accel-Sim: An Extensible Simulation Framework for Validated GPU Modeling. In Proc. 47th IEEE/ACM International Symposium on Computer Architecture (ISCA).
[31]
Mohsen Kiani and Amir Rajabzadeh. 2018. Efficient Cache Performance Modeling in GPUs Using Reuse Distance Analysis. ACM Transactions on Architecture and Code Optimization (TACO) 15 (2018).
[32]
Jaewon Lee, Hanhwi Jang, and Jangwoo Kim. 2014. RpStacks: Fast and Accurate Processor Design Space Exploration Using Representative Stall-Event Stacks. In Proc. 47th IEEE/ACM International Symposium on Microarchitecture (MICRO).
[33]
Peter Mattson, Christine Cheng, Gregory Diamos, Cody Coleman, Paulius Micikevicius, David Patterson, Hanlin Tang, Gu-Yeon Wei, Peter Bailis, Victor Bittorf, David Brooks, Dehao Chen, Debo Dutta, Udit Gupta, Kim Hazelwood, Andy Hock, Xinyuan Huang, Daniel Kang, David Kanter, Naveen Kumar, Jeffery Liao, Deepak Narayanan, Tayo Oguntebi, Gennady Pekhimenko, Lillian Pentecost, Vijay Janapa Reddi, Taylor Robie, Tom St John, Carole-Jean Wu, Lingjie Xu, Cliff Young, and Matei Zaharia. 2020. MLPerf Training Benchmark. In Proc. Machine Learning and Systems 2 (MLSys).
[34]
Aaftab Munshi. 2009. The OpenCL specification. In Proc. IEEE Hot Chips 21 Symposium (HCS).
[35]
Sharan Narang. 2016. DeepBench. https://svail.github.io/DeepBench/.
[36]
Cedric Nugteren, Gert-Jan van den Braak, Henk Corporaal, and Henri Bal. 2014. A Detailed GPU Cache Model Based on Reuse Distance Theory. In Proc. 20th IEEE International Symposium on High Performance Computer Architecture (HPCA).
[37]
NVIDIA Corporation. 2020. Nsight Compute CLI.
[38]
NVIDIA Corporation. 2021. NVIDIA Ampere GA102 GPU Architecture. https://www.nvidia.com/content/PDF/nvidia-ampere-ga-102-gpu-architecture-whitepaper-v2.pdf.
[39]
NVIDIA Corporation. 2021. Parallel Thread Execution ISA: Application Guide (v7.4).
[40]
Kenneth O'neal, Philip Brisk, Ahmed Abousamra, Zack Waters, and Emily Shriver. 2017. GPU Performance Estimation Using Software Rasterization and Machine Learning. ACM Trans. Embed. Comput. Syst. 16 (2017).
[41]
Vijay Janapa Reddi, Christine Cheng, David Kanter, Peter Mattson, Guenther Schmuelling, Carole-Jean Wu, Brian Anderson, Maximilien Breughe, Mark Charlebois, William Chou, Ramesh Chukka, Cody Coleman, Sam Davis, Pan Deng, Greg Diamos, Jared Duke, Dave Fick, J. Scott Gardner, Itay Hubara, Sachin Idgunji, Thomas B. Jablin, Jeff Jiao, Tom St. John, Pankaj Kanwar, David Lee, Jeffery Liao, Anton Lokhmotov, Francisco Massa, Peng Meng, Paulius Micikevicius, Colin Osborne, Gennady Pekhimenko, Arun Tejusve Raghunath Rajan, Dilip Sequeira, Ashish Sirasao, Fei Sun, Hanlin Tang, Michael Thomson, Frank Wei, Ephrem Wu, Lingjie Xu, Koichi Yamada, Bing Yu, George Yuan, Aaron Zhong, Peizhao Zhang, and Yuchen Zhou. 2020. MLPerf Inference Benchmark. In Proc. 47th ACM/IEEE Annual International Symposium on Computer Architecture (ISCA).
[42]
Timothy G. Rogers, Mike O'Connor, and Tor M. Aamodt. 2012. Cache-Conscious Wavefront Scheduling. In Proc. 45th IEEE/ACM International Symposium on Microarchitecture (MICRO).
[43]
Ali G. Saidi, Nathan L. Binkert, Steven K. Reinhardt, and Trevor Mudge. 2008. Full-System Critical Path Analysis. In Proc. 2008 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).
[44]
Jaewoong Sim, Aniruddha Dasgupta, Hyesoon Kim, and Richard Vuduc. 2012. A Performance Analysis Framework for Identifying Potential Benefits in GPGPU Applications. In Proc. 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP).
[45]
John E. Stone, David Gohara, and Guochun Shi. 2010. OpenCL: A Parallel Programming Standard for Heterogeneous Computing Systems. Computing in Science & Engineering 12 (2010).
[46]
Yifan Sun, Trinayan Baruah, Saiful A. Mojumder, Shi Dong, Xiang Gong, Shane Treadway, Yuhui Bao, Spencer Hance, Carter McCardwell, Vincent Zhao, Harrison Barclay, Amir Kavyan Ziabari, Zhongliang Chen, Rafael Ubal, José Abellán, John Kim, Ajay Joshi, and David Kaeli. 2019. MGPUSim: Enabling Multi-GPU Performance Modeling and Optimization. In Proc. 46th ACM/IEEE Annual International Symposium on Computer Architecture (ISCA).
[47]
Teruo Tanimoto, Takatsugu Ono, Koji Inoue, and Hiroshi Sasaki. 2017. Enhanced Dependence Graph Model for Critical Path Analysis on Modern Out-of-Order Processors. IEEE Computer Architecture Letters (CAL) 16 (2017).
[48]
Oreste Villa, Daniel Lustig, Zi Yan, Evgeny Bolotin, Yaosheng Fu, Niladrish Chatterjee, Nan Jiang, and David Nellans. 2021. Need for Speed: Experiences Building a Trustworthy System-Level GPU Simulator. In Proc. 27th IEEE International Symposium on High-Performance Computer Architecture (HPCA).
[49]
Oreste Villa, Mark Stephenson, David Nellans, and Stephen W. Keckler. 2019. NVBit: A Dynamic Binary Instrumentation Framework for NVIDIA GPUs. In Proc. 52nd IEEE/ACM International Symposium on Microarchitecture (MICRO).
[50]
Lu Wang, Magnus Jahre, Almutaz Adileh, and Lieven Eeckhout. 2020. MDM: The GPU Memory Divergence Model. In Proc. 53rd IEEE/ACM International Symposium on Microarchitecture (MICRO).
[51]
Lu Wang, Magnus Jahre, Almutaz Adileh, Zhiying Wang, and Lieven Eeckhout. 2019. Modeling Emerging Memory-Divergent GPU Applications. IEEE Computer Architecture Letters (CAL) 18 (2019).
[52]
Xiebing Wang, Kai Huang, Alois Knoll, and Xuehai Qian. 2019. A Hybrid Framework for Fast and Accurate GPU Performance Estimation through Source-Level Analysis and Trace-Based Simulation. In Proc. 25th IEEE International Symposium on High Performance Computer Architecture (HPCA).
[53]
Gene Wu, Joseph L. Greathouse, Alexander Lyashevsky, Nuwan Jayasena, and Derek Chiou. 2015. GPGPU performance and power estimation using machine learning. In Proc. 21st IEEE International Symposium on High Performance Computer Architecture (HPCA).
[54]
Zhibin Yu, Lieven Eeckhout, Nilanjan Goswami, Tao Li, Lizy John, Hai Jin, and Chengzhong Xu. 2013. Accelerating GPGPU Architecture Simulation. In Proc. ACM SIGMETRICS/International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS).
[55]
Zhibin Yu, Lieven Eeckhout, Nilanjan Goswami, Tao Li, Lizy K John, Hai Jin, Chengzhong Xu, and Junmin Wu. 2015. GPGPU-MiniBench: Accelerating GPGPU Micro-Architecture Simulation. IEEE Trans. Comput. 64 (2015).
[56]
Yao Zhang and John D. Owens. 2011. A Quantitative Performance Analysis Model for GPU Architectures. In Proc. 17th IEEE International Symposium on High Performance Computer Architecture (HPCA).

Cited By

View all
  • (2024)HyFiSS: A Hybrid Fidelity Stall-Aware Simulator for GPGPUs2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO61859.2024.00022(168-185)Online publication date: 2-Nov-2024
  • (2024)GCStack: A GPU Cycle Accounting Mechanism for Providing Accurate Insight Into GPU PerformanceIEEE Computer Architecture Letters10.1109/LCA.2024.347690923:2(235-238)Online publication date: Jul-2024
  • (2024)Zatel: Sample Complexity-Aware Scale-Model Simulation for Ray Tracing2024 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)10.1109/ISPASS61541.2024.00024(156-166)Online publication date: 5-May-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
ISCA '22: Proceedings of the 49th Annual International Symposium on Computer Architecture
June 2022
1097 pages
ISBN:9781450386104
DOI:10.1145/3470496
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

In-Cooperation

  • IEEE CS TCAA: IEEE CS technical committee on architectural acoustics

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 11 June 2022

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. graphics processing units
  2. interval analysis
  3. performance modeling

Qualifiers

  • Research-article

Funding Sources

  • Yonsei Signature Research Cluster Program (2022-22-0002)
  • National Research Foundation of Korea (NRF)
  • Institute of Information & communications Technology Planning & Evaluation (IITP)
  • Ministry of Education (MOE) of Korea

Conference

ISCA '22
Sponsor:

Acceptance Rates

ISCA '22 Paper Acceptance Rate 67 of 400 submissions, 17%;
Overall Acceptance Rate 543 of 3,203 submissions, 17%

Upcoming Conference

ISCA '25

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)366
  • Downloads (Last 6 weeks)46
Reflects downloads up to 15 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)HyFiSS: A Hybrid Fidelity Stall-Aware Simulator for GPGPUs2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO61859.2024.00022(168-185)Online publication date: 2-Nov-2024
  • (2024)GCStack: A GPU Cycle Accounting Mechanism for Providing Accurate Insight Into GPU PerformanceIEEE Computer Architecture Letters10.1109/LCA.2024.347690923:2(235-238)Online publication date: Jul-2024
  • (2024)Zatel: Sample Complexity-Aware Scale-Model Simulation for Ray Tracing2024 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)10.1109/ISPASS61541.2024.00024(156-166)Online publication date: 5-May-2024
  • (2024)AIO: An Abstraction for Performance Analysis Across Diverse Accelerator Architectures2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA)10.1109/ISCA59077.2024.00043(487-500)Online publication date: 29-Jun-2024
  • (2024)GhOST: a GPU Out-of-Order Scheduling Technique for Stall Reduction2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA)10.1109/ISCA59077.2024.00011(1-16)Online publication date: 29-Jun-2024
  • (2024)GPU Scale-Model Simulation2024 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA57654.2024.00088(1125-1140)Online publication date: 2-Mar-2024
  • (2023)Sieve: Stratified GPU-Compute Workload Sampling2023 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)10.1109/ISPASS57527.2023.00030(224-234)Online publication date: Apr-2023

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media