skip to main content
10.1145/3470496.3527384acmconferencesArticle/Chapter ViewAbstractPublication PagesiscaConference Proceedingsconference-collections
research-article

GCoM: a detailed GPU core model for accurate analytical modeling of modern GPUs

Published:11 June 2022Publication History

ABSTRACT

Analytical models can greatly help computer architects perform orders of magnitude faster early-stage design space exploration than using cycle-level simulators. To facilitate rapid design space exploration for graphics processing units (GPUs), prior studies have proposed GPU analytical models which capture first-order stall events causing performance degradation; however, the existing analytical models cannot accurately model modern GPUs due to their outdated and highly abstract GPU core microarchitecture assumptions. Therefore, to accurately evaluate the performance of modern GPUs, we need a new GPU analytical model which accurately captures the stall events incurred by the significant changes in the core microarchitectures of modern GPUs.

We propose GCoM, an accurate GPU analytical model which faithfully captures the key core-side stall events of modern GPUs. Through detailed microarchitecture-driven GPU core modeling, GCoM accurately models modern GPUs by revealing the following key core-side stalls overlooked by the existing GPU analytical models. First, GCoM identifies the compute structural stall events caused by the limited per-sub-core functional units. Second, GCoM exposes the memory structural stalls due to the limited banks and shared nature of per-core L1 data caches. Third, GCoM correctly predicts the memory data stalls induced by the sectored L1 data caches which split a cache line into a set of sectors sharing the same tag. Fourth, GCoM captures the idle stalls incurred by the inter- and intra-core load imbalances. Our experiments using an NVIDIA RTX 2060 configuration show that GCoM greatly improves the modeling accuracy by achieving a mean absolute error of 10.0% against Accel-Sim cycle-level simulator, whereas the state-of-the-art GPU analytical model achieves a mean absolute error of 44.9%.

References

  1. Advanced Micro Devices, Inc. 2016. RADEON: Dissecting the Polaris Architecture. https://www.amd.com/system/files/documents/polaris-whitepaper.pdf.Google ScholarGoogle Scholar
  2. Johnathan Alsop, Matthew D. Sinclair, Rakesh Komuravelli, and Sarita V. Adve. 2016. GSI: A GPU Stall Inspector to Characterize the Sources of Memory Stalls for Tightly Coupled GPUs. In Proc. 2016 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).Google ScholarGoogle Scholar
  3. Yehia Arafa, Abdel-Hameed Badawy, Ammar ElWazir, Atanu Barai, Ali Eker, Gopinath Chennupati, Nandakishore Santhi, and Stephan Eidenbenz. 2021. Hybrid, scalable, trace-driven performance modeling of GPGPUs. In Proc. 2021 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Yehia Arafa, Abdel-Hameed A. Badawy, Gopinath Chennupati, Nandakishore Santhi, and Stephan Eidenbenz. 2019. PPT-GPU: Scalable GPU Performance Modeling. IEEE Computer Architecture Letters (CAL) 18 (2019).Google ScholarGoogle Scholar
  5. Newsha Ardalani, Clint Lestourgeon, Karthikeyan Sankaralingam, and Xiaojin Zhu. 2015. Cross-Architecture Performance Prediction (XAPP) Using CPU Code to Predict GPU Performance. In Proc. 48th IEEE/ACM International Symposium on Microarchitecture (MICRO).Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Cesar A. Baddouh, Mahmoud Khairy, Roland Green, Mathias Payer, and Timothy G. Rogers. 2021. Principal Kernel Analysis: A Tractable Methodology to Simulate Scaled GPU Workloads. In Proc. 54th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).Google ScholarGoogle Scholar
  7. Sara S. Baghsorkhi, Matthieu Delahaye, Sanjay J. Patel, William D. Gropp, and Wen mei W. Hwu. 2010. An Adaptive Performance Modeling Tool for GPU Architectures. In Proc. 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP).Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Lorenz Braun, Sotirios Nikas, Chen Song, Vincent Heuveline, and Holger Fröning. 2020. A Simple Model for Portable and Fast Prediction of Execution Time and Power Consumption of GPU Kernels. ACM Transactions on Architecture and Code Optimization (TACO) 18 (2020).Google ScholarGoogle Scholar
  9. John Burgess. 2020. RTX on---The NVIDIA Turing GPU. IEEE Micro 40 (2020).Google ScholarGoogle ScholarCross RefCross Ref
  10. Shuai Che, Bradford M. Beckmann, Steven K. Reinhardt, and Kevin Skadron. 2013. Pannotia: Understanding irregular GPGPU graph applications. In Proc. 2013 IEEE International Symposium on Workload Characterization (IISWC).Google ScholarGoogle ScholarCross RefCross Ref
  11. Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W, Sheaffer, Sang-Ha Lee, and Kevin Skadron. 2009. Rodinia: A benchmark suite for heterogeneous computing. In Proc. 2009 IEEE International Symposium on Workload Characterization (IISWC).Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Jack Choquette, Olivier Giroux, and Denis Foley. 2018. Volta: Performance and Programmability. IEEE Micro 38 (2018).Google ScholarGoogle Scholar
  13. Saumay Dublish, Vijay Nagarajan, and Nigel Topham. 2019. Poise: Balancing Thread-Level Parallelism and Memory System Performance in GPUs Using Machine Learning. In Proc. 25th IEEE International Symposium on High Performance Computer Architecture (HPCA).Google ScholarGoogle ScholarCross RefCross Ref
  14. Stijin Eyerman, Lieven Eeckhout, Tejas Karkhanis, and James E. Smith. 2009. A Mechanistic Performance Model for Superscalar Out-of-Order Processors. ACM Transactions on Computer Systems 27 (2009).Google ScholarGoogle Scholar
  15. B.A. Fields, R. Bodik, M.D. Hill, and C.J. Newburn. 2003. Using interaction costs for microarchitectural bottleneck analysis. In Proc. 36th IEEE/ACM International Symposium on Microarchitecture (MICRO).Google ScholarGoogle Scholar
  16. B. Fields, S. Rubin, and R. Bodik. 2001. Focusing processor policies via critical-path prediction. In Proc. 28th IEEE/ACM International Symposium on Computer Architecture (ISCA).Google ScholarGoogle Scholar
  17. Denis Foley and John Danskin. 2017. Ultra-Performance Pascal GPU and NVLink Interconnect. IEEE Micro 37 (2017).Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Michael Garland, Scott Le Grand, John Nickolls, Joshua Anderson, Jim Hardwick, Scott Morton, Everett Phillips, Yao Zhang, and Vasily Volkov. 2008. Parallel Computing Experiences with CUDA. IEEE Micro 28 (2008).Google ScholarGoogle Scholar
  19. Xiang Gong, Chunling HU, and Chu-Cheow Lim. 2020. PAQSIM: Fast Performance Model for Graphics Workload on Mobile GPUs. In Proc. 21st ACM SIGPLAN/SIGBED Conference on Languages, Compilers, and Tools for Embedded Systems (LCTES).Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Scott Grauer-Gray, Lifan Xu, Robert Searles, Sudhee Ayalasomayajula, and John Cavazos. 2012. Auto-tuning a bhigh-level language targeted to GPU codes. In 2012 Innovative Parallel Computing (InPar).Google ScholarGoogle Scholar
  21. João Guerreiro, Aleksandar Ilic, Nuno Roma, and Pedro Tomás. 2019. GPU Static Modeling Using PTX and Deep Structured Learning. IEEE Access 7 (2019).Google ScholarGoogle ScholarCross RefCross Ref
  22. Anthony Gutierrez, Bradford M. Beckmann, Alexandru Dutu, Joseph Gross, Michael LeBeane, John Kalamatianos, Onur Kayiran, Matthew Poremba, Brandon Potter, Sooraj Puthoor, Matthew D. Sinclair, Mark Wyse, Jieming Yin, Xianwei Zhang, Akshay Jain, and Timothy Rogers. 2018. Lost in Abstraction: Pitfalls of Analyzing GPUs at the Intermediate Language Level. In Proc. 24th IEEE International Symposium on High Performance Computer Architecture (HPCA).Google ScholarGoogle ScholarCross RefCross Ref
  23. Seonyeong Heo, Sungjun Cho, Youngsok Kim, and Hanjun Kim. 2020. Real-Time Object Detection System with Multi-Path Neural Networks. In Proc. 26th IEEE Real-Time and Embedded Technology and Applications Symposium (RTAS).Google ScholarGoogle ScholarCross RefCross Ref
  24. Sunpyo Hong and Hyesoon Kim. 2009. An Analytical Model for a GPU Architecture with Memory-level and Thread-level Parallelism Awareness. In Proc. 36th IEEE/ACM International Symposium on Computer Architecture (ISCA).Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Jen-Cheng Huang, Joo Hwan Lee, Hyesoon Kim, and Hsien-Hsin S. Lee. 2014. GPUMech: GPU Performance Modeling Techniques based on Interval Analysis. In Proc. 47th IEEE/ACM International Symposium on Microarchitecture (MICRO).Google ScholarGoogle Scholar
  26. Jen-Cheng Huang, Lifeng Nai, Hyesoon Kim, and Hsien-Hsin S. Lee. 2014. TB-Point: Reducing Simulation Time for Large-Scale GPGPU Kernels. In Proc. 28th IEEE International Parallel and Distributed Processing Symposium (IPDPS).Google ScholarGoogle Scholar
  27. Hanhwi Jang, Jae-Eon Jo, Jaewon Lee, and Jangwoo Kim. 2018. RpStacks-MT: A High-throughput Design Evaluation Methodology for Multi-core Processors. In Proc. 51st IEEE/ACM International Symposium on Microarchitecture (MICRO).Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Tejas S. Karkhanis and James E. Smith. 2004. A First-Order Superscalar Processor Model. In Proc. 31st IEEE/ACM International Symposium on Computer Architecture (ISCA).Google ScholarGoogle Scholar
  29. Aajna Karki, Chethan Palangotu Keshava, Spoorthi Mysore Shivakumar, Joshua Skow, Goutam Madhukeshwar Hegde, and Hyeran Jeon. 2019. Detailed Characterization of Deep Neural Networks on GPUs and FPGAs. In Proc. 12th Workshop on General Purpose Processing Using GPUs (GPGPU).Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Mahmoud Khairy, Zhesheng Shen, Tor M. Aamodt, and Timothy G. Rogers. 2020. Accel-Sim: An Extensible Simulation Framework for Validated GPU Modeling. In Proc. 47th IEEE/ACM International Symposium on Computer Architecture (ISCA).Google ScholarGoogle Scholar
  31. Mohsen Kiani and Amir Rajabzadeh. 2018. Efficient Cache Performance Modeling in GPUs Using Reuse Distance Analysis. ACM Transactions on Architecture and Code Optimization (TACO) 15 (2018).Google ScholarGoogle Scholar
  32. Jaewon Lee, Hanhwi Jang, and Jangwoo Kim. 2014. RpStacks: Fast and Accurate Processor Design Space Exploration Using Representative Stall-Event Stacks. In Proc. 47th IEEE/ACM International Symposium on Microarchitecture (MICRO).Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Peter Mattson, Christine Cheng, Gregory Diamos, Cody Coleman, Paulius Micikevicius, David Patterson, Hanlin Tang, Gu-Yeon Wei, Peter Bailis, Victor Bittorf, David Brooks, Dehao Chen, Debo Dutta, Udit Gupta, Kim Hazelwood, Andy Hock, Xinyuan Huang, Daniel Kang, David Kanter, Naveen Kumar, Jeffery Liao, Deepak Narayanan, Tayo Oguntebi, Gennady Pekhimenko, Lillian Pentecost, Vijay Janapa Reddi, Taylor Robie, Tom St John, Carole-Jean Wu, Lingjie Xu, Cliff Young, and Matei Zaharia. 2020. MLPerf Training Benchmark. In Proc. Machine Learning and Systems 2 (MLSys).Google ScholarGoogle Scholar
  34. Aaftab Munshi. 2009. The OpenCL specification. In Proc. IEEE Hot Chips 21 Symposium (HCS).Google ScholarGoogle ScholarCross RefCross Ref
  35. Sharan Narang. 2016. DeepBench. https://svail.github.io/DeepBench/.Google ScholarGoogle Scholar
  36. Cedric Nugteren, Gert-Jan van den Braak, Henk Corporaal, and Henri Bal. 2014. A Detailed GPU Cache Model Based on Reuse Distance Theory. In Proc. 20th IEEE International Symposium on High Performance Computer Architecture (HPCA).Google ScholarGoogle Scholar
  37. NVIDIA Corporation. 2020. Nsight Compute CLI.Google ScholarGoogle Scholar
  38. NVIDIA Corporation. 2021. NVIDIA Ampere GA102 GPU Architecture. https://www.nvidia.com/content/PDF/nvidia-ampere-ga-102-gpu-architecture-whitepaper-v2.pdf.Google ScholarGoogle Scholar
  39. NVIDIA Corporation. 2021. Parallel Thread Execution ISA: Application Guide (v7.4).Google ScholarGoogle Scholar
  40. Kenneth O'neal, Philip Brisk, Ahmed Abousamra, Zack Waters, and Emily Shriver. 2017. GPU Performance Estimation Using Software Rasterization and Machine Learning. ACM Trans. Embed. Comput. Syst. 16 (2017).Google ScholarGoogle Scholar
  41. Vijay Janapa Reddi, Christine Cheng, David Kanter, Peter Mattson, Guenther Schmuelling, Carole-Jean Wu, Brian Anderson, Maximilien Breughe, Mark Charlebois, William Chou, Ramesh Chukka, Cody Coleman, Sam Davis, Pan Deng, Greg Diamos, Jared Duke, Dave Fick, J. Scott Gardner, Itay Hubara, Sachin Idgunji, Thomas B. Jablin, Jeff Jiao, Tom St. John, Pankaj Kanwar, David Lee, Jeffery Liao, Anton Lokhmotov, Francisco Massa, Peng Meng, Paulius Micikevicius, Colin Osborne, Gennady Pekhimenko, Arun Tejusve Raghunath Rajan, Dilip Sequeira, Ashish Sirasao, Fei Sun, Hanlin Tang, Michael Thomson, Frank Wei, Ephrem Wu, Lingjie Xu, Koichi Yamada, Bing Yu, George Yuan, Aaron Zhong, Peizhao Zhang, and Yuchen Zhou. 2020. MLPerf Inference Benchmark. In Proc. 47th ACM/IEEE Annual International Symposium on Computer Architecture (ISCA).Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Timothy G. Rogers, Mike O'Connor, and Tor M. Aamodt. 2012. Cache-Conscious Wavefront Scheduling. In Proc. 45th IEEE/ACM International Symposium on Microarchitecture (MICRO).Google ScholarGoogle Scholar
  43. Ali G. Saidi, Nathan L. Binkert, Steven K. Reinhardt, and Trevor Mudge. 2008. Full-System Critical Path Analysis. In Proc. 2008 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Jaewoong Sim, Aniruddha Dasgupta, Hyesoon Kim, and Richard Vuduc. 2012. A Performance Analysis Framework for Identifying Potential Benefits in GPGPU Applications. In Proc. 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP).Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. John E. Stone, David Gohara, and Guochun Shi. 2010. OpenCL: A Parallel Programming Standard for Heterogeneous Computing Systems. Computing in Science & Engineering 12 (2010).Google ScholarGoogle Scholar
  46. Yifan Sun, Trinayan Baruah, Saiful A. Mojumder, Shi Dong, Xiang Gong, Shane Treadway, Yuhui Bao, Spencer Hance, Carter McCardwell, Vincent Zhao, Harrison Barclay, Amir Kavyan Ziabari, Zhongliang Chen, Rafael Ubal, José Abellán, John Kim, Ajay Joshi, and David Kaeli. 2019. MGPUSim: Enabling Multi-GPU Performance Modeling and Optimization. In Proc. 46th ACM/IEEE Annual International Symposium on Computer Architecture (ISCA).Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Teruo Tanimoto, Takatsugu Ono, Koji Inoue, and Hiroshi Sasaki. 2017. Enhanced Dependence Graph Model for Critical Path Analysis on Modern Out-of-Order Processors. IEEE Computer Architecture Letters (CAL) 16 (2017).Google ScholarGoogle ScholarCross RefCross Ref
  48. Oreste Villa, Daniel Lustig, Zi Yan, Evgeny Bolotin, Yaosheng Fu, Niladrish Chatterjee, Nan Jiang, and David Nellans. 2021. Need for Speed: Experiences Building a Trustworthy System-Level GPU Simulator. In Proc. 27th IEEE International Symposium on High-Performance Computer Architecture (HPCA).Google ScholarGoogle ScholarCross RefCross Ref
  49. Oreste Villa, Mark Stephenson, David Nellans, and Stephen W. Keckler. 2019. NVBit: A Dynamic Binary Instrumentation Framework for NVIDIA GPUs. In Proc. 52nd IEEE/ACM International Symposium on Microarchitecture (MICRO).Google ScholarGoogle Scholar
  50. Lu Wang, Magnus Jahre, Almutaz Adileh, and Lieven Eeckhout. 2020. MDM: The GPU Memory Divergence Model. In Proc. 53rd IEEE/ACM International Symposium on Microarchitecture (MICRO).Google ScholarGoogle ScholarCross RefCross Ref
  51. Lu Wang, Magnus Jahre, Almutaz Adileh, Zhiying Wang, and Lieven Eeckhout. 2019. Modeling Emerging Memory-Divergent GPU Applications. IEEE Computer Architecture Letters (CAL) 18 (2019).Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Xiebing Wang, Kai Huang, Alois Knoll, and Xuehai Qian. 2019. A Hybrid Framework for Fast and Accurate GPU Performance Estimation through Source-Level Analysis and Trace-Based Simulation. In Proc. 25th IEEE International Symposium on High Performance Computer Architecture (HPCA).Google ScholarGoogle ScholarCross RefCross Ref
  53. Gene Wu, Joseph L. Greathouse, Alexander Lyashevsky, Nuwan Jayasena, and Derek Chiou. 2015. GPGPU performance and power estimation using machine learning. In Proc. 21st IEEE International Symposium on High Performance Computer Architecture (HPCA).Google ScholarGoogle ScholarCross RefCross Ref
  54. Zhibin Yu, Lieven Eeckhout, Nilanjan Goswami, Tao Li, Lizy John, Hai Jin, and Chengzhong Xu. 2013. Accelerating GPGPU Architecture Simulation. In Proc. ACM SIGMETRICS/International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS).Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. Zhibin Yu, Lieven Eeckhout, Nilanjan Goswami, Tao Li, Lizy K John, Hai Jin, Chengzhong Xu, and Junmin Wu. 2015. GPGPU-MiniBench: Accelerating GPGPU Micro-Architecture Simulation. IEEE Trans. Comput. 64 (2015).Google ScholarGoogle Scholar
  56. Yao Zhang and John D. Owens. 2011. A Quantitative Performance Analysis Model for GPU Architectures. In Proc. 17th IEEE International Symposium on High Performance Computer Architecture (HPCA).Google ScholarGoogle Scholar

Index Terms

  1. GCoM: a detailed GPU core model for accurate analytical modeling of modern GPUs

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        ISCA '22: Proceedings of the 49th Annual International Symposium on Computer Architecture
        June 2022
        1097 pages
        ISBN:9781450386104
        DOI:10.1145/3470496

        Copyright © 2022 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 11 June 2022

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        ISCA '22 Paper Acceptance Rate67of400submissions,17%Overall Acceptance Rate543of3,203submissions,17%

        Upcoming Conference

        ISCA '24

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader