ABSTRACT
Analytical models can greatly help computer architects perform orders of magnitude faster early-stage design space exploration than using cycle-level simulators. To facilitate rapid design space exploration for graphics processing units (GPUs), prior studies have proposed GPU analytical models which capture first-order stall events causing performance degradation; however, the existing analytical models cannot accurately model modern GPUs due to their outdated and highly abstract GPU core microarchitecture assumptions. Therefore, to accurately evaluate the performance of modern GPUs, we need a new GPU analytical model which accurately captures the stall events incurred by the significant changes in the core microarchitectures of modern GPUs.
We propose GCoM, an accurate GPU analytical model which faithfully captures the key core-side stall events of modern GPUs. Through detailed microarchitecture-driven GPU core modeling, GCoM accurately models modern GPUs by revealing the following key core-side stalls overlooked by the existing GPU analytical models. First, GCoM identifies the compute structural stall events caused by the limited per-sub-core functional units. Second, GCoM exposes the memory structural stalls due to the limited banks and shared nature of per-core L1 data caches. Third, GCoM correctly predicts the memory data stalls induced by the sectored L1 data caches which split a cache line into a set of sectors sharing the same tag. Fourth, GCoM captures the idle stalls incurred by the inter- and intra-core load imbalances. Our experiments using an NVIDIA RTX 2060 configuration show that GCoM greatly improves the modeling accuracy by achieving a mean absolute error of 10.0% against Accel-Sim cycle-level simulator, whereas the state-of-the-art GPU analytical model achieves a mean absolute error of 44.9%.
- Advanced Micro Devices, Inc. 2016. RADEON: Dissecting the Polaris Architecture. https://www.amd.com/system/files/documents/polaris-whitepaper.pdf.Google Scholar
- Johnathan Alsop, Matthew D. Sinclair, Rakesh Komuravelli, and Sarita V. Adve. 2016. GSI: A GPU Stall Inspector to Characterize the Sources of Memory Stalls for Tightly Coupled GPUs. In Proc. 2016 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).Google Scholar
- Yehia Arafa, Abdel-Hameed Badawy, Ammar ElWazir, Atanu Barai, Ali Eker, Gopinath Chennupati, Nandakishore Santhi, and Stephan Eidenbenz. 2021. Hybrid, scalable, trace-driven performance modeling of GPGPUs. In Proc. 2021 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).Google ScholarDigital Library
- Yehia Arafa, Abdel-Hameed A. Badawy, Gopinath Chennupati, Nandakishore Santhi, and Stephan Eidenbenz. 2019. PPT-GPU: Scalable GPU Performance Modeling. IEEE Computer Architecture Letters (CAL) 18 (2019).Google Scholar
- Newsha Ardalani, Clint Lestourgeon, Karthikeyan Sankaralingam, and Xiaojin Zhu. 2015. Cross-Architecture Performance Prediction (XAPP) Using CPU Code to Predict GPU Performance. In Proc. 48th IEEE/ACM International Symposium on Microarchitecture (MICRO).Google ScholarDigital Library
- Cesar A. Baddouh, Mahmoud Khairy, Roland Green, Mathias Payer, and Timothy G. Rogers. 2021. Principal Kernel Analysis: A Tractable Methodology to Simulate Scaled GPU Workloads. In Proc. 54th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).Google Scholar
- Sara S. Baghsorkhi, Matthieu Delahaye, Sanjay J. Patel, William D. Gropp, and Wen mei W. Hwu. 2010. An Adaptive Performance Modeling Tool for GPU Architectures. In Proc. 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP).Google ScholarDigital Library
- Lorenz Braun, Sotirios Nikas, Chen Song, Vincent Heuveline, and Holger Fröning. 2020. A Simple Model for Portable and Fast Prediction of Execution Time and Power Consumption of GPU Kernels. ACM Transactions on Architecture and Code Optimization (TACO) 18 (2020).Google Scholar
- John Burgess. 2020. RTX on---The NVIDIA Turing GPU. IEEE Micro 40 (2020).Google ScholarCross Ref
- Shuai Che, Bradford M. Beckmann, Steven K. Reinhardt, and Kevin Skadron. 2013. Pannotia: Understanding irregular GPGPU graph applications. In Proc. 2013 IEEE International Symposium on Workload Characterization (IISWC).Google ScholarCross Ref
- Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W, Sheaffer, Sang-Ha Lee, and Kevin Skadron. 2009. Rodinia: A benchmark suite for heterogeneous computing. In Proc. 2009 IEEE International Symposium on Workload Characterization (IISWC).Google ScholarDigital Library
- Jack Choquette, Olivier Giroux, and Denis Foley. 2018. Volta: Performance and Programmability. IEEE Micro 38 (2018).Google Scholar
- Saumay Dublish, Vijay Nagarajan, and Nigel Topham. 2019. Poise: Balancing Thread-Level Parallelism and Memory System Performance in GPUs Using Machine Learning. In Proc. 25th IEEE International Symposium on High Performance Computer Architecture (HPCA).Google ScholarCross Ref
- Stijin Eyerman, Lieven Eeckhout, Tejas Karkhanis, and James E. Smith. 2009. A Mechanistic Performance Model for Superscalar Out-of-Order Processors. ACM Transactions on Computer Systems 27 (2009).Google Scholar
- B.A. Fields, R. Bodik, M.D. Hill, and C.J. Newburn. 2003. Using interaction costs for microarchitectural bottleneck analysis. In Proc. 36th IEEE/ACM International Symposium on Microarchitecture (MICRO).Google Scholar
- B. Fields, S. Rubin, and R. Bodik. 2001. Focusing processor policies via critical-path prediction. In Proc. 28th IEEE/ACM International Symposium on Computer Architecture (ISCA).Google Scholar
- Denis Foley and John Danskin. 2017. Ultra-Performance Pascal GPU and NVLink Interconnect. IEEE Micro 37 (2017).Google ScholarDigital Library
- Michael Garland, Scott Le Grand, John Nickolls, Joshua Anderson, Jim Hardwick, Scott Morton, Everett Phillips, Yao Zhang, and Vasily Volkov. 2008. Parallel Computing Experiences with CUDA. IEEE Micro 28 (2008).Google Scholar
- Xiang Gong, Chunling HU, and Chu-Cheow Lim. 2020. PAQSIM: Fast Performance Model for Graphics Workload on Mobile GPUs. In Proc. 21st ACM SIGPLAN/SIGBED Conference on Languages, Compilers, and Tools for Embedded Systems (LCTES).Google ScholarDigital Library
- Scott Grauer-Gray, Lifan Xu, Robert Searles, Sudhee Ayalasomayajula, and John Cavazos. 2012. Auto-tuning a bhigh-level language targeted to GPU codes. In 2012 Innovative Parallel Computing (InPar).Google Scholar
- João Guerreiro, Aleksandar Ilic, Nuno Roma, and Pedro Tomás. 2019. GPU Static Modeling Using PTX and Deep Structured Learning. IEEE Access 7 (2019).Google ScholarCross Ref
- Anthony Gutierrez, Bradford M. Beckmann, Alexandru Dutu, Joseph Gross, Michael LeBeane, John Kalamatianos, Onur Kayiran, Matthew Poremba, Brandon Potter, Sooraj Puthoor, Matthew D. Sinclair, Mark Wyse, Jieming Yin, Xianwei Zhang, Akshay Jain, and Timothy Rogers. 2018. Lost in Abstraction: Pitfalls of Analyzing GPUs at the Intermediate Language Level. In Proc. 24th IEEE International Symposium on High Performance Computer Architecture (HPCA).Google ScholarCross Ref
- Seonyeong Heo, Sungjun Cho, Youngsok Kim, and Hanjun Kim. 2020. Real-Time Object Detection System with Multi-Path Neural Networks. In Proc. 26th IEEE Real-Time and Embedded Technology and Applications Symposium (RTAS).Google ScholarCross Ref
- Sunpyo Hong and Hyesoon Kim. 2009. An Analytical Model for a GPU Architecture with Memory-level and Thread-level Parallelism Awareness. In Proc. 36th IEEE/ACM International Symposium on Computer Architecture (ISCA).Google ScholarDigital Library
- Jen-Cheng Huang, Joo Hwan Lee, Hyesoon Kim, and Hsien-Hsin S. Lee. 2014. GPUMech: GPU Performance Modeling Techniques based on Interval Analysis. In Proc. 47th IEEE/ACM International Symposium on Microarchitecture (MICRO).Google Scholar
- Jen-Cheng Huang, Lifeng Nai, Hyesoon Kim, and Hsien-Hsin S. Lee. 2014. TB-Point: Reducing Simulation Time for Large-Scale GPGPU Kernels. In Proc. 28th IEEE International Parallel and Distributed Processing Symposium (IPDPS).Google Scholar
- Hanhwi Jang, Jae-Eon Jo, Jaewon Lee, and Jangwoo Kim. 2018. RpStacks-MT: A High-throughput Design Evaluation Methodology for Multi-core Processors. In Proc. 51st IEEE/ACM International Symposium on Microarchitecture (MICRO).Google ScholarDigital Library
- Tejas S. Karkhanis and James E. Smith. 2004. A First-Order Superscalar Processor Model. In Proc. 31st IEEE/ACM International Symposium on Computer Architecture (ISCA).Google Scholar
- Aajna Karki, Chethan Palangotu Keshava, Spoorthi Mysore Shivakumar, Joshua Skow, Goutam Madhukeshwar Hegde, and Hyeran Jeon. 2019. Detailed Characterization of Deep Neural Networks on GPUs and FPGAs. In Proc. 12th Workshop on General Purpose Processing Using GPUs (GPGPU).Google ScholarDigital Library
- Mahmoud Khairy, Zhesheng Shen, Tor M. Aamodt, and Timothy G. Rogers. 2020. Accel-Sim: An Extensible Simulation Framework for Validated GPU Modeling. In Proc. 47th IEEE/ACM International Symposium on Computer Architecture (ISCA).Google Scholar
- Mohsen Kiani and Amir Rajabzadeh. 2018. Efficient Cache Performance Modeling in GPUs Using Reuse Distance Analysis. ACM Transactions on Architecture and Code Optimization (TACO) 15 (2018).Google Scholar
- Jaewon Lee, Hanhwi Jang, and Jangwoo Kim. 2014. RpStacks: Fast and Accurate Processor Design Space Exploration Using Representative Stall-Event Stacks. In Proc. 47th IEEE/ACM International Symposium on Microarchitecture (MICRO).Google ScholarDigital Library
- Peter Mattson, Christine Cheng, Gregory Diamos, Cody Coleman, Paulius Micikevicius, David Patterson, Hanlin Tang, Gu-Yeon Wei, Peter Bailis, Victor Bittorf, David Brooks, Dehao Chen, Debo Dutta, Udit Gupta, Kim Hazelwood, Andy Hock, Xinyuan Huang, Daniel Kang, David Kanter, Naveen Kumar, Jeffery Liao, Deepak Narayanan, Tayo Oguntebi, Gennady Pekhimenko, Lillian Pentecost, Vijay Janapa Reddi, Taylor Robie, Tom St John, Carole-Jean Wu, Lingjie Xu, Cliff Young, and Matei Zaharia. 2020. MLPerf Training Benchmark. In Proc. Machine Learning and Systems 2 (MLSys).Google Scholar
- Aaftab Munshi. 2009. The OpenCL specification. In Proc. IEEE Hot Chips 21 Symposium (HCS).Google ScholarCross Ref
- Sharan Narang. 2016. DeepBench. https://svail.github.io/DeepBench/.Google Scholar
- Cedric Nugteren, Gert-Jan van den Braak, Henk Corporaal, and Henri Bal. 2014. A Detailed GPU Cache Model Based on Reuse Distance Theory. In Proc. 20th IEEE International Symposium on High Performance Computer Architecture (HPCA).Google Scholar
- NVIDIA Corporation. 2020. Nsight Compute CLI.Google Scholar
- NVIDIA Corporation. 2021. NVIDIA Ampere GA102 GPU Architecture. https://www.nvidia.com/content/PDF/nvidia-ampere-ga-102-gpu-architecture-whitepaper-v2.pdf.Google Scholar
- NVIDIA Corporation. 2021. Parallel Thread Execution ISA: Application Guide (v7.4).Google Scholar
- Kenneth O'neal, Philip Brisk, Ahmed Abousamra, Zack Waters, and Emily Shriver. 2017. GPU Performance Estimation Using Software Rasterization and Machine Learning. ACM Trans. Embed. Comput. Syst. 16 (2017).Google Scholar
- Vijay Janapa Reddi, Christine Cheng, David Kanter, Peter Mattson, Guenther Schmuelling, Carole-Jean Wu, Brian Anderson, Maximilien Breughe, Mark Charlebois, William Chou, Ramesh Chukka, Cody Coleman, Sam Davis, Pan Deng, Greg Diamos, Jared Duke, Dave Fick, J. Scott Gardner, Itay Hubara, Sachin Idgunji, Thomas B. Jablin, Jeff Jiao, Tom St. John, Pankaj Kanwar, David Lee, Jeffery Liao, Anton Lokhmotov, Francisco Massa, Peng Meng, Paulius Micikevicius, Colin Osborne, Gennady Pekhimenko, Arun Tejusve Raghunath Rajan, Dilip Sequeira, Ashish Sirasao, Fei Sun, Hanlin Tang, Michael Thomson, Frank Wei, Ephrem Wu, Lingjie Xu, Koichi Yamada, Bing Yu, George Yuan, Aaron Zhong, Peizhao Zhang, and Yuchen Zhou. 2020. MLPerf Inference Benchmark. In Proc. 47th ACM/IEEE Annual International Symposium on Computer Architecture (ISCA).Google ScholarDigital Library
- Timothy G. Rogers, Mike O'Connor, and Tor M. Aamodt. 2012. Cache-Conscious Wavefront Scheduling. In Proc. 45th IEEE/ACM International Symposium on Microarchitecture (MICRO).Google Scholar
- Ali G. Saidi, Nathan L. Binkert, Steven K. Reinhardt, and Trevor Mudge. 2008. Full-System Critical Path Analysis. In Proc. 2008 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).Google ScholarDigital Library
- Jaewoong Sim, Aniruddha Dasgupta, Hyesoon Kim, and Richard Vuduc. 2012. A Performance Analysis Framework for Identifying Potential Benefits in GPGPU Applications. In Proc. 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP).Google ScholarDigital Library
- John E. Stone, David Gohara, and Guochun Shi. 2010. OpenCL: A Parallel Programming Standard for Heterogeneous Computing Systems. Computing in Science & Engineering 12 (2010).Google Scholar
- Yifan Sun, Trinayan Baruah, Saiful A. Mojumder, Shi Dong, Xiang Gong, Shane Treadway, Yuhui Bao, Spencer Hance, Carter McCardwell, Vincent Zhao, Harrison Barclay, Amir Kavyan Ziabari, Zhongliang Chen, Rafael Ubal, José Abellán, John Kim, Ajay Joshi, and David Kaeli. 2019. MGPUSim: Enabling Multi-GPU Performance Modeling and Optimization. In Proc. 46th ACM/IEEE Annual International Symposium on Computer Architecture (ISCA).Google ScholarDigital Library
- Teruo Tanimoto, Takatsugu Ono, Koji Inoue, and Hiroshi Sasaki. 2017. Enhanced Dependence Graph Model for Critical Path Analysis on Modern Out-of-Order Processors. IEEE Computer Architecture Letters (CAL) 16 (2017).Google ScholarCross Ref
- Oreste Villa, Daniel Lustig, Zi Yan, Evgeny Bolotin, Yaosheng Fu, Niladrish Chatterjee, Nan Jiang, and David Nellans. 2021. Need for Speed: Experiences Building a Trustworthy System-Level GPU Simulator. In Proc. 27th IEEE International Symposium on High-Performance Computer Architecture (HPCA).Google ScholarCross Ref
- Oreste Villa, Mark Stephenson, David Nellans, and Stephen W. Keckler. 2019. NVBit: A Dynamic Binary Instrumentation Framework for NVIDIA GPUs. In Proc. 52nd IEEE/ACM International Symposium on Microarchitecture (MICRO).Google Scholar
- Lu Wang, Magnus Jahre, Almutaz Adileh, and Lieven Eeckhout. 2020. MDM: The GPU Memory Divergence Model. In Proc. 53rd IEEE/ACM International Symposium on Microarchitecture (MICRO).Google ScholarCross Ref
- Lu Wang, Magnus Jahre, Almutaz Adileh, Zhiying Wang, and Lieven Eeckhout. 2019. Modeling Emerging Memory-Divergent GPU Applications. IEEE Computer Architecture Letters (CAL) 18 (2019).Google ScholarDigital Library
- Xiebing Wang, Kai Huang, Alois Knoll, and Xuehai Qian. 2019. A Hybrid Framework for Fast and Accurate GPU Performance Estimation through Source-Level Analysis and Trace-Based Simulation. In Proc. 25th IEEE International Symposium on High Performance Computer Architecture (HPCA).Google ScholarCross Ref
- Gene Wu, Joseph L. Greathouse, Alexander Lyashevsky, Nuwan Jayasena, and Derek Chiou. 2015. GPGPU performance and power estimation using machine learning. In Proc. 21st IEEE International Symposium on High Performance Computer Architecture (HPCA).Google ScholarCross Ref
- Zhibin Yu, Lieven Eeckhout, Nilanjan Goswami, Tao Li, Lizy John, Hai Jin, and Chengzhong Xu. 2013. Accelerating GPGPU Architecture Simulation. In Proc. ACM SIGMETRICS/International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS).Google ScholarDigital Library
- Zhibin Yu, Lieven Eeckhout, Nilanjan Goswami, Tao Li, Lizy K John, Hai Jin, Chengzhong Xu, and Junmin Wu. 2015. GPGPU-MiniBench: Accelerating GPGPU Micro-Architecture Simulation. IEEE Trans. Comput. 64 (2015).Google Scholar
- Yao Zhang and John D. Owens. 2011. A Quantitative Performance Analysis Model for GPU Architectures. In Proc. 17th IEEE International Symposium on High Performance Computer Architecture (HPCA).Google Scholar
Index Terms
- GCoM: a detailed GPU core model for accurate analytical modeling of modern GPUs
Recommendations
Algorithmic performance studies on graphics processing units
We report on our experience with integrating and using graphics processing units (GPUs) as fast parallel floating-point co-processors to accelerate two fundamental computational scientific kernels on the GPU: sparse direct factorization and nonlinear ...
Beyond the socket: NUMA-aware GPUs
MICRO-50 '17: Proceedings of the 50th Annual IEEE/ACM International Symposium on MicroarchitectureGPUs achieve high throughput and power efficiency by employing many small single instruction multiple thread (SIMT) cores. To minimize scheduling logic and performance variance they utilize a uniform memory system and leverage strong data parallelism ...
Massively LDPC Decoding on Multicore Architectures
Unlike usual VLSI approaches necessary for the computation of intensive Low-Density Parity-Check (LDPC) code decoders, this paper presents flexible software-based LDPC decoders. Algorithms and data structures suitable for parallel computing are proposed ...
Comments