skip to main content
10.1145/3392717.3392761acmconferencesArticle/Chapter ViewAbstractPublication PagesicsConference Proceedingsconference-collections
research-article

Fast, accurate, and scalable memory modeling of GPGPUs using reuse profiles

Published: 29 June 2020 Publication History

Abstract

In this paper, we introduce an accurate and scalable memory modeling framework for General Purpose Graphics Processor units (GPGPUs), PPT-GPU-Mem. That is Performance Prediction Tool-Kit for GPUs Cache Memories. PPT-GPU-Mem predicts the performance of different GPUs' cache memory hierarchy (L1 & L2) based on reuse profiles. We extract a memory trace for each GPU kernel once in its lifetime using the recently released binary instrumentation tool, NVBIT. The memory trace extraction is architecture-independent and can be done on any available NVIDIA GPU. PPT-GPU-Mem can then model any NVIDIA GPU caches given their parameters and the extracted memory trace. We model Volta Tesla V100 and Turing TITAN RTX and validate our framework using different kernels from Polybench and Rodinia benchmark suites in addition to two deep learning applications from Tango DNN benchmark suite. We provide two models, MBRDP (Multiple Block Reuse Distance Profile) and OBRDP (One Block Reuse Distance Profile), with varying assumptions, accuracy, and speed. Our accuracy ranges from 92% to 99% for the different cache levels compared to real hardware while maintaining the scalability in producing the results. Finally, we illustrate that PPT-GPU-Mem can be used for design space exploration and for predicting the cache performance of future GPUs.

References

[1]
A. Agarwal, J. Hennessy, and M. Horowitz. 1989. An Analytical Cache Model. ACM Transactions on Computer Systems 7, 2 (May 1989), 184--215.
[2]
G. Almási, C. Caşcaval, and D. A. Padua. 2002. Calculating Stack Distances Efficiently. In Proceedings of the 2002 Workshop on Memory System Performance (MSP '02). 37--43.
[3]
Y. Arafa, A. A. Badawy, G. Chennupati, N. Santhi, and S. Eidenbenz. 2019. Low Overhead Instruction Latency Characterization for NVIDIA GPGPUs. In IEEE High Performance Extreme Computing Conference (HPEC '19). 1--8.
[4]
Y. Arafa, A. A. Badawy, G. Chennupati, N. Santhi, and S. Eidenbenz. 2019. PPT-GPU: Scalable GPU Performance Modeling. IEEE Computer Architecture Letters, vol. 18, no. 1 (2019), 55--58.
[5]
Y. Arafa, G. Chennupati, A. Barai, A. A. Badawy, N. Santhi, and S. Eidenbenz. 2019. GPUs Cache Performance Estimation using Reuse Distance Analysis. In IEEE 38th International Performance Computing and Communications Conference (IPCCC '19). 1--8.
[6]
Y. Arafa, A. ElWazir, A. ElKanishy, Y. Aly, A. Elsayed, A.-H. Badawy, G. Chennupati, S. Eidenbenz, and N. Santhi. 2020. Verified Instruction-Level Energy Consumption Measurement for NVIDIA GPUs. In 17th ACM International Conference on Computing Frontiers (CF '20). 1--11.
[7]
A. A. Badawy and D. Yeung. 2017. Guiding Locality Optimizations for Graph Computations via Reuse Distance Analysis. IEEE Computer Architecture Letters 16, 2 (2017), 119--122.
[8]
A. A. Badawy and D. Yeung. 2017. Optimizing locality in graph computations using reuse distance profiles. In IEEE 36th International Performance Computing and Communications Conference (IPCCC '17). 1--8.
[9]
A. Bakhoda, G. L. Yuan, W. W. L. Fung, H. Wong, and T. M. Aamodt. 2009. Analyzing CUDA workloads using a detailed GPU simulator. In IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS '09). 163--174.
[10]
M. Brehob and R. Enbody. 1999. An analytical model of locality and caching. Technical Report, Michigan State University, MSU-CSE-99-31 (Aug. 1999).
[11]
S. Chatterjee, E. Parker, P. J. Hanlon, and A. R. Lebeck. 2001. Exact Analysis of the Cache Behavior of Nested Loops. In Proceedings of the ACM SIGPLAN 2001 Conference on Programming Language Design and Implementation (PLDI '01). 286--297.
[12]
S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S. Lee, and K. Skadron. 2009. Rodinia: A benchmark suite for heterogeneous computing. In IEEE International Symposium on Workload Characterization (IISWC '09). 44--54.
[13]
G. Chennupati, N. Santhi, R. Bird, S. Thulasidasan, A. A. Badawy, S. Misra, and S. Eidenbenz. 2017. A scalable analytical memory model for CPU performance prediction. In International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS '17). Springer, 114--135.
[14]
G. Chennupati, N. Santhi, and S. Eidenbenz. 2019. Scalable Performance Prediction of Codes with Memory Hierarchy and Pipelines. In Proceedings of the 2019 ACM SIGSIM Conference on Principles of Advanced Discrete Simulation (SIGSIM-PADS). 13--24.
[15]
G. Chennupati, N. Santhi, S. Eidenbenz, and S. Thulasidasan. 2017. An analytical memory hierarchy model for performance prediction. In Winter Simulation Conference (WSC '17). 908--919.
[16]
NVIDIA Corporation. 2014. Kepler GPU Architecture Whitepaper. https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/tesla-product-literature/NVIDIA-Kepler-GK110-GK210-Architecture-Whitepaper.pdf
[17]
NVIDIA Corporation. 2018. Turing GPU Architecture Whitepaper. https://www.nvidia.com/content/dam/en-zz/Solutions/design-visualization/technologies/turing-architecture/NVIDIA-Turing-Architecture-Whitepaper.pdf
[18]
NVIDIA Corporation. 2019. CUDA Basic Linear Algebra Subroutines (cuBLAS). https://developer.nvidia.com/cublas
[19]
NVIDIA Corporation. 2019. CUDA Compiler Driver NVCC. https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc
[20]
NVIDIA Corporation. 2019. CUDA Deep Neural Network library (cuDNN). https://developer.nvidia.com/cudnn
[21]
NVIDIA Corporation. 2019. CUDA Programming Guide. https://docs.nvidia.com/cuda/cuda-c-programming-guide
[22]
NVIDIA Corporation. 2019. Nsight Compute CLI (nv-nsight). https://docs.nvidia.com/nsight-compute/NsightComputeCli
[23]
NVIDIA Corporation. 2019. Visual Profiler (nvprof). https://docs.nvidia.com/cuda/profiler-users-guide
[24]
NVIDIA Corporation. Jun. 2017. Volta Tesla V100 GPU Architecture Whitepaper. http://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf
[25]
G. Diamos, A. Kerr, S. Yalamanchili, and N. Clark. 2010. Ocelot: A dynamic optimization framework for bulk-synchronous applications in heterogeneous systems. In 19th International Conference on Parallel Architectures and Compilation Techniques (PACT). 353--364.
[26]
C. Ding and Y. Zhong. 2003. Predicting Whole-Program Locality through Reuse Distance Analysis. In Proceedings of the ACM SIGPLAN 2003 Conference on Programming Language Design and Implementation (PLDI '03). 245--257.
[27]
L. Eeckhout, K. de Bosschere, and H. Neefs. 2000. Performance Analysis Through Synthetic Trace Generation. In IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS '00). 1--6.
[28]
C. Fang, S. Carr, S. Önder, and Z. Wang. 2004. Reuse-Distance-Based Miss-Rate Prediction on a per Instruction Basis. In Proceedings of the 2004 Workshop on Memory System Performance (MSP '04). 60--68.
[29]
S. Grauer-Gray, L. Xu, R. Searles, S. Ayalasomayajula, and J. Cavazos. 2012. Auto-tuning a high-level language targeted to GPU codes. In Innovative Parallel Computing (InPar '12). 1--10.
[30]
R. Hassan, A. Harris, N. Topham, and A. Efthymiou. 2007. Synthetic Trace-Driven Simulation of Cache Memory. In 21st International Conference on Advanced Information Networking and Applications Workshops (AINAW '07), Vol. 1. 764--771.
[31]
T. Janjusic and K. Kavi. 2013. Gleipnir: A memory profiling and tracing tool. ACM SIGARCH Computer Architecture News 41, 4 (2013), 8--12.
[32]
A. Jog, O. Kayiran, N. C. Chidambaram, A. K. Mishra, M. T. Kandemir, O. Mutlu, R. Iyer, and C. R. Das. 2013. OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance. In Proceedings of the Eighteenth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '13). 395--406.
[33]
A. Karki, C. P. Keshava, S. M. Shivakumar, J. Skow, G. M. Hegde, and H. Jeon. 2019. Detailed Characterization of Deep Neural Networks on GPUs and FPGAs. In Proceedings of the 12th Workshop on General Purpose Processing Using GPUs (GPGPU '19). 12--21.
[34]
O. Kayiran, A. Jog, M. T. Kandemir, and C. R. Das. 2013. Neither More nor Less: Optimizing Thread-Level Parallelism for GPGPUs. In Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques (PACT '13). 157--166.
[35]
M. Kiani and A. Rajabzadeh. 2018. Efficient Cache Performance Modeling in GPUs Using Reuse Distance Analysis. ACM Trans. Archit. Code Optim. 15, 4, Article Article 58 (Dec. 2018), 24 pages.
[36]
M. Lee, S. Song, J. Moon, J. Kim, W. Seo, Y. Cho, and S. Ryu. 2014. Improving GPGPU resource utilization through alternative thread block scheduling. In IEEE 20th International Symposium on High Performance Computer Architecture (HPCA '14). 260--271.
[37]
C. Li, S. L. Song, H. Dai, A. Sidelnik, S. K. S. Hari, and H. Zhou. 2015. Locality-Driven Dynamic GPU Cache Bypassing. In Proceedings of the 29th ACM on International Conference on Supercomputing (ICS '15). 67--77.
[38]
Y. Liang, X. Xie, G. Sun, and D. Chen. 2015. An Efficient Compiler Framework for Cache Bypassing on GPUs. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 34, 10 (2015), 1677--1690.
[39]
C. Luk, R. Cohn, R. Muth, H. Patil, A. Klauser, G. Lowney, S. Wallace, V. J. Reddi, and K. Hazelwood. 2005. Pin: Building Customized Program Analysis Tools with Dynamic Instrumentation. In Proceedings of the 2005 ACM SIGPLAN Co6nference on Programming Language Design and Implementation (PLDI '05). 190--200.
[40]
H. Mujtaba. 2020. NVIDIA Ampere Leak Out Specifications. https://wccftech.com/nvidia-ampere-gpu-geforce-rtx-3080-3070-specs-rumor/
[41]
N. Nethercote and J. Seward. 2007. Valgrind: A Framework for Heavyweight Dynamic Binary Instrumentation. In Proceedings of the 28th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI '07). 89--100.
[42]
A. Nguyen, P. Bose, K. Ekanadham, A. Nanda, and M. Michael. 1997. Accuracy and speed-up of parallel trace-driven architectural simulation. In Proceedings 11th International Parallel Processing Symposium (IPPS). 39--44.
[43]
C. Nugteren, G. van den Braak, H. Corporaal, and H. Bal. 2014. A detailed GPU cache model based on reuse distance theory. In 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA '14). 37--48.
[44]
J. M. Sabarimuthu and T. G. Venkatesh. 2018. Analytical Miss Rate Calculation of L2 Cache from the RD Profile of L1 Cache. IEEE Trans. Comput. 67, 1 (2018), 9--15.
[45]
S. K. Sahoo, R. Panuganti, P. Sadayappan, and P. Krishnamoorthy. 2005. Cache Miss Characterization and Data Locality Optimization for Imperfectly Nested Loops on Shared Memory Multiprocessors. In 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS '05). 44--53.
[46]
T. Sherwood, E. Perelman, G. Hamerly, and B. Calder. 2002. Automatically Characterizing Large Scale Program Behavior. In Proceedings of the 10th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS X). 45--57.
[47]
M. Stephenson, S. K. S. Hari, Y. Lee, E. Ebrahimi, D. R. Johnson, D. Nellans, M. O'Connor, and S. W. Keckler. 2015. Flexible Software Profiling of GPU Architectures. In ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA '15). 185--197.
[48]
T. Tang, X. Yang, and Y. Lin. 2011. Cache Miss Analysis for GPU Programs Based on Stack Distance Profile. In Proceedings of the 2011 31st International Conference on Distributed Computing Systems (ICDCS '11). 623--634.
[49]
TechPowerUp. Dec. 2018. NVIDIA TITAN RTX Specs. https://www.techpowerup.com/gpu-specs/titan-rtx.c3311
[50]
TechPowerUp. Jun. 2017. NVIDIA Tesla V100 Specs. https://www.techpowerup.com/gpu-specs/tesla-v100-pcie-16-gb.c2957
[51]
TechPowerUp. Mar. 2017. NVIDIA GeForce GTX 1080 Ti Specs. https://www.techpowerup.com/gpu-specs/geforce-gtx-1080.c2839
[52]
TOP500. 2019. https://www.top500.org/
[53]
O. Villa, M. Stephenson, D. Nellans, and S. W. Keckler. 2019. NVBit: A Dynamic Binary Instrumentation Framework for NVIDIA GPUs. In Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO '52). 372--383.
[54]
D. Wang and W. Xiao. 2016. A reuse distance based performance analysis on GPU L1 data cache. In IEEE 35th International Performance Computing and Communications Conference (IPCCC) (IPCCC '16). 1--8.
[55]
J. Weinberg, M. O. McCracken, E. Strohmaier, and A. Snavely. 2005. Quantifying Locality In The Memory Access Patterns of HPC Applications. In SC '05: Proceedings of the 2005 ACM/IEEE Conference on Supercomputing. 50--61.
[56]
C. M. Wittenbrink, E. Kilgariff, and A. Prabhu. 2011. Fermi GF100 GPU Architecture. IEEE Micro 31, 2 (2011), 50--59.
[57]
M. Wu, M. Zhao, and D. Yeung. 2013. Studying Multicore Processor Scaling via Reuse Distance Analysis. SIGARCH Comput. Archit. News 41, 3 (June 2013), 499--510.
[58]
Y. Zhong, S. G. Dropsho, and C. Ding. 2003. Miss Rate Prediction Across All Program Inputs. In Proceedings of the 12th International Conference on Parallel Architectures and Compilation Techniques (PACT '03). 79--90.

Cited By

View all
  • (2024)CentimaniProceedings of the 2024 USENIX Conference on Usenix Annual Technical Conference10.5555/3691992.3692065(1203-1221)Online publication date: 10-Jul-2024
  • (2024)Reuse distance-based shared LLC management mechanism for heterogeneous CPU-GPU systemsIEICE Electronics Express10.1587/elex.21.2023052021:4(20230520-20230520)Online publication date: 25-Feb-2024
  • (2024)Toward Energy-efficient STT-MRAM-based Near Memory Computing Architecture for Embedded SystemsACM Transactions on Embedded Computing Systems10.1145/365072923:3(1-24)Online publication date: 7-Mar-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
ICS '20: Proceedings of the 34th ACM International Conference on Supercomputing
June 2020
499 pages
ISBN:9781450379830
DOI:10.1145/3392717
  • General Chairs:
  • Eduard Ayguadé,
  • Wen-mei Hwu,
  • Program Chairs:
  • Rosa M. Badia,
  • H. Peter Hofstee
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 29 June 2020

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. GPU computing
  2. NVIDIA NVBIT
  3. performance modeling
  4. reuse distance

Qualifiers

  • Research-article

Funding Sources

  • Triad National Security, LLC,

Conference

ICS '20
Sponsor:
ICS '20: 2020 International Conference on Supercomputing
June 29 - July 2, 2020
Spain, Barcelona

Acceptance Rates

Overall Acceptance Rate 629 of 2,180 submissions, 29%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)94
  • Downloads (Last 6 weeks)10
Reflects downloads up to 12 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)CentimaniProceedings of the 2024 USENIX Conference on Usenix Annual Technical Conference10.5555/3691992.3692065(1203-1221)Online publication date: 10-Jul-2024
  • (2024)Reuse distance-based shared LLC management mechanism for heterogeneous CPU-GPU systemsIEICE Electronics Express10.1587/elex.21.2023052021:4(20230520-20230520)Online publication date: 25-Feb-2024
  • (2024)Toward Energy-efficient STT-MRAM-based Near Memory Computing Architecture for Embedded SystemsACM Transactions on Embedded Computing Systems10.1145/365072923:3(1-24)Online publication date: 7-Mar-2024
  • (2024)Low-Overhead Trace Collection and Profiling on GPU Compute KernelsACM Transactions on Parallel Computing10.1145/364951011:2(1-24)Online publication date: 8-Jun-2024
  • (2023)Modeling and Characterizing Shared and Local Memories of the Ampere GPUsProceedings of the International Symposium on Memory Systems10.1145/3631882.3631891(1-3)Online publication date: 2-Oct-2023
  • (2023)Fine‐Grained Memory Profiling of GPGPU KernelsComputer Graphics Forum10.1111/cgf.1467141:7(227-235)Online publication date: 20-Mar-2023
  • (2022)Performance Modeling of Computer Vision-based CNN on Edge GPUsACM Transactions on Embedded Computing Systems10.1145/352716921:5(1-33)Online publication date: 26-Mar-2022
  • (2022)An Automated Tool for Analysis and Tuning of GPU-Accelerated Code in HPC ApplicationsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2021.309416933:4(854-865)Online publication date: 1-Apr-2022
  • (2021)Hybrid, scalable, trace-driven performance modeling of GPGPUsProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3458817.3476221(1-15)Online publication date: 14-Nov-2021
  • (2021)Machine Learning–enabled Scalable Performance Prediction of Scientific CodesACM Transactions on Modeling and Computer Simulation10.1145/345026431:2(1-28)Online publication date: 30-Apr-2021
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media