research-article

Fast, accurate, and scalable memory modeling of GPGPUs using reuse profiles

Authors:

Abdel-Hameed Badawy,

Gopinath Chennupati,

Nandakishore Santhi,

Stephan EidenbenzAuthors Info & Claims

ICS '20: Proceedings of the 34th ACM International Conference on Supercomputing

Article No.: 31, Pages 1 - 12

https://doi.org/10.1145/3392717.3392761

Published: 29 June 2020 Publication History

Abstract

In this paper, we introduce an accurate and scalable memory modeling framework for General Purpose Graphics Processor units (GPGPUs), PPT-GPU-Mem. That is Performance Prediction Tool-Kit for GPUs Cache Memories. PPT-GPU-Mem predicts the performance of different GPUs' cache memory hierarchy (L1 & L2) based on reuse profiles. We extract a memory trace for each GPU kernel once in its lifetime using the recently released binary instrumentation tool, NVBIT. The memory trace extraction is architecture-independent and can be done on any available NVIDIA GPU. PPT-GPU-Mem can then model any NVIDIA GPU caches given their parameters and the extracted memory trace. We model Volta Tesla V100 and Turing TITAN RTX and validate our framework using different kernels from Polybench and Rodinia benchmark suites in addition to two deep learning applications from Tango DNN benchmark suite. We provide two models, MBRDP (Multiple Block Reuse Distance Profile) and OBRDP (One Block Reuse Distance Profile), with varying assumptions, accuracy, and speed. Our accuracy ranges from 92% to 99% for the different cache levels compared to real hardware while maintaining the scalability in producing the results. Finally, we illustrate that PPT-GPU-Mem can be used for design space exploration and for predicting the cache performance of future GPUs.

References

[1]

A. Agarwal, J. Hennessy, and M. Horowitz. 1989. An Analytical Cache Model. ACM Transactions on Computer Systems 7, 2 (May 1989), 184--215.

Digital Library

[2]

G. Almási, C. Caşcaval, and D. A. Padua. 2002. Calculating Stack Distances Efficiently. In Proceedings of the 2002 Workshop on Memory System Performance (MSP '02). 37--43.

Digital Library

[3]

Y. Arafa, A. A. Badawy, G. Chennupati, N. Santhi, and S. Eidenbenz. 2019. Low Overhead Instruction Latency Characterization for NVIDIA GPGPUs. In IEEE High Performance Extreme Computing Conference (HPEC '19). 1--8.

[4]

Y. Arafa, A. A. Badawy, G. Chennupati, N. Santhi, and S. Eidenbenz. 2019. PPT-GPU: Scalable GPU Performance Modeling. IEEE Computer Architecture Letters, vol. 18, no. 1 (2019), 55--58.

[5]

Y. Arafa, G. Chennupati, A. Barai, A. A. Badawy, N. Santhi, and S. Eidenbenz. 2019. GPUs Cache Performance Estimation using Reuse Distance Analysis. In IEEE 38th International Performance Computing and Communications Conference (IPCCC '19). 1--8.

[6]

Y. Arafa, A. ElWazir, A. ElKanishy, Y. Aly, A. Elsayed, A.-H. Badawy, G. Chennupati, S. Eidenbenz, and N. Santhi. 2020. Verified Instruction-Level Energy Consumption Measurement for NVIDIA GPUs. In 17th ACM International Conference on Computing Frontiers (CF '20). 1--11.

Digital Library

[7]

A. A. Badawy and D. Yeung. 2017. Guiding Locality Optimizations for Graph Computations via Reuse Distance Analysis. IEEE Computer Architecture Letters 16, 2 (2017), 119--122.

[8]

A. A. Badawy and D. Yeung. 2017. Optimizing locality in graph computations using reuse distance profiles. In IEEE 36th International Performance Computing and Communications Conference (IPCCC '17). 1--8.

[9]

A. Bakhoda, G. L. Yuan, W. W. L. Fung, H. Wong, and T. M. Aamodt. 2009. Analyzing CUDA workloads using a detailed GPU simulator. In IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS '09). 163--174.

[10]

M. Brehob and R. Enbody. 1999. An analytical model of locality and caching. Technical Report, Michigan State University, MSU-CSE-99-31 (Aug. 1999).

[11]

S. Chatterjee, E. Parker, P. J. Hanlon, and A. R. Lebeck. 2001. Exact Analysis of the Cache Behavior of Nested Loops. In Proceedings of the ACM SIGPLAN 2001 Conference on Programming Language Design and Implementation (PLDI '01). 286--297.

Digital Library

[12]

S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S. Lee, and K. Skadron. 2009. Rodinia: A benchmark suite for heterogeneous computing. In IEEE International Symposium on Workload Characterization (IISWC '09). 44--54.

Digital Library

[13]

G. Chennupati, N. Santhi, R. Bird, S. Thulasidasan, A. A. Badawy, S. Misra, and S. Eidenbenz. 2017. A scalable analytical memory model for CPU performance prediction. In International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS '17). Springer, 114--135.

[14]

G. Chennupati, N. Santhi, and S. Eidenbenz. 2019. Scalable Performance Prediction of Codes with Memory Hierarchy and Pipelines. In Proceedings of the 2019 ACM SIGSIM Conference on Principles of Advanced Discrete Simulation (SIGSIM-PADS). 13--24.

Digital Library

[15]

G. Chennupati, N. Santhi, S. Eidenbenz, and S. Thulasidasan. 2017. An analytical memory hierarchy model for performance prediction. In Winter Simulation Conference (WSC '17). 908--919.

[16]

NVIDIA Corporation. 2014. Kepler GPU Architecture Whitepaper. https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/tesla-product-literature/NVIDIA-Kepler-GK110-GK210-Architecture-Whitepaper.pdf

[17]

NVIDIA Corporation. 2018. Turing GPU Architecture Whitepaper. https://www.nvidia.com/content/dam/en-zz/Solutions/design-visualization/technologies/turing-architecture/NVIDIA-Turing-Architecture-Whitepaper.pdf

[18]

NVIDIA Corporation. 2019. CUDA Basic Linear Algebra Subroutines (cuBLAS). https://developer.nvidia.com/cublas

[19]

NVIDIA Corporation. 2019. CUDA Compiler Driver NVCC. https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc

[20]

NVIDIA Corporation. 2019. CUDA Deep Neural Network library (cuDNN). https://developer.nvidia.com/cudnn

[21]

NVIDIA Corporation. 2019. CUDA Programming Guide. https://docs.nvidia.com/cuda/cuda-c-programming-guide

[22]

NVIDIA Corporation. 2019. Nsight Compute CLI (nv-nsight). https://docs.nvidia.com/nsight-compute/NsightComputeCli

[23]

NVIDIA Corporation. 2019. Visual Profiler (nvprof). https://docs.nvidia.com/cuda/profiler-users-guide

[24]

NVIDIA Corporation. Jun. 2017. Volta Tesla V100 GPU Architecture Whitepaper. http://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf

[25]

G. Diamos, A. Kerr, S. Yalamanchili, and N. Clark. 2010. Ocelot: A dynamic optimization framework for bulk-synchronous applications in heterogeneous systems. In 19th International Conference on Parallel Architectures and Compilation Techniques (PACT). 353--364.

[26]

C. Ding and Y. Zhong. 2003. Predicting Whole-Program Locality through Reuse Distance Analysis. In Proceedings of the ACM SIGPLAN 2003 Conference on Programming Language Design and Implementation (PLDI '03). 245--257.

Digital Library

[27]

L. Eeckhout, K. de Bosschere, and H. Neefs. 2000. Performance Analysis Through Synthetic Trace Generation. In IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS '00). 1--6.

[28]

C. Fang, S. Carr, S. Önder, and Z. Wang. 2004. Reuse-Distance-Based Miss-Rate Prediction on a per Instruction Basis. In Proceedings of the 2004 Workshop on Memory System Performance (MSP '04). 60--68.

Digital Library

[29]

S. Grauer-Gray, L. Xu, R. Searles, S. Ayalasomayajula, and J. Cavazos. 2012. Auto-tuning a high-level language targeted to GPU codes. In Innovative Parallel Computing (InPar '12). 1--10.

[30]

R. Hassan, A. Harris, N. Topham, and A. Efthymiou. 2007. Synthetic Trace-Driven Simulation of Cache Memory. In 21st International Conference on Advanced Information Networking and Applications Workshops (AINAW '07), Vol. 1. 764--771.

Digital Library

[31]

T. Janjusic and K. Kavi. 2013. Gleipnir: A memory profiling and tracing tool. ACM SIGARCH Computer Architecture News 41, 4 (2013), 8--12.

Digital Library

[32]

A. Jog, O. Kayiran, N. C. Chidambaram, A. K. Mishra, M. T. Kandemir, O. Mutlu, R. Iyer, and C. R. Das. 2013. OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance. In Proceedings of the Eighteenth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '13). 395--406.

Digital Library

[33]

A. Karki, C. P. Keshava, S. M. Shivakumar, J. Skow, G. M. Hegde, and H. Jeon. 2019. Detailed Characterization of Deep Neural Networks on GPUs and FPGAs. In Proceedings of the 12th Workshop on General Purpose Processing Using GPUs (GPGPU '19). 12--21.

Digital Library

[34]

O. Kayiran, A. Jog, M. T. Kandemir, and C. R. Das. 2013. Neither More nor Less: Optimizing Thread-Level Parallelism for GPGPUs. In Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques (PACT '13). 157--166.

[35]

M. Kiani and A. Rajabzadeh. 2018. Efficient Cache Performance Modeling in GPUs Using Reuse Distance Analysis. ACM Trans. Archit. Code Optim. 15, 4, Article Article 58 (Dec. 2018), 24 pages.

Digital Library

[36]

M. Lee, S. Song, J. Moon, J. Kim, W. Seo, Y. Cho, and S. Ryu. 2014. Improving GPGPU resource utilization through alternative thread block scheduling. In IEEE 20th International Symposium on High Performance Computer Architecture (HPCA '14). 260--271.

[37]

C. Li, S. L. Song, H. Dai, A. Sidelnik, S. K. S. Hari, and H. Zhou. 2015. Locality-Driven Dynamic GPU Cache Bypassing. In Proceedings of the 29th ACM on International Conference on Supercomputing (ICS '15). 67--77.

Digital Library

[38]

Y. Liang, X. Xie, G. Sun, and D. Chen. 2015. An Efficient Compiler Framework for Cache Bypassing on GPUs. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 34, 10 (2015), 1677--1690.

Digital Library

[39]

C. Luk, R. Cohn, R. Muth, H. Patil, A. Klauser, G. Lowney, S. Wallace, V. J. Reddi, and K. Hazelwood. 2005. Pin: Building Customized Program Analysis Tools with Dynamic Instrumentation. In Proceedings of the 2005 ACM SIGPLAN Co6nference on Programming Language Design and Implementation (PLDI '05). 190--200.

Digital Library

[40]

H. Mujtaba. 2020. NVIDIA Ampere Leak Out Specifications. https://wccftech.com/nvidia-ampere-gpu-geforce-rtx-3080-3070-specs-rumor/

[41]

N. Nethercote and J. Seward. 2007. Valgrind: A Framework for Heavyweight Dynamic Binary Instrumentation. In Proceedings of the 28th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI '07). 89--100.

Digital Library

[42]

A. Nguyen, P. Bose, K. Ekanadham, A. Nanda, and M. Michael. 1997. Accuracy and speed-up of parallel trace-driven architectural simulation. In Proceedings 11th International Parallel Processing Symposium (IPPS). 39--44.

[43]

C. Nugteren, G. van den Braak, H. Corporaal, and H. Bal. 2014. A detailed GPU cache model based on reuse distance theory. In 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA '14). 37--48.

[44]

J. M. Sabarimuthu and T. G. Venkatesh. 2018. Analytical Miss Rate Calculation of L2 Cache from the RD Profile of L1 Cache. IEEE Trans. Comput. 67, 1 (2018), 9--15.

Digital Library

[45]

S. K. Sahoo, R. Panuganti, P. Sadayappan, and P. Krishnamoorthy. 2005. Cache Miss Characterization and Data Locality Optimization for Imperfectly Nested Loops on Shared Memory Multiprocessors. In 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS '05). 44--53.

Digital Library

[46]

T. Sherwood, E. Perelman, G. Hamerly, and B. Calder. 2002. Automatically Characterizing Large Scale Program Behavior. In Proceedings of the 10th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS X). 45--57.

Digital Library

[47]

M. Stephenson, S. K. S. Hari, Y. Lee, E. Ebrahimi, D. R. Johnson, D. Nellans, M. O'Connor, and S. W. Keckler. 2015. Flexible Software Profiling of GPU Architectures. In ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA '15). 185--197.

Digital Library

[48]

T. Tang, X. Yang, and Y. Lin. 2011. Cache Miss Analysis for GPU Programs Based on Stack Distance Profile. In Proceedings of the 2011 31st International Conference on Distributed Computing Systems (ICDCS '11). 623--634.

Digital Library

[49]

TechPowerUp. Dec. 2018. NVIDIA TITAN RTX Specs. https://www.techpowerup.com/gpu-specs/titan-rtx.c3311

[50]

TechPowerUp. Jun. 2017. NVIDIA Tesla V100 Specs. https://www.techpowerup.com/gpu-specs/tesla-v100-pcie-16-gb.c2957

[51]

TechPowerUp. Mar. 2017. NVIDIA GeForce GTX 1080 Ti Specs. https://www.techpowerup.com/gpu-specs/geforce-gtx-1080.c2839

[52]

TOP500. 2019. https://www.top500.org/

[53]

O. Villa, M. Stephenson, D. Nellans, and S. W. Keckler. 2019. NVBit: A Dynamic Binary Instrumentation Framework for NVIDIA GPUs. In Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO '52). 372--383.

Digital Library

[54]

D. Wang and W. Xiao. 2016. A reuse distance based performance analysis on GPU L1 data cache. In IEEE 35th International Performance Computing and Communications Conference (IPCCC) (IPCCC '16). 1--8.

[55]

J. Weinberg, M. O. McCracken, E. Strohmaier, and A. Snavely. 2005. Quantifying Locality In The Memory Access Patterns of HPC Applications. In SC '05: Proceedings of the 2005 ACM/IEEE Conference on Supercomputing. 50--61.

Digital Library

[56]

C. M. Wittenbrink, E. Kilgariff, and A. Prabhu. 2011. Fermi GF100 GPU Architecture. IEEE Micro 31, 2 (2011), 50--59.

Digital Library

[57]

M. Wu, M. Zhao, and D. Yeung. 2013. Studying Multicore Processor Scaling via Reuse Distance Analysis. SIGARCH Comput. Archit. News 41, 3 (June 2013), 499--510.

Digital Library

[58]

Y. Zhong, S. G. Dropsho, and C. Ding. 2003. Miss Rate Prediction Across All Program Inputs. In Proceedings of the 12th International Conference on Parallel Architectures and Compilation Techniques (PACT '03). 79--90.

Cited By

Xie ZEmani MYu XTao DHe XSu PZhou KVishwanath VBagchi SZhang Y(2024)CentimaniProceedings of the 2024 USENIX Conference on Usenix Annual Technical Conference10.5555/3691992.3692065(1203-1221)Online publication date: 10-Jul-2024
https://dl.acm.org/doi/10.5555/3691992.3692065
Liu JEgawa RTakahashi KShimomura YTakizawa H(2024)Reuse distance-based shared LLC management mechanism for heterogeneous CPU-GPU systemsIEICE Electronics Express10.1587/elex.21.2023052021:4(20230520-20230520)Online publication date: 25-Feb-2024
https://doi.org/10.1587/elex.21.20230520
Li YWang XZhang HPan BQiu KKang WWang JZhao W(2024)Toward Energy-efficient STT-MRAM-based Near Memory Computing Architecture for Embedded SystemsACM Transactions on Embedded Computing Systems10.1145/365072923:3(1-24)Online publication date: 7-Mar-2024
https://dl.acm.org/doi/10.1145/3650729
Show More Cited By

Index Terms

Fast, accurate, and scalable memory modeling of GPGPUs using reuse profiles
1. General and reference
  1. Cross-computing tools and techniques
    1. Performance
2. Hardware
  1. Integrated circuits
    1. Reconfigurable logic and FPGAs
      1. Hardware accelerators

Recommendations

Escalating Memory Accesses to Shared Memory by Profiling Reuse
IMCOM '16: Proceedings of the 10th International Conference on Ubiquitous Information Management and Communication

Recently, many studies have been conducted to improve the performance of CUDA or OpenCL applications, and one of key techniques to improve the performance is using the shared memory in GPUs. The shared memory is an on-chip memory, which can be accessed ...
Memory bandwidth optimization of SpMV on GPGPUs

It is an important task to improve performance for sparse matrix vector multiplication (SpMV), and it is a difficult task because of its irregular memory access. General purpose GPU (GPGPU) provides high computing ability and substantial bandwidth that ...
Predictive modeling and analysis of OP2 on distributed memory GPU clusters
PMBS '11: Proceedings of the second international workshop on Performance modeling, benchmarking and simulation of high performance computing systems

OP2 is an "active" library framework for the development and solution of unstructured mesh based applications. It aims to decouple the scientific specification of an application from its parallel implementation to achieve code longevity and near-optimal ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ICS '20: Proceedings of the 34th ACM International Conference on Supercomputing

June 2020

499 pages

ISBN:9781450379830

DOI:10.1145/3392717

General Chairs:
Eduard Ayguadé
Universitat Politècnica de Catalunya and Barcelona Supercomputing Center
,
Wen-mei Hwu
University of Illinois at Urbana-Champaign
,
Program Chairs:
Rosa M. Badia
Barcelona Supercomputing Center and Universitat Politècnica de Catalunya
,
H. Peter Hofstee
IBM Austin

Copyright © 2020 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGARCH: ACM Special Interest Group on Computer Architecture

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 29 June 2020

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Triad National Security, LLC,

Conference

ICS '20

Sponsor:

SIGARCH

ICS '20: 2020 International Conference on Supercomputing

June 29 - July 2, 2020

Spain, Barcelona

Acceptance Rates

Overall Acceptance Rate 629 of 2,180 submissions, 29%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

14
Total Citations
View Citations
543
Total Downloads

Downloads (Last 12 months)94
Downloads (Last 6 weeks)10

Reflects downloads up to 12 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Xie ZEmani MYu XTao DHe XSu PZhou KVishwanath VBagchi SZhang Y(2024)CentimaniProceedings of the 2024 USENIX Conference on Usenix Annual Technical Conference10.5555/3691992.3692065(1203-1221)Online publication date: 10-Jul-2024
https://dl.acm.org/doi/10.5555/3691992.3692065
Liu JEgawa RTakahashi KShimomura YTakizawa H(2024)Reuse distance-based shared LLC management mechanism for heterogeneous CPU-GPU systemsIEICE Electronics Express10.1587/elex.21.2023052021:4(20230520-20230520)Online publication date: 25-Feb-2024
https://doi.org/10.1587/elex.21.20230520
Li YWang XZhang HPan BQiu KKang WWang JZhao W(2024)Toward Energy-efficient STT-MRAM-based Near Memory Computing Architecture for Embedded SystemsACM Transactions on Embedded Computing Systems10.1145/365072923:3(1-24)Online publication date: 7-Mar-2024
https://dl.acm.org/doi/10.1145/3650729
Darche SDagenais M(2024)Low-Overhead Trace Collection and Profiling on GPU Compute KernelsACM Transactions on Parallel Computing10.1145/364951011:2(1-24)Online publication date: 8-Jun-2024
https://dl.acm.org/doi/10.1145/3649510
Abdelkhalik HArafa YSanthi NPrajapati NBadawy A(2023)Modeling and Characterizing Shared and Local Memories of the Ampere GPUsProceedings of the International Symposium on Memory Systems10.1145/3631882.3631891(1-3)Online publication date: 2-Oct-2023
https://dl.acm.org/doi/10.1145/3631882.3631891
Buelow MGuthe SFellner D(2023)Fine‐Grained Memory Profiling of GPGPU KernelsComputer Graphics Forum10.1111/cgf.1467141:7(227-235)Online publication date: 20-Mar-2023
https://doi.org/10.1111/cgf.14671
Bouzidi HOuarnoughi HNiar SCadi A(2022)Performance Modeling of Computer Vision-based CNN on Edge GPUsACM Transactions on Embedded Computing Systems10.1145/352716921:5(1-33)Online publication date: 26-Mar-2022
https://dl.acm.org/doi/10.1145/3527169
Zhou KMeng XSai RGrubisic DMellor-Crummey J(2022)An Automated Tool for Analysis and Tuning of GPU-Accelerated Code in HPC ApplicationsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2021.309416933:4(854-865)Online publication date: 1-Apr-2022
https://doi.org/10.1109/TPDS.2021.3094169
Arafa YBadawy AElWazir ABarai AEker AChennupati GSanthi NEidenbenz Sde Supinski BHall MGamblin T(2021)Hybrid, scalable, trace-driven performance modeling of GPGPUsProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3458817.3476221(1-15)Online publication date: 14-Nov-2021
https://dl.acm.org/doi/10.1145/3458817.3476221
Chennupati GSanthi NRomero PEidenbenz S(2021)Machine Learning–enabled Scalable Performance Prediction of Scientific CodesACM Transactions on Modeling and Computer Simulation10.1145/345026431:2(1-28)Online publication date: 30-Apr-2021
https://doi.org/10.1145/3450264
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten