research-article

Statistical Pattern Based Modeling of GPU Memory Access Streams

Authors:

Andreas Gerstlauer,

Lizy K. JohnAuthors Info & Claims

DAC '17: Proceedings of the 54th Annual Design Automation Conference 2017

Article No.: 81, Pages 1 - 6

https://doi.org/10.1145/3061639.3062320

Published: 18 June 2017 Publication History

Abstract

Recent research studies have shown that modern GPU performance is often limited by the memory system performance. Optimizing memory hierarchy performance requires GPU designers to draw design insights based on the cache & memory behavior of end-user applications. Unfortunately, it is often difficult to get access to end-user workloads due to the confidential or proprietary nature of the software/data. Furthermore, the efficiency of early design space exploration of cache & memory systems is often limited due to either the slow speed of detailed simulation techniques or limited scope of state-of-the-art cache analytical models.

To enable efficient GPU memory system exploration, we present a novel methodology and framework that statistically models the GPU memory access stream locality. The proposed G-MAP (GPU Memory Access Proxy) framework models the regularity in code-localized memory access patterns of GPGPU applications and the parallelism in GPU's execution model to create miniaturized memory proxies. We evaluate G-MAP using 18 GPGPU benchmarks and show that G-MAP proxies can replicate cache/memory performance of original applications with over 90% accuracy across over 5000 different L1/L2 cache, prefetcher and memory configurations.

References

[1]

NVIDIA's next generation CUDA compute architecture, Fermi, 2009.

[2]

Nvidia. CUDA c/c++ sdk code samples, 2011.

[3]

A. Awad and Y. Solihin. Stm: Cloning the spatial and temporal memory access behavior. HPCA, pages 237--247, 2014.

[4]

A. Bakhoda et al. Analyzing CUDA workloads using a detailed GPU simulator, In ISPASS, pages 163--174. IEEE Computer Society, 2009.

[5]

S. Che et al. Rodinia: A benchmark suite for heterogeneous computing. In IISWC, pages 44--54, 2009.

Digital Library

[6]

E. Deniz and A. Sen. Minime-gpu: Multicore benchmark synthesizer for gpus, ACM Trans. Archit. Code Optim., 12(4):34:l--34:25, Nov. 2015.

Digital Library

[7]

K. Ganesan et al. Synthesizing memory-level parallelism aware miniature clones for spec cpu2006 and implantbench workloads. ISPASS, 2010.

[8]

S. Hong and H. Kim. An analytical model for a gpu architecture with memory-level and thread-level parallelism awareness. SIGARCH Comput. Archit. News, 37(3):152--163, June 2009.

Digital Library

[9]

A. Jaleel, R. S. Cohn, C. keung Luk, and B. Jacob. Cmp$im: A pin-based on-the-fly multi-core cache simulator. In MoBS, 2008.

[10]

A. Joshi et al. Performance cloning: A technique for disseminating proprietary applications as benchmarks. In IISWC, pages 105--115, 2006.

[11]

Y. Kim, W. Yang, and O. Mutlu. Ramulator: A fast and extensible dram simulator. IEEE Computer Architecture Letters, 15(1):45--49, 2016.

Digital Library

[12]

J. Lee et al. Many-thread aware prefetching mechanisms for GPGPU applications. In MICRO, pages 213--224. IEEE Computer Society, 2010.

Digital Library

[13]

S. Y. Lee and C. J. Wu. Characterizing the latency hiding ability of gpus. In ISPASS, pages 145--146, 2014.

[14]

R. L. Mattson, J. Gecsei, D. R. Slutz, and I. L. Traiger. Evaluation techniques for storage hierarchies. IBM Syst. J., 9(2):78--117, June 1970.

Digital Library

[15]

C. Nugteren et al. A detailed gpu cache model based on reuse distance theory, HPCA, pages 37--48, 2014.

[16]

NVIDIA. Cuda c programming guide 5.5. 2013.

[17]

R. Panda et al. Prefetching techniques for near-memory throughput processors, In ICS, 2016.

Digital Library

[18]

R. Panda, X. Zheng, and L. John. Accurate address streams for llc and beyond (slab): A methodology to enable system exploration. In IEEE ISPASS, 2017.

[19]

J. Power et al. gem5-gpu: A heterogeneous cpu-gpu simulator. IEEE CAL, 14(1):34--36, Jan 2015.

[20]

J. Sim, A. Dasgupta, H. Kim, and R. Vuduc. A performance analysis framework for identifying potential benefits in gpgpu applications. In PPoPP, 2012.

Digital Library

[21]

T. Tang et al. Cache miss analysis for gpu programs based on stack distance profile. In ICDCS, pages 623--634, 2011.

Digital Library

[22]

Z. Yu et al. Gpgpu-minibench: Accelerating gpgpu micro-architecture simulation. IEEE Transactions on Computers, 64(11):3153--3166, Nov 2015.

Digital Library

Cited By

Liang MFu WFeng LLin ZPanakanti PZheng SSridharan SDelimitrou CSolihin YHeinrich M(2023)Mystique: Enabling Accurate and Scalable Generation of Production AI BenchmarksProceedings of the 50th Annual International Symposium on Computer Architecture10.1145/3579371.3589072(1-13)Online publication date: 17-Jun-2023
https://dl.acm.org/doi/10.1145/3579371.3589072
Kim HHan H(2023)GPU thread throttling for page-level thrashing reduction via static analysisThe Journal of Supercomputing10.1007/s11227-023-05787-y80:7(9829-9847)Online publication date: 16-Dec-2023
https://doi.org/10.1007/s11227-023-05787-y
Kim HHong SLee HSeo EHan H(2019)Compiler-Assisted GPU Thread Throttling for Reduced Cache ContentionProceedings of the 48th International Conference on Parallel Processing10.1145/3337821.3337886(1-10)Online publication date: 5-Aug-2019
https://dl.acm.org/doi/10.1145/3337821.3337886
Show More Cited By

Statistical Pattern Based Modeling of GPU Memory Access Streams
1. Computing methodologies
  1. Modeling and simulation
    1. Model development and analysis

Recommendations

Access Pattern-Aware Cache Management for Improving Data Utilization in GPU
ISCA '17: Proceedings of the 44th Annual International Symposium on Computer Architecture

Long latency of memory operation is a prominent performance bottleneck in graphics processing units (GPUs). The small data cache that must be shared across dozens of warps (a collection of threads) creates significant cache contention and premature data ...
An Efficient GPU Cache Architecture for Applications with Irregular Memory Access Patterns

GPUs provide high-bandwidth/low-latency on-chip shared memory and L1 cache to efficiently service a large number of concurrent memory requests. Specifically, concurrent memory requests accessing contiguous memory space are coalesced into warp-wide ...
LARA: Locality-aware resource allocation to improve GPU memory-access time
Abstract
Memory access as a primary performance bottleneck of each processing unit also plays a significant role in GPU performance. In addition to high challenging parts of GPU’s memory access path, the low locality property among the requests ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

DAC '17: Proceedings of the 54th Annual Design Automation Conference 2017

June 2017

533 pages

ISBN:9781450349277

DOI:10.1145/3061639

Copyright © 2017 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

EDAC: Electronic Design Automation Consortium
SIGDA: ACM Special Interest Group on Design Automation
IEEE-CEDA

In-Cooperation

SIGBED: ACM Special Interest Group on Embedded Systems

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 18 June 2017

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Research-article
Research
Refereed limited

Conference

DAC '17

Sponsor:

EDAC
SIGDA

DAC '17: The 54th Annual Design Automation Conference 2017

June 18 - 22, 2017

TX, Austin, USA

Acceptance Rates

Overall Acceptance Rate 1,770 of 5,499 submissions, 32%

Upcoming Conference

DAC '25

Sponsor:
sigda

62nd ACM/IEEE Design Automation Conference

June 22 - 26, 2025

San Francisco , CA , USA

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

7
Total Citations
View Citations
248
Total Downloads

Downloads (Last 12 months)24
Downloads (Last 6 weeks)2

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Liang MFu WFeng LLin ZPanakanti PZheng SSridharan SDelimitrou CSolihin YHeinrich M(2023)Mystique: Enabling Accurate and Scalable Generation of Production AI BenchmarksProceedings of the 50th Annual International Symposium on Computer Architecture10.1145/3579371.3589072(1-13)Online publication date: 17-Jun-2023
https://dl.acm.org/doi/10.1145/3579371.3589072
Kim HHan H(2023)GPU thread throttling for page-level thrashing reduction via static analysisThe Journal of Supercomputing10.1007/s11227-023-05787-y80:7(9829-9847)Online publication date: 16-Dec-2023
https://doi.org/10.1007/s11227-023-05787-y
Kim HHong SLee HSeo EHan H(2019)Compiler-Assisted GPU Thread Throttling for Reduced Cache ContentionProceedings of the 48th International Conference on Parallel Processing10.1145/3337821.3337886(1-10)Online publication date: 5-Aug-2019
https://dl.acm.org/doi/10.1145/3337821.3337886
Wang XHuang KKnoll AQian X(2019)A Hybrid Framework for Fast and Accurate GPU Performance Estimation through Source-Level Analysis and Trace-Based Simulation2019 IEEE International Symposium on High Performance Computer Architecture (HPCA)10.1109/HPCA.2019.00062(506-518)Online publication date: Feb-2019
https://doi.org/10.1109/HPCA.2019.00062
Kim HHong SPark JHan H(2019)Static code transformations for thread‐dense memory accesses in GPU computingConcurrency and Computation: Practice and Experience10.1002/cpe.551232:5Online publication date: 18-Oct-2019
https://doi.org/10.1002/cpe.5512
Diarra RHuchard MKästner CFraser G(2018)Towards automatic restrictification of CUDA kernel argumentsProceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering10.1145/3238147.3241533(928-931)Online publication date: 3-Sep-2018
https://dl.acm.org/doi/10.1145/3238147.3241533
Panda RJohn L(2018)HALOProceedings of the 2018 International Conference on Supercomputing10.1145/3205289.3205323(118-128)Online publication date: 12-Jun-2018
https://dl.acm.org/doi/10.1145/3205289.3205323

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten