skip to main content
10.1145/2145816.2145820acmconferencesArticle/Chapter ViewAbstractPublication PagesppoppConference Proceedingsconference-collections
research-article

Efficient performance evaluation of memory hierarchy for highly multithreaded graphics processors

Published: 25 February 2012 Publication History

Abstract

With the emergence of highly multithreaded architectures, performance monitoring techniques face new challenges in efficiently locating sources of performance discrepancies in the program source code. For example, the state-of-the-art performance counters in highly multithreaded graphics processing units (GPUs) report only the overall occurrences of microarchitecture events at the end of program execution. Furthermore, even if supported, any fine-grained sampling of performance counters will distort the actual program behavior and will make the sampled values inaccurate. On the other hand, it is difficult to achieve high resolution performance information at low sampling rates in the presence of thousands of concurrently running threads. In this paper, we present a novel software-based approach for monitoring the memory hierarchy performance in highly multithreaded general-purpose graphics processors. The proposed analysis is based on memory traces collected for snapshots of an application execution. A trace-based memory hierarchy model with a Monte Carlo experimental methodology generates statistical bounds of performance measures without being concerned about the exact inter-thread ordering of individual events but rather studying the behavior of the overall system. The statistical approach overcomes the classical problem of disturbed execution timing due to fine-grained instrumentation. The approach scales well as we deploy an efficient parallel trace collection technique to reduce the trace generation overhead and a simple memory hierarchy model to reduce the simulation time. The proposed scheme also keeps track of individual memory operations in the source code and can quantify their efficiency with respect to the memory system. A cross-validation of our results shows close agreement with the values read from the hardware performance counters on an NVIDIA Tesla C2050 GPU. Based on the high resolution profile data produced by our model we optimized memory accesses in the sparse matrix vector multiply kernel and achieved speedups ranging from 2.4 to 14.8 depending on the characteristics of the input matrices.

References

[1]
http://clang.llvm.org/.
[2]
The OpenCL Specification, 2009.
[3]
S. S. Baghsorkhi, M. Delahaye, S. J. Patel, W. D. Gropp, and W.-m. W. Hwu. An adaptive performance modeling tool for GPU architectures. In Proceedings of the 15th ACM SIGPLAN symposium on Principles and practice of parallel programming, pages 105--114, 2010.
[4]
A. Bakhoda, G. Yuan, W. Fung, H. Wong, and T. Aamodt. Analyzing CUDA workloads using a detailed GPU simulator. In Performance Analysis of Systems and Software, 2009. ISPASS 2009. IEEE International Symposium on, pages 163 --174, 2009.
[5]
G. Diamos, A. Kerr, and M. Kesavan. A dynamic compilation framework for ptx. http://code.google.com/p/gpuocelot.
[6]
M. Fillo, S. W. Keckler, W. J. Dally, N. P. Carter, A. Chang, Y. Gurevich, and W. S. Lee. The M-Machine multicomputer. In Proceedings of the 28th annual international symposium on Microarchitecture, pages 146--156, 1995.
[7]
S. Hong and H. Kim. An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness. In Proceedings of the 36th annual international symposium on Computer architecture, pages 152--163, 2009.
[8]
S. Laha, J. H. Patel, and R. K. Iyer. Accurate low-cost methods for performance evaluation of cache memory systems. IEEE Trans. Comput., 37:1325--1336.
[9]
C. Lattner and V. Adve. LLVM: A compilation framework for lifelong program analysis & transformation. In Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization, pages 75--86, 2004.
[10]
W. Nagel, A. Arnold, M. Weber, H. Hoppe, and K. Solchenbach. VAMPIR: Visualization and analysis of MPI resources. KFA, ZAM, 1996.
[11]
J. Nickolls and W. J. Dally. The GPU computing era. IEEE Micro, 30:56--69, March 2010.
[12]
NVIDIA. CUDA occupancy calculator.
[13]
NVIDIA Staff. NVIDIA CUDA Programming Guide 4.0, 2011.
[14]
V. Pillet, J. Labarta, T. Cortes, and S. Girona. PARAVER: A tool to visualise and analyze parallel code. In Proceedings of WoTUG-18: Transputer and occam Developments, volume 44, pages 17--31, 1995.
[15]
S. Ryoo, C. I. Rodrigues, S. S. Stone, S. S. Baghsorkhi, S.-Z. Ueng, J. A. Stratton, and W.-m. W. Hwu. Program optimization space pruning for a multithreaded GPU. In Proceedings of the 6th annual IEEE/ACM international symposium on Code generation and optimization, pages 195--204, 2008.
[16]
V. Salapura, K. Ganesan, A. Gara, M. Gschwind, J. C. Sexton, and R. E. Walkup. Next-generation performance counters: Towards monitoring over thousand concurrent events. In Proceedings of the ISPASS 2008 - IEEE International Symposium on Performance Analysis of Systems and software, pages 139--146, 2008.
[17]
S. Shende and A. Malony. The TAU parallel performance system. International Journal of High Performance Computing Applications, 20(2):287, 2006.
[18]
B. J. Smith. Readings in computer architecture. chapter Architecture and applications of the HEP mulitprocessor computer system, pages 342--349. Morgan Kaufmann Publishers Inc., 2000.
[19]
A. Snavely, L. Carter, J. Boisseau, A. Majumdar, K. S. Gatlin, N. Mitchell, J. Feo, and B. Koblenz. Multi-processor performance on the tera MTA. In Proceedings of the 1998 ACM/IEEE conference on Supercomputing, pages 1--8, 1998.
[20]
H. Stark and J. Woods. Probability, random processes, and estimation theory for engineers. Prentice-Hall, Inc. Upper Saddle River, NJ, USA, 1986.
[21]
R. Ubal, J. Sahuquillo, S. Petit, and P. López. Multi2Sim: A Simulation Framework to Evaluate Multicore-Multithreaded Processors. Oct. 2007.
[22]
Y. Zhang and J. D. Owens. A quantitative performance analysis model for GPU architectures. In Proceedings of the 17th IEEE International Symposium on High-Performance Computer Architecture, 2011.

Cited By

View all
  • (2019)RDMKE: Applying Reuse Distance Analysis to Multiple GPU Kernel ExecutionsJournal of Circuits, Systems and Computers10.1142/S021812661950245128:14(1950245)Online publication date: 15-Mar-2019
  • (2019)GPUs Cache Performance Estimation using Reuse Distance Analysis2019 IEEE 38th International Performance Computing and Communications Conference (IPCCC)10.1109/IPCCC47392.2019.8958760(1-8)Online publication date: Oct-2019
  • (2018)Efficient Cache Performance Modeling in GPUs Using Reuse Distance AnalysisACM Transactions on Architecture and Code Optimization10.1145/329105115:4(1-24)Online publication date: 19-Dec-2018
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
PPoPP '12: Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming
February 2012
352 pages
ISBN:9781450311601
DOI:10.1145/2145816
  • cover image ACM SIGPLAN Notices
    ACM SIGPLAN Notices  Volume 47, Issue 8
    PPOPP '12
    August 2012
    334 pages
    ISSN:0362-1340
    EISSN:1558-1160
    DOI:10.1145/2370036
    Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 25 February 2012

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. gpu
  2. memory hierarchy
  3. performance evaluation

Qualifiers

  • Research-article

Conference

PPoPP '12
Sponsor:

Acceptance Rates

Overall Acceptance Rate 230 of 1,014 submissions, 23%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)27
  • Downloads (Last 6 weeks)3
Reflects downloads up to 17 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2019)RDMKE: Applying Reuse Distance Analysis to Multiple GPU Kernel ExecutionsJournal of Circuits, Systems and Computers10.1142/S021812661950245128:14(1950245)Online publication date: 15-Mar-2019
  • (2019)GPUs Cache Performance Estimation using Reuse Distance Analysis2019 IEEE 38th International Performance Computing and Communications Conference (IPCCC)10.1109/IPCCC47392.2019.8958760(1-8)Online publication date: Oct-2019
  • (2018)Efficient Cache Performance Modeling in GPUs Using Reuse Distance AnalysisACM Transactions on Architecture and Code Optimization10.1145/329105115:4(1-24)Online publication date: 19-Dec-2018
  • (2018)Shared Last-Level Cache Management and Memory Scheduling for GPGPUs with Hybrid Main MemoryACM Transactions on Embedded Computing Systems10.1145/323064317:4(1-25)Online publication date: 31-Jul-2018
  • (2017)Efficient exception handling support for GPUsProceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3123939.3123950(109-122)Online publication date: 14-Oct-2017
  • (2017)Locality-Aware CTA Clustering for Modern GPUsACM SIGARCH Computer Architecture News10.1145/3093337.303770945:1(297-311)Online publication date: 4-Apr-2017
  • (2017)Locality-Aware CTA Clustering for Modern GPUsACM SIGPLAN Notices10.1145/3093336.303770952:4(297-311)Online publication date: 4-Apr-2017
  • (2017)Locality-Aware CTA Clustering for Modern GPUsACM SIGOPS Operating Systems Review10.1145/3093315.303770951:2(297-311)Online publication date: 4-Apr-2017
  • (2017)Locality-Aware CTA Clustering for Modern GPUsProceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3037697.3037709(297-311)Online publication date: 4-Apr-2017
  • (2016)Prefetching Techniques for Near-memory Throughput ProcessorsProceedings of the 2016 International Conference on Supercomputing10.1145/2925426.2926282(1-14)Online publication date: 1-Jun-2016
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media