Abstract
The achievable GPU performance of many scientific computations is not determined by a GPU’s peak floating-point rate, but rather how fast data are moved through different stages of the entire memory hierarchy. We take low-order 3D stencil computations as a representative class to study the reachable GPU performance from the angle of data traffic. Specifically, we propose a simple analytical model to estimate the execution time based on quantifying the data traffic volume at three stages: (1) between registers and on-streaming multiprocessor (SMX) storage, (2) between on-SMX storage and L2 cache, (3) between L2 cache and GPU’s device memory. Three associated granularities are used: a CUDA thread, a thread block, and a set of simultaneously active thread blocks. For four chosen 3D stencil computations, NVIDIA’s profiling tools are used to verify the accuracy of the quantified data traffic volumes, by examining a large number of executions with different problem sizes and thread block configurations. Moreover, by introducing an imbalance coefficient, together with the known realistic memory bandwidths, we can predict the execution time usage based on the quantified data traffic volumes. For the four 3D stencils, the average error of the time predictions is 6.9 % for a baseline implementation approach, whereas for a blocking implementation approach the average prediction error is 9.5 %.





Similar content being viewed by others
References
Baghsorkhi SS, Delahaye M, Patel SJ, Gropp WD, Hwu WMW (2010) An adaptive performance modeling tool for GPU architectures. In: Proceedings of PPoPP’10. ACM, New York, pp 105–114. doi:10.1145/1693453.1693470
Bakhoda A, Yuan GL, Fung WW, Wong H, Aamodt TM (2009) Analyzing cuda workloads using a detailed GPU simulator. In: IEEE international symposium on performance analysis of systems and software (ISPASS’09). IEEE, pp 163–174
Datta K, Murphy M, Volkov V, Williams S, Carter J, Oliker L, Patterson D, Shalf J, Yelick K (2008) Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures. In: Proceedings of SC’08. IEEE Press, Piscataway, pp 4:1–4:12. doi:10.1109/SC.2008.5222004
Datta K, Kamil S, Williams S, Oliker L, Shalf J, Yelick K (2009) Optimization and performance modeling of stencil computations on modern microprocessors. SIAM Rev 51(1):129–159
de la Cruz R, Araya-Polo M (in press) Modeling stencil computations on modern HPC architectures
De La Cruz R, Araya-Polo M (2014) Algorithm 942: semi-stencil. ACM Trans Math Softw (TOMS) 40(3):23
Holewinski J, Pouchet LN, Sadayappan P (2012) High-performance code generation for stencil computations on GPU architectures. In: Proceedings of ICS’12. ACM, New York, pp 311–320. doi:10.1145/2304576.2304619
Hong S, Kim H (2009) An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness. In: Proceedings of ISCA’09. ACM, New York, pp 152–163. doi:10.1145/1555754.1555775
Kamil S, Husbands P, Oliker L, Shalf J, Yelick K (2005) Impact of modern memory subsystems on cache optimizations for stencil computations. In: Proceedings of MSP’05. ACM, New York, pp 36–43. doi:10.1145/1111583.1111589
Kamil S, Datta K, Williams S, Oliker L, Shalf J, Yelick K (2006) Implicit and explicit optimizations for stencil computations. In: Proceedings of MSPC’06. ACM, New York, pp 51–60. doi:10.1145/1178597.1178605
Kamil S, Chan C, Oliker L, Shalf J, Williams S (2010) An auto-tuning framework for parallel multicore stencil computations. In: Proceedings of IPDPS’10, pp 1–12. doi:10.1109/IPDPS.2010.5470421
Meng J, Skadron K (2009) Performance modeling and automatic ghost zone optimization for iterative stencil loops on GPUs. In: Proceedings of ICS’09. ACM, New York, pp 256–265. doi:10.1145/1542275.1542313
Micikevicius P (2009) 3D finite difference computation on GPUs using CUDA. In: GPGPU-2. ACM, New York, pp 79–84. doi:10.1145/1513895.1513905
Nickolls J, Dally W (2010) The GPU computing era. Micro IEEE 30(2):56–69. doi:10.1109/MM.2010.41
Nugteren C, van den Braak GJ, Corporaal H, Bal H (2014) A detailed GPU cache model based on reuse distance theory. In: IEEE 20th international symposium on high performance computer architecture (HPCA). IEEE, pp 37–48
NVIDIA T (2013) K20-k20x GPU accelerators benchmarks. ApplicationPerformance Technical Brief, Nvidia. http://www.nvidia.com/docs/IO/122874/K20-and-K20X-application-performance-technical-brief.pdf
NVIDIA C (2012a) CUDA API reference manual
Profiler user’s guide.http://docs.nvidia.com/cuda/pdf/CUDA_Profiler_Users_Guide.pdf
Rahman SMF, Yi Q, Qasem A (2011) Understanding stencil code performance on multicore architectures. In: Proceedings of the 8th ACM international conference on computing frontiers. ACM, New York p 30
Schäfer A, Fey D (2011) High performance stencil code algorithms for GPGPUs. Procedia Comput Sci 4:2027–2036
Sim J, Dasgupta A, Kim H, Vuduc R (2012) A performance analysis framework for identifying potential benefits in GPGPU applications. In: Proceedings of PPoPP’12. ACM, New York, pp 11–22. doi:10.1145/2145816.2145819
Stengel H, Treibig J, Hager G, Wellein G (2014) Quantifying performance bottlenecks of stencil computations using the execution-cache-memory model. arXiv:1410.5010
Su H, Wu N, Wen M, Zhang C, Cai X (2013a) On the GPU–CPU performance portability of OpenCL for 3D stencil computations. In: International conference on parallel and distributed systems (ICPADS). IEEE, pp 78–85
Su H, Wu N, Wen M, Zhang C, Cai X (2013b) On the GPU performance of 3D stencil computations implemented in OpenCL. In: Supercomputing. Springer, New York, pp 125–135
Unat D, Cai X, Baden SB (2011) Mint: realizing CUDA performance in 3D stencil methods with annotated C. In: Proceedings of ICS’11. ACM, New York, pp 214–224. doi:10.1145/1995896.1995932
Williams S, Waterman A, Patterson D (2009) Roofline: an insightful visual performance model for multicore architectures. Commun ACM 52(4):65–76. doi:10.1145/1498765.1498785
Zhang Y, Mueller F (2012) Auto-generation and auto-tuning of 3D stencil codes on GPU clusters. In: Proceedings of CGO’12. ACM, New York, pp 155–164. doi:10.1145/2259016.2259037
Acknowledgments
The authors gratefully acknowledge the support from the National Natural Science Foundation of China under NSFC Nos. 61033008, 61103080 and 61272145, SRFDP Nos. 20104307110002 and 20124307130004, Innovation in Graduate School of NUDT Nos. B100603, B120605, the FRINATEK program of the Research Council of Norway under No. 214113/F20.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Su, H., Cai, X., Wen, M. et al. An analytical GPU performance model for 3D stencil computations from the angle of data traffic. J Supercomput 71, 2433–2453 (2015). https://doi.org/10.1007/s11227-015-1392-1
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-015-1392-1