Abstract
Performance analysis is a daunting job, especially for the rapid-evolving accelerator technologies. The Roofline Scaling Trajectories technique aims at diagnosing various performance bottlenecks for GPU programming models through the visually intuitive Roofline plots. In this work, we introduce the use of the Roofline Scaling Trajectories to capture major performance bottlenecks on NVIDIA Volta GPU architectures, such as warp efficiency, occupancy, and locality. Using this analysis technique, we explain the performance characteristics of the NAS Parallel Benchmarks (NPB) written with two programming models, CUDA and OpenACC. We present the influence of the programming model on the performance and scaling characteristics. We also leverage the insights of the Roofline Scaling Trajectory analysis to tune some of the NAS Parallel Benchmarks, achieving up to 2\(\times \) speedup.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
Volta configurable L1 cache/shared memory capacity is 128 KB per SM.
- 2.
Volta register file is 256 KB per SM.
References
Adhianto, L., et al.: HPCToolkit: tools for performance analysis of optimized parallel programs. Concurr. Comput. Pract. Exp. 22(6), 685–701 (2010). http://hpctoolkit.org
Bailey, D., Harris, T., Saphir, W., Van Der Wijngaart, R., Woo, A., Yarrow, M.: The NAS parallel benchmarks 2.0. Technical report NAS-95-010, NASA Ames Research Center (1995)
Calotoiu, A., Hoefler, T., Poke, M., Wolf, F.: Using automated performance modeling to find scalability bugs in complex codes. In: SC 2013 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, pp. 1–12 (2013)
Yang, C., Kurth, T., Williams, S.: Hierarchical Roofline analysis for GPUs: accelerating performance optimization for the NERSC-9 Perlmutter system. Cray User Group (CUG), May 2019
Chatterjee, N., O’Connor, M., Loh, G.H., Jayasena, N., Balasubramonia, R.: Managing DRAM latency divergence in irregular GPGPU applications. In: SC 2014 Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 128–139 (2014)
Cook, S.: CUDA Programming: A Developer’s Guide to Parallel Computing with GPUs, 1st edn. Morgan Kaufmann Publishers Inc., San Francisco (2013)
Cray: The Cray Performance Measurement and Analysis Tools. https://pubs.cray.com/content/S-2376/6.4.0/cray-performance-measurement-and-analysis-tools-user-guide-640/craypat
Ilic, A., Pratas, F., Sousa, L.: Cache-aware Roofline model: upgrading the loft. IEEE Comput. Archit. Lett. 13(1), 21–24 (2014)
Dümmler, J.: A CUDA version of NPB 3.3.1. https://www.tu-chemnitz.de/informatik/PI/sonstiges/downloads/npb-gpu/index.php.en
Ibrahim, K., Williams, S., Oliker, L.: Roofline scaling trajectories: a method for parallel application and architectural performance analysis. In: International Conference on High Performance Computing & Simulation (HPCS) (2018)
Marowka, A.: On performance analysis of a multithreaded application parallelized by different programming models using Intel VTune. In: Malyshkin, V. (ed.) PaCT 2011. LNCS, vol. 6873, pp. 317–331. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-23178-0_28
Measuring Roofline Quantities on NVIDIA GPUs: Portability Across DOE Office of Science HPC Facilities. https://performanceportability.org/perfport/measurements/gpu/
nVidia: CUDA Profiler Users Guide. https://docs.nvidia.com/cuda/pdf/CUDA_Profiler_Users_Guide.pdf
nVidia: NVIDIA Tesla V100 GPU Architecture. https://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf
OpenACC STANDARD Organization: OpenACC Application Programming Interface. https://www.openacc.org
Shende, S.S., Malony, A.D.: The tau parallel performance system. Int. J. High Perform. Comput. Appl. 20(2), 287–311 (2006)
Top 500 Supercomputers. http://www.top500.org
Williams, S., Watterman, A., Patterson, D.: Roofline: an insightful visual performance model for multicore architectures. Commun. ACM 52(4), 65–76 (2009). https://doi.org/10.1145/1498765.1498785
Xu, R., Tian, X., Chandrasekaran, S., Yan, Y., Chapman, B.: NAS parallel benchmarks for GPGPUs using a directive-based programming model. In: Brodman, J., Tu, P. (eds.) LCPC 2014. LNCS, vol. 8967, pp. 67–81. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-17473-0_5
Acknowledgment
This material is based on work supported by the Advanced Scientific Computing Research Program in the U.S. Department of Energy, Office of Science, under award number DE-AC02–05CH11231. This research used resources of the Oak Ridge Leadership Computing Facility at the Oak Ridge National Laboratory, which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC05-00OR22725.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Ibrahim, K.Z., Williams, S., Oliker, L. (2020). Performance Analysis of GPU Programming Models Using the Roofline Scaling Trajectories. In: Gao, W., Zhan, J., Fox, G., Lu, X., Stanzione, D. (eds) Benchmarking, Measuring, and Optimizing. Bench 2019. Lecture Notes in Computer Science(), vol 12093. Springer, Cham. https://doi.org/10.1007/978-3-030-49556-5_1
Download citation
DOI: https://doi.org/10.1007/978-3-030-49556-5_1
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-49555-8
Online ISBN: 978-3-030-49556-5
eBook Packages: Computer ScienceComputer Science (R0)