Performance Analysis of GPU Programming Models Using the Roofline Scaling Trajectories

Ibrahim, Khaled Z.; Williams, Samuel; Oliker, Leonid

doi:10.1007/978-3-030-49556-5_1

Khaled Z. Ibrahim¹³,
Samuel Williams¹³ &
Leonid Oliker¹³

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 12093))

Included in the following conference series:

International Symposium on Benchmarking, Measuring and Optimization

1368 Accesses
3 Citations

Abstract

Performance analysis is a daunting job, especially for the rapid-evolving accelerator technologies. The Roofline Scaling Trajectories technique aims at diagnosing various performance bottlenecks for GPU programming models through the visually intuitive Roofline plots. In this work, we introduce the use of the Roofline Scaling Trajectories to capture major performance bottlenecks on NVIDIA Volta GPU architectures, such as warp efficiency, occupancy, and locality. Using this analysis technique, we explain the performance characteristics of the NAS Parallel Benchmarks (NPB) written with two programming models, CUDA and OpenACC. We present the influence of the programming model on the performance and scaling characteristics. We also leverage the insights of the Roofline Scaling Trajectory analysis to tune some of the NAS Parallel Benchmarks, achieving up to 2$\times $ speedup.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

NAS Parallel Benchmarks for GPGPUs Using a Directive-Based Programming Model

GPU-STREAM v2.0: Benchmarking the Achievable Memory Bandwidth of Many-Core Processors Across Diverse Parallel Programming Models

Roofline Model Toolkit: A Practical Tool for Architectural and Program Analysis

Notes

1.
Volta configurable L1 cache/shared memory capacity is 128 KB per SM.
2.
Volta register file is 256 KB per SM.

References

Adhianto, L., et al.: HPCToolkit: tools for performance analysis of optimized parallel programs. Concurr. Comput. Pract. Exp. 22(6), 685–701 (2010). http://hpctoolkit.org
Google Scholar
Bailey, D., Harris, T., Saphir, W., Van Der Wijngaart, R., Woo, A., Yarrow, M.: The NAS parallel benchmarks 2.0. Technical report NAS-95-010, NASA Ames Research Center (1995)
Google Scholar
Calotoiu, A., Hoefler, T., Poke, M., Wolf, F.: Using automated performance modeling to find scalability bugs in complex codes. In: SC 2013 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, pp. 1–12 (2013)
Google Scholar
Yang, C., Kurth, T., Williams, S.: Hierarchical Roofline analysis for GPUs: accelerating performance optimization for the NERSC-9 Perlmutter system. Cray User Group (CUG), May 2019
Google Scholar
Chatterjee, N., O’Connor, M., Loh, G.H., Jayasena, N., Balasubramonia, R.: Managing DRAM latency divergence in irregular GPGPU applications. In: SC 2014 Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 128–139 (2014)
Google Scholar
Cook, S.: CUDA Programming: A Developer’s Guide to Parallel Computing with GPUs, 1st edn. Morgan Kaufmann Publishers Inc., San Francisco (2013)
Google Scholar
Cray: The Cray Performance Measurement and Analysis Tools. https://pubs.cray.com/content/S-2376/6.4.0/cray-performance-measurement-and-analysis-tools-user-guide-640/craypat
Ilic, A., Pratas, F., Sousa, L.: Cache-aware Roofline model: upgrading the loft. IEEE Comput. Archit. Lett. 13(1), 21–24 (2014)
Article Google Scholar
Dümmler, J.: A CUDA version of NPB 3.3.1. https://www.tu-chemnitz.de/informatik/PI/sonstiges/downloads/npb-gpu/index.php.en
Ibrahim, K., Williams, S., Oliker, L.: Roofline scaling trajectories: a method for parallel application and architectural performance analysis. In: International Conference on High Performance Computing & Simulation (HPCS) (2018)
Google Scholar
Marowka, A.: On performance analysis of a multithreaded application parallelized by different programming models using Intel VTune. In: Malyshkin, V. (ed.) PaCT 2011. LNCS, vol. 6873, pp. 317–331. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-23178-0_28
Chapter Google Scholar
Measuring Roofline Quantities on NVIDIA GPUs: Portability Across DOE Office of Science HPC Facilities. https://performanceportability.org/perfport/measurements/gpu/
nVidia: CUDA Profiler Users Guide. https://docs.nvidia.com/cuda/pdf/CUDA_Profiler_Users_Guide.pdf
nVidia: NVIDIA Tesla V100 GPU Architecture. https://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf
OpenACC STANDARD Organization: OpenACC Application Programming Interface. https://www.openacc.org
Shende, S.S., Malony, A.D.: The tau parallel performance system. Int. J. High Perform. Comput. Appl. 20(2), 287–311 (2006)
Article Google Scholar
Top 500 Supercomputers. http://www.top500.org
Williams, S., Watterman, A., Patterson, D.: Roofline: an insightful visual performance model for multicore architectures. Commun. ACM 52(4), 65–76 (2009). https://doi.org/10.1145/1498765.1498785
Article Google Scholar
Xu, R., Tian, X., Chandrasekaran, S., Yan, Y., Chapman, B.: NAS parallel benchmarks for GPGPUs using a directive-based programming model. In: Brodman, J., Tu, P. (eds.) LCPC 2014. LNCS, vol. 8967, pp. 67–81. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-17473-0_5
Chapter Google Scholar

Download references

Acknowledgment

This material is based on work supported by the Advanced Scientific Computing Research Program in the U.S. Department of Energy, Office of Science, under award number DE-AC02–05CH11231. This research used resources of the Oak Ridge Leadership Computing Facility at the Oak Ridge National Laboratory, which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC05-00OR22725.

Author information

Authors and Affiliations

Lawrence Berkeley National Laboratory, One Cyclotron Road, Berkeley, CA, 94720, USA
Khaled Z. Ibrahim, Samuel Williams & Leonid Oliker

Authors

Khaled Z. Ibrahim
View author publications
You can also search for this author in PubMed Google Scholar
Samuel Williams
View author publications
You can also search for this author in PubMed Google Scholar
Leonid Oliker
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Khaled Z. Ibrahim .

Editor information

Editors and Affiliations

Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China
Wanling Gao
Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China
Jianfeng Zhan
School of Informatics, Computing, and Engineering, Indiana University, Bloomington, IN, USA
Geoffrey Fox
Department of Computer Science and Engineering, The Ohio State University, Columbus, OH, USA
Xiaoyi Lu
Texas Advanced Computing Center, The University of Texas at Austin, Austin, TX, USA
Dan Stanzione

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ibrahim, K.Z., Williams, S., Oliker, L. (2020). Performance Analysis of GPU Programming Models Using the Roofline Scaling Trajectories. In: Gao, W., Zhan, J., Fox, G., Lu, X., Stanzione, D. (eds) Benchmarking, Measuring, and Optimizing. Bench 2019. Lecture Notes in Computer Science(), vol 12093. Springer, Cham. https://doi.org/10.1007/978-3-030-49556-5_1

Download citation

DOI: https://doi.org/10.1007/978-3-030-49556-5_1
Published: 09 June 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-49555-8
Online ISBN: 978-3-030-49556-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Performance Analysis of GPU Programming Models Using the Roofline Scaling Trajectories

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

NAS Parallel Benchmarks for GPGPUs Using a Directive-Based Programming Model

GPU-STREAM v2.0: Benchmarking the Achievable Memory Bandwidth of Many-Core Processors Across Diverse Parallel Programming Models

Roofline Model Toolkit: A Practical Tool for Architectural and Program Analysis

Notes

References

Acknowledgment

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Performance Analysis of GPU Programming Models Using the Roofline Scaling Trajectories

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

NAS Parallel Benchmarks for GPGPUs Using a Directive-Based Programming Model

GPU-STREAM v2.0: Benchmarking the Achievable Memory Bandwidth of Many-Core Processors Across Diverse Parallel Programming Models

Roofline Model Toolkit: A Practical Tool for Architectural and Program Analysis

Notes

References

Acknowledgment

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation