skip to main content
10.1145/2832105.2832113acmconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections
research-article

Exploring dynamic parallelism in OpenMP

Published: 15 November 2015 Publication History

Abstract

GPU devices are becoming a common element in current HPC platforms due to their high performance-per-Watt ratio. However, developing applications able to exploit their dazzling performance is not a trivial task, which becomes even harder when they have irregular data access patterns or control flows. Dynamic Parallelism (DP) has been introduced in the most recent GPU architecture as a mechanism to improve applicability of GPU computing in these situations, resource utilization and execution performance. DP allows to launch a kernel within a kernel without intervention of the CPU. Current experiences reveal that DP is offered to programmers at the expenses of an excessive overhead which, together with its architecture dependency, makes it difficult to see the benefits in real applications.
In this paper, we propose how to extend the current OpenMP accelerator model to make the use of DP easy and effective. The proposal is based on nesting of teams constructs and conditional clauses, showing how it is possible for the compiler to generate code that is then efficiently executed under dynamic runtime scheduling. The proposal has been implemented on the MACC compiler supporting the OmpSs task--based programming model and evaluated using three kernels with data access and computation patterns commonly found in real applications: sparse matrix vector multiplication, breadth-first search and divide--and--conquer Mandelbrot. Performance results show speed-ups in the 40x range relative to versions not using DP.

References

[1]
OpenMP ARB. Openmp application program interface, v. 4.0. 2013.
[2]
Eduard Ayguadé, Rosa M. Badia, Pieter Bellens, Daniel Cabrera, Alejandro Duran, Roger Ferrer, Marc González, Francisco D. Igual, Daniel Jiménez-González, and Jesús Labarta. Extending openmp to survive the heterogeneous multi-core era. International Journal of Parallel Programming, 38(5-6):440--459, 2010.
[3]
Carlo Bertolli, Samuel F. Antao, Alexandre E. Eichenberger, Kevin O'Brien, Zehra Sura, Arpith C. Jacob, Tong Chen, and Olivier Sallenave. Coordinating gpu threads for openmp 4.0 in llvm. In Proceedings of the 2014 LLVM Compiler Infrastructure in HPC, LLVM-HPC '14, pages 12--21, Piscataway, NJ, USA, 2014. IEEE Press.
[4]
Barcelona Supercomputing Center. The mercurium compiler http://pm.bsc.es/mcxx.
[5]
Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W. Sheaffer, Sang-Ha Lee, and Kevin Skadron. Rodinia: A benchmark suite for heterogeneous computing. In Proceedings of the 2009 IEEE International Symposium on Workload Characterization, IISWC 2009, October 4-6, 2009, Austin, TX, USA, pages 44--54, 2009.
[6]
Timothy A. Davis and Yifan Hu. The university of florida sparse matrix collection. ACM Trans. Math. Softw., 38(1):1, 2011.
[7]
NVIDIA. Adaptive parallel computation with cuda dynamic parallelism, http://devblogs.nvidia.com/parallelforall/introduction-cuda-dynamic-parallelism/.
[8]
NVIDIA. Cuda dynamic parallelism programming guide, 2013.
[9]
NVIDIA. Next generation cuda compute architecture: Kepler tm gk110 http://www.nvidia.es/content/pdf/kepler/nvidia-kepler-gk110-architecture-whitepaper.pdf.
[10]
OpenACC. The openacc standard, http://www.openacc-standard.org.
[11]
Guray Ozen, Eduard Ayguadé, and Jesús Labarta. On the roles of the programmer, the compiler and the runtime system when programming accelerators in openmp. In Using and Improving OpenMP for Devices, Tasks, and More - 10th International Workshop on OpenMP, IWOMP 2014, Salvador, Brazil, September 28-30, 2014. Proceedings, pages 215--229, 2014.
[12]
PathScale. Enzo, http://www.pathscale.com/enzo.
[13]
Balaram Sinharoy, James Van Norstrand, Richard J. Eickemeyer, Hung Q. Le, Jens Leenstra, Dung Q. Nguyen, B. Konigsburg, K. Ward, M. D. Brown, José E. Moreira, D. Levitan, S. Tung, David Hrusecky, James W. Bishop, Michael Gschwind, Maarten Boersma, Michael Kroener, Markus Kaltenbach, Tejas Karkhanis, and K. M. Fernsler. IBM POWER8 processor core microarchitecture. IBM Journal of Research and Development, 59(1), 2015.
[14]
Jin Wang and Sudhakar Yalamanchili. Characterization and analysis of dynamic parallelism in unstructured GPU applications. In 2014 IEEE International Symposium on Workload Characterization, IISWC 2014, Raleigh, NC, USA, October 26-28, 2014, pages 51--60, 2014.
[15]
Yi Yang and Huiyang Zhou. CUDA-NP: realizing nested thread-level parallelism in GPGPU applications. In ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP '14, Orlando, FL, USA, February 15-19, 2014, pages 93--106, 2014.

Cited By

View all
  • (2022)Performant portable OpenMPProceedings of the 31st ACM SIGPLAN International Conference on Compiler Construction10.1145/3497776.3517780(156-168)Online publication date: 19-Mar-2022
  • (2016)POSTERProceedings of the 2016 International Conference on Parallel Architectures and Compilation10.1145/2967938.2974056(423-424)Online publication date: 11-Sep-2016
  • (2016)Multiple Target Task Sharing Support for the OpenMP Accelerator ModelOpenMP: Memory, Devices, and Tasks10.1007/978-3-319-45550-1_19(268-280)Online publication date: 21-Sep-2016

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
WACCPD '15: Proceedings of the Second Workshop on Accelerator Programming using Directives
November 2015
68 pages
ISBN:9781450340144
DOI:10.1145/2832105
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 15 November 2015

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. CUDA
  2. GPGPU
  3. OmpSs
  4. OpenACC
  5. OpenMP
  6. compilers
  7. dynamic parallelism
  8. programming models

Qualifiers

  • Research-article

Conference

SC15
Sponsor:

Acceptance Rates

WACCPD '15 Paper Acceptance Rate 7 of 14 submissions, 50%;
Overall Acceptance Rate 7 of 14 submissions, 50%

Upcoming Conference

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)11
  • Downloads (Last 6 weeks)1
Reflects downloads up to 24 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2022)Performant portable OpenMPProceedings of the 31st ACM SIGPLAN International Conference on Compiler Construction10.1145/3497776.3517780(156-168)Online publication date: 19-Mar-2022
  • (2016)POSTERProceedings of the 2016 International Conference on Parallel Architectures and Compilation10.1145/2967938.2974056(423-424)Online publication date: 11-Sep-2016
  • (2016)Multiple Target Task Sharing Support for the OpenMP Accelerator ModelOpenMP: Memory, Devices, and Tasks10.1007/978-3-319-45550-1_19(268-280)Online publication date: 21-Sep-2016

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media