research-article

Exploring dynamic parallelism in OpenMP

Authors:

Eduard Ayguade,

Jesus LabartaAuthors Info & Claims

WACCPD '15: Proceedings of the Second Workshop on Accelerator Programming using Directives

Article No.: 5, Pages 1 - 8

https://doi.org/10.1145/2832105.2832113

Published: 15 November 2015 Publication History

Abstract

GPU devices are becoming a common element in current HPC platforms due to their high performance-per-Watt ratio. However, developing applications able to exploit their dazzling performance is not a trivial task, which becomes even harder when they have irregular data access patterns or control flows. Dynamic Parallelism (DP) has been introduced in the most recent GPU architecture as a mechanism to improve applicability of GPU computing in these situations, resource utilization and execution performance. DP allows to launch a kernel within a kernel without intervention of the CPU. Current experiences reveal that DP is offered to programmers at the expenses of an excessive overhead which, together with its architecture dependency, makes it difficult to see the benefits in real applications.

In this paper, we propose how to extend the current OpenMP accelerator model to make the use of DP easy and effective. The proposal is based on nesting of teams constructs and conditional clauses, showing how it is possible for the compiler to generate code that is then efficiently executed under dynamic runtime scheduling. The proposal has been implemented on the MACC compiler supporting the OmpSs task--based programming model and evaluated using three kernels with data access and computation patterns commonly found in real applications: sparse matrix vector multiplication, breadth-first search and divide--and--conquer Mandelbrot. Performance results show speed-ups in the 40x range relative to versions not using DP.

References

[1]

OpenMP ARB. Openmp application program interface, v. 4.0. 2013.

[2]

Eduard Ayguadé, Rosa M. Badia, Pieter Bellens, Daniel Cabrera, Alejandro Duran, Roger Ferrer, Marc González, Francisco D. Igual, Daniel Jiménez-González, and Jesús Labarta. Extending openmp to survive the heterogeneous multi-core era. International Journal of Parallel Programming, 38(5-6):440--459, 2010.

[3]

Carlo Bertolli, Samuel F. Antao, Alexandre E. Eichenberger, Kevin O'Brien, Zehra Sura, Arpith C. Jacob, Tong Chen, and Olivier Sallenave. Coordinating gpu threads for openmp 4.0 in llvm. In Proceedings of the 2014 LLVM Compiler Infrastructure in HPC, LLVM-HPC '14, pages 12--21, Piscataway, NJ, USA, 2014. IEEE Press.

Digital Library

[4]

Barcelona Supercomputing Center. The mercurium compiler http://pm.bsc.es/mcxx.

[5]

Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W. Sheaffer, Sang-Ha Lee, and Kevin Skadron. Rodinia: A benchmark suite for heterogeneous computing. In Proceedings of the 2009 IEEE International Symposium on Workload Characterization, IISWC 2009, October 4-6, 2009, Austin, TX, USA, pages 44--54, 2009.

Digital Library

[6]

Timothy A. Davis and Yifan Hu. The university of florida sparse matrix collection. ACM Trans. Math. Softw., 38(1):1, 2011.

Digital Library

[7]

NVIDIA. Adaptive parallel computation with cuda dynamic parallelism, http://devblogs.nvidia.com/parallelforall/introduction-cuda-dynamic-parallelism/.

[8]

NVIDIA. Cuda dynamic parallelism programming guide, 2013.

[9]

NVIDIA. Next generation cuda compute architecture: Kepler tm gk110 http://www.nvidia.es/content/pdf/kepler/nvidia-kepler-gk110-architecture-whitepaper.pdf.

[10]

OpenACC. The openacc standard, http://www.openacc-standard.org.

[11]

Guray Ozen, Eduard Ayguadé, and Jesús Labarta. On the roles of the programmer, the compiler and the runtime system when programming accelerators in openmp. In Using and Improving OpenMP for Devices, Tasks, and More - 10th International Workshop on OpenMP, IWOMP 2014, Salvador, Brazil, September 28-30, 2014. Proceedings, pages 215--229, 2014.

[12]

PathScale. Enzo, http://www.pathscale.com/enzo.

[13]

Balaram Sinharoy, James Van Norstrand, Richard J. Eickemeyer, Hung Q. Le, Jens Leenstra, Dung Q. Nguyen, B. Konigsburg, K. Ward, M. D. Brown, José E. Moreira, D. Levitan, S. Tung, David Hrusecky, James W. Bishop, Michael Gschwind, Maarten Boersma, Michael Kroener, Markus Kaltenbach, Tejas Karkhanis, and K. M. Fernsler. IBM POWER8 processor core microarchitecture. IBM Journal of Research and Development, 59(1), 2015.

Digital Library

[14]

Jin Wang and Sudhakar Yalamanchili. Characterization and analysis of dynamic parallelism in unstructured GPU applications. In 2014 IEEE International Symposium on Workload Characterization, IISWC 2014, Raleigh, NC, USA, October 26-28, 2014, pages 51--60, 2014.

[15]

Yi Yang and Huiyang Zhou. CUDA-NP: realizing nested thread-level parallelism in GPGPU applications. In ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP '14, Orlando, FL, USA, February 15-19, 2014, pages 93--106, 2014.

Digital Library

Cited By

Ozen GWolfe MEgger BSmith A(2022)Performant portable OpenMPProceedings of the 31st ACM SIGPLAN International Conference on Compiler Construction10.1145/3497776.3517780(156-168)Online publication date: 19-Mar-2022
https://dl.acm.org/doi/10.1145/3497776.3517780
Ozen GAyguade ELabarta JZaks AMendelson BRauchwerger LHwu W(2016)POSTERProceedings of the 2016 International Conference on Parallel Architectures and Compilation10.1145/2967938.2974056(423-424)Online publication date: 11-Sep-2016
https://dl.acm.org/doi/10.1145/2967938.2974056
Ozen GMateo SAyguadé ELabarta JBeyer J(2016)Multiple Target Task Sharing Support for the OpenMP Accelerator ModelOpenMP: Memory, Devices, and Tasks10.1007/978-3-319-45550-1_19(268-280)Online publication date: 21-Sep-2016
https://doi.org/10.1007/978-3-319-45550-1_19

Index Terms

Exploring dynamic parallelism in OpenMP

Recommendations

POSTER: Collective Dynamic Parallelism for Directive Based GPU Programming Languages and Compilers
PACT '16: Proceedings of the 2016 International Conference on Parallel Architectures and Compilation

Early programs for GPU (Graphics Processing Units) acceleration were based on a flat, bulk parallel programming model, in which programs had to perform a sequence of kernel launches from the host CPU. In the latest releases of these devices, dynamic (or ...
Benchmarking OpenCL, OpenACC, OpenMP, and CUDA: Programming Productivity, Performance, and Energy Consumption
ARMS-CC '17: Proceedings of the 2017 Workshop on Adaptive Resource Management and Scheduling for Cloud Computing

Many modern parallel computing systems are heterogeneous at their node level. Such nodes may comprise general purpose CPUs and accelerators (such as, GPU, or Intel Xeon Phi) that provide high performance with suitable energy-consumption characteristics. ...
A performance study of general-purpose applications on graphics processors using CUDA

Graphics processors (GPUs) provide a vast number of simple, data-parallel, deeply multithreaded cores and high memory bandwidths. GPU architectures are becoming increasingly programmable, offering the potential for dramatic speedups for a variety of ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

WACCPD '15: Proceedings of the Second Workshop on Accelerator Programming using Directives

November 2015

68 pages

ISBN:9781450340144

DOI:10.1145/2832105

Program Chairs:
Sunita Chandrasekaran
University of Houston
,
Fernanda Foertter
ORNL

Copyright © 2015 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGHPC: ACM Special Interest Group on High Performance Computing, Special Interest Group on High Performance Computing
SIGARCH: ACM Special Interest Group on Computer Architecture
IEEE-CS\DATC: IEEE Computer Society

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 15 November 2015

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

SC15

Sponsor:

SIGHPC
SIGARCH
IEEE-CS\DATC

SC15: The International Conference for High Performance Computing, Networking, Storage and Analysis

November 15, 2015

Texas, Austin

Acceptance Rates

WACCPD '15 Paper Acceptance Rate 7 of 14 submissions, 50%;

Overall Acceptance Rate 7 of 14 submissions, 50%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

3
Total Citations
View Citations
218
Total Downloads

Downloads (Last 12 months)11
Downloads (Last 6 weeks)1

Reflects downloads up to 24 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Ozen GWolfe MEgger BSmith A(2022)Performant portable OpenMPProceedings of the 31st ACM SIGPLAN International Conference on Compiler Construction10.1145/3497776.3517780(156-168)Online publication date: 19-Mar-2022
https://dl.acm.org/doi/10.1145/3497776.3517780
Ozen GAyguade ELabarta JZaks AMendelson BRauchwerger LHwu W(2016)POSTERProceedings of the 2016 International Conference on Parallel Architectures and Compilation10.1145/2967938.2974056(423-424)Online publication date: 11-Sep-2016
https://dl.acm.org/doi/10.1145/2967938.2974056
Ozen GMateo SAyguadé ELabarta JBeyer J(2016)Multiple Target Task Sharing Support for the OpenMP Accelerator ModelOpenMP: Memory, Devices, and Tasks10.1007/978-3-319-45550-1_19(268-280)Online publication date: 21-Sep-2016
https://doi.org/10.1007/978-3-319-45550-1_19

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten