Skip to main content
Log in

CUDA-NP: Realizing Nested Thread-Level Parallelism in GPGPU Applications

  • Regular Paper
  • Published:
Journal of Computer Science and Technology Aims and scope Submit manuscript

Abstract

Parallel programs consist of series of code sections with different thread-level parallelism (TLP). As a result, it is rather common that a thread in a parallel program, such as a GPU kernel in CUDA programs, still contains both sequential code and parallel loops. In order to leverage such parallel loops, the latest NVIDIA Kepler architecture introduces dynamic parallelism, which allows a GPU thread to start another GPU kernel, thereby reducing the overhead of launching kernels from a CPU. However, with dynamic parallelism, a parent thread can only communicate with its child threads through global memory and the overhead of launching GPU kernels is non-trivial even within GPUs. In this paper, we first study a set of GPGPU benchmarks that contain parallel loops, and highlight that these benchmarks do not have a very high loop count or high degree of TLP. Consequently, the benefits of leveraging such parallel loops using dynamic parallelism are too limited to offset its overhead. We then present our proposed solution to exploit nested parallelism in CUDA, referred to as CUDA-NP. With CUDA-NP, we initially enable a high number of threads when a GPU program starts, and use control flow to activate different numbers of threads for different code sections. We implement our proposed CUDA-NP framework using a directive-based compiler approach. For a GPU kernel, an application developer only needs to add OpenMP-like pragmas for parallelizable code sections. Then, our CUDA-NP compiler automatically generates the optimized GPU kernels. It supports both the reduction and the scan primitives, explores different ways to distribute parallel loop iterations into threads, and efficiently manages on-chip resource. Our experiments show that for a set of GPGPU benchmarks, which have already been optimized and contain nested parallelism, our proposed CUDA-NP framework further improves the performance by up to 6.69 times and 2.01 times on average.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Chen L, Agrawal G. Optimizing MapReduce for GPUs with effective shared memory usage. In Proc. the 21st International Symposium on High-Performance Parallel and Distributed Computing, June 2012, pp.199-210.

  2. He B, Fang W, Luo Q, Govindaraju N K, Wang T. Mars: A MapReduce framework on graphics processors. In Proc. the 17th International Conference on Parallel Architectures and Compilation Techniques, Oct. 2008, pp.260-269.

  3. Stuart J A, Owens J D. Multi-GPU MapReduce on GPU clusters. In Proc. IEEE Int. Parallel & Distributed Processing Symposium, May 2011, pp.1068-1079.

  4. Wang J, Yalamanchili S. Characterization and analysis of dynamic parallelism in unstructured GPU applications. In Proc. the 2014 IEEE International Symposium on Work-load Characterization, Oct. 2014.

  5. Che S, Boyer M, Meng J et al. Rodinia: A benchmark suite for heterogeneous computing. In Proc. the 2009 IEEE International Symposium on Workload Characterization, Oct. 2009, pp.44-54.

  6. Bakhoda A, Yuan G, Fung W W L et al. Analyzing CUDA workloads using a detailed GPU simulator. In Proc. Int. Symp. Performance Analysis of Systems and Software, April 2009, pp.163-174.

  7. Collange S, Defour D, Zhang Y. Dynamic detection of uniform and affine vectors in GPGPU computations. In Proc. the 2009 Euro-Par Parallel Processing Workshops, Aug. 2009, pp.46-55.

  8. Yang Y, Xiang P, Kong J, Zhou H. A GPGPU compiler for memory optimization and parallelism management. In Proc. the 2010 ACM SIGPLAN Conference on Programming Language Design and Implementation, June 2010, pp.86-97.

  9. Boyer M, Tarjan D, Acton S T, Skadron K. Accelerating leukocyte tracking using CUDA: A case study in leveraging manycore coprocessors. In Proc. IEEE International Symposium on Parallel & Distributed Processing, May 2009.

  10. Yang Y, Xiang P, Mantor M, Rubin N, Zhou H. Shared memory multiplexing: A novel way to improve GPGPU throughput. In Proc. the 21st International Conference on Parallel Architectures and Compilation Techniques, Sept. 2012, pp.283-292.

  11. Lee S I, Johnson T, Eigenmann R. Cetus – An extensible compiler infrastructure for source-to-source transformation. In Proc. the 16th Int. Workshop on Languages and Compilers for Parallel Computing, October 2003, pp.539-553.

  12. Kayiran O, Jog A, Kandemir M T, Das C R. Neither more nor less: Optimizing thread-level parallelism for GPGPUs. In Proc. the 22nd International Conference on Parallel Architectures and Compilation Techniques, Sept. 2013, pp.157-166.

  13. Baskaran M M, Bondhugula U, Krishnamoorthy S et al. A compiler framework for optimization of affine loop nests for GPGPUs. In Proc. the 22nd Annual International Conference on Supercomputing, June 2008, pp.225-234.

  14. Lee S, Min S J, Eigenmann R. OpenMP to GPGPU: A compiler framework for automatic translation and optimization. In Proc. the 14th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, Feb. 2009, pp. 101-110,

  15. Liu Y, Zhang E Z, Shen X. A cross-input adaptive framework for GPU programs optimization. In Proc. IEEE International Symposium on Parallel & Distributed Processing, May 2009.

  16. Ueng S, Lathara M, Baghsorkhi S S, HwuWW. CUDA-lite: Reducing GPU programming complexity. In Proc. the 21st Languages and Compilers for Parallel Computing, July 31-Aug. 2, 2008, pp.1-15.

  17. DiMarco J, Taufer M. Performance impact of dynamic parallelism on different clustering algorithms. In Proc. SPIE 8752, Modeling and Simulation for Defense Systems and Applications VIII, May 2013, p. 87520E.

  18. Hong S, Kim H. An analytical model for GPU architecture with memory-level and thread-level parallelism awareness. In Proc. the 36th International Symposium on Computer Architecture, June 2009, pp. 152-163.

  19. Phothilimthana P M, Ansel J, Ragan-Kelley J, Amarasinghe S. Portable performance on heterogeneous architectures. In Proc. the 18th International Conference on Architectural Support for Programming Languages and Operating Systems, March 2013, pp.431-444.

  20. Narasiman V, Shebanow M, Lee C et al. Improving GPU performance via large warps and two-level warp scheduling. In Proc. the 44th Annual IEEE/ACM International Symposium on Microarchitecture, Dec. 2011, pp. 308-317.

  21. Rogers T G, O’Connor M, Aamodt T. Cache-conscious wavefront scheduling. In Proc. the 45th Annual IEEE/ACM International Symposium on Microarchitecture, Dec. 2012, pp. 72-83.

  22. Steffen M, Zambreno J. Improving SIMT efficiency of global rendering algorithms with architectural support for dynamic micro-kernels. In Proc. the 43rd Annual IEEE/ACM International Symposium on Microarchitecture, Dec. 2010, pp.237-248.

  23. Govindaraju N, Lloyd B, Dotsenko Y, Smith B, Manferdelli J. High performance discrete Fourier transforms on graphics processors. In Proc. the 2008 ACM/IEEE Conference on Supercomputing, Nov. 2008, Article No. 2.

  24. Hong S, Kim S K, Oguntebi T, Olukotun K. Accelerating CUDA graph algorithms at maximum warp. In Proc. the 16th ACM Symposium on Principles and Practice of Parallel Programming, Feb. 2011, pp.267-276.

  25. Jang B, Schaa D, Mistry P, Kaeli D. Exploiting memory access patterns to improve memory performance in dataparallel architectures. IEEE Transactions on Parallel and Distributed Systems, 2011, 22(1): 105-118.

    Article  Google Scholar 

  26. Kim J, Kim H, Lee J, Lee J. Achieving a single compute device image in OpenCL for multiple GPUs. In Proc. the 16th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, Feb. 2011, pp. 277-288.

  27. Ren B, Agrawal G, Larus J R, Mytkowicz T, Poutanen T, Schulte W. SIMD parallelization of applications that traverse irregular data structures. In Proc. the 2013 IEEE/ACM International Symposium on Code Generation and Optimization, Feb. 2013.

  28. Ryoo S, Rodrigues C I, Stone S S et al. Program optimization space pruning for a multithreaded GPU. In Proc. the 6th Annual IEEE/ACM International Symposium on Code Generation and Optimization, April 2008, pp.195-204.

  29. Ryoo S, Rodrigues C I, Baghsorkhi S S et al. Optimization principles and application performance evaluation of a multithreaded GPU using CUDA. In Proc. the 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, Feb. 2008, pp.73-82.

  30. Ruetsch G, Micikevicius P. Optimizing matrix transpose in CUDA. 2009. https://developer.nvidia.com/cuda-toolkit-55-archive, Dec. 2014.

  31. Sim J, Dasgupta A, Kim H et al. A performance analysis framework for identifying performance benefits in GPGPU applications. In Proc. the 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, Feb. 2012, pp.11-22.

  32. Volkov V, Demmel J W. Benchmarking GPUs to tune dense linear algebra. In Proc. the 2008 ACM/IEEE Conference on Supercomputing, Nov. 2008, Article No. 31.

  33. Wu B, Zhao Z, Zhang E et al. Complexity analysis and algorithm design for reorganizing data to minimize noncoalesced GPU memory accesses. In Proc. the 18th ACM SIGPLAN Symposium on Principles and Practice of Parallel programming, Feb. 2013, pp. 57-68.

  34. Zhang Y, Cohen J, Owens J D. Fast tridiagonal solvers on the GPU. In Proc. the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, Jan. 2010, pp. 127-136.

  35. Zhang E Z, Jiang Y, Guo Z et al. On-the-fly elimination of dynamic irregularities for GPU computing. In Proc. the 16th International Conference on Architectural Support for Programming Languages and Operating Systems, March 2011, pp. 369-380.

  36. Liao C, Hernandez O, Chapman B et al. OpenUH: An optimizing, portable OpenMP compiler. In Proc. the 12th Workshop on Compilers for Parallel Computers, Jan. 2006, pp. 2317-2332.

  37. Ayguadé E, Copty N, Duran A et al. The design of OpenMP tasks. IEEE Transactions on Parallel and Distributed Systems, 2009, 20(3): 404-418.

    Article  Google Scholar 

  38. Bücker H M, Rasch A, Wolf A. A class of OpenMP applications involving nested parallelism. In Proc. the 2004 ACM Symposium on Applied Computing, March 2004, pp.220-224.

  39. Dagum L, Menon R. OpenMP: An industry standard API for shared-memory programming. IEEE Computational Science & Engineering, 1998, 5(1): 46-55.

    Article  Google Scholar 

  40. Dimakopoulos V V, Hadjidoukas P E, Philos G C. A microbenchmark study of OpenMP overheads under nested parallelism. In Proc. the 4th Int. Workshop on OpenMP in a New Era of Parallelism, May 2008, pp. 1-12.

  41. Duran A, Gonzàlez M, Corbalán J. Automatic thread distribution for nested parallelism in OpenMP. In Proc. the 19th Annual International Conference on Supercomputing, June 2005, pp.121-130.

  42. Hadjidoukas P E, Dimakopoulos V V. Nested parallelism in the OMPi OpenMP/C compiler. In Proc. the 13th Int. Euro-Par Conf. Euro-Par Parallel Processing, Aug. 2007, pp. 662-671.

  43. Tian X, Hoeflinger J P, Haab G et al. A compiler for exploiting nested parallelism in OpenMP programs. Parallel Computing, 2005, 31(10/11/12): 960-983.

  44. Tanaka Y, Taura K, Sato M et al. Performance evaluation of OpenMP applications with nested parallelism. In Proc. the 5th Int. Workshop on Languages, Compilers, and Run-Time Systems for Scalable Computers, May 2000, pp. 100-112.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yi Yang.

Additional information

This work was supported by the National Science Foundation of USA under Grant No. CCF-1216569 and a CAREER award of National Science Foundation of USA under Grant No. CCF-0968667.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Yang, Y., Li, C. & Zhou, H. CUDA-NP: Realizing Nested Thread-Level Parallelism in GPGPU Applications. J. Comput. Sci. Technol. 30, 3–19 (2015). https://doi.org/10.1007/s11390-015-1500-y

Download citation

  • Received:

  • Revised:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11390-015-1500-y

Keywords

Navigation