Skip to main content

NAS Parallel Benchmarks for GPGPUs Using a Directive-Based Programming Model

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 8967))

Abstract

The broad adoption of accelerators boosts the interest in accelerator programming. Accelerators such as GPGPUs are optimized for throughput and offer high GFLOPS and memory bandwidth. CUDA has been adopted quite rapidly but it is proprietary and only applicable to GPUs, and the difficulty in writing efficient CUDA code has kindled the necessity to create higher-level programming approaches such as OpenACC. Directive-based programming models such as OpenMP and OpenACC offer programmers an option to rapidly create prototype applications by adding annotations to guide compiler optimizations. In this paper we study the effectiveness of a high-level directive based programming model, OpenACC, for parallelizing NAS Parallel Benchmarks (NPB) on GPGPUs. We present the application of techniques such as array privatization, memory coalescing, cache optimization and examine their impact on the performance of the benchmarks. The right choice or combination of techniques/hints are crucial for compilers to generate highly efficient codes tuned to a particular type of accelerator. Poorly selected choice or combination of techniques can lead to degraded performance. We also propose a new clause, ‘scan’, that handles scan operations for arbitrary input array size. We hope that the practices discussed in this paper will provide useful guidance to users to effectively migrate their sequential/CPU-parallel codes to GPGPU architectures and achieve optimal performance.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

  1. NPB-CUDA (2013). http://www.tu-chemnitz.de/informatik/PI/forschung/download/npb-gpu/

  2. NPB-UPC (2013). http://threads.hpcl.gwu.edu/sites/npb-upc

  3. OpenACC (2013). http://www.openacc-standard.org

  4. OpenCL Standard (2013). http://www.khronos.org/opencl

  5. OpenMP (2013). www.openmp.org

  6. 11 Tricks for Maximizing Performance with OpenACC Directives in Fortran (2014). http://www.pgroup.com/resources/openacc_tips_fortran.htm

  7. CUDA (2014). http://www.nvidia.com/object/cuda_home_new.html

  8. CUDA C Programming Guide (2014). http://docs.nvidia.com/cuda/cuda-c-programming-guide/

  9. Pathscale NPB2.3 OpenACC (2014). https://github.com/pathscale/NPB2.3-OpenACC-C

  10. Bailey, D., et al.: The NAS Parallel Benchmarks. NASA Ames Research Center (1994)

    Google Scholar 

  11. Baker, M., Pophale, S., Vasnier, J.-C., Jin, H., Hernandez, O.: Hybrid programming using OpenSHMEM and OpenACC. In: Poole, S., Hernandez, O., Shamis, P. (eds.) OpenSHMEM 2014. LNCS, vol. 8356, pp. 74–89. Springer, Heidelberg (2014)

    Google Scholar 

  12. Ding, W., Hernandez, O., Chapman, B.: A similarity-based analysis tool for porting OpenMP applications. In: Keller, R., Kramer, D., Weiss, J.-P. (eds.) Facing the Multicore-Challenge III. LNCS, vol. 7686, pp. 13–24. Springer, Heidelberg (2013)

    Chapter  Google Scholar 

  13. Ding, W., Hsu, C.-H., Hernandez, O., Chapman, B.M., Graham, R.L.: KLONOS: similarity-based planning tool support for porting scientific applications. Concurrency Comput. Pract. Experience 25(8), 1072–1088 (2013)

    Article  Google Scholar 

  14. Dolbeau, R., Bihan, S., Bodin, F.: HMPP: a hybrid multi-core parallel programming environment. In: Workshop on GPGPU (2007)

    Google Scholar 

  15. Frumkin, M., Jin, H., Yan, J.: Implementation of NAS parallel benchmarks in high performance fortran. NAS Techinical report NAS-98-009 (1998)

    Google Scholar 

  16. Grewe, D., Wang, Z., O’Boyle, M.F.: Portable mapping of data parallel programs to OpenCL for heterogeneous systems. In: 2013 IEEE/ACM International Symposium on CGO, pp. 1–10. IEEE (2013)

    Google Scholar 

  17. Harris, M., Sengupta, S., Owens, J.D.: Parallel prefix sum (scan) with CUDA. GPU Gems 3(39), 851–876 (2007)

    Google Scholar 

  18. Jin, H., Frumkin, M., Yan, J.: The OpenMP implementation of NAS parallel benchmarks and its performance. Technical report, NAS-99-011, NASA Ames Research Center (1999)

    Google Scholar 

  19. Lee, S., Li, D., Vetter, J.S.: Interactive program debugging and optimization for directive-based, Efficient GPU Computing (2014)

    Google Scholar 

  20. Lee, S., Vetter, J.S.: Early evaluation of directive-based GPU programming models for productive exascale computing. In: SC 2012, pp. 23:1–23:11. IEEE Computer Society Press (2012)

    Google Scholar 

  21. Pennycook, S.J., Hammond, S.D., Jarvis, S.A., Mudalige, G.R.: Performance analysis of a hybrid MPI/CUDA implementation of the NAS LU benchmark. ACM SIGMETRICS Perform. Eval. Rev. 38(4), 23–29 (2011)

    Article  Google Scholar 

  22. Reyes, R., López-Rodríguez, I., Fumero, J.J., de Sande, F.: accULL: an OpenACC implementation with CUDA and OpenCL support. In: Kaklamanis, C., Papatheodorou, T., Spirakis, P.G. (eds.) Euro-Par 2012. LNCS, vol. 7484, pp. 871–882. Springer, Heidelberg (2012)

    Chapter  Google Scholar 

  23. Seo, S., Jo, G., Lee, J.: Performance characterization of the NAS parallel benchmarks in OpenCL. In: IEEE International Symposium on IISWC, pp. 137–148. IEEE (2011)

    Google Scholar 

  24. Tian, X., Xu, R., Yan, Y., Yun, Z., Chandrasekaran, S., Chapman, B.: Compiling a high-level directive-based programming model for GPGPUs. In: Caṣcaval, C., Montesinos-Ortego, P. (eds.) LCPC 2013 - Testing. LNCS, vol. 8664, pp. 105–120. Springer, Heidelberg (2014)

    Chapter  Google Scholar 

  25. Wu, X., Taylor, V.: Performance characteristics of hybrid MPI/OpenMP implementations of NAS parallel benchmarks SP and BT on large-scale multicore clusters. Comput. J. 55(2), 154–167 (2012)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Rengan Xu .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this paper

Cite this paper

Xu, R., Tian, X., Chandrasekaran, S., Yan, Y., Chapman, B. (2015). NAS Parallel Benchmarks for GPGPUs Using a Directive-Based Programming Model. In: Brodman, J., Tu, P. (eds) Languages and Compilers for Parallel Computing. LCPC 2014. Lecture Notes in Computer Science(), vol 8967. Springer, Cham. https://doi.org/10.1007/978-3-319-17473-0_5

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-17473-0_5

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-17472-3

  • Online ISBN: 978-3-319-17473-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics