Abstract
The broad adoption of accelerators boosts the interest in accelerator programming. Accelerators such as GPGPUs are optimized for throughput and offer high GFLOPS and memory bandwidth. CUDA has been adopted quite rapidly but it is proprietary and only applicable to GPUs, and the difficulty in writing efficient CUDA code has kindled the necessity to create higher-level programming approaches such as OpenACC. Directive-based programming models such as OpenMP and OpenACC offer programmers an option to rapidly create prototype applications by adding annotations to guide compiler optimizations. In this paper we study the effectiveness of a high-level directive based programming model, OpenACC, for parallelizing NAS Parallel Benchmarks (NPB) on GPGPUs. We present the application of techniques such as array privatization, memory coalescing, cache optimization and examine their impact on the performance of the benchmarks. The right choice or combination of techniques/hints are crucial for compilers to generate highly efficient codes tuned to a particular type of accelerator. Poorly selected choice or combination of techniques can lead to degraded performance. We also propose a new clause, ‘scan’, that handles scan operations for arbitrary input array size. We hope that the practices discussed in this paper will provide useful guidance to users to effectively migrate their sequential/CPU-parallel codes to GPGPU architectures and achieve optimal performance.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
NPB-CUDA (2013). http://www.tu-chemnitz.de/informatik/PI/forschung/download/npb-gpu/
NPB-UPC (2013). http://threads.hpcl.gwu.edu/sites/npb-upc
OpenACC (2013). http://www.openacc-standard.org
OpenCL Standard (2013). http://www.khronos.org/opencl
OpenMP (2013). www.openmp.org
11 Tricks for Maximizing Performance with OpenACC Directives in Fortran (2014). http://www.pgroup.com/resources/openacc_tips_fortran.htm
CUDA (2014). http://www.nvidia.com/object/cuda_home_new.html
CUDA C Programming Guide (2014). http://docs.nvidia.com/cuda/cuda-c-programming-guide/
Pathscale NPB2.3 OpenACC (2014). https://github.com/pathscale/NPB2.3-OpenACC-C
Bailey, D., et al.: The NAS Parallel Benchmarks. NASA Ames Research Center (1994)
Baker, M., Pophale, S., Vasnier, J.-C., Jin, H., Hernandez, O.: Hybrid programming using OpenSHMEM and OpenACC. In: Poole, S., Hernandez, O., Shamis, P. (eds.) OpenSHMEM 2014. LNCS, vol. 8356, pp. 74–89. Springer, Heidelberg (2014)
Ding, W., Hernandez, O., Chapman, B.: A similarity-based analysis tool for porting OpenMP applications. In: Keller, R., Kramer, D., Weiss, J.-P. (eds.) Facing the Multicore-Challenge III. LNCS, vol. 7686, pp. 13–24. Springer, Heidelberg (2013)
Ding, W., Hsu, C.-H., Hernandez, O., Chapman, B.M., Graham, R.L.: KLONOS: similarity-based planning tool support for porting scientific applications. Concurrency Comput. Pract. Experience 25(8), 1072–1088 (2013)
Dolbeau, R., Bihan, S., Bodin, F.: HMPP: a hybrid multi-core parallel programming environment. In: Workshop on GPGPU (2007)
Frumkin, M., Jin, H., Yan, J.: Implementation of NAS parallel benchmarks in high performance fortran. NAS Techinical report NAS-98-009 (1998)
Grewe, D., Wang, Z., O’Boyle, M.F.: Portable mapping of data parallel programs to OpenCL for heterogeneous systems. In: 2013 IEEE/ACM International Symposium on CGO, pp. 1–10. IEEE (2013)
Harris, M., Sengupta, S., Owens, J.D.: Parallel prefix sum (scan) with CUDA. GPU Gems 3(39), 851–876 (2007)
Jin, H., Frumkin, M., Yan, J.: The OpenMP implementation of NAS parallel benchmarks and its performance. Technical report, NAS-99-011, NASA Ames Research Center (1999)
Lee, S., Li, D., Vetter, J.S.: Interactive program debugging and optimization for directive-based, Efficient GPU Computing (2014)
Lee, S., Vetter, J.S.: Early evaluation of directive-based GPU programming models for productive exascale computing. In: SC 2012, pp. 23:1–23:11. IEEE Computer Society Press (2012)
Pennycook, S.J., Hammond, S.D., Jarvis, S.A., Mudalige, G.R.: Performance analysis of a hybrid MPI/CUDA implementation of the NAS LU benchmark. ACM SIGMETRICS Perform. Eval. Rev. 38(4), 23–29 (2011)
Reyes, R., López-Rodríguez, I., Fumero, J.J., de Sande, F.: accULL: an OpenACC implementation with CUDA and OpenCL support. In: Kaklamanis, C., Papatheodorou, T., Spirakis, P.G. (eds.) Euro-Par 2012. LNCS, vol. 7484, pp. 871–882. Springer, Heidelberg (2012)
Seo, S., Jo, G., Lee, J.: Performance characterization of the NAS parallel benchmarks in OpenCL. In: IEEE International Symposium on IISWC, pp. 137–148. IEEE (2011)
Tian, X., Xu, R., Yan, Y., Yun, Z., Chandrasekaran, S., Chapman, B.: Compiling a high-level directive-based programming model for GPGPUs. In: Caṣcaval, C., Montesinos-Ortego, P. (eds.) LCPC 2013 - Testing. LNCS, vol. 8664, pp. 105–120. Springer, Heidelberg (2014)
Wu, X., Taylor, V.: Performance characteristics of hybrid MPI/OpenMP implementations of NAS parallel benchmarks SP and BT on large-scale multicore clusters. Comput. J. 55(2), 154–167 (2012)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Xu, R., Tian, X., Chandrasekaran, S., Yan, Y., Chapman, B. (2015). NAS Parallel Benchmarks for GPGPUs Using a Directive-Based Programming Model. In: Brodman, J., Tu, P. (eds) Languages and Compilers for Parallel Computing. LCPC 2014. Lecture Notes in Computer Science(), vol 8967. Springer, Cham. https://doi.org/10.1007/978-3-319-17473-0_5
Download citation
DOI: https://doi.org/10.1007/978-3-319-17473-0_5
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-17472-3
Online ISBN: 978-3-319-17473-0
eBook Packages: Computer ScienceComputer Science (R0)