Abstract
Programming many-core systems with accelerators (e.g., GPUs) remains a challenging task, even for expert programmers. In the current, low-level approaches—OpenCL and CUDA—two distinct programming models are employed: the host code for the CPU is written in C/C++ with a restricted memory model, while the device code for the accelerator is written using a device-dependent model of CUDA or OpenCL. The programmer is responsible for explicitly specifying parallelism, memory transfers, and synchronization, and also for configuring the program and optimizing its performance for a particular many-core system. This leads to long, poorly structured and error-prone codes, often with a suboptimal performance. We present PACXX—an alternative, unified programming approach for accelerators. In PACXX, both host and device programs are written in the same programming language—the newest C++14 standard with the Standard Template Library (STL), including all modern features: type inference (auto), variadic templates, generic lambda expressions, and the newly proposed parallel extensions of the STL. PACXX includes an easy-to-use and type-safe API for multi-stage programming which allows for aggressive runtime compiler optimizations. We implement PACXX by developing a custom compiler (based on the Clang and LLVM frameworks) and a runtime system, that together perform memory management and synchronization automatically and transparently for the programmer. We evaluate our approach by comparing it to OpenCL regarding program size and target performance.
Similar content being viewed by others
References
Aldinucci, M., Campa, S., Danelutto, M., Kilpatrick, P., Torquati, M.: Targeting distributed systems in fastflow. In: Euro-Par 2012: Parallel Processing Workshops, pp. 47–56, Springer (2012)
AMD: Bolt C++ Template Library. Version 1.2 (2014)
Bell, N., Hoberock, J.: Thrust: a parallel template library. GPU Computing Gems Jade Edition. pp. 359–372 (2011)
Bischof, H., Gorlatch, S., Leshchinskiy, R., Müller, J.: Data parallelism in C++ template programs: a Barnes-Hut case study. Parallel Process. Lett. 15(03), 257–272 (2005)
Enmyren, J., Kessler, C.: SkePU: A multi-backend skeleton programming library for multi-GPU Systems. In: Proceedings of the Fourth International Workshop on High-Level Parallel Programming and Applications, ACM, pp 5–14 (2010)
Ernsting, S., Kuchen, H.: Algorithmic skeletons for multi-core, multi-GPU systems and clusters. Int. J. High Perform. Comput. Netw. 7(2), 129–138 (2012)
Gorlatch, S., Cole, M.: Parallel skeletons. In: Encyclopedia of Parallel Computing. pp. 1417–1422, Springer (2011)
isocpp (2014a) Programming languages - C++ (committee draft)
isocpp (2014b) Working draft, C++ extensions for ranges [N4569]
isocpp (2015a) Programming languages—C++ extensions for library fundamentals [N4480]
isocpp (2015b) Technical specification for C++ extensions for parallelism [N4578]
Khronos Group: the OpenCL specification. Version 1.2 (2012)
Khronos Group: the SPIR specification. Version 1.2 (2014)
Khronos Group: SYCL specifcation. Version 1.2 (2015)
Lattner, C.: LLVM and Clang: next generation compiler technology. In: Proceedings of the BSD Conference, pp 1–2 (2008)
Lutz, T.: ParallelSTL. https://github.com/t-lutz/ParallelSTL. Accessed 30 Apr 2016
Microsoft: C++ AMP: language and programming model. Version 1.0 (2012)
Microsoft: Parallel STL. https://parallelstl.codeplex.com/ Accessed 30 Apr 2016
Nvidia: CUDA programming guide. Version 7.5 (2015a)
Nvidia: CUDA Toolkit 7.5 (2015b)
Nvidia: Parallel thread execution ISA. Version 4.3 (2015c)
Nyland, L., Harris, M., Prins, J.: Fast N-body simulation with CUDA. GPU Gems 3(1), 677–696 (2007)
Okabe, A., Boots, B., Sugihara, K., Chiu, S.N.: Spatial Tessellations: Concepts and Applications of Voronoi Diagrams, vol. 501. Wiley, Hoboken (2009)
Rompf, T., Odersky, M.: Lightweight modular staging: a pragmatic approach to runtime code generation and compiled DSLs. ACM SIGPLAN Notices, vol. 46, pp 127–136, ACM (2010)
Steuwer, M., Kegel, P., Gorlatch, S.: SkelCL—a portable skeleton library for high-level GPU programming. In: Workshop on High-Level Parallel Programming Models and Supportive Environments at IPDPS 2011, IEEE, pp 1176–1182 (2011)
Sujeeth, A.K., Brown, K.J., Lee, H., et al.: Delite: a compiler architecture for performance-oriented embedded domain-specific languages. ACM Trans. Embed. Comput. Syst. 13(4s), 134:1–134:25 (2014)
Taha, W.: A gentle introduction to multi-stage programming. In: Domain-Specific Program Generation, pp 30–50, Springer (2004)
Acknowledgements
We would like to thank Michel Steuwer for many fruitful discussions and Nvidia Corp. for their generous hardware donation.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Haidl, M., Gorlatch, S. High-Level Programming for Many-Cores Using C++14 and the STL. Int J Parallel Prog 46, 23–41 (2018). https://doi.org/10.1007/s10766-017-0497-y
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10766-017-0497-y