Abstract
Parallel programming has become ubiquitous; however, it is still a low-level and error-prone task, especially when accelerators such as GPUs are used. Thus, algorithmic skeletons have been proposed to provide well-defined programming patterns in order to assist programmers and shield them from low-level aspects. As the complexity of problems, and consequently the need for computing capacity, grows, we have directed our research toward simultaneous CPU–GPU execution of data parallel skeletons to achieve a performance gain. GPUs are optimized with respect to throughput and designed for massively parallel computations. Nevertheless, we analyze whether the additional utilization of the CPU for data parallel skeletons in the Muenster Skeleton Library leads to speedups or causes a reduced performance, because of the smaller computational capacity of CPUs compared to GPUs. We present a C\(++\) implementation based on a static distribution approach. In order to evaluate the implementation, four different benchmarks, including matrix multiplication, N-body simulation, Frobenius norm, and ray tracing, have been conducted. The ratio of CPU and GPU execution has been varied manually to observe the effects of different distributions. The results show that a speedup can be achieved by distributing the execution among CPUs and GPUs. However, both the results and the optimal distribution highly depend on the available hardware and the specific algorithm.






Similar content being viewed by others
Notes
With regard to the CUDA terminology, we also refer to GPUs as devices.
References
Cole, M.: Algorithmic Skeletons: Structured Management of Parallel Computation. MIT Press, Cambridge (1991)
Cole, M.: Bringing skeletons out of the closet: a pragmatic manifesto for skeletal parallel programming. Parallel Comput. 30(3), 389–406 (2004)
Ernsting, S., Kuchen, H.: Algorithmic skeletons for multi-core, multi-GPU systems and clusters. Int. J. High Perform. Comput. Netw. 7(2), 129–138 (2012)
The Message Passing Interface (MPI) standard. http://www.mcs.anl.gov/research/projects/mpi/. Accessed Apr 2016
The OpenMP API Specification for Parallel Programming. http://openmp.org. Accessed Apr 2016
Nvidia Corporation: CUDA Website. http://www.nvidia.de/object/cuda-parallel-computing-de.html. Accessed Apr 2016
Nickolls, J., Buck, I., Garland, M., Skadron, K.: Scalable parallel programming with CUDA. Queue 6(2), 40–53 (2008)
Ciechanowicz, P.: Algorithmic skeletons for general sparse matrices on multi-core processors. In: Proceedings of the 20th IASTED International Conference on Parallel and Distributed Computing and Systems (PDCS), pp. 188–197 (2008)
Ernsting, S., Kuchen, H.: A scalable farm skeleton for hybrid parallel and distributed programming. Int. J. Parallel Prog. 42(6), 968–987 (2014)
Poldner, M., Kuchen, H.: Skeletons for divide and conquer algorithms. In: Proceedings of the IASTED International Conference on Parallel and Distributed Computing and Networks (PDCN). ACTA Press (2008)
Poldner, M., Kuchen, H.: Algorithmic skeletons for branch and bound. In: Software and Data Technologies: First International Conference, ICSOFT 2006, Setúbal, Portugal, 11–14 September 2006, Revised Selected Papers, pp. 204–219. Springer, Berlin (2008)
Kuchen, H., Striegnitz, J.: Higher-order functions and partial applications for a C++ skeleton library. In: Proceedings of the 2002 Joint ACM-ISCOPE Conference on Java Grande, JGI ’02, pp. 122–130. ACM, New York (2002)
Zhang, Y., Kameda, H., Hung, S.L.: Comparison of dynamic and static load-balancing strategies in heterogeneous distributed systems. IEE Proc. Comput. Digit. Tech. 144(2), 100–106 (1997)
Feitelson, D.G., Rudolph, L., Schwiegelshohn, U., Sevcik, K.C., Wong, P.: Theory and practice in parallel job scheduling. In: Proceedings of the Job Scheduling Strategies for Parallel Processing, IPPS ’97. Springer, London (1997)
Ferguson, D.F., Nikolaou, C., Sairamesh, J., Yemini, Y.: Economic models for allocating resources in computer systems. In Clearwater, S.H. (ed.) Market-based Control: A Paradigm for Distributed Resource Allocation, pp 156–183 (1996)
Grewe, D., O’Boyle, M.F.P.: A static task partitioning approach for heterogeneous systems using OpenCL. In: Proceedings of the 20th International Conference on Compiler Construction: Part of the Joint European Conferences on Theory and Practice of Software, CC’11/ETAPS’11, pp. 286–305. Springer, Berlin (2011)
Ernsting, S., Kuchen, H.: Data parallel algorithmic skeletons with accelerator support. Int. J. Parallel Program. (2016). doi:10.1007/s10766-016-0416-7
Cannon, L.E.: A Cellular Computer to Implement the Kalman Filter Algorithm. Ph.D. thesis, Montana State University, Bozeman, MT, AAI7010025 (1969)
Demout, J.: CUDA Pro Tip: minimize the tail effect. https://devblogs.nvidia.com/parallelforall/cuda-pro-tip-minimize-the-tail-effect/. Accessed Sep 2016
Augonnet, C., Thibault, S., Namyst, R., Wacrenier, P.: StarPU: a unified platform for task scheduling on heterogeneous multicore architectures. Concurr. Comput. Pract. Exp. 23(2), 187–198 (2011)
Dastgeer, U., Kessler, C., Thibault, S.: Flexible runtime support for efficient skeleton programming on heterogeneous GPU-based systems. In: Proceedings of the International Conference on Parallel Computing, ParCo ’11 (2011)
Alexandre, F., Marques, R., Paulino, H.: On the support of task-parallel algorithmic skeletons for multi-GPU computing. In: Proceedings of the 29th Annual ACM Symposium on Applied Computing, SAC ’14, pp. 880–885. ACM, New York (2014)
Sato, S., Iwasaki, H.: A skeletal parallel framework with fusion optimizer for GPGPU programming. In: Hu, Z. (ed.) Proceedings of the 7th Asian Symposium. APLAS 2009, Seoul, Korea, December 14–16, 2009. Lecture Notes in Computer Science, vol. 5904, pp. 79–94. Springer, Berlin (2009)
Aldinucci, M., Danelutto, M., Kilpatrick, P., Torquati, M.: FastFlow: High-level and efficient streaming on multi-core. In: Pllana, S., Xhafa, F. (eds.) Programming Multi-core and Many-Core Computing Systems, Parallel and Distributed Computing, chapt. 13. Wiley-Blackwell (in press)
Steuwer, M., Gorlatch, S.: SkelCL: enhancing OpenCL for high-level programming of multi-GPU systems. In: Malyshkin, V. (ed.) Proceedings of the 12th International Conference on Parallel Computing Technologies. PaCT ’13, pp. 258–272. Springer, Berlin (2013)
Goli, M., Gonzalez-Velez, H.: Heterogeneous algorithmic skeletons for fast flow with seamless coordination over hybrid architectures. In: Proceedings of the 2013 21st Euromicro International Conference on Parallel, Distributed, and Network-Based Processing, PDP ’13, pp. 148–156. IEEE Computer Society, Washington (2013)
Lee, C., Ro, W.W., Gaudiot, J.-L.: Boosting CUDA applications with CPU–GPU hybrid computing. Int. J. Parallel Prog. 42(2), 384–404 (2014)
Chen, L., Huo, X., Agrawal, G.: Accelerating MapReduce on a coupled CPU–GPU architecture. In: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC ’12, pp. 25:1–25:11. IEEE Computer Society Press, Los Alamitos (2012)
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Wrede, F., Ernsting, S. Simultaneous CPU–GPU Execution of Data Parallel Algorithmic Skeletons. Int J Parallel Prog 46, 42–61 (2018). https://doi.org/10.1007/s10766-016-0483-9
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10766-016-0483-9