ABSTRACT
Nowadays, the majority of desktop, mobile, and embedded devices in the consumer and industrial markets are heterogeneous, as they contain at least multi-core CPU and GPU resources in the same system. However, exploiting the performance and energy-efficiency of these diverse processing elements does not come for free from a software point of view: programmers need to a) code each activity through the specific approaches, libraries, and frameworks suitable for their target architecture (e.g., CPUs and GPUs) along with the orchestration of such heterogeneous execution, and b) decide the distribution of sequential and parallel activities towards the different parallel hardware resources available.
Current frameworks typically provide either low-abstraction-level target-specific and/or generic but not high-performance interfaces, which complicate the exploration of different task assignments, with DAG1 precedence relationship, to the available heterogeneous resources. To enable this, tasks would typically need to be coded one time for each target architecture due to the profound differences in their programming.
In this work, we include the support of tasks and DAGs of data-parallel tasks within the single-source PHAST library, which currently supports both multi-core CPUs and NVIDIA GPUs, so that tasks are coded in a target-agnostic fashion and their targeting to multi-core or GPU architectures is automatic and efficient. The integration of this coding approach with tasks can help to postpone the choice of the execution platform for each task up to the testing, or even to the runtime, phase.
Finally, we demonstrate the effects of this approach in the case of a sample image pipeline benchmark from the computer vision domain. We compare our implementation to a SYCL implementation from a productivity point of view. Also, we show that various task assignments can be seamlessly explored by implementing both the PEFT2 mapping technique along with an exhaustive search in the mapping space.
- H. Arabnejad and J. G. Barbosa. 2014. List Scheduling Algorithm for Heterogeneous Systems by an Optimistic Cost Table. IEEE Transactions on Parallel and Distributed Systems 25, 3 (March 2014), 682--694. Google ScholarDigital Library
- C. Augonnet, S. Thibault, R. Namyst, and P.-A. Wacrenier. 2011. StarPU: A Unified Platform for Task Scheduling on Heterogeneous Multicore Architectures. Concurr. Comput.: Pract. Exper. 23, 2 (Feb. 2011), 187--198. Google ScholarDigital Library
- L. F. Bittencourt, R. Sakellariou, and E. R. M. Madeira. 2010. DAG Scheduling Using a Lookahead Variant of the Heterogeneous Earliest Finish Time Algorithm. In 2010 18th Euromicro Conference on Parallel, Distributed and Network-based Processing. 27--34. Google ScholarDigital Library
- D. Bouvier and B. Sander. 2014. Applying AMD's Kaveri APU for heterogeneous computing. In 2014 IEEE Hot Chips 26 Symposium (HCS). 1--42.Google Scholar
- J. Bueno, L. Martinell, A. Duran, M. Farreras, X. Martorell, R. M. Badia, E. Ayguade, and J. Labarta. 2011. Productive Cluster Programming with OmpSs. In Proceedings of the 17th International Conference on Parallel Processing - Volume Part I (EuroPar'11). Springer-Verlag, Berlin, Heidelberg, 555--566. Google ScholarDigital Library
- L.-C. Canon, E. Jeannot, R. Sakellariou, and W. Zheng. 2008. Comparative Evaluation Of The Robustness Of DAG Scheduling Heuristics. Springer US, Boston, MA, 73--84.Google Scholar
- H. C. Edwards and C. R. Trott. 2013. Kokkos: Enabling Performance Portability Across Manycore Architectures. In 2013 Extreme Scaling Workshop (xsw 2013). 18--24. Google ScholarDigital Library
- en.cppreference.com. 2018. C++ named requirements- Callable. https://en.cppreference.com/w/cpp/named_req/CallableGoogle Scholar
- J. Enmyren and C. W. Kessler. 2010. SkePU: A Multi-backend Skeleton Programming Library for multi-GPU Systems. In Proc. of the Int. Workshop on High-level Par. Progr. and Applications (HLPP '10). ACM, New York, NY, USA, 5--14. Google ScholarDigital Library
- D. Franklin. 2017. NVIDIA Jetson TX2 Delivers Twice the Intelligence to the Edge. https://devblogs.nvidia.com/jetson-tx2-delivers-twice-intelligence-edge/Google Scholar
- A.-I. Funie, P. Grigoras, P. Burovskiy, W. Luk, and M. Salmon. 2018. Run-time Reconfigurable Acceleration for Genetic Programming Fitness Evaluation in Trading Strategies. Journal of Signal Proc. Systems 90, 1 (01 Jan 2018), 39--52. Google ScholarDigital Library
- M. H. Halstead. 1977. Elements of Software Science (Operating and Programming Systems Ser.). Elsevier Science Inc., New York, USA. Google ScholarDigital Library
- ISO. 2011. ISO/IEC 14882:2011 - Information technology - Programming languages - C++. Standard. International Organization for Standardization, Geneva, CH.Google Scholar
- ISO. 2017. ISO/IEC14882:2017-Informationtechnology-Programminglanguages - C++. Standard. International Organization for Standardization, Geneva, CH.Google Scholar
- Khronos OpenCL Working Group. 2016. SYCL Provisional Specification, version 2.2. https://www.khronos.org/registry/sycl/specs/sycl-2.2.pdfGoogle Scholar
- Khronos OpenCL Working Group. 2016. The OpenCL Specification, version 2.2. https://www.khronos.org/registry/cl/specs/opencl-2.2.pdfGoogle Scholar
- S. Li, J. Meng, L. Yu, J. Ma, T. Chen, and M. Wu. 2015. Buffer Filter: A Last-Level Cache Management Policy for CPU-GPGPU Heterogeneous System. In 2015 IEEE 17th Int. Conf. on High Performance Computing and Communications, 2015 IEEE 7th Int. Symp. on Cyberspace Safety and Security, and 2015 IEEE 12th Int. Conf. on Embedded Software and Systems. 266--271. Google ScholarDigital Library
- K. Lutz. 2016. Boost.Compute. http://www.boost.org/doc/libs/1_61_0/libs/compute/doc/html/index.htmlGoogle Scholar
- H. Mair, G. Gammie, A. Wang, S. Gururajarao, I. Lin, H. Chen, W. Kuo, A. Rajagopalan, W. Ge, R. Lagerquist, S. Rahman, C. J. Chung, S. Wang, L. Wong, Y. Zhuang, K. Li, J. Wang, M. Chau, Y. Liu, D. Dia, M. Peng, and U. Ko. 2015. 23.3 A highly integrated smartphone SoC featuring a 2.5GHz octa-core CPU with advanced high-performance and low-power techniques. In 2015 IEEE International Solid-State Circuits Conference - (ISSCC) Digest of Technical Papers. 1--3.Google Scholar
- S. Markidis, S. Wei Der Chien, E. Laure, I. Peng, and J. S. Vetter. 2018. NVIDIA Tensor Core Programmability, Performance & Precision. 522--531.Google Scholar
- T. J. McCabe. 1976. A Complexity Measure. IEEE Trans. Softw. Eng. 2, 4 (July 1976), 308--320. Google ScholarDigital Library
- R. Membarth, O. Reiche, F. Hannig, J. Teich, M. Körner, and W. Eckert. 2016. HIPAcc: A Domain-Specific Language and Compiler for Image Processing. IEEE Transactions on Parallel and Distributed Systems 27, 1 (Jan 2016), 210--224. Google ScholarDigital Library
- NVIDIA. 2015. CUDA C Programming Guide. http://docs.nvidia.com/cuda/pdf/CUDA_C_Programming_Guide.pdfGoogle Scholar
- OpenACC. 2017. The OpenACC Application Programming Interface. https://www.openacc.org/sites/default/files/inline-files/OpenACC.2.6.final.pdfGoogle Scholar
- OpenMP Architecture Review Board. 2013. OpenMP Application Program Interface. http://www.openmp.org/wp-content/uploads/OpenMP4.0.0.pdfGoogle Scholar
- B. Peccerillo and S. Bartolini. 2018. PHAST - A portable high-level modern C+ + programming library for GPUs and multi-cores. IEEE Transactions on Parallel and Distributed Systems (2018), 1--1.Google Scholar
- B. Peccerillo, S. Bartolini, and Ç. K. Koç. 2017. Parallel bitsliced AES through PHAST: a single-source high-performance library for multi-cores and GPUs. Journal of Cryptographic Engineering (2017), 1--13.Google Scholar
- J. Ragan-Kelley, C. Barnes, A. Adams, S. Paris, F. Durand, and S. Amarasinghe. 2013. Halide: A Language and Compiler for Optimizing Parallelism, Locality, and Recomputation in Image Processing Pipelines. SIGPLAN Not. 48, 6 (June 2013), 519--530. Google ScholarDigital Library
- R. Sakellariou and H. Zhao. 2004. A hybrid heuristic for DAG scheduling on heterogeneous systems. In 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings. 111-.Google Scholar
- M. Steuwer, P. Kegel, and S. Gorlatch. 2011. SkelCL - A Portable Skeleton Library for High-Level GPU Programming. In Proceedings of the 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and PhD Forum (IPDPSW '11). IEEE Computer Society, Washington, DC, USA, 1176--1182. Google ScholarDigital Library
- H. Topcuoglu, S. Hariri, and Min-You Wu. 2002. Performance-effective and low-complexity task scheduling for heterogeneous computing. IEEE Transactions on Parallel and Distributed Systems 13, 3 (March 2002), 260--274. Google ScholarDigital Library
- Wikipedia. 2018. Unsharp masking. https://en.wikipedia.org/wiki/Unsharp_maskingGoogle Scholar
- F. Zhang, J. Zhai, B. He, S. Zhang, and W. Chen. 2017. Understanding Co-Running Behaviors on Integrated CPU/GPU Architectures. IEEE Transactions on Parallel and Distributed Systems 28, 3 (March 2017), 905--918. Google ScholarDigital Library
Index Terms
- Task-DAG Support in Single-Source PHAST Library: Enabling Flexible Assignment of Tasks to CPUs and GPUs in Heterogeneous Architectures
Recommendations
Single-source Library for Enabling Seamless Assignment of Data-parallel Task-DAGs to CPUs and GPUs in Heterogeneous Architectures
PARMA-DITAM 2019: Proceedings of the 10th and 8th Workshop on Parallel Programming and Run-Time Management Techniques for Many-core Architectures and Design Tools and Architectures for Multicore Embedded Computing PlatformsCurrently, the majority of devices is heterogeneous and comprises at least a multi-core CPU and a GPU. Exploiting these modules requires programmers to a) assign parallel activities to the different hardware resources, and b) code each activity through ...
A Portable and High-Performance General Matrix-Multiply (GEMM) Library for GPUs and Single-Chip CPU/GPU Systems
PDP '14: Proceedings of the 2014 22nd Euromicro International Conference on Parallel, Distributed, and Network-Based ProcessingOpenCL is a vendor neutral and portable interface for programming parallel compute devices such as GPUs. Tuning OpenCL implementations of important library functions such as dense general matrix multiply (GEMM) for a particular device is a difficult ...
On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing
SAAHPC '11: Proceedings of the 2011 Symposium on Application Accelerators in High-Performance ComputingThe graphics processing unit (GPU) has made significant strides as an accelerator in parallel computing. However, because the GPU has resided out on PCIe as a discrete device, the performance of GPU applications can be bottlenecked by data transfers ...
Comments