research-article

Task-DAG Support in Single-Source PHAST Library: Enabling Flexible Assignment of Tasks to CPUs and GPUs in Heterogeneous Architectures

Authors:
Biagio Peccerillo

Department of Information Engineering and Mathematical Sciences, University of Siena, Siena, Italy

Department of Information Engineering and Mathematical Sciences, University of Siena, Siena, Italy
View Profile

,
Sandro Bartolini

Department of Information Engineering and Mathematical Sciences, University of Siena, Siena, Italy

Department of Information Engineering and Mathematical Sciences, University of Siena, Siena, Italy
View Profile

PMAM'19: Proceedings of the 10th International Workshop on Programming Models and Applications for Multicores and ManycoresFebruary 2019Pages 91–100https://doi.org/10.1145/3303084.3309496

Published:17 February 2019Publication History

PMAM'19: Proceedings of the 10th International Workshop on Programming Models and Applications for Multicores and Manycores

Pages 91–100

ABSTRACT

Nowadays, the majority of desktop, mobile, and embedded devices in the consumer and industrial markets are heterogeneous, as they contain at least multi-core CPU and GPU resources in the same system. However, exploiting the performance and energy-efficiency of these diverse processing elements does not come for free from a software point of view: programmers need to a) code each activity through the specific approaches, libraries, and frameworks suitable for their target architecture (e.g., CPUs and GPUs) along with the orchestration of such heterogeneous execution, and b) decide the distribution of sequential and parallel activities towards the different parallel hardware resources available.

Current frameworks typically provide either low-abstraction-level target-specific and/or generic but not high-performance interfaces, which complicate the exploration of different task assignments, with DAG1 precedence relationship, to the available heterogeneous resources. To enable this, tasks would typically need to be coded one time for each target architecture due to the profound differences in their programming.

In this work, we include the support of tasks and DAGs of data-parallel tasks within the single-source PHAST library, which currently supports both multi-core CPUs and NVIDIA GPUs, so that tasks are coded in a target-agnostic fashion and their targeting to multi-core or GPU architectures is automatic and efficient. The integration of this coding approach with tasks can help to postpone the choice of the execution platform for each task up to the testing, or even to the runtime, phase.

Finally, we demonstrate the effects of this approach in the case of a sample image pipeline benchmark from the computer vision domain. We compare our implementation to a SYCL implementation from a productivity point of view. Also, we show that various task assignments can be seamlessly explored by implementing both the PEFT2 mapping technique along with an exhaustive search in the mapping space.

References

H. Arabnejad and J. G. Barbosa. 2014. List Scheduling Algorithm for Heterogeneous Systems by an Optimistic Cost Table. IEEE Transactions on Parallel and Distributed Systems 25, 3 (March 2014), 682--694. Google ScholarDigital Library
C. Augonnet, S. Thibault, R. Namyst, and P.-A. Wacrenier. 2011. StarPU: A Unified Platform for Task Scheduling on Heterogeneous Multicore Architectures. Concurr. Comput.: Pract. Exper. 23, 2 (Feb. 2011), 187--198. Google ScholarDigital Library
L. F. Bittencourt, R. Sakellariou, and E. R. M. Madeira. 2010. DAG Scheduling Using a Lookahead Variant of the Heterogeneous Earliest Finish Time Algorithm. In 2010 18th Euromicro Conference on Parallel, Distributed and Network-based Processing. 27--34. Google ScholarDigital Library
D. Bouvier and B. Sander. 2014. Applying AMD's Kaveri APU for heterogeneous computing. In 2014 IEEE Hot Chips 26 Symposium (HCS). 1--42.Google Scholar
J. Bueno, L. Martinell, A. Duran, M. Farreras, X. Martorell, R. M. Badia, E. Ayguade, and J. Labarta. 2011. Productive Cluster Programming with OmpSs. In Proceedings of the 17th International Conference on Parallel Processing - Volume Part I (EuroPar'11). Springer-Verlag, Berlin, Heidelberg, 555--566. Google ScholarDigital Library
L.-C. Canon, E. Jeannot, R. Sakellariou, and W. Zheng. 2008. Comparative Evaluation Of The Robustness Of DAG Scheduling Heuristics. Springer US, Boston, MA, 73--84.Google Scholar
H. C. Edwards and C. R. Trott. 2013. Kokkos: Enabling Performance Portability Across Manycore Architectures. In 2013 Extreme Scaling Workshop (xsw 2013). 18--24. Google ScholarDigital Library
en.cppreference.com. 2018. C++ named requirements- Callable. https://en.cppreference.com/w/cpp/named_req/CallableGoogle Scholar
J. Enmyren and C. W. Kessler. 2010. SkePU: A Multi-backend Skeleton Programming Library for multi-GPU Systems. In Proc. of the Int. Workshop on High-level Par. Progr. and Applications (HLPP '10). ACM, New York, NY, USA, 5--14. Google ScholarDigital Library
D. Franklin. 2017. NVIDIA Jetson TX2 Delivers Twice the Intelligence to the Edge. https://devblogs.nvidia.com/jetson-tx2-delivers-twice-intelligence-edge/Google Scholar
A.-I. Funie, P. Grigoras, P. Burovskiy, W. Luk, and M. Salmon. 2018. Run-time Reconfigurable Acceleration for Genetic Programming Fitness Evaluation in Trading Strategies. Journal of Signal Proc. Systems 90, 1 (01 Jan 2018), 39--52. Google ScholarDigital Library
M. H. Halstead. 1977. Elements of Software Science (Operating and Programming Systems Ser.). Elsevier Science Inc., New York, USA. Google ScholarDigital Library
ISO. 2011. ISO/IEC 14882:2011 - Information technology - Programming languages - C++. Standard. International Organization for Standardization, Geneva, CH.Google Scholar
ISO. 2017. ISO/IEC14882:2017-Informationtechnology-Programminglanguages - C++. Standard. International Organization for Standardization, Geneva, CH.Google Scholar
Khronos OpenCL Working Group. 2016. SYCL Provisional Specification, version 2.2. https://www.khronos.org/registry/sycl/specs/sycl-2.2.pdfGoogle Scholar
Khronos OpenCL Working Group. 2016. The OpenCL Specification, version 2.2. https://www.khronos.org/registry/cl/specs/opencl-2.2.pdfGoogle Scholar
S. Li, J. Meng, L. Yu, J. Ma, T. Chen, and M. Wu. 2015. Buffer Filter: A Last-Level Cache Management Policy for CPU-GPGPU Heterogeneous System. In 2015 IEEE 17th Int. Conf. on High Performance Computing and Communications, 2015 IEEE 7th Int. Symp. on Cyberspace Safety and Security, and 2015 IEEE 12th Int. Conf. on Embedded Software and Systems. 266--271. Google ScholarDigital Library
K. Lutz. 2016. Boost.Compute. http://www.boost.org/doc/libs/1_61_0/libs/compute/doc/html/index.htmlGoogle Scholar
H. Mair, G. Gammie, A. Wang, S. Gururajarao, I. Lin, H. Chen, W. Kuo, A. Rajagopalan, W. Ge, R. Lagerquist, S. Rahman, C. J. Chung, S. Wang, L. Wong, Y. Zhuang, K. Li, J. Wang, M. Chau, Y. Liu, D. Dia, M. Peng, and U. Ko. 2015. 23.3 A highly integrated smartphone SoC featuring a 2.5GHz octa-core CPU with advanced high-performance and low-power techniques. In 2015 IEEE International Solid-State Circuits Conference - (ISSCC) Digest of Technical Papers. 1--3.Google Scholar
S. Markidis, S. Wei Der Chien, E. Laure, I. Peng, and J. S. Vetter. 2018. NVIDIA Tensor Core Programmability, Performance & Precision. 522--531.Google Scholar
T. J. McCabe. 1976. A Complexity Measure. IEEE Trans. Softw. Eng. 2, 4 (July 1976), 308--320. Google ScholarDigital Library
R. Membarth, O. Reiche, F. Hannig, J. Teich, M. Körner, and W. Eckert. 2016. HIPAcc: A Domain-Specific Language and Compiler for Image Processing. IEEE Transactions on Parallel and Distributed Systems 27, 1 (Jan 2016), 210--224. Google ScholarDigital Library
NVIDIA. 2015. CUDA C Programming Guide. http://docs.nvidia.com/cuda/pdf/CUDA_C_Programming_Guide.pdfGoogle Scholar
OpenACC. 2017. The OpenACC Application Programming Interface. https://www.openacc.org/sites/default/files/inline-files/OpenACC.2.6.final.pdfGoogle Scholar
OpenMP Architecture Review Board. 2013. OpenMP Application Program Interface. http://www.openmp.org/wp-content/uploads/OpenMP4.0.0.pdfGoogle Scholar
B. Peccerillo and S. Bartolini. 2018. PHAST - A portable high-level modern C+ + programming library for GPUs and multi-cores. IEEE Transactions on Parallel and Distributed Systems (2018), 1--1.Google Scholar
B. Peccerillo, S. Bartolini, and Ç. K. Koç. 2017. Parallel bitsliced AES through PHAST: a single-source high-performance library for multi-cores and GPUs. Journal of Cryptographic Engineering (2017), 1--13.Google Scholar
J. Ragan-Kelley, C. Barnes, A. Adams, S. Paris, F. Durand, and S. Amarasinghe. 2013. Halide: A Language and Compiler for Optimizing Parallelism, Locality, and Recomputation in Image Processing Pipelines. SIGPLAN Not. 48, 6 (June 2013), 519--530. Google ScholarDigital Library
R. Sakellariou and H. Zhao. 2004. A hybrid heuristic for DAG scheduling on heterogeneous systems. In 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings. 111-.Google Scholar
M. Steuwer, P. Kegel, and S. Gorlatch. 2011. SkelCL - A Portable Skeleton Library for High-Level GPU Programming. In Proceedings of the 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and PhD Forum (IPDPSW '11). IEEE Computer Society, Washington, DC, USA, 1176--1182. Google ScholarDigital Library
H. Topcuoglu, S. Hariri, and Min-You Wu. 2002. Performance-effective and low-complexity task scheduling for heterogeneous computing. IEEE Transactions on Parallel and Distributed Systems 13, 3 (March 2002), 260--274. Google ScholarDigital Library
Wikipedia. 2018. Unsharp masking. https://en.wikipedia.org/wiki/Unsharp_maskingGoogle Scholar
F. Zhang, J. Zhai, B. He, S. Zhang, and W. Chen. 2017. Understanding Co-Running Behaviors on Integrated CPU/GPU Architectures. IEEE Transactions on Parallel and Distributed Systems 28, 3 (March 2017), 905--918. Google ScholarDigital Library

Index Terms

Task-DAG Support in Single-Source PHAST Library: Enabling Flexible Assignment of Tasks to CPUs and GPUs in Heterogeneous Architectures
1. Computer systems organization
  1. Architectures
    1. Parallel architectures
  2. Embedded and cyber-physical systems
2. Software and its engineering
  1. Software notations and tools
    1. General programming languages

Recommendations

Single-source Library for Enabling Seamless Assignment of Data-parallel Task-DAGs to CPUs and GPUs in Heterogeneous Architectures
PARMA-DITAM 2019: Proceedings of the 10th and 8th Workshop on Parallel Programming and Run-Time Management Techniques for Many-core Architectures and Design Tools and Architectures for Multicore Embedded Computing Platforms

Currently, the majority of devices is heterogeneous and comprises at least a multi-core CPU and a GPU. Exploiting these modules requires programmers to a) assign parallel activities to the different hardware resources, and b) code each activity through ...
Read More
A Portable and High-Performance General Matrix-Multiply (GEMM) Library for GPUs and Single-Chip CPU/GPU Systems
PDP '14: Proceedings of the 2014 22nd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing

OpenCL is a vendor neutral and portable interface for programming parallel compute devices such as GPUs. Tuning OpenCL implementations of important library functions such as dense general matrix multiply (GEMM) for a particular device is a difficult ...
Read More
On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing
SAAHPC '11: Proceedings of the 2011 Symposium on Application Accelerators in High-Performance Computing

The graphics processing unit (GPU) has made significant strides as an accelerator in parallel computing. However, because the GPU has resided out on PCIe as a discrete device, the performance of GPU applications can be bottlenecked by data transfers ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
PMAM'19: Proceedings of the 10th International Workshop on Programming Models and Applications for Multicores and Manycores
February 2019
105 pages
ISBN:9781450362900
DOI:10.1145/3303084
Editors:
Quan Chen
Shanghai Jiao Tong, University, China
,
Zhiyi Huang
University of Otago, New Zealand
,
Min Si
Argonne National, Laboratory, USA
Copyright © 2019 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 17 February 2019
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Heterogeneous computing
data-parallelism
design-space exploration
single-source parallel programming
task-based applications
Qualifiers
- research-article
- Research
- Refereed limited
Conference

Acceptance Rates
PMAM'19 Paper Acceptance Rate10of17submissions,59%Overall Acceptance Rate53of97submissions,55%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 3
  Total Citations
  View Citations
- 155
  Total Downloads
- Downloads (Last 12 months)8
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Task-DAG Support in Single-Source PHAST Library: Enabling Flexible Assignment of Tasks to CPUs and GPUs in Heterogeneous Architectures

PMAM'19: Proceedings of the 10th International Workshop on Programming Models and Applications for Multicores and Manycores

ABSTRACT

References

Cited By

Index Terms

Recommendations

Single-source Library for Enabling Seamless Assignment of Data-parallel Task-DAGs to CPUs and GPUs in Heterogeneous Architectures

A Portable and High-Performance General Matrix-Multiply (GEMM) Library for GPUs and Single-Chip CPU/GPU Systems

On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Task-DAG Support in Single-Source PHAST Library: Enabling Flexible Assignment of Tasks to CPUs and GPUs in Heterogeneous Architectures

PMAM'19: Proceedings of the 10th International Workshop on Programming Models and Applications for Multicores and Manycores

ABSTRACT

References

Cited By

Index Terms

Recommendations

Single-source Library for Enabling Seamless Assignment of Data-parallel Task-DAGs to CPUs and GPUs in Heterogeneous Architectures

A Portable and High-Performance General Matrix-Multiply (GEMM) Library for GPUs and Single-Chip CPU/GPU Systems

On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media