research-article

Open access

Hardware support for balanced co-execution in heterogeneous processors

Authors:

Jose Luis BosqueAuthors Info & Claims

CF '24: Proceedings of the 21st ACM International Conference on Computing Frontiers

Pages 106 - 114

https://doi.org/10.1145/3649153.3649208

Published: 02 July 2024 Publication History

Abstract

Heterogeneous systems are the go-to solution in computing, ranging from HPC to mobile, due to their excellent performance and energy efficiency. However, using this kind of systems adequately poses challenges. Namely, each of the devices that comprise the system are often considered as independent entities that need to be managed and dispatched work to manually. This represents a significant burden on programming and often results in a fastest-device-only approach, in which compute intensive regions are offloaded to the fastest device available, while the rest of the system idles. This idling represents a waste of computing capabilities that could be leveraged if the workload was co-executed. Software solutions have been proposed to provide transparent co-execution, but they always trade abstraction and ease of use for performance. In general, a higher level of abstraction, which improves programmability, will generate overheads. This paper presents HCoD (Hardware co-execution Dispatcher), a design for a hardware dispatcher to enable transparent co-execution without the overheads in integrated heterogeneous SoCs. The dispatcher distributes the work associated to a single kernel among CPU cores and GPU compute units at runtime, while monitoring co-execution to balance the load and prevent a slow device from delaying computation. HCoD achieves an excellent balance among all the compute elements and improves performance by an average of 14%, by transparently leveraging the computing capabilities already available in the hardware.

References

[1]

Alejandro Acosta, Robert Corujo, Vicente Blanco, and Francisco Almeida. 2010. Dynamic load balancing on heterogeneous multicore/multiGPU systems. In HPCS, Waleed W. Smari and John P. McIntire (Eds.). IEEE, 467--476.

[2]

N. Agarwal, D. Nellans, E. Ebrahimi, T. F. Wenisch, J. Danskin, and S. W. Keckler. 2016. Selective GPU caches to eliminate CPU-GPU HW cache coherence. In IEEE Int. Sym. on High Performance Computer Architecture (HPCA). 494--506.

[3]

AMD. 2023. AMD INSTINCT™ MI300A APU. Integrated CPU/GPU accelerated processing unit for high-performance computing, generative AI, and ML training. https://www.amd.com/content/dam/amd/en/documents/instinct-tech-docs/data- sheets/amd-instinct-mi300a-data-sheet.pdf

[4]

Cédric Augonnet, Samuel Thibault, Raymond Namyst, and Pierre-André Wacrenier. 2011. StarPU: a unified platform for task scheduling on heterogeneous multicore architectures. Concurrency-Pract Ex 23, 2 (2011), 187--198.

Digital Library

[5]

R. Azimi, T. Fox, andS. Reda. 2017. Understanding the Role of GPGPU-Accelerated SoC-Based ARM Clusters. In 2017 IEEE Int. Conf. Cluster Computing. 333--343.

[6]

Mehmet E. Belviranli, Laxmi N. Bhuyan, and Rajiv Gupta. 2013. A Dynamic Self-scheduling Scheme for Heterogeneous Multiprocessor Architectures. ACM Trans. Archit. Code Optim. 9, 4, Article 57 (Jan. 2013), 20 pages.

Digital Library

[7]

T. Beri, S. Bansal, and S. Kumar. 2017. The Unicorn Runtime: Efficient Distributed Shared Memory Programming for Hybrid CPU-GPU Clusters. IEEE Transactions on Parallel and Distributed Systems 28, 5 (May 2017), 1518--1534.

Digital Library

[8]

M. Boyer, K. Skadron, S. Che, and N. Jayasena. 2013. Load Balancing in a Changing World: Dealing with Heterogeneity and Performance Variability. In Proc. ACM Int. Conf. on Computing Frontiers (Ischia, Italy). ACM, Article 21, 10 pages.

[9]

J. Cabezas, I. Gelado, J. E. Stone, N. Navarro, D. B. Kirk, and W. m. Hwu. 2015. Runtime and Architecture Support for Efficient Data Exchange in Multi-Accelerator Applications. IEEE Trans. Parallel and Distributed Systems 26, 5 (2015), 1405--1418.

Digital Library

[10]

E. Castillo, L. Alvarez, M. Moreto, M. Casas, E. Vallejo, J. L. Bosque, R. Beivide, and M. Valero. 2018. Architectural Support for Task Dependence Management with Flexible Software Scheduling. In 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA). 283--295.

[11]

Emilio Castillo, Cristóbal Camarero, Ana Borrego, and Jose Luis Bosque. 2015. Financial Applications on multi-CPU and multi-GPU Architectures. J. Supercomput. 71, 2 (Feb. 2015), 729--739.

Digital Library

[12]

E. Castillo, M. Moreto, M. Casas, L. Alvarez, E. Vallejo, K. Chronaki, R. Badia, J. L. Bosque, R. Beivide, E. Ayguade, J. Labarta, and M. Valero. 2016. CATA: Criticality Aware Task Acceleration for Multicore Processors. In 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS). 413--422.

[13]

V. García, J. Gomez-Luna, T. Grass, A. Rico, E. Ayguade, and A. J. Pena. 2016. Evaluating the effect of last-level cache sharing on integrated GPU-CPU systems with heterogeneous applications. In 2016 IEEE International Symposium on Workload Characterization (IISWC). 1--10.

[14]

T. Gautier, J.V.F. Lima, N. Maillard, and B. Raffin. 2013. XKaapi: A Runtime System for Data-Flow Task Programming on Heterogeneous Architectures. In Proc. of IPDPS. 1299--1308.

[15]

J. Hestness, S. W. Keckler, and D. A. Wood. 2014. A comparative analysis of microarchitecture effects on CPU and GPU memory system behavior. In 2014 IEEE International Symposium on Workload Characterization (IISWC). 150--160.

[16]

J. Hestness, S. W. Keckler, and D. A. Wood. 2015. GPU Computing Pipeline Inefficiencies and Optimization Opportunities in Heterogeneous CPU-GPU Processors. In 2015 IEEE International Symposium on Workload Characterization. 87--97.

[17]

D. R. Kaeli, P. Mistry, D. Schaa, and D. P. Zhang. 2015. Heterogeneous Computing with OpenCL 2.0 (1st ed.). Morgan Kaufmann Publishers Inc.

[18]

R. Kaleem and et all. 2014. Adaptive Heterogeneous Scheduling for Integrated GPUs. In Proc. of PACT. 151--162.

[19]

J. Kim, H. Kim, J.H. Lee, and J. Lee. 2011. Achieving a Single Compute Device Image in OpenCL for Multiple GPUs. In Proc. of the ACM PPoPP. ACM, 277--287.

[20]

J. Kim, S. Seo, J. Lee, J. Nah, G. Jo, and J. Lee. 2012. SnuCL: An OpenCL Framework for Heterogeneous CPU/GPU Clusters. In Proc. of the ACM ICS (Italy). 341--352.

[21]

D. B. Kirk and W. W. Hwu. 2010. Programming Massively Parallel Processors: A Hands-on Approach (1st ed.). Morgan Kaufmann.

[22]

J. Lee, M. Samadi, Y. Park, and S. Mahlke. 2013. Transparent CPU-GPU Collaboration for Data-parallel Kernels on Heterogeneous Systems. In Proc. of PACT (Scotland, UK). IEEE Press, 245--256.

[23]

J. Lee, M. Samadi, Y. Park, andS. Mahlke. 2015. SKMD: Single Kernel on Multiple Devices for Transparent CPU-GPU Collaboration. ACM Trans. Comput. Syst. 33, 3, Article 9 (Aug. 2015), 27 pages.

Digital Library

[24]

C. Luk, S. Hong, and H. Kim. 2009. Qilin: Exploiting Parallelism on Heterogeneous Multiprocessors with Adaptive Mapping. In Proc. of the 42Nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 42). ACM, 45--55.

[25]

Angeles Navarro, Antonio Vilches, Francisco Corbera, and Rafael Asenjo. 2014. Strategies for Maximizing Utilization on multi-CPU and multi-GPU Heterogeneous Architectures. J. Supercomput. 70, 2 (Nov. 2014), 756--771.

Digital Library

[26]

Raúl Nozal and José Luis Bosque. 2021. Exploiting Co-execution with OneAPI: Heterogeneity from a Modern Perspective. In Euro-Par 2021: Parallel Processing - 27th International Conference on Parallel and Distributed Computing, Lisbon, Portugal, September 1-3, 2021, Proceedings (Lecture Notes in Computer Science, Vol. 12820), Leonel Sousa, Nuno Roma, and Pedro Tomás (Eds.). Springer, 501--516. https://doi.org/10.1007/978-3-030-85665-6_31

Digital Library

[27]

Raúl Nozal and Jose Luis Bosque. 2021. Straightforward Heterogeneous Computing with the oneAPI Coexecutor Runtime. Electronics 10, 19 (2021). https://doi.org/10.3390/electronics10192386

[28]

R. Nozal, J. L. Bosque, and R. Beivide. 2020. EngineCL: Usability and Performance in Heterogeneous Computing. Future Generation Computer Systems 107 (2020), 522 - 537.

Digital Library

[29]

NVIDIA. 2023. NVIDIA GH200 Grace Hopper Superchip. The breakthrough accelerated CPU for large-scale AI and high-performance computing (HPC) applications. https://resources.nvidia.com/en-us-grace-cpu/grace-hopper-superchip

[30]

P. Pandit and R. Govindarajan. 2014. Fluidic Kernels: Cooperative Execution of OpenCL Programs on Multiple Heterogeneous Devices. In Proc. Annual IEEE/ACM CGO. ACM, Article 273, 11 pages.

[31]

B. Pérez, J. L. Bosque, and R. Beivide. 2016. Simplifying Programming and Load Balancing of Data Parallel Applications on Heterogeneous Systems. In 9th Annual Workshop on General Purpose Processing Using Graphics Processing Unit (GPGPU '16). ACM, 42--51.

[32]

Borja Pérez, Esteban Stafford, José Luis Bosque, and Ramón Beivide. 2021. Sigmoid: An auto-tuned load balancing algorithm for heterogeneous systems. J. Parallel Distributed Comput. 157 (2021), 30--42. https://doi.org/10.1016/j.jpdc.2021.06.003

Digital Library

[33]

J. Power, A. Basu, J. Gu, S. Puthoor, B. M. Beckmann, M. D. Hill, S. K. Reinhardt, and D. A. Wood. 2013. Heterogeneous System Coherence for Integrated CPU-GPU Systems. In 46th IEEE/ACM Int. Sym. on Microarchitecture (MICRO-46). ACM, 457--467.

[34]

K. Spafford, J. Meredith, and J. Vetter. 2010. Maestro: Data Orchestration and Tuning for OpenCL Devices. In 16th International Euro-Par Conference on Parallel Processing: Part II (Ischia, Italy) (Euro-Par' 10). Springer-Verlag, 275--286.

[35]

Esteban Stafford, B. Pérez, J. L. Bosque, R. Beivide, and M. Valero. 2017. To Distribute or Not to Distribute: The Question of Load Balancing for Performance or Energy. In Euro-Par 2017: Parallel Processing - 23rd Int. Conf. on Parallel and Distributed Computing. 710--722.

[36]

Y. Ukidave, D. Kaeli, U. Gupta, and K. Keville. 2015. Performance of the NVIDIA Jetson TK1 in HPC. In 2015 IEEE Int. Conference on Cluster Computing. 533--534.

[37]

T. Vijayaraghavan, Y. Eckert, G. H. Loh, M. J. Schulte, M. Ignatowski, B. M. Beckmann, W. C. Brantley, J. L. Greathouse, W. Huang, A. Karunanithi, O. Kayiran, M. Meswani, I. Paul, M. Poremba, S. Raasch, S. K. Reinhardt, G. Sadowski, and V. Sridharan. 2017. Design and Analysis of an APU for Exascale Computing. In 2017 IEEE Int. Sym. on High Performance Computer Architecture (HPCA). 85--96.

[38]

A. Vilches, R. Asenjo, A. Navarro, F. Corbera, R. Gran, and María Garzarán. 2015. Adaptive Partitioning for Irregular Applications on Heterogeneous CPU-GPU Chips. Procedia Computer Science 51 (2015), 140 - 149. International Conference On Computational Science, ICCS 2015.

Digital Library

[39]

Hao Wen and Wei Zhang. 2019. Heterogeneous Cache Hierarchy Management for Integrated CPU-GPU Architecture. HPEC (2019), 1--6.

[40]

K. Wilcox, D. Akeson, H. R. Fair, J. Farrell, D. Johnson, G. Krishnan, H. Mclntyre, E. McLellan, S. Naffziger, R. Schreiber, S. Sundaram, and J. White. 2015. 4.8 A 28nm x86 APU optimized for power and area efficiency. In 2015 IEEE International Solid-State Circuits Conference - (ISSCC) Digest of Technical Papers. 1--3.

[41]

Y. Yang, P. Xiang, M. Mantor, and H. Zhou. 2012. CPU-assisted GPGPU on fused CPU-GPU architectures. In IEEE International Symposium on High-Performance Comp Architecture. 1--12.

[42]

Yi-Ping You, Hen-Jung Wu, Yeh-Ning Tsai, and Yen-Ting Chao. 2015. VirtCL: A Framework for OpenCL Device Abstraction and Management. In Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming(PPoPP). ACM, 161--172.

Digital Library

[43]

Ziming Zhong, V. Rychkov, and A. Lastovetsky. 2015. Data Partitioning on Multi-core and Multi-GPU Platforms Using Functional Performance Models. Computers, IEEE Transactions on 64, 9 (Sept 2015), 2506--2518.

[44]

Amir Kavyan Ziabari, José L. Abellán, Yenai Ma, Ajay Joshi, and David Kaeli. 2015. Asymmetric NoC Architectures for GPU Systems. In Proc. of the 9th Int. Symposium on Networks-on-Chip (NOCS '15). ACM, Article 25, 8 pages.

Digital Library

Index Terms

Hardware support for balanced co-execution in heterogeneous processors
1. Computer systems organization
  1. Architectures
    1. Parallel architectures

Recommendations

A timeenergy performance analysis of MapReduce on heterogeneous systems with GPUs

Motivated by the explosion of Big Data analytics, performance improvements in low-power (wimpy) systems and the increasing energy efficiency of GPUs, this paper presents a timeenergy performance analysis of MapReduce on heterogeneous systems with GPUs. ...
Compiler and runtime support for enabling generalized reduction computations on heterogeneous parallel configurations
ICS '10: Proceedings of the 24th ACM International Conference on Supercomputing

A trend that has materialized, and has given rise to much attention, is of the increasingly heterogeneous computing platforms. Presently, it has become very common for a desktop or a notebook computer to come equipped with both a multi-core CPU and a ...
MATE-CG: A Map Reduce-Like Framework for Accelerating Data-Intensive Computations on Heterogeneous Clusters
IPDPS '12: Proceedings of the 2012 IEEE 26th International Parallel and Distributed Processing Symposium

Clusters of GPUs have rapidly emerged as the means for achieving extreme-scale, cost-effective, and powerefficient high performance computing. At the same time, high level APIs like map-reduce are being used for developing several types of high-end and/...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

CF '24: Proceedings of the 21st ACM International Conference on Computing Frontiers

May 2024

345 pages

ISBN:9798400705977

DOI:10.1145/3649153

Copyright © 2024 Owner/Author.

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike International 4.0 License.

Sponsors

SIGMICRO: ACM Special Interest Group on Microarchitectural Research and Processing

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 02 July 2024

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Funding Sources

Spanish Science and Technology Commission

Conference

CF '24

Sponsor:

SIGMICRO

CF '24: 21st ACM International Conference on Computing Frontiers

May 7 - 9, 2024

Ischia, Italy

Acceptance Rates

CF '24 Paper Acceptance Rate 33 of 105 submissions, 31%;

Overall Acceptance Rate 273 of 785 submissions, 35%

Upcoming Conference

CF '25

Sponsor:
sigmicro

22nd ACM International Conference on Computing Frontiers

May 28 - 30, 2025

Cagliari , Italy

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
115
Total Downloads

Downloads (Last 12 months)115
Downloads (Last 6 weeks)32

Reflects downloads up to 24 Jan 2025

Other Metrics

View Author Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Figures

Tables

Media

View Table of Conten