skip to main content
10.1145/3649153.3649208acmconferencesArticle/Chapter ViewAbstractPublication PagescfConference Proceedingsconference-collections
research-article
Open access

Hardware support for balanced co-execution in heterogeneous processors

Published: 02 July 2024 Publication History

Abstract

Heterogeneous systems are the go-to solution in computing, ranging from HPC to mobile, due to their excellent performance and energy efficiency. However, using this kind of systems adequately poses challenges. Namely, each of the devices that comprise the system are often considered as independent entities that need to be managed and dispatched work to manually. This represents a significant burden on programming and often results in a fastest-device-only approach, in which compute intensive regions are offloaded to the fastest device available, while the rest of the system idles. This idling represents a waste of computing capabilities that could be leveraged if the workload was co-executed. Software solutions have been proposed to provide transparent co-execution, but they always trade abstraction and ease of use for performance. In general, a higher level of abstraction, which improves programmability, will generate overheads. This paper presents HCoD (Hardware co-execution Dispatcher), a design for a hardware dispatcher to enable transparent co-execution without the overheads in integrated heterogeneous SoCs. The dispatcher distributes the work associated to a single kernel among CPU cores and GPU compute units at runtime, while monitoring co-execution to balance the load and prevent a slow device from delaying computation. HCoD achieves an excellent balance among all the compute elements and improves performance by an average of 14%, by transparently leveraging the computing capabilities already available in the hardware.

References

[1]
Alejandro Acosta, Robert Corujo, Vicente Blanco, and Francisco Almeida. 2010. Dynamic load balancing on heterogeneous multicore/multiGPU systems. In HPCS, Waleed W. Smari and John P. McIntire (Eds.). IEEE, 467--476.
[2]
N. Agarwal, D. Nellans, E. Ebrahimi, T. F. Wenisch, J. Danskin, and S. W. Keckler. 2016. Selective GPU caches to eliminate CPU-GPU HW cache coherence. In IEEE Int. Sym. on High Performance Computer Architecture (HPCA). 494--506.
[3]
AMD. 2023. AMD INSTINCT™ MI300A APU. Integrated CPU/GPU accelerated processing unit for high-performance computing, generative AI, and ML training. https://www.amd.com/content/dam/amd/en/documents/instinct-tech-docs/data- sheets/amd-instinct-mi300a-data-sheet.pdf
[4]
Cédric Augonnet, Samuel Thibault, Raymond Namyst, and Pierre-André Wacrenier. 2011. StarPU: a unified platform for task scheduling on heterogeneous multicore architectures. Concurrency-Pract Ex 23, 2 (2011), 187--198.
[5]
R. Azimi, T. Fox, andS. Reda. 2017. Understanding the Role of GPGPU-Accelerated SoC-Based ARM Clusters. In 2017 IEEE Int. Conf. Cluster Computing. 333--343.
[6]
Mehmet E. Belviranli, Laxmi N. Bhuyan, and Rajiv Gupta. 2013. A Dynamic Self-scheduling Scheme for Heterogeneous Multiprocessor Architectures. ACM Trans. Archit. Code Optim. 9, 4, Article 57 (Jan. 2013), 20 pages.
[7]
T. Beri, S. Bansal, and S. Kumar. 2017. The Unicorn Runtime: Efficient Distributed Shared Memory Programming for Hybrid CPU-GPU Clusters. IEEE Transactions on Parallel and Distributed Systems 28, 5 (May 2017), 1518--1534.
[8]
M. Boyer, K. Skadron, S. Che, and N. Jayasena. 2013. Load Balancing in a Changing World: Dealing with Heterogeneity and Performance Variability. In Proc. ACM Int. Conf. on Computing Frontiers (Ischia, Italy). ACM, Article 21, 10 pages.
[9]
J. Cabezas, I. Gelado, J. E. Stone, N. Navarro, D. B. Kirk, and W. m. Hwu. 2015. Runtime and Architecture Support for Efficient Data Exchange in Multi-Accelerator Applications. IEEE Trans. Parallel and Distributed Systems 26, 5 (2015), 1405--1418.
[10]
E. Castillo, L. Alvarez, M. Moreto, M. Casas, E. Vallejo, J. L. Bosque, R. Beivide, and M. Valero. 2018. Architectural Support for Task Dependence Management with Flexible Software Scheduling. In 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA). 283--295.
[11]
Emilio Castillo, Cristóbal Camarero, Ana Borrego, and Jose Luis Bosque. 2015. Financial Applications on multi-CPU and multi-GPU Architectures. J. Supercomput. 71, 2 (Feb. 2015), 729--739.
[12]
E. Castillo, M. Moreto, M. Casas, L. Alvarez, E. Vallejo, K. Chronaki, R. Badia, J. L. Bosque, R. Beivide, E. Ayguade, J. Labarta, and M. Valero. 2016. CATA: Criticality Aware Task Acceleration for Multicore Processors. In 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS). 413--422.
[13]
V. García, J. Gomez-Luna, T. Grass, A. Rico, E. Ayguade, and A. J. Pena. 2016. Evaluating the effect of last-level cache sharing on integrated GPU-CPU systems with heterogeneous applications. In 2016 IEEE International Symposium on Workload Characterization (IISWC). 1--10.
[14]
T. Gautier, J.V.F. Lima, N. Maillard, and B. Raffin. 2013. XKaapi: A Runtime System for Data-Flow Task Programming on Heterogeneous Architectures. In Proc. of IPDPS. 1299--1308.
[15]
J. Hestness, S. W. Keckler, and D. A. Wood. 2014. A comparative analysis of microarchitecture effects on CPU and GPU memory system behavior. In 2014 IEEE International Symposium on Workload Characterization (IISWC). 150--160.
[16]
J. Hestness, S. W. Keckler, and D. A. Wood. 2015. GPU Computing Pipeline Inefficiencies and Optimization Opportunities in Heterogeneous CPU-GPU Processors. In 2015 IEEE International Symposium on Workload Characterization. 87--97.
[17]
D. R. Kaeli, P. Mistry, D. Schaa, and D. P. Zhang. 2015. Heterogeneous Computing with OpenCL 2.0 (1st ed.). Morgan Kaufmann Publishers Inc.
[18]
R. Kaleem and et all. 2014. Adaptive Heterogeneous Scheduling for Integrated GPUs. In Proc. of PACT. 151--162.
[19]
J. Kim, H. Kim, J.H. Lee, and J. Lee. 2011. Achieving a Single Compute Device Image in OpenCL for Multiple GPUs. In Proc. of the ACM PPoPP. ACM, 277--287.
[20]
J. Kim, S. Seo, J. Lee, J. Nah, G. Jo, and J. Lee. 2012. SnuCL: An OpenCL Framework for Heterogeneous CPU/GPU Clusters. In Proc. of the ACM ICS (Italy). 341--352.
[21]
D. B. Kirk and W. W. Hwu. 2010. Programming Massively Parallel Processors: A Hands-on Approach (1st ed.). Morgan Kaufmann.
[22]
J. Lee, M. Samadi, Y. Park, and S. Mahlke. 2013. Transparent CPU-GPU Collaboration for Data-parallel Kernels on Heterogeneous Systems. In Proc. of PACT (Scotland, UK). IEEE Press, 245--256.
[23]
J. Lee, M. Samadi, Y. Park, andS. Mahlke. 2015. SKMD: Single Kernel on Multiple Devices for Transparent CPU-GPU Collaboration. ACM Trans. Comput. Syst. 33, 3, Article 9 (Aug. 2015), 27 pages.
[24]
C. Luk, S. Hong, and H. Kim. 2009. Qilin: Exploiting Parallelism on Heterogeneous Multiprocessors with Adaptive Mapping. In Proc. of the 42Nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 42). ACM, 45--55.
[25]
Angeles Navarro, Antonio Vilches, Francisco Corbera, and Rafael Asenjo. 2014. Strategies for Maximizing Utilization on multi-CPU and multi-GPU Heterogeneous Architectures. J. Supercomput. 70, 2 (Nov. 2014), 756--771.
[26]
Raúl Nozal and José Luis Bosque. 2021. Exploiting Co-execution with OneAPI: Heterogeneity from a Modern Perspective. In Euro-Par 2021: Parallel Processing - 27th International Conference on Parallel and Distributed Computing, Lisbon, Portugal, September 1-3, 2021, Proceedings (Lecture Notes in Computer Science, Vol. 12820), Leonel Sousa, Nuno Roma, and Pedro Tomás (Eds.). Springer, 501--516. https://doi.org/10.1007/978-3-030-85665-6_31
[27]
Raúl Nozal and Jose Luis Bosque. 2021. Straightforward Heterogeneous Computing with the oneAPI Coexecutor Runtime. Electronics 10, 19 (2021). https://doi.org/10.3390/electronics10192386
[28]
R. Nozal, J. L. Bosque, and R. Beivide. 2020. EngineCL: Usability and Performance in Heterogeneous Computing. Future Generation Computer Systems 107 (2020), 522 - 537.
[29]
NVIDIA. 2023. NVIDIA GH200 Grace Hopper Superchip. The breakthrough accelerated CPU for large-scale AI and high-performance computing (HPC) applications. https://resources.nvidia.com/en-us-grace-cpu/grace-hopper-superchip
[30]
P. Pandit and R. Govindarajan. 2014. Fluidic Kernels: Cooperative Execution of OpenCL Programs on Multiple Heterogeneous Devices. In Proc. Annual IEEE/ACM CGO. ACM, Article 273, 11 pages.
[31]
B. Pérez, J. L. Bosque, and R. Beivide. 2016. Simplifying Programming and Load Balancing of Data Parallel Applications on Heterogeneous Systems. In 9th Annual Workshop on General Purpose Processing Using Graphics Processing Unit (GPGPU '16). ACM, 42--51.
[32]
Borja Pérez, Esteban Stafford, José Luis Bosque, and Ramón Beivide. 2021. Sigmoid: An auto-tuned load balancing algorithm for heterogeneous systems. J. Parallel Distributed Comput. 157 (2021), 30--42. https://doi.org/10.1016/j.jpdc.2021.06.003
[33]
J. Power, A. Basu, J. Gu, S. Puthoor, B. M. Beckmann, M. D. Hill, S. K. Reinhardt, and D. A. Wood. 2013. Heterogeneous System Coherence for Integrated CPU-GPU Systems. In 46th IEEE/ACM Int. Sym. on Microarchitecture (MICRO-46). ACM, 457--467.
[34]
K. Spafford, J. Meredith, and J. Vetter. 2010. Maestro: Data Orchestration and Tuning for OpenCL Devices. In 16th International Euro-Par Conference on Parallel Processing: Part II (Ischia, Italy) (Euro-Par' 10). Springer-Verlag, 275--286.
[35]
Esteban Stafford, B. Pérez, J. L. Bosque, R. Beivide, and M. Valero. 2017. To Distribute or Not to Distribute: The Question of Load Balancing for Performance or Energy. In Euro-Par 2017: Parallel Processing - 23rd Int. Conf. on Parallel and Distributed Computing. 710--722.
[36]
Y. Ukidave, D. Kaeli, U. Gupta, and K. Keville. 2015. Performance of the NVIDIA Jetson TK1 in HPC. In 2015 IEEE Int. Conference on Cluster Computing. 533--534.
[37]
T. Vijayaraghavan, Y. Eckert, G. H. Loh, M. J. Schulte, M. Ignatowski, B. M. Beckmann, W. C. Brantley, J. L. Greathouse, W. Huang, A. Karunanithi, O. Kayiran, M. Meswani, I. Paul, M. Poremba, S. Raasch, S. K. Reinhardt, G. Sadowski, and V. Sridharan. 2017. Design and Analysis of an APU for Exascale Computing. In 2017 IEEE Int. Sym. on High Performance Computer Architecture (HPCA). 85--96.
[38]
A. Vilches, R. Asenjo, A. Navarro, F. Corbera, R. Gran, and María Garzarán. 2015. Adaptive Partitioning for Irregular Applications on Heterogeneous CPU-GPU Chips. Procedia Computer Science 51 (2015), 140 - 149. International Conference On Computational Science, ICCS 2015.
[39]
Hao Wen and Wei Zhang. 2019. Heterogeneous Cache Hierarchy Management for Integrated CPU-GPU Architecture. HPEC (2019), 1--6.
[40]
K. Wilcox, D. Akeson, H. R. Fair, J. Farrell, D. Johnson, G. Krishnan, H. Mclntyre, E. McLellan, S. Naffziger, R. Schreiber, S. Sundaram, and J. White. 2015. 4.8 A 28nm x86 APU optimized for power and area efficiency. In 2015 IEEE International Solid-State Circuits Conference - (ISSCC) Digest of Technical Papers. 1--3.
[41]
Y. Yang, P. Xiang, M. Mantor, and H. Zhou. 2012. CPU-assisted GPGPU on fused CPU-GPU architectures. In IEEE International Symposium on High-Performance Comp Architecture. 1--12.
[42]
Yi-Ping You, Hen-Jung Wu, Yeh-Ning Tsai, and Yen-Ting Chao. 2015. VirtCL: A Framework for OpenCL Device Abstraction and Management. In Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming(PPoPP). ACM, 161--172.
[43]
Ziming Zhong, V. Rychkov, and A. Lastovetsky. 2015. Data Partitioning on Multi-core and Multi-GPU Platforms Using Functional Performance Models. Computers, IEEE Transactions on 64, 9 (Sept 2015), 2506--2518.
[44]
Amir Kavyan Ziabari, José L. Abellán, Yenai Ma, Ajay Joshi, and David Kaeli. 2015. Asymmetric NoC Architectures for GPU Systems. In Proc. of the 9th Int. Symposium on Networks-on-Chip (NOCS '15). ACM, Article 25, 8 pages.

Index Terms

  1. Hardware support for balanced co-execution in heterogeneous processors

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    CF '24: Proceedings of the 21st ACM International Conference on Computing Frontiers
    May 2024
    345 pages
    ISBN:9798400705977
    DOI:10.1145/3649153
    This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike International 4.0 License.

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 02 July 2024

    Check for updates

    Author Tags

    1. Co-Execution
    2. GPUs
    3. Hardware Support
    4. Heterogeneous Systems
    5. Load Balancing

    Qualifiers

    • Research-article
    • Research
    • Refereed limited

    Funding Sources

    • Spanish Science and Technology Commission

    Conference

    CF '24
    Sponsor:

    Acceptance Rates

    CF '24 Paper Acceptance Rate 33 of 105 submissions, 31%;
    Overall Acceptance Rate 273 of 785 submissions, 35%

    Upcoming Conference

    CF '25

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 115
      Total Downloads
    • Downloads (Last 12 months)115
    • Downloads (Last 6 weeks)32
    Reflects downloads up to 24 Jan 2025

    Other Metrics

    Citations

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Login options

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media