Abstract:
For the rapid growth computation requirements in big data and artificial intelligence area, CPU-GPU heterogeneous clusters can provide more powerful computing capacity co...Show MoreMetadata
Abstract:
For the rapid growth computation requirements in big data and artificial intelligence area, CPU-GPU heterogeneous clusters can provide more powerful computing capacity compared to CPU clusters. The high parallel computing capabilities of GPUs greatly accelerate computation-intensive applications. And the number of GPUs on single computing node is scalable, which greatly improves the computing capacity of the cluster under the condition of limited cluster size. However, there is a lack of the effective load-balancing scheduling model in multi-GPU hardware environment. This article proposes AEML, an acceleration engine for multi-GPU load-balancing in distributed heterogeneous environment. AEML can effectively integrate GPUs into distributed processing framework and achieve great load-balance among multiple heterogeneous GPUs. We propose a heterogeneous task execution model based on multiple GPUs and multiple streams (MGMS), which can effectively balance the workload of multiple GPUs. MGMS model utilizes four core techniques: a fine-grained task mapping mechanism, a device resource unified management scheme, a novel resource-aware GPU task scheduling strategy, and a feedback-based streams adjustment scheme. The implementation of AEML system is based on Spark 3.0.0 and NVIDIA CUDA 10.0. We comprehensively evaluate the performance of AEML with multiple typical benchmarks. Experimental results show that AEML can fully exploit the computing power of GPUs and achieve great load-balance among multiple heterogeneous GPUs.
Published in: IEEE Transactions on Computers ( Volume: 71, Issue: 6, 01 June 2022)