Abstract:
A significant portion of research work in the past decade has been devoted on developing resource allocation and task scheduling solutions for large-scale data processing...Show MoreMetadata
Abstract:
A significant portion of research work in the past decade has been devoted on developing resource allocation and task scheduling solutions for large-scale data processing platforms. Such algorithms are designed to facilitate deployment of data analytic applications across either conventional cluster computing systems or modern virtualized data-centers. The main reason for such a huge research effort stems from the fact that even a slight improvement in the performance of such platforms can bring a considerable monetary savings for vendors, especially for modern data processing engines that are designed solely to perform high throughput or/and low-latency computations over massive-scale batch or streaming data. A challenging question to be yet answered in such a context is to design an effective resource allocation solution that can prevent low resource utilization while meeting the enforced performance level (such as 99-th latency percentile) in circumstances where contention among applications to obtain the capacity of shared resources is a non negligible performance-limiting parameter. This paper proposes a resource controller system, called QSpark, to cope with the problem of (i) low performance (i.e., resource utilization in the batch mode and p-99 response time in the streaming mode), and (ii) the shared resource interference among collocated applications in a multi-tenancy modern Spark platform. The proposed solution leverages a set of controlling mechanisms for dynamic partitioning of the allocation of computing resources, in a way that it can fulfill the QoS re-quirements of latency-critical data processing applications, while enhancing the throughput for all working nodes without reaching their saturation points. Through extensive experiments in our in-house Spark cluster, we compared the achieved performance of proposed solution against the default Spark resource allocation policy for a variety of Machine Learning (ML), Artificial Intelligence (AI), and De...
Date of Conference: 23-26 November 2021
Date Added to IEEE Xplore: 31 January 2022
ISBN Information: