Elsevier

Big Data Research

Volume 29, 28 August 2022, 100329
Big Data Research

XDataExplorer: A Three-Stage Comprehensive Self-Tuning Tool for Big Data Platforms

https://doi.org/10.1016/j.bdr.2022.100329Get rights and content

Abstract

To meet the challenges of massive data, many big data platforms have been used in practice. In these data processing platforms, there are many correlated parameters that have an impact on processing performance; thus, it is challenging to configure these parameters properly for users with different roles. This paper proposes XDataExplorer, a new comprehensive self-tuning tool for big data platforms, which is based on a three-stage optimization approach that optimizes performance successively at the system level, application level and fine-grain tuning. System-level optimization is guided by expert knowledge that is used to update the system variables. Based on the metrics that are computed by collecting recent application history, application-level optimization is achieved using rule-based heuristic methods. The last step of fine-grain tuning uses a hill-climbing algorithm to iterative examine combinations of system-level and application-level parameters. Several performance optimization best practices, expertise, and heuristic rules are also summarized in this paper. Through different stages of gradual tuning, the proposed tool can rapidly improve the productivity of a big data platform and make processes run more efficiently. System evaluations show that, with the suggested configurations, performance can be improved by between 15% and 60% for different workloads compared to default configurations.

Introduction

In recent years, information technology and data science have been developed rapidly and used widely. In many domains, large amounts of data are accumulating and growing rapidly, driving us into the era of big data [1]. Currently, big data analytics have become a focus of academia, industry, and governments around the world [2]. Compared to traditional data, the challenges posed by big data include their large quantity and their diverse data modalities, complex data sources, timely response requirements, and uncertainties [3]. To manage these difficulties and challenges, many efforts have been made by many big data holders, such as Google, Facebook, and Cloudera, by launching various types of big data processing systems [4]. Currently, many technologies and data processing systems have been used in practice, including distributed systems such as HDFS, MapReduce, Hive, HBase, and Spark [5]. These systems simplify large parallel data analytics and provide value-added services to users.

However, different data processing architectures are typically designed to meet different application requirements; thus, big data platform architectures have become complex. If a platform is configured with proper parameters, the performance of a job running on the platform will be improved. Otherwise, inappropriate configurations can lead to severe performance loss. Most scientists focus more on domain problems and designing intelligent modules, so it is difficult for them to configure a big data platform. Conversely, system engineers often do not understand the context of applications and the goal of programs, although they are familiar with the configuration of the platform; therefore, it remains nontrivial to optimize the performance of big data applications. Meanwhile, big data platforms have many correlated parameters, and any minor misconfiguration might lead to significant performance loss [6]. Users may care about running time, response time, throughput and/or energy efficiency. In traditional relational databases, there are optimizers or automatic tuning tools to manage these complexities. In big data platforms, the problem is more complex, and more advanced tools are required.

Most existing tuning tools are designed to collect system utilization information and provide monitoring metrics for jobs running on the platform. They simplify the management of clusters, while few of them can help users optimize job flows. Previous studies have investigated performance tuning and optimization for specific big data platforms or applications in recent years. In [6], based on the collected and preprocessed data, Yeh et al. used different machine learning algorithms to select the best configuration. It is somewhat complex to select suitable algorithms and often takes a long time to train machine learning models. In [7], Gounaris et al. proposed a new tuning methodology for Spark, which uses a graph-based algorithm to build complex candidate configurations and then chooses the one with the best performance. However, it is only for Spark, and the parameters they studied are limited to shuffling, compression, and serialization. In [8], a big data application performance tuning tool in a high-performance computing cluster (HPCC) system was developed to provide performance monitoring and analysis tools to help users detect application hotspots. Although many fine-grained analysis functions are useful, they are not fully automated and require user interaction to achieve the expected performance.

To overcome the aforementioned problems, we propose XDataExplorer, an automatic framework to provide analysis and tuning capabilities for a big data system. XDataExplorer has been used as a module in a commercial software product called XData, and we summarize several best practices and useful parameter tuning suggestions in this paper. XDataExplorer can automatically collect all relevant metrics, analyze application performance, and provide recommendations about how to tune relevant parameters to improve the system efficiency. The contributions of this paper include the following:

  • We propose a new comprehensive self-tuning framework for big data platforms that is based on the three-stage optimization approach. The proposed framework optimizes performance successively at the system level, application level and fine-grain tuning. Through different stages of gradual tuning, the proposed framework can rapidly improve the productivity of a big data platform and make applications run more efficiently. The proposed framework uses a pluggable architecture that can be extended to support different software systems and add new tuning rules.

  • We use expert knowledge to update the platform variables for system-level optimization. Based on the metrics that are computed by collecting recent application history, application-level optimization is achieved using rule-based heuristic methods. Several performance optimization best practices, expertise, and heuristic rules are summarized, added to a database, and can be used to quickly recommend parameters based on the deployment of clusters or the specific application characteristics.

  • We propose a hill-climbing algorithm for fine-grain tuning by considering both system-level and application-level variables. This algorithm attempts to run workloads directly with candidate configurations using test data sets and then use the results of iterative experiments to provide the best combination of parameters. Ultimately, the productivity of a given system will be improved, and applications will run more efficiently.

The remainder of this paper is organized as follows. The next section provides an overview of relevant background information and related studies. In section 3, we present the architecture of the proposed tuning tool and explain its implementation in detail. In section 4, experiments demonstrate the performance optimization achieved by the proposed tool, and Section 5 provides conclusions.

Section snippets

Background and related work

In this section, we present relevant background information and related studies.

System design and implementation

In this section, we describe the design details of XDataExplorer. We present an overview of its architecture and then discuss its implementation in detail.

System evaluation

In this section, we present experimental evaluations of XDataExplorer. We use a 4-node cluster, where each node is equipped with two 16-core processors running at 2.4 GHz, 256 GB DDR4 memory, two 600 GB SAS disks, twelve 6 TB SATA disks, and dual 10-gigabit Ethernet cards. The nodes are connected to each other through a 10-gigabit switch.

XDataExplorer now supports performance tuning for HDFS, MapReduce, Spark, Hive and HBase. Usually, users use industry benchmarks to evaluate the performance of

Conclusion

In this paper, we propose a three-stage comprehensive self-tuning tool for big data platforms. Through different stages of gradual tuning, the parameters and their searching ranges decrease progressively; thus, the proposed tool can quickly improve the productivity of a big data platform and make it perform more efficiently. In the first step of system-level tuning, we use knowledge rules that are stored in an expert system to promptly adjust important parameters according to cluster deployment

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

References (32)

  • L. Xu et al.

    ECL-watch: a big data application performance tuning tool in the HPCC systems platform

  • Tom White

    Hadoop: The Definitive Guide: Storage and Analysis at Internet Scale

    (2014)
  • Eric Sammer

    Hadoop Operations

    (2012)
  • The apache software foundation, HBase

  • View full text