XDataExplorer: A Three-Stage Comprehensive Self-Tuning Tool for Big Data Platforms
Introduction
In recent years, information technology and data science have been developed rapidly and used widely. In many domains, large amounts of data are accumulating and growing rapidly, driving us into the era of big data [1]. Currently, big data analytics have become a focus of academia, industry, and governments around the world [2]. Compared to traditional data, the challenges posed by big data include their large quantity and their diverse data modalities, complex data sources, timely response requirements, and uncertainties [3]. To manage these difficulties and challenges, many efforts have been made by many big data holders, such as Google, Facebook, and Cloudera, by launching various types of big data processing systems [4]. Currently, many technologies and data processing systems have been used in practice, including distributed systems such as HDFS, MapReduce, Hive, HBase, and Spark [5]. These systems simplify large parallel data analytics and provide value-added services to users.
However, different data processing architectures are typically designed to meet different application requirements; thus, big data platform architectures have become complex. If a platform is configured with proper parameters, the performance of a job running on the platform will be improved. Otherwise, inappropriate configurations can lead to severe performance loss. Most scientists focus more on domain problems and designing intelligent modules, so it is difficult for them to configure a big data platform. Conversely, system engineers often do not understand the context of applications and the goal of programs, although they are familiar with the configuration of the platform; therefore, it remains nontrivial to optimize the performance of big data applications. Meanwhile, big data platforms have many correlated parameters, and any minor misconfiguration might lead to significant performance loss [6]. Users may care about running time, response time, throughput and/or energy efficiency. In traditional relational databases, there are optimizers or automatic tuning tools to manage these complexities. In big data platforms, the problem is more complex, and more advanced tools are required.
Most existing tuning tools are designed to collect system utilization information and provide monitoring metrics for jobs running on the platform. They simplify the management of clusters, while few of them can help users optimize job flows. Previous studies have investigated performance tuning and optimization for specific big data platforms or applications in recent years. In [6], based on the collected and preprocessed data, Yeh et al. used different machine learning algorithms to select the best configuration. It is somewhat complex to select suitable algorithms and often takes a long time to train machine learning models. In [7], Gounaris et al. proposed a new tuning methodology for Spark, which uses a graph-based algorithm to build complex candidate configurations and then chooses the one with the best performance. However, it is only for Spark, and the parameters they studied are limited to shuffling, compression, and serialization. In [8], a big data application performance tuning tool in a high-performance computing cluster (HPCC) system was developed to provide performance monitoring and analysis tools to help users detect application hotspots. Although many fine-grained analysis functions are useful, they are not fully automated and require user interaction to achieve the expected performance.
To overcome the aforementioned problems, we propose XDataExplorer, an automatic framework to provide analysis and tuning capabilities for a big data system. XDataExplorer has been used as a module in a commercial software product called XData, and we summarize several best practices and useful parameter tuning suggestions in this paper. XDataExplorer can automatically collect all relevant metrics, analyze application performance, and provide recommendations about how to tune relevant parameters to improve the system efficiency. The contributions of this paper include the following:
- •
We propose a new comprehensive self-tuning framework for big data platforms that is based on the three-stage optimization approach. The proposed framework optimizes performance successively at the system level, application level and fine-grain tuning. Through different stages of gradual tuning, the proposed framework can rapidly improve the productivity of a big data platform and make applications run more efficiently. The proposed framework uses a pluggable architecture that can be extended to support different software systems and add new tuning rules.
- •
We use expert knowledge to update the platform variables for system-level optimization. Based on the metrics that are computed by collecting recent application history, application-level optimization is achieved using rule-based heuristic methods. Several performance optimization best practices, expertise, and heuristic rules are summarized, added to a database, and can be used to quickly recommend parameters based on the deployment of clusters or the specific application characteristics.
- •
We propose a hill-climbing algorithm for fine-grain tuning by considering both system-level and application-level variables. This algorithm attempts to run workloads directly with candidate configurations using test data sets and then use the results of iterative experiments to provide the best combination of parameters. Ultimately, the productivity of a given system will be improved, and applications will run more efficiently.
The remainder of this paper is organized as follows. The next section provides an overview of relevant background information and related studies. In section 3, we present the architecture of the proposed tuning tool and explain its implementation in detail. In section 4, experiments demonstrate the performance optimization achieved by the proposed tool, and Section 5 provides conclusions.
Section snippets
Background and related work
In this section, we present relevant background information and related studies.
System design and implementation
In this section, we describe the design details of XDataExplorer. We present an overview of its architecture and then discuss its implementation in detail.
System evaluation
In this section, we present experimental evaluations of XDataExplorer. We use a 4-node cluster, where each node is equipped with two 16-core processors running at 2.4 GHz, 256 GB DDR4 memory, two 600 GB SAS disks, twelve 6 TB SATA disks, and dual 10-gigabit Ethernet cards. The nodes are connected to each other through a 10-gigabit switch.
XDataExplorer now supports performance tuning for HDFS, MapReduce, Spark, Hive and HBase. Usually, users use industry benchmarks to evaluate the performance of
Conclusion
In this paper, we propose a three-stage comprehensive self-tuning tool for big data platforms. Through different stages of gradual tuning, the parameters and their searching ranges decrease progressively; thus, the proposed tool can quickly improve the productivity of a big data platform and make it perform more efficiently. In the first step of system-level tuning, we use knowledge rules that are stored in an expert system to promptly adjust important parameters according to cluster deployment
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
References (32)
- et al.
Significance and challenges of big data research
Big Data Res.
(2015) - et al.
A methodology for spark parameter tuning
Big Data Res.
(2018) - et al.
mrMoulder: a recommendation-based adaptive parameter tuning approach for big data processing platform
Future Gener. Comput. Syst.
(2019) - et al.
A hybrid recommender system for recommending relevant movies using an expert system
Expert Syst. Appl.
(2020) - et al.
Developing two heuristic algorithms with metaheuristic algorithms to improve solutions of optimization problems with soft and hard constraints: an application to nurse rostering problems
Appl. Soft Comput.
(2020) - et al.
Big data and its technical challenges
Commun. ACM
(2014) Privacy and security of big data: current challenges and future research perspectives
- et al.
Survey on big data system and analytic technology
J. Softw.
(2014) - et al.
MR-advisor: a comprehensive tuning, profiling, and prediction tool for MapReduce execution frameworks on HPC clusters
J. Parallel Distrib. Comput.
(2018) - et al.
Big data platform configuration using machine learning
J. Inf. Sci. Eng.
(2020)