A Survey on Data-driven Performance Tuning for Big Data Analytics Platforms

doi:10.1016/j.bdr.2021.100206

Big Data Research

Volume 25, 15 July 2021, 100206

https://doi.org/10.1016/j.bdr.2021.100206 Get rights and content

Abstract

Many research works deal with big data platforms looking forward to data science and analytics. These are complex and usually distributed environments, composed of several systems and tools. As expected, there is a need for a closer look at performance issues.

In this work, we review performance tuning strategies in the big data environment. We focus on data-driven tuning techniques, discussing the use of database inspired approaches. Concerning big data and NoSQL stores, performance tuning issues are quite different from the so-called conventional systems. Many existing solutions are mostly ad-hoc activities that do not fit for multiple situations. But there are some categories of data-driven solutions that can be taken as guidelines and incorporated into general-purpose auto-tuning modules for big data systems.

We examine typical performance tuning actions, discussing available solutions to support some of the tuning process's primary activities. We also discuss recent implementations of data-driven performance tuning solutions for big data platforms. We propose an initial classification based on the domain state-of-the-art and present selected tuning actions for large-scale data processing systems. Finally, we organized existing works towards self-tuning big data systems based on this classification and presented general and system-specific tuning recommendations. We found that most of the literature pieces evaluate the use of tuning actions at the physical design perspective, and there is a lack of self-tuning machine-learning-based solutions for big data systems.

Introduction

A wide range of devices, including mobile phones, GPS devices, social networks, sensors, and IoT devices [1], [2], [3], [4] is generating a large volume of distributed and heterogeneous data. Transforming such massive amount of data into valuable information while revealing its underlying meaning is a crucial function of big data analytics [5], [6].

New requirements in terms of analytics (e.g., real-time data management) contribute to a big data analytics environment composed of a myriad of solutions and distributed systems, commonly known as the big data stack [5]. Many of these solutions involve distributed processing of large volumes of data, replicated or moved over networks on the fly. Failures are somewhat common, and runtime auto-recovery is required and commonly achieved based on re-processing of tasks and through the use of intermediate data materialization [7], [8]. In such an environment, I/O costs and data movement may impose relevant costs and overheads. Capacity planning may turn out to be a problem and lead to increasing costs, especially in applications that must meet some type of Service Level Agreement (SLA) specified in terms of performance metrics [5]. Also, there is no optimizer to build the most efficient execution plan to execute each user's job, which increases the importance of carefully tuning the system performance.

Both automatic tuning and real-time analytics tuning are some of the main open challenges for big data systems [9]. In big data analytics environments, the main tuning opportunities include the reorganization of data layouts (e.g., applying partitioning) and the efficient use of materialization, mainly to persist intermediate results [10]. These opportunities are somewhat related to some of the most common performance tuning actions usually taken in traditional database systems, which are data-driven decisions, mostly oriented to reduce I/O operations and improve memory consumption. Some of the aspects that impact traditional database systems' performance also affect large-scale big data processing systems. For instance, query (job) size, data skew, and temporal locality in data access impact both database systems and MapReduce-based systems performance [11].

In this paper, we review the use of performance tuning techniques in the big data stack. We focus on data-driven performance tuning, discussing works related to I/O optimization, data partitioning and materialization, file placement and data transfer, caching, and memory consumption. We describe traditional tuning activities and present how it has evolved into the NoSQL solutions. We describe proposals on data-driven performance tuning for big data systems and warehouses, mostly in solutions based on the MapReduce paradigm and distributed systems. We present selected tuning actions that would benefit data analysts. We also present big data benchmarking tools, self-tuning solutions, and performance monitoring tools and advisors, and discuss some future research opportunities.

For the best of our knowledge, there is no prior work that presents an organized review of tuning techniques and works in the context of big data systems and analytics with such a data-driven perspective.

In the following section, we describe some background on the big data systems. We also discuss some related big data benchmarks and studies on performance comparison. Section 3 describes the main performance tuning actions, discussing available solutions used to support some of the tuning process's main activities in relational and NoSQL stores, based on our data-driven performance tuning classification. Section 4 reviews data-driven performance (self)tuning in the large-scale processing systems. In Section 5, we classify existing works, extract general and system-specific tuning recommendations, and discuss research opportunities. Finally, section 6 concludes the paper and presents future work.

Section snippets

Big Data and analytics: concepts, uses and benchmarks

In this section, we review some concepts and background on big data and big data analytics, including common systems and workload types, application areas, benchmark tools and performance comparison studies.

Database systems tuning overview

Performance is crucial for database systems. It is the main optimization goal of the tuning activity, and it is affected by several aspects, e.g., the amount of data or replication and consistency requirements [91].

Database performance tuning is the process of changes and adjustments at the database level that would improve database applications' performance.

The process of database tuning can be done continuously in a feedback loop style (Fig. 5). It is an activity where we need to:

•
monitor the

Tuning in large-scale data processing systems

Large-scale data processing systems have hundreds of configurable parameters, and several of them may affect performance [9]. Having a reasonable set of parameters that can help tune the application execution to a particular context may seem desirable, but configuring too many settings to achieve the best configuration in terms of throughput often turns out to be a challenging and time-consuming task [114].

Herodotoua et al. [10] define three categories of optimization opportunities for big data

Recommendations and opportunities

Performance tuning is a crucial activity, especially in large data-bound systems, where I/O operations and data movement impose relevant costs.

In Section 4, we reviewed the state-of-the-art on data-driven performance tuning in large-scale processing systems. Such works can be categorized in the same way we did in Section 3 to classify tuning actions in traditional database systems, as represented in Fig. 7.

Most of the works in the literature evaluate the use of tuning actions at the physical

Conclusions

Big data has radically transformed the traditional analytics environment. The big data analytics environment is composed of several heterogeneous systems and commonly distributed solutions. Mixed workloads, produced by batch and stream processing, and near-real-time analytics, have emerged as a real necessity in many contexts.

Such complex environments usually deal with a considerable volume of data. As failures are somewhat common, replication and materialization are typically used to allow

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgements

This work is partially funded by National Funds through the FCT (Foundation for Science and Technology) in the context of the projects UIDB/04524/2020 and UIDB/00127/2020, and by Fundo Europeu de Desenvolvimento Regional (FEDER), Programa Operacional Competitividade e Internacionalização in the context of the projects POCI-01-0145-FEDER-032636 and Produtech II SIF – POCI-01-0247-FEDER-024541. Some of the authors are partially supported by grants from CNPq and CAPES, Brazilian public funding

References (140)

U. Sivarajah et al.
Critical analysis of Big Data challenges and analytical methods
J. Bus. Res.
(2017)
A.N. Navaz et al.
Towards an efficient and energy-aware mobile big health data architecture
Comput. Methods Programs Biomed.
(2018)
X. Jin et al.
Significance and challenges of Big Data research
Big Data Res.
(2015)
A. Corbellini et al.
Persisting big-data: the NoSQL landscape
Inf. Syst.
(2017)
A.T. Kabakus et al.
A performance evaluation of in-memory databases
J. King Saud Univ, Comput. Inf. Sci.
(2017)
C. Li et al.
Flutedb: an efficient and scalable in-memory time series database for sensor-cloud
J. Parallel Distrib. Comput.
(2018)
M. Aldinucci et al.
Data stream processing in HPC systems: new frameworks and architectures for high-frequency streaming
Parallel Comput.
(2020)
C. Barba-González et al.
On the design of a framework integrating an optimization engine with streaming technologies
Future Gener. Comput. Syst.
(2020)
S. Bergamaschi et al.
Bigbench workload executed by using apache flink
Proc. Manuf.
(2017)
V. Persico et al.
Benchmarking big data architectures for social networks data processing using public cloud platforms
Future Gener. Comput. Syst.
(2018)

D. Abadi et al.

The Seattle report on database research

SIGMOD Rec.

(2020)

J. Moorthy et al.

Big Data: prospects and challenges

Vikalpa

(2015)

Y. Wang et al.

Modeling and building iot data platforms with actor-oriented databases

A. Arvanitis et al.

Automated performance management for the big data stack

A. Rasmussen et al.

Themis: an I/O-efficient MapReduce

H. Zhang et al.

Riffle: optimized Shuffle service for large-scale data

J. Lu et al.

Speedup your analytics: automatic parameter tuning for databases and big data systems

Proc. VLDB Endow.

(2018)

H. Herodotou et al.

Starfish: a self-tuning system for Big Data analytics

Y. Chen et al.

Interactive analytical processing in big data systems

Proc. VLDB Endow.

(2012)

T. Shah et al.

Investigating an ontology-based approach for Big Data analysis of inter-dependent medical and oral health conditions

Clust. Comput.

(2015)

Y. Riahi et al.

Big Data and Big Data analytics: concepts, types and technologies

Int. J. Res. Eng.

(2018)

E.G. Ularu et al.

Perspectives on Big Data and Big Data analytics

Database Syst. J.

(2012)

F. Özcan et al.

Hybrid transactional/analytical processing: a survey

D. Abadi et al.

Beckman report on database research

Commun. ACM

(2016)

The Apache Software Foundation

A. Thusoo et al.

Hive: a warehousing solution over a map-reduce framework

Proc. VLDB Endow.

(2009)

M. Kornacker et al.

Impala: a modern, open-source SQL engine for Hadoop

M. Armbrust et al.

Spark SQL: relational data processing in spark

R. Cattell

Scalable SQL and NoSQL data stores

SIGMOD Rec.

(2010)

B.G. Tudorica et al.

A comparison between several NoSQL databases with comments and notes

R. Hecht et al.

Nosql evaluation: a use case oriented survey

Cited by (0)

View full text

A Survey on Data-driven Performance Tuning for Big Data Analytics Platforms

Abstract

Introduction

Section snippets

Big Data and analytics: concepts, uses and benchmarks

Database systems tuning overview

Tuning in large-scale data processing systems

Recommendations and opportunities

Conclusions

Declaration of Competing Interest

Acknowledgements

J. Bus. Res.

Comput. Methods Programs Biomed.

Big Data Res.

Inf. Syst.

J. King Saud Univ, Comput. Inf. Sci.

J. Parallel Distrib. Comput.

Parallel Comput.

Future Gener. Comput. Syst.

Proc. Manuf.

Future Gener. Comput. Syst.

Future Gener. Comput. Syst.

Future Gener. Comput. Syst.

Agric. Syst.

Big Data Res.

Future Gener. Comput. Syst.

J. Syst. Softw.

Sustain. Comput. Inf. Sci.

Int. J. Inf. Manag.

J. Clean. Prod.

Comput. Ind. Eng.

Prev. Vet. Med.

Proc. Comput. Sci.

Future Gener. Comput. Syst.

Big Data Res.

Comput. Hum. Behav.

Big Data Res.

Big Data Res.

Future Gener. Comput. Syst.

The Seattle report on database research

SIGMOD Rec.

Big Data: prospects and challenges

Vikalpa

Modeling and building iot data platforms with actor-oriented databases

Automated performance management for the big data stack

Themis: an I/O-efficient MapReduce

Riffle: optimized Shuffle service for large-scale data

Speedup your analytics: automatic parameter tuning for databases and big data systems

Proc. VLDB Endow.

Starfish: a self-tuning system for Big Data analytics

Interactive analytical processing in big data systems

Proc. VLDB Endow.

Investigating an ontology-based approach for Big Data analysis of inter-dependent medical and oral health conditions

Clust. Comput.

Big Data and Big Data analytics: concepts, types and technologies

Int. J. Res. Eng.

Perspectives on Big Data and Big Data analytics

Database Syst. J.

Hybrid transactional/analytical processing: a survey

Beckman report on database research

Commun. ACM

Hive: a warehousing solution over a map-reduce framework

Proc. VLDB Endow.

Impala: a modern, open-source SQL engine for Hadoop

Spark SQL: relational data processing in spark

Scalable SQL and NoSQL data stores

SIGMOD Rec.

A comparison between several NoSQL databases with comments and notes

Nosql evaluation: a use case oriented survey