Elsevier

Big Data Research

Volume 25, 15 July 2021, 100206
Big Data Research

A Survey on Data-driven Performance Tuning for Big Data Analytics Platforms

https://doi.org/10.1016/j.bdr.2021.100206Get rights and content

Abstract

Many research works deal with big data platforms looking forward to data science and analytics. These are complex and usually distributed environments, composed of several systems and tools. As expected, there is a need for a closer look at performance issues.

In this work, we review performance tuning strategies in the big data environment. We focus on data-driven tuning techniques, discussing the use of database inspired approaches. Concerning big data and NoSQL stores, performance tuning issues are quite different from the so-called conventional systems. Many existing solutions are mostly ad-hoc activities that do not fit for multiple situations. But there are some categories of data-driven solutions that can be taken as guidelines and incorporated into general-purpose auto-tuning modules for big data systems.

We examine typical performance tuning actions, discussing available solutions to support some of the tuning process's primary activities. We also discuss recent implementations of data-driven performance tuning solutions for big data platforms. We propose an initial classification based on the domain state-of-the-art and present selected tuning actions for large-scale data processing systems. Finally, we organized existing works towards self-tuning big data systems based on this classification and presented general and system-specific tuning recommendations. We found that most of the literature pieces evaluate the use of tuning actions at the physical design perspective, and there is a lack of self-tuning machine-learning-based solutions for big data systems.

Introduction

A wide range of devices, including mobile phones, GPS devices, social networks, sensors, and IoT devices [1], [2], [3], [4] is generating a large volume of distributed and heterogeneous data. Transforming such massive amount of data into valuable information while revealing its underlying meaning is a crucial function of big data analytics [5], [6].

New requirements in terms of analytics (e.g., real-time data management) contribute to a big data analytics environment composed of a myriad of solutions and distributed systems, commonly known as the big data stack [5]. Many of these solutions involve distributed processing of large volumes of data, replicated or moved over networks on the fly. Failures are somewhat common, and runtime auto-recovery is required and commonly achieved based on re-processing of tasks and through the use of intermediate data materialization [7], [8]. In such an environment, I/O costs and data movement may impose relevant costs and overheads. Capacity planning may turn out to be a problem and lead to increasing costs, especially in applications that must meet some type of Service Level Agreement (SLA) specified in terms of performance metrics [5]. Also, there is no optimizer to build the most efficient execution plan to execute each user's job, which increases the importance of carefully tuning the system performance.

Both automatic tuning and real-time analytics tuning are some of the main open challenges for big data systems [9]. In big data analytics environments, the main tuning opportunities include the reorganization of data layouts (e.g., applying partitioning) and the efficient use of materialization, mainly to persist intermediate results [10]. These opportunities are somewhat related to some of the most common performance tuning actions usually taken in traditional database systems, which are data-driven decisions, mostly oriented to reduce I/O operations and improve memory consumption. Some of the aspects that impact traditional database systems' performance also affect large-scale big data processing systems. For instance, query (job) size, data skew, and temporal locality in data access impact both database systems and MapReduce-based systems performance [11].

In this paper, we review the use of performance tuning techniques in the big data stack. We focus on data-driven performance tuning, discussing works related to I/O optimization, data partitioning and materialization, file placement and data transfer, caching, and memory consumption. We describe traditional tuning activities and present how it has evolved into the NoSQL solutions. We describe proposals on data-driven performance tuning for big data systems and warehouses, mostly in solutions based on the MapReduce paradigm and distributed systems. We present selected tuning actions that would benefit data analysts. We also present big data benchmarking tools, self-tuning solutions, and performance monitoring tools and advisors, and discuss some future research opportunities.

For the best of our knowledge, there is no prior work that presents an organized review of tuning techniques and works in the context of big data systems and analytics with such a data-driven perspective.

In the following section, we describe some background on the big data systems. We also discuss some related big data benchmarks and studies on performance comparison. Section 3 describes the main performance tuning actions, discussing available solutions used to support some of the tuning process's main activities in relational and NoSQL stores, based on our data-driven performance tuning classification. Section 4 reviews data-driven performance (self)tuning in the large-scale processing systems. In Section 5, we classify existing works, extract general and system-specific tuning recommendations, and discuss research opportunities. Finally, section 6 concludes the paper and presents future work.

Section snippets

Big Data and analytics: concepts, uses and benchmarks

In this section, we review some concepts and background on big data and big data analytics, including common systems and workload types, application areas, benchmark tools and performance comparison studies.

Database systems tuning overview

Performance is crucial for database systems. It is the main optimization goal of the tuning activity, and it is affected by several aspects, e.g., the amount of data or replication and consistency requirements [91].

Database performance tuning is the process of changes and adjustments at the database level that would improve database applications' performance.

The process of database tuning can be done continuously in a feedback loop style (Fig. 5). It is an activity where we need to:

  • monitor the

Tuning in large-scale data processing systems

Large-scale data processing systems have hundreds of configurable parameters, and several of them may affect performance [9]. Having a reasonable set of parameters that can help tune the application execution to a particular context may seem desirable, but configuring too many settings to achieve the best configuration in terms of throughput often turns out to be a challenging and time-consuming task [114].

Herodotoua et al. [10] define three categories of optimization opportunities for big data

Recommendations and opportunities

Performance tuning is a crucial activity, especially in large data-bound systems, where I/O operations and data movement impose relevant costs.

In Section 4, we reviewed the state-of-the-art on data-driven performance tuning in large-scale processing systems. Such works can be categorized in the same way we did in Section 3 to classify tuning actions in traditional database systems, as represented in Fig. 7.

Most of the works in the literature evaluate the use of tuning actions at the physical

Conclusions

Big data has radically transformed the traditional analytics environment. The big data analytics environment is composed of several heterogeneous systems and commonly distributed solutions. Mixed workloads, produced by batch and stream processing, and near-real-time analytics, have emerged as a real necessity in many contexts.

Such complex environments usually deal with a considerable volume of data. As failures are somewhat common, replication and materialization are typically used to allow

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgements

This work is partially funded by National Funds through the FCT (Foundation for Science and Technology) in the context of the projects UIDB/04524/2020 and UIDB/00127/2020, and by Fundo Europeu de Desenvolvimento Regional (FEDER), Programa Operacional Competitividade e Internacionalização in the context of the projects POCI-01-0145-FEDER-032636 and Produtech II SIFPOCI-01-0247-FEDER-024541. Some of the authors are partially supported by grants from CNPq and CAPES, Brazilian public funding

References (140)

  • E. Psomakelis et al.

    Context agnostic trajectory prediction based on λ-architecture

    Future Gener. Comput. Syst.

    (2020)
  • V. Persico et al.

    Benchmarking big data architectures for social networks data processing using public cloud platforms

    Future Gener. Comput. Syst.

    (2018)
  • S. Wolfert et al.

    Big Data in smart farming – a review

    Agric. Syst.

    (2017)
  • K.P. Subbu et al.

    Big Data for context aware computing – perspectives and challenges

    Big Data Res.

    (2017)
  • S. Wang et al.

    An integrated GIS platform architecture for spatiotemporal big data

    Future Gener. Comput. Syst.

    (2019)
  • F. Ullah et al.

    Architectural tactics for Big Data cybersecurity analytics systems: a review

    J. Syst. Softw.

    (2019)
  • W. Li et al.

    PIM-WEAVER: a high energy-efficient, general-purpose acceleration architecture for string operations in Big Data processing

    Sustain. Comput. Inf. Sci.

    (2019)
  • M. Lnenicka et al.

    Developing a government enterprise architecture framework to support the requirements of big and open linked data with the use of cloud computing

    Int. J. Inf. Manag.

    (2019)
  • Y. Zhang et al.

    A big data analytics architecture for cleaner manufacturing and maintenance processes of complex products

    J. Clean. Prod.

    (2017)
  • M. Fahmideh et al.

    Big data analytics architecture design—an application in manufacturing systems

    Comput. Ind. Eng.

    (2019)
  • D.U. Pfeiffer et al.

    Spatial and temporal epidemiological analysis in the Big Data era

    Prev. Vet. Med.

    (2015)
  • N. Spangenberg et al.

    A Big Data architecture for intra-surgical remaining time predictions

    Proc. Comput. Sci.

    (2017)
  • G. Manogaran et al.

    A new architecture of Internet of things and big data ecosystem for secured smart healthcare monitoring and alerting system

    Future Gener. Comput. Syst.

    (2018)
  • S. Sakr et al.

    Towards a comprehensive data analytics framework for smart healthcare services

    Big Data Res.

    (2016)
  • N.A. Ghani et al.

    Social media big data analytics: a survey

    Comput. Hum. Behav.

    (2019)
  • A. Neilson et al.

    Systematic review of the literature on Big Data in the transportation domain: concepts and applications

    Big Data Res.

    (2019)
  • M. Balduini et al.

    Models and practices in Urban data science at scale

    Big Data Res.

    (2019)
  • B.N. Silva et al.

    Integration of Big Data analytics embedded smart city architecture with RESTful web of things for efficient service provision and energy management

    Future Gener. Comput. Syst.

    (2020)
  • D. Abadi et al.

    The Seattle report on database research

    SIGMOD Rec.

    (2020)
  • J. Moorthy et al.

    Big Data: prospects and challenges

    Vikalpa

    (2015)
  • Y. Wang et al.

    Modeling and building iot data platforms with actor-oriented databases

  • A. Arvanitis et al.

    Automated performance management for the big data stack

  • A. Rasmussen et al.

    Themis: an I/O-efficient MapReduce

  • H. Zhang et al.

    Riffle: optimized Shuffle service for large-scale data

  • J. Lu et al.

    Speedup your analytics: automatic parameter tuning for databases and big data systems

    Proc. VLDB Endow.

    (2018)
  • H. Herodotou et al.

    Starfish: a self-tuning system for Big Data analytics

  • Y. Chen et al.

    Interactive analytical processing in big data systems

    Proc. VLDB Endow.

    (2012)
  • T. Shah et al.

    Investigating an ontology-based approach for Big Data analysis of inter-dependent medical and oral health conditions

    Clust. Comput.

    (2015)
  • Y. Riahi et al.

    Big Data and Big Data analytics: concepts, types and technologies

    Int. J. Res. Eng.

    (2018)
  • E.G. Ularu et al.

    Perspectives on Big Data and Big Data analytics

    Database Syst. J.

    (2012)
  • F. Özcan et al.

    Hybrid transactional/analytical processing: a survey

  • D. Abadi et al.

    Beckman report on database research

    Commun. ACM

    (2016)
  • The Apache Software Foundation
  • A. Thusoo et al.

    Hive: a warehousing solution over a map-reduce framework

    Proc. VLDB Endow.

    (2009)
  • M. Kornacker et al.

    Impala: a modern, open-source SQL engine for Hadoop

  • M. Armbrust et al.

    Spark SQL: relational data processing in spark

  • R. Cattell

    Scalable SQL and NoSQL data stores

    SIGMOD Rec.

    (2010)
  • B.G. Tudorica et al.

    A comparison between several NoSQL databases with comments and notes

  • R. Hecht et al.

    Nosql evaluation: a use case oriented survey

  • Cited by (0)

    View full text