Elsevier

Future Generation Computer Systems

Volume 86, September 2018, Pages 1351-1367
Future Generation Computer Systems

Energy-efficient hadoop for big data analytics and computing: A systematic review and research insights

https://doi.org/10.1016/j.future.2017.11.010Get rights and content

Highlights

  • This paper presents the new viewpoints/insights in improving the energy efficiency of Hadoop.

  • Present valuable and feasible solutions towards improving the energy efficiency of Hadoop.

  • Propose five categories of optimizing the energy efficiency of Hadoop.

  • Present instructive research insights in future research directions of energy-efficient Hadoop.

Abstract

As the demands for big data analytics keep growing rapidly in scientific applications and online services, MapReduce and its open-source implementation Hadoop gained popularity in both academia and enterprises. Hadoop provides a highly feasible solution for building big data analytics platforms. However, defects of Hadoop are also exposed in many aspects including data management, resource management, scheduling policies, etc. These issues usually cause high energy consumption when running MapReduce jobs in Hadoop clusters. In this paper, we review the studies on improving energy efficiency of Hadoop clusters and summarize them in five categories including the energy-aware cluster node management, energy-aware data management, energy-aware resource allocation, energy-aware task scheduling and other energy-saving schemes. For each category, we briefly illustrate its rationale and comparatively analyze the relevant works regarding their advantages and limitations. Moreover, we present our insights and figure out possible research directions including energy-efficient cluster partitioning, data-oriented resource classification and provisioning, resource provisioning based on optimal utilization, EE and locality aware task scheduling, optimizing job profiling with machine learning, elastic power-saving Hadoop with containerization and efficient big data analytics on Hadoop. On one hand, the summary of studies on energy-efficient Hadoop presented in this paper provides useful guidance for the developers and users to better utilize Hadoop. On the other hand, the insights and research trends discussed in this work may inspire the relevant research on improving the energy efficiency of Hadoop in big data analytics.

Introduction

MapReduce [1], a parallel computing paradigm proposed by Google, has gained a wide adoption due to its features including fault tolerance, high scalability and simplicity in programming. With the ever-growing demands for analyzing large-scale datasets, MapReduce has become a mainstream programming model for the applications of big data analytics [[2], [3]]. The most popular open-source implementation of MapReduce is Hadoop [4], which was originally developed by Yahoo! and now has been widely deployed on production clusters for commercial purposes [5]. Hadoop well screens the complexity of underlying hardware systems and provides high-level programming interfaces, which are designed for processing large-scale datasets following MapReduce paradigm [6]. Besides, it transparently provides applications with scalability and reliable data storage on a cluster. Due to these features, Hadoop has gained wide adoption in many fields of research such as bioinformatics, social network, healthcare and business intelligence.

There is no perfect paradigm or framework in this world. Hadoop provides a highly feasible solution for distributed computing, but meanwhile exposes a number of shortcomings in performance and energy efficiency [[7], [8], [9]]. For example, Zaharia et al. [7] figured out that the default scheduler of Hadoop makes unrealistic assumptions that different worker nodes have same performance and throughput and that launching speculative tasks in idle slots will not induce extra time consumption. Generally, these assumptions do not hold in heterogeneous data centers, which may cause the scheduler’s decisions to be sub-optimal in terms of job completion time and energy consumption. Since the adoption of Hadoop is extending from batch jobs (e.g., offline log analysis) to streaming data processing and ad hoc data query, people are paying more attention to the defects of Hadoop. They are exposed especially when the system has to run a large proportion of short jobs. Chen et al. [8] developed a MapReduce-oriented benchmark suite and carried out experiments with a one-day long Hadoop workload synthesized from Facebook traces. From the benchmarking results they found that FIFO scheduler might incur the failures of many jobs when long jobs were submitted constantly. As a matter of fact, the scheduler is not the only implementation that needs to be improved in Hadoop. For example, the default task scheduling implementation of Hadoop ignores the performance and workload of servers, which probably vary largely in a heterogeneous cluster. A variety of issues have been raised and most of them are associated to the inefficiency of Hadoop in node management, data management, resource management and task scheduling.

Previous studies mainly focus on how to improve the performance of Hadoop clusters by shortening job execution time [[10], [11], [12]], balancing task execution progress [[13], [14]], reducing the impact of task failures [[15], [16], [17]], etc. However, these years the large amount of energy consumed by data centers emerged to be a prominent issue [[18], [19]]. This made energy saving a topic of interest for MapReduce applications [9]. For example, Yang et al. [20] built a high-performance computing/storage platform using Hadoop for big data processing. They proposed to collect real-time power data of servers via wireless power sensors. As a result, the fine-grained monitoring system can help control the cluster’s power consumption and can be integrated with warning and prediction modules. Due to the popularity and open-source nature of Hadoop, a large number of works can be found related to reducing energy consumption of Hadoop clusters. The study of [21] briefly divides them into five categories. However, it does not illustrate their methodologies and implementations in detail. Rao and Reddy [22] analyzes different types of Hadoop schedulers including the embedded FIFO scheduler, Fair scheduler and Capacity scheduler. Improved schemes such as Delay scheduler, Dynamic Priority scheduler and Resource Aware scheduler are also reviewed in their work. However, they are not compared directly in terms of advantages and shortcomings. Besides, energy-aware schedulers are not included. The study of [23] categorizes the researches on scheduling into two groups: cluster-based scheduling and resource-based scheduling. The authors also compare several energy-aware schedulers (e.g., [[24], [25], [26], [27]]) in the paper. But their research is limited to task scheduling. Actually there are a variety of methods and techniques available for improving Hadoop’s energy efficiency. Hameed et al. [28] present a practical taxonomy of energy-saving techniques in cloud environment and compare them from the perspectives of resource adaption strategy, target function, allocation and migration policy. Their result provides theoretical guidance for reducing energy consumption in generic cloud systems but not in a specific framework such as Hadoop. In this paper, we summarize mainstream energy-saving schemes and strategies specifically focusing on Hadoop/MapReduce. In each category, we review the state-of-the-art studies and make comparison in terms of their applicable situations, advantages and limitations.

In this paper, the studies on optimizing Hadoop energy efficiency are divided into five categories:

Energy-efficient worker node management. This category surveys the studies on saving energy by dynamically scaling the cluster size (number of workers) and the CPU frequency of the servers.

Energy-efficient data management. This category focuses on data distribution on HDFS. For example, the cost of data transfer can be reduced (i.e., achieving better data locality) through well-designed placing strategies for data replicas and migration schemes for data blocks between Datanodes.

Energy-efficient resource allocation. The scheduler determines the resource share of every job and dynamically reorders the job queue in order to achieve energy saving at the system level.

Energy-efficient task scheduling. A plan of task scheduling is made after comprehensively considering the factors such as data locality, server performance and Service Level Agreement (SLA).

Other energy-saving schemes. Apart from the categories mentioned above, we also introduce some other energy-saving schemes such as data sampling, file merging and using renewable energy.

For each category, we first reveal its basic rationale and introduce every relevant work in detail. Most of these studies present valuable and feasible solutions towards improving the energy efficiency of Hadoop. Moreover, we make comparisons between them and list their pros and cons, which would be extremely useful when trying to apply them to the realistic environment. More importantly, in this paper we present instructive research insights in the discussion about future research directions including energy-efficient cluster partitioning, data-oriented resource classification and provisioning, resource provisioning based on optimal utilization, EE and locality aware task scheduling, optimizing job profiling with machine learning, elastic power-saving Hadoop with containerization and efficient big data analytics on Hadoop. More and more scientific and service applications are directly or indirectly deployed on the platform of Hadoop because of its great potential in big data analytics. Meanwhile, optimizing energy consumption has become a major trend and the topic of interest. In this paper, we systematically review the studies on improving Hadoop’s energy efficiency, which offers useful guidance to the users and developers for better utilization of Hadoop. We further discuss some possible improvements and research directions in order to provide instructive insights for the relevant research work on developing energy-aware Hadoop systems.

The rest of the paper is organized as follows. Section 2 is about the background knowledge of Hadoop. Section 3 briefly introduces the application of Hadoop for big data analytics in different research fields. In Section 4, we review the relevant studies on optimizing the energy efficiency of Hadoop by organizing them into five categories. In Section 5, we discuss the directions for future research. Finally, we conclude the paper in Section 6.

Section snippets

MapReduce

MapReduce [1] is a parallel programming framework proposed by Google and designed for data processing in distributed environments. A MapReduce job is mainly composed of two phases: Map and Reduce. Initially the input dataset of the job is split into several blocks while each block corresponds to a single Map task. Typically, each Map task processes a data block and produces a set of intermediate key/values pairs. The finish of Map phase is followed by Shuffle, in which intermediate outputs are

Big data analytics on Hadoop

As big data offers opportunities for scientists to generate, store, access and analyze massive amount of experimental data in a fast and low-cost manner, the adoption of big data techniques is experiencing an unprecedented growth in a diversity of research fields. More and more scientific applications are developed following the MapReduce programming model and taking advantage of the powerful computing and storage capability of Hadoop. For example, bioinformatics systems usually need to provide

Energy-efficient worker node management

Dynamic node management is a common measure to control cluster power consumption in both homogeneous and heterogeneous environments. To dynamically manage worker nodes, a set of policies work at the hardware level, including shutting down servers, turning servers into low-power states and adjusting CPU performance dynamically (e.g., DFS and DVFS). Wirtz and Ge [24] studied through experiments how the number of active workers and dynamic CPU frequency scaling technique impact on the execution

Research insights and future directions

As the ever-growing consumption of energy gradually raises people’s concern, how to improve the energy efficiency of Hadoop has become a topic of interest. Although much work has been done for the optimization of Hadoop and MapReduce framework, we are still facing great challenges as the heterogeneity of data centers gets prevailing and the diversity of big data workloads keeps increasing. Thus, it is of great necessity to overcome the limitations of previous studies and find more effective

Conclusions

As the most popular open source implementation of MapReduce parallel framework, Hadoop provides a generic platform for processing large-scale datasets in a distributed environment. With the ever-growing demand for big data analytics, Hadoop has gained wide adoption in many fields of research and practice including bioinformatics, social network and business intelligence. However, little consideration of energy efficiency is taken in its original design, which usually causes overconsumption of

Acknowledgments

This work is partially supported by the National Natural Science Foundation of China (Grant Nos. 61402183 and 61772205), National Science and Technology Ministry (Grant No. 2015BAK36B06), Guangdong Provincial Scientific and Technological Projects (Grant Nos. 2017B010126002, 2017A010101008, 2017A010101014, 2017B090901061, 2016A010101007, 2016B090918021 and 2014B010117001), Guangzhou Science and Technology Projects (Grant Nos. 201607010048 and 201604010040).

WenTai Wu is a master student in Computer Science at South China University of Technology. He received his bachelor degree in Computer Science from South China University of Technology in 2015. His research interests include distributed computing and cloud computing.

References (86)

  • HashemI.A.T. et al.

    The role of big data in smart city

    Int. J. Inf. Manag.

    (2016)
  • LarsonD. et al.

    A review and future direction of agile, business intelligence, analytics and data science

    Int. J. Inf. Manag.

    (2016)
  • KimJ. et al.

    iPACS: Power-aware covering sets for energy proportionality and performance in data parallel computing clusters

    J. Parallel Distrib. Comput.

    (2014)
  • MaheshwariN. et al.

    Dynamic energy efficient data placement and cluster reconfiguration algorithm for MapReduce framework

    Future Gener. Comput. Syst.

    (2012)
  • FellerE. et al.

    Performance and energy efficiency of big data applications in cloud environments: A Hadoop case study

    J. Parallel Distrib. Comput.

    (2015)
  • NghiemP.P. et al.

    Towards efficient resource provisioning in mapreduce

    J. Parallel Distributed Comput.

    (2016)
  • WenY.F.

    Energy-aware dynamical hosts and tasks assignment for cloud computing

    J. Syst. Softw.

    (2016)
  • ChenY.W. et al.

    Virtual hadoop: Mapreduce over docker containers with an auto-scaling mechanism for heterogeneous environments

  • DeanJ. et al.

    MapReduce: Simplified data processing on large clusters

    Commun. ACM

    (2008)
  • GhazalA. et al.

    Bigbench: Towards an industry standard benchmark for big data analytics

  • ChangV.

    An overview, examples, and impacts offered by emerging services and analytics in cloud computing virtual reality

    Neural Comput. Appl.

    (2015)
  • The Apache Hadoop Project....
  • Powered by Hadoop....
  • JinH. et al.

    The mapreduce programming model and implementations

    Cloud Comput.: Princ. Paradigms

    (2011)
  • ZahariaM. et al.

    Improving mapreduce performance in heterogeneous environments

  • ChenY. et al.

    The case for evaluating mapreduce performance using workload suites

  • ZhuN. et al.

    Taming power peaks in mapreduce clusters

  • Y. Wang, J. Tan, W. Yu, L. Zhang, X. Meng, X. Li, Preemptive reduce task scheduling for fair and fast job completion,...
  • AnanthanarayananG. et al.

    GRASS: Trimming stragglers in approximation analytics

  • LiY. et al.

    A new speculative execution algorithm based on c4.5 decision tree for hadoop

  • Quiané-RuizJ.A. et al.

    RAFTing mapreduce: Fast recovery on the raft

  • DinuF. et al.

    Rcmp: Enabling efficient recomputation based failure resilience for big data analytics

  • LeeY.C. et al.

    Energy efficient utilization of resources in cloud computing systems

    J. Supercomput.

    (2012)
  • YangC.T. et al.

    iGEMS: A Cloud Green Energy Management System in Data Center

  • RaoB.T. et al.

    Survey on improved scheduling in hadoop mapreduce in cloud environments

    Int. J. Comput. Appl.

    (2012)
  • S. D’Souza, K. Chandrasekaran, Analysis of MapReduce scheduling and its improvements in cloud environment, in: IEEE...
  • WirtzT. et al.

    Improving mapreduce energy efficiency for computation intensive workloads

  • ChenY. et al.

    Energy efficiency for large-scale mapreduce workloads with significant interactive analysis

  • N. Yigitbasi, K. Datta, N. Jain, T. Willke, Energy efficient scheduling of mapreduce workloads on heterogeneous...
  • HameedA. et al.

    A survey and taxonomy on energy efficient resource allocation techniques for cloud computing systems

    Computing

    (2016)
  • BorthakurD.

    The hadoop distributed file system: Architecture and design

    Hadoop Proj. Website

    (2007)
  • ChangV. et al.

    Cloud storage and bioinformatics in a private cloud deployment: Lessons for data intensive research

  • NguyenT. et al.

    Cloudaligner: A fast and full-featured mapreduce based tool for sequence mapping

    BMC Res. Notes

    (2011)
  • Cited by (46)

    • Predicting climate factors based on big data analytics based agricultural disaster management

      2022, Physics and Chemistry of the Earth
      Citation Excerpt :

      As the economy increases, people's food demand grows (Zhang and Liu, 2019). Thus, the control of agricultural goods has become an additional significant (Wu et al., 2018). Technology is the only key to increasing agricultural production (Tantalaki et al., 2019).

    • A modeling approach for estimating performance and energy consumption of storage systems

      2022, Journal of Computer and System Sciences
      Citation Excerpt :

      Concerning object sizes, little variation usually does not affect the performance and energy consumption of storage systems significantly. Therefore, the object sizes are small or large to better evaluate the impact on HDDs and SSDs [40] [41] [42]. Concerning energy consumption, the proposed approach has focused on assessing storage devices during the active energy state (e.g., read operation) and, thus, other energy states are not explicitly represented.

    • Retention based energy harvesting technique for efficient internet of things aided edge devices

      2021, Sustainable Energy Technologies and Assessments
      Citation Excerpt :

      It can bring several benefits like network traffic reduction, quicker response rate, lesser network connection reliability, effective use of cloud services, etc.[11]. End devices in IoT networks are reserved, and computing capabilities are low, and energy resources can be restricted [12]. Many countries and industries have developed many methods for conserving energy in buildings, grids, and households to determine applications capable of measuring, controlling, and managing energy consumption.

    View all citing articles on Scopus

    WenTai Wu is a master student in Computer Science at South China University of Technology. He received his bachelor degree in Computer Science from South China University of Technology in 2015. His research interests include distributed computing and cloud computing.

    WeiWei Lin is currently an associate professor in the School of Computer Science and Engineering, South China University of Technology. His research interests include distributed system, cloud computing and big data.

    Ching-Hsien Hsu is a professor in the Department of Computer Science and Information Engineering at Chung Hua University, Taiwan. His research includes high performance computing, cloud computing, big data intelligence, parallel and distributed systems, ubiquitous/pervasive computing and intelligence. Dr. Hsu is an IEEE senior member.

    LiGang He is an associate professor in the Department of Computer Science, University of Warwick. His research mainly includes High Performance Computing, Cloud Computing and Parallel Processing.

    View full text