SHadoop: Improving MapReduce performance by optimizing job execution mechanism in Hadoop clusters

doi:10.1016/j.jpdc.2013.10.003

Journal of Parallel and Distributed Computing

Volume 74, Issue 3, March 2014, Pages 2166-2179

https://doi.org/10.1016/j.jpdc.2013.10.003 Get rights and content

Highlights

•
Analyzed and identified two critical limitations of MapReduce execution mechanism.
•
Achieved first optimization by implementing new job setup/cleanup tasks.
•
Replaced heartbeat with an instant messaging mechanism to speedup task scheduling.
•
Conducted comprehensive benchmarks to evaluate stable performance improvements.
•
Passed a production test and integrated our work into Intel Distributed Hadoop.

Abstract

As a widely-used parallel computing framework for big data processing today, the Hadoop MapReduce framework puts more emphasis on high-throughput of data than on low-latency of job execution. However, today more and more big data applications developed with MapReduce require quick response time. As a result, improving the performance of MapReduce jobs, especially for short jobs, is of great significance in practice and has attracted more and more attentions from both academia and industry. A lot of efforts have been made to improve the performance of Hadoop from job scheduling or job parameter optimization level. In this paper, we explore an approach to improve the performance of the Hadoop MapReduce framework by optimizing the job and task execution mechanism. First of all, by analyzing the job and task execution mechanism in MapReduce framework we reveal two critical limitations to job execution performance. Then we propose two major optimizations to the MapReduce job and task execution mechanisms: first, we optimize the setup and cleanup tasks of a MapReduce job to reduce the time cost during the initialization and termination stages of the job; second, instead of adopting the loose heartbeat-based communication mechanism to transmit all messages between the JobTracker and TaskTrackers, we introduce an instant messaging communication mechanism for accelerating performance-sensitive task scheduling and execution. Finally, we implement SHadoop, an optimized and fully compatible version of Hadoop that aims at shortening the execution time cost of MapReduce jobs, especially for short jobs. Experimental results show that compared to the standard Hadoop, SHadoop can achieve stable performance improvement by around 25% on average for comprehensive benchmarks without losing scalability and speedup. Our optimization work has passed a production-level test in Intel and has been integrated into the Intel Distributed Hadoop (IDH). To the best of our knowledge, this work is the first effort that explores on optimizing the execution mechanism inside map/reduce tasks of a job. The advantage is that it can complement job scheduling optimizations to further improve the job execution performance.

Introduction

The MapReduce parallel computing framework [7], proposed by Google in 2004, has become an effective and attractive solution for big data processing problems. Through simple programming interfaces with two functions, map and reduce, MapReduce significantly simplifies the design and implementation of many data-intensive applications in the real world. Moreover, MapReduce offers other benefits, including load balancing, elastic scalability, and fault tolerance, which makes it a widely adopted parallel computing framework. Hadoop [1], an open-source implementation of MapReduce, has been widely used in industry and researched in academia.

Both the MapReduce frameworks of Google and Hadoop have been widely recognized for their high-throughput, elastic scalability, and fault tolerance. They focus more on these features than job execution efficiency. This results in relatively poor performance when using Hadoop MapReduce to execute jobs, especially short jobs. The term ‘short job’ has already been used in some related work [30], [31]. There is no quantitative definition for short jobs now. Usually they refer to MapReduce jobs with execution time ranging from seconds to a few minutes as opposed to long MapReduce jobs that take hours. Facebook names this type jobs as ‘small job’ in its recently released optimized version of Hadoop, Corona [24].

Some studies show that short jobs compose a large portion of MapReduce jobs [7], [5]. For example, the average execution time of MapReduce jobs at Google in September 2007 is 395 s [7]. Response time is most important for short jobs in scenarios where users need the answer quickly, such as query or analysis on log data for debugging, monitoring and business intelligence [30]. In a pay-by-the-time environment like EC2, improving MapReduce performance means saving monetary costs. Optimizing MapReduce’s execution time can also prevent jobs from occupying system resources too long, which is good for a cluster’s health [31].

Today there are a number of high-level query and data-analysis systems that provide services on top of MapReduce, such as Google’s Sawzall [17], Facebook’s Hive [22] and Yahoo!’s Pig [16]. These systems execute users’ requests by converting SQL-like queries to a series of MapReduce jobs that are usually short. These high-level declarative languages can greatly simplify the task of developing applications in MapReduce without hand-coded MapReduce programs [21]. Thus, in practice, these systems play more important roles than hand-coded MapReduce programs. For example, more than 95% Hadoop jobs in Facebook are not hand-coded but generated by Hive and more than 90% MapReduce jobs in Yahoo! are generated by Pig [12]. In fact, these systems are very sensitive to the execution time of underlying short MapReduce jobs. Therefore, reducing the execution time of MapReduce jobs is very important to these widely-used systems.

For the above reasons, in this paper we concentrate on improving the execution performance of short MapReduce jobs. Having studied the Hadoop MapReduce framework in great detail, we focus on the internal execution mechanisms of an individual job and the tasks inside a job. Through in-depth analysis, we reveal that there are two critical issues that limit the performance of MapReduce jobs. To address these issues, we design and implement SHadoop, an optimized version of Hadoop that is fully compatible with the standard Hadoop. Different from improving the performance on job scheduling or job parameter optimization level, we optimize the underlying execution mechanism of each of tasks inside a job. In implementation, first, we optimize the setup and cleanup tasks, two special tasks when executing a MapReduce job, to reduce the time cost during the initialization and termination stages of the job; second, we add an instant messaging communication mechanism into the standard Hadoop for the fast delivery of the performance-sensitive task scheduling and execution messages between the JobTracker and TaskTrackers. This way the tasks of a job can be scheduled and executed instantly without heartbeat delay. As a consequence, the job execution process becomes more compact and utilization of the slots on the TaskTrackers can be much improved. Experimental results show that SHadoop outperforms the standard Hadoop and can achieve stable performance improvements of around 25% on average for comprehensive benchmarks. Our optimization work has passed a production-level test in Intel and been integrated into the Intel Distributed Hadoop [11]. To the best of our knowledge, this work is the first effort that explores on optimizing the execution mechanism inside map/reduce tasks. The advantage is that it can complement job scheduling optimization work to further improve the job execution performance.

The rest of this paper is organized as follows: Section 2 introduces the related works on MapReduce performance optimization and comparisons them with SHadoop. Section 3 focuses on analyzing the job/tasks execution mechanism in standard Hadoop MapReduce. Based on this, Section 4 describes our optimization methods for improving job execution efficiency in the standard Hadoop MapReduce. Section 5 discusses experiments and performance evaluations of our optimization work. Finally, we conclude this paper in Section 6.

Section snippets

Related work analysis

Many studies have been done to improve the performance of the Hadoop MapReduce framework from different levels or aspects. They fall into several categories. The first focuses on designing scheduling algorithms to optimize the execution order of jobs or tasks more intelligently [30], [29], [15], [8], [27], [9], [14], [25]. The second explores how to improve the efficiency of MapReduce with the aid of special hardware or supporting software [31], [3], [28]. The third conducts specialized

In-depth analysis of MapReduce job execution process

In this section, we first give a brief introduction to the Hadoop MapReduce framework. Then we focus on performing an in-depth analysis of the underlying execution mechanism and process of a MapReduce job and its tasks in Hadoop.

Hadoop MapReduce framework, which is deployed on top of HDFS, consists of a JobTracker running on the master node and many TaskTrackers running on slave nodes. “Job” and “Task” are two important concepts in MapReduce architecture. Usually, a MapReduce job contains a set

Optimization of MapReduce job and task execution mechanisms

Based on the above in-depth analysis of execution mechanisms of a MapReduce job and its tasks, in this section we reveal two critical limitations to job execution performance in the standard Hadoop MapReduce framework. Then we present our optimization work to address these issues in more detail. The optimizations made in SHadoop aim at reducing the internal execution time cost of individual MapReduce jobs, especially for short jobs, by optimizing the job and task execution mechanisms to improve

Evaluation

In order to verify the effects of our optimizations, we conducted a series of experiments to evaluate and compare the performance of SHadoop with the standard Hadoop. First, we performed a number of experiments to separately evaluate the effect of each of optimization measures. Second, in order to evaluate how much our optimization can benefit the MapReduce jobs with different workloads, we adopted several Hadoop MapReduce benchmark suites to further evaluate SHadoop. Third, the widely-used big

Conclusion and future work

MapReduce is a popular programming model and framework for processing large datasets. It has been widely used and recognized for its simple programming interfaces, fault tolerance and elastic scalability. However, the job execution performance of MapReduce is relatively disregarded. In this paper, we explore an approach to optimize the job and task execution mechanism and present an optimized version of Hadoop, named SHadoop, to improve the execution performance of MapReduce jobs. SHadoop makes

Acknowledgments

This work is funded in part by China NSF Grants (61223003) and the National High Technology Research and Development Program of China (863 Program) (2011AA01A202).

Rong Gu, received the B.S. degree in computer science from Nanjing University of Aeronautics and Astronautics, Nanjing, China, in 2011. He is currently a Ph.D. candidate in computer science at Nanjing University, Nanjing, China. His research interests include parallel and distributed computing, cloud computing, and big data parallel processing.

References (31)

Apache Hadoop....
S. Babu, Towards automatic optimization of mapreduce programs, in: Proceedings of the 1st ACM symposium on Cloud...
Y. Becerra Fontal, V. Beltran Querol, P, D. Carrera, et al. Speeding up distributed MapReduce applications using...
Benchmarking and Stress Testing an Hadoop Cluster With TeraSort, TestDFSIO & Co....
Y. Chen, A. Ganapathi, R. Griffith, R. Katz, The case for evaluating mapreduce performance using workload suites, in:...
Danga Interactive, memcached,...
J. Dean et al.
MapReduce: simplified data processing on large clusters
Commun. ACM
(2008)
M. Hammoud, M. Sak, Locality-aware reduce task scheduling for MapReduce, in 3nd IEEE International Conference on Cloud...
C. He, Y. Lu, D. Swanson, Matchmaking: a new MapReduce scheduling technique, in: 3rd International Conference on Cloud...
S. Huang, J. Huang, J. Dai, T. Xie, B. Huang, The HiBench benchmark suite: characterization of the MapReduce-based data...

Intel Distributed Hadoop....

R. Lee, T. Luo, Y. Huai, F. Wang, Y. He, X. Zhang, Ysmart: yet another sql-to-mapreduce translator, in: 31st...

B. Li, E. Mazur, Y. Diao, A. McGregor, P. Shenoy, A platform for scalable one-pass analytics using MapReduce, in:...

H. Mao, S. Hu, Z. Zhang, L. Xiao, L. Ruan, A load-driven task scheduler with adaptive DSC for MapReduce, in: 2011...

R. Nanduri, N. Maheshwari, A. Reddyraja, V. Varma, Job aware scheduling algorithm for MapReduce framework, in: 3rd IEEE...

Cited by (86)

A review on offloading in fog-based Internet of Things: Architecture, machine learning approaches, and open issues
2023, High-Confidence Computing
There is an exponential increase in the number of smart devices, generating helpful information and posing a serious challenge while processing this huge data. The processing is either done at fog level or cloud level depending on the size and nature of the task. Offloading data to fog or cloud adds latency, which is less in fog and more in the cloud. The methods of processing data and tasks at fog level or cloud are mostly machine learning based. In this paper, we will discuss all three levels in terms of architecture, starting from the internet of things to fog and fog to cloud. Specifically, we will describe machine learning-based offloading from the internet of things to fog and fog to cloud. Finally, we will come up with current research directions, issues, and challenges in the IoT–fog–cloud environment.
Analysis of hadoop MapReduce scheduling in heterogeneous environment
2021, Ain Shams Engineering Journal
Citation Excerpt :
It’s a data locality based process of scheduling the map and the reduce tasks according to the node capacity. This aims at improving the data locality and performance of MapReduce cluster when compared to Hadoop’s default scheduling algorithms, delay scheduler, and the matchmaking scheduling algorithm [33,37–39]. This focuses on decreasing the run time of a job by classifying the nodes into a map and reduce slow nodes respectively.
Over the last decade, several advancements have happened in distributed and parallel computing. A lot of data is generated daily from various sources, and this speedy data proliferation led to the development of many more frameworks that are efficient to handle such huge data e.g. - Microsoft Dryad, Apache Hadoop, etc. Apache Hadoop is an open-source application of Google MapReduce and is getting a lot of attention from various researchers. Proper scheduling of jobs needs to be done for better performance. Numerous efforts have been done in the development of existing MapReduce schedulers and in developing new optimized techniques or algorithms. This paper focuses on the Hadoop MapReduce framework, its shortcomings, various issues we face while scheduling jobs to nodes and algorithms proposed by various researchers. Furthermore, we then classify these algorithms on various quality measures that affect MapReduce performance.
Parallel and distributed architecture of genetic algorithm on Apache Hadoop and Spark
2020, Applied Soft Computing Journal
The genetic algorithm (GA), one of the best-known metaheuristic algorithms, has been extensively utilized in various fields of management science, operational research, and industrial engineering. The efficiency of GAs in solving large-scale optimization problems would be enhanced if the iterative processes required by the genetic operators can be implemented in a parallel and distributed computing architecture. Apache Hadoop has recently been one of the most popular systems for distributed storage and parallel processing of big data. By integrating the GA highly into Apache Hadoop, this study proposes an advanced GA parallel and distributed computing architecture that achieves the effectiveness and efficiency of GA evolution. Characterized by the sophisticated mechanism of dispatching the GA core operators into Apache Hadoop, the developed computing framework fits well with the cloud computing model. The presented GA parallelization architecture outperforms the state-of-the-art reference architectures according to the computational experiments where the testing instances of traveling salesman problems are employed. Our numerical experiments also demonstrate that the proposed architecture can readily be extended to Apache Spark.
Safe automated refactoring for intelligent parallelization of Java 8 streams
2020, Science of Computer Programming
Citation Excerpt :
Manual) interprocedural and type hierarchy analysis may be needed to discover ways to use streams in a particular context optimally. Previously, attention has been given to retrofitting concurrency on to existing sequential (imperative) programs [18–20], translating imperative code to MapReduce [21], verifying and validating correctness of MapReduce-style programs [22–25], studying the use of λ-expressions [17,26–28] and streams [29], and improving performance of the underlying MapReduce framework implementation [30–33]. Little attention, though, has been paid to mainstream languages utilizing functional-style APIs that facilitate MapReduce-style operations over native data structures like collections.
Streaming APIs are becoming more pervasive in mainstream Object-Oriented programming languages and platforms. For example, the Stream API introduced in Java 8 allows for functional-like, MapReduce-style operations in processing both finite, e.g., collections, and infinite data structures. However, using this API efficiently involves subtle considerations such as determining when it is best for stream operations to run in parallel, when running operations in parallel can be less efficient, and when it is safe to run in parallel due to possible lambda expression side-effects. In this paper, we present an automated refactoring approach that assists developers in writing efficient stream code in a semantics-preserving fashion. The approach, based on a novel data ordering and typestate analysis, consists of preconditions and transformations for automatically determining when it is safe and possibly advantageous to convert sequential streams to parallel and unorder or de-parallelize already parallel streams. The approach was implemented as a plug-in to the popular Eclipse IDE, uses the WALA and SAFE analysis frameworks, and was evaluated on 18 Java projects consisting of ∼1.65M lines of code. We found that 116 of 419 candidate streams (27.68%) were refactorable, and an average speedup of 3.49 on performance tests was observed. The results indicate that the approach is useful in optimizing stream code to their full potential.
Urban data management system: Towards Big Data analytics for Internet of Things based smart urban environment using customized Hadoop
2019, Future Generation Computer Systems
The unbroken amplification of a versatile urban setup is challenged by huge Big Data processing. Understanding the voluminous data generated in a smart urban environment for decision making is a challenging task. Big Data analytics is performed to obtain useful insights about the massive data. The existing conventional techniques are not suitable to get a useful insight due to the huge volume of data. Big Data analytics has attracted significant attention in the context of large-scale data computation and processing. This paper presents a Hadoop-based architecture to deal with Big Data loading and processing. The proposed architecture is composed of two different modules, i.e., Big Data loading and Big Data processing. The performance and efficiency of data loading is tested to propose a customized methodology for loading Big Data to a distributed and processing platform, i.e., Hadoop. To examine data ingestion into Hadoop, data loading is performed and compared repeatedly against different decisions. The experimental results are recorded for various attributes along with manual and traditional data loading to highlight the efficiency of our proposed solution. On the other hand, the processing is achieved using YARN cluster management framework with specific customization of dynamic scheduling. In addition, the effectiveness of our proposed solution regarding processing and computation is also highlighted and decorated in the context of throughput.
Survey of Distributed Computing Frameworks for Supporting Big Data Analysis
2023, Big Data Mining and Analytics

View all citing articles on Scopus

Xiaoliang Yang, received the B.S. degree in computer science from YanShan University, China, in 2008 and the Master degree in computer science from the Nanjing University, Nanjing, China, in 2012. He currently works at Baidu. His research interests include parallel and distributed computing and bioinformatics.

Jinshuang Yan, received the B.S. degree in computer science from the Nanjing University of Aeronautics and Astronautics, Nanjing, China, in 2010. He is currently a Master student in computer science at Nanjing University, Nanjing, China. His research interests include parallel computing, and large-scale data analysis.

Yuanhao Sun joined Intel in 2003. He was managing the big data product team in Datacenter Software Division at Intel Asia-Pacific R&D Ltd., leading the efforts for Intel’s distribution of Hadoop and related solution and services. Yuanhao received his bachelor and master degree from Nanjing University, both in computer science.

Bin Wang received the B.S. degree in software engineering from Nanjing University. He is currently a Master degree candidate in software engineering at Nanjing University. And he is taking an internship in Intel Asia-Pacific R&D Center. His research interests include distributed computing, large-scale data analysis and data mining.

Chunfeng Yuan is currently a professor in computer science department of Nanjing University, China. She received her bachelor and master degree from Nanjing University, both in computer science. Her main research interests include compute system architecture, big data parallel processing and Web information mining.

Yihua Huang is currently a professor in computer science department of Nanjing University, China. He received his bachelor, master and Ph.D. degree from Nanjing University, both in computer science. His main research interests include parallel and distributed computing, big data parallel processing and Web information mining.

View full text