SHadoop: Improving MapReduce performance by optimizing job execution mechanism in Hadoop clusters
Introduction
The MapReduce parallel computing framework [7], proposed by Google in 2004, has become an effective and attractive solution for big data processing problems. Through simple programming interfaces with two functions, map and reduce, MapReduce significantly simplifies the design and implementation of many data-intensive applications in the real world. Moreover, MapReduce offers other benefits, including load balancing, elastic scalability, and fault tolerance, which makes it a widely adopted parallel computing framework. Hadoop [1], an open-source implementation of MapReduce, has been widely used in industry and researched in academia.
Both the MapReduce frameworks of Google and Hadoop have been widely recognized for their high-throughput, elastic scalability, and fault tolerance. They focus more on these features than job execution efficiency. This results in relatively poor performance when using Hadoop MapReduce to execute jobs, especially short jobs. The term ‘short job’ has already been used in some related work [30], [31]. There is no quantitative definition for short jobs now. Usually they refer to MapReduce jobs with execution time ranging from seconds to a few minutes as opposed to long MapReduce jobs that take hours. Facebook names this type jobs as ‘small job’ in its recently released optimized version of Hadoop, Corona [24].
Some studies show that short jobs compose a large portion of MapReduce jobs [7], [5]. For example, the average execution time of MapReduce jobs at Google in September 2007 is 395 s [7]. Response time is most important for short jobs in scenarios where users need the answer quickly, such as query or analysis on log data for debugging, monitoring and business intelligence [30]. In a pay-by-the-time environment like EC2, improving MapReduce performance means saving monetary costs. Optimizing MapReduce’s execution time can also prevent jobs from occupying system resources too long, which is good for a cluster’s health [31].
Today there are a number of high-level query and data-analysis systems that provide services on top of MapReduce, such as Google’s Sawzall [17], Facebook’s Hive [22] and Yahoo!’s Pig [16]. These systems execute users’ requests by converting SQL-like queries to a series of MapReduce jobs that are usually short. These high-level declarative languages can greatly simplify the task of developing applications in MapReduce without hand-coded MapReduce programs [21]. Thus, in practice, these systems play more important roles than hand-coded MapReduce programs. For example, more than 95% Hadoop jobs in Facebook are not hand-coded but generated by Hive and more than 90% MapReduce jobs in Yahoo! are generated by Pig [12]. In fact, these systems are very sensitive to the execution time of underlying short MapReduce jobs. Therefore, reducing the execution time of MapReduce jobs is very important to these widely-used systems.
For the above reasons, in this paper we concentrate on improving the execution performance of short MapReduce jobs. Having studied the Hadoop MapReduce framework in great detail, we focus on the internal execution mechanisms of an individual job and the tasks inside a job. Through in-depth analysis, we reveal that there are two critical issues that limit the performance of MapReduce jobs. To address these issues, we design and implement SHadoop, an optimized version of Hadoop that is fully compatible with the standard Hadoop. Different from improving the performance on job scheduling or job parameter optimization level, we optimize the underlying execution mechanism of each of tasks inside a job. In implementation, first, we optimize the setup and cleanup tasks, two special tasks when executing a MapReduce job, to reduce the time cost during the initialization and termination stages of the job; second, we add an instant messaging communication mechanism into the standard Hadoop for the fast delivery of the performance-sensitive task scheduling and execution messages between the JobTracker and TaskTrackers. This way the tasks of a job can be scheduled and executed instantly without heartbeat delay. As a consequence, the job execution process becomes more compact and utilization of the slots on the TaskTrackers can be much improved. Experimental results show that SHadoop outperforms the standard Hadoop and can achieve stable performance improvements of around 25% on average for comprehensive benchmarks. Our optimization work has passed a production-level test in Intel and been integrated into the Intel Distributed Hadoop [11]. To the best of our knowledge, this work is the first effort that explores on optimizing the execution mechanism inside map/reduce tasks. The advantage is that it can complement job scheduling optimization work to further improve the job execution performance.
The rest of this paper is organized as follows: Section 2 introduces the related works on MapReduce performance optimization and comparisons them with SHadoop. Section 3 focuses on analyzing the job/tasks execution mechanism in standard Hadoop MapReduce. Based on this, Section 4 describes our optimization methods for improving job execution efficiency in the standard Hadoop MapReduce. Section 5 discusses experiments and performance evaluations of our optimization work. Finally, we conclude this paper in Section 6.
Section snippets
Related work analysis
Many studies have been done to improve the performance of the Hadoop MapReduce framework from different levels or aspects. They fall into several categories. The first focuses on designing scheduling algorithms to optimize the execution order of jobs or tasks more intelligently [30], [29], [15], [8], [27], [9], [14], [25]. The second explores how to improve the efficiency of MapReduce with the aid of special hardware or supporting software [31], [3], [28]. The third conducts specialized
In-depth analysis of MapReduce job execution process
In this section, we first give a brief introduction to the Hadoop MapReduce framework. Then we focus on performing an in-depth analysis of the underlying execution mechanism and process of a MapReduce job and its tasks in Hadoop.
Hadoop MapReduce framework, which is deployed on top of HDFS, consists of a JobTracker running on the master node and many TaskTrackers running on slave nodes. “Job” and “Task” are two important concepts in MapReduce architecture. Usually, a MapReduce job contains a set
Optimization of MapReduce job and task execution mechanisms
Based on the above in-depth analysis of execution mechanisms of a MapReduce job and its tasks, in this section we reveal two critical limitations to job execution performance in the standard Hadoop MapReduce framework. Then we present our optimization work to address these issues in more detail. The optimizations made in SHadoop aim at reducing the internal execution time cost of individual MapReduce jobs, especially for short jobs, by optimizing the job and task execution mechanisms to improve
Evaluation
In order to verify the effects of our optimizations, we conducted a series of experiments to evaluate and compare the performance of SHadoop with the standard Hadoop. First, we performed a number of experiments to separately evaluate the effect of each of optimization measures. Second, in order to evaluate how much our optimization can benefit the MapReduce jobs with different workloads, we adopted several Hadoop MapReduce benchmark suites to further evaluate SHadoop. Third, the widely-used big
Conclusion and future work
MapReduce is a popular programming model and framework for processing large datasets. It has been widely used and recognized for its simple programming interfaces, fault tolerance and elastic scalability. However, the job execution performance of MapReduce is relatively disregarded. In this paper, we explore an approach to optimize the job and task execution mechanism and present an optimized version of Hadoop, named SHadoop, to improve the execution performance of MapReduce jobs. SHadoop makes
Acknowledgments
This work is funded in part by China NSF Grants (61223003) and the National High Technology Research and Development Program of China (863 Program) (2011AA01A202).
Rong Gu, received the B.S. degree in computer science from Nanjing University of Aeronautics and Astronautics, Nanjing, China, in 2011. He is currently a Ph.D. candidate in computer science at Nanjing University, Nanjing, China. His research interests include parallel and distributed computing, cloud computing, and big data parallel processing.
References (31)
- Apache Hadoop....
- S. Babu, Towards automatic optimization of mapreduce programs, in: Proceedings of the 1st ACM symposium on Cloud...
- Y. Becerra Fontal, V. Beltran Querol, P, D. Carrera, et al. Speeding up distributed MapReduce applications using...
- Benchmarking and Stress Testing an Hadoop Cluster With TeraSort, TestDFSIO & Co....
- Y. Chen, A. Ganapathi, R. Griffith, R. Katz, The case for evaluating mapreduce performance using workload suites, in:...
- Danga Interactive, memcached,...
- et al.
MapReduce: simplified data processing on large clusters
Commun. ACM
(2008) - M. Hammoud, M. Sak, Locality-aware reduce task scheduling for MapReduce, in 3nd IEEE International Conference on Cloud...
- C. He, Y. Lu, D. Swanson, Matchmaking: a new MapReduce scheduling technique, in: 3rd International Conference on Cloud...
- S. Huang, J. Huang, J. Dai, T. Xie, B. Huang, The HiBench benchmark suite: characterization of the MapReduce-based data...
Cited by (86)
A review on offloading in fog-based Internet of Things: Architecture, machine learning approaches, and open issues
2023, High-Confidence ComputingAnalysis of hadoop MapReduce scheduling in heterogeneous environment
2021, Ain Shams Engineering JournalCitation Excerpt :It’s a data locality based process of scheduling the map and the reduce tasks according to the node capacity. This aims at improving the data locality and performance of MapReduce cluster when compared to Hadoop’s default scheduling algorithms, delay scheduler, and the matchmaking scheduling algorithm [33,37–39]. This focuses on decreasing the run time of a job by classifying the nodes into a map and reduce slow nodes respectively.
Parallel and distributed architecture of genetic algorithm on Apache Hadoop and Spark
2020, Applied Soft Computing JournalSafe automated refactoring for intelligent parallelization of Java 8 streams
2020, Science of Computer ProgrammingCitation Excerpt :Manual) interprocedural and type hierarchy analysis may be needed to discover ways to use streams in a particular context optimally. Previously, attention has been given to retrofitting concurrency on to existing sequential (imperative) programs [18–20], translating imperative code to MapReduce [21], verifying and validating correctness of MapReduce-style programs [22–25], studying the use of λ-expressions [17,26–28] and streams [29], and improving performance of the underlying MapReduce framework implementation [30–33]. Little attention, though, has been paid to mainstream languages utilizing functional-style APIs that facilitate MapReduce-style operations over native data structures like collections.
Urban data management system: Towards Big Data analytics for Internet of Things based smart urban environment using customized Hadoop
2019, Future Generation Computer SystemsSurvey of Distributed Computing Frameworks for Supporting Big Data Analysis
2023, Big Data Mining and Analytics
Rong Gu, received the B.S. degree in computer science from Nanjing University of Aeronautics and Astronautics, Nanjing, China, in 2011. He is currently a Ph.D. candidate in computer science at Nanjing University, Nanjing, China. His research interests include parallel and distributed computing, cloud computing, and big data parallel processing.
Xiaoliang Yang, received the B.S. degree in computer science from YanShan University, China, in 2008 and the Master degree in computer science from the Nanjing University, Nanjing, China, in 2012. He currently works at Baidu. His research interests include parallel and distributed computing and bioinformatics.
Jinshuang Yan, received the B.S. degree in computer science from the Nanjing University of Aeronautics and Astronautics, Nanjing, China, in 2010. He is currently a Master student in computer science at Nanjing University, Nanjing, China. His research interests include parallel computing, and large-scale data analysis.
Yuanhao Sun joined Intel in 2003. He was managing the big data product team in Datacenter Software Division at Intel Asia-Pacific R&D Ltd., leading the efforts for Intel’s distribution of Hadoop and related solution and services. Yuanhao received his bachelor and master degree from Nanjing University, both in computer science.
Bin Wang received the B.S. degree in software engineering from Nanjing University. He is currently a Master degree candidate in software engineering at Nanjing University. And he is taking an internship in Intel Asia-Pacific R&D Center. His research interests include distributed computing, large-scale data analysis and data mining.
Chunfeng Yuan is currently a professor in computer science department of Nanjing University, China. She received her bachelor and master degree from Nanjing University, both in computer science. Her main research interests include compute system architecture, big data parallel processing and Web information mining.
Yihua Huang is currently a professor in computer science department of Nanjing University, China. He received his bachelor, master and Ph.D. degree from Nanjing University, both in computer science. His main research interests include parallel and distributed computing, big data parallel processing and Web information mining.