Locality and loading aware virtual machine mapping techniques for optimizing communications in MapReduce applications

https://doi.org/10.1016/j.future.2015.04.006Get rights and content

Highlights

  • Improving performance of MapReduce programs in heterogeneous environments and hybrid clouds.

  • Enhancing data locality through a virtual machine mapping technique.

  • Optimizing shuffle performance and reducing communication overheads in distributed systems.

  • We propose a loading aware technique to balance workload of reducers at run-time.

Abstract

Big data refers to data that is so large that it exceeds the processing capabilities of traditional systems. Big data can be awkward to work and the storage, processing and analysis of big data can be problematic. MapReduce is a recent programming model that can handle big data. MapReduce achieves this by distributing the storage and processing of data amongst a large number of computers (nodes). However, this means the time required to process a MapReduce job is dependent on whichever node is last to complete a task. Heterogeneous environments exacerbate this problem.

In this paper we propose a method to improve MapReduce execution in heterogeneous environments. This is done by dynamically partitioning data before the Map phase and by using virtual machine mapping in the Reduce phase in order to maximize resource utilization. Simulation and experimental results show an improvement in MapReduce performance, including data locality and total completion time with different optimization approaches.

Introduction

Big Data is relative term that refers to datasets that have grown to a size that is awkward to work as conventional software tools to capture, manage and process in a tolerable period of time  [1]. The source of this data is wide and varied. Typical examples include RFID tags, GPS-enabled smart phones, social media, phone records, web logs, sensor networks, online browsing, eCommerce, and in various scientific research such as astronomy, medicine and weather  [2], [3], [4]. By mining this data researchers can discover trends such as user behavior. Such knowledge can have an impact on businesses, government, and scientific endeavors.

MapReduce is a programming model to create distributed applications that can process big data using a large number of commodity computers. Originally developed by Google, MapReduce enjoys wide use by both industry and academia  [5] via Hadoop  [6]. Hadoop is an open source implementation of MapReduce developed by Yahoo and is based on Google’s MapReduce  [7] and Google File System  [8] papers.

The advantages of MapReduce framework is that it allows users to execute analytical tasks over big data without worrying about the myriad of details inherent in distributed programming  [5], [9]. Both scalable and fault tolerant MapReduce frameworks potentially reduce the time it takes to complete a job by an amount that is proportionally related to the number of nodes available. The efficacy of MapReduce can be undermined however by its implementation. For instance, Hadoop the most popular open source MapReduce framework  [9] assumes all the nodes in the network to be homogeneous. Consequently, Hadoop’s performance is not optimal in a heterogeneous environment.

In the MapReduce model the time it takes to complete a job depends on when each node completes its workload. Therefore, if the workload is distributed evenly, the slowest node determines the time a job completes. To compensate for this the workload on slower nodes needs to be less than the faster nodes. This can be achieved by dividing the workload proportionally amongst individual nodes based on the processing efficiency of each node.

In this paper we focus on the Hadoop framework. We look in particular how MapReduce handles map input and reduce task assignment in a heterogeneous environment. This is important area of research since there is ample opportunity for MapReduce to be deployed in such environments. For instance, as technology advances, new machines on the network are likely be dissimilar to old ones. Alternatively, MapReduce may be deployed on a hybrid cloud environment, where computing resources tend to be heterogeneous. Therefore, this paper proposes a method to improve execution of MapReduce jobs in a heterogeneous environment.

In summary, this paper presents the following contributions

  • A method to improve mapper performance in a heterogeneous environment by repartitioning data at each node.

  • A method to improve virtual machine mapping for reducers.

  • A method to improve reducer selection on a heterogeneous systems.

The rest of this paper is organized as follows. In Section  2, we present some background on Map Reduce. In Section  3, we present our proposed dynamic data partitioning and virtual machine mapping methods. In Section  4, we evaluate our work, present our experimental results and discuss our findings. In Section  5, we discuss related work. Finally, in Section  6, we present our conclusion and give a brief discussion of future work.

Section snippets

Background

The term cloud computing appears to have been coined by Google’s CEO, Eric Shmidt at a conference in 2006, and is likely inspired by the use of a cloud to represent the Internet in pictures and diagrams  [10]. Amongst the literature  [10], [11], [12], [13] it appears that there is no standard definition of what cloud computing is. Therefore, this paper uses the NIST definition  [14] of cloud computing being “a model for enabling ubiquitous, convenient, on-demand network access to a shared pool

Proposed techniques and implementation

Hadoop has garnered much popularity in both academia and industry. The popularity it enjoys is a result of both its speed and economic efficacy. Furthermore, since Hadoop is open source, academics have been provided a useful platform on which to base their research. However, one of the drawbacks of the Hadoop implementation is it assumes that the computing nodes in the network are homogeneous  [22].

Consequently, Hadoop exhibits several inefficient behaviors when employed in a heterogeneous

Evaluation

To evaluate the performance of the proposed techniques, we have implemented the dynamic data partitioning and virtual machine mapping algorithms and tested these methodologies with a 900 MB randomly generated input data file on a simulated MapReduce environment. The performance analysis was reported with synthesized datasets, where the number of key–value pairs was set equal to the number of reducers. A 1:3 performance ratio (low heterogeneity) and 1:5 performance ratio (high heterogeneity)

Related work

There has been a variety of studies done by researchers on MapReduce and heterogeneous environments. We begin by summarizing some of the recent works here.

MapReduce frameworks such as Hadoop make an assumption that they will be deployed in a homogeneous environment. However, there are times where that assumption is untrue. Khalil, Salem, Nassar, and Saad  [23] identified three reasons why this assumption may be broken. Firstly, it is often impossible or even undesirable to have only one type of

Conclusion

This paper is based on MapReduce and the Hadoop framework. Its purpose is to improve the performance of MapReduce distributed application when executing in a heterogeneous environment. By dynamically partitioning input data being read by map tasks and by judicious assignation of reduce tasks based on data locality using a Virtual Machine Mapper. Furthermore, this paper presents an optimization of this method called a Load Aware Virtual Machine Mapper.

Simulation and experimental results show an

Acknowledgments

We thank the anonymous reviewers for their insightful comments. This work was funded by Ministry of Science and Technology, Taiwan, under grant number NSC-101-2918-I-216-001.

Ching-Hsien Hsu received B.S. and Ph.D. degrees in Computer Science from Tung Hai University and Feng Chia University, Taiwan, in 1995 and 1999, respectively. From 2001 to 2002, he had been an assistant professor in the department of Electrical Engineering at Nan Kai College. He joined the department of Computer Science and Information Engineering, Chung Hua University in 2002, and has become an associate professor since August 2005. He has published more than 100 academic papers in journals,

References (34)

  • A.B. Patel, M. Birla, U. Nair, Addressing big data problem using Hadoop and Map Reduce, in: 2012 Nirma University...
  • B. Feldman, E.M. Martin, T. Skotnes, Big data in healthcare hype and hope,...
  • R. Smolan et al.

    The Human Face of Big Data

    (2012)
  • T. White

    Hadoop: The Definitive Guide

    (2012)
  • K.-H. Lee et al.

    Parallel data processing with MapReduce: a survey

    ACM SIGMOD Rec.

    (2012)
  • Apache Hadoop. Available at: http://hadoop.apache.org  [August 12,...
  • J. Dean et al.

    MapReduce: simplified data processing on large clusters

    Commun. ACM

    (2008)
  • S. Ghemawat et al.

    The Google file system

    ACM SIGOPS Oper. Syst. Rev.

    (2003)
  • J. Dittrich et al.

    Efficient big data processing in Hadoop MapReduce

    Proc. VLDB Endow.

    (2012)
  • N. Sultan et al.

    Organisational culture and cloud computing: coping with a disruptive innovation

    Technol. Anal. Strateg. Manag.

    (2012)
  • J. Lin et al.

    Data-intensive text processing with MapReduce

  • L. Wang et al.

    Cloud computing: a perspective study

    New Gener. Comput.

    (2010)
  • M. Böhm, S. Leimeister, C. Riedl, H. Krcmar, Cloud computing and computing evolution, Technische Universität München,...
  • P. Mell et al.

    The NIST definition of cloud computing (draft)

    NIST Spec. Publ.

    (2011)
  • Y. Xing et al.

    Virtualization and cloud computing

  • B. Furht

    Cloud computing fundamentals

  • Xen, Available at: http://www.xenproject.org  [August 21,...
  • Cited by (41)

    • A data-locality-aware task scheduler for distributed social graph queries

      2019, Future Generation Computer Systems
      Citation Excerpt :

      In the work presented here, our goal is to accelerate graph query processing by using a novel scheduling technique. Data locality is required by jobs that need to process a large volume of data, because it efficiently reduces the data transfer cost [25–30]. One representative of the data-locality-aware schedulers is the scheduler used by Hadoop [6].

    • Data locality optimization based on data migration and hotspots prediction in geo-distributed cloud environment

      2019, Knowledge-Based Systems
      Citation Excerpt :

      To distinguish physical and virtual entities during measure node-locality, Ma, X., et al. [17] developed a comprehensive and practical scheme named vlocality for data locality in virtualized environments, in which, an enhanced task scheduling algorithm that prioritize co-located VMs was proposed. For optimizing communications in MapReduce applications, Hsu, C., et al. [18] developed locality and loading aware virtual machine mapping techniques. They improve mapper performance in heterogeneous environment by dynamically repartitioning data set at each node and use the virtual machine mapping technique in the Reduce phase to maximize resource utilization.

    • Energy-efficient hadoop for big data analytics and computing: A systematic review and research insights

      2018, Future Generation Computer Systems
      Citation Excerpt :

      A variety of issues have been raised and most of them are associated to the inefficiency of Hadoop in node management, data management, resource management and task scheduling. Previous studies mainly focus on how to improve the performance of Hadoop clusters by shortening job execution time [10–12], balancing task execution progress [13,14], reducing the impact of task failures [15–17], etc. However, these years the large amount of energy consumed by data centers emerged to be a prominent issue [18,19].

    • Efficient data and CPU-intensive job scheduling algorithms for healthcare cloud

      2018, Computers and Electrical Engineering
      Citation Excerpt :

      Personalized medical application: The health-related parameters of a patient needs to be analyzed efficiently for suitable medication and therapies that needs each individual’s genetic content and other molecular analysis. In general, different schedulers like FAIR scheduler, Torque, Condor [25] and other priority scheduler [21], schedule the jobs based on different factors such as priority, order of arrival, data locality etc. All schedulers treat different classes of jobs as identical.

    View all citing articles on Scopus

    Ching-Hsien Hsu received B.S. and Ph.D. degrees in Computer Science from Tung Hai University and Feng Chia University, Taiwan, in 1995 and 1999, respectively. From 2001 to 2002, he had been an assistant professor in the department of Electrical Engineering at Nan Kai College. He joined the department of Computer Science and Information Engineering, Chung Hua University in 2002, and has become an associate professor since August 2005. He has published more than 100 academic papers in journals, books and conference proceedings. His research interests include parallel and distributed processing, concurrent programming, parallelizing compilers, grid and pervasive computing. He is a senior member of the IEEE computer society.

    Kenn D. Slagter received an NZCE in Electronics and Computer Technology from the Eastern Institute of Technology in 1996, a B.S. degree in Computer Science from the University of Waikato in 2000 and a Master of Computer Studies from the University of New England in 2007. In 2008 he joined the Department of Computer Science at National Tsing Hua University as a Ph.D. candidate. He has also over 8 years work experience in the private sector as a software engineer. His research interests include high performance computing, cloud computing and parallel and distributed systems. He is a student member of the IEEE computer society.

    Yeh-Ching Chung received a B.S. degree in Information Engineering from Chung Yuan Christian University in 1983, and the M.S. and Ph.D. degrees in Computer and Information Science from Syracuse University in 1988 and 1992, respectively. He joined the Department of Information Engineering at Feng Chia University as an associate professor in 1992 and became a full professor in 1999. From 1998 to 2001, he was the chairman of the department. In 2002, he joined the Department of Computer Science at National Tsing Hua University as a full professor. His research interests include parallel and distributed processing, cloud computing, and embedded systems. He is a senior member of the IEEE computer society.

    View full text