research-article

HOMR: a hybrid approach to exploit maximum overlapping in MapReduce over high performance interconnects

Authors:
Md Wasi-ur- Rahman

The Ohio State University, Columbus, OH, USA

The Ohio State University, Columbus, OH, USA
View Profile

,
Xiaoyi Lu

The Ohio State University, Columbus, OH, USA

The Ohio State University, Columbus, OH, USA
View Profile

,
Nusrat Sharmin Islam

The Ohio State University, Columbus, OH, USA

The Ohio State University, Columbus, OH, USA
View Profile

,
Dhabaleswar K. (DK) Panda

The Ohio State University, Columbus, OH, USA

The Ohio State University, Columbus, OH, USA
View Profile

ICS '14: Proceedings of the 28th ACM international conference on SupercomputingJune 2014Pages 33–42https://doi.org/10.1145/2597652.2597684

Published:10 June 2014Publication History

ICS '14: Proceedings of the 28th ACM international conference on Supercomputing

Pages 33–42

ABSTRACT

Hadoop MapReduce is the most popular open-source parallel programming model extensively used in Big Data analytics. Although fault tolerance and platform independence make Hadoop MapReduce the most popular choice for many users, it still has huge performance improvement potentials. Recently, RDMA-based design of Hadoop MapReduce has alleviated major performance bottlenecks with the implementation of many novel design features such as in-memory merge, prefetching and caching of map outputs, and overlapping of merge and reduce phases. Although these features reduce the overall execution time for MapReduce jobs compared to the default framework, further improvement is possible if shuffle and merge phases can also be overlapped with the map phase during job execution. In this paper, we propose HOMR (a Hybrid approach to exploit maximum Overlapping in MapReduce), that incorporates not only the features implemented in RDMA-based design, but also exploits maximum possible overlapping among all different phases compared to current best approaches. Our solution introduces two key concepts: Greedy Shuffle Algorithm and On-demand Shuffle Adjustment, both of which are essential to achieve significant performance benefits over the default MapReduce framework. Architecture of HOMR is generalized enough to provide performance efficiency both over different Sockets interface as well as previous RDMA-based designs over InfiniBand. Performance evaluations show that HOMR with RDMA over InfiniBand can achieve performance benefits of 54% and 56% compared to default Hadoop over IPoIB (IP over InfiniBand) and 10GigE, respectively. Compared to the previous best RDMA-based designs, this benefit is 29%. HOMR over Sockets also achieves a maximum of 38-40% benefit compared to default Hadoop over Sockets interface. We also evaluate our design with real-world workloads like SWIM and PUMA, and observe benefits of up to 16% and 18%, respectively, over the previous best-case RDMA-based design. To the best of our knowledge, this is the first approach to achieve maximum possible overlapping for MapReduce framework.

References

2011 IDC Digital Universe Study. http://www.emc.com/leadership/programs/digital-universe.htm.Google Scholar
2012 DataNami Study. http://www.datanami.com/datanami/2012-07--16/top_5_ challenges_for_hadoop_mapreduce_in_the_enterprise.html.Google Scholar
J. Appavoo, A. Waterland, D. Da Silva, V. Uhlig, B. Rosenburg, E. Van Hensbergen, J. Stoess, R. Wisniewski, and U. Steinberg. Providing A Cloud Network Infrastructure on A Supercomputer. In Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing, HPDC, New York, NY, USA, 2010. Google ScholarDigital Library
F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Burrows, T. Chandra, A. Fikes, and R. Gruber. Bigtable: A Distributed Storage System for Structured Data. In Proceedings of the 7th Symposium on Operating System Desgin and Implementation, OSDI, Seattle, WA, USA, 2006. Google ScholarDigital Library
J. Dean and S. Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. In Proceedings of the 6th Symposium on Operating Systems Design and Implementation, OSDI, San Francisco, CA, USA, 2004. Google ScholarDigital Library
Greenplum Analytics Workbench. http://www.greenplum.com/news/greenplum-analytics-workbench.Google Scholar
Hadoop Map Reduce. The Apache Hadoop Project.http://hadoop.apache.org/mapreduce/.Google Scholar
J. Huang, X. Ouyang, J. Jose, M. W. Rahman, H. Wang, M. Luo, H. Subramoni, C. Murthy, and D. K. Panda. High-Performance Design of HBase with RDMA over InfiniBand. In Proceedings of the IEEE International Parallel and Distributed Processing Symposium, IPDPS, Shanghai, China, 2012. Google ScholarDigital Library
Infiniband Trade Association. http://www.infinibandta.org.Google Scholar
N. S. Islam, X. Lu, M. W. Rahman, and D. K. Panda. Can Parallel Replication Benefit Hadoop Distributed File System for High Performance Interconnects? In The Proceedings of IEEE 21st Annual Symposium on High-Performance Interconnects, HOTI, San Jose, CA, 2013. Google ScholarDigital Library
N. S. Islam, M. W. Rahman, J. Jose, R. Rajachandrasekar, H. Wang, H. Subramoni, C. Murthy, and D. K. Panda. High Performance RDMA-based Design of HDFS over InfiniBand. In The Proceedings of International Conference for High Performance Computing, Networking, Storage and Analysis, SC, Salt Lake City, Utah, USA, 2012. Google ScholarDigital Library
D. Jiang, B. C. Ooi, L. Shi, and S. Wu. The Performance of MapReduce: an In-depth Study. Proceedings of VLDB Endowment, 2010. Google ScholarDigital Library
X. Lu, N. S. Islam, M. W. Rahman, J. Jose, H. Subramoni, H. Wang, and D. K. Panda. High-Performance Design of Hadoop RPC with RDMA over InfiniBand. In Proceedings of IEEE 42nd International Conference on Parallel Processing, ICPP, Lyon, France, 2013. Google ScholarDigital Library
X. Lu, B. Wang, L. Zha, and Z. Xu. Can MPI Benefit Hadoop and MapReduce Applications? In Proceedings of IEEE 40th International Conference on Parallel Processing Workshops, ICPPW, 2011. Google ScholarDigital Library
Mellanox Technologies. Unstructured Data Accelerator. http://www.mellanox.com/page/products_dyn?product_family=144.Google Scholar
Purdue MapReduce Benchmarks Suite (PUMA). http://web.ics.purdue.edu/.Google Scholar
M. W. Rahman, N. S. Islam, X. Lu, J. Jose, H. Subramoni, H. Wang, and D. K. Panda. High-Performance RDMA-based Design of Hadoop MapReduce over InfiniBand. In International Workshop on High Performance Data Intensive Computing, in conjunction with IPDPS, HPDIC, Boston, USA, 2013. Google ScholarDigital Library
RDMA for Apache Hadoop: High-Performance Design of Hadoop over RDMA-enabled Interconnects. http://hadoop-rdma.cse.ohio-state.edu/.Google Scholar
S. Seo, I. Jang, K. Woo, I. Kim, J.-S. Kim, and S. Maeng. HPMR: Prefetching and Pre-shuffling in Shared MapReduce Computation Environment. In Proceedings of IEEE International Conference on Cluster Computing and Workshops, CLUSTER, 2009.Google ScholarCross Ref
K. Shvachko, H. Kuang, S. Radia, and R. Chansler. The Hadoop Distributed File System. In Proceedings of the IEEE 26th Symposium on Mass Storage Systems and Technologies, MSST, 2010. Google ScholarDigital Library
Stampede at Texas Advanced Computing Center. http://www.tacc.utexas.edu/resources/hpc/stampede.Google Scholar
Statistical Workload Injector for MapReduce. https://github.com/SWIMProjectUCB/SWIM/wiki.Google Scholar
S. Sur, H. Wang, J. Huang, X. Ouyang, and D. K. Panda. Can High Performance Interconnects Benefit Hadoop Distributed File System? In Workshop on Micro Architectural Support for Virtualization, Data Center Computing, and Clouds, in Conjunction with MICRO, Atlanta, GA, 2010.Google Scholar
The Apache Software Foundation. The Apache Hadoop Project. http://hadoop.apache.org/.Google Scholar
Y. Wang, X. Que, W. Yu, D. Goldenberg, and D. Sehgal. Hadoop Acceleration through Network Levitated Merge. In Proceedings of International Conference for High Performance Computing, Networking, Storage and Analysis, SC, Seattle, WA, USA, 2011. Google ScholarDigital Library
Y. Wang, C. Xu, X. Li, and W. Yu. JVM-Bypass for Efficient Hadoop Shuffling. In Proceedings of the IEEE 27th International Symposium on Parallel and Distributed Processing, IPDPS, Boston, MA, USA, 2013. Google ScholarDigital Library

Index Terms

HOMR: a hybrid approach to exploit maximum overlapping in MapReduce over high performance interconnects
1. General and reference
  1. Cross-computing tools and techniques
    1. Design

Recommendations

Can Parallel Replication Benefit Hadoop Distributed File System for High Performance Interconnects?
HOTI '13: Proceedings of the 2013 IEEE 21st Annual Symposium on High-Performance Interconnects

The Hadoop Distributed File System (HDFS) is a popular choice for Big Data applications due to its reliability and fault-tolerance. HDFS provides fault-tolerance and availability guarantee by replicating each data block to multiple DataN-odes. The ...
Read More
MapReduce: Review and open challenges

The continuous increase in computational capacity over the past years has produced an overwhelming flow of data or big data, which exceeds the capabilities of conventional processing tools. Big data signify a new era in data exploration and utilization. ...
Read More
Design and Development of a Medical Big Data Processing System Based on Hadoop

Secondary use of medical big data is increasingly popular in healthcare services and clinical research. Understanding the logic behind medical big data demonstrates tendencies in hospital information technology and shows great significance for hospital ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
ICS '14: Proceedings of the 28th ACM international conference on Supercomputing
June 2014
378 pages
ISBN:9781450326421
DOI:10.1145/2597652
General Chairs:
Arndt Bode
Technische Universität München and Leibniz Rechenzentrum, Germany
,
Michael Gerndt
Technische Universität München, Germany
,
Program Chairs:
Per Stenström
Chalmers University of Technology, Sweden
,
Lawrence Rauchwerger
Texas A&M University, USA
,
Barton Miller
University of Wisconsin, USA
,
Martin Schulz
Lawrence Livermore National Laboratory, USA
Copyright © 2014 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 10 June 2014
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
high performance interconnects
mapreduce
overlapping execution
shuffie algorithm
Qualifiers
- research-article
Conference

Acceptance Rates
ICS '14 Paper Acceptance Rate34of160submissions,21%Overall Acceptance Rate584of2,055submissions,28%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 39
  Total Citations
  View Citations
- 376
  Total Downloads
- Downloads (Last 12 months)9
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HOMR: a hybrid approach to exploit maximum overlapping in MapReduce over high performance interconnects

ICS '14: Proceedings of the 28th ACM international conference on Supercomputing

ABSTRACT

References

Cited By

Index Terms

Recommendations

Can Parallel Replication Benefit Hadoop Distributed File System for High Performance Interconnects?

MapReduce: Review and open challenges

Design and Development of a Medical Big Data Processing System Based on Hadoop

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

HOMR: a hybrid approach to exploit maximum overlapping in MapReduce over high performance interconnects

ICS '14: Proceedings of the 28th ACM international conference on Supercomputing

ABSTRACT

References

Cited By

Index Terms

Recommendations

Can Parallel Replication Benefit Hadoop Distributed File System for High Performance Interconnects?

MapReduce: Review and open challenges

Design and Development of a Medical Big Data Processing System Based on Hadoop

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media