ABSTRACT
Hadoop MapReduce is the most popular open-source parallel programming model extensively used in Big Data analytics. Although fault tolerance and platform independence make Hadoop MapReduce the most popular choice for many users, it still has huge performance improvement potentials. Recently, RDMA-based design of Hadoop MapReduce has alleviated major performance bottlenecks with the implementation of many novel design features such as in-memory merge, prefetching and caching of map outputs, and overlapping of merge and reduce phases. Although these features reduce the overall execution time for MapReduce jobs compared to the default framework, further improvement is possible if shuffle and merge phases can also be overlapped with the map phase during job execution. In this paper, we propose HOMR (a Hybrid approach to exploit maximum Overlapping in MapReduce), that incorporates not only the features implemented in RDMA-based design, but also exploits maximum possible overlapping among all different phases compared to current best approaches. Our solution introduces two key concepts: Greedy Shuffle Algorithm and On-demand Shuffle Adjustment, both of which are essential to achieve significant performance benefits over the default MapReduce framework. Architecture of HOMR is generalized enough to provide performance efficiency both over different Sockets interface as well as previous RDMA-based designs over InfiniBand. Performance evaluations show that HOMR with RDMA over InfiniBand can achieve performance benefits of 54% and 56% compared to default Hadoop over IPoIB (IP over InfiniBand) and 10GigE, respectively. Compared to the previous best RDMA-based designs, this benefit is 29%. HOMR over Sockets also achieves a maximum of 38-40% benefit compared to default Hadoop over Sockets interface. We also evaluate our design with real-world workloads like SWIM and PUMA, and observe benefits of up to 16% and 18%, respectively, over the previous best-case RDMA-based design. To the best of our knowledge, this is the first approach to achieve maximum possible overlapping for MapReduce framework.
- 2011 IDC Digital Universe Study. http://www.emc.com/leadership/programs/digital-universe.htm.Google Scholar
- 2012 DataNami Study. http://www.datanami.com/datanami/2012-07--16/top_5_ challenges_for_hadoop_mapreduce_in_the_enterprise.html.Google Scholar
- J. Appavoo, A. Waterland, D. Da Silva, V. Uhlig, B. Rosenburg, E. Van Hensbergen, J. Stoess, R. Wisniewski, and U. Steinberg. Providing A Cloud Network Infrastructure on A Supercomputer. In Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing, HPDC, New York, NY, USA, 2010. Google ScholarDigital Library
- F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Burrows, T. Chandra, A. Fikes, and R. Gruber. Bigtable: A Distributed Storage System for Structured Data. In Proceedings of the 7th Symposium on Operating System Desgin and Implementation, OSDI, Seattle, WA, USA, 2006. Google ScholarDigital Library
- J. Dean and S. Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. In Proceedings of the 6th Symposium on Operating Systems Design and Implementation, OSDI, San Francisco, CA, USA, 2004. Google ScholarDigital Library
- Greenplum Analytics Workbench. http://www.greenplum.com/news/greenplum-analytics-workbench.Google Scholar
- Hadoop Map Reduce. The Apache Hadoop Project.http://hadoop.apache.org/mapreduce/.Google Scholar
- J. Huang, X. Ouyang, J. Jose, M. W. Rahman, H. Wang, M. Luo, H. Subramoni, C. Murthy, and D. K. Panda. High-Performance Design of HBase with RDMA over InfiniBand. In Proceedings of the IEEE International Parallel and Distributed Processing Symposium, IPDPS, Shanghai, China, 2012. Google ScholarDigital Library
- Infiniband Trade Association. http://www.infinibandta.org.Google Scholar
- N. S. Islam, X. Lu, M. W. Rahman, and D. K. Panda. Can Parallel Replication Benefit Hadoop Distributed File System for High Performance Interconnects? In The Proceedings of IEEE 21st Annual Symposium on High-Performance Interconnects, HOTI, San Jose, CA, 2013. Google ScholarDigital Library
- N. S. Islam, M. W. Rahman, J. Jose, R. Rajachandrasekar, H. Wang, H. Subramoni, C. Murthy, and D. K. Panda. High Performance RDMA-based Design of HDFS over InfiniBand. In The Proceedings of International Conference for High Performance Computing, Networking, Storage and Analysis, SC, Salt Lake City, Utah, USA, 2012. Google ScholarDigital Library
- D. Jiang, B. C. Ooi, L. Shi, and S. Wu. The Performance of MapReduce: an In-depth Study. Proceedings of VLDB Endowment, 2010. Google ScholarDigital Library
- X. Lu, N. S. Islam, M. W. Rahman, J. Jose, H. Subramoni, H. Wang, and D. K. Panda. High-Performance Design of Hadoop RPC with RDMA over InfiniBand. In Proceedings of IEEE 42nd International Conference on Parallel Processing, ICPP, Lyon, France, 2013. Google ScholarDigital Library
- X. Lu, B. Wang, L. Zha, and Z. Xu. Can MPI Benefit Hadoop and MapReduce Applications? In Proceedings of IEEE 40th International Conference on Parallel Processing Workshops, ICPPW, 2011. Google ScholarDigital Library
- Mellanox Technologies. Unstructured Data Accelerator. http://www.mellanox.com/page/products_dyn?product_family=144.Google Scholar
- Purdue MapReduce Benchmarks Suite (PUMA). http://web.ics.purdue.edu/.Google Scholar
- M. W. Rahman, N. S. Islam, X. Lu, J. Jose, H. Subramoni, H. Wang, and D. K. Panda. High-Performance RDMA-based Design of Hadoop MapReduce over InfiniBand. In International Workshop on High Performance Data Intensive Computing, in conjunction with IPDPS, HPDIC, Boston, USA, 2013. Google ScholarDigital Library
- RDMA for Apache Hadoop: High-Performance Design of Hadoop over RDMA-enabled Interconnects. http://hadoop-rdma.cse.ohio-state.edu/.Google Scholar
- S. Seo, I. Jang, K. Woo, I. Kim, J.-S. Kim, and S. Maeng. HPMR: Prefetching and Pre-shuffling in Shared MapReduce Computation Environment. In Proceedings of IEEE International Conference on Cluster Computing and Workshops, CLUSTER, 2009.Google ScholarCross Ref
- K. Shvachko, H. Kuang, S. Radia, and R. Chansler. The Hadoop Distributed File System. In Proceedings of the IEEE 26th Symposium on Mass Storage Systems and Technologies, MSST, 2010. Google ScholarDigital Library
- Stampede at Texas Advanced Computing Center. http://www.tacc.utexas.edu/resources/hpc/stampede.Google Scholar
- Statistical Workload Injector for MapReduce. https://github.com/SWIMProjectUCB/SWIM/wiki.Google Scholar
- S. Sur, H. Wang, J. Huang, X. Ouyang, and D. K. Panda. Can High Performance Interconnects Benefit Hadoop Distributed File System? In Workshop on Micro Architectural Support for Virtualization, Data Center Computing, and Clouds, in Conjunction with MICRO, Atlanta, GA, 2010.Google Scholar
- The Apache Software Foundation. The Apache Hadoop Project. http://hadoop.apache.org/.Google Scholar
- Y. Wang, X. Que, W. Yu, D. Goldenberg, and D. Sehgal. Hadoop Acceleration through Network Levitated Merge. In Proceedings of International Conference for High Performance Computing, Networking, Storage and Analysis, SC, Seattle, WA, USA, 2011. Google ScholarDigital Library
- Y. Wang, C. Xu, X. Li, and W. Yu. JVM-Bypass for Efficient Hadoop Shuffling. In Proceedings of the IEEE 27th International Symposium on Parallel and Distributed Processing, IPDPS, Boston, MA, USA, 2013. Google ScholarDigital Library
Index Terms
- HOMR: a hybrid approach to exploit maximum overlapping in MapReduce over high performance interconnects
Recommendations
Can Parallel Replication Benefit Hadoop Distributed File System for High Performance Interconnects?
HOTI '13: Proceedings of the 2013 IEEE 21st Annual Symposium on High-Performance InterconnectsThe Hadoop Distributed File System (HDFS) is a popular choice for Big Data applications due to its reliability and fault-tolerance. HDFS provides fault-tolerance and availability guarantee by replicating each data block to multiple DataN-odes. The ...
MapReduce: Review and open challenges
The continuous increase in computational capacity over the past years has produced an overwhelming flow of data or big data, which exceeds the capabilities of conventional processing tools. Big data signify a new era in data exploration and utilization. ...
Design and Development of a Medical Big Data Processing System Based on Hadoop
Secondary use of medical big data is increasingly popular in healthcare services and clinical research. Understanding the logic behind medical big data demonstrates tendencies in hospital information technology and shows great significance for hospital ...
Comments