skip to main content
10.1145/2597652.2597684acmconferencesArticle/Chapter ViewAbstractPublication PagesicsConference Proceedingsconference-collections
research-article

HOMR: a hybrid approach to exploit maximum overlapping in MapReduce over high performance interconnects

Published:10 June 2014Publication History

ABSTRACT

Hadoop MapReduce is the most popular open-source parallel programming model extensively used in Big Data analytics. Although fault tolerance and platform independence make Hadoop MapReduce the most popular choice for many users, it still has huge performance improvement potentials. Recently, RDMA-based design of Hadoop MapReduce has alleviated major performance bottlenecks with the implementation of many novel design features such as in-memory merge, prefetching and caching of map outputs, and overlapping of merge and reduce phases. Although these features reduce the overall execution time for MapReduce jobs compared to the default framework, further improvement is possible if shuffle and merge phases can also be overlapped with the map phase during job execution. In this paper, we propose HOMR (a Hybrid approach to exploit maximum Overlapping in MapReduce), that incorporates not only the features implemented in RDMA-based design, but also exploits maximum possible overlapping among all different phases compared to current best approaches. Our solution introduces two key concepts: Greedy Shuffle Algorithm and On-demand Shuffle Adjustment, both of which are essential to achieve significant performance benefits over the default MapReduce framework. Architecture of HOMR is generalized enough to provide performance efficiency both over different Sockets interface as well as previous RDMA-based designs over InfiniBand. Performance evaluations show that HOMR with RDMA over InfiniBand can achieve performance benefits of 54% and 56% compared to default Hadoop over IPoIB (IP over InfiniBand) and 10GigE, respectively. Compared to the previous best RDMA-based designs, this benefit is 29%. HOMR over Sockets also achieves a maximum of 38-40% benefit compared to default Hadoop over Sockets interface. We also evaluate our design with real-world workloads like SWIM and PUMA, and observe benefits of up to 16% and 18%, respectively, over the previous best-case RDMA-based design. To the best of our knowledge, this is the first approach to achieve maximum possible overlapping for MapReduce framework.

References

  1. 2011 IDC Digital Universe Study. http://www.emc.com/leadership/programs/digital-universe.htm.Google ScholarGoogle Scholar
  2. 2012 DataNami Study. http://www.datanami.com/datanami/2012-07--16/top_5_ challenges_for_hadoop_mapreduce_in_the_enterprise.html.Google ScholarGoogle Scholar
  3. J. Appavoo, A. Waterland, D. Da Silva, V. Uhlig, B. Rosenburg, E. Van Hensbergen, J. Stoess, R. Wisniewski, and U. Steinberg. Providing A Cloud Network Infrastructure on A Supercomputer. In Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing, HPDC, New York, NY, USA, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Burrows, T. Chandra, A. Fikes, and R. Gruber. Bigtable: A Distributed Storage System for Structured Data. In Proceedings of the 7th Symposium on Operating System Desgin and Implementation, OSDI, Seattle, WA, USA, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. J. Dean and S. Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. In Proceedings of the 6th Symposium on Operating Systems Design and Implementation, OSDI, San Francisco, CA, USA, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Greenplum Analytics Workbench. http://www.greenplum.com/news/greenplum-analytics-workbench.Google ScholarGoogle Scholar
  7. Hadoop Map Reduce. The Apache Hadoop Project.http://hadoop.apache.org/mapreduce/.Google ScholarGoogle Scholar
  8. J. Huang, X. Ouyang, J. Jose, M. W. Rahman, H. Wang, M. Luo, H. Subramoni, C. Murthy, and D. K. Panda. High-Performance Design of HBase with RDMA over InfiniBand. In Proceedings of the IEEE International Parallel and Distributed Processing Symposium, IPDPS, Shanghai, China, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Infiniband Trade Association. http://www.infinibandta.org.Google ScholarGoogle Scholar
  10. N. S. Islam, X. Lu, M. W. Rahman, and D. K. Panda. Can Parallel Replication Benefit Hadoop Distributed File System for High Performance Interconnects? In The Proceedings of IEEE 21st Annual Symposium on High-Performance Interconnects, HOTI, San Jose, CA, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. N. S. Islam, M. W. Rahman, J. Jose, R. Rajachandrasekar, H. Wang, H. Subramoni, C. Murthy, and D. K. Panda. High Performance RDMA-based Design of HDFS over InfiniBand. In The Proceedings of International Conference for High Performance Computing, Networking, Storage and Analysis, SC, Salt Lake City, Utah, USA, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. D. Jiang, B. C. Ooi, L. Shi, and S. Wu. The Performance of MapReduce: an In-depth Study. Proceedings of VLDB Endowment, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. X. Lu, N. S. Islam, M. W. Rahman, J. Jose, H. Subramoni, H. Wang, and D. K. Panda. High-Performance Design of Hadoop RPC with RDMA over InfiniBand. In Proceedings of IEEE 42nd International Conference on Parallel Processing, ICPP, Lyon, France, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. X. Lu, B. Wang, L. Zha, and Z. Xu. Can MPI Benefit Hadoop and MapReduce Applications? In Proceedings of IEEE 40th International Conference on Parallel Processing Workshops, ICPPW, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Mellanox Technologies. Unstructured Data Accelerator. http://www.mellanox.com/page/products_dyn?product_family=144.Google ScholarGoogle Scholar
  16. Purdue MapReduce Benchmarks Suite (PUMA). http://web.ics.purdue.edu/.Google ScholarGoogle Scholar
  17. M. W. Rahman, N. S. Islam, X. Lu, J. Jose, H. Subramoni, H. Wang, and D. K. Panda. High-Performance RDMA-based Design of Hadoop MapReduce over InfiniBand. In International Workshop on High Performance Data Intensive Computing, in conjunction with IPDPS, HPDIC, Boston, USA, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. RDMA for Apache Hadoop: High-Performance Design of Hadoop over RDMA-enabled Interconnects. http://hadoop-rdma.cse.ohio-state.edu/.Google ScholarGoogle Scholar
  19. S. Seo, I. Jang, K. Woo, I. Kim, J.-S. Kim, and S. Maeng. HPMR: Prefetching and Pre-shuffling in Shared MapReduce Computation Environment. In Proceedings of IEEE International Conference on Cluster Computing and Workshops, CLUSTER, 2009.Google ScholarGoogle ScholarCross RefCross Ref
  20. K. Shvachko, H. Kuang, S. Radia, and R. Chansler. The Hadoop Distributed File System. In Proceedings of the IEEE 26th Symposium on Mass Storage Systems and Technologies, MSST, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Stampede at Texas Advanced Computing Center. http://www.tacc.utexas.edu/resources/hpc/stampede.Google ScholarGoogle Scholar
  22. Statistical Workload Injector for MapReduce. https://github.com/SWIMProjectUCB/SWIM/wiki.Google ScholarGoogle Scholar
  23. S. Sur, H. Wang, J. Huang, X. Ouyang, and D. K. Panda. Can High Performance Interconnects Benefit Hadoop Distributed File System? In Workshop on Micro Architectural Support for Virtualization, Data Center Computing, and Clouds, in Conjunction with MICRO, Atlanta, GA, 2010.Google ScholarGoogle Scholar
  24. The Apache Software Foundation. The Apache Hadoop Project. http://hadoop.apache.org/.Google ScholarGoogle Scholar
  25. Y. Wang, X. Que, W. Yu, D. Goldenberg, and D. Sehgal. Hadoop Acceleration through Network Levitated Merge. In Proceedings of International Conference for High Performance Computing, Networking, Storage and Analysis, SC, Seattle, WA, USA, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Y. Wang, C. Xu, X. Li, and W. Yu. JVM-Bypass for Efficient Hadoop Shuffling. In Proceedings of the IEEE 27th International Symposium on Parallel and Distributed Processing, IPDPS, Boston, MA, USA, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. HOMR: a hybrid approach to exploit maximum overlapping in MapReduce over high performance interconnects

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      ICS '14: Proceedings of the 28th ACM international conference on Supercomputing
      June 2014
      378 pages
      ISBN:9781450326421
      DOI:10.1145/2597652

      Copyright © 2014 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 10 June 2014

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      ICS '14 Paper Acceptance Rate34of160submissions,21%Overall Acceptance Rate584of2,055submissions,28%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader