Skip to main content
Log in

Characterizing and benchmarking stand-alone Hadoop MapReduce on modern HPC clusters

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

With the emergence of high-performance data analytics, the Hadoop platform is being increasingly used to process data stored on high-performance computing clusters. While there is immense scope for improving the performance of Hadoop MapReduce (including the network-intensive shuffle phase) over these modern clusters, that are equipped with high-speed interconnects such as InfiniBand and 10/40 GigE, and storage systems such as SSDs and Lustre, it is essential to study the MapReduce component in an isolated manner. In this paper, we study popular MapReduce workloads, obtained from well-accepted, comprehensive benchmark suites, to identify common shuffle data distribution patterns. We determine different environmental and workload-specific factors that affect the performance of the MapReduce job. Based on these characterization studies, we propose a micro-benchmark suite that can be used to evaluate the performance of stand-alone Hadoop MapReduce, and demonstrate its ease-of-use with different networks/protocols, Hadoop distributions, and storage architectures. Performance evaluations with our proposed micro-benchmarks show that stand-alone Hadoop MapReduce over IPoIB performs better than 10 GigE by about 13–15 %, and the RDMA-enhanced hybrid MapReduce design can achieve up to 43 % performance improvement over default Hadoop MapReduce over IPoIB, in both shared-nothing and shared storage architectures.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Similar content being viewed by others

References

  1. Adjacency List Representation: https://xlinux.nist.gov/dads//HTML/adjacencyListRep.html

  2. Agrawal R, Srikant R (1994) Fast algorithms for mining association rules in large databases. In: Proceedings of the 20th international conference on very large data bases. VLDB ’94, San Francisco, pp 487–499

  3. Ahmad F, Lee S, Thottethodi M, Vijaykumar T (2012) PUMA: Purdue MapReduce Benchmarks Suite

  4. Ananthanarayanan G, Ghodsi A, Wang A, Borthakur D, Kandula S, Shenker S, Stoica I (2012) PACMan: coordinated memory caching for parallel jobs. In: Proceedings of the 9th USENIX conference on networked systems design and implementation. USENIX Association, p 20

  5. Apache Hadoop NextGen MapReduce (YARN): http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html

  6. Apache Mahout: http://mahout.apache.org

  7. Apache Spark: https://spark.apache.org

  8. BigDataBench: A Big Data Benchmark Suite. http://prof.ict.ac.cn/BigDataBench

  9. Chen Y, Ganapathi A, Griffith R, Katz R (2011) The case for evaluating MapReduce performance using workload suites. In: 2011 IEEE 19th international symposium on modeling, analysis simulation of computer and telecommunication systems. MASCOTS (July 2011), pp 390–399

  10. Comet at SDSC: http://www.sdsc.edu/services/hpc/hpc_systems.html#comet

  11. Connected-component labeling: http://en.wikipedia.org/wiki/Connected-component_labeling

  12. Dean J, Ghemawat S (2004) MapReduce: simplified data processing on large clusters. In: Proceedings of the 6th conference on Symposium on Operating Systems Design and Implementation. OSDI, San Francisco, CA (December 2004)

  13. Gordon at SDSC: http://www.sdsc.edu/us/resources/gordon/

  14. GraySort and MinuteSort at Yahoo on Hadoop 0.23: http://sortbenchmark.org/Yahoo2013Sort.pdf

  15. Grep: http://wiki.apache.org/hadoop/Grep

  16. GridMix3: Emulating Production Workload for Apache Hadoop: https://developer.yahoo.com/blogs/hadoop/gridmix3-emulating-production-workload-apache-hadoop-450.html

  17. Guo Y, Rao J, Zhou X (2013) iShuffle: Improving Hadoop Performance with Shuffle-on-Write. Proceedings of the 10th international conference on autonomic computing (ICAC’ 13). USENIX, San Jose, pp 107–117

    Google Scholar 

  18. Hadoop Distributed File System (HDFS): https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html

  19. High-Performance Big Data (HiBD). http://hibd.cse.ohio-state.edu

  20. Huang S, Huang J, Dai J, Xie T, Huang B (2010) The HiBench benchmark suite: characterization of the MapReduce-based data analysis. In: Proceedings of the 26th international conference on data engineering workshops. ICDEW, Long Beach, CA (March 2010)

  21. International Data Corporation (IDC): New IDC Worldwide HPC End-User Study Identifies Latest Trends in High Performance Computing Usage and Spending. http://www.idc.com/getdoc.jsp?containerId=prUS24409313

  22. Inverted Index: https://en.wikipedia.org/wiki/Inverted_index

  23. Islam NS, Lu X, Rahman W, Shankar D, Panda DK (2015) Triple-H: a hybrid approach to accelerate HDFS on HPC clusters with heterogeneous storage architecture . In: 15th IEEE/ACM international symposium on cluster, cloud and grid computing. CCGrid, Shenzhen, China (May 2015)

  24. Islam NS, Lu X, Rahman MW, Jose J, Panda DK (2012) A micro-benchmark suite for evaluating HDFS operations on modern clusters. In: Proceedings of the 2nd workshop on Big Data benchmarking. WBDB

  25. Islam NS, Rahman MW, Jose J, Rajachandrasekar R, Wang H, Subramoni H, Murthy C, Panda DK (2012) High performance RDMA-based design of HDFS over InfiniBand. In: Proceedings of the international conference on high performance computing, networking, storage and analysis. SC (November 2012)

  26. Islam NS, Lu X, Rahman MWu, Panda DKD (2014) SOR-HDFS: a SEDA-based approach to maximize overlapping in RDMA-enhanced HDFS. In: Proceedings of the 23rd international symposium on high-performance parallel and distributed computing. HPDC ’14, ACM, New York, pp 261–264

  27. Hartigan JA, MAW (1979) Algorithm as 136: a k-means clustering algorithm. J R Stat Soc Ser C (Appl Stat) 28(1):100–108. http://www.jstor.org/stable/2346830

  28. Jia Z, Zhan J, Wang L, Han R, McKee SA, Yang Q, Luo C, Li J (2014) Characterizing and subsetting Big Data workloads. arXiv:1409.0792

  29. Kang U, Tsourakakis CE, Faloutsos C (2009) PEGASUS: a peta-scale graph mining system - implementation and observation. In: Data mining, 2009. ICDM’09. Ninth IEEE international conference on IEEE, pp 229–238

  30. Kim K, Jeon K, Han H, gyu Kim S, Jung H, Yeom H (2008) MRBench: a benchmark for MapReduce framework. In: Proceedings of the IEEE 14th international conference on parallel and distributed systems. ICPADS, Melbourne, Victoria, Australia (December 2008)

  31. Kwon Y, Balazinska M, Howe B, Rolia J (2011) A study of skew in MapReduce applications. Open Cirrus Summit

  32. Kwon Y, Ren K, Balazinska M, Howe B, Rolia J (2013) Managing skew in Hadoop. IEEE Data Eng Bull 36(1):24–33

    Google Scholar 

  33. Lu X, Islam NS, Rahman MW, Jose J, Subramoni H, Wang H, Panda DK (2013) High-performance design of Hadoop RPC with RDMA over InfiniBand. In: Proceedings of the IEEE 42th international conference on parallel processing. ICPP, Lyon

  34. Lu X, Islam NS, Wasi-Ur-Rahman M, Panda DK (2013) A micro-benchmark suite for evaluating Hadoop RPC on high-performance networks. In: Proceedings of the 3rd workshop on Big Data benchmarking. WBDB (May 2013)

  35. Lu X, Rahman M, Islam N, Shankar D, Panda D (2014) Accelerating Spark with RDMA for Big Data processing: early experiences. In: High-performance interconnects (HOTI), 2014 IEEE 22nd annual symposium on, pp 9–16 (Aug 2014)

  36. Lu X, Wang B, Zha L, Xu Z (2011) Can MPI benefit Hadoop and MapReduce applications? In: Proceedings of the IEEE 40th international conference on parallel processing workshops. ICPPW (September 2011)

  37. Lustre filesystem: http://lustre.org

  38. Memory Storage Support in HDFS: https://hadoop.apache.org/docs/r2.7.1/hadoop-project-dist/hadoop-hdfs/MemoryStorage.html

  39. Ming Z, Luo C, Gao W, Han R, Yang Q, Wang L, Zhan J (2014) BDGS: a scalable Big Data generator suite in Big Data benchmarking. arXiv:1401.5465

  40. NullOutputFormat (Hadoop 1.2.1 API). https://hadoop.apache.org/docs/r1.2.1/api/org/apache/hadoop/mapred/lib/NullOutputFormat.html

  41. PageRank: http://en.wikipedia.org/wiki/PageRank

  42. PoweredBy-Hadoop Wiki: https://wiki.apache.org/hadoop/PoweredBy

  43. PUMA MapReduce Benchmarks: https://engineering.purdue.edu/~puma/pumabenchmarks.htm

  44. Rahman MW, Lu X, Islam NS, Rajachadrasekar R, Panda DK (2015) High-performance design of YARN MapReduce on modern HPC clusters with Lustre and RDMA. In: 29th IEEE international parallel and distributed processing symposium. IPDPS (May 2015)

  45. Rahman MW, Lu X, Islam N, Rajachandrasekar R, Panda D (2014) MapReduce over Lustre: Can RDMA-based approach benefit? In: Euro-Par 2014 parallel processing, lecture notes in computer science, vol 8632. Springer International Publishing (August 2014), pp 644–655

  46. Rahman MW, Islam NS, Lu X, Jose J, Subramoni H, Wang H, Panda DK (2013) High-performance RDMA-based design of Hadoop MapReduce over InfiniBand. In: International workshop on high performance data intensive computing. HPDIC, Boston (May 2013)

  47. Rahman MW, Lu X, Islam NS, Panda DK (2014) HOMR: a hybrid approach to exploit maximum overlapping in MapReduce over high performance interconnects. In: Proceedings of the 28th ACM international conference on supercomputing. ICS, ACM, Munich, pp 33–42 (June 2014)

  48. Sangroya A, Serrano D, Bouchenak S (2013) MRBS: Towards dependability benchmarking for Hadoop MapReduce. In: Proceedings of the 18th international conference on parallel processing workshops. Euro-Par, Aachen (Aug 2013)

  49. Shankar D, Lu X, Rahman MW, Islam N, Panda DK (2014) A Micro-benchmark Suite for Evaluating Hadoop MapReduce on high-performance networks. In: Proceedings of the fifth workshop on Big Data benchmarks, performance optimization, and emerging hardware, BPOE-5, vol 8807. Springer International Publishing, Hangzhou, pp 19–33 (Sep 2014)

  50. Sort: http://wiki.apache.org/hadoop/Sort

  51. Stampede at TACC: http://www.tacc.utexas.edu/resources/hpc/stampede

  52. Stanford Large Network Dataset Collection (SNAP): https://snap.stanford.edu/data/

  53. TeraSort Package: https://hadoop.apache.org/docs/current/api/org/apache/hadoop/examples/terasort/package-summary.html

  54. The Apache Software Foundation: Apache Hadoop. http://hadoop.apache.org

  55. Top500 Supercomputing System: http://www.top500.org

  56. Wang L, Zhan J, Luo C, Zhu Y, Yang Q, He Y, Gao W, Jia Z, Shi Y, Zhang S, Zheng C, Lu G, Zhan K, Li X, Qiu B (2014) BigDataBench: a Big Data Benchmark Suite from Internet Services. In: Proceedings of the 20th IEEE international symposium on high performance computer architecture. HPCA, Orlando (Feb 2014)

  57. Wang Y, Que X, Yu W, Goldenberg D, Sehgal D (2011) Hadoop acceleration through network levitated merge. In: Proceedings of international conference for high performance computing, networking, storage and analysis (SC). Seattle (Nov 2011)

  58. Wikipedia Dumps: http://dumps.wikimedia.org/enwiki/

  59. WordCount: http://wiki.apache.org/hadoop/WordCount

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Dipti Shankar.

Additional information

This research is supported in part by National Science Foundation Grants #CNS-1419123, #IIS-1447804, and #ACI-1450440. It used the Extreme Science and Engineering Discovery Environment (XSEDE), which is supported by National Science Foundation Grant number #OCI-1053575.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Shankar, D., Lu, X., Wasi-ur-Rahman, M. et al. Characterizing and benchmarking stand-alone Hadoop MapReduce on modern HPC clusters. J Supercomput 72, 4573–4600 (2016). https://doi.org/10.1007/s11227-016-1760-5

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-016-1760-5

Keywords

Navigation