Abstract
With the emergence of high-performance data analytics, the Hadoop platform is being increasingly used to process data stored on high-performance computing clusters. While there is immense scope for improving the performance of Hadoop MapReduce (including the network-intensive shuffle phase) over these modern clusters, that are equipped with high-speed interconnects such as InfiniBand and 10/40 GigE, and storage systems such as SSDs and Lustre, it is essential to study the MapReduce component in an isolated manner. In this paper, we study popular MapReduce workloads, obtained from well-accepted, comprehensive benchmark suites, to identify common shuffle data distribution patterns. We determine different environmental and workload-specific factors that affect the performance of the MapReduce job. Based on these characterization studies, we propose a micro-benchmark suite that can be used to evaluate the performance of stand-alone Hadoop MapReduce, and demonstrate its ease-of-use with different networks/protocols, Hadoop distributions, and storage architectures. Performance evaluations with our proposed micro-benchmarks show that stand-alone Hadoop MapReduce over IPoIB performs better than 10 GigE by about 13–15 %, and the RDMA-enhanced hybrid MapReduce design can achieve up to 43 % performance improvement over default Hadoop MapReduce over IPoIB, in both shared-nothing and shared storage architectures.
Similar content being viewed by others
References
Adjacency List Representation: https://xlinux.nist.gov/dads//HTML/adjacencyListRep.html
Agrawal R, Srikant R (1994) Fast algorithms for mining association rules in large databases. In: Proceedings of the 20th international conference on very large data bases. VLDB ’94, San Francisco, pp 487–499
Ahmad F, Lee S, Thottethodi M, Vijaykumar T (2012) PUMA: Purdue MapReduce Benchmarks Suite
Ananthanarayanan G, Ghodsi A, Wang A, Borthakur D, Kandula S, Shenker S, Stoica I (2012) PACMan: coordinated memory caching for parallel jobs. In: Proceedings of the 9th USENIX conference on networked systems design and implementation. USENIX Association, p 20
Apache Hadoop NextGen MapReduce (YARN): http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html
Apache Mahout: http://mahout.apache.org
Apache Spark: https://spark.apache.org
BigDataBench: A Big Data Benchmark Suite. http://prof.ict.ac.cn/BigDataBench
Chen Y, Ganapathi A, Griffith R, Katz R (2011) The case for evaluating MapReduce performance using workload suites. In: 2011 IEEE 19th international symposium on modeling, analysis simulation of computer and telecommunication systems. MASCOTS (July 2011), pp 390–399
Comet at SDSC: http://www.sdsc.edu/services/hpc/hpc_systems.html#comet
Connected-component labeling: http://en.wikipedia.org/wiki/Connected-component_labeling
Dean J, Ghemawat S (2004) MapReduce: simplified data processing on large clusters. In: Proceedings of the 6th conference on Symposium on Operating Systems Design and Implementation. OSDI, San Francisco, CA (December 2004)
Gordon at SDSC: http://www.sdsc.edu/us/resources/gordon/
GraySort and MinuteSort at Yahoo on Hadoop 0.23: http://sortbenchmark.org/Yahoo2013Sort.pdf
GridMix3: Emulating Production Workload for Apache Hadoop: https://developer.yahoo.com/blogs/hadoop/gridmix3-emulating-production-workload-apache-hadoop-450.html
Guo Y, Rao J, Zhou X (2013) iShuffle: Improving Hadoop Performance with Shuffle-on-Write. Proceedings of the 10th international conference on autonomic computing (ICAC’ 13). USENIX, San Jose, pp 107–117
Hadoop Distributed File System (HDFS): https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html
High-Performance Big Data (HiBD). http://hibd.cse.ohio-state.edu
Huang S, Huang J, Dai J, Xie T, Huang B (2010) The HiBench benchmark suite: characterization of the MapReduce-based data analysis. In: Proceedings of the 26th international conference on data engineering workshops. ICDEW, Long Beach, CA (March 2010)
International Data Corporation (IDC): New IDC Worldwide HPC End-User Study Identifies Latest Trends in High Performance Computing Usage and Spending. http://www.idc.com/getdoc.jsp?containerId=prUS24409313
Inverted Index: https://en.wikipedia.org/wiki/Inverted_index
Islam NS, Lu X, Rahman W, Shankar D, Panda DK (2015) Triple-H: a hybrid approach to accelerate HDFS on HPC clusters with heterogeneous storage architecture . In: 15th IEEE/ACM international symposium on cluster, cloud and grid computing. CCGrid, Shenzhen, China (May 2015)
Islam NS, Lu X, Rahman MW, Jose J, Panda DK (2012) A micro-benchmark suite for evaluating HDFS operations on modern clusters. In: Proceedings of the 2nd workshop on Big Data benchmarking. WBDB
Islam NS, Rahman MW, Jose J, Rajachandrasekar R, Wang H, Subramoni H, Murthy C, Panda DK (2012) High performance RDMA-based design of HDFS over InfiniBand. In: Proceedings of the international conference on high performance computing, networking, storage and analysis. SC (November 2012)
Islam NS, Lu X, Rahman MWu, Panda DKD (2014) SOR-HDFS: a SEDA-based approach to maximize overlapping in RDMA-enhanced HDFS. In: Proceedings of the 23rd international symposium on high-performance parallel and distributed computing. HPDC ’14, ACM, New York, pp 261–264
Hartigan JA, MAW (1979) Algorithm as 136: a k-means clustering algorithm. J R Stat Soc Ser C (Appl Stat) 28(1):100–108. http://www.jstor.org/stable/2346830
Jia Z, Zhan J, Wang L, Han R, McKee SA, Yang Q, Luo C, Li J (2014) Characterizing and subsetting Big Data workloads. arXiv:1409.0792
Kang U, Tsourakakis CE, Faloutsos C (2009) PEGASUS: a peta-scale graph mining system - implementation and observation. In: Data mining, 2009. ICDM’09. Ninth IEEE international conference on IEEE, pp 229–238
Kim K, Jeon K, Han H, gyu Kim S, Jung H, Yeom H (2008) MRBench: a benchmark for MapReduce framework. In: Proceedings of the IEEE 14th international conference on parallel and distributed systems. ICPADS, Melbourne, Victoria, Australia (December 2008)
Kwon Y, Balazinska M, Howe B, Rolia J (2011) A study of skew in MapReduce applications. Open Cirrus Summit
Kwon Y, Ren K, Balazinska M, Howe B, Rolia J (2013) Managing skew in Hadoop. IEEE Data Eng Bull 36(1):24–33
Lu X, Islam NS, Rahman MW, Jose J, Subramoni H, Wang H, Panda DK (2013) High-performance design of Hadoop RPC with RDMA over InfiniBand. In: Proceedings of the IEEE 42th international conference on parallel processing. ICPP, Lyon
Lu X, Islam NS, Wasi-Ur-Rahman M, Panda DK (2013) A micro-benchmark suite for evaluating Hadoop RPC on high-performance networks. In: Proceedings of the 3rd workshop on Big Data benchmarking. WBDB (May 2013)
Lu X, Rahman M, Islam N, Shankar D, Panda D (2014) Accelerating Spark with RDMA for Big Data processing: early experiences. In: High-performance interconnects (HOTI), 2014 IEEE 22nd annual symposium on, pp 9–16 (Aug 2014)
Lu X, Wang B, Zha L, Xu Z (2011) Can MPI benefit Hadoop and MapReduce applications? In: Proceedings of the IEEE 40th international conference on parallel processing workshops. ICPPW (September 2011)
Lustre filesystem: http://lustre.org
Memory Storage Support in HDFS: https://hadoop.apache.org/docs/r2.7.1/hadoop-project-dist/hadoop-hdfs/MemoryStorage.html
Ming Z, Luo C, Gao W, Han R, Yang Q, Wang L, Zhan J (2014) BDGS: a scalable Big Data generator suite in Big Data benchmarking. arXiv:1401.5465
NullOutputFormat (Hadoop 1.2.1 API). https://hadoop.apache.org/docs/r1.2.1/api/org/apache/hadoop/mapred/lib/NullOutputFormat.html
PageRank: http://en.wikipedia.org/wiki/PageRank
PoweredBy-Hadoop Wiki: https://wiki.apache.org/hadoop/PoweredBy
PUMA MapReduce Benchmarks: https://engineering.purdue.edu/~puma/pumabenchmarks.htm
Rahman MW, Lu X, Islam NS, Rajachadrasekar R, Panda DK (2015) High-performance design of YARN MapReduce on modern HPC clusters with Lustre and RDMA. In: 29th IEEE international parallel and distributed processing symposium. IPDPS (May 2015)
Rahman MW, Lu X, Islam N, Rajachandrasekar R, Panda D (2014) MapReduce over Lustre: Can RDMA-based approach benefit? In: Euro-Par 2014 parallel processing, lecture notes in computer science, vol 8632. Springer International Publishing (August 2014), pp 644–655
Rahman MW, Islam NS, Lu X, Jose J, Subramoni H, Wang H, Panda DK (2013) High-performance RDMA-based design of Hadoop MapReduce over InfiniBand. In: International workshop on high performance data intensive computing. HPDIC, Boston (May 2013)
Rahman MW, Lu X, Islam NS, Panda DK (2014) HOMR: a hybrid approach to exploit maximum overlapping in MapReduce over high performance interconnects. In: Proceedings of the 28th ACM international conference on supercomputing. ICS, ACM, Munich, pp 33–42 (June 2014)
Sangroya A, Serrano D, Bouchenak S (2013) MRBS: Towards dependability benchmarking for Hadoop MapReduce. In: Proceedings of the 18th international conference on parallel processing workshops. Euro-Par, Aachen (Aug 2013)
Shankar D, Lu X, Rahman MW, Islam N, Panda DK (2014) A Micro-benchmark Suite for Evaluating Hadoop MapReduce on high-performance networks. In: Proceedings of the fifth workshop on Big Data benchmarks, performance optimization, and emerging hardware, BPOE-5, vol 8807. Springer International Publishing, Hangzhou, pp 19–33 (Sep 2014)
Stampede at TACC: http://www.tacc.utexas.edu/resources/hpc/stampede
Stanford Large Network Dataset Collection (SNAP): https://snap.stanford.edu/data/
TeraSort Package: https://hadoop.apache.org/docs/current/api/org/apache/hadoop/examples/terasort/package-summary.html
The Apache Software Foundation: Apache Hadoop. http://hadoop.apache.org
Top500 Supercomputing System: http://www.top500.org
Wang L, Zhan J, Luo C, Zhu Y, Yang Q, He Y, Gao W, Jia Z, Shi Y, Zhang S, Zheng C, Lu G, Zhan K, Li X, Qiu B (2014) BigDataBench: a Big Data Benchmark Suite from Internet Services. In: Proceedings of the 20th IEEE international symposium on high performance computer architecture. HPCA, Orlando (Feb 2014)
Wang Y, Que X, Yu W, Goldenberg D, Sehgal D (2011) Hadoop acceleration through network levitated merge. In: Proceedings of international conference for high performance computing, networking, storage and analysis (SC). Seattle (Nov 2011)
Wikipedia Dumps: http://dumps.wikimedia.org/enwiki/
WordCount: http://wiki.apache.org/hadoop/WordCount
Author information
Authors and Affiliations
Corresponding author
Additional information
This research is supported in part by National Science Foundation Grants #CNS-1419123, #IIS-1447804, and #ACI-1450440. It used the Extreme Science and Engineering Discovery Environment (XSEDE), which is supported by National Science Foundation Grant number #OCI-1053575.
Rights and permissions
About this article
Cite this article
Shankar, D., Lu, X., Wasi-ur-Rahman, M. et al. Characterizing and benchmarking stand-alone Hadoop MapReduce on modern HPC clusters. J Supercomput 72, 4573–4600 (2016). https://doi.org/10.1007/s11227-016-1760-5
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-016-1760-5