Assessing the Performance Impact of High-Speed Interconnects on MapReduce

Wang, Yandong; Jiao, Yizheng; Xu, Cong; Li, Xiaobing; Wang, Teng; Que, Xinyu; Cira, Cristi; Wang, Bin; Liu, Zhuo; Bailey, Bliss; Yu, Weikuan

doi:10.1007/978-3-642-53974-9_13

Yandong Wang¹⁹,
Yizheng Jiao¹⁹,
Cong Xu¹⁹,
Xiaobing Li¹⁹,
Teng Wang¹⁹,
Xinyu Que¹⁹,
Cristi Cira¹⁹,
Bin Wang¹⁹,
Zhuo Liu¹⁹,
Bliss Bailey¹⁹ &
…
Weikuan Yu¹⁹

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 8163))

Included in the following conference series:

2117 Accesses
5 Citations

Abstract

Hadoop is a successful open-source implementation of MapReduce programming model. It has been widely adopted by many leading industry companies for big data analytics. However, its intermediate data shuffling is a time-consuming operation that impacts the total execution time of MapReduce programs. Recently, a growing number of organizations are interested in addressing this issue by leveraging the high-performance interconnects, such as InfiniBand and 10 Gigabit Ethernet, which have been popular in High-Performance Computing (HPC) Community. There is a lack of comprehensive examination of the performance impact of these interconnects on MapReduce programs.

In this work, we systematically evaluate the performance impact of two popular high-speed interconnects, 10 Gigabit Ethernet and InfiniBand, using the original Apache Hadoop and our extended Hadoop Acceleration framework. Our analysis shows that, under the Apache Hadoop, although using fast networks can efficiently accelerate the jobs with small intermediate data sizes, it is unable to maintain such advantages for jobs with large intermediate data. In contrast, Hadoop Acceleration provides better performance for jobs of a wide range of data sizes. In addition, both implementations exhibit good scalability under different networks. Hadoop Acceleration significantly reduces CPU utilization and I/O wait time of MapReduce programs.

This research is supported in part by an NSF grant #CNS-1059376, and a grant from Lawrence Livermore National Laboratory.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 49.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

A Micro-benchmark Suite for Evaluating Hadoop MapReduce on High-Performance Networks

Characterizing and benchmarking stand-alone Hadoop MapReduce on modern HPC clusters

Article 01 June 2016

MapReduce over Lustre: Can RDMA-Based Approach Benefit?

References

Apache Hadoop Project, http://hadoop.apache.org/
Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. In: Proceedings of the 6th Conference on Symposium on Opearting Systems Design & Implementation, OSDI 2004, vol. 6, p. 10. USENIX Association, Berkeley (2004)
Google Scholar
Pavlo, A., Paulson, E., Rasin, A., Abadi, D.J., DeWitt, D.J., Madden, S., Stonebraker, M.: A comparison of approaches to large-scale data analysis. In: Proceedings of the 35th SIGMOD International Conference on Management of Data, SIGMOD 2009, pp. 165–178. ACM, New York (2009)
Chapter Google Scholar
Condie, T., Conway, N., Alvaro, P., Hellerstein, J.M., Elmeleegy, K., Sears, R.: Mapreduce online. In: Proceedings of the 7th USENIX Conference on Networked Systems Design and Implementation, NSDI 2010, p. 21. USENIX Association, Berkeley (2010)
Google Scholar
Seo, S., Jang, I., Woo, K., Kim, I., Kim, J.S., Maeng, S.: HPMR: Prefetching and pre-shuffling in shared MapReduce computation environment. In: CLUSTER, pp. 1–8 (August 2009)
Google Scholar
Rao, S., Ramakrishnan, R., Silberstein, A., Ovsiannikov, M., Reeves, D.: Sailfish: a framework for large scale data processing. In: Proceedings of the Third ACM Symposium on Cloud Computing, SoCC 2012, pp. 4:1–4:14. ACM, New York (2012)
Google Scholar
InfiniBand Trade Association: The InfiniBand Architecture, http://www.infinibandta.org
Recio, R., Culley, P., Garcia, D., Hilland, J.: An rdma protocol specification, version 1.0 (October 2002)
Google Scholar
High Performance Computing (HPC) on AWS, http://aws.amazon.com/hpc-applications/
Wang, Y., Que, X., Yu, W., Goldenberg, D., Sehgal, D.: Hadoop acceleration through network levitated merge. In: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2011, pp. 57:1–57:10. ACM, New York (2011)
Google Scholar
Shvachko, K., Kuang, H., Radia, S., Chansler, R.: The hadoop distributed file system. In: Proceedings of the 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies, MSST 2010, pp. 1–10. IEEE Computer Society, Washington, DC (2010)
Chapter Google Scholar
Que, X., Wang, Y., Xu, C., Yu, W.: Hierarchical merge for scalable mapreduce. In: Proceedings of the 2012 Workshop on Management of Big Data Systems, MBDS 2012, pp. 1–6. ACM, New York (2012)
Chapter Google Scholar
Open Fabrics Alliance, http://www.openfabrics.org
Chu, J., Kashyap, V.: Transmission of IP over InfiniBand(IPoIB) (2006), http://tools.ietf.org/html/rfc4391
InfiniBand Trade Association: Socket Direct Protocol Specification V1.0 (2002)
Google Scholar
Zaharia, M., Borthakur, D., Sen Sarma, J., Elmeleegy, K., Shenker, S., Stoica, I.: Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling. In: Proceedings of the 5th European Conference on Computer Systems, EuroSys 2010, pp. 265–278. ACM, New York (2010)
Google Scholar
Huang, J., Ouyang, X., Jose, J., Wasi-ur-Rahman, M., Wang, H., Luo, M., Subramoni, H., Murthy, C., Panda, D.K.: High-performance design of hbase with rdma over infiniband. In: 26th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2012, Shanghai, China, May 21-25, pp. 774–785 (2012)
Google Scholar
Sur, S., Wang, H., Huang, J., Ouyang, X., Panda, D.K.: Can High-Performance Interconnects Benefit Hadoop Distributed File System? In: MASVDC 2010 Workshop in Conjunction with MICRO (December 2010)
Google Scholar
Infiniband Trade Association, http://www.infinibandta.org
Islam, N.S., Rahman, M.W., Jose, J., Rajachandrasekar, R., Wang, H., Subramoni, H., Murthy, C., Panda, D.K.: High performance rdma-based design of hdfs over infiniband. In: Proceedings of 2012 International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2012. ACM (2012)
Google Scholar
Jose, J., Subramoni, H., Luo, M., Zhang, M., Huang, J., Wasi-ur-Rahman, M., Islam, N.S., Ouyang, X., Wang, H., Sur, S., Panda, D.K.: Memcached design on high performance rdma capable interconnects. In: ICPP, pp. 743–752. IEEE (2011)
Google Scholar
Jose, J., Subramoni, H., Kandalla, K., Wasi-ur Rahman, M., Wang, H., Narravula, S., Panda, D.K.: Scalable memcached design for infiniband clusters using hybrid transports. In: Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid 2012), pp. 236–243. IEEE Computer Society, Washington, DC (2012)
Chapter Google Scholar
Wang, Y., Xu, C., Li, X., Yu, W.: Jvm-bypass for efficient hadoop shuffling. In: 27th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2013. IEEE (2013)
Google Scholar
Frey, P.W., Alonso, G.: Minimizing the hidden cost of rdma. In: Proceedings of the 2009 29th IEEE International Conference on Distributed Computing Systems, ICDCS 2009, pp. 553–560. IEEE Computer Society, Washington, DC (2009)
Chapter Google Scholar
Liu, J., Wu, J., Panda, D.K.: High performance rdma-based mpi implementation over infiniband. International Journal of Parallel Programming 32, 167–198 (2004)
Article MATH Google Scholar
Yu, W., Gao, Q., Panda, D.K.: Adaptive connection management for scalable mpi over infiniband. In: Proceedings of the 20th International Conference on Parallel and Distributed Processing, IPDPS 2006, p. 102. IEEE Computer Society, Washington, DC (2006)
Google Scholar
Carns, P.H., Ligon III, W.B., Ross, R.B., Thakur, R.: PVFS: A Parallel File System For Linux Clusters. In: Proceedings of the 4th Annual Linux Showcase and Conference, Atlanta, GA, pp. 317–327 (October 2000)
Google Scholar
Wu, J., Wychoff, P., Panda, D.K.: PVFS over InfiniBand: Design and Performance Evaluation. In: Proceedings of the International Conference on Parallel Processing 2003, Kaohsiung, Taiwan (October 2003)
Google Scholar
Yu, W., Liang, S., Panda, D.K.: High Performance Support of Parallel Virtual File System (PVFS2) over Quadrics. In: Proceedings of the 19th ACM International Conference on Supercomputing, Boston, Massachusetts (June 2005)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, Auburn University, USA
Yandong Wang, Yizheng Jiao, Cong Xu, Xiaobing Li, Teng Wang, Xinyu Que, Cristi Cira, Bin Wang, Zhuo Liu, Bliss Bailey & Weikuan Yu

Authors

Yandong Wang
View author publications
You can also search for this author in PubMed Google Scholar
Yizheng Jiao
View author publications
You can also search for this author in PubMed Google Scholar
Cong Xu
View author publications
You can also search for this author in PubMed Google Scholar
Xiaobing Li
View author publications
You can also search for this author in PubMed Google Scholar
Teng Wang
View author publications
You can also search for this author in PubMed Google Scholar
Xinyu Que
View author publications
You can also search for this author in PubMed Google Scholar
Cristi Cira
View author publications
You can also search for this author in PubMed Google Scholar
Bin Wang
View author publications
You can also search for this author in PubMed Google Scholar
Zhuo Liu
View author publications
You can also search for this author in PubMed Google Scholar
Bliss Bailey
View author publications
You can also search for this author in PubMed Google Scholar
Weikuan Yu
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Electric and Computer Science, University of Toronto, 10 King’s College Road, SFB 540, M5S 3G4, Toronto, ON, Canada
Tilmann Rabl & Hans-Arno Jacobsen &
Server Technologies, Oracle Corporation, 500 Oracle Parkway, 94065, Redwood Shores, CA, USA
Meikel Poess
Supercomputer Center, University of California San Diego, 9500 Gilman Drive, 92093-0505, La Jolla, CA, USA
Chaitanya Baru

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wang, Y. et al. (2014). Assessing the Performance Impact of High-Speed Interconnects on MapReduce. In: Rabl, T., Poess, M., Baru, C., Jacobsen, HA. (eds) Specifying Big Data Benchmarks. WBDB WBDB 2012 2012. Lecture Notes in Computer Science, vol 8163. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-53974-9_13

Download citation

DOI: https://doi.org/10.1007/978-3-642-53974-9_13
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-53973-2
Online ISBN: 978-3-642-53974-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Assessing the Performance Impact of High-Speed Interconnects on MapReduce

Abstract

Access this chapter

Subscribe and save

Buy Now

Preview

Similar content being viewed by others

A Micro-benchmark Suite for Evaluating Hadoop MapReduce on High-Performance Networks

Characterizing and benchmarking stand-alone Hadoop MapReduce on modern HPC clusters

MapReduce over Lustre: Can RDMA-Based Approach Benefit?

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Assessing the Performance Impact of High-Speed Interconnects on MapReduce

Abstract

Access this chapter

Subscribe and save

Buy Now

Preview

Similar content being viewed by others

A Micro-benchmark Suite for Evaluating Hadoop MapReduce on High-Performance Networks

Characterizing and benchmarking stand-alone Hadoop MapReduce on modern HPC clusters

MapReduce over Lustre: Can RDMA-Based Approach Benefit?

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation