A Plugin-Based Approach to Exploit RDMA Benefits for Apache and Enterprise HDFS

Bhat, Adithya; Islam, Nusrat Sharmin; Lu, Xiaoyi; Wasi-ur-Rahman, Md.; Shankar, Dipti; (DK) Panda, Dhabaleswar K.

doi:10.1007/978-3-319-29006-5_10

Adithya Bhat¹⁶,
Nusrat Sharmin Islam¹⁶,
Xiaoyi Lu¹⁶,
Md. Wasi-ur-Rahman¹⁶,
Dipti Shankar¹⁶ &
…
Dhabaleswar K. (DK) Panda¹⁶

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 9495))

Included in the following conference series:

BPOE

887 Accesses
1 Citations

Abstract

Hadoop Distributed File System (HDFS) has been popularly utilized by many Big Data processing frameworks as their underlying storage engine, such as Hadoop MapReduce, HBase, Hive, and Spark. This makes the performance of HDFS a primary concern in the Big Data community. Recent studies have shown that HDFS cannot completely exploit the performance benefits of RDMA-enabled high performance interconnects like InfiniBand. To solve these performance issues, RDMA-enabled HDFS designs have been proposed in the literature that show better performance with RDMA-enabled networks. But these designs are tightly integrated with the specific versions of the Apache Hadoop distribution, and cannot be used with other Hadoop distributions easily. In this paper, we propose an efficient RDMA-based plugin for HDFS, which can be easily integrated with various Hadoop distributions and versions like Apache Hadoop 2.5 and 2.6, Hortonworks HDP, and Cloudera CDH. Performance evaluations show that our plugin ensures the expected performance of up to 3.7x improvement in TestDFSIO write, associated with the hybrid RDMA-enhanced design, to all these distributions. We also demonstrate that our RDMA-based plugin can achieve up to 4.6x improvement over Mellanox R4H (RDMA for HDFS) plugin.

This research is supported in part by National Science Foundation grants #CNS-1419123 and #IIS-1447804.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 34.99; Price excludes VAT (USA)

Softcover Book: USD 44.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Ananthanarayanan, G., Ghodsi, A., Wang, A., Borthakur, D., Kandula, S., Shenker, S., Stoica, I.: PACMan: Coordinated memory caching for parallel jobs. In: Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation. NSDI 2012, San Jose, CA (2012)
Google Scholar
Apache HBase. http://hbase.apache.org/
Cloudera Hadoop Distribution: http://cloudera.com/
Dean, J., Ghemawat, S.: MapReduce: Simplified data processing on large clusters. In: OSDI. Boston, MA (2004)
Google Scholar
Foundation, A.S.: Centralized Cache Management in HDFS. http://hadoop.apache.org/docs/r2.3.0/hadoop-project-dist/hadoop-hdfs/CentralizedCacheManagement.html
Data, High-Performance Big (HiBD). http://hibd.cse.ohio-state.edu
Hortonworks: We do Hadoop. Enabling the Data-First Enterprise. http://hortonworks.com/
Islam, N.S., Lu, X., Rahman, M.W., Rajachandrasekar, R., Panda, D.K.: In-memory I/O and replication for HDFS with memcached: Early experiences. In: 2014 IEEE International Conference on Big Data (IEEE BigData). Washington DC (2014)
Google Scholar
Islam, N.S., Lu, X., Rahman, M.W., Shankar, D., Panda, D.K.: Triple-H: A hybrid approach to accelerate HDFS on HPC clusters with heterogeneous storage architecture. In: 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing. China (2015)
Google Scholar
Islam, N.S., Lu, X., Rahman, M.W., Panda, D.K.: SOR-HDFS: A SEDA-based approach to maximize overlapping in RDMA-enhanced HDFS. In: The Proceedings of The 23rd International ACM Symposium on High-Performance Parallel and Distributed Computing (HPDC). Canada (2014)
Google Scholar
Islam, N.S., Rahman, M.W., Jose, J., Rajachandrasekar, R., Wang, H., Subramoni, H., Murthy, C., Panda, D.K.: High performance RDMA-based design of HDFS over infiniBand. In: The Proceedings of The International Conference for High Performance Computing, Networking, Storage and Analysis (SC). Salt Lake City (2012)
Google Scholar
Islam, N.S., Lu, X., Rahman, M.W., Panda, D.K.: Can parallel replication benefit hadoop distributed file system for high performance interconnects? In: The Proceedings of IEEE 21st Annual Symposium on High-Performance Interconnects (HOTI). San Jose, CA (2013)
Google Scholar
Mellanox. http://www.mellanox.com
Anwar, R.K., Butt, A.A.: hatS: A heterogeneity-aware tiered storage for hadoop. In: 14th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID) (2014)
Google Scholar
R.K., Iqbal, S., Butt, A.: VENU: Orchestrating SSDs in hadoop storage. In: 2014 IEEE International Conference on Big Data (IEEE BigData) (2014)
Google Scholar
Rahman, M.W., Islam, N.S., Lu, X., Jose, J., Subramoni, H., Wang, H., Panda, D.K.: High-performance RDMA-based design of hadoop mapreduce over infiniBand. In: HPDIC, in conjunction with IPDPS. Boston, MA (2013)
Google Scholar
Rahman, M.W., Lu, X., Islam, N.S., Panda, D.K.: HOMR: A hybrid approach to exploit maximum overlapping in mapreduce over high performance interconnects. In: ICS. Munich, Germany (2014)
Google Scholar
RDMA for HDFS (R4H). https://github.com/Mellanox/R4H
Shafer, J., Rixner, S., Cox, A.: The hadoop distributed filesystem: Balancing portability and performance. In: 2010 IEEE International Symposium on Performance Analysis of Systems Software (ISPASS), pp. 122–133, March 2010
Google Scholar
Shvachko, K.: HDFS Scalability: The Limits to Growth (2010)
Google Scholar
The Apache Software Foundation: The Apache Hive. http://hive.apache.org/
Wang, Y., Que, X., Yu, W., Goldenberg, D., Sehgal, D.: Hadoop acceleration through network levitated merge. In: SC (2011)
Google Scholar
Welsh, M., Culler, D., Brewer, E.: SEDA: An architecture for well-conditioned, scalable internet services. In: Proceedings of the 18th ACM Symposium on Operating Systems Principles (SOSP). Banff, Alberta, Canada (2001)
Google Scholar
Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: Cluster computing with working sets. In: Proceedings of the 2Nd USENIX Conference on Hot Topics in Cloud Computing. HotCloud 2010, Boston, MA (2010)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science and Engineering, The Ohio State University, Columbus, USA
Adithya Bhat, Nusrat Sharmin Islam, Xiaoyi Lu, Md. Wasi-ur-Rahman, Dipti Shankar & Dhabaleswar K. (DK) Panda

Authors

Adithya Bhat
View author publications
You can also search for this author in PubMed Google Scholar
Nusrat Sharmin Islam
View author publications
You can also search for this author in PubMed Google Scholar
Xiaoyi Lu
View author publications
You can also search for this author in PubMed Google Scholar
Md. Wasi-ur-Rahman
View author publications
You can also search for this author in PubMed Google Scholar
Dipti Shankar
View author publications
You can also search for this author in PubMed Google Scholar
Dhabaleswar K. (DK) Panda
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Adithya Bhat .

Editor information

Editors and Affiliations

Institute of Computing, Chinese Academy of Sciences, Beijing, China
Jianfeng Zhan
ICT, Chinese Academy of Sciences, Beijing, China
Rui Han
FB12 - DBIS (5. Stock), Goethe Universität Frankfurt, Frankfurt, Hessen, Germany
Roberto V. Zicari

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Bhat, A., Islam, N.S., Lu, X., Wasi-ur-Rahman, M., Shankar, D., (DK) Panda, D.K. (2016). A Plugin-Based Approach to Exploit RDMA Benefits for Apache and Enterprise HDFS. In: Zhan, J., Han, R., Zicari, R. (eds) Big Data Benchmarks, Performance Optimization, and Emerging Hardware. BPOE 2015. Lecture Notes in Computer Science(), vol 9495. Springer, Cham. https://doi.org/10.1007/978-3-319-29006-5_10

Download citation

DOI: https://doi.org/10.1007/978-3-319-29006-5_10
Published: 09 January 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-29005-8
Online ISBN: 978-3-319-29006-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics