Skip to main content

A Plugin-Based Approach to Exploit RDMA Benefits for Apache and Enterprise HDFS

  • Conference paper
  • First Online:
Book cover Big Data Benchmarks, Performance Optimization, and Emerging Hardware (BPOE 2015)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 9495))

Included in the following conference series:

Abstract

Hadoop Distributed File System (HDFS) has been popularly utilized by many Big Data processing frameworks as their underlying storage engine, such as Hadoop MapReduce, HBase, Hive, and Spark. This makes the performance of HDFS a primary concern in the Big Data community. Recent studies have shown that HDFS cannot completely exploit the performance benefits of RDMA-enabled high performance interconnects like InfiniBand. To solve these performance issues, RDMA-enabled HDFS designs have been proposed in the literature that show better performance with RDMA-enabled networks. But these designs are tightly integrated with the specific versions of the Apache Hadoop distribution, and cannot be used with other Hadoop distributions easily. In this paper, we propose an efficient RDMA-based plugin for HDFS, which can be easily integrated with various Hadoop distributions and versions like Apache Hadoop 2.5 and 2.6, Hortonworks HDP, and Cloudera CDH. Performance evaluations show that our plugin ensures the expected performance of up to 3.7x improvement in TestDFSIO write, associated with the hybrid RDMA-enhanced design, to all these distributions. We also demonstrate that our RDMA-based plugin can achieve up to 4.6x improvement over Mellanox R4H (RDMA for HDFS) plugin.

This research is supported in part by National Science Foundation grants #CNS-1419123 and #IIS-1447804.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 34.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 44.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Ananthanarayanan, G., Ghodsi, A., Wang, A., Borthakur, D., Kandula, S., Shenker, S., Stoica, I.: PACMan: Coordinated memory caching for parallel jobs. In: Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation. NSDI 2012, San Jose, CA (2012)

    Google Scholar 

  2. Apache HBase. http://hbase.apache.org/

  3. Cloudera Hadoop Distribution: http://cloudera.com/

  4. Dean, J., Ghemawat, S.: MapReduce: Simplified data processing on large clusters. In: OSDI. Boston, MA (2004)

    Google Scholar 

  5. Foundation, A.S.: Centralized Cache Management in HDFS. http://hadoop.apache.org/docs/r2.3.0/hadoop-project-dist/hadoop-hdfs/CentralizedCacheManagement.html

  6. Data, High-Performance Big (HiBD). http://hibd.cse.ohio-state.edu

  7. Hortonworks: We do Hadoop. Enabling the Data-First Enterprise. http://hortonworks.com/

  8. Islam, N.S., Lu, X., Rahman, M.W., Rajachandrasekar, R., Panda, D.K.: In-memory I/O and replication for HDFS with memcached: Early experiences. In: 2014 IEEE International Conference on Big Data (IEEE BigData). Washington DC (2014)

    Google Scholar 

  9. Islam, N.S., Lu, X., Rahman, M.W., Shankar, D., Panda, D.K.: Triple-H: A hybrid approach to accelerate HDFS on HPC clusters with heterogeneous storage architecture. In: 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing. China (2015)

    Google Scholar 

  10. Islam, N.S., Lu, X., Rahman, M.W., Panda, D.K.: SOR-HDFS: A SEDA-based approach to maximize overlapping in RDMA-enhanced HDFS. In: The Proceedings of The 23rd International ACM Symposium on High-Performance Parallel and Distributed Computing (HPDC). Canada (2014)

    Google Scholar 

  11. Islam, N.S., Rahman, M.W., Jose, J., Rajachandrasekar, R., Wang, H., Subramoni, H., Murthy, C., Panda, D.K.: High performance RDMA-based design of HDFS over infiniBand. In: The Proceedings of The International Conference for High Performance Computing, Networking, Storage and Analysis (SC). Salt Lake City (2012)

    Google Scholar 

  12. Islam, N.S., Lu, X., Rahman, M.W., Panda, D.K.: Can parallel replication benefit hadoop distributed file system for high performance interconnects? In: The Proceedings of IEEE 21st Annual Symposium on High-Performance Interconnects (HOTI). San Jose, CA (2013)

    Google Scholar 

  13. Mellanox. http://www.mellanox.com

  14. Anwar, R.K., Butt, A.A.: hatS: A heterogeneity-aware tiered storage for hadoop. In: 14th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID) (2014)

    Google Scholar 

  15. R.K., Iqbal, S., Butt, A.: VENU: Orchestrating SSDs in hadoop storage. In: 2014 IEEE International Conference on Big Data (IEEE BigData) (2014)

    Google Scholar 

  16. Rahman, M.W., Islam, N.S., Lu, X., Jose, J., Subramoni, H., Wang, H., Panda, D.K.: High-performance RDMA-based design of hadoop mapreduce over infiniBand. In: HPDIC, in conjunction with IPDPS. Boston, MA (2013)

    Google Scholar 

  17. Rahman, M.W., Lu, X., Islam, N.S., Panda, D.K.: HOMR: A hybrid approach to exploit maximum overlapping in mapreduce over high performance interconnects. In: ICS. Munich, Germany (2014)

    Google Scholar 

  18. RDMA for HDFS (R4H). https://github.com/Mellanox/R4H

  19. Shafer, J., Rixner, S., Cox, A.: The hadoop distributed filesystem: Balancing portability and performance. In: 2010 IEEE International Symposium on Performance Analysis of Systems Software (ISPASS), pp. 122–133, March 2010

    Google Scholar 

  20. Shvachko, K.: HDFS Scalability: The Limits to Growth (2010)

    Google Scholar 

  21. The Apache Software Foundation: The Apache Hive. http://hive.apache.org/

  22. Wang, Y., Que, X., Yu, W., Goldenberg, D., Sehgal, D.: Hadoop acceleration through network levitated merge. In: SC (2011)

    Google Scholar 

  23. Welsh, M., Culler, D., Brewer, E.: SEDA: An architecture for well-conditioned, scalable internet services. In: Proceedings of the 18th ACM Symposium on Operating Systems Principles (SOSP). Banff, Alberta, Canada (2001)

    Google Scholar 

  24. Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: Cluster computing with working sets. In: Proceedings of the 2Nd USENIX Conference on Hot Topics in Cloud Computing. HotCloud 2010, Boston, MA (2010)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Adithya Bhat .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing Switzerland

About this paper

Cite this paper

Bhat, A., Islam, N.S., Lu, X., Wasi-ur-Rahman, M., Shankar, D., (DK) Panda, D.K. (2016). A Plugin-Based Approach to Exploit RDMA Benefits for Apache and Enterprise HDFS. In: Zhan, J., Han, R., Zicari, R. (eds) Big Data Benchmarks, Performance Optimization, and Emerging Hardware. BPOE 2015. Lecture Notes in Computer Science(), vol 9495. Springer, Cham. https://doi.org/10.1007/978-3-319-29006-5_10

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-29006-5_10

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-29005-8

  • Online ISBN: 978-3-319-29006-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics