Skip to main content
Log in

The research and analysis of efficiency of hardware usage base on HDFS

  • Published:
Cluster Computing Aims and scope Submit manuscript

Abstract

HDFS (Hadoop Distributed File System), as a part of data stored in the Hadoop ecosystem, provides read and write interfaces for many upper-level applications. The read/write performance of HDFS is affected by hardware such as disk, network, and even CPU and memory. The underlying storage system and transmission network of HDFS use high-performance devices, the read/write performance will be improved to a certain extent. However, due to the influence of the complex software stack, the improvement ratio cannot reach the device's own performance's lift ratio. HDFS can use cheap machines to store petabytes of data, equipped with ultra-high-performance hardware devices to improve the performance of HDFS will increase economic expenses and waste resources. In this paper, we analyze the read/write process of HDFS, determine the proportion of software and hardware processes. According to the test environment and methods in this paper, we find that the impact of the storage system on HDFS accounts for 19.7%, and the network accounts for 62.5%. We test the basic performance of various hardware and its application to HDFS, combine hardware utilization analysis, we find that the use of popular storage systems and the networks can improve the write performance of HDFS by 257% and 207%, respectively.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

References

  1. Swanson, S., Caulfield, A.M.: Refactor, reduce, recycle: restructuring the I/O stack for the future of storage. Computer 46(8), 52–59 (2013)

    Article  Google Scholar 

  2. Jin, G.D., Bian, H.Q., Chen, Y.G., Du, X.Y.: Survey on storage and optimization techniques of HDFS. J. Softw. 31(1), 137–161 (2020) (in Chinese). http://www.jos.org.cn/1000-9825/5872.htm

  3. Shu, J., Youyou, L.U., Zhang, J., et al.: Research progress on non-volatile memory based storage system. Science & Technology Review (2016)

  4. nvmexpress. [EB/OL]. https://nvmexpress.org/, 2022–01–08.

  5. Izraelevitz, J., Yang, J., Zhang, L., et al.: Basic performance measurements of the intel optane DC persistent memory module. arXiv preprint arXiv:1903.05714 (2019)

  6. SEAGATE. [EB/OL]. https://www.seagate.com/www-content/datasheets/pdfs/barracuda-2-5-DS1907-3-2005CN-zh_CN.pdf, 2020–07–24.

  7. SAMSUNG. [EB/OL]. https://www.samsung.com/semiconductor/minisite/ssd/product/consumer/860evo/, 2022–01–08.

  8. Intel® Optane™ Persistent Memory. [EB/OL]. https://www.intel.com/content/www/us/en/products/details/memory-storage/optane-dc-persistent-memory.html, 2020–01–07.

  9. Auradkar, P., Prashanth, T., Aralihalli, S., et al.: Performance tuning analysis of spatial operations on Spatial Hadoop cluster with SSD. Procedia Computer Science 167, 2253–2266 (2020)

    Article  Google Scholar 

  10. Krish, K.R., Anwar, A., Butt, A.R.: hats: a heterogeneity-aware tiered storage for Hadoop. In: Proceedings of the CCGrid. 2014. 502–511

  11. Islam, N.S., Lu, X., Wasi-ur-Rahman, M., Shankar, D., Panda, D.K.: Triple-H: a hybrid approach to accelerate HDFS on HPC clusters with heterogeneous storage architecture. In: Proceedings of of the CCGrid, pp. 101–110 (2015)

  12. Krish, K.R., Iqbal, M.S., Butt, A.R.: Venu: Orchestrating SSDS in Hadoop storage. In: Proceedings of the 2014 IEEE international conference on Big Data (big data). IEEE, 2014. pp. 207–212

  13. Subramanyam, R.: HDFS heterogeneous storage resource management based on data temperature. In: Proceedings of the 2015 international conference on cloud and autonomic computing (ICCAC). IEEE, 2015. pp. 232–235

  14. Li, H., Li, X., Lu, Y., et al.: An Experimental Study on Data Recovery Performance Improvement for HDFS with NVM[C]//2020 29th International Conference on Computer Communications and Networks (ICCCN). IEEE, 2020: 1–9.

  15. Islam, N.S., Rahman, M.W., Jose, J., et al.: High performance RDMA-based design of HDFS over InfiniBand[C]// International Conference for High Performance Computing. IEEE Computer Society (2012)

  16. Islam, N.S., Lu, X., Wasi-Ur-Rahman, M., et al.: SOR-HDFS: a SEDA-based approach to maximize overlapping in RDMA-enhanced HDFS[C]// The 23rd International ACM Symposium on High-Performance Parallel and Distributed Computing (HPDC '14). ACM (2014)

  17. Lu, X., Islam, N.S., Wasiurrahman, M., et al.: High-Performance Design of Hadoop RPC with RDMA over InfiniBand. In: Advances in Swarm and Computational Intelligence (2015)

  18. Zhu, Y., Niu, D., Cai, T., et al.: Test and analysis of big data system in different network environment. Journal of Jiangsu University (2016)

  19. Shvachko, K., Kuang, H., Radia, S., et al.: The Hadoop Distributed File System[C]// IEEE Symposium on Mass Storage Systems & Technologies. IEEE (2010)

  20. hdfs-default. [EB/OL].https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml, 2022–01–08

  21. FIO. [EB/OL]. https://fio.updatestar.com/, 2022–01–08

  22. qperf. [EB/OL]. https://linux.die.net/man/1/qperf, 2022–01–08

Download references

Funding

This work is supported by the National Key Research and Development Project of China (2018YFB1004401), Beijing Municipal Natural Science Foundation-Haidian original innovation joint fund (L192027). This work is supported by the National Key Research and Development Program (No. 2018YFB1004401), Beijing Natural Science Foundation (No. L192027). Shaanxi province key industrial projects (2021ZDLGY03-02, 2021ZDLGY03-08). National Natural Science Foundation of China Major Program (92152301).

Author information

Authors and Affiliations

Authors

Contributions

All authors contributed to the study’s conception and design. Material preparation, data collection, and analysis were performed by Yun Liu, Xiao Zhang, and Binbin Liu. The first draft of the manuscript was written by Yun Liu and all authors commented on previous versions of the manuscript. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Yun Liu.

Ethics declarations

Conflict of interest

No conflicts of interest.

Data availability

The data in the study come from test results. The data that supports the findings of this study are available in the supplementary material of this article.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary file1 (RAR 233 KB)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Liu, Y., Zhang, X., Liu, B. et al. The research and analysis of efficiency of hardware usage base on HDFS. Cluster Comput 25, 3719–3732 (2022). https://doi.org/10.1007/s10586-022-03597-0

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10586-022-03597-0

Keywords

Navigation