Skip to main content
Log in

An efficient distributed caching for accessing small files in HDFS

  • Published:
Cluster Computing Aims and scope Submit manuscript

Abstract

In this paper, we propose a distributed caching scheme to efficiently access small files in Hadoop distributed file system. The proposed scheme reduces the volume of metadata to manage in the NameNode by combining and storing multiple small files in a block. In addition, it reduces unnecessary accesses by maintaining information on requested files using client cache and DataNode cache, and synchronizing metadata of the client cache. The client cache maintains small files requested by users and metadata, and each DataNode cache maintains small files frequently requested by users. Performance evaluation shows that the proposed distributed cache management scheme significantly outperforms existing schemes in small file access costs.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17

Similar content being viewed by others

References

  1. Alam, A., Ahmed, J.: Hadoop architecture and its issues. In: Proceedings of International Conference on Computational Science and Computational Intelligence, pp. 288–291 (2014)

  2. Chandrasekar, S., Dakshinamurthy, R., Sechakumar, P.G., Prabavathy, B., Bahu, C.: A novel indexing scheme for efficient handling of small files in Hadoop distributed file system. In: Proceedings of International Conference on Computer Communication and Informatics, pp. 1–8 (2013)

  3. Chen, J., Wang, D., Fu, L., Zhao, W.: An improved small file processing method for HDFS. Int. J. Digit. Content Technol. Appl. 6(20), 296–304 (2012)

    Article  Google Scholar 

  4. Cho, J., Jin, H., Lee, M., Schwan, K.: Dynamic core affinity for high-performance file upload on Hadoop distributed file system. Parallel Comput. 40(10), 722–737 (2014)

    Article  Google Scholar 

  5. Cohen, J., Dolan, B., Dunlap, M., Hellerstein, J.M., Welton, C.: MAD skills: new analysis practices for big data. Proc. VLDB Endow. 2(2), 1481–1492 (2009)

    Article  Google Scholar 

  6. Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large cluster. Commun. ACM 51(1), 107–113 (2008)

    Article  Google Scholar 

  7. Dittrich, J., Quiani-Ruiz, J.: Efficient big data processing in Hadoop MapReduce. Proc. VLDB Endow. 5(12), 2014–2015 (2012)

    Article  Google Scholar 

  8. Dong, B., Qiu, J., Zheng, O., Zhong, X., Li, J., Li, Y.: A novel approach to improving the efficiency of storing and accessing small files on Hadoop: a case study by powerpoint files. In: Proceedings of International Conference on Services Computing, pp. 65–72 (2010)

  9. Dong, B., Zheng, Q., Tian, F., Chao, K., Godwin, N., Ma, T., Xu, H.: Performance models and dynamic characteristics analysis for HDFS write and read operations: a systematic view. J. Syst. Softw. 93, 132–151 (2014)

    Article  Google Scholar 

  10. Dörre, J., Apel, S., Lengauer, C.: Modeling and optimizing MapReduce programs. Concurr. Comput. 27(7), 1734–1766 (2015)

    Article  Google Scholar 

  11. Gandomi, A., Haider, M.: Beyond the hype: big data concepts, methods, and analytics. Int. J. Inf. Manag. 35(2), 137–144 (2015)

    Article  Google Scholar 

  12. Gohil, P.: Efficient ways to improve the performance of HDFS for small files. Comput. Eng. Intell. Syst. 5(1), 45–49 (2014)

    Google Scholar 

  13. Hua, X., Wu, H., Li, Z., Ren, S.: Enhancing throughput of the Hadoop distributed file system for interaction-intensive tasks. J. Parallel Distrib. Comput. 74(8), 2770–2779 (2014)

    Article  Google Scholar 

  14. Kim, Y., Araragi, T., Nakamura, J., Masuzawa, T.: A distributed and cooperative NameNode cluster for a highly-available Hadoop distributed file system. IEICE Trans. Inf. Syst. 98–D(4), 835–851 (2015)

    Article  Google Scholar 

  15. Krish, K.R., Anwar A.: hstS: a heterogeneity-aware tiered storage for Hadoop. In: Proceedings of International Symposium on Cluster, Cloud and Grid Computing, pp. 502–511 (2014)

  16. Krishna, T.L.S.R., Ragunathan, T., Battula, S.K.: Performance evaluation of read and write operations in Hadoop distributed file system. In: Proceedings of International Symposium on Parallel Architectures, Algorithms and Programming, pp. 110–113 (2014)

  17. Mukhopadhyay, D., Agrawal, C., Maru, D., Yedale, P., Gadekar, P.: Addressing NameNode scalability issue in Hadoop distributed file system using cache approach. In: Proceedings of International Conference on Information Technology, pp. 321–326 (2014)

  18. Schvachko, K., Kuang, H., Radia, S., Chansler, R.: The Hadoop distributed file system. In: Proceedings of IEEE Symposium on Mass Storage Systems and Technologies, pp. 1–10 (2010)

  19. Sheu, R., Yuan, S., Lo, W., Ku, C.: Design and implementation of file deduplication framework on HDFS. Int. J. Distrib. Sensor Netw. 2014, 1–12 (2014)

    Google Scholar 

  20. Tang, Y., Fan, A., Wang, Y., Yao, Y.: mDHT: a multi-level-indexed DHT algorithm to extra-large-scale data retrieval on HDFS/Hadoop architecture. Pers. Ubiquitous Comput. 18(8), 1835–1844 (2014)

    Article  Google Scholar 

  21. Azzedin, F.: Towards a scalable HDFS architecture. In: Proceedings of International Conference on Collaboration Technologies and Systems, pp. 155–161 (2013)

  22. Vu, T., Huet, F.: A lightweight continuous jobs mechanism for MapReduce frameworks. In: Proceedings of International Symposium on Cluster, Cloud, and Grid Computing, pp. 269–279 (2013)

  23. Wang, Y., Ma, C., Wang, W., Meng, D.: An approach of fast data manipulation in HDFS with supplementary mechanisms. J. Supercomput. 71(5), 1736–1753 (2015)

    Article  Google Scholar 

  24. Wei, L., Lian, W., Liu, K., Wang, Y.: Hippo: an enhancement of pipeline-aware in-memory caching for HDFS. In: Proceedings of International Conference on Computer Communication and Networks, pp. 1–5 (2014)

  25. Yan, C., Li, T., Huang, Y., Gan, Y.: Hmfs: efficient support of small files processing over HDFS. In: Proceedings of International Conference on Algorithms and Architectures for Parallel Processing, pp. 54–67 (2014)

  26. Zhang, Y., Chen, S., Wang, Q., Yu, G.: i2MapReduce: incremental MapReduce for mining evolving big data. IEEE Trans. Knowl. Data Eng. 27(7), 1906–1919 (2015)

    Article  Google Scholar 

  27. Zhang, J., Wu, G., Hu, X., Wu, X.: A distributed cache for Hadoop distributed file system in real-time cloud services. In: Proceedings of International Conference on Grid Computing, pp. 12–21 (2012)

  28. Yang, C., Shih, W., Chen, L., Kuo, C., Jiang, F., Leu, F.: Accessing medical image file with co-allocation HDFS in cloud. Future Gener. Comput. Syst. 43–44, 61–73 (2015)

    Article  Google Scholar 

  29. Leung, C.K., Zhang, H.: Management of distributed big data for social networks. In: Proceedings of International Symposium on Cluster, Cloud and Grid Computing, pp. 639–648 (2016)

  30. Yu, S., Liu, M., Dou, W., Liu, X., Zhou, S.: Networking for big data: a survey. IEEE Commun. Surveys Tutor. 19(1), 531–549 (2017)

    Article  Google Scholar 

  31. Salvador, J., Ruiz, Z., Garcia-Rodriguez, J.: Big data infrastructure: a survey. In: Proceedings of International Work-Conference on the Interplay Between Natural and Artificial Computation, vol. 2, pp. 249–258 (2017)

  32. Lim, B., Kim, J. W., Chung, Y.D.: CATS: cache-aware task scheduling for Hadoop-based systems. Cluster Comput., 1–15 (2017)

  33. Raicu, I., Foster, I.T., Wilde, M., Zhang, Z., Iskra, K., Beckman, P.H., Zhao, Y., Szalay, A.S., Choudhary, A.N., Little, P., Moretti, C., Chaudhary, A., Thain, D.: Middleware support for many-task computing. Cluster Comput. 13(3), 291–314 (2010)

    Article  Google Scholar 

  34. Floratou, A., Megiddo, N., Potti, N., Özcan, F., Kale, U., Schmitz-Hermes J.: Adaptive caching in big SQL using the HDFS cache. In: Proceedings of ACM Symposium on Cloud Computing, pp. 321–333 (2016)

  35. Kim, J., Lee, W., Song, J.J., Lee, S.: Optimized combinatorial clustering for stochastic processes. Cluster Comput. 20(2), 1135–1148 (2017)

    Article  Google Scholar 

  36. Mackey, G., Sehrish, S., Wang, J.: Improving metadata management for small files in HDFS. In: Proceedings of International Conference on Cluster Computing, pp. 1–4 (2009)

  37. http://datacurationprofiles.org/

Download references

Acknowledgements

This research was supported by the MSIT (Ministry of Science and ICT), Korea, under the ITRC (Information Technology Research Center) support program (IITP-2017-2013-0-00881, IITP-2017-2013-0-00680) supervised by the IITP (Institute for Information & communications Technology Promotion), by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIP) (No. 2016R1A2B3007527), and by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Science and ICT (No. NRF-2017R1A2B1003678).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jaesoo Yoo.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Bok, K., Oh, H., Lim, J. et al. An efficient distributed caching for accessing small files in HDFS. Cluster Comput 20, 3579–3592 (2017). https://doi.org/10.1007/s10586-017-1147-2

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10586-017-1147-2

Keywords

Navigation