An efficient distributed caching for accessing small files in HDFS

Bok, Kyoungsoo; Oh, Hyunkyo; Lim, Jongtae; Pae, Yosop; Choi, Hyoungrak; Lee, Byoungyup; Yoo, Jaesoo

doi:10.1007/s10586-017-1147-2

An efficient distributed caching for accessing small files in HDFS

Published: 04 September 2017

Volume 20, pages 3579–3592, (2017)
Cite this article

Cluster Computing Aims and scope Submit manuscript

Kyoungsoo Bok¹,
Hyunkyo Oh¹,
Jongtae Lim¹,
Yosop Pae¹,
Hyoungrak Choi¹,
Byoungyup Lee² &
…
Jaesoo Yoo ORCID: orcid.org/0000-0001-9926-9947¹

647 Accesses
10 Citations
3 Altmetric
Explore all metrics

Abstract

In this paper, we propose a distributed caching scheme to efficiently access small files in Hadoop distributed file system. The proposed scheme reduces the volume of metadata to manage in the NameNode by combining and storing multiple small files in a block. In addition, it reduces unnecessary accesses by maintaining information on requested files using client cache and DataNode cache, and synchronizing metadata of the client cache. The client cache maintains small files requested by users and metadata, and each DataNode cache maintains small files frequently requested by users. Performance evaluation shows that the proposed distributed cache management scheme significantly outperforms existing schemes in small file access costs.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Flexible fingerprint cuckoo filter for information retrieval optimization in distributed network

Article 11 April 2024

Data deduplication techniques for efficient cloud storage management: a systematic review

Article 20 December 2017

A survey on data storage and placement methodologies for Cloud-Big Data ecosystem

Article Open access 11 February 2019

References

Alam, A., Ahmed, J.: Hadoop architecture and its issues. In: Proceedings of International Conference on Computational Science and Computational Intelligence, pp. 288–291 (2014)
Chandrasekar, S., Dakshinamurthy, R., Sechakumar, P.G., Prabavathy, B., Bahu, C.: A novel indexing scheme for efficient handling of small files in Hadoop distributed file system. In: Proceedings of International Conference on Computer Communication and Informatics, pp. 1–8 (2013)
Chen, J., Wang, D., Fu, L., Zhao, W.: An improved small file processing method for HDFS. Int. J. Digit. Content Technol. Appl. 6(20), 296–304 (2012)
Article Google Scholar
Cho, J., Jin, H., Lee, M., Schwan, K.: Dynamic core affinity for high-performance file upload on Hadoop distributed file system. Parallel Comput. 40(10), 722–737 (2014)
Article Google Scholar
Cohen, J., Dolan, B., Dunlap, M., Hellerstein, J.M., Welton, C.: MAD skills: new analysis practices for big data. Proc. VLDB Endow. 2(2), 1481–1492 (2009)
Article Google Scholar
Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large cluster. Commun. ACM 51(1), 107–113 (2008)
Article Google Scholar
Dittrich, J., Quiani-Ruiz, J.: Efficient big data processing in Hadoop MapReduce. Proc. VLDB Endow. 5(12), 2014–2015 (2012)
Article Google Scholar
Dong, B., Qiu, J., Zheng, O., Zhong, X., Li, J., Li, Y.: A novel approach to improving the efficiency of storing and accessing small files on Hadoop: a case study by powerpoint files. In: Proceedings of International Conference on Services Computing, pp. 65–72 (2010)
Dong, B., Zheng, Q., Tian, F., Chao, K., Godwin, N., Ma, T., Xu, H.: Performance models and dynamic characteristics analysis for HDFS write and read operations: a systematic view. J. Syst. Softw. 93, 132–151 (2014)
Article Google Scholar
Dörre, J., Apel, S., Lengauer, C.: Modeling and optimizing MapReduce programs. Concurr. Comput. 27(7), 1734–1766 (2015)
Article Google Scholar
Gandomi, A., Haider, M.: Beyond the hype: big data concepts, methods, and analytics. Int. J. Inf. Manag. 35(2), 137–144 (2015)
Article Google Scholar
Gohil, P.: Efficient ways to improve the performance of HDFS for small files. Comput. Eng. Intell. Syst. 5(1), 45–49 (2014)
Google Scholar
Hua, X., Wu, H., Li, Z., Ren, S.: Enhancing throughput of the Hadoop distributed file system for interaction-intensive tasks. J. Parallel Distrib. Comput. 74(8), 2770–2779 (2014)
Article Google Scholar
Kim, Y., Araragi, T., Nakamura, J., Masuzawa, T.: A distributed and cooperative NameNode cluster for a highly-available Hadoop distributed file system. IEICE Trans. Inf. Syst. 98–D(4), 835–851 (2015)
Article Google Scholar
Krish, K.R., Anwar A.: hstS: a heterogeneity-aware tiered storage for Hadoop. In: Proceedings of International Symposium on Cluster, Cloud and Grid Computing, pp. 502–511 (2014)
Krishna, T.L.S.R., Ragunathan, T., Battula, S.K.: Performance evaluation of read and write operations in Hadoop distributed file system. In: Proceedings of International Symposium on Parallel Architectures, Algorithms and Programming, pp. 110–113 (2014)
Mukhopadhyay, D., Agrawal, C., Maru, D., Yedale, P., Gadekar, P.: Addressing NameNode scalability issue in Hadoop distributed file system using cache approach. In: Proceedings of International Conference on Information Technology, pp. 321–326 (2014)
Schvachko, K., Kuang, H., Radia, S., Chansler, R.: The Hadoop distributed file system. In: Proceedings of IEEE Symposium on Mass Storage Systems and Technologies, pp. 1–10 (2010)
Sheu, R., Yuan, S., Lo, W., Ku, C.: Design and implementation of file deduplication framework on HDFS. Int. J. Distrib. Sensor Netw. 2014, 1–12 (2014)
Google Scholar
Tang, Y., Fan, A., Wang, Y., Yao, Y.: mDHT: a multi-level-indexed DHT algorithm to extra-large-scale data retrieval on HDFS/Hadoop architecture. Pers. Ubiquitous Comput. 18(8), 1835–1844 (2014)
Article Google Scholar
Azzedin, F.: Towards a scalable HDFS architecture. In: Proceedings of International Conference on Collaboration Technologies and Systems, pp. 155–161 (2013)
Vu, T., Huet, F.: A lightweight continuous jobs mechanism for MapReduce frameworks. In: Proceedings of International Symposium on Cluster, Cloud, and Grid Computing, pp. 269–279 (2013)
Wang, Y., Ma, C., Wang, W., Meng, D.: An approach of fast data manipulation in HDFS with supplementary mechanisms. J. Supercomput. 71(5), 1736–1753 (2015)
Article Google Scholar
Wei, L., Lian, W., Liu, K., Wang, Y.: Hippo: an enhancement of pipeline-aware in-memory caching for HDFS. In: Proceedings of International Conference on Computer Communication and Networks, pp. 1–5 (2014)
Yan, C., Li, T., Huang, Y., Gan, Y.: Hmfs: efficient support of small files processing over HDFS. In: Proceedings of International Conference on Algorithms and Architectures for Parallel Processing, pp. 54–67 (2014)
Zhang, Y., Chen, S., Wang, Q., Yu, G.: i2MapReduce: incremental MapReduce for mining evolving big data. IEEE Trans. Knowl. Data Eng. 27(7), 1906–1919 (2015)
Article Google Scholar
Zhang, J., Wu, G., Hu, X., Wu, X.: A distributed cache for Hadoop distributed file system in real-time cloud services. In: Proceedings of International Conference on Grid Computing, pp. 12–21 (2012)
Yang, C., Shih, W., Chen, L., Kuo, C., Jiang, F., Leu, F.: Accessing medical image file with co-allocation HDFS in cloud. Future Gener. Comput. Syst. 43–44, 61–73 (2015)
Article Google Scholar
Leung, C.K., Zhang, H.: Management of distributed big data for social networks. In: Proceedings of International Symposium on Cluster, Cloud and Grid Computing, pp. 639–648 (2016)
Yu, S., Liu, M., Dou, W., Liu, X., Zhou, S.: Networking for big data: a survey. IEEE Commun. Surveys Tutor. 19(1), 531–549 (2017)
Article Google Scholar
Salvador, J., Ruiz, Z., Garcia-Rodriguez, J.: Big data infrastructure: a survey. In: Proceedings of International Work-Conference on the Interplay Between Natural and Artificial Computation, vol. 2, pp. 249–258 (2017)
Lim, B., Kim, J. W., Chung, Y.D.: CATS: cache-aware task scheduling for Hadoop-based systems. Cluster Comput., 1–15 (2017)
Raicu, I., Foster, I.T., Wilde, M., Zhang, Z., Iskra, K., Beckman, P.H., Zhao, Y., Szalay, A.S., Choudhary, A.N., Little, P., Moretti, C., Chaudhary, A., Thain, D.: Middleware support for many-task computing. Cluster Comput. 13(3), 291–314 (2010)
Article Google Scholar
Floratou, A., Megiddo, N., Potti, N., Özcan, F., Kale, U., Schmitz-Hermes J.: Adaptive caching in big SQL using the HDFS cache. In: Proceedings of ACM Symposium on Cloud Computing, pp. 321–333 (2016)
Kim, J., Lee, W., Song, J.J., Lee, S.: Optimized combinatorial clustering for stochastic processes. Cluster Comput. 20(2), 1135–1148 (2017)
Article Google Scholar
Mackey, G., Sehrish, S., Wang, J.: Improving metadata management for small files in HDFS. In: Proceedings of International Conference on Cluster Computing, pp. 1–4 (2009)
http://datacurationprofiles.org/

Download references

Acknowledgements

This research was supported by the MSIT (Ministry of Science and ICT), Korea, under the ITRC (Information Technology Research Center) support program (IITP-2017-2013-0-00881, IITP-2017-2013-0-00680) supervised by the IITP (Institute for Information & communications Technology Promotion), by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIP) (No. 2016R1A2B3007527), and by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Science and ICT (No. NRF-2017R1A2B1003678).

Author information

Authors and Affiliations

School of Information and Communication Engineering, Chungbuk National University, Chungdae-ro 1, Seowon-Gu, Cheongju, Chungbuk, 28644, Korea
Kyoungsoo Bok, Hyunkyo Oh, Jongtae Lim, Yosop Pae, Hyoungrak Choi & Jaesoo Yoo
Department of Cyber Security, Paichai University, Baejae-ro, Seo-gu, Daejeon, 35345, Korea
Byoungyup Lee

Authors

Kyoungsoo Bok
View author publications
You can also search for this author in PubMed Google Scholar
Hyunkyo Oh
View author publications
You can also search for this author in PubMed Google Scholar
Jongtae Lim
View author publications
You can also search for this author in PubMed Google Scholar
Yosop Pae
View author publications
You can also search for this author in PubMed Google Scholar
Hyoungrak Choi
View author publications
You can also search for this author in PubMed Google Scholar
Byoungyup Lee
View author publications
You can also search for this author in PubMed Google Scholar
Jaesoo Yoo
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jaesoo Yoo.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Bok, K., Oh, H., Lim, J. et al. An efficient distributed caching for accessing small files in HDFS. Cluster Comput 20, 3579–3592 (2017). https://doi.org/10.1007/s10586-017-1147-2

Download citation

Received: 24 June 2017
Revised: 09 August 2017
Accepted: 23 August 2017
Published: 04 September 2017
Issue Date: December 2017
DOI: https://doi.org/10.1007/s10586-017-1147-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

An efficient distributed caching for accessing small files in HDFS

Abstract

Access this article

Similar content being viewed by others

Flexible fingerprint cuckoo filter for information retrieval optimization in distributed network

Data deduplication techniques for efficient cloud storage management: a systematic review

A survey on data storage and placement methodologies for Cloud-Big Data ecosystem

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

An efficient distributed caching for accessing small files in HDFS

Abstract

Access this article

Similar content being viewed by others

Flexible fingerprint cuckoo filter for information retrieval optimization in distributed network

Data deduplication techniques for efficient cloud storage management: a systematic review

A survey on data storage and placement methodologies for Cloud-Big Data ecosystem

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation