Abstract
In this paper, we propose a distributed caching scheme to efficiently access small files in Hadoop distributed file system. The proposed scheme reduces the volume of metadata to manage in the NameNode by combining and storing multiple small files in a block. In addition, it reduces unnecessary accesses by maintaining information on requested files using client cache and DataNode cache, and synchronizing metadata of the client cache. The client cache maintains small files requested by users and metadata, and each DataNode cache maintains small files frequently requested by users. Performance evaluation shows that the proposed distributed cache management scheme significantly outperforms existing schemes in small file access costs.
Similar content being viewed by others
References
Alam, A., Ahmed, J.: Hadoop architecture and its issues. In: Proceedings of International Conference on Computational Science and Computational Intelligence, pp. 288–291 (2014)
Chandrasekar, S., Dakshinamurthy, R., Sechakumar, P.G., Prabavathy, B., Bahu, C.: A novel indexing scheme for efficient handling of small files in Hadoop distributed file system. In: Proceedings of International Conference on Computer Communication and Informatics, pp. 1–8 (2013)
Chen, J., Wang, D., Fu, L., Zhao, W.: An improved small file processing method for HDFS. Int. J. Digit. Content Technol. Appl. 6(20), 296–304 (2012)
Cho, J., Jin, H., Lee, M., Schwan, K.: Dynamic core affinity for high-performance file upload on Hadoop distributed file system. Parallel Comput. 40(10), 722–737 (2014)
Cohen, J., Dolan, B., Dunlap, M., Hellerstein, J.M., Welton, C.: MAD skills: new analysis practices for big data. Proc. VLDB Endow. 2(2), 1481–1492 (2009)
Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large cluster. Commun. ACM 51(1), 107–113 (2008)
Dittrich, J., Quiani-Ruiz, J.: Efficient big data processing in Hadoop MapReduce. Proc. VLDB Endow. 5(12), 2014–2015 (2012)
Dong, B., Qiu, J., Zheng, O., Zhong, X., Li, J., Li, Y.: A novel approach to improving the efficiency of storing and accessing small files on Hadoop: a case study by powerpoint files. In: Proceedings of International Conference on Services Computing, pp. 65–72 (2010)
Dong, B., Zheng, Q., Tian, F., Chao, K., Godwin, N., Ma, T., Xu, H.: Performance models and dynamic characteristics analysis for HDFS write and read operations: a systematic view. J. Syst. Softw. 93, 132–151 (2014)
Dörre, J., Apel, S., Lengauer, C.: Modeling and optimizing MapReduce programs. Concurr. Comput. 27(7), 1734–1766 (2015)
Gandomi, A., Haider, M.: Beyond the hype: big data concepts, methods, and analytics. Int. J. Inf. Manag. 35(2), 137–144 (2015)
Gohil, P.: Efficient ways to improve the performance of HDFS for small files. Comput. Eng. Intell. Syst. 5(1), 45–49 (2014)
Hua, X., Wu, H., Li, Z., Ren, S.: Enhancing throughput of the Hadoop distributed file system for interaction-intensive tasks. J. Parallel Distrib. Comput. 74(8), 2770–2779 (2014)
Kim, Y., Araragi, T., Nakamura, J., Masuzawa, T.: A distributed and cooperative NameNode cluster for a highly-available Hadoop distributed file system. IEICE Trans. Inf. Syst. 98–D(4), 835–851 (2015)
Krish, K.R., Anwar A.: hstS: a heterogeneity-aware tiered storage for Hadoop. In: Proceedings of International Symposium on Cluster, Cloud and Grid Computing, pp. 502–511 (2014)
Krishna, T.L.S.R., Ragunathan, T., Battula, S.K.: Performance evaluation of read and write operations in Hadoop distributed file system. In: Proceedings of International Symposium on Parallel Architectures, Algorithms and Programming, pp. 110–113 (2014)
Mukhopadhyay, D., Agrawal, C., Maru, D., Yedale, P., Gadekar, P.: Addressing NameNode scalability issue in Hadoop distributed file system using cache approach. In: Proceedings of International Conference on Information Technology, pp. 321–326 (2014)
Schvachko, K., Kuang, H., Radia, S., Chansler, R.: The Hadoop distributed file system. In: Proceedings of IEEE Symposium on Mass Storage Systems and Technologies, pp. 1–10 (2010)
Sheu, R., Yuan, S., Lo, W., Ku, C.: Design and implementation of file deduplication framework on HDFS. Int. J. Distrib. Sensor Netw. 2014, 1–12 (2014)
Tang, Y., Fan, A., Wang, Y., Yao, Y.: mDHT: a multi-level-indexed DHT algorithm to extra-large-scale data retrieval on HDFS/Hadoop architecture. Pers. Ubiquitous Comput. 18(8), 1835–1844 (2014)
Azzedin, F.: Towards a scalable HDFS architecture. In: Proceedings of International Conference on Collaboration Technologies and Systems, pp. 155–161 (2013)
Vu, T., Huet, F.: A lightweight continuous jobs mechanism for MapReduce frameworks. In: Proceedings of International Symposium on Cluster, Cloud, and Grid Computing, pp. 269–279 (2013)
Wang, Y., Ma, C., Wang, W., Meng, D.: An approach of fast data manipulation in HDFS with supplementary mechanisms. J. Supercomput. 71(5), 1736–1753 (2015)
Wei, L., Lian, W., Liu, K., Wang, Y.: Hippo: an enhancement of pipeline-aware in-memory caching for HDFS. In: Proceedings of International Conference on Computer Communication and Networks, pp. 1–5 (2014)
Yan, C., Li, T., Huang, Y., Gan, Y.: Hmfs: efficient support of small files processing over HDFS. In: Proceedings of International Conference on Algorithms and Architectures for Parallel Processing, pp. 54–67 (2014)
Zhang, Y., Chen, S., Wang, Q., Yu, G.: i2MapReduce: incremental MapReduce for mining evolving big data. IEEE Trans. Knowl. Data Eng. 27(7), 1906–1919 (2015)
Zhang, J., Wu, G., Hu, X., Wu, X.: A distributed cache for Hadoop distributed file system in real-time cloud services. In: Proceedings of International Conference on Grid Computing, pp. 12–21 (2012)
Yang, C., Shih, W., Chen, L., Kuo, C., Jiang, F., Leu, F.: Accessing medical image file with co-allocation HDFS in cloud. Future Gener. Comput. Syst. 43–44, 61–73 (2015)
Leung, C.K., Zhang, H.: Management of distributed big data for social networks. In: Proceedings of International Symposium on Cluster, Cloud and Grid Computing, pp. 639–648 (2016)
Yu, S., Liu, M., Dou, W., Liu, X., Zhou, S.: Networking for big data: a survey. IEEE Commun. Surveys Tutor. 19(1), 531–549 (2017)
Salvador, J., Ruiz, Z., Garcia-Rodriguez, J.: Big data infrastructure: a survey. In: Proceedings of International Work-Conference on the Interplay Between Natural and Artificial Computation, vol. 2, pp. 249–258 (2017)
Lim, B., Kim, J. W., Chung, Y.D.: CATS: cache-aware task scheduling for Hadoop-based systems. Cluster Comput., 1–15 (2017)
Raicu, I., Foster, I.T., Wilde, M., Zhang, Z., Iskra, K., Beckman, P.H., Zhao, Y., Szalay, A.S., Choudhary, A.N., Little, P., Moretti, C., Chaudhary, A., Thain, D.: Middleware support for many-task computing. Cluster Comput. 13(3), 291–314 (2010)
Floratou, A., Megiddo, N., Potti, N., Özcan, F., Kale, U., Schmitz-Hermes J.: Adaptive caching in big SQL using the HDFS cache. In: Proceedings of ACM Symposium on Cloud Computing, pp. 321–333 (2016)
Kim, J., Lee, W., Song, J.J., Lee, S.: Optimized combinatorial clustering for stochastic processes. Cluster Comput. 20(2), 1135–1148 (2017)
Mackey, G., Sehrish, S., Wang, J.: Improving metadata management for small files in HDFS. In: Proceedings of International Conference on Cluster Computing, pp. 1–4 (2009)
Acknowledgements
This research was supported by the MSIT (Ministry of Science and ICT), Korea, under the ITRC (Information Technology Research Center) support program (IITP-2017-2013-0-00881, IITP-2017-2013-0-00680) supervised by the IITP (Institute for Information & communications Technology Promotion), by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIP) (No. 2016R1A2B3007527), and by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Science and ICT (No. NRF-2017R1A2B1003678).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Bok, K., Oh, H., Lim, J. et al. An efficient distributed caching for accessing small files in HDFS. Cluster Comput 20, 3579–3592 (2017). https://doi.org/10.1007/s10586-017-1147-2
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10586-017-1147-2