Abstract
Internet of Things (IoT) devices are generating an enormous number of files that are categorized into two types: (1) large files and (2) small files. Hadoop Distributed File System (HDFS) processes datasets using a default compression technique Hadoop Archives (HAR) for building data chunks of 64, 128 and 256 MBs. This technique works in normal batch processing, however, when a streaming chunk of IoT dataset is considered, it returns issues not addressed before: (1) improper file wrapping, (2) random access latency, (3) slower Namenode and (4) wastage of block volume. This paper proposes a novel technique pseudo-cache-based small files management framework (PSFMF), that bypasses default HAR with its novel logical file association mechanism and avoids huge memory to build HDFS blocks. The evaluation shows that PSFMF reduces the usage of memory consumption, increases MapReduce performance and reduces tasks workload over HDFS cluster.
Similar content being viewed by others
References
Siddiqui, I. F., Qureshi, N. M. F., Shaikh, M. A., Chowdhry, B. S., Abbas, A., Bashir, A. K., et al. (2019). Stuck-at fault analytics of IoT devices using knowledge-based data processing strategy in smart grid. Wireless Personal Communications, 106, 1969–1983.
Faseeh Qureshi, N. M., et al. (2019). Dynamic container-based resource management framework of spark ecosystem. In 21st International conference on advanced communication technology (ICACT) (pp. 522–526).
Qureshi, N. M. F., & Shin, D. R. (2016). RDP: A storage-tier-aware robust data placement strategy for hadoop in a cloud-based heterogeneous environment. KSII Transactions on Internet & Information Systems, 10(9), 4063–4086.
Qureshi, N. M. F, et al. (2018). A knowledge-based path optimization technique for cognitive nodes in smart grid. In IEEE global communications conference (GLOBECOM).
Abbas, A., et al. (2018). Multi-objective optimum solutions for IoT-based feature models of software product line. IEEE Access, 6, 12228–12239.
Musaddiq, A., et al. (2018). A survey on resource management in IoT operating systems. IEEE Access, 6, 8459–8482.
Qureshi, N. M. F., Siddiqui, I. F., Unar, M. A., Uqaili, M. A., Nam, C. S., Shin, D. R., et al. (2019). An aggregate mapreduce data block placement strategy for wireless IoT edge nodes in smart grid. Wireless Personal Communications, 106, 2225–2236.
Qureshi, N. M. F., Shin, D. R., Siddiqui, I. F., & Chowdhry, B. S. (2017). Storage-tag-aware scheduler for hadoop cluster. IEEE Access, 5, 13742–13755.
Apache Hadoop archives, Hadoop archives guide. Retrieved May 5, 2019, from https://hadoop.apache.org/docs/r2.7.2/hadoop-archives/HadoopArchives.html.
Su, Q., Lu, L., & Feng, Q. (2018). An optimal solution of storing and processing small image files on hadoop. In International conference on brain inspired cognitive systems (pp. 644–653).
Ahad, M. A., & Biswas, R. (2019). Handling small size files in hadoop: Challenges, opportunities, and review. In J. Nayak, A. Abraham, B. Krishna, Sekhar G. Chandra, & A. Das (Eds.), Soft computing in data analytics (pp. 653–663). Singapore: Springer.
Dev, D., & Patgiri, R (2015). HAR+: Archive and metadata distribution! Why not both?. In International conference on computer communication and informatics (ICCCI), Coimbatore (pp. 1–6).
Zhang, B., Wang, X., & Zheng, Z. (2018). The optimization for recurring queries in big data analysis system with MapReduce. Future Generation Computer Systems, 87, 549–556.
Gohil, P., Panchal, B., & Dhobi, J. S. (2015). A novel approach to improve the performance of Hadoop in handling of small files. In IEEE international conference on electrical, computer and communication technologies (ICECCT), Coimbatore (pp. 1–5).
Khan, S., Liu, X., Ali, S. A. & Alam, M. (2019). Storage solutions for big data systems: A qualitative study and comparison. arXiv preprint arXiv:1904.11498
Huo, J., Weng, J., & Qu, H. (2019). A parallel clustering algorithm for logs data based on Hadoop platform. In Proceedings of the 3rd international conference on high performance compilation, computing and communications (pp. 90–94), ACM.
Renner, T., Müller, J., Thamsen, L., & Kao, O. Addressing Hadoop’s small file problem with an appendable archive file format. In Proceedings of the computing frontiers conference (CF’17) (pp. 367–372). New York, NY: ACM.
Ajah, I. A., & Nweke, H. F. (2019). Big data and business analytics: Trends, platforms, success factors and applications. Big Data and Cognitive Computing, 3(2), 32.
Zhou, W., Feng, D., Tan, Z., & Zheng, Y. (2018). Improving big data storage performance in hybrid environment. Journal of Computational Science, 26, 409–418.
Cai, X., Chen, C., & Liang, Y. (2018). An optimization strategy of massive small files storage based on HDFS. In Joint international advanced engineering and technology research conference. Atlantis Press.
Karan, A., Rautaray, S. S., & Pandey, M. (2019). A proposed approach for improving Hadoop performance for handling small files. In A. Abraham, P. Dutta, J. Mandal, A. Bhattacharya, & S. Dutta (Eds.), Emerging technologies in data mining and information security (pp. 311–319). Singapore: Springer.
Su, Q., Lu, L., & QiuYan, F. (2018). An optimal solution of storing and processing small image files on Hadoop. In International conference on brain inspired cognitive systems. Cham: Springer.
Niazi, S., et al. (2018). Size matters: Improving the performance of small files in Hadoop. In Proceedings of the 19th international middleware conference. ACM.
El Kafrawy, P. M., Sauber, A. M., Hafez, M. M., & Shawish, A. F. (2018). HDFSx: An enhanced model to handle small files in Hadoop with a simulating toolkit. In 1st International conference on computer applications & information security (ICCAIS), Riyadh (pp. 1–8).
Kaseb, M. R., Khafagy, M. H., Ali, I. A., & Saad, E. M. (2019). An improved technique for increasing availability in big data replication. Future Generation Computer Systems, 91, 493–505.
Offline Image Viewer, Apache Hadoop Offline Image Viewer. Retrieved May 5, 2019, from https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsImageViewer.html.
Hakak, S., Kamsin, A., Shivakumara, P., Idris, M. Y. I., & Gilkar, G. A. (2018). A new split based searching for exact pattern matching for natural texts. PloS One, 13(7), e0200912.
Riesinger, C., Neckel, T., & Rupp, F. (2018). Non-standard pseudo random number generators revisited for GPUs. Future Generation Computer Systems, 82, 482–492.
Alizadeh, M., Abolfazli, S., Zamani, M., Baharun, S., & Sakurai, K. (2016). Authentication in mobile cloud computing: A survey. Journal of Network and Computer Applications, 61, 59–80.
Simsiri, N., et al. (2018). Work-efficient parallel union-find. Concurrency and Computation: Practice and Experience, 30(4), e4333.
Krenger, S. Linux RAM Disk TMPFS. Retrieved May 5, 2019, from https://www.krenger.ch/blog/linux-ramdisk-with-tmpfs.
Amazon Web Services, Large Datasets Repository of Amazon Web Services. Retrieved May 5, 2019, from https://aws.amazon.com/public-datasets/.
Siddiqui, I. F., Qureshi, N. M. F., Chowdhry, B. S., & Uqaili, M. A. (2019). Edge-node-aware adaptive data processing framework for smart grid. Wireless Personal Communications, 106(1), 179–189.
SequenceFile-Hadoop Wiki. Retrieved May 5, 2019, from http://wiki.apache.org/hadoop/SequenceFile.
Fu, X., Liu, W., Cang, Y., Gong, X., & Deng, S. (2016). Optimized data replication for small files in cloud storage systems. Mathematical Problems in Engineering. https://doi.org/10.1155/2016/4837894.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Siddiqui, I.F., Qureshi, N.M.F., Chowdhry, B.S. et al. Pseudo-Cache-Based IoT Small Files Management Framework in HDFS Cluster. Wireless Pers Commun 113, 1495–1522 (2020). https://doi.org/10.1007/s11277-020-07312-3
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11277-020-07312-3