Skip to main content
Log in

Pseudo-Cache-Based IoT Small Files Management Framework in HDFS Cluster

  • Published:
Wireless Personal Communications Aims and scope Submit manuscript

Abstract

Internet of Things (IoT) devices are generating an enormous number of files that are categorized into two types: (1) large files and (2) small files. Hadoop Distributed File System (HDFS) processes datasets using a default compression technique Hadoop Archives (HAR) for building data chunks of 64, 128 and 256 MBs. This technique works in normal batch processing, however, when a streaming chunk of IoT dataset is considered, it returns issues not addressed before: (1) improper file wrapping, (2) random access latency, (3) slower Namenode and (4) wastage of block volume. This paper proposes a novel technique pseudo-cache-based small files management framework (PSFMF), that bypasses default HAR with its novel logical file association mechanism and avoids huge memory to build HDFS blocks. The evaluation shows that PSFMF reduces the usage of memory consumption, increases MapReduce performance and reduces tasks workload over HDFS cluster.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20
Fig. 21
Fig. 22
Fig. 23
Fig. 24
Fig. 25
Fig. 26
Fig. 27
Fig. 28
Fig. 29
Fig. 30
Fig. 31
Fig. 32

Similar content being viewed by others

References

  1. Siddiqui, I. F., Qureshi, N. M. F., Shaikh, M. A., Chowdhry, B. S., Abbas, A., Bashir, A. K., et al. (2019). Stuck-at fault analytics of IoT devices using knowledge-based data processing strategy in smart grid. Wireless Personal Communications, 106, 1969–1983.

    Article  Google Scholar 

  2. Faseeh Qureshi, N. M., et al. (2019). Dynamic container-based resource management framework of spark ecosystem. In 21st International conference on advanced communication technology (ICACT) (pp. 522–526).

  3. Qureshi, N. M. F., & Shin, D. R. (2016). RDP: A storage-tier-aware robust data placement strategy for hadoop in a cloud-based heterogeneous environment. KSII Transactions on Internet & Information Systems, 10(9), 4063–4086.

    Google Scholar 

  4. Qureshi, N. M. F, et al. (2018). A knowledge-based path optimization technique for cognitive nodes in smart grid. In IEEE global communications conference (GLOBECOM).

  5. Abbas, A., et al. (2018). Multi-objective optimum solutions for IoT-based feature models of software product line. IEEE Access, 6, 12228–12239.

    Article  Google Scholar 

  6. Musaddiq, A., et al. (2018). A survey on resource management in IoT operating systems. IEEE Access, 6, 8459–8482.

    Article  Google Scholar 

  7. Qureshi, N. M. F., Siddiqui, I. F., Unar, M. A., Uqaili, M. A., Nam, C. S., Shin, D. R., et al. (2019). An aggregate mapreduce data block placement strategy for wireless IoT edge nodes in smart grid. Wireless Personal Communications, 106, 2225–2236.

    Article  Google Scholar 

  8. Qureshi, N. M. F., Shin, D. R., Siddiqui, I. F., & Chowdhry, B. S. (2017). Storage-tag-aware scheduler for hadoop cluster. IEEE Access, 5, 13742–13755.

    Article  Google Scholar 

  9. Apache Hadoop archives, Hadoop archives guide. Retrieved May 5, 2019, from https://hadoop.apache.org/docs/r2.7.2/hadoop-archives/HadoopArchives.html.

  10. Su, Q., Lu, L., & Feng, Q. (2018). An optimal solution of storing and processing small image files on hadoop. In International conference on brain inspired cognitive systems (pp. 644–653).

  11. Ahad, M. A., & Biswas, R. (2019). Handling small size files in hadoop: Challenges, opportunities, and review. In J. Nayak, A. Abraham, B. Krishna, Sekhar G. Chandra, & A. Das (Eds.), Soft computing in data analytics (pp. 653–663). Singapore: Springer.

    Chapter  Google Scholar 

  12. Dev, D., & Patgiri, R (2015). HAR+: Archive and metadata distribution! Why not both?. In International conference on computer communication and informatics (ICCCI), Coimbatore (pp. 1–6).

  13. Zhang, B., Wang, X., & Zheng, Z. (2018). The optimization for recurring queries in big data analysis system with MapReduce. Future Generation Computer Systems, 87, 549–556.

    Article  Google Scholar 

  14. Gohil, P., Panchal, B., & Dhobi, J. S. (2015). A novel approach to improve the performance of Hadoop in handling of small files. In IEEE international conference on electrical, computer and communication technologies (ICECCT), Coimbatore (pp. 1–5).

  15. Khan, S., Liu, X., Ali, S. A. & Alam, M. (2019). Storage solutions for big data systems: A qualitative study and comparison. arXiv preprint arXiv:1904.11498

  16. Huo, J., Weng, J., & Qu, H. (2019). A parallel clustering algorithm for logs data based on Hadoop platform. In Proceedings of the 3rd international conference on high performance compilation, computing and communications (pp. 90–94), ACM.

  17. Renner, T., Müller, J., Thamsen, L., & Kao, O. Addressing Hadoop’s small file problem with an appendable archive file format. In Proceedings of the computing frontiers conference (CF’17) (pp. 367–372). New York, NY: ACM.

  18. Ajah, I. A., & Nweke, H. F. (2019). Big data and business analytics: Trends, platforms, success factors and applications. Big Data and Cognitive Computing, 3(2), 32.

    Article  Google Scholar 

  19. Zhou, W., Feng, D., Tan, Z., & Zheng, Y. (2018). Improving big data storage performance in hybrid environment. Journal of Computational Science, 26, 409–418.

    Article  Google Scholar 

  20. Cai, X., Chen, C., & Liang, Y. (2018). An optimization strategy of massive small files storage based on HDFS. In Joint international advanced engineering and technology research conference. Atlantis Press.

  21. Karan, A., Rautaray, S. S., & Pandey, M. (2019). A proposed approach for improving Hadoop performance for handling small files. In A. Abraham, P. Dutta, J. Mandal, A. Bhattacharya, & S. Dutta (Eds.), Emerging technologies in data mining and information security (pp. 311–319). Singapore: Springer.

    Chapter  Google Scholar 

  22. Su, Q., Lu, L., & QiuYan, F. (2018). An optimal solution of storing and processing small image files on Hadoop. In International conference on brain inspired cognitive systems. Cham: Springer.

  23. Niazi, S., et al. (2018). Size matters: Improving the performance of small files in Hadoop. In Proceedings of the 19th international middleware conference. ACM.

  24. El Kafrawy, P. M., Sauber, A. M., Hafez, M. M., & Shawish, A. F. (2018). HDFSx: An enhanced model to handle small files in Hadoop with a simulating toolkit. In 1st International conference on computer applications & information security (ICCAIS), Riyadh (pp. 1–8).

  25. Kaseb, M. R., Khafagy, M. H., Ali, I. A., & Saad, E. M. (2019). An improved technique for increasing availability in big data replication. Future Generation Computer Systems, 91, 493–505.

    Article  Google Scholar 

  26. Offline Image Viewer, Apache Hadoop Offline Image Viewer. Retrieved May 5, 2019, from https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsImageViewer.html.

  27. Hakak, S., Kamsin, A., Shivakumara, P., Idris, M. Y. I., & Gilkar, G. A. (2018). A new split based searching for exact pattern matching for natural texts. PloS One, 13(7), e0200912.

    Article  Google Scholar 

  28. Riesinger, C., Neckel, T., & Rupp, F. (2018). Non-standard pseudo random number generators revisited for GPUs. Future Generation Computer Systems, 82, 482–492.

    Article  Google Scholar 

  29. Alizadeh, M., Abolfazli, S., Zamani, M., Baharun, S., & Sakurai, K. (2016). Authentication in mobile cloud computing: A survey. Journal of Network and Computer Applications, 61, 59–80.

    Article  Google Scholar 

  30. Simsiri, N., et al. (2018). Work-efficient parallel union-find. Concurrency and Computation: Practice and Experience, 30(4), e4333.

    Article  Google Scholar 

  31. Krenger, S. Linux RAM Disk TMPFS. Retrieved May 5, 2019, from https://www.krenger.ch/blog/linux-ramdisk-with-tmpfs.

  32. Amazon Web Services, Large Datasets Repository of Amazon Web Services. Retrieved May 5, 2019, from https://aws.amazon.com/public-datasets/.

  33. Siddiqui, I. F., Qureshi, N. M. F., Chowdhry, B. S., & Uqaili, M. A. (2019). Edge-node-aware adaptive data processing framework for smart grid. Wireless Personal Communications, 106(1), 179–189.

    Article  Google Scholar 

  34. SequenceFile-Hadoop Wiki. Retrieved May 5, 2019, from http://wiki.apache.org/hadoop/SequenceFile.

  35. Fu, X., Liu, W., Cang, Y., Gong, X., & Deng, S. (2016). Optimized data replication for small files in cloud storage systems. Mathematical Problems in Engineering. https://doi.org/10.1155/2016/4837894.

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Nawab Muhammad Faseeh Qureshi.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Siddiqui, I.F., Qureshi, N.M.F., Chowdhry, B.S. et al. Pseudo-Cache-Based IoT Small Files Management Framework in HDFS Cluster. Wireless Pers Commun 113, 1495–1522 (2020). https://doi.org/10.1007/s11277-020-07312-3

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11277-020-07312-3

Keywords

Navigation