Pseudo-Cache-Based IoT Small Files Management Framework in HDFS Cluster

Siddiqui, Isma Farah; Qureshi, Nawab Muhammad Faseeh; Chowdhry, Bhawani Shankar; Uqaili, Muhammad Aslam

doi:10.1007/s11277-020-07312-3

Pseudo-Cache-Based IoT Small Files Management Framework in HDFS Cluster

Published: 02 May 2020

Volume 113, pages 1495–1522, (2020)
Cite this article

Wireless Personal Communications Aims and scope Submit manuscript

Isma Farah Siddiqui¹,
Nawab Muhammad Faseeh Qureshi²,
Bhawani Shankar Chowdhry³ &
…
Muhammad Aslam Uqaili⁴

380 Accesses
24 Citations
Explore all metrics

Abstract

Internet of Things (IoT) devices are generating an enormous number of files that are categorized into two types: (1) large files and (2) small files. Hadoop Distributed File System (HDFS) processes datasets using a default compression technique Hadoop Archives (HAR) for building data chunks of 64, 128 and 256 MBs. This technique works in normal batch processing, however, when a streaming chunk of IoT dataset is considered, it returns issues not addressed before: (1) improper file wrapping, (2) random access latency, (3) slower Namenode and (4) wastage of block volume. This paper proposes a novel technique pseudo-cache-based small files management framework (PSFMF), that bypasses default HAR with its novel logical file association mechanism and avoids huge memory to build HDFS blocks. The evaluation shows that PSFMF reduces the usage of memory consumption, increases MapReduce performance and reduces tasks workload over HDFS cluster.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Big data privacy: a technological perspective and review

Article Open access 26 November 2016

A comprehensive and systematic literature review on the big data management techniques in the internet of things

Article 15 November 2022

The evolution of distributed computing systems: from fundamental to new frontiers

Article 30 January 2021

References

Siddiqui, I. F., Qureshi, N. M. F., Shaikh, M. A., Chowdhry, B. S., Abbas, A., Bashir, A. K., et al. (2019). Stuck-at fault analytics of IoT devices using knowledge-based data processing strategy in smart grid. Wireless Personal Communications, 106, 1969–1983.
Article Google Scholar
Faseeh Qureshi, N. M., et al. (2019). Dynamic container-based resource management framework of spark ecosystem. In 21st International conference on advanced communication technology (ICACT) (pp. 522–526).
Qureshi, N. M. F., & Shin, D. R. (2016). RDP: A storage-tier-aware robust data placement strategy for hadoop in a cloud-based heterogeneous environment. KSII Transactions on Internet & Information Systems, 10(9), 4063–4086.
Google Scholar
Qureshi, N. M. F, et al. (2018). A knowledge-based path optimization technique for cognitive nodes in smart grid. In IEEE global communications conference (GLOBECOM).
Abbas, A., et al. (2018). Multi-objective optimum solutions for IoT-based feature models of software product line. IEEE Access, 6, 12228–12239.
Article Google Scholar
Musaddiq, A., et al. (2018). A survey on resource management in IoT operating systems. IEEE Access, 6, 8459–8482.
Article Google Scholar
Qureshi, N. M. F., Siddiqui, I. F., Unar, M. A., Uqaili, M. A., Nam, C. S., Shin, D. R., et al. (2019). An aggregate mapreduce data block placement strategy for wireless IoT edge nodes in smart grid. Wireless Personal Communications, 106, 2225–2236.
Article Google Scholar
Qureshi, N. M. F., Shin, D. R., Siddiqui, I. F., & Chowdhry, B. S. (2017). Storage-tag-aware scheduler for hadoop cluster. IEEE Access, 5, 13742–13755.
Article Google Scholar
Apache Hadoop archives, Hadoop archives guide. Retrieved May 5, 2019, from https://hadoop.apache.org/docs/r2.7.2/hadoop-archives/HadoopArchives.html.
Su, Q., Lu, L., & Feng, Q. (2018). An optimal solution of storing and processing small image files on hadoop. In International conference on brain inspired cognitive systems (pp. 644–653).
Ahad, M. A., & Biswas, R. (2019). Handling small size files in hadoop: Challenges, opportunities, and review. In J. Nayak, A. Abraham, B. Krishna, Sekhar G. Chandra, & A. Das (Eds.), Soft computing in data analytics (pp. 653–663). Singapore: Springer.
Chapter Google Scholar
Dev, D., & Patgiri, R (2015). HAR+: Archive and metadata distribution! Why not both?. In International conference on computer communication and informatics (ICCCI), Coimbatore (pp. 1–6).
Zhang, B., Wang, X., & Zheng, Z. (2018). The optimization for recurring queries in big data analysis system with MapReduce. Future Generation Computer Systems, 87, 549–556.
Article Google Scholar
Gohil, P., Panchal, B., & Dhobi, J. S. (2015). A novel approach to improve the performance of Hadoop in handling of small files. In IEEE international conference on electrical, computer and communication technologies (ICECCT), Coimbatore (pp. 1–5).
Khan, S., Liu, X., Ali, S. A. & Alam, M. (2019). Storage solutions for big data systems: A qualitative study and comparison. arXiv preprint arXiv:1904.11498
Huo, J., Weng, J., & Qu, H. (2019). A parallel clustering algorithm for logs data based on Hadoop platform. In Proceedings of the 3rd international conference on high performance compilation, computing and communications (pp. 90–94), ACM.
Renner, T., Müller, J., Thamsen, L., & Kao, O. Addressing Hadoop’s small file problem with an appendable archive file format. In Proceedings of the computing frontiers conference (CF’17) (pp. 367–372). New York, NY: ACM.
Ajah, I. A., & Nweke, H. F. (2019). Big data and business analytics: Trends, platforms, success factors and applications. Big Data and Cognitive Computing, 3(2), 32.
Article Google Scholar
Zhou, W., Feng, D., Tan, Z., & Zheng, Y. (2018). Improving big data storage performance in hybrid environment. Journal of Computational Science, 26, 409–418.
Article Google Scholar
Cai, X., Chen, C., & Liang, Y. (2018). An optimization strategy of massive small files storage based on HDFS. In Joint international advanced engineering and technology research conference. Atlantis Press.
Karan, A., Rautaray, S. S., & Pandey, M. (2019). A proposed approach for improving Hadoop performance for handling small files. In A. Abraham, P. Dutta, J. Mandal, A. Bhattacharya, & S. Dutta (Eds.), Emerging technologies in data mining and information security (pp. 311–319). Singapore: Springer.
Chapter Google Scholar
Su, Q., Lu, L., & QiuYan, F. (2018). An optimal solution of storing and processing small image files on Hadoop. In International conference on brain inspired cognitive systems. Cham: Springer.
Niazi, S., et al. (2018). Size matters: Improving the performance of small files in Hadoop. In Proceedings of the 19th international middleware conference. ACM.
El Kafrawy, P. M., Sauber, A. M., Hafez, M. M., & Shawish, A. F. (2018). HDFSx: An enhanced model to handle small files in Hadoop with a simulating toolkit. In 1st International conference on computer applications & information security (ICCAIS), Riyadh (pp. 1–8).
Kaseb, M. R., Khafagy, M. H., Ali, I. A., & Saad, E. M. (2019). An improved technique for increasing availability in big data replication. Future Generation Computer Systems, 91, 493–505.
Article Google Scholar
Offline Image Viewer, Apache Hadoop Offline Image Viewer. Retrieved May 5, 2019, from https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsImageViewer.html.
Hakak, S., Kamsin, A., Shivakumara, P., Idris, M. Y. I., & Gilkar, G. A. (2018). A new split based searching for exact pattern matching for natural texts. PloS One, 13(7), e0200912.
Article Google Scholar
Riesinger, C., Neckel, T., & Rupp, F. (2018). Non-standard pseudo random number generators revisited for GPUs. Future Generation Computer Systems, 82, 482–492.
Article Google Scholar
Alizadeh, M., Abolfazli, S., Zamani, M., Baharun, S., & Sakurai, K. (2016). Authentication in mobile cloud computing: A survey. Journal of Network and Computer Applications, 61, 59–80.
Article Google Scholar
Simsiri, N., et al. (2018). Work-efficient parallel union-find. Concurrency and Computation: Practice and Experience, 30(4), e4333.
Article Google Scholar
Krenger, S. Linux RAM Disk TMPFS. Retrieved May 5, 2019, from https://www.krenger.ch/blog/linux-ramdisk-with-tmpfs.
Amazon Web Services, Large Datasets Repository of Amazon Web Services. Retrieved May 5, 2019, from https://aws.amazon.com/public-datasets/.
Siddiqui, I. F., Qureshi, N. M. F., Chowdhry, B. S., & Uqaili, M. A. (2019). Edge-node-aware adaptive data processing framework for smart grid. Wireless Personal Communications, 106(1), 179–189.
Article Google Scholar
SequenceFile-Hadoop Wiki. Retrieved May 5, 2019, from http://wiki.apache.org/hadoop/SequenceFile.
Fu, X., Liu, W., Cang, Y., Gong, X., & Deng, S. (2016). Optimized data replication for small files in cloud storage systems. Mathematical Problems in Engineering. https://doi.org/10.1155/2016/4837894.
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Software Engineering, Mehran University of Engineering and Technology, Jamshoro, Pakistan
Isma Farah Siddiqui
Department of Computer Education, Sungkyunkwan University, Seoul, South Korea
Nawab Muhammad Faseeh Qureshi
Faculty of Electrical, Electronics, and Computer Engineering, Mehran University of Engineering and Technology, Jamshoro, Pakistan
Bhawani Shankar Chowdhry
Department of Electrical Engineering, Mehran University of Engineering and Technology, Jamshoro, Pakistan
Muhammad Aslam Uqaili

Authors

Isma Farah Siddiqui
View author publications
You can also search for this author in PubMed Google Scholar
Nawab Muhammad Faseeh Qureshi
View author publications
You can also search for this author in PubMed Google Scholar
Bhawani Shankar Chowdhry
View author publications
You can also search for this author in PubMed Google Scholar
Muhammad Aslam Uqaili
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Nawab Muhammad Faseeh Qureshi.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Siddiqui, I.F., Qureshi, N.M.F., Chowdhry, B.S. et al. Pseudo-Cache-Based IoT Small Files Management Framework in HDFS Cluster. Wireless Pers Commun 113, 1495–1522 (2020). https://doi.org/10.1007/s11277-020-07312-3

Download citation

Published: 02 May 2020
Issue Date: August 2020
DOI: https://doi.org/10.1007/s11277-020-07312-3

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Pseudo-Cache-Based IoT Small Files Management Framework in HDFS Cluster

Abstract

Access this article

Similar content being viewed by others

Big data privacy: a technological perspective and review

A comprehensive and systematic literature review on the big data management techniques in the internet of things

The evolution of distributed computing systems: from fundamental to new frontiers

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Pseudo-Cache-Based IoT Small Files Management Framework in HDFS Cluster

Abstract

Access this article

Similar content being viewed by others

Big data privacy: a technological perspective and review

A comprehensive and systematic literature review on the big data management techniques in the internet of things

The evolution of distributed computing systems: from fundamental to new frontiers

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation