Enhancing HDFS with a full-text search system for massive small files

Xu, Wentao; Zhao, Xin; Lao, Bin; Nong, Ge

doi:10.1007/s11227-020-03526-1

Enhancing HDFS with a full-text search system for massive small files

Published: 04 January 2021

Volume 77, pages 7149–7170, (2021)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

Wentao Xu¹,
Xin Zhao¹,
Bin Lao³ &
…
Ge Nong^1,2

551 Accesses
7 Citations
Explore all metrics

Abstract

HDFS is a popular open-source system for scalable and reliable file management, which is designed as a general-purpose solution for distributed file storage. While it works well for medium or large files, it will suffer heavy performance degradations in case of lots of small files. To overcome this drawback, we propose here a system to enhance HDFS with a distributed true full-text search system SAES of 100% recall and precision ratios. By indexing the meta data of each file, e.g., name, size, date and description, files can be quickly accessed by efficient searches over metadata. Moreover, by merging many small files into a large file to be stored with better space and I/O efficiencies, the negative performance impacts caused by directly storing each small file individually are avoided. An experimental study is conducted for function and performance tests on both realistic and artificial data. The experimental results show that the system works well for file operations such as uploading, downloading and deleting. Moreover, the RAM consumption for managing massive small files is dramatically reduced, which is critical for good system performance. The proposed system could be a potential storage solution for massive small files.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Strategy for Small Files Processing in HDFS

Efficient File Accessing Techniques on Hadoop Distributed File Systems

Small files access efficiency in hadoop distributed file system a case study performed on British library text files

Article 07 April 2023

Notes

References

Apostolico A, Crochemore M, Farach-Colton M, Galil Z, Muthukrishnan S (2016) 40 years of suffix trees. Commun ACM 59(4):66–73
Article Google Scholar
Arroyuelo D, Bonacic C, Gil-Costa V, Marin M, Navarro G (2014) Distributed text search using suffix arrays. Parallel Comput 40(9):471–495
Article Google Scholar
Chandrasekar A, Chandrasekar K, Ramasatagopan H, Rafica AR, Balasubramaniyan J (2012) Classification based metadata management for HDFS. In: HPCC 2012 and ICESS 2012
Chen G, Hu T, Jiang D, Lu P, Tan KL, Vo HT, Wu S (2014) BestPeer++: a peer-to-peer based large-scale data processing platform. IEEE Trans Knowl Data Eng 26(6):1316–1331
Article Google Scholar
Chen Y, Zhou Y, Taneja S, Qin X, Huang J (2017) aHDFS: an erasure-coded data archival system for Hadoop clusters. IEEE Trans Parallel Distrib Syst 28(11):3060–3073
Article Google Scholar
Choi C, Choi C, Choi J, Kim P (2016) Improved performance optimization for massive small files in cloud computing environment. Ann Oper Res 265(2):305–317
Article Google Scholar
Dhaliwal J, Puglisi SJ, Turpin A (2012) Trends in suffix sorting: a survey of low memory algorithms. In: Proceedings of the Thirty-Fifth Australasian Computer Science Conference-Volume, vol 122, pp 91–98
Everingham M, Gool LV, Williams CKI, Winn J, Zisserman A (2009) The pascal visual object classes (VOC) challenge. Int J Comput Vis 88(2):303–338
Article Google Scholar
Fu S, He L, Huang C, Liao X, Li K (2015) Performance optimization for managing massive numbers of small files in distributed file systems. IEEE Trans Parallel Distrib Syst 26(12):3433–3448
Article Google Scholar
Gao Z, Qin Y, Niu K (2016) An effective merge strategy based hierarchy for improving small file problem on HDFS. In: 2016 4th International Conference on Cloud Computing and Intelligence Systems
Gupta S, Yadav S, Prasad R (2018) Document retrieval using efficient indexing techniques. In: Information retrieval and management, pp 1745–1764
Han LB, Wu Y, Nong G (2020) Succinct suffix sorting in external memory. Inf Process Manag. https://doi.org/10.1016/j.ipm.2020.102378
Article Google Scholar
He H, Du Z, Zhang W, Chen A (2015) Optimization strategy of Hadoop small file storage for big data in healthcare. J Supercomput 72(10):3696–3707
Article Google Scholar
Kärkkäinen J, Kempa D, Puglisi SJ (2015) Parallel external memory suffix sorting. In: Proceedings of the 26th Annual Symposium on Combinatorial Pattern Matching, pp 329–342
Kim H, Yeom H (2017) Improving small file I/O performance for massive digital archives. In: 2017 IEEE 13th International Conference on E-Science
Lakshman A, Malik P (2010) Cassandra: a decentralized structured storage system. ACM SIGOPS Oper Syst Rev 44(2):35–40
Article Google Scholar
Lao B, Nong G, Chan WH, Xie JY (2018) Fast in-place suffix sorting on a multicore computer. IEEE Trans Comput 67(12):1737–1749
Article MathSciNet Google Scholar
Lin TY, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick CL (2014) Microsoft COCO: common objects in context. In: Computer Vision—ECCV, pp 740–755
Manber U, Myers G (1993) Suffix arrays: a new method for on-line string searches. SIAM J Comput 22(5):935–948
Article MathSciNet Google Scholar
Meng B, Bin Guo W, Sheng Fan G, Wu Qian N (2016) A novel approach for efficient accessing of small files in HDFS: TLB-MapFile. In: 17th IEEE/ACIS International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing
Mori Y Libdivsufsort, a software library that implements a lightweight suffix array construction algorithm. Available: https://github.com/y-256/libdivsufsort
Nguyen MC, Won H, Son S, Gil MS, Moon YS (2017) Prefetching-based metadata management in advanced multitenant Hadoop. J Supercomput 2:1–21
Google Scholar
Nong G (2013) Practical linear-time O(1)-workspace suffix sorting for constant alphabets. ACM Trans Inf Syst 31(3):1–15
Article MathSciNet Google Scholar
Nong G, Zhang S, Chan WH (2011) Two efficient algorithms for linear time suffix array construction. IEEE Trans Comput 60(10):1471–1484
Article MathSciNet Google Scholar
Parkhi O.M, Vedaldi A, Zisserman A, Jawahar CV (2012) Cats and dogs. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp 3498–3505
Phakade P, Raut S (2014) An innovative strategy for improved processing of small files in Hadoop. Int J Appl Innov Eng Manag 3(7):278–280
Google Scholar
Song J, He H, Thomas R, Bao Y, Yu G (2019) Haery: a Hadoop based query system on accumulative and high-dimensional data model for big data. IEEE Trans Knowl Data Eng 32(7):1362–1377
Article Google Scholar
Tchaye-Kondi J, Zhai Y, Lin KJ, Tao W, Yang K (2019) Hadoop perfect file: a fast access container for small files with direct in disc metadata access. arXiv preprint arXiv:1903.05838
Transier F, Sanders P (2010) Engineering basic algorithms of an in-memory text search engine. ACM Trans Inf Syst 29(1):1–37
Article Google Scholar
Wang Y, Ma C, Wang W, Meng D (2014) An approach of fast data manipulation in HDFS with supplementary mechanisms. J Supercomput 71(5):1736–1753
Article Google Scholar
Wu S, Chen G, Chen K, Li F, Shou L (2015) HM: a column-oriented MapReduce system on hybrid storage. IEEE Trans Knowl Data Eng 27(12):3304–3317
Article Google Scholar
Xie JY, Nong G, Lao B, Xu W (2020) Scalable suffix sorting on a multicore machine. IEEE Trans Comput 69(9):1364–1375
Article MathSciNet Google Scholar
Zhang Y, Liu D (2012) Improving the efficiency of storing for small files in HDFS. In: 2012 International Conference on Computer Science and Service System

Download references

Acknowledgements

This work was funded by the National Natural Science Foundation of China (Grant number 61872391), the Guangzhou Science and Technology Program (Grant No. 201802010011), and the Foundation for Young Talents in Higher Education of Guangdong, China (Grant number 2019KQNCX031).

Author information

Authors and Affiliations

School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, China
Wentao Xu, Xin Zhao & Ge Nong
Guangdong Province Key Laboratory of Information Security Technology, Guangzhou, China
Ge Nong
School of Information Science and Technology, Guangdong University of Foreign Studies, Guangzhou, China
Bin Lao

Authors

Wentao Xu
View author publications
You can also search for this author inPubMed Google Scholar
Xin Zhao
View author publications
You can also search for this author inPubMed Google Scholar
Bin Lao
View author publications
You can also search for this author inPubMed Google Scholar
Ge Nong
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Ge Nong.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

The work of X. Zhao was done in his Master program in Sun Yat-sen University.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Xu, W., Zhao, X., Lao, B. et al. Enhancing HDFS with a full-text search system for massive small files. J Supercomput 77, 7149–7170 (2021). https://doi.org/10.1007/s11227-020-03526-1

Download citation

Accepted: 17 November 2020
Published: 04 January 2021
Issue Date: July 2021
DOI: https://doi.org/10.1007/s11227-020-03526-1

Keywords

Profiles

Bin Lao View author profile

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Enhancing HDFS with a full-text search system for massive small files

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

A Strategy for Small Files Processing in HDFS

Efficient File Accessing Techniques on Hadoop Distributed File Systems

Small files access efficiency in hadoop distributed file system a case study performed on British library text files

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Profiles

Subscribe and save

Buy Now