Skip to main content
Log in

Enhancing HDFS with a full-text search system for massive small files

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

HDFS is a popular open-source system for scalable and reliable file management, which is designed as a general-purpose solution for distributed file storage. While it works well for medium or large files, it will suffer heavy performance degradations in case of lots of small files. To overcome this drawback, we propose here a system to enhance HDFS with a distributed true full-text search system SAES of 100% recall and precision ratios. By indexing the meta data of each file, e.g., name, size, date and description, files can be quickly accessed by efficient searches over metadata. Moreover, by merging many small files into a large file to be stored with better space and I/O efficiencies, the negative performance impacts caused by directly storing each small file individually are avoided. An experimental study is conducted for function and performance tests on both realistic and artificial data. The experimental results show that the system works well for file operations such as uploading, downloading and deleting. Moreover, the RAM consumption for managing massive small files is dramatically reduced, which is critical for good system performance. The proposed system could be a potential storage solution for massive small files.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

Notes

  1. https://github.com/alibaba/tfs.

  2. https://hadoop.apache.org/docs/current/hadoop-archives/HadoopArchives.html.

  3. https://hadoop.apache.org/docs/r2.7.5/api/org/apache/hadoop/io/SequenceFile.html.

  4. https://hadoop.apache.org/docs/r2.6.2/api/org/apache/hadoop/io/MapFile.html.

  5. https://www.elastic.com/cn/blog/elastic-search-7-2-0-released

  6. https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-ngram-tokenizer.html.

  7. https://jmeter.apache.org/.

References

  1. Apostolico A, Crochemore M, Farach-Colton M, Galil Z, Muthukrishnan S (2016) 40 years of suffix trees. Commun ACM 59(4):66–73

    Article  Google Scholar 

  2. Arroyuelo D, Bonacic C, Gil-Costa V, Marin M, Navarro G (2014) Distributed text search using suffix arrays. Parallel Comput 40(9):471–495

    Article  Google Scholar 

  3. Chandrasekar A, Chandrasekar K, Ramasatagopan H, Rafica AR, Balasubramaniyan J (2012) Classification based metadata management for HDFS. In: HPCC 2012 and ICESS 2012

  4. Chen G, Hu T, Jiang D, Lu P, Tan KL, Vo HT, Wu S (2014) BestPeer++: a peer-to-peer based large-scale data processing platform. IEEE Trans Knowl Data Eng 26(6):1316–1331

    Article  Google Scholar 

  5. Chen Y, Zhou Y, Taneja S, Qin X, Huang J (2017) aHDFS: an erasure-coded data archival system for Hadoop clusters. IEEE Trans Parallel Distrib Syst 28(11):3060–3073

    Article  Google Scholar 

  6. Choi C, Choi C, Choi J, Kim P (2016) Improved performance optimization for massive small files in cloud computing environment. Ann Oper Res 265(2):305–317

    Article  Google Scholar 

  7. Dhaliwal J, Puglisi SJ, Turpin A (2012) Trends in suffix sorting: a survey of low memory algorithms. In: Proceedings of the Thirty-Fifth Australasian Computer Science Conference-Volume, vol 122, pp 91–98

  8. Everingham M, Gool LV, Williams CKI, Winn J, Zisserman A (2009) The pascal visual object classes (VOC) challenge. Int J Comput Vis 88(2):303–338

    Article  Google Scholar 

  9. Fu S, He L, Huang C, Liao X, Li K (2015) Performance optimization for managing massive numbers of small files in distributed file systems. IEEE Trans Parallel Distrib Syst 26(12):3433–3448

    Article  Google Scholar 

  10. Gao Z, Qin Y, Niu K (2016) An effective merge strategy based hierarchy for improving small file problem on HDFS. In: 2016 4th International Conference on Cloud Computing and Intelligence Systems

  11. Gupta S, Yadav S, Prasad R (2018) Document retrieval using efficient indexing techniques. In: Information retrieval and management, pp 1745–1764

  12. Han LB, Wu Y, Nong G (2020) Succinct suffix sorting in external memory. Inf Process Manag. https://doi.org/10.1016/j.ipm.2020.102378

    Article  Google Scholar 

  13. He H, Du Z, Zhang W, Chen A (2015) Optimization strategy of Hadoop small file storage for big data in healthcare. J Supercomput 72(10):3696–3707

    Article  Google Scholar 

  14. Kärkkäinen J, Kempa D, Puglisi SJ (2015) Parallel external memory suffix sorting. In: Proceedings of the 26th Annual Symposium on Combinatorial Pattern Matching, pp 329–342

  15. Kim H, Yeom H (2017) Improving small file I/O performance for massive digital archives. In: 2017 IEEE 13th International Conference on E-Science

  16. Lakshman A, Malik P (2010) Cassandra: a decentralized structured storage system. ACM SIGOPS Oper Syst Rev 44(2):35–40

    Article  Google Scholar 

  17. Lao B, Nong G, Chan WH, Xie JY (2018) Fast in-place suffix sorting on a multicore computer. IEEE Trans Comput 67(12):1737–1749

    Article  MathSciNet  Google Scholar 

  18. Lin TY, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick CL (2014) Microsoft COCO: common objects in context. In: Computer Vision—ECCV, pp 740–755

  19. Manber U, Myers G (1993) Suffix arrays: a new method for on-line string searches. SIAM J Comput 22(5):935–948

    Article  MathSciNet  Google Scholar 

  20. Meng B, Bin Guo W, Sheng Fan G, Wu Qian N (2016) A novel approach for efficient accessing of small files in HDFS: TLB-MapFile. In: 17th IEEE/ACIS International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing

  21. Mori Y Libdivsufsort, a software library that implements a lightweight suffix array construction algorithm. Available: https://github.com/y-256/libdivsufsort

  22. Nguyen MC, Won H, Son S, Gil MS, Moon YS (2017) Prefetching-based metadata management in advanced multitenant Hadoop. J Supercomput 2:1–21

    Google Scholar 

  23. Nong G (2013) Practical linear-time O(1)-workspace suffix sorting for constant alphabets. ACM Trans Inf Syst 31(3):1–15

    Article  MathSciNet  Google Scholar 

  24. Nong G, Zhang S, Chan WH (2011) Two efficient algorithms for linear time suffix array construction. IEEE Trans Comput 60(10):1471–1484

    Article  MathSciNet  Google Scholar 

  25. Parkhi O.M, Vedaldi A, Zisserman A, Jawahar CV (2012) Cats and dogs. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp 3498–3505

  26. Phakade P, Raut S (2014) An innovative strategy for improved processing of small files in Hadoop. Int J Appl Innov Eng Manag 3(7):278–280

    Google Scholar 

  27. Song J, He H, Thomas R, Bao Y, Yu G (2019) Haery: a Hadoop based query system on accumulative and high-dimensional data model for big data. IEEE Trans Knowl Data Eng 32(7):1362–1377

    Article  Google Scholar 

  28. Tchaye-Kondi J, Zhai Y, Lin KJ, Tao W, Yang K (2019) Hadoop perfect file: a fast access container for small files with direct in disc metadata access. arXiv preprint arXiv:1903.05838

  29. Transier F, Sanders P (2010) Engineering basic algorithms of an in-memory text search engine. ACM Trans Inf Syst 29(1):1–37

    Article  Google Scholar 

  30. Wang Y, Ma C, Wang W, Meng D (2014) An approach of fast data manipulation in HDFS with supplementary mechanisms. J Supercomput 71(5):1736–1753

    Article  Google Scholar 

  31. Wu S, Chen G, Chen K, Li F, Shou L (2015) HM: a column-oriented MapReduce system on hybrid storage. IEEE Trans Knowl Data Eng 27(12):3304–3317

    Article  Google Scholar 

  32. Xie JY, Nong G, Lao B, Xu W (2020) Scalable suffix sorting on a multicore machine. IEEE Trans Comput 69(9):1364–1375

    Article  MathSciNet  Google Scholar 

  33. Zhang Y, Liu D (2012) Improving the efficiency of storing for small files in HDFS. In: 2012 International Conference on Computer Science and Service System

Download references

Acknowledgements

This work was funded by the National Natural Science Foundation of China (Grant number 61872391), the Guangzhou Science and Technology Program (Grant No. 201802010011), and the Foundation for Young Talents in Higher Education of Guangdong, China (Grant number 2019KQNCX031).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ge Nong.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

The work of X. Zhao was done in his Master program in Sun Yat-sen University.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Xu, W., Zhao, X., Lao, B. et al. Enhancing HDFS with a full-text search system for massive small files. J Supercomput 77, 7149–7170 (2021). https://doi.org/10.1007/s11227-020-03526-1

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-020-03526-1

Keywords

Navigation