Skip to main content
Log in

MBFS: a parallel metadata search method based on Bloomfilters using MapReduce for large-scale file systems

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

The metadata search is an important way to access and manage file systems. Many solutions have been proposed to tackle performance issue of metadata search. However, the existing solutions build a separate metadata index at the internal or external file system through the related data structure or database use semantics and event-notification method to construct the index structure, utilize the sampling-based method to conduct direct metadata search on the namespace, face problems of the high I/O overhead for maintaining consistency between metadata indexes and metadata, have enormous space overhead for metadata indexes storing and low accuracy of results and so on. To address these problems, this paper presents MBFS, a fast, accurate and lightweight metadata search method based on multi-dimensional Bloomfilters. We create a multi-dimensional Bloomfilter structure on the basis of the directory entry that can prune sub-trees to narrow the search scope of namespace. MBFS is capable of producing fast and accurate answers for a class of complex search over a file system after consuming a small number of disk accesses. MBFS residing in the file system does not need additional I/O overhead to maintain consistency. MBFS consists of Bloomfilters which are composed of bits, so it is a lightweight metadata search method that consumes marginal space overhead. Moreover, MBFS employs MapReduce for speeding up search under the environment of multiple metadata servers. Extensive experiments are conducted to prove the effectiveness of MBFS. The experimental results show that MBFS can achieve an excellent performance not only on the search latency, but also on the accuracy of results with low space and time overhead.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20
Fig. 21

Similar content being viewed by others

References

  1. Agrawal N, Arpaci-Dusseau AC, Arpaci-Dusseau RH (2009) Generating realistic impressions for file-system benchmarking. ACM Trans Storage 5(4):16

    Article  Google Scholar 

  2. Agrawal N, Bolosky WJ, Douceur JR, Lorch JR (2007) A five-year study of file-system metadata. ACM Trans Storage 3(3):9

    Article  Google Scholar 

  3. APPLE (2009) Spotlight server: stop searching, start finding. http://www.apple.com/server/macosx/features/spotlight/

  4. Arasu A, Cho J, Garcia-Molina H, Paepcke A, Raghavan S (2001) Searching the web. ACM Trans Internet Technol 1(1):2–43

    Article  Google Scholar 

  5. Bergman K, Borkar S, Campbell D, Carlson W, Dally W, Denneau M, Franzon P, Harrod W, Hill K, Hiller J et al (2008) Exascale computing study: technology challenges in achieving exascale systems. Defense Advanced Research Projects Agency Information Processing Techniques Office (DARPA IPTO), Technical Report 15

  6. Brin S, Page L (1998) The anatomy of a large-scale hypertextual web search engine. Comput Netw ISDN Syst 30(1):107–117

    Article  Google Scholar 

  7. Broder A, Mitzenmacher M (2004) Network applications of bloom filters: a survey. Internet Math 1(4):485–509

    Article  MathSciNet  MATH  Google Scholar 

  8. Cohen S, Matias Y (2003) Spectral bloom filters. In: Proceedings of the 2003 ACM SIGMOD international conference on management of data, pp 241–252. ACM

  9. Dai D, Ross RB, Carns P, Kimpe D, Chen Y (2014) Using property graphs for rich metadata management in hpc systems. In: 2014 9th parallel data storage workshop (PDSW), pp 7–12. IEEE

  10. Dean J, Ghemawat S (2008) Mapreduce: simplified data processing on large clusters. Commun ACM 51(1):107–113

    Article  Google Scholar 

  11. Engines I (2008) Power over information. http://www.indexengines.com/online

  12. Fan L, Cao P, Almeida J, Broder AZ (2000) Summary cache: a scalable wide-area web cache sharing protocol. IEEE/ACM Trans Netw 8(3):281–293

    Article  Google Scholar 

  13. Fast (2008) A microsoft subsidiary. fast-enterprise search. http://www.fastsearch.com/

  14. Ficara D, Giordano S, Procissi G, Vitucci F (2008) Multilayer compressed counting bloom filters. In: INFOCOM 2008. The 27th conference on computer communications. IEEE

  15. Gifford DK, Jouvelot P, Sheldon MA et al. (1991) Semantic file systems. In: ACM SIGOPS operating systems review, vol 25. ACM, pp 16–25

  16. Google I (2007) Google desktop: information when you want it, right on your desktop. http://www.desktop.google.com/

  17. Google I (2008) Google enterprise. http://www.google.com/enterprise/

  18. Groups ES (2007) ESG research report: storage resource management on the launch pad. Technical Report etsg-1809930. Technical Report, Enterprise Strategy Groups

  19. Hua Y, Jiang H, Feng D (2014) Fast: near real-time searchable data analytics for the cloud. In: SC14: international conference for high performance computing, networking, storage and analysis, pp 754–765. IEEE

  20. Hua Y, Jiang H, Zhu Y, Feng D (2010) Rapport: semantic-sensitive namespace management in large-scale file systems. CSE Technical reports. University of Nebraska, Lincoln

  21. Hua Y, Jiang H, Zhu Y, Feng D, Tian L (2009) Smartstore: a new metadata organization paradigm with semantic-awareness for next-generation file systems. In: Proceedings of the conference on high performance computing networking, storage and analysis, pp 1–12. IEEE

  22. Hua Y, Jiang H, Zhu Y, Feng D, Xu L (2014) SANE: semantic-aware namespace in ultra-large-scale file systems. IEEE Trans Parallel Distrib Syst 25(5):1328–1338

    Article  Google Scholar 

  23. Hua Y, Zhu Y, Jiang H, Feng D, Tian L (2011) Supporting scalable and adaptive metadata management in ultralarge-scale file systems. IEEE Trans Parallel Distrib Syst 22(4):580–593

    Article  Google Scholar 

  24. Huang HH, Zhang N, Wang W, Das G, Szalay A (2012) Just-in-time analytics on large file systems. IEEE Trans Comput 61(11):1651–1664

    Article  MathSciNet  Google Scholar 

  25. Huston L, Sukthankar R, Wickremesinghe R, Satyanarayanan M, Ganger GR, Riedel E, Ailamaki A (2004) Diamond: a storage architecture for early discard in interactive search. FAST 4:73–86

    Google Scholar 

  26. Imran M, Hlavacs H (2013) Searching in cloud object storage by using a metadata model. In: 2013 Ninth international conference on semantics, knowledge and grids (SKG), pp 121–128. IEEE

  27. Inc GG (2008) Compare search appliance tools. http://www.goebelgroup

  28. Katcher J (1997) Postmark: a new file system benchmark. Technical Report TR3022, Network Appliance, 1997. www.netapp.com/tech_library/3022.html

  29. KAZEON: Kazeon: search the enterprise. http://www.kazeon.com/

  30. Leung A, Adams I, Miller EL (2009) Magellan: a searchable metadata architecture for large-scale file systems. University of California, Santa Cruz, Technical Report UCSC-SSRC-09-07

  31. Leung AW (2009) Organizing, indexing, and searching large-scale file systems. PhD thesis, University of California, Santa Cruz

  32. Leung AW, Pasupathy S, Goodson GR, Miller EL (2008) Measurement and analysis of large-scale network file system workloads. In: USENIX annual technical conference, vol 1, pp 213–226

  33. Leung AW, Shao M, Bisson T, Pasupathy S, Miller EL (2009) Spyglass: fast, scalable metadata search for large-scale storage systems. FAST 9:153–166

    Google Scholar 

  34. Liu J, Feng D, Hua Y, Peng B, Nie Z (2014) Using provenance to efficiently improve metadata searching performance in storage systems. Future Gener Comput Syst

  35. Madden APWBA, Long MMDD. Examining scientific data for scalable index designs

  36. Malkani P, Ellard D, Ledlie J, Seltzer M (2003) Passive NFS tracing of email and research workloads. Proceedings of the 2nd USENIX conference on file and storage technologies. pp 203–216

  37. Mathur A, Cao M, Bhattacharya S, Dilger A, Tomas A, Vivier L (2007) The new ext4 filesystem: current status and future plans. In: Proceedings of the Linux symposium, vol 2. Citeseer, pp 21–33

  38. MetaTracker (2008) Metatracker for linux. http://www.gnome.org/projects/tracker/

  39. Microsoft I (2009) Windows search 4.0. http://www.microsoft.com/windows/Products/winfamily/desktopsearch/default.mspx

  40. Nunez J (2008) High end computing file system and IO R&D gaps roadmap. In: HEC FSIO R&D Conference

  41. Ohara Y (2013) Hctrie: a structure for indexing hundreds of dimensions for use in file systems search. In: 2013 IEEE 29th symposium on mass storage systems and technologies (MSST), pp 1–5. IEEE

  42. Owens L, Brown M, Poore K, Nicolson N (2008) The forrester wave: enterprise search, q2 2008. For information and knowledge management professionals

  43. Pagh A, Pagh R, Rao SS (2005) An optimal bloom filter replacement. In: Proceedings of the sixteenth annual ACM-SIAM symposium on discrete algorithms. Society for Industrial and Applied Mathematics, pp 823–829

  44. Pathan AI, Sinhal A (2013) Encode decode linux based partitions to hide and explore file system. Int J Comput Appl 75(12)

  45. Ross RB, Thakur R et al (2000) PVFS: a parallel file system for linux clusters. In: Proceedings of the 4th annual Linux showcase and conference, pp 391–430

  46. Schwan P (2003) Lustre: building a file system for 1000-node clusters. In: Proceedings of the 2003 Linux symposium, vol 2003

  47. SNIA (2010) Nfs traces. http://iotta.snia.org/traces/list/NFS

  48. Soules CA, Ganger GR (2005) Connections: using context to enhance file search. In: ACM SIGOPS operating systems review, vol 39. ACM, pp 119–132

  49. Soules CA, Keeton K, Morrey III CB (2009) Scan-lite: enterprise-wide analysis on the cheap. In: Proceedings of the 4th ACM European conference on computer systems. ACM, pp 117–130

  50. van Heuven van Staereling R, Appuswamy R, van Moolenbroek DC, Tanenbaum AS (2011) Efficient, modular metadata management with loris. In: 2011 6th IEEE international conference on networking, architecture and storage (NAS). IEEE, pp 278–287

  51. Szalay A (2008) New challenges in petascale scientific databases. In: Scientific and statistical database management. Springer, Berlin, p 1

  52. Takata M, Sutoh A (2012) Event-notification-based inactive file search for large-scale file systems. In: APMRC, 2012 digest. IEEE, pp 1–7

  53. Ward L (2009) PDSI SciDAC: released trace data. http://www.cs.sandia.gov

  54. Weil SA (2007) Ceph: reliable, scalable, and high-performance distributed storage. PhD thesis, University of California, Santa Cruz

  55. Xiao B, Hua Y (2010) Using parallel bloom filters for multiattribute representation on network services. IEEE Trans Parallel Distrib Syst 21(1):20–32

    Article  Google Scholar 

  56. Xu L, Huang Z, Jiang H, Tian L, Swanson D (2014) VSFS: a searchable distributed file system. In: Parallel data storage workshop (PDSW), 2014 9th. IEEE, pp 25–30

  57. Xu L, Jiang H, Liu X, Tian L, Hua Y, Hu J (2011) Propeller: a scalable metadata organization for a versatile searchable file system. CSE Technical reports. University of Nebraska, Lincoln

  58. Yu Y, Zhu Y, Ng W, Samsudin J (2014) An efficient multidimension metadata index and search system for cloud data. In: 2014 IEEE 6th international conference on cloud computing technology and science (CloudCom), pp 499–504. IEEE

  59. Zhang Q, Feng D, Wang F, Wu S (2014) Mlock: building delegable metadata service for the parallel file systems. Sci China Inf Sci 58(3):1–14

    Article  Google Scholar 

Download references

Acknowledgments

This version has benefited greatly from the many detailed comments and suggestions from the anonymous reviewers. The authors gratefully acknowledge these comments and suggestions. The work described in this paper was supported by the National Natural Science Foundation of China under Grant No. 61370059 and 61232009, the Beijing Natural Science Foundation under Grant No. 4152030, the fund of the State Key Laboratory of Software Development Environment under Grant No. SKLSDE-2014ZX-05, the Open Research Fund of The Academy of Satellite Application under Grant NO. 2014_CXJJDSJ_04, the Fundamental Research Funds for the Central Universities under Grant NO. YWF-14-JSJXY-14 and YWF-15-GJSYS-085, the Open Project Program of National Engineering Research Center for Science & Technology Resources Sharing Service (Beihang University).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Zhisheng Huo.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Huo, Z., Xiao, L., Zhong, Q. et al. MBFS: a parallel metadata search method based on Bloomfilters using MapReduce for large-scale file systems. J Supercomput 72, 3006–3032 (2016). https://doi.org/10.1007/s11227-015-1464-2

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-015-1464-2

Keywords

Navigation