Abstract
The metadata search is an important way to access and manage file systems. Many solutions have been proposed to tackle performance issue of metadata search. However, the existing solutions build a separate metadata index at the internal or external file system through the related data structure or database use semantics and event-notification method to construct the index structure, utilize the sampling-based method to conduct direct metadata search on the namespace, face problems of the high I/O overhead for maintaining consistency between metadata indexes and metadata, have enormous space overhead for metadata indexes storing and low accuracy of results and so on. To address these problems, this paper presents MBFS, a fast, accurate and lightweight metadata search method based on multi-dimensional Bloomfilters. We create a multi-dimensional Bloomfilter structure on the basis of the directory entry that can prune sub-trees to narrow the search scope of namespace. MBFS is capable of producing fast and accurate answers for a class of complex search over a file system after consuming a small number of disk accesses. MBFS residing in the file system does not need additional I/O overhead to maintain consistency. MBFS consists of Bloomfilters which are composed of bits, so it is a lightweight metadata search method that consumes marginal space overhead. Moreover, MBFS employs MapReduce for speeding up search under the environment of multiple metadata servers. Extensive experiments are conducted to prove the effectiveness of MBFS. The experimental results show that MBFS can achieve an excellent performance not only on the search latency, but also on the accuracy of results with low space and time overhead.





















Similar content being viewed by others
Explore related subjects
Discover the latest articles and news from researchers in related subjects, suggested using machine learning.References
Agrawal N, Arpaci-Dusseau AC, Arpaci-Dusseau RH (2009) Generating realistic impressions for file-system benchmarking. ACM Trans Storage 5(4):16
Agrawal N, Bolosky WJ, Douceur JR, Lorch JR (2007) A five-year study of file-system metadata. ACM Trans Storage 3(3):9
APPLE (2009) Spotlight server: stop searching, start finding. http://www.apple.com/server/macosx/features/spotlight/
Arasu A, Cho J, Garcia-Molina H, Paepcke A, Raghavan S (2001) Searching the web. ACM Trans Internet Technol 1(1):2–43
Bergman K, Borkar S, Campbell D, Carlson W, Dally W, Denneau M, Franzon P, Harrod W, Hill K, Hiller J et al (2008) Exascale computing study: technology challenges in achieving exascale systems. Defense Advanced Research Projects Agency Information Processing Techniques Office (DARPA IPTO), Technical Report 15
Brin S, Page L (1998) The anatomy of a large-scale hypertextual web search engine. Comput Netw ISDN Syst 30(1):107–117
Broder A, Mitzenmacher M (2004) Network applications of bloom filters: a survey. Internet Math 1(4):485–509
Cohen S, Matias Y (2003) Spectral bloom filters. In: Proceedings of the 2003 ACM SIGMOD international conference on management of data, pp 241–252. ACM
Dai D, Ross RB, Carns P, Kimpe D, Chen Y (2014) Using property graphs for rich metadata management in hpc systems. In: 2014 9th parallel data storage workshop (PDSW), pp 7–12. IEEE
Dean J, Ghemawat S (2008) Mapreduce: simplified data processing on large clusters. Commun ACM 51(1):107–113
Engines I (2008) Power over information. http://www.indexengines.com/online
Fan L, Cao P, Almeida J, Broder AZ (2000) Summary cache: a scalable wide-area web cache sharing protocol. IEEE/ACM Trans Netw 8(3):281–293
Fast (2008) A microsoft subsidiary. fast-enterprise search. http://www.fastsearch.com/
Ficara D, Giordano S, Procissi G, Vitucci F (2008) Multilayer compressed counting bloom filters. In: INFOCOM 2008. The 27th conference on computer communications. IEEE
Gifford DK, Jouvelot P, Sheldon MA et al. (1991) Semantic file systems. In: ACM SIGOPS operating systems review, vol 25. ACM, pp 16–25
Google I (2007) Google desktop: information when you want it, right on your desktop. http://www.desktop.google.com/
Google I (2008) Google enterprise. http://www.google.com/enterprise/
Groups ES (2007) ESG research report: storage resource management on the launch pad. Technical Report etsg-1809930. Technical Report, Enterprise Strategy Groups
Hua Y, Jiang H, Feng D (2014) Fast: near real-time searchable data analytics for the cloud. In: SC14: international conference for high performance computing, networking, storage and analysis, pp 754–765. IEEE
Hua Y, Jiang H, Zhu Y, Feng D (2010) Rapport: semantic-sensitive namespace management in large-scale file systems. CSE Technical reports. University of Nebraska, Lincoln
Hua Y, Jiang H, Zhu Y, Feng D, Tian L (2009) Smartstore: a new metadata organization paradigm with semantic-awareness for next-generation file systems. In: Proceedings of the conference on high performance computing networking, storage and analysis, pp 1–12. IEEE
Hua Y, Jiang H, Zhu Y, Feng D, Xu L (2014) SANE: semantic-aware namespace in ultra-large-scale file systems. IEEE Trans Parallel Distrib Syst 25(5):1328–1338
Hua Y, Zhu Y, Jiang H, Feng D, Tian L (2011) Supporting scalable and adaptive metadata management in ultralarge-scale file systems. IEEE Trans Parallel Distrib Syst 22(4):580–593
Huang HH, Zhang N, Wang W, Das G, Szalay A (2012) Just-in-time analytics on large file systems. IEEE Trans Comput 61(11):1651–1664
Huston L, Sukthankar R, Wickremesinghe R, Satyanarayanan M, Ganger GR, Riedel E, Ailamaki A (2004) Diamond: a storage architecture for early discard in interactive search. FAST 4:73–86
Imran M, Hlavacs H (2013) Searching in cloud object storage by using a metadata model. In: 2013 Ninth international conference on semantics, knowledge and grids (SKG), pp 121–128. IEEE
Inc GG (2008) Compare search appliance tools. http://www.goebelgroup
Katcher J (1997) Postmark: a new file system benchmark. Technical Report TR3022, Network Appliance, 1997. www.netapp.com/tech_library/3022.html
KAZEON: Kazeon: search the enterprise. http://www.kazeon.com/
Leung A, Adams I, Miller EL (2009) Magellan: a searchable metadata architecture for large-scale file systems. University of California, Santa Cruz, Technical Report UCSC-SSRC-09-07
Leung AW (2009) Organizing, indexing, and searching large-scale file systems. PhD thesis, University of California, Santa Cruz
Leung AW, Pasupathy S, Goodson GR, Miller EL (2008) Measurement and analysis of large-scale network file system workloads. In: USENIX annual technical conference, vol 1, pp 213–226
Leung AW, Shao M, Bisson T, Pasupathy S, Miller EL (2009) Spyglass: fast, scalable metadata search for large-scale storage systems. FAST 9:153–166
Liu J, Feng D, Hua Y, Peng B, Nie Z (2014) Using provenance to efficiently improve metadata searching performance in storage systems. Future Gener Comput Syst
Madden APWBA, Long MMDD. Examining scientific data for scalable index designs
Malkani P, Ellard D, Ledlie J, Seltzer M (2003) Passive NFS tracing of email and research workloads. Proceedings of the 2nd USENIX conference on file and storage technologies. pp 203–216
Mathur A, Cao M, Bhattacharya S, Dilger A, Tomas A, Vivier L (2007) The new ext4 filesystem: current status and future plans. In: Proceedings of the Linux symposium, vol 2. Citeseer, pp 21–33
MetaTracker (2008) Metatracker for linux. http://www.gnome.org/projects/tracker/
Microsoft I (2009) Windows search 4.0. http://www.microsoft.com/windows/Products/winfamily/desktopsearch/default.mspx
Nunez J (2008) High end computing file system and IO R&D gaps roadmap. In: HEC FSIO R&D Conference
Ohara Y (2013) Hctrie: a structure for indexing hundreds of dimensions for use in file systems search. In: 2013 IEEE 29th symposium on mass storage systems and technologies (MSST), pp 1–5. IEEE
Owens L, Brown M, Poore K, Nicolson N (2008) The forrester wave: enterprise search, q2 2008. For information and knowledge management professionals
Pagh A, Pagh R, Rao SS (2005) An optimal bloom filter replacement. In: Proceedings of the sixteenth annual ACM-SIAM symposium on discrete algorithms. Society for Industrial and Applied Mathematics, pp 823–829
Pathan AI, Sinhal A (2013) Encode decode linux based partitions to hide and explore file system. Int J Comput Appl 75(12)
Ross RB, Thakur R et al (2000) PVFS: a parallel file system for linux clusters. In: Proceedings of the 4th annual Linux showcase and conference, pp 391–430
Schwan P (2003) Lustre: building a file system for 1000-node clusters. In: Proceedings of the 2003 Linux symposium, vol 2003
SNIA (2010) Nfs traces. http://iotta.snia.org/traces/list/NFS
Soules CA, Ganger GR (2005) Connections: using context to enhance file search. In: ACM SIGOPS operating systems review, vol 39. ACM, pp 119–132
Soules CA, Keeton K, Morrey III CB (2009) Scan-lite: enterprise-wide analysis on the cheap. In: Proceedings of the 4th ACM European conference on computer systems. ACM, pp 117–130
van Heuven van Staereling R, Appuswamy R, van Moolenbroek DC, Tanenbaum AS (2011) Efficient, modular metadata management with loris. In: 2011 6th IEEE international conference on networking, architecture and storage (NAS). IEEE, pp 278–287
Szalay A (2008) New challenges in petascale scientific databases. In: Scientific and statistical database management. Springer, Berlin, p 1
Takata M, Sutoh A (2012) Event-notification-based inactive file search for large-scale file systems. In: APMRC, 2012 digest. IEEE, pp 1–7
Ward L (2009) PDSI SciDAC: released trace data. http://www.cs.sandia.gov
Weil SA (2007) Ceph: reliable, scalable, and high-performance distributed storage. PhD thesis, University of California, Santa Cruz
Xiao B, Hua Y (2010) Using parallel bloom filters for multiattribute representation on network services. IEEE Trans Parallel Distrib Syst 21(1):20–32
Xu L, Huang Z, Jiang H, Tian L, Swanson D (2014) VSFS: a searchable distributed file system. In: Parallel data storage workshop (PDSW), 2014 9th. IEEE, pp 25–30
Xu L, Jiang H, Liu X, Tian L, Hua Y, Hu J (2011) Propeller: a scalable metadata organization for a versatile searchable file system. CSE Technical reports. University of Nebraska, Lincoln
Yu Y, Zhu Y, Ng W, Samsudin J (2014) An efficient multidimension metadata index and search system for cloud data. In: 2014 IEEE 6th international conference on cloud computing technology and science (CloudCom), pp 499–504. IEEE
Zhang Q, Feng D, Wang F, Wu S (2014) Mlock: building delegable metadata service for the parallel file systems. Sci China Inf Sci 58(3):1–14
Acknowledgments
This version has benefited greatly from the many detailed comments and suggestions from the anonymous reviewers. The authors gratefully acknowledge these comments and suggestions. The work described in this paper was supported by the National Natural Science Foundation of China under Grant No. 61370059 and 61232009, the Beijing Natural Science Foundation under Grant No. 4152030, the fund of the State Key Laboratory of Software Development Environment under Grant No. SKLSDE-2014ZX-05, the Open Research Fund of The Academy of Satellite Application under Grant NO. 2014_CXJJDSJ_04, the Fundamental Research Funds for the Central Universities under Grant NO. YWF-14-JSJXY-14 and YWF-15-GJSYS-085, the Open Project Program of National Engineering Research Center for Science & Technology Resources Sharing Service (Beihang University).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Huo, Z., Xiao, L., Zhong, Q. et al. MBFS: a parallel metadata search method based on Bloomfilters using MapReduce for large-scale file systems. J Supercomput 72, 3006–3032 (2016). https://doi.org/10.1007/s11227-015-1464-2
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-015-1464-2