Abstract
This article presents a new Fast Hash-based File Existence Checking (FHFEC) method for archiving systems. During the archiving process, there are many submissions which are actually unchanged files that do not need to be re-archived. In this system, instead of comparing the entire files, only digests of the files are compared. Strong cryptographic hash functions with a low probability of collision can be used as digests. We propose a fast algorithm to check if a certain hash, that is, a corresponding file, is already stored in the system. The algorithm is based on dividing the whole domain of hashes into equally sized regions, and on the existence of a pointer array, which has exactly one pointer for each region. Each pointer points to the location of the first stored hash from the corresponding region and has a null value if no hash from that region exists. The entire structure can be stored in random access memory or, alternatively, on a dedicated hard disk. A statistical performance analysis has been performed that shows that in certain cases FHFEC performs nearly optimally. Extensive simulations have confirmed these analytical results. The performance of FHFEC has been compared to the performance of a binary search (BIS) and B+tree, which are commonly used in file systems and databases for table indices. The results show that FHFEC significantly outperforms both of them.
- Bayer, R. and McCreight, E. M. 1972. Organization and maintenance of large ordered indices. Acta Informatica 1, 173--189.Google ScholarDigital Library
- Bingmann, T. 2010a. Speed test results. http://idlebox.net/2007/stx-btree/stx-btree-0.8-doxygen/speedtest.html.Google Scholar
- Bingmann, T. 2010b. Stx b+ tree c++ template classes. http://idlebox.net/2007/stx-btree/.Google Scholar
- Bohn, R., et al. 2008. How much information? At the global information industry center. http://hmi.ucsd.edu/howmuchinfo.php.Google Scholar
- Broder, A. Z. 1993. Some Applications of Rabin's Fingerprinting Method, Sequences II: In Methods in Communications, Security and Computer Science, Springer-Verlag.Google Scholar
- Cormen, T. H., Leiserson, C. E., Rivest, R. L., and Stein, C. 2001. Introduction to Algorithms, 2nd Ed. MIT Press and McGraw-Hill. Google ScholarDigital Library
- Corwin, E. M. 2010. Average case of binary search. http://www.mcs.sdsmt.edu/ecorwin/cs251/binavg/binavg.htm.Google Scholar
- Cox, L. P., Murray, C. D., and Noble, B. D. 2002. Pastiche: Making backup cheap and easy. ACM SIGOPS Oper. Syst. Rev. 36, 285--298. Google ScholarDigital Library
- FIPS 180-2 2002. Secure hash standard. National Institute of Standards and Technology.Google Scholar
- IBM 2010. Grouping hash implementation. http://publib.boulder.ibm.com/infocenter/iseries/v5r3/index.jsp?topic=/rzajq/groupopt.htm.Google Scholar
- Jovanov, E., Milutinovic, V., and Hurson, A. R. 2002. Acceleration of nonnumeric operations using hardware support for the ordered table hashing algorithms. IEEE Trans. Comput. 51, 9. Google ScholarDigital Library
- Knuth, D. 1997. The Art of Computer Programming, Vol. 3: Sorting and Searching, 3rd Ed. Addison-Wesley.Google Scholar
- Kulkarni, P., Douglis, F., LaVoie, J., and Tracey, J. M. 2004. Redundancy elimination within large collections of files. In Proceedings of the USENIX Technical Conference. Google ScholarDigital Library
- Lyman, P., Varian, H. R., Swearingen, K., Chanles, P., Good, N., Jorvan, L. L., and Pal, J. 2003. How much information? 2003. http://www2.sims.berkeley.edu/research/projects/how-much-info-2003.Google Scholar
- Muthitacharoen, A., Chen, B., and Mazieres, D. 2001. A low-bandwidth network file system. In Proceedings of the Symposium on Operating Systems Principles. Google ScholarDigital Library
- Papoulis, A. 1984. Probability, Random Variables and Stochastic Processes, 2nd Ed. McGraw-Hill.Google Scholar
- Parlante, N. 2001. Linked List Basics. Stanford University.Google Scholar
- PCGuide 2010. Logical block addressing (LBA). http://www.pcguide.com/ref/hdd/bios/modesLBA-c.html.Google Scholar
- Policroniades, C. and Pratt, I. 2004. Alternatives for detecting redundancy in storage systems data. In Proceedings of the USENIX Conference. Google ScholarDigital Library
- Quinlan, S., and Dorward, S.. 2002. Venti: A new approach to archival storage. In Proceedings of the 1st USENIX Conference on File and Storage Technologies. Google ScholarDigital Library
- RFC 1321 1992. The MD5 message-digest algorithm. IETF.Google Scholar
- Rudan, S., Kovacevic, A. Z., Babovic, D. J., Milligan, C., and Milutinovic, V. 2006. One approach to efficient management of zillion signatures. PSI Trans. Internet Res. 2, 2, 17--21.Google Scholar
Index Terms
- Fast file existence checking in archiving systems
Recommendations
The Design of New Journaling File Systems: The DualFS Case
This paper describes the foundation, design, implementation, and evaluation of DualFS, a new high-performance journaling file system which has the same consistency guarantees as traditional journaling file systems but a greater performance. DualFS ...
Efficient Search for Free Blocks in the WAFL File System
ICPP '18: Proceedings of the 47th International Conference on Parallel ProcessingThe WAFL® write allocator is responsible for assigning blocks on persistent storage to data in a way that maximizes both write throughput to the storage media and subsequent read performance of data. The ability to quickly and efficiently guide the ...
A multiple-file write scheme for improving write performance of small files in Fast File System
Fast File System (FFS) stores files to disk in separate disk writes, each of which incurs a disk positioning (seek + rotation) limiting the write performance for small files. We propose a new scheme called co-writing to accelerate small file writes in ...
Comments