research-article

Fast file existence checking in archiving systems

Authors:
Saso Tomazic

University of Ljubljana, Ljubljana, Slovenia

University of Ljubljana, Ljubljana, Slovenia
View Profile

,
Vesna Pavlovic

University of Belgrade, Beograd, Serbia

University of Belgrade, Beograd, Serbia
View Profile

,
Jasna Milovanovic

University of Belgrade, Beograd, Serbia

University of Belgrade, Beograd, Serbia
View Profile

,
Jaka Sodnik

University of Ljubljana, Ljubljana, Slovenia

University of Ljubljana, Ljubljana, Slovenia
View Profile

,
Anton Kos

University of Ljubljana, Ljubljana, Slovenia

University of Ljubljana, Ljubljana, Slovenia
View Profile

,
Sara Stancin

University of Ljubljana, Ljubljana, Slovenia

University of Ljubljana, Ljubljana, Slovenia
View Profile

,
Veljko Milutinovic

IPSI Belgrade

IPSI Belgrade
View Profile

Authors Info & Claims

ACM Transactions on Storage Volume 7 Issue 1Article No.: 2pp 1–21https://doi.org/10.1145/1970343.1970345

Published:27 June 2011Publication History

ACM Transactions on Storage

Abstract

This article presents a new Fast Hash-based File Existence Checking (FHFEC) method for archiving systems. During the archiving process, there are many submissions which are actually unchanged files that do not need to be re-archived. In this system, instead of comparing the entire files, only digests of the files are compared. Strong cryptographic hash functions with a low probability of collision can be used as digests. We propose a fast algorithm to check if a certain hash, that is, a corresponding file, is already stored in the system. The algorithm is based on dividing the whole domain of hashes into equally sized regions, and on the existence of a pointer array, which has exactly one pointer for each region. Each pointer points to the location of the first stored hash from the corresponding region and has a null value if no hash from that region exists. The entire structure can be stored in random access memory or, alternatively, on a dedicated hard disk. A statistical performance analysis has been performed that shows that in certain cases FHFEC performs nearly optimally. Extensive simulations have confirmed these analytical results. The performance of FHFEC has been compared to the performance of a binary search (BIS) and B+tree, which are commonly used in file systems and databases for table indices. The results show that FHFEC significantly outperforms both of them.

References

Bayer, R. and McCreight, E. M. 1972. Organization and maintenance of large ordered indices. Acta Informatica 1, 173--189.Google ScholarDigital Library
Bingmann, T. 2010a. Speed test results. http://idlebox.net/2007/stx-btree/stx-btree-0.8-doxygen/speedtest.html.Google Scholar
Bingmann, T. 2010b. Stx b+ tree c++ template classes. http://idlebox.net/2007/stx-btree/.Google Scholar
Bohn, R., et al. 2008. How much information&quest; At the global information industry center. http://hmi.ucsd.edu/howmuchinfo.php.Google Scholar
Broder, A. Z. 1993. Some Applications of Rabin's Fingerprinting Method, Sequences II: In Methods in Communications, Security and Computer Science, Springer-Verlag.Google Scholar
Cormen, T. H., Leiserson, C. E., Rivest, R. L., and Stein, C. 2001. Introduction to Algorithms, 2nd Ed. MIT Press and McGraw-Hill. Google ScholarDigital Library
Corwin, E. M. 2010. Average case of binary search. http://www.mcs.sdsmt.edu/ecorwin/cs251/binavg/binavg.htm.Google Scholar
Cox, L. P., Murray, C. D., and Noble, B. D. 2002. Pastiche: Making backup cheap and easy. ACM SIGOPS Oper. Syst. Rev. 36, 285--298. Google ScholarDigital Library
FIPS 180-2 2002. Secure hash standard. National Institute of Standards and Technology.Google Scholar
IBM 2010. Grouping hash implementation. http://publib.boulder.ibm.com/infocenter/iseries/v5r3/index.jsp?topic=/rzajq/groupopt.htm.Google Scholar
Jovanov, E., Milutinovic, V., and Hurson, A. R. 2002. Acceleration of nonnumeric operations using hardware support for the ordered table hashing algorithms. IEEE Trans. Comput. 51, 9. Google ScholarDigital Library
Knuth, D. 1997. The Art of Computer Programming, Vol. 3: Sorting and Searching, 3rd Ed. Addison-Wesley.Google Scholar
Kulkarni, P., Douglis, F., LaVoie, J., and Tracey, J. M. 2004. Redundancy elimination within large collections of files. In Proceedings of the USENIX Technical Conference. Google ScholarDigital Library
Lyman, P., Varian, H. R., Swearingen, K., Chanles, P., Good, N., Jorvan, L. L., and Pal, J. 2003. How much information&quest; 2003. http://www2.sims.berkeley.edu/research/projects/how-much-info-2003.Google Scholar
Muthitacharoen, A., Chen, B., and Mazieres, D. 2001. A low-bandwidth network file system. In Proceedings of the Symposium on Operating Systems Principles. Google ScholarDigital Library
Papoulis, A. 1984. Probability, Random Variables and Stochastic Processes, 2nd Ed. McGraw-Hill.Google Scholar
Parlante, N. 2001. Linked List Basics. Stanford University.Google Scholar
PCGuide 2010. Logical block addressing (LBA). http://www.pcguide.com/ref/hdd/bios/modesLBA-c.html.Google Scholar
Policroniades, C. and Pratt, I. 2004. Alternatives for detecting redundancy in storage systems data. In Proceedings of the USENIX Conference. Google ScholarDigital Library
Quinlan, S., and Dorward, S.. 2002. Venti: A new approach to archival storage. In Proceedings of the 1st USENIX Conference on File and Storage Technologies. Google ScholarDigital Library
RFC 1321 1992. The MD5 message-digest algorithm. IETF.Google Scholar
Rudan, S., Kovacevic, A. Z., Babovic, D. J., Milligan, C., and Milutinovic, V. 2006. One approach to efficient management of zillion signatures. PSI Trans. Internet Res. 2, 2, 17--21.Google Scholar

Index Terms

Fast file existence checking in archiving systems
1. Information systems
  1. Information retrieval
    1. Document representation

Recommendations

The Design of New Journaling File Systems: The DualFS Case

This paper describes the foundation, design, implementation, and evaluation of DualFS, a new high-performance journaling file system which has the same consistency guarantees as traditional journaling file systems but a greater performance. DualFS ...
Read More
Efficient Search for Free Blocks in the WAFL File System
ICPP '18: Proceedings of the 47th International Conference on Parallel Processing

The WAFL® write allocator is responsible for assigning blocks on persistent storage to data in a way that maximizes both write throughput to the storage media and subsequent read performance of data. The ability to quickly and efficiently guide the ...
Read More
A multiple-file write scheme for improving write performance of small files in Fast File System

Fast File System (FFS) stores files to disk in separate disk writes, each of which incurs a disk positioning (seek + rotation) limiting the write performance for small files. We propose a new scheme called co-writing to accelerate small file writes in ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

ACM Transactions on Storage Volume 7, Issue 1
June 2011
73 pages
ISSN:1553-3077
EISSN:1553-3093
DOI:10.1145/1970343
Issue’s Table of Contents

Copyright © 2011 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 27 June 2011
- Accepted: 1 July 2010
- Revised: 1 April 2010
- Received: 1 December 2009
Published in tos Volume 7, Issue 1

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
File systems management
archiving
files backup/recovery
files sorting/searching
hash-table
performance evaluation
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 4
  Total Citations
  View Citations
- 326
  Total Downloads
- Downloads (Last 12 months)3
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Fast file existence checking in archiving systems

ACM Transactions on Storage

Abstract

References

Cited By

Index Terms

Recommendations

The Design of New Journaling File Systems: The DualFS Case

Efficient Search for Free Blocks in the WAFL File System

A multiple-file write scheme for improving write performance of small files in Fast File System

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Fast file existence checking in archiving systems

ACM Transactions on Storage

Abstract

References

Cited By

Index Terms

Recommendations

The Design of New Journaling File Systems: The DualFS Case

Efficient Search for Free Blocks in the WAFL File System

A multiple-file write scheme for improving write performance of small files in Fast File System

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media