Skip to main content
Log in

Scalable high performance de-duplication backup via hash join

  • Published:
Journal of Zhejiang University SCIENCE C Aims and scope Submit manuscript

Abstract

Apart from high space efficiency, other demanding requirements for enterprise de-duplication backup are high performance, high scalability, and availability for large-scale distributed environments. The main challenge is reducing the significant disk input/output (I/O) overhead as a result of constantly accessing the disk to identify duplicate chunks. Existing inline de-duplication approaches mainly rely on duplicate locality to avoid disk bottleneck, thus suffering from degradation under poor duplicate locality workload. This paper presents Chunkfarm, a post-processing de-duplication backup system designed to improve capacity, throughput, and scalability for de-duplication. Chunkfarm performs de-duplication backup using the hash join algorithm, which turns the notoriously random and small disk I/Os of fingerprint lookups and updates into large sequential disk I/Os, hence achieving high write throughput not influenced by workload locality. More importantly, by decentralizing fingerprint lookup and update, Chunkfarm supports a cluster of servers to perform de-duplication backup in parallel; it hence is conducive to distributed implementation and thus applicable to large-scale and distributed storage systems.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Agrawal, N., Bolosky, W.J., Douceur, J.R., Lorch, J.R., 2007. A Five-Year Study of File-System Metadata. Proc. 5th USENIX Conf. on File and Storage Technologies, p. 31–45. [doi:10.1145/1288783]

  • Bhagwat, D., Eshghi, K., Long, D.D.E., Lillibridge, M., 2009. Extreme Binning: Scalable, Parallel Deduplication for Chunk-Based File Backup. Proc. 17th IEEE Int. Symp. on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems. [doi:10.1109/MASCOT.2009.5366623]

  • Blomer, J., Kalfane, M., Karpinski, M., Karp, R., Luby, M., Zuckerman, D., 1995. A XOR-Based Erasure Resilient Coding Scheme. International Computer Science Institute Technical Report, No. TR-95-048.

  • Bloom, B., 1970. Space/Time trade-offs in hash coding with allowable errors. Commun. ACM., 13(7):422–426. [doi:10.1145/362686.362692]

    Article  MATH  Google Scholar 

  • Broder, A., Mitzenmacher, M., 2004. Network applications of Bloom filters: a survey. Internet Math., 1(4):485–509.

    MATH  MathSciNet  Google Scholar 

  • Dubnicki, C., Gryz, L., Heldt, L., Kaczmarczyk, M., Kilian, W., Strzelczak, P., Szczepkowski, J., Ungureanu, C., Welnicki, M., 2009. Hydrastor: a Scalable Secondary Storage. Proc. 7th USENIX Conf. on File and Storage Technologies, p.197–210.

  • Eshghi, K., Lillibridge, M., Wilcock, L., Belrose, G., Hawkes, R., 2007. Jumbo Store: Providing Efficient Incremental Upload and Versioning for a Utility Rendering Service. Proc. 5th USENIX Conf. on File and Storage Technologies, p.22–37.

  • Lillibridge, M., Eshghi, K., Bhagwat, D., Deolalikar, V., Trezise, G., Camble, P., 2009. Sparse Indexing: Large Scale, Inline Deduplication Using Sampling and Locality. Proc. 7th USENIX Conf. on File and Storage Technologies, p.111–123.

  • Muthitacharoen, A., Chen, B., Mazieres, D., 2001. A Low-Bandwidth Network File System. Proc. 18th ACM Symp. on Operating Systems Principles, p.174–187. [doi:10.1145/502034.502052]

  • Quinlan, S., Dorward, S., 2002. Venti: a New Approach to Archival Storage. Proc. USENIX Conf. on File and Storage Technologies, p.89–101.

  • Rhea, S., Cox, R., Pesterev, A., 2008. Fast, Inexpensive Content-Addressed Storage in Foundation. Proc. USENIX Annual Technical Conf., p.143–156.

  • Secure Hash Standard, 1995. Department of Commerce/NIST, National Technical Information Service, Springfield, VA, USA.

    Google Scholar 

  • Shapiro, L.D., 1986. Join processing in database systems with large main memories. ACM Trans. Database Syst., 11(3): 239–264. [doi:10.1145/6314.6315]

    Article  Google Scholar 

  • Tanenbaum, A.S., Herder, J.N., Bos, H., 2006. File size distribution on UNIX systems: then and now. ACM SIGOPS Oper. Syst. Rev., 40(1):100–104. [doi:10.1145/1113361.1113364]

    Article  Google Scholar 

  • You, L., Pollack, K., Long, D.D.E., 2005. Deep Store: an Archival Storage System Architecture. Proc. 21st Int. Conf. on Data Engineering, p.804–815. [doi:10.1109/ICDE.2005.47]

  • Zeng, L.F., Zhou, K., Shi, Z., Feng, D., Wang, F., Xie, C.S., Li, Z.T., Yu, Z.W., Gong, J.Y., Cao, Q., et al., 2006. HUSt: a Heterogeneous Unified Storage System for GIS Grid. Proc. ACM/IEEE Conf. on Supercomputing, p.325–338. [doi:10.1145/1188455.1188798]

  • Zhu, B.J., Li, H., Patterson, H., 2008. Avoiding the Disk Bottleneck in the Data Domain Deduplication File System. Proc. 6th USENIX Conf. on File and Storage Technologies, p.269–282.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Dan Feng.

Additional information

Project supported by the National Basic Research Program (973) of China (No. 2004CB318201), the National High-Tech Research and Development Program (863) of China (No. 2008AA01A402), and the National Natural Science Foundation of China (Nos. 60703046 and 60873028)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Yang, Tm., Feng, D., Niu, Zy. et al. Scalable high performance de-duplication backup via hash join. J. Zhejiang Univ. - Sci. C 11, 315–327 (2010). https://doi.org/10.1631/jzus.C0910445

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1631/jzus.C0910445

Key words

CLC number

Navigation