DZIP: A Data Deduplication-Compatible Enhanced Version of Gzip

Xiao, Hengying; Liu, Yangyang

doi:10.1007/978-981-99-9785-5_23

Hengying Xiao¹⁰ &
Yangyang Liu¹⁰

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14509))

Included in the following conference series:

International Conference on Artificial Intelligence Security and Privacy

730 Accesses

Abstract

Data deduplication is a common method for reducing storage space in backup storage systems. Despite extensive research aimed at improving the efficiency of data deduplication, we have observed poor compatibility between compressed data and deduplication. Specifically, two files with significant duplicate content cannot be deduplicated once they are compressed. In this paper, we delve into the internals of gzip and investigate the primary cause of this issue: the default compression-ratio-based heuristic blocking algorithm within deflate introduces a boundary offset issue. We introduce Dzip, which incorporates a content-defined chunking algorithm into gzip to maintain the redundancy of similar files after compression. The dataset-driven evaluation demonstrates that data compressed by Dzip can achieve a deduplication ratio of up to 86.2% compared to uncompressed data, with the compression ratio remaining largely unchanged compared to gzip, while achieving a throughput of up to 96% of gzip.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 79.99; Price excludes VAT (USA)

Softcover Book: USD 99.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

An Overview on Data Deduplication Techniques

Brushing—An Algorithm for Data Deduplication

Study of Chunking Algorithm in Data Deduplication

References

Black, J.: Compare-by-hash: a reasoned analysis. In: Proceedings of USENIX ATC, pp. 85–90 (2006)
Google Scholar
Broder, A.Z.: Some applications of Rabin’s fingerprinting method. In: Capocelli, R., De Santis, A., Vaccaro, U. (eds.) Sequences II, pp. 143–152. Springer, New York (1993). https://doi.org/10.1007/978-1-4613-9323-8_11
Chapter Google Scholar
Cao, Z., Wen, H., Wu, F., Du, D.H.: ALACC: accelerating restore performance of data deduplication systems using adaptive Look-Ahead window assisted chunk caching. In: Proceedings of USENIX FAST (2018)
Google Scholar
Collet, Y.: Zstandard compression and the ‘application/zstd’ media type. https://datatracker.ietf.org/doc/html/rfc8878
Deutsch, L.P.: Deflate compressed data format specification version 1.3. https://www.ietf.org/rfc/rfc1951.txt
El-Shimi, A., Kalach, R., Kumar, A., Oltean, A., Li, J., Sengupta, S.: Primary data deduplication-large scale study and system design. In: Proceedings of USENIX ATC (2012)
Google Scholar
loup Gailly, J., Adler, M.: zlib. https://www.zlib.net/
GNU: GCC. https://gcc.gnu.org/
GNU: GZIP. https://www.gnu.org/software/gzip/
Han, Z., et al.: DEC: an efficient deduplication-enhanced compression approach. In: Proceedings of IEEE ICPADS (2016)
Google Scholar
Huffman, D.A.: A method for the construction of minimum-redundancy codes. Proc. IRE 40(9), 1098–1101 (1952)
Article Google Scholar
Kotlarska, I., Jackowski, A., Lichota, K., Welnicki, M., Dubnicki, C., Iwanicki, K.: InftyDedup: scalable and cost-effective cloud tiering with deduplication. In: Proceedings of USENIX FAST (2023)
Google Scholar
Kruus, E., Ungureanu, C.: Bimodal content defined chunking for backup streams. In: Proceedings of USENIX FAST (2010)
Google Scholar
Li, C., Shilane, P., Douglis, F., Shim, H., Smaldone, S., Wallace, G.: Nitro: a capacity-optimized SSD cache for primary storage. In: Proceedings of USENIX ATC (2014)
Google Scholar
Li, W., Jean-Baptise, G., Riveros, J., Narasimhan, G., Zhang, T., Zhao, M.: CacheDedup: in-line deduplication for flash caching. In: Proceedings of USENIX FAST (2016)
Google Scholar
Lillibridge, M., Eshghi, K., Bhagwat, D., Deolalikar, V., Trezise, G., Camble, P.: Sparse indexing: large scale, inline deduplication using sampling and locality. In: Proceedings of USENIX FAST (2009)
Google Scholar
Lin, X., et al.: Metadata considered Harmful... to deduplication. In: Proceedings of USENIX HotStorage (2015)
Google Scholar
Neumann, A.: rabin-cdc. https://github.com/fd0/rabin-cdc
Rabin, M.O.: Fingerprinting by random polynomials. Technical report (1981)
Google Scholar
Srinivasan, K., Bisson, T., Goodson, G., Voruganti, K.: iDedup: latency-aware, inline data deduplication for primary storage. In: Proceedings of USENIX FAST (2012)
Google Scholar
Wallace, G., et al.: Characteristics of backup workloads in production systems. In: Proceedings of USENIX FAST (2012)
Google Scholar
Wang, Q., Li, J., Xia, W., Kruus, E., Debnath, B., Lee, P.P.: Austere flash caching with deduplication and compression. In: Proceedings of USENIX ATC (2020)
Google Scholar
Xia, W., et al.: FastCDC: a fast and efficient content-defined chunking approach for data deduplication. In: Proceedings of USENIX ATC (2016)
Google Scholar
Zhu, B., Li, K., Patterson, H.: Avoiding the disk bottleneck in the data domain deduplication file system. In: Proceedings of USENIX FAST (2008)
Google Scholar
Ziv, J., Lempel, A.: A universal algorithm for sequential data compression. IEEE Trans. Inf. Theory 23(3), 337–343 (1977)
Article MathSciNet Google Scholar
Zou, X., Xia, W., Shilane, P., Zhang, H., Wang, X.: Building a high-performance fine-grained deduplication framework for backup storage with high deduplication ratio. In: Proceedings of USENIX ATC (2022)
Google Scholar
Zou, X., Yuan, J., Shilane, P., Xia, W., Zhang, H., Wang, X.: The dilemma between deduplication and locality: can both be achieved? In: Proceedings of USENIX FAST (2021)
Google Scholar

Download references

Author information

Authors and Affiliations

University of Electronic Science and Technology of China, Chengdu, China
Hengying Xiao & Yangyang Liu

Authors

Hengying Xiao
View author publications
You can also search for this author in PubMed Google Scholar
Yangyang Liu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Hengying Xiao or Yangyang Liu .

Editor information

Editors and Affiliations

Rutgers University, Newark, NJ, USA
Jaideep Vaidya
Tampere University, Tampere, Finland
Moncef Gabbouj
Guangzhou University, Guangzhou, China
Jin Li

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Xiao, H., Liu, Y. (2024). DZIP: A Data Deduplication-Compatible Enhanced Version of Gzip. In: Vaidya, J., Gabbouj, M., Li, J. (eds) Artificial Intelligence Security and Privacy. AIS&P 2023. Lecture Notes in Computer Science, vol 14509. Springer, Singapore. https://doi.org/10.1007/978-981-99-9785-5_23

Download citation

DOI: https://doi.org/10.1007/978-981-99-9785-5_23
Published: 04 February 2024
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-9784-8
Online ISBN: 978-981-99-9785-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

DZIP: A Data Deduplication-Compatible Enhanced Version of Gzip