Abstract
Data deduplication is a common method for reducing storage space in backup storage systems. Despite extensive research aimed at improving the efficiency of data deduplication, we have observed poor compatibility between compressed data and deduplication. Specifically, two files with significant duplicate content cannot be deduplicated once they are compressed. In this paper, we delve into the internals of gzip and investigate the primary cause of this issue: the default compression-ratio-based heuristic blocking algorithm within deflate introduces a boundary offset issue. We introduce Dzip, which incorporates a content-defined chunking algorithm into gzip to maintain the redundancy of similar files after compression. The dataset-driven evaluation demonstrates that data compressed by Dzip can achieve a deduplication ratio of up to 86.2% compared to uncompressed data, with the compression ratio remaining largely unchanged compared to gzip, while achieving a throughput of up to 96% of gzip.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Black, J.: Compare-by-hash: a reasoned analysis. In: Proceedings of USENIX ATC, pp. 85–90 (2006)
Broder, A.Z.: Some applications of Rabin’s fingerprinting method. In: Capocelli, R., De Santis, A., Vaccaro, U. (eds.) Sequences II, pp. 143–152. Springer, New York (1993). https://doi.org/10.1007/978-1-4613-9323-8_11
Cao, Z., Wen, H., Wu, F., Du, D.H.: ALACC: accelerating restore performance of data deduplication systems using adaptive Look-Ahead window assisted chunk caching. In: Proceedings of USENIX FAST (2018)
Collet, Y.: Zstandard compression and the ‘application/zstd’ media type. https://datatracker.ietf.org/doc/html/rfc8878
Deutsch, L.P.: Deflate compressed data format specification version 1.3. https://www.ietf.org/rfc/rfc1951.txt
El-Shimi, A., Kalach, R., Kumar, A., Oltean, A., Li, J., Sengupta, S.: Primary data deduplication-large scale study and system design. In: Proceedings of USENIX ATC (2012)
loup Gailly, J., Adler, M.: zlib. https://www.zlib.net/
GNU: GCC. https://gcc.gnu.org/
GNU: GZIP. https://www.gnu.org/software/gzip/
Han, Z., et al.: DEC: an efficient deduplication-enhanced compression approach. In: Proceedings of IEEE ICPADS (2016)
Huffman, D.A.: A method for the construction of minimum-redundancy codes. Proc. IRE 40(9), 1098–1101 (1952)
Kotlarska, I., Jackowski, A., Lichota, K., Welnicki, M., Dubnicki, C., Iwanicki, K.: InftyDedup: scalable and cost-effective cloud tiering with deduplication. In: Proceedings of USENIX FAST (2023)
Kruus, E., Ungureanu, C.: Bimodal content defined chunking for backup streams. In: Proceedings of USENIX FAST (2010)
Li, C., Shilane, P., Douglis, F., Shim, H., Smaldone, S., Wallace, G.: Nitro: a capacity-optimized SSD cache for primary storage. In: Proceedings of USENIX ATC (2014)
Li, W., Jean-Baptise, G., Riveros, J., Narasimhan, G., Zhang, T., Zhao, M.: CacheDedup: in-line deduplication for flash caching. In: Proceedings of USENIX FAST (2016)
Lillibridge, M., Eshghi, K., Bhagwat, D., Deolalikar, V., Trezise, G., Camble, P.: Sparse indexing: large scale, inline deduplication using sampling and locality. In: Proceedings of USENIX FAST (2009)
Lin, X., et al.: Metadata considered Harmful... to deduplication. In: Proceedings of USENIX HotStorage (2015)
Neumann, A.: rabin-cdc. https://github.com/fd0/rabin-cdc
Rabin, M.O.: Fingerprinting by random polynomials. Technical report (1981)
Srinivasan, K., Bisson, T., Goodson, G., Voruganti, K.: iDedup: latency-aware, inline data deduplication for primary storage. In: Proceedings of USENIX FAST (2012)
Wallace, G., et al.: Characteristics of backup workloads in production systems. In: Proceedings of USENIX FAST (2012)
Wang, Q., Li, J., Xia, W., Kruus, E., Debnath, B., Lee, P.P.: Austere flash caching with deduplication and compression. In: Proceedings of USENIX ATC (2020)
Xia, W., et al.: FastCDC: a fast and efficient content-defined chunking approach for data deduplication. In: Proceedings of USENIX ATC (2016)
Zhu, B., Li, K., Patterson, H.: Avoiding the disk bottleneck in the data domain deduplication file system. In: Proceedings of USENIX FAST (2008)
Ziv, J., Lempel, A.: A universal algorithm for sequential data compression. IEEE Trans. Inf. Theory 23(3), 337–343 (1977)
Zou, X., Xia, W., Shilane, P., Zhang, H., Wang, X.: Building a high-performance fine-grained deduplication framework for backup storage with high deduplication ratio. In: Proceedings of USENIX ATC (2022)
Zou, X., Yuan, J., Shilane, P., Xia, W., Zhang, H., Wang, X.: The dilemma between deduplication and locality: can both be achieved? In: Proceedings of USENIX FAST (2021)
Author information
Authors and Affiliations
Corresponding authors
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Xiao, H., Liu, Y. (2024). DZIP: A Data Deduplication-Compatible Enhanced Version of Gzip. In: Vaidya, J., Gabbouj, M., Li, J. (eds) Artificial Intelligence Security and Privacy. AIS&P 2023. Lecture Notes in Computer Science, vol 14509. Springer, Singapore. https://doi.org/10.1007/978-981-99-9785-5_23
Download citation
DOI: https://doi.org/10.1007/978-981-99-9785-5_23
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-9784-8
Online ISBN: 978-981-99-9785-5
eBook Packages: Computer ScienceComputer Science (R0)