Skip to main content

DZIP: A Data Deduplication-Compatible Enhanced Version of Gzip

  • Conference paper
  • First Online:
Artificial Intelligence Security and Privacy (AIS&P 2023)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14509))

  • 730 Accesses

Abstract

Data deduplication is a common method for reducing storage space in backup storage systems. Despite extensive research aimed at improving the efficiency of data deduplication, we have observed poor compatibility between compressed data and deduplication. Specifically, two files with significant duplicate content cannot be deduplicated once they are compressed. In this paper, we delve into the internals of gzip and investigate the primary cause of this issue: the default compression-ratio-based heuristic blocking algorithm within deflate introduces a boundary offset issue. We introduce Dzip, which incorporates a content-defined chunking algorithm into gzip to maintain the redundancy of similar files after compression. The dataset-driven evaluation demonstrates that data compressed by Dzip can achieve a deduplication ratio of up to 86.2% compared to uncompressed data, with the compression ratio remaining largely unchanged compared to gzip, while achieving a throughput of up to 96% of gzip.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Black, J.: Compare-by-hash: a reasoned analysis. In: Proceedings of USENIX ATC, pp. 85–90 (2006)

    Google Scholar 

  2. Broder, A.Z.: Some applications of Rabin’s fingerprinting method. In: Capocelli, R., De Santis, A., Vaccaro, U. (eds.) Sequences II, pp. 143–152. Springer, New York (1993). https://doi.org/10.1007/978-1-4613-9323-8_11

    Chapter  Google Scholar 

  3. Cao, Z., Wen, H., Wu, F., Du, D.H.: ALACC: accelerating restore performance of data deduplication systems using adaptive Look-Ahead window assisted chunk caching. In: Proceedings of USENIX FAST (2018)

    Google Scholar 

  4. Collet, Y.: Zstandard compression and the ‘application/zstd’ media type. https://datatracker.ietf.org/doc/html/rfc8878

  5. Deutsch, L.P.: Deflate compressed data format specification version 1.3. https://www.ietf.org/rfc/rfc1951.txt

  6. El-Shimi, A., Kalach, R., Kumar, A., Oltean, A., Li, J., Sengupta, S.: Primary data deduplication-large scale study and system design. In: Proceedings of USENIX ATC (2012)

    Google Scholar 

  7. loup Gailly, J., Adler, M.: zlib. https://www.zlib.net/

  8. GNU: GCC. https://gcc.gnu.org/

  9. GNU: GZIP. https://www.gnu.org/software/gzip/

  10. Han, Z., et al.: DEC: an efficient deduplication-enhanced compression approach. In: Proceedings of IEEE ICPADS (2016)

    Google Scholar 

  11. Huffman, D.A.: A method for the construction of minimum-redundancy codes. Proc. IRE 40(9), 1098–1101 (1952)

    Article  Google Scholar 

  12. Kotlarska, I., Jackowski, A., Lichota, K., Welnicki, M., Dubnicki, C., Iwanicki, K.: InftyDedup: scalable and cost-effective cloud tiering with deduplication. In: Proceedings of USENIX FAST (2023)

    Google Scholar 

  13. Kruus, E., Ungureanu, C.: Bimodal content defined chunking for backup streams. In: Proceedings of USENIX FAST (2010)

    Google Scholar 

  14. Li, C., Shilane, P., Douglis, F., Shim, H., Smaldone, S., Wallace, G.: Nitro: a capacity-optimized SSD cache for primary storage. In: Proceedings of USENIX ATC (2014)

    Google Scholar 

  15. Li, W., Jean-Baptise, G., Riveros, J., Narasimhan, G., Zhang, T., Zhao, M.: CacheDedup: in-line deduplication for flash caching. In: Proceedings of USENIX FAST (2016)

    Google Scholar 

  16. Lillibridge, M., Eshghi, K., Bhagwat, D., Deolalikar, V., Trezise, G., Camble, P.: Sparse indexing: large scale, inline deduplication using sampling and locality. In: Proceedings of USENIX FAST (2009)

    Google Scholar 

  17. Lin, X., et al.: Metadata considered Harmful... to deduplication. In: Proceedings of USENIX HotStorage (2015)

    Google Scholar 

  18. Neumann, A.: rabin-cdc. https://github.com/fd0/rabin-cdc

  19. Rabin, M.O.: Fingerprinting by random polynomials. Technical report (1981)

    Google Scholar 

  20. Srinivasan, K., Bisson, T., Goodson, G., Voruganti, K.: iDedup: latency-aware, inline data deduplication for primary storage. In: Proceedings of USENIX FAST (2012)

    Google Scholar 

  21. Wallace, G., et al.: Characteristics of backup workloads in production systems. In: Proceedings of USENIX FAST (2012)

    Google Scholar 

  22. Wang, Q., Li, J., Xia, W., Kruus, E., Debnath, B., Lee, P.P.: Austere flash caching with deduplication and compression. In: Proceedings of USENIX ATC (2020)

    Google Scholar 

  23. Xia, W., et al.: FastCDC: a fast and efficient content-defined chunking approach for data deduplication. In: Proceedings of USENIX ATC (2016)

    Google Scholar 

  24. Zhu, B., Li, K., Patterson, H.: Avoiding the disk bottleneck in the data domain deduplication file system. In: Proceedings of USENIX FAST (2008)

    Google Scholar 

  25. Ziv, J., Lempel, A.: A universal algorithm for sequential data compression. IEEE Trans. Inf. Theory 23(3), 337–343 (1977)

    Article  MathSciNet  Google Scholar 

  26. Zou, X., Xia, W., Shilane, P., Zhang, H., Wang, X.: Building a high-performance fine-grained deduplication framework for backup storage with high deduplication ratio. In: Proceedings of USENIX ATC (2022)

    Google Scholar 

  27. Zou, X., Yuan, J., Shilane, P., Xia, W., Zhang, H., Wang, X.: The dilemma between deduplication and locality: can both be achieved? In: Proceedings of USENIX FAST (2021)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Hengying Xiao or Yangyang Liu .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Xiao, H., Liu, Y. (2024). DZIP: A Data Deduplication-Compatible Enhanced Version of Gzip. In: Vaidya, J., Gabbouj, M., Li, J. (eds) Artificial Intelligence Security and Privacy. AIS&P 2023. Lecture Notes in Computer Science, vol 14509. Springer, Singapore. https://doi.org/10.1007/978-981-99-9785-5_23

Download citation

  • DOI: https://doi.org/10.1007/978-981-99-9785-5_23

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-99-9784-8

  • Online ISBN: 978-981-99-9785-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics