skip to main content
10.1145/3319647.3325834acmconferencesArticle/Chapter ViewAbstractPublication PagessystorConference Proceedingsconference-collections
research-article

SS-CDC: a two-stage parallel content-defined chunking for deduplicating backup storage

Published:22 May 2019Publication History

ABSTRACT

Data deduplication has been widely used in storage systems to improve storage efficiency and I/O performance. In particular, content-defined variable-size chunking (CDC) is often used in data deduplication systems for its capability to detect and remove duplicate data in modified files. However, the CDC algorithm is very compute-intensive and inherently sequential. Efforts on accelerating it by segmenting a file and running the algorithm independently on each segment in parallel come at a cost of substantial degradation of deduplication ratio.

In this paper, we propose SS-CDC, a two-stage parallel CDC, that enables (almost) full parallelism on chunking of a file without compromising deduplication ratio. Further, SS-CDC exploits instruction-level SIMD parallelism available in today's processors. As a case study, by using Intel AVX-512 instructions, SS-CDC consistently obtains superlinear speedups on a multi-core server. Our experiments using real-world datasets show that, compared to existing parallel CDC methods which only achieve up to a 7.7X speedup on an 8-core processor with the deduplication ratio degraded by up to 40%, SS-CDC can achieve up to a 25.6X speedup with no loss of deduplication ratio.

References

  1. Samer Al-Kiswany, Abdullah Gharaibeh, Elizeu Santos-Neto, George Yuan, and Matei Ripeanu. 2008. StoreGPU: Exploiting Graphics Processing Units to Accelerate Distributed Storage Systems. In Proceedings of the 17th International Symposium on High Performance Distributed Computing (HPDC '08). ACM, New York, NY, USA, 165--174. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Pramod Bhatotia, Rodrigo Rodrigues, and Akshat Verma. 2012. Shredder: GPU-accelerated Incremental Storage and Computation. In Proceedings of the 10th USENIX Conference on File and Storage Technologies (FAST' 12). USENIX Association, Berkeley, CA, USA, 14--14. http://dl.acm.org/citation.cfm?id=2208461.2208475 Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Apache Cassandra. 2014. Apache cassandra. http://planetcassandra.org/what-is-apache-cassandra.Google ScholarGoogle Scholar
  4. Intel Corporation. 2013. Intel Xeon Phi processors. https://www.intel.com/content/www/us/en/products/processors/xeon-phi/xeon-phi-processors.html.Google ScholarGoogle Scholar
  5. Intel Corporation. 2015. Intel Skylake Processors. https://ark.intel.com/products/codename/37572/Skylake.Google ScholarGoogle Scholar
  6. Intel Corporation. 2018. Intel Architecture Instruction Set Extensions Programming Reference. https://software.intel.com/sites/default/files/managed/c5/15/architecture-instruction-set-extensions-programming-reference.pdf.Google ScholarGoogle Scholar
  7. Dell EMC. 2019. Data Domain - Data Backup Appliance, Data Protection. https://www.dellemc.com/en-us/data-protection/data-domain-backup-storage.htmGoogle ScholarGoogle Scholar
  8. DELL EMC inc. 2018. Supported Stream Counts for Data Domain OS 5.7. https://community.emc.com/docs/DOC-63282.Google ScholarGoogle Scholar
  9. Docker, Inc. 2016. Official repositories on Docker Hub. https://hub.docker.com/.Google ScholarGoogle Scholar
  10. Gary S. Brown. 1986. CRC32 code in FreeBSD derived from work by Gary S. Brown. http://web.mit.edu/freebsd/head/sys/libkern/crc32.c.Google ScholarGoogle Scholar
  11. Fanglu Guo and Petros Efstathopoulos. 2011. Building a High-performance Deduplication System. In Proceedings of the 2011 USENIX Conference on USENIX Annual Technical Conference (USENIXATC'11). USENIX Association, Berkeley, CA, USA, 25--25. http://dl.acm.org/citation.cfm?id=2002181.2002206 Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Docker Inc. 2018. debian: Docker Official Images. https://hub.docker.com/_/debian/.Google ScholarGoogle Scholar
  13. Docker Inc. 2018. Node: Docker Official Images. https://hub.docker.com/_/node/.Google ScholarGoogle Scholar
  14. Docker Inc. 2018. wordpress: Docker Official Images. https://hub.docker.com/_/wordpress/.Google ScholarGoogle Scholar
  15. Xing Lin, Fred Douglis, Jim Li, Xudong Li, Robert Ricci, Stephen Smaldone, and Grant Wallace. 2015. Metadata Considered Harmful ... To Deduplication. In Proceedings of the 7th USENIX Conference on Hot Topics in Storage and File Systems (HotStorage'15). USENIX Association, Berkeley, CA, USA, 11--11. http://dl.acm.org/citation.cfm?id=2813749.2813760 Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Athicha Muthitacharoen, Benjie Chen, and David Maziéres. 2001. A Low-bandwidth Network File System. SIGOPS Oper. Syst. Rev. 35, 5 (Oct. 2001), 174--187. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Neo Technology. 2018. Neo4j Graph Database Platform. https://neo4j.com/.Google ScholarGoogle Scholar
  18. NETAPP inc. 2015. NetApp® AltaVault® Cloud Integrated Storage 4.0: Installation and Service Guide for Physical Appliances. goo.gl/wj2Y4K.Google ScholarGoogle Scholar
  19. NETAPP inc. 2019. AFF A-Series All Flash Arrays: Leads the market with new performance benchmark results. https://www.netapp.com/us/products/storage-systems/all-flash-array/aff-a-series.aspx.Google ScholarGoogle Scholar
  20. Pure Storage, Inc. 2019. Pure Unifies Cloud: Your Hybrid Cloud Journey Just Got A Lot Easier. https://www.purestorage.com/.Google ScholarGoogle Scholar
  21. James Reinders. 2013. Intel AVX-512 instructions. https://software.intel.com/en-us/blogs/2013/avx-512-instructionsGoogle ScholarGoogle Scholar
  22. Salvatore Sanfilippo and Pieter Noordhuis. 2015. Redis. http://redis.io.Google ScholarGoogle Scholar
  23. The Linux Kernel Organization, Inc. 2019. The Linux Kernel Archives. https://www.kernel.orgGoogle ScholarGoogle Scholar
  24. Grant Wallace, Fred Douglis, Hangwei Qian, Philip Shilane, Stephen Smaldone, Mark Chamness, and Windsor Hsu. 2012. Characteristics of Backup Workloads in Production Systems. In Proceedings of the 10th USENIX Conference on File and Storage Technologies (FAST'12). USENIX Association, Berkeley, CA, USA, 4--4. http://dl.acm.org/citation.cfm?id=2208461.2208465 Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Wikipedia. 2019. Flynn's taxonomy. https://en.wikipedia.org/wiki/Flynn%27s_taxonomyGoogle ScholarGoogle Scholar
  26. Wikipedia. 2019. Speedup. https://en.wikipedia.org/wiki/Speedup#Super-linear_speedup.Google ScholarGoogle Scholar
  27. Y. Won, K. Lim, and J. Min. 2015. MUCH: Multithreaded Content-Based File Chunking. IEEE Trans. Comput. 64, 5 (May 2015), 1375--1388.Google ScholarGoogle ScholarCross RefCross Ref
  28. Wen Xia, Hong Jiang, Dan Feng, Lei Tian, Min Fu, and Zhongtao Wang. 2012. P-dedupe: Exploiting parallelism in data deduplication system. In 2012 IEEE Seventh International Conference on Networking, Architecture, and Storage. IEEE, IEEE, Xiamen, China, 338--347. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Wen Xia, Hong Jiang, Dan Feng, Lei Tian, Min Fu, and Yukun Zhou. 2014. Ddelta: A deduplication-inspired fast delta compression approach. Performance Evaluation 79 (2014), 258--272.Google ScholarGoogle ScholarCross RefCross Ref
  30. Wen Xia, Yukun Zhou, Hong Jiang, Dan Feng, Yu Hua, Yuchong Hu, Yucheng Zhang, and Qing Liu. 2016. FastCDC: A Fast and Efficient Content-defined Chunking Approach for Data Deduplication. In Proceedings of the 2016 USENIX Conference on Usenix Annual Technical Conference (USENIX ATC '16). USENIX Association, Berkeley, CA, USA, 101--114. http://dl.acm.org/citation.cfm?id=3026959.3026969 Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. C. Yu, C. Zhang, Y. Mao, and F. Li. 2015. Leap-based Content Defined Chunking - Theory and Implementation. In 2015 31st Symposium on Mass Storage Systems and Technologies (MSST). IEEE, Santa Clara, CA, 1--12.Google ScholarGoogle Scholar
  32. Yucheng Zhang, Dan Feng, Hong Jiang, Wen Xia, Min Fu, Fangting Huang, and Yukun Zhou. 2017. A Fast Asymmetric Extremum Content Defined Chunking Algorithm for Data Deduplication in Backup Storage Systems. IEEE Trans. Comput. 66, 2 (Feb 2017), 199--211. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Benjamin Zhu, Kai Li, and Hugo Patterson. 2008. Avoiding the Disk Bottleneck in the Data Domain Deduplication File System. In Proceedings of the 6th USENIX Conference on File and Storage Technologies (FAST'08). USENIX Association, Berkeley, CA, USA, Article 18, 14 pages. http://dl.acm.org/citation.cfm?id=1364813.1364831 Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. SS-CDC: a two-stage parallel content-defined chunking for deduplicating backup storage

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      SYSTOR '19: Proceedings of the 12th ACM International Conference on Systems and Storage
      May 2019
      211 pages
      ISBN:9781450367493
      DOI:10.1145/3319647

      Copyright © 2019 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 22 May 2019

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      Overall Acceptance Rate94of285submissions,33%

      Upcoming Conference

      SYSTOR '24
      The 17th ACM International Systems and Storage Conference
      September 23 - 25, 2024
      Tel-Aviv , Israel

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader