ABSTRACT
Data deduplication has been widely used in storage systems to improve storage efficiency and I/O performance. In particular, content-defined variable-size chunking (CDC) is often used in data deduplication systems for its capability to detect and remove duplicate data in modified files. However, the CDC algorithm is very compute-intensive and inherently sequential. Efforts on accelerating it by segmenting a file and running the algorithm independently on each segment in parallel come at a cost of substantial degradation of deduplication ratio.
In this paper, we propose SS-CDC, a two-stage parallel CDC, that enables (almost) full parallelism on chunking of a file without compromising deduplication ratio. Further, SS-CDC exploits instruction-level SIMD parallelism available in today's processors. As a case study, by using Intel AVX-512 instructions, SS-CDC consistently obtains superlinear speedups on a multi-core server. Our experiments using real-world datasets show that, compared to existing parallel CDC methods which only achieve up to a 7.7X speedup on an 8-core processor with the deduplication ratio degraded by up to 40%, SS-CDC can achieve up to a 25.6X speedup with no loss of deduplication ratio.
- Samer Al-Kiswany, Abdullah Gharaibeh, Elizeu Santos-Neto, George Yuan, and Matei Ripeanu. 2008. StoreGPU: Exploiting Graphics Processing Units to Accelerate Distributed Storage Systems. In Proceedings of the 17th International Symposium on High Performance Distributed Computing (HPDC '08). ACM, New York, NY, USA, 165--174. Google ScholarDigital Library
- Pramod Bhatotia, Rodrigo Rodrigues, and Akshat Verma. 2012. Shredder: GPU-accelerated Incremental Storage and Computation. In Proceedings of the 10th USENIX Conference on File and Storage Technologies (FAST' 12). USENIX Association, Berkeley, CA, USA, 14--14. http://dl.acm.org/citation.cfm?id=2208461.2208475 Google ScholarDigital Library
- Apache Cassandra. 2014. Apache cassandra. http://planetcassandra.org/what-is-apache-cassandra.Google Scholar
- Intel Corporation. 2013. Intel Xeon Phi processors. https://www.intel.com/content/www/us/en/products/processors/xeon-phi/xeon-phi-processors.html.Google Scholar
- Intel Corporation. 2015. Intel Skylake Processors. https://ark.intel.com/products/codename/37572/Skylake.Google Scholar
- Intel Corporation. 2018. Intel Architecture Instruction Set Extensions Programming Reference. https://software.intel.com/sites/default/files/managed/c5/15/architecture-instruction-set-extensions-programming-reference.pdf.Google Scholar
- Dell EMC. 2019. Data Domain - Data Backup Appliance, Data Protection. https://www.dellemc.com/en-us/data-protection/data-domain-backup-storage.htmGoogle Scholar
- DELL EMC inc. 2018. Supported Stream Counts for Data Domain OS 5.7. https://community.emc.com/docs/DOC-63282.Google Scholar
- Docker, Inc. 2016. Official repositories on Docker Hub. https://hub.docker.com/.Google Scholar
- Gary S. Brown. 1986. CRC32 code in FreeBSD derived from work by Gary S. Brown. http://web.mit.edu/freebsd/head/sys/libkern/crc32.c.Google Scholar
- Fanglu Guo and Petros Efstathopoulos. 2011. Building a High-performance Deduplication System. In Proceedings of the 2011 USENIX Conference on USENIX Annual Technical Conference (USENIXATC'11). USENIX Association, Berkeley, CA, USA, 25--25. http://dl.acm.org/citation.cfm?id=2002181.2002206 Google ScholarDigital Library
- Docker Inc. 2018. debian: Docker Official Images. https://hub.docker.com/_/debian/.Google Scholar
- Docker Inc. 2018. Node: Docker Official Images. https://hub.docker.com/_/node/.Google Scholar
- Docker Inc. 2018. wordpress: Docker Official Images. https://hub.docker.com/_/wordpress/.Google Scholar
- Xing Lin, Fred Douglis, Jim Li, Xudong Li, Robert Ricci, Stephen Smaldone, and Grant Wallace. 2015. Metadata Considered Harmful ... To Deduplication. In Proceedings of the 7th USENIX Conference on Hot Topics in Storage and File Systems (HotStorage'15). USENIX Association, Berkeley, CA, USA, 11--11. http://dl.acm.org/citation.cfm?id=2813749.2813760 Google ScholarDigital Library
- Athicha Muthitacharoen, Benjie Chen, and David Maziéres. 2001. A Low-bandwidth Network File System. SIGOPS Oper. Syst. Rev. 35, 5 (Oct. 2001), 174--187. Google ScholarDigital Library
- Neo Technology. 2018. Neo4j Graph Database Platform. https://neo4j.com/.Google Scholar
- NETAPP inc. 2015. NetApp® AltaVault® Cloud Integrated Storage 4.0: Installation and Service Guide for Physical Appliances. goo.gl/wj2Y4K.Google Scholar
- NETAPP inc. 2019. AFF A-Series All Flash Arrays: Leads the market with new performance benchmark results. https://www.netapp.com/us/products/storage-systems/all-flash-array/aff-a-series.aspx.Google Scholar
- Pure Storage, Inc. 2019. Pure Unifies Cloud: Your Hybrid Cloud Journey Just Got A Lot Easier. https://www.purestorage.com/.Google Scholar
- James Reinders. 2013. Intel AVX-512 instructions. https://software.intel.com/en-us/blogs/2013/avx-512-instructionsGoogle Scholar
- Salvatore Sanfilippo and Pieter Noordhuis. 2015. Redis. http://redis.io.Google Scholar
- The Linux Kernel Organization, Inc. 2019. The Linux Kernel Archives. https://www.kernel.orgGoogle Scholar
- Grant Wallace, Fred Douglis, Hangwei Qian, Philip Shilane, Stephen Smaldone, Mark Chamness, and Windsor Hsu. 2012. Characteristics of Backup Workloads in Production Systems. In Proceedings of the 10th USENIX Conference on File and Storage Technologies (FAST'12). USENIX Association, Berkeley, CA, USA, 4--4. http://dl.acm.org/citation.cfm?id=2208461.2208465 Google ScholarDigital Library
- Wikipedia. 2019. Flynn's taxonomy. https://en.wikipedia.org/wiki/Flynn%27s_taxonomyGoogle Scholar
- Wikipedia. 2019. Speedup. https://en.wikipedia.org/wiki/Speedup#Super-linear_speedup.Google Scholar
- Y. Won, K. Lim, and J. Min. 2015. MUCH: Multithreaded Content-Based File Chunking. IEEE Trans. Comput. 64, 5 (May 2015), 1375--1388.Google ScholarCross Ref
- Wen Xia, Hong Jiang, Dan Feng, Lei Tian, Min Fu, and Zhongtao Wang. 2012. P-dedupe: Exploiting parallelism in data deduplication system. In 2012 IEEE Seventh International Conference on Networking, Architecture, and Storage. IEEE, IEEE, Xiamen, China, 338--347. Google ScholarDigital Library
- Wen Xia, Hong Jiang, Dan Feng, Lei Tian, Min Fu, and Yukun Zhou. 2014. Ddelta: A deduplication-inspired fast delta compression approach. Performance Evaluation 79 (2014), 258--272.Google ScholarCross Ref
- Wen Xia, Yukun Zhou, Hong Jiang, Dan Feng, Yu Hua, Yuchong Hu, Yucheng Zhang, and Qing Liu. 2016. FastCDC: A Fast and Efficient Content-defined Chunking Approach for Data Deduplication. In Proceedings of the 2016 USENIX Conference on Usenix Annual Technical Conference (USENIX ATC '16). USENIX Association, Berkeley, CA, USA, 101--114. http://dl.acm.org/citation.cfm?id=3026959.3026969 Google ScholarDigital Library
- C. Yu, C. Zhang, Y. Mao, and F. Li. 2015. Leap-based Content Defined Chunking - Theory and Implementation. In 2015 31st Symposium on Mass Storage Systems and Technologies (MSST). IEEE, Santa Clara, CA, 1--12.Google Scholar
- Yucheng Zhang, Dan Feng, Hong Jiang, Wen Xia, Min Fu, Fangting Huang, and Yukun Zhou. 2017. A Fast Asymmetric Extremum Content Defined Chunking Algorithm for Data Deduplication in Backup Storage Systems. IEEE Trans. Comput. 66, 2 (Feb 2017), 199--211. Google ScholarDigital Library
- Benjamin Zhu, Kai Li, and Hugo Patterson. 2008. Avoiding the Disk Bottleneck in the Data Domain Deduplication File System. In Proceedings of the 6th USENIX Conference on File and Storage Technologies (FAST'08). USENIX Association, Berkeley, CA, USA, Article 18, 14 pages. http://dl.acm.org/citation.cfm?id=1364813.1364831 Google ScholarDigital Library
Index Terms
- SS-CDC: a two-stage parallel content-defined chunking for deduplicating backup storage
Recommendations
Flash-Based Storage Deduplication Techniques: A Survey
Exponential growth of the amount of data stored worldwide together with high level of data redundancy motivates the active development of data deduplication techniques. The overall increasing popularity of solid-state drives (SSDs) as primary storage ...
Chunk Fragmentation Level: An Effective Indicator for Read Performance Degradation in Deduplication Storage
HPCC '11: Proceedings of the 2011 IEEE International Conference on High Performance Computing and CommunicationsData deduplication has recently become commonplace in most secondary storage and even in some primary storage for the capacity optimization purpose. Aside from its write performance, read performance of the deduplication storage has been gaining in ...
Survey on Deduplication Techniques in Flash-Based Storage
FRUCT'22: Proceedings of the 22st Conference of Open Innovations Association FRUCTData deduplication importance is growing with the growth of data volumes. The domain of data deduplication is in active development. Recently it was in?uenced by appearance of Solid State Drive. This new type of disk has signi?cant differences from ...
Comments