research-article

Reducing impact of data fragmentation caused by in-line deduplication

Authors:

Michal Kaczmarczyk,

Marcin Barczynski,

Wojciech Kilian,

Cezary DubnickiAuthors Info & Claims

SYSTOR '12: Proceedings of the 5th Annual International Systems and Storage Conference

Article No.: 15, Pages 1 - 12

https://doi.org/10.1145/2367589.2367600

Published: 04 June 2012 Publication History

Abstract

Deduplication results inevitably in data fragmentation, because logically continuous data is scattered across many disk locations. In this work we focus on fragmentation caused by duplicates from previous backups of the same backup set, since such duplicates are very common due to repeated full backups containing a lot of unchanged data. For systems with in-line dedup which detects duplicates during writing and avoids storing them, such fragmentation causes data from the latest backup being scattered across older backups. As a result, the time of restore from the latest backup can be significantly increased, sometimes more than doubled.

We propose an algorithm called context-based rewriting (CBR in short) minimizing this drop in restore performance for latest backups by shifting fragmentation to older backups, which are rarely used for restore. By selectively rewriting a small percentage of duplicates during backup, we can reduce the drop in restore bandwidth from 12--55% to only 4--7%, as shown by experiments driven by a set of backup traces. All of this is achieved with only small increase in writing time, between 1% and 5%. Since we rewrite only few duplicates and old copies of rewritten data are removed in the background, the whole process introduces small and temporary space overhead.

References

[1]

L. Aronovich, R. Asher, E. Bachmat, H. Bitner, M. Hirsch, and S. T. Klein. The design of a similarity based dedupli-cation system. In Proceedings of SYSTOR 2009: The Israeli Experimental Systems Conference, SYSTOR '09, pages 6:1--6:14, New York, NY, USA, 2009. ACM.

Digital Library

[2]

T. Asaro and H. Biggar. Data De-duplication and Disk-to-Disk Backup Systems: Technical and Business Considerations, 2007. The Enterprise Strategy Group.

[3]

D. Bhagwat, K. Eshghi, D. D. E. Long, and M. Lillibridge. Extreme binning: Scalable, parallel deduplication for chunk-based file backup. In Proceedings of the 17th IEEE International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS 2009), Sep 2009.

[4]

H. Biggar. Experiencing Data De-Duplication: Improving Efficiency and Reducing Capacity Requirements, 2007. The Enterprise Strategy Group.

[5]

A. Z. Broder. Some aplications of rabin's fingerprinting method. In Sequences II: Methods in Communications, Security, and Computer Science, pages 143--152. Springer-Verlag, 1993.

[6]

L. P. Cox, C. D. Murray, and B. D. Noble. Pastiche: making backup cheap and easy. In OSDI '02: Proceedings of the 5th symposium on Operating systems design and implementation, pages 285--298, New York, NY, USA, 2002. ACM.

Digital Library

[7]

B. Debnath, S. Sengupta, and J. Li. Chunkstash: Speeding up inline storage deduplication using flash memory. In 2010 USENIX Annual Technical Conference, June 2010.

Digital Library

[8]

W. Dong, F. Douglis, K. Li, H. Patterson, S. Reddy, and P. Shilane. Tradeoffs in scalable data routing for deduplication clusters. In Proceedings of the 9th USENIX conference on File and Storage Technologies, FAST'11, pages 15--29, Berkeley, CA, USA, 2011. USENIX Association.

Digital Library

[9]

C. Dubnicki, L. Gryz, L. Heldt, M. Kaczmarczyk, W. Kilian, P. Strzelczak, J. Szczepkowski, C. Ungureanu, and M. Welnicki. HYDRAstor: a Scalable Secondary Storage. In FAST'09: Proceedings of the 7th USENIX Conference on File and Storage Technologies, pages 197--210, Berkeley, CA, USA, 2009. USENIX Association.

Digital Library

[10]

C. Dubnicki, C. Ungureanu, and W. Kilian. FPN: A distributed hash table for commercial applications. In Proceedings of the 13th IEEE International Symposium on High Performance Distributed Computing, pages 120--128, Washington, DC, USA, 2004. IEEE Computer Society.

Digital Library

[11]

D. E. Eastlake and P. E. Jones. US Secure Hash Algorithm 1 (SHA1). RFC 3174 (Informational), September 2001.

Digital Library

[12]

EMC Avamar: Backup and recovery with global deduplication, 2008. http://www.emc.com/avamar.

[13]

EMC Centera: Content addressed storage system, January 2008. http://www.emc.com/centera.

[14]

EMC Corporation: Data Domain Global Deduplication Array, 2011. http://www.datadomain.com/products/global-deduplication-array.html.

[15]

EMC Corporation: DataDomain - Deduplication Storage for Backup, Archiving and Disaster Recovery, 2011. http://www.datadomain.com.

[16]

Exagrid. http://www.exagrid.com.

[17]

D. Floyer. Wikibon Data De-duplication Performance Tables. Wikibon.org, May 2011. http://wikibon.org/wiki/v/Wikibon_Data_De-duplication_Performance_Tables.

[18]

R. Koller and R. Rangaswami. I/O Deduplication: Utilizing content similarity to improve I/O performance. volume 6, pages 13:1--13:26, New York, NY, USA, September 2010. ACM.

Digital Library

[19]

E. Kruus, C. Ungureanu, and C. Dubnicki. Bimodal content defined chunking for backup streams. In Proceedings of the 8th USENIX conference on File and Storage Technologies, FAST'10, pages 239--252, Berkeley, CA, USA, 2010. USENIX Association.

Digital Library

[20]

M. Lillibridge, K. Eshghi, D. Bhagwat, V. Deolalikar, G. Trezis, and P. Camble. Sparse indexing: Large scale, inline deduplication using sampling and locality. In FAST'09: Proceedings of the 7th USENIX Conference on File and Storage Technologies, pages 111--123, 2009.

Digital Library

[21]

J. Livens. Deduplication and restore performance. Wikibon.org, January 2009. http://wikibon.org/wiki/v/Deduplication_and_restore_performance.

[22]

J. Livens. Defragmentation, rehydration and deduplication. AboutRestore.com, June 2009. http://www.aboutrestore.com/2009/06/24/defragmentation-rehydration-and-deduplication/.

[23]

D. Meister and A. Brinkmann. dedupv1: Improving Deduplication Throughput using Solid State Drives (SSD). In Proceedings of the 26th IEEE Symposium on Massive Storage Systems and Technologies (MSST), May 2010.

Digital Library

[24]

A. Muthitacharoen, B. Chen, and D. Mazires. A low-bandwidth network file system. In In Proceedings of the 18th ACM Symposium on Operating Systems Principles (SOSP '01, pages 174--187, New York, NY, USA, 2001. ACM.

Digital Library

[25]

P. Nath, B. Urgaonkar, and A. Sivasubramaniam. Evaluating the usefulness of content addressable storage for high-performance data intensive applications. In HPDC '08: Proceedings of the 17th international symposium on High performance distributed computing, pages 35--44, New York, NY, USA, 2008. ACM.

Digital Library

[26]

NEC Corporation. HYDRAstor Grid Storage System, 2008. http://www.hydrastor.com.

[27]

W. C. Preston. The Rehydration Myth. BackupCentral.com, June 2009. http://www.backupcentral.com/mr-backup-blog-mainmenu-47/13-mr-backup-blog/247-rehydration-myth.html/.

[28]

W. C. Preston. Restoring deduped data in deduplication systems. SearchDataBackup.com, April 2010. http://searchdatabackup.techtarget.com/feature/Restoring-deduped-data-in-deduplication-systems.

[29]

W. C. Preston. Solving common data deduplication system problems. SearchDataBackup.com, November 2010. http://searchdatabackup.techtarget.com/feature/Solving-common-data-deduplication-system-problems.

[30]

W. C. Preston. Target deduplication appliance performance comparison. BackupCentral.com, October 2010. http://www.backupcentral.com/mr-backup-blog-mainmenu-47/13-mr-backup-blog/348-target-deduplication-appliance-performance-comparison.html.

[31]

Quantum Corporation: DXi Deduplication Solution, 2011. http://www.quantum.com.

[32]

S. Quinlan and S. Dorward. Venti: A new approach to archival storage. In FAST'02: Proceedings of the Conference on File and Storage Technologies, pages 89--101, Berkeley, CA, USA, 2002. USENIX Association.

Digital Library

[33]

M. Rabin. Fingerprinting by random polynomials. Technical report, Center for Research in Computing Technology, Harvard University, New York, NY, USA, 1981.

[34]

B. Romanski, L. Heldt, W. Kilian, K. Lichota, and C. Dubnicki. Anchor-driven subchunk deduplication. In Proceedings of the 4th Annual International Conference on Systems and Storage, SYSTOR '11, pages 16:1--16:13, New York, NY, USA, 2011. ACM.

Digital Library

[35]

SEPATON Scalable Data Deduplication Solutions. http://sepaton.com/solutions/data-deduplication.

[36]

L. Whitehouse. Restoring deduped data. searchdatabackup.techtarget.com, August 2008. http://searchdatabackup.techtarget.com/tip/Restoring-deduped-data.

[37]

B. Zhu, K. Li, and H. Patterson. Avoiding the disk bottleneck in the Data Domain deduplication file system. In FAST'08: Proceedings of the 6th USENIX Conference on File and Storage Technologies, pages 1--14, Berkeley, CA, USA, 2008. USENIX Association.

Digital Library

Cited By

Raaf PBrinkmann ABorba EAsadi HNarasimhamurthy SBent JEl-Batal MSalkhordeh R(2024)From SSDs Back to HDDs: Optimizing VDO to Support Inline Deduplication and Compression for HDDs as Primary Storage MediaACM Transactions on Storage10.1145/367825020:4(1-28)Online publication date: 23-Jul-2024
https://dl.acm.org/doi/10.1145/3678250
Zhang YJiang HWang CHuang WChen MZhang YZhang L(2024)Applying Delta Compression to Packed Datasets for Efficient Data ReductionIEEE Transactions on Computers10.1109/TC.2023.331840473:1(73-85)Online publication date: 1-Jan-2024
https://dl.acm.org/doi/10.1109/TC.2023.3318404
Kotlarska IJackowski ALichota KWelnicki MDubnicki CIwanicki KNaor DGoel A(2023)InftyDedupProceedings of the 21st USENIX Conference on File and Storage Technologies10.5555/3585938.3585941(33-48)Online publication date: 21-Feb-2023
https://dl.acm.org/doi/10.5555/3585938.3585941
Show More Cited By

Index Terms

Reducing impact of data fragmentation caused by in-line deduplication
1. Information systems
  1. Information retrieval
    1. Document representation
    2. Search engine architectures and scalability
      1. Search engine indexing
  2. Information storage systems
    1. Storage replication
      1. Storage recovery strategies
2. Software and its engineering
  1. Software creation and management
    1. Software post-development issues
      1. Backup procedures

Recommendations

A study of practical deduplication

We collected file system content data from 857 desktop computers at Microsoft over a span of 4 weeks. We analyzed the data to determine the relative efficacy of data deduplication, particularly considering whole-file versus block-level elimination of ...
Efficient Hybrid Inline and Out-of-Line Deduplication for Backup Storage

Backup storage systems often remove redundancy across backups via inline deduplication, which works by referring duplicate chunks of the latest backup to those of existing backups. However, inline deduplication degrades restore performance of the latest ...
Reducing fragmentation impact with forward knowledge in backup systems with deduplication
SYSTOR '15: Proceedings of the 8th ACM International Systems and Storage Conference

Deduplication of backups is very effective in saving storage, but may also cause significant restore slowdown. This problem is caused by data fragmentation, where logically continuous but duplicate data is not placed sequentially on the disk. Two types ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

SYSTOR '12: Proceedings of the 5th Annual International Systems and Storage Conference

June 2012

183 pages

ISBN:9781450314480

DOI:10.1145/2367589

General Chair:
Michael Vinov
IBM Haifa
,
Program Chairs:
Dan Tsafrir
Technion
,
Erez Zadok
Stony Brook University

Copyright © 2012 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

The Technion - Israel Institute of Techn.: The Technion - Israel Institute of Technology

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 04 June 2012

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

SYSTOR '12

Sponsor:

The Technion - Israel Institute of Techn.

SYSTOR '12: The 5th Annual International Systems and Storage Conference

June 4 - 6, 2012

Haifa, Israel

Acceptance Rates

Overall Acceptance Rate 108 of 323 submissions, 33%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

73
Total Citations
View Citations
641
Total Downloads

Downloads (Last 12 months)31
Downloads (Last 6 weeks)6

Reflects downloads up to 20 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Raaf PBrinkmann ABorba EAsadi HNarasimhamurthy SBent JEl-Batal MSalkhordeh R(2024)From SSDs Back to HDDs: Optimizing VDO to Support Inline Deduplication and Compression for HDDs as Primary Storage MediaACM Transactions on Storage10.1145/367825020:4(1-28)Online publication date: 23-Jul-2024
https://dl.acm.org/doi/10.1145/3678250
Zhang YJiang HWang CHuang WChen MZhang YZhang L(2024)Applying Delta Compression to Packed Datasets for Efficient Data ReductionIEEE Transactions on Computers10.1109/TC.2023.331840473:1(73-85)Online publication date: 1-Jan-2024
https://dl.acm.org/doi/10.1109/TC.2023.3318404
Kotlarska IJackowski ALichota KWelnicki MDubnicki CIwanicki KNaor DGoel A(2023)InftyDedupProceedings of the 21st USENIX Conference on File and Storage Technologies10.5555/3585938.3585941(33-48)Online publication date: 21-Feb-2023
https://dl.acm.org/doi/10.5555/3585938.3585941
Panda ASarangi S(2023)SnapStoreProceedings of the 24th International Middleware Conference10.1145/3590140.3629120(261-274)Online publication date: 27-Nov-2023
https://dl.acm.org/doi/10.1145/3590140.3629120
Lin LDeng YZhou YZhu Y(2023)InDe: An Inline Data Deduplication Approach via Adaptive Detection of Valid Container UtilizationACM Transactions on Storage10.1145/356842619:1(1-27)Online publication date: 11-Jan-2023
https://dl.acm.org/doi/10.1145/3568426
Jackowski AŚlusarczyk ŁLichota KWełnicki MWijata RKielar MKopeć TDubnicki CIwanicki K(2023)ObjDedup: High-Throughput Object Storage Layer for Backup Systems With Block-Level DeduplicationIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2023.325050134:7(2180-2197)Online publication date: Jul-2023
https://doi.org/10.1109/TPDS.2023.3250501
Vij PDalip (2023)Comparative Analysis of Image Augmentation and Data Deduplication TechniquesSmart Trends in Computing and Communications10.1007/978-981-99-0769-4_26(271-281)Online publication date: 15-Jun-2023
https://doi.org/10.1007/978-981-99-0769-4_26
Kisous RKolikant ADuggal ASheinvald SYadgar G(2022)The what, The from, and The to: The Migration Games in Deduplicated SystemsACM Transactions on Storage10.1145/356502518:4(1-29)Online publication date: 15-Nov-2022
https://dl.acm.org/doi/10.1145/3565025
Zou XYuan JShilane PXia WZhang HWang X(2022)From Hyper-dimensional Structures to Linear Structures: Maintaining Deduplicated Data’s LocalityACM Transactions on Storage10.1145/350792118:3(1-28)Online publication date: 24-Aug-2022
https://dl.acm.org/doi/10.1145/3507921
Lee KLee GSong THong JBures MPark JCerny T(2022)Enhanced configurable snapshotProceedings of the 37th ACM/SIGAPP Symposium on Applied Computing10.1145/3477314.3507061(1166-1175)Online publication date: 25-Apr-2022
https://dl.acm.org/doi/10.1145/3477314.3507061
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents