skip to main content
10.1145/2367589.2367600acmotherconferencesArticle/Chapter ViewAbstractPublication PagessystorConference Proceedingsconference-collections
research-article

Reducing impact of data fragmentation caused by in-line deduplication

Published: 04 June 2012 Publication History

Abstract

Deduplication results inevitably in data fragmentation, because logically continuous data is scattered across many disk locations. In this work we focus on fragmentation caused by duplicates from previous backups of the same backup set, since such duplicates are very common due to repeated full backups containing a lot of unchanged data. For systems with in-line dedup which detects duplicates during writing and avoids storing them, such fragmentation causes data from the latest backup being scattered across older backups. As a result, the time of restore from the latest backup can be significantly increased, sometimes more than doubled.
We propose an algorithm called context-based rewriting (CBR in short) minimizing this drop in restore performance for latest backups by shifting fragmentation to older backups, which are rarely used for restore. By selectively rewriting a small percentage of duplicates during backup, we can reduce the drop in restore bandwidth from 12--55% to only 4--7%, as shown by experiments driven by a set of backup traces. All of this is achieved with only small increase in writing time, between 1% and 5%. Since we rewrite only few duplicates and old copies of rewritten data are removed in the background, the whole process introduces small and temporary space overhead.

References

[1]
L. Aronovich, R. Asher, E. Bachmat, H. Bitner, M. Hirsch, and S. T. Klein. The design of a similarity based dedupli-cation system. In Proceedings of SYSTOR 2009: The Israeli Experimental Systems Conference, SYSTOR '09, pages 6:1--6:14, New York, NY, USA, 2009. ACM.
[2]
T. Asaro and H. Biggar. Data De-duplication and Disk-to-Disk Backup Systems: Technical and Business Considerations, 2007. The Enterprise Strategy Group.
[3]
D. Bhagwat, K. Eshghi, D. D. E. Long, and M. Lillibridge. Extreme binning: Scalable, parallel deduplication for chunk-based file backup. In Proceedings of the 17th IEEE International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS 2009), Sep 2009.
[4]
H. Biggar. Experiencing Data De-Duplication: Improving Efficiency and Reducing Capacity Requirements, 2007. The Enterprise Strategy Group.
[5]
A. Z. Broder. Some aplications of rabin's fingerprinting method. In Sequences II: Methods in Communications, Security, and Computer Science, pages 143--152. Springer-Verlag, 1993.
[6]
L. P. Cox, C. D. Murray, and B. D. Noble. Pastiche: making backup cheap and easy. In OSDI '02: Proceedings of the 5th symposium on Operating systems design and implementation, pages 285--298, New York, NY, USA, 2002. ACM.
[7]
B. Debnath, S. Sengupta, and J. Li. Chunkstash: Speeding up inline storage deduplication using flash memory. In 2010 USENIX Annual Technical Conference, June 2010.
[8]
W. Dong, F. Douglis, K. Li, H. Patterson, S. Reddy, and P. Shilane. Tradeoffs in scalable data routing for deduplication clusters. In Proceedings of the 9th USENIX conference on File and Storage Technologies, FAST'11, pages 15--29, Berkeley, CA, USA, 2011. USENIX Association.
[9]
C. Dubnicki, L. Gryz, L. Heldt, M. Kaczmarczyk, W. Kilian, P. Strzelczak, J. Szczepkowski, C. Ungureanu, and M. Welnicki. HYDRAstor: a Scalable Secondary Storage. In FAST'09: Proceedings of the 7th USENIX Conference on File and Storage Technologies, pages 197--210, Berkeley, CA, USA, 2009. USENIX Association.
[10]
C. Dubnicki, C. Ungureanu, and W. Kilian. FPN: A distributed hash table for commercial applications. In Proceedings of the 13th IEEE International Symposium on High Performance Distributed Computing, pages 120--128, Washington, DC, USA, 2004. IEEE Computer Society.
[11]
D. E. Eastlake and P. E. Jones. US Secure Hash Algorithm 1 (SHA1). RFC 3174 (Informational), September 2001.
[12]
EMC Avamar: Backup and recovery with global deduplication, 2008. http://www.emc.com/avamar.
[13]
EMC Centera: Content addressed storage system, January 2008. http://www.emc.com/centera.
[14]
EMC Corporation: Data Domain Global Deduplication Array, 2011. http://www.datadomain.com/products/global-deduplication-array.html.
[15]
EMC Corporation: DataDomain - Deduplication Storage for Backup, Archiving and Disaster Recovery, 2011. http://www.datadomain.com.
[16]
Exagrid. http://www.exagrid.com.
[17]
D. Floyer. Wikibon Data De-duplication Performance Tables. Wikibon.org, May 2011. http://wikibon.org/wiki/v/Wikibon_Data_De-duplication_Performance_Tables.
[18]
R. Koller and R. Rangaswami. I/O Deduplication: Utilizing content similarity to improve I/O performance. volume 6, pages 13:1--13:26, New York, NY, USA, September 2010. ACM.
[19]
E. Kruus, C. Ungureanu, and C. Dubnicki. Bimodal content defined chunking for backup streams. In Proceedings of the 8th USENIX conference on File and Storage Technologies, FAST'10, pages 239--252, Berkeley, CA, USA, 2010. USENIX Association.
[20]
M. Lillibridge, K. Eshghi, D. Bhagwat, V. Deolalikar, G. Trezis, and P. Camble. Sparse indexing: Large scale, inline deduplication using sampling and locality. In FAST'09: Proceedings of the 7th USENIX Conference on File and Storage Technologies, pages 111--123, 2009.
[21]
J. Livens. Deduplication and restore performance. Wikibon.org, January 2009. http://wikibon.org/wiki/v/Deduplication_and_restore_performance.
[22]
J. Livens. Defragmentation, rehydration and deduplication. AboutRestore.com, June 2009. http://www.aboutrestore.com/2009/06/24/defragmentation-rehydration-and-deduplication/.
[23]
D. Meister and A. Brinkmann. dedupv1: Improving Deduplication Throughput using Solid State Drives (SSD). In Proceedings of the 26th IEEE Symposium on Massive Storage Systems and Technologies (MSST), May 2010.
[24]
A. Muthitacharoen, B. Chen, and D. Mazires. A low-bandwidth network file system. In In Proceedings of the 18th ACM Symposium on Operating Systems Principles (SOSP '01, pages 174--187, New York, NY, USA, 2001. ACM.
[25]
P. Nath, B. Urgaonkar, and A. Sivasubramaniam. Evaluating the usefulness of content addressable storage for high-performance data intensive applications. In HPDC '08: Proceedings of the 17th international symposium on High performance distributed computing, pages 35--44, New York, NY, USA, 2008. ACM.
[26]
NEC Corporation. HYDRAstor Grid Storage System, 2008. http://www.hydrastor.com.
[27]
W. C. Preston. The Rehydration Myth. BackupCentral.com, June 2009. http://www.backupcentral.com/mr-backup-blog-mainmenu-47/13-mr-backup-blog/247-rehydration-myth.html/.
[28]
W. C. Preston. Restoring deduped data in deduplication systems. SearchDataBackup.com, April 2010. http://searchdatabackup.techtarget.com/feature/Restoring-deduped-data-in-deduplication-systems.
[29]
W. C. Preston. Solving common data deduplication system problems. SearchDataBackup.com, November 2010. http://searchdatabackup.techtarget.com/feature/Solving-common-data-deduplication-system-problems.
[30]
W. C. Preston. Target deduplication appliance performance comparison. BackupCentral.com, October 2010. http://www.backupcentral.com/mr-backup-blog-mainmenu-47/13-mr-backup-blog/348-target-deduplication-appliance-performance-comparison.html.
[31]
Quantum Corporation: DXi Deduplication Solution, 2011. http://www.quantum.com.
[32]
S. Quinlan and S. Dorward. Venti: A new approach to archival storage. In FAST'02: Proceedings of the Conference on File and Storage Technologies, pages 89--101, Berkeley, CA, USA, 2002. USENIX Association.
[33]
M. Rabin. Fingerprinting by random polynomials. Technical report, Center for Research in Computing Technology, Harvard University, New York, NY, USA, 1981.
[34]
B. Romanski, L. Heldt, W. Kilian, K. Lichota, and C. Dubnicki. Anchor-driven subchunk deduplication. In Proceedings of the 4th Annual International Conference on Systems and Storage, SYSTOR '11, pages 16:1--16:13, New York, NY, USA, 2011. ACM.
[35]
SEPATON Scalable Data Deduplication Solutions. http://sepaton.com/solutions/data-deduplication.
[36]
L. Whitehouse. Restoring deduped data. searchdatabackup.techtarget.com, August 2008. http://searchdatabackup.techtarget.com/tip/Restoring-deduped-data.
[37]
B. Zhu, K. Li, and H. Patterson. Avoiding the disk bottleneck in the Data Domain deduplication file system. In FAST'08: Proceedings of the 6th USENIX Conference on File and Storage Technologies, pages 1--14, Berkeley, CA, USA, 2008. USENIX Association.

Cited By

View all
  • (2024)From SSDs Back to HDDs: Optimizing VDO to Support Inline Deduplication and Compression for HDDs as Primary Storage MediaACM Transactions on Storage10.1145/367825020:4(1-28)Online publication date: 23-Jul-2024
  • (2024)Applying Delta Compression to Packed Datasets for Efficient Data ReductionIEEE Transactions on Computers10.1109/TC.2023.331840473:1(73-85)Online publication date: 1-Jan-2024
  • (2023)InftyDedupProceedings of the 21st USENIX Conference on File and Storage Technologies10.5555/3585938.3585941(33-48)Online publication date: 21-Feb-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
SYSTOR '12: Proceedings of the 5th Annual International Systems and Storage Conference
June 2012
183 pages
ISBN:9781450314480
DOI:10.1145/2367589
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

  • The Technion - Israel Institute of Techn.: The Technion - Israel Institute of Technology

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 04 June 2012

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. CAS
  2. backup
  3. chunking
  4. deduplication
  5. fragmentation

Qualifiers

  • Research-article

Conference

SYSTOR '12
Sponsor:
  • The Technion - Israel Institute of Techn.

Acceptance Rates

Overall Acceptance Rate 108 of 323 submissions, 33%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)31
  • Downloads (Last 6 weeks)6
Reflects downloads up to 20 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)From SSDs Back to HDDs: Optimizing VDO to Support Inline Deduplication and Compression for HDDs as Primary Storage MediaACM Transactions on Storage10.1145/367825020:4(1-28)Online publication date: 23-Jul-2024
  • (2024)Applying Delta Compression to Packed Datasets for Efficient Data ReductionIEEE Transactions on Computers10.1109/TC.2023.331840473:1(73-85)Online publication date: 1-Jan-2024
  • (2023)InftyDedupProceedings of the 21st USENIX Conference on File and Storage Technologies10.5555/3585938.3585941(33-48)Online publication date: 21-Feb-2023
  • (2023)SnapStoreProceedings of the 24th International Middleware Conference10.1145/3590140.3629120(261-274)Online publication date: 27-Nov-2023
  • (2023)InDe: An Inline Data Deduplication Approach via Adaptive Detection of Valid Container UtilizationACM Transactions on Storage10.1145/356842619:1(1-27)Online publication date: 11-Jan-2023
  • (2023)ObjDedup: High-Throughput Object Storage Layer for Backup Systems With Block-Level DeduplicationIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2023.325050134:7(2180-2197)Online publication date: Jul-2023
  • (2023)Comparative Analysis of Image Augmentation and Data Deduplication TechniquesSmart Trends in Computing and Communications10.1007/978-981-99-0769-4_26(271-281)Online publication date: 15-Jun-2023
  • (2022)The what, The from, and The to: The Migration Games in Deduplicated SystemsACM Transactions on Storage10.1145/356502518:4(1-29)Online publication date: 15-Nov-2022
  • (2022)From Hyper-dimensional Structures to Linear Structures: Maintaining Deduplicated Data’s LocalityACM Transactions on Storage10.1145/350792118:3(1-28)Online publication date: 24-Aug-2022
  • (2022)Enhanced configurable snapshotProceedings of the 37th ACM/SIGAPP Symposium on Applied Computing10.1145/3477314.3507061(1166-1175)Online publication date: 25-Apr-2022
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media