skip to main content
10.1145/2367589.2367606acmotherconferencesArticle/Chapter ViewAbstractPublication PagessystorConference Proceedingsconference-collections
research-article

Insights for data reduction in primary storage: a practical analysis

Published: 04 June 2012 Publication History

Abstract

There has been increasing interest in deploying data reduction techniques in primary storage systems. This paper analyzes large datasets in four typical enterprise data environments to find patterns that can suggest good design choices for such systems. The overall data reduction opportunity is evaluated for deduplication and compression, separately and combined, then in-depth analysis is presented focusing on frequency, clustering and other patterns in the collected data. The results suggest ways to enhance performance and reduce resource requirements and system cost while maintaining data reduction effectiveness. These techniques include deciding which files to compress based on file type and size, using duplication affinity to guide deployment decisions, and optimizing the detection and mapping of duplicate content adaptively when large segments account for most of the opportunity.

References

[1]
Constantinescu, C., Glider, J., and Chambliss, D. Mixing deduplication and compression on active data sets. In 2011 Data Compression Conference (2011), IEEE, pp. 393--402.
[2]
Constantinescu, C., and Lu, M. Quick estimation of data compression and de-duplication for large storage systems. In CCP2011: First International Conference on Data Compression, Communication and Processing (2011), pp. 89--93.
[3]
EMC. Emc celerra. Online: http://www.emc.com/products/-family/celerra-family.htm, 2011.
[4]
IBM. Ibm real-time compression appliance. Online: http://www-03.ibm.com/systems/storage/network/rtc/-stn6800/, 2011.
[5]
Jin, K., and Miller, E. The effectiveness of deduplication on virtual machine disk images. In Proceedings of SYSTOR 2009: The Israeli Experimental Systems Conference (2009), ACM.
[6]
Lillibridge, M., Eshghi, K., Bhagwat, D., Deolalikar, V., Trezise, G., and Camble, P. Sparse indexing: large scale, inline deduplication using sampling and locality. In Proccedings of the 7th conference on File and storage technologies (2009), USENIX Association, pp. 111--123.
[7]
Lu, M., Constantinescu, C., and Sarkar, P. Content sharing graphs for deduplication-enabled storage systems. Algorithms 5, 2 (2012), 236--260.
[8]
Meyer, D., and Bolosky, W. A study of practical deduplication. In FAST'11: Proceedings of the 9th Conference on File and Storage Technologies (2011).
[9]
NetApp. Netapp a-sis. Online: http://www.netapp.com/us/-products/platform-os/dedupe.html, 2011.
[10]
SearchStorage.com. Data deduplication. On-line: http://searchstorage.techtarget.com/definition/data-deduplication/, 2011.
[11]
Srinivasan, K., Bison, T., Goodson, G., and Voruganti, K. iDedup: Latency-aware, inline data deduplication for primary storage. In FAST'12: Proceedings of the 9th Conference on File and Storage Technologies (2012).
[12]
Wallace, G., Douglis, F., Qian, H., Shilane, P., Smaldone, S., Chamness, M., and Hsu, W. Characteristics of backup workloads in production systems. In FAST'12: Proceedings of the 9th Conference on File and Storage Technologies (2012).
[13]
Zhu, B., Li, K., and Patterson, H. Avoiding the disk bottleneck in the data domain deduplication file system. In FAST'08: Proceedings of the 6th USENIX Conference on File and Storage Technologies (2008).

Cited By

View all
  • (2024)From SSDs Back to HDDs: Optimizing VDO to Support Inline Deduplication and Compression for HDDs as Primary Storage MediaACM Transactions on Storage10.1145/367825020:4(1-28)Online publication date: 23-Jul-2024
  • (2024)Efficient Data Placement in Deduplication Enabled ZenFS via CRC-Based PredictionIEEE Access10.1109/ACCESS.2024.352018412(197233-197246)Online publication date: 2024
  • (2022)Context-aware Resemblance Detection based Deduplication Ratio Prediction for Cloud Storage2022 IEEE/ACM International Conference on Big Data Computing, Applications and Technologies (BDCAT)10.1109/BDCAT56447.2022.00011(21-29)Online publication date: Dec-2022
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
SYSTOR '12: Proceedings of the 5th Annual International Systems and Storage Conference
June 2012
183 pages
ISBN:9781450314480
DOI:10.1145/2367589
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

  • The Technion - Israel Institute of Techn.: The Technion - Israel Institute of Technology

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 04 June 2012

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. compression
  2. deduplication
  3. locality
  4. partition
  5. primary storage

Qualifiers

  • Research-article

Conference

SYSTOR '12
Sponsor:
  • The Technion - Israel Institute of Techn.

Acceptance Rates

Overall Acceptance Rate 108 of 323 submissions, 33%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)5
  • Downloads (Last 6 weeks)0
Reflects downloads up to 17 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)From SSDs Back to HDDs: Optimizing VDO to Support Inline Deduplication and Compression for HDDs as Primary Storage MediaACM Transactions on Storage10.1145/367825020:4(1-28)Online publication date: 23-Jul-2024
  • (2024)Efficient Data Placement in Deduplication Enabled ZenFS via CRC-Based PredictionIEEE Access10.1109/ACCESS.2024.352018412(197233-197246)Online publication date: 2024
  • (2022)Context-aware Resemblance Detection based Deduplication Ratio Prediction for Cloud Storage2022 IEEE/ACM International Conference on Big Data Computing, Applications and Technologies (BDCAT)10.1109/BDCAT56447.2022.00011(21-29)Online publication date: Dec-2022
  • (2021)Fast Variable-Grained Resemblance Data Deduplication For Cloud Storage2021 IEEE International Conference on Networking, Architecture and Storage (NAS)10.1109/NAS51552.2021.9605398(1-8)Online publication date: Oct-2021
  • (2020)DupHunterProceedings of the 2020 USENIX Conference on Usenix Annual Technical Conference10.5555/3489146.3489199(769-783)Online publication date: 15-Jul-2020
  • (2020)A Content Fingerprint-Based Cluster-Wide Inline Deduplication for Shared-Nothing Storage SystemsIEEE Access10.1109/ACCESS.2020.30390568(209163-209180)Online publication date: 2020
  • (2019)Data domain cloud tierProceedings of the 2019 USENIX Conference on Usenix Annual Technical Conference10.5555/3358807.3358862(647-660)Online publication date: 10-Jul-2019
  • (2018)Cluster and Single-Node Analysis of Long-Term Deduplication PatternsACM Transactions on Storage10.1145/318389014:2(1-27)Online publication date: 11-May-2018
  • (2018)A Simulation Analysis of Redundancy and Reliability in Primary Storage DeduplicationIEEE Transactions on Computers10.1109/TC.2018.280849667:9(1259-1272)Online publication date: 1-Sep-2018
  • (2018)ThinDedup: An I/O Deduplication Scheme that Minimizes Efficiency Loss due to Metadata Writes2018 IEEE 37th International Performance Computing and Communications Conference (IPCCC)10.1109/PCCC.2018.8710792(1-8)Online publication date: Nov-2018
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media