research-article

Sketching Volume Capacities in Deduplicated Storage

Authors:
Danny Harnik

IBM Research, Givatayim, Israel

IBM Research, Givatayim, Israel
View Profile

,
Moshik Hershcovitch

IBM Research, Givatayim, Israel

IBM Research, Givatayim, Israel
View Profile

,
Yosef Shatsky

IBM Systems, Givatayim, Israel

IBM Systems, Givatayim, Israel
View Profile

,
Amir Epstein

Citi Innovation Lab TLV, Israel

Citi Innovation Lab TLV, Israel
View Profile

,
Ronen Kat

IBM Research, Givatayim, Israel

IBM Research, Givatayim, Israel
View Profile

Authors Info & Claims

ACM Transactions on Storage Volume 15 Issue 4Article No.: 24pp 1–23https://doi.org/10.1145/3369737

Published:18 December 2019Publication History

ACM Transactions on Storage

Abstract

The adoption of deduplication in storage systems has introduced significant new challenges for storage management. Specifically, the physical capacities associated with volumes are no longer readily available. In this work, we introduce a new approach to analyzing capacities in deduplicated storage environments. We provide sketch-based estimations of fundamental capacity measures required for managing a storage system: How much physical space would be reclaimed if a volume or group of volumes were to be removed from a system (the reclaimable capacity) and how much of the physical space should be attributed to each of the volumes in the system (the attributed capacity). Our methods also support capacity queries for volume groups across multiple storage systems, e.g., how much capacity would a volume group consume after being migrated to another storage system? We provide analytical accuracy guarantees for our estimations as well as empirical evaluations. Our technology is integrated into a prominent all-flash storage array and exhibits high performance even for very large systems. We also demonstrate how this method opens the door for performing placement decisions at the data-center level and obtaining insights on deduplication in the field.

References

VDBench Users Guide. 2012. Retrieved from https://www.oracle.com/technetwork/server-storage/vdbench-1901683.pdf.Google Scholar
HPE StoreOnce Data Protection Backup Appliances. 2018. Retrieved from https://www.hpe.com/us/en/storage/storeonce.html.Google Scholar
IBM FlashSystem 9100. 2018. Retrieved from https://www.ibm.com/us-en/marketplace/flashsystem-9100.Google Scholar
IBM FlashSystem A9000. 2018. Retrieved from https://www.ibm.com/il-en/marketplace/small-cloud-storage/specifications.Google Scholar
Pure Storage: purity-reduce. 2018. Retrieved September 2018 from https://www.purestorage.com/products/purity/purity-reduce.html.Google Scholar
SNIA: IOTTA Repository Home. 2018. Retrieved from http://iotta.snia.org/.Google Scholar
VMware vSAN: Using Deduplication and Compression. 2018. Retrieved from https://docs.vmware.com/en/VMware-vSphere/.Google Scholar
XIOS 6.1 Data Reduction (DRR) Reporting per a Volume. 2018. Retrieved from https://xtremio.me/.Google Scholar
XtremIO Integrated Data Reduction. 2018. Retrieved September 2018 from https://www.emc.com/collateral/solution-overview/h12453-xtremio-integrated-data-reduction-so.pdf.Google Scholar
Lior Aronovich, Ron Asher, Eitan Bachmat, Haim Bitner, Michael Hirsch, and Shmuel T. Klein. 2009. The design of a similarity based deduplication system. In Proceedings of the ACM International Systems and Storage Conference (SYSTOR’09). ACM.Google Scholar
Ziv Bar-Yossef, T. S. Jayram, Ravi Kumar, D. Sivakumar, and Luca Trevisan. 2002. Counting distinct elements in a data stream. In Proceedings of the Randomization and Approximation Techniques, 6th International Workshop (RANDOM’02). 1--10.Google ScholarCross Ref
Deepavali Bhagwat, Kave Eshghi, Darrell D. E. Long, and Mark Lillibridge. 2009. Extreme binning: Scalable, parallel deduplication for chunk-based file backup. In Proceedings of the IEEE International Symposium on the Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS’09). 1--9.Google ScholarCross Ref
P. Deutsch and J. L. Gailly. 1996. Zlib Compressed Data Format Specification version 3.3. Technical Report RFC 1950. Network Working Group.Google Scholar
Wei Dong, Fred Douglis, Kai Li, R. Hugo Patterson, Sazzala Reddy, and Philip Shilane. 2011. Tradeoffs in scalable data routing for deduplication clusters. In Proceedings of the USENIX Conference on File and Storage Technologies (FAST’11). 15--29.Google Scholar
Fred Douglis, Deepti Bhardwaj, Hangwei Qian, and Philip Shilane. 2011. Content-aware load balancing for distributed backup. In Proceedings of the 25th Large Installation System Administration Conference (LISA’11).Google ScholarDigital Library
Philippe Flajolet and G. Nigel Martin. 1985. Probabilistic counting algorithms for data base applications. J. Comput. Syst. Sci. 31, 2 (1985), 182--209.Google ScholarDigital Library
George Forman, Kave Eshghi, and Jaap Suermondt. 2009. Efficient detection of large-scale redundancy in enterprise file systems. Operat. Syst. Rev. 43, 1 (2009), 84--91.Google ScholarDigital Library
Davide Frey, Anne-Marie Kermarrec, and Konstantinos Kloudas. 2012. Probabilistic deduplication for cluster-based storage systems. In Proceedings of the ACM Symposium on Cloud Computing (SOCC’12). 17.Google ScholarDigital Library
Yinjin Fu, Hong Jiang, and Nong Xiao. 2012. A scalable inline cluster deduplication framework for big data protection. In Proceedings of the ACM/IFIP/USENIX 13th International Middleware Conference (Middleware’12). 354--373.Google ScholarCross Ref
Phillip B. Gibbons and Srikanta Tirthapura. 2001. Estimating simple functions on the union of data streams. In Proceedings of the ACM Symposium on Parallelism in Algorithms and Architectures (SPAA’01). 281--291.Google Scholar
William Greene. 1993. k-way merging and k-ary sorts. In Proceedings of the 31st Annual ACM Southeast Conference. 127--135.Google Scholar
Danny Harnik, Ronen Kat, Dmitry Sotnikov, Avishay Traeger, and Oded Margalit. 2013. To zip or not to zip: Effective resource usage for real-time compression. In Proceedings of the USENIX Conference on File and Storage Technologies (FAST’13).Google Scholar
Danny Harnik, Ety Khaitzin, and Dmitry Sotnikov. 2016. Estimating unseen deduplication—from theory to practice. In Proceedings of the 14th USENIX Conference on File and Storage Technologies (FAST’16). 277--290.Google ScholarDigital Library
Danny Harnik, Oded Margalit, Dalit Naor, Dmitry Sotnikov, and Gil Vernik. 2012. Estimation of deduplication ratios in large data sets. In Proceedings of the IEEE 28th Symposium on Mass Storage Systems and Technologies (MSST’12). 1--11.Google ScholarCross Ref
D. A. Huffman. 1952. A method for the construction of minimum-redundancy codes. Proc. Inst. Radio Eng. 40, 9 (Sep. 1952), 1098--1101.Google Scholar
Mark Lillibridge, Kave Eshghi, Deepavali Bhagwat, Vinay Deolalikar, Greg Trezise, and Peter Camble. 2009. Sparse indexing: Large scale, inline deduplication using sampling and locality. In Proceedings of the 7th USENIX Conference on File and Storage Technologies (FAST’09).Google Scholar
Maohua Lu, Cornel Constantinescu, and Prasenjit Sarkar. 2012. Content sharing graphs for deduplication-enabled storage systems. Algorithms 5, 2 (2012).Google Scholar
Dutch T. Meyer and William J. Bolosky. 2011. A study of practical deduplication. In Proceedings of the 9th USENIX Conference on File and Storage Technologies (FAST’11). 1--13.Google Scholar
Rajeev Motwani and Prabhakar Raghavan. 1995. Randomized Algorithms. Cambridge University Press, New York, NY.Google Scholar
P. C. Nagesh and Atish Kathpal. 2013. Rangoli: Space management in deduplication environments. In Proceedings of the 6th International Systems and Storage Conference (SYSTOR’13). 14:1--14:6.Google ScholarDigital Library
Philip Shilane, Ravi Chitloor, and Uday Kiran Jonnala. 2016. 99 deduplication problems. In Proceedings of the 8th USENIX Workshop on Hot Topics in Storage and File Systems (HotStorage’16).Google ScholarDigital Library
Carl A. Waldspurger, Nohhyun Park, Alexander Garthwaite, and Irfan Ahmad. 2015. Efficient MRC construction with SHARDS. In Proceedings of the 13th USENIX Conference on File and Storage Technologies (FAST’15). USENIX Association, Santa Clara, CA, 95--110. https://www.usenix.org/conference/fast15/technical-sessions/presentation/waldspurger.Google ScholarDigital Library
Jake Wires, Pradeep Ganesan, and Andrew Warfield. 2017. Sketches of space: Ownership accounting for shared storage. In Proceedings of the 2017 Symposium on Cloud Computing (SoCC’17). 535--547.Google ScholarDigital Library
Fei Xie, Michael Condict, and Sandip Shete. 2013. Estimating duplication by content-based sampling. In Proceedings of the 2013 USENIX Conference on Annual Technical Conference (USENIX ATC’13). 181--186.Google ScholarDigital Library
Jacob Ziv and Abraham Lempel. 1977. A universal algorithm for sequential data compression. IEEE Trans. Inf. Theory 23, 3 (1977), 337--343.Google ScholarDigital Library

Index Terms

Sketching Volume Capacities in Deduplicated Storage
1. Information systems
  1. Data management systems
    1. Information integration
      1. Deduplication
  2. Information storage systems
    1. Record storage systems
      1. Relational storage
        Compression strategies
    2. Storage management
2. Theory of computation
  1. Design and analysis of algorithms
    1. Data structures design and analysis
      1. Data compression
    2. Streaming, sublinear and near linear time algorithms
      1. Sketching and sampling

Recommendations

Sketching volume capacities in deduplicated storage
FAST'19: Proceedings of the 17th USENIX Conference on File and Storage Technologies

The adoption of deduplication in storage systems has introduced significant new challenges for storage management. Specifically, the physical capacities associated with volumes are no longer readily available. In this work we introduce a new approach to ...
Read More
Storage Deduplication by Virtual Large-Scale Disks
NBIS '12: Proceedings of the 2012 15th International Conference on Network-Based Information Systems

Recently, the demand of low cost large scale storages increases. We developed VLSD (Virtual Large Scale Disks) toolkit for constructing virtual disk based distributed storages, which aggregate free spaces of individual disks. VLSD realizes low-cost ...
Read More
On Information Leakage in Deduplicated Storage Systems
CCSW '16: Proceedings of the 2016 ACM on Cloud Computing Security Workshop

Most existing cloud storage providers rely on data deduplication in order to significantly save storage costs by storing duplicate data only once. While the literature has thoroughly analyzed client-side information leakage associated with the use of ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on Storage Volume 15, Issue 4
Usenix Fast 2019 Special Section and Regular Papers
November 2019
228 pages
ISSN:1553-3077
EISSN:1553-3093
DOI:10.1145/3373756
Editor:
Sam H. Noh
Ulsan National Institute of Science and Technology, Ulsan, Republic of Korea
Issue’s Table of Contents
Copyright © 2019 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 18 December 2019
- Revised: 1 October 2019
- Accepted: 1 October 2019
- Received: 1 June 2019
Published in tos Volume 15, Issue 4

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Deduplication
capacity management
estimation
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 7
  Total Citations
  View Citations
- 216
  Total Downloads
- Downloads (Last 12 months)18
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format

Sketching Volume Capacities in Deduplicated Storage

ACM Transactions on Storage

Abstract

References

Cited By

Index Terms

Recommendations

Sketching volume capacities in deduplicated storage

Storage Deduplication by Virtual Large-Scale Disks

On Information Leakage in Deduplicated Storage Systems

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

HTML Format

Caption

Sketching Volume Capacities in Deduplicated Storage

ACM Transactions on Storage

Abstract

References

Cited By

Index Terms

Recommendations

Sketching volume capacities in deduplicated storage

Storage Deduplication by Virtual Large-Scale Disks

On Information Leakage in Deduplicated Storage Systems

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

HTML Format

Share this Publication link

Share on Social Media