Abstract
The adoption of deduplication in storage systems has introduced significant new challenges for storage management. Specifically, the physical capacities associated with volumes are no longer readily available. In this work, we introduce a new approach to analyzing capacities in deduplicated storage environments. We provide sketch-based estimations of fundamental capacity measures required for managing a storage system: How much physical space would be reclaimed if a volume or group of volumes were to be removed from a system (the reclaimable capacity) and how much of the physical space should be attributed to each of the volumes in the system (the attributed capacity). Our methods also support capacity queries for volume groups across multiple storage systems, e.g., how much capacity would a volume group consume after being migrated to another storage system? We provide analytical accuracy guarantees for our estimations as well as empirical evaluations. Our technology is integrated into a prominent all-flash storage array and exhibits high performance even for very large systems. We also demonstrate how this method opens the door for performing placement decisions at the data-center level and obtaining insights on deduplication in the field.
- VDBench Users Guide. 2012. Retrieved from https://www.oracle.com/technetwork/server-storage/vdbench-1901683.pdf.Google Scholar
- HPE StoreOnce Data Protection Backup Appliances. 2018. Retrieved from https://www.hpe.com/us/en/storage/storeonce.html.Google Scholar
- IBM FlashSystem 9100. 2018. Retrieved from https://www.ibm.com/us-en/marketplace/flashsystem-9100.Google Scholar
- IBM FlashSystem A9000. 2018. Retrieved from https://www.ibm.com/il-en/marketplace/small-cloud-storage/specifications.Google Scholar
- Pure Storage: purity-reduce. 2018. Retrieved September 2018 from https://www.purestorage.com/products/purity/purity-reduce.html.Google Scholar
- SNIA: IOTTA Repository Home. 2018. Retrieved from http://iotta.snia.org/.Google Scholar
- VMware vSAN: Using Deduplication and Compression. 2018. Retrieved from https://docs.vmware.com/en/VMware-vSphere/.Google Scholar
- XIOS 6.1 Data Reduction (DRR) Reporting per a Volume. 2018. Retrieved from https://xtremio.me/.Google Scholar
- XtremIO Integrated Data Reduction. 2018. Retrieved September 2018 from https://www.emc.com/collateral/solution-overview/h12453-xtremio-integrated-data-reduction-so.pdf.Google Scholar
- Lior Aronovich, Ron Asher, Eitan Bachmat, Haim Bitner, Michael Hirsch, and Shmuel T. Klein. 2009. The design of a similarity based deduplication system. In Proceedings of the ACM International Systems and Storage Conference (SYSTOR’09). ACM.Google Scholar
- Ziv Bar-Yossef, T. S. Jayram, Ravi Kumar, D. Sivakumar, and Luca Trevisan. 2002. Counting distinct elements in a data stream. In Proceedings of the Randomization and Approximation Techniques, 6th International Workshop (RANDOM’02). 1--10.Google ScholarCross Ref
- Deepavali Bhagwat, Kave Eshghi, Darrell D. E. Long, and Mark Lillibridge. 2009. Extreme binning: Scalable, parallel deduplication for chunk-based file backup. In Proceedings of the IEEE International Symposium on the Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS’09). 1--9.Google ScholarCross Ref
- P. Deutsch and J. L. Gailly. 1996. Zlib Compressed Data Format Specification version 3.3. Technical Report RFC 1950. Network Working Group.Google Scholar
- Wei Dong, Fred Douglis, Kai Li, R. Hugo Patterson, Sazzala Reddy, and Philip Shilane. 2011. Tradeoffs in scalable data routing for deduplication clusters. In Proceedings of the USENIX Conference on File and Storage Technologies (FAST’11). 15--29.Google Scholar
- Fred Douglis, Deepti Bhardwaj, Hangwei Qian, and Philip Shilane. 2011. Content-aware load balancing for distributed backup. In Proceedings of the 25th Large Installation System Administration Conference (LISA’11).Google ScholarDigital Library
- Philippe Flajolet and G. Nigel Martin. 1985. Probabilistic counting algorithms for data base applications. J. Comput. Syst. Sci. 31, 2 (1985), 182--209.Google ScholarDigital Library
- George Forman, Kave Eshghi, and Jaap Suermondt. 2009. Efficient detection of large-scale redundancy in enterprise file systems. Operat. Syst. Rev. 43, 1 (2009), 84--91.Google ScholarDigital Library
- Davide Frey, Anne-Marie Kermarrec, and Konstantinos Kloudas. 2012. Probabilistic deduplication for cluster-based storage systems. In Proceedings of the ACM Symposium on Cloud Computing (SOCC’12). 17.Google ScholarDigital Library
- Yinjin Fu, Hong Jiang, and Nong Xiao. 2012. A scalable inline cluster deduplication framework for big data protection. In Proceedings of the ACM/IFIP/USENIX 13th International Middleware Conference (Middleware’12). 354--373.Google ScholarCross Ref
- Phillip B. Gibbons and Srikanta Tirthapura. 2001. Estimating simple functions on the union of data streams. In Proceedings of the ACM Symposium on Parallelism in Algorithms and Architectures (SPAA’01). 281--291.Google Scholar
- William Greene. 1993. k-way merging and k-ary sorts. In Proceedings of the 31st Annual ACM Southeast Conference. 127--135.Google Scholar
- Danny Harnik, Ronen Kat, Dmitry Sotnikov, Avishay Traeger, and Oded Margalit. 2013. To zip or not to zip: Effective resource usage for real-time compression. In Proceedings of the USENIX Conference on File and Storage Technologies (FAST’13).Google Scholar
- Danny Harnik, Ety Khaitzin, and Dmitry Sotnikov. 2016. Estimating unseen deduplication—from theory to practice. In Proceedings of the 14th USENIX Conference on File and Storage Technologies (FAST’16). 277--290.Google ScholarDigital Library
- Danny Harnik, Oded Margalit, Dalit Naor, Dmitry Sotnikov, and Gil Vernik. 2012. Estimation of deduplication ratios in large data sets. In Proceedings of the IEEE 28th Symposium on Mass Storage Systems and Technologies (MSST’12). 1--11.Google ScholarCross Ref
- D. A. Huffman. 1952. A method for the construction of minimum-redundancy codes. Proc. Inst. Radio Eng. 40, 9 (Sep. 1952), 1098--1101.Google Scholar
- Mark Lillibridge, Kave Eshghi, Deepavali Bhagwat, Vinay Deolalikar, Greg Trezise, and Peter Camble. 2009. Sparse indexing: Large scale, inline deduplication using sampling and locality. In Proceedings of the 7th USENIX Conference on File and Storage Technologies (FAST’09).Google Scholar
- Maohua Lu, Cornel Constantinescu, and Prasenjit Sarkar. 2012. Content sharing graphs for deduplication-enabled storage systems. Algorithms 5, 2 (2012).Google Scholar
- Dutch T. Meyer and William J. Bolosky. 2011. A study of practical deduplication. In Proceedings of the 9th USENIX Conference on File and Storage Technologies (FAST’11). 1--13.Google Scholar
- Rajeev Motwani and Prabhakar Raghavan. 1995. Randomized Algorithms. Cambridge University Press, New York, NY.Google Scholar
- P. C. Nagesh and Atish Kathpal. 2013. Rangoli: Space management in deduplication environments. In Proceedings of the 6th International Systems and Storage Conference (SYSTOR’13). 14:1--14:6.Google ScholarDigital Library
- Philip Shilane, Ravi Chitloor, and Uday Kiran Jonnala. 2016. 99 deduplication problems. In Proceedings of the 8th USENIX Workshop on Hot Topics in Storage and File Systems (HotStorage’16).Google ScholarDigital Library
- Carl A. Waldspurger, Nohhyun Park, Alexander Garthwaite, and Irfan Ahmad. 2015. Efficient MRC construction with SHARDS. In Proceedings of the 13th USENIX Conference on File and Storage Technologies (FAST’15). USENIX Association, Santa Clara, CA, 95--110. https://www.usenix.org/conference/fast15/technical-sessions/presentation/waldspurger.Google ScholarDigital Library
- Jake Wires, Pradeep Ganesan, and Andrew Warfield. 2017. Sketches of space: Ownership accounting for shared storage. In Proceedings of the 2017 Symposium on Cloud Computing (SoCC’17). 535--547.Google ScholarDigital Library
- Fei Xie, Michael Condict, and Sandip Shete. 2013. Estimating duplication by content-based sampling. In Proceedings of the 2013 USENIX Conference on Annual Technical Conference (USENIX ATC’13). 181--186.Google ScholarDigital Library
- Jacob Ziv and Abraham Lempel. 1977. A universal algorithm for sequential data compression. IEEE Trans. Inf. Theory 23, 3 (1977), 337--343.Google ScholarDigital Library
Index Terms
- Sketching Volume Capacities in Deduplicated Storage
Recommendations
Sketching volume capacities in deduplicated storage
FAST'19: Proceedings of the 17th USENIX Conference on File and Storage TechnologiesThe adoption of deduplication in storage systems has introduced significant new challenges for storage management. Specifically, the physical capacities associated with volumes are no longer readily available. In this work we introduce a new approach to ...
Storage Deduplication by Virtual Large-Scale Disks
NBIS '12: Proceedings of the 2012 15th International Conference on Network-Based Information SystemsRecently, the demand of low cost large scale storages increases. We developed VLSD (Virtual Large Scale Disks) toolkit for constructing virtual disk based distributed storages, which aggregate free spaces of individual disks. VLSD realizes low-cost ...
On Information Leakage in Deduplicated Storage Systems
CCSW '16: Proceedings of the 2016 ACM on Cloud Computing Security WorkshopMost existing cloud storage providers rely on data deduplication in order to significantly save storage costs by storing duplicate data only once. While the literature has thoroughly analyzed client-side information leakage associated with the use of ...
Comments