skip to main content
research-article

Sketching Volume Capacities in Deduplicated Storage

Published:18 December 2019Publication History
Skip Abstract Section

Abstract

The adoption of deduplication in storage systems has introduced significant new challenges for storage management. Specifically, the physical capacities associated with volumes are no longer readily available. In this work, we introduce a new approach to analyzing capacities in deduplicated storage environments. We provide sketch-based estimations of fundamental capacity measures required for managing a storage system: How much physical space would be reclaimed if a volume or group of volumes were to be removed from a system (the reclaimable capacity) and how much of the physical space should be attributed to each of the volumes in the system (the attributed capacity). Our methods also support capacity queries for volume groups across multiple storage systems, e.g., how much capacity would a volume group consume after being migrated to another storage system? We provide analytical accuracy guarantees for our estimations as well as empirical evaluations. Our technology is integrated into a prominent all-flash storage array and exhibits high performance even for very large systems. We also demonstrate how this method opens the door for performing placement decisions at the data-center level and obtaining insights on deduplication in the field.

References

  1. VDBench Users Guide. 2012. Retrieved from https://www.oracle.com/technetwork/server-storage/vdbench-1901683.pdf.Google ScholarGoogle Scholar
  2. HPE StoreOnce Data Protection Backup Appliances. 2018. Retrieved from https://www.hpe.com/us/en/storage/storeonce.html.Google ScholarGoogle Scholar
  3. IBM FlashSystem 9100. 2018. Retrieved from https://www.ibm.com/us-en/marketplace/flashsystem-9100.Google ScholarGoogle Scholar
  4. IBM FlashSystem A9000. 2018. Retrieved from https://www.ibm.com/il-en/marketplace/small-cloud-storage/specifications.Google ScholarGoogle Scholar
  5. Pure Storage: purity-reduce. 2018. Retrieved September 2018 from https://www.purestorage.com/products/purity/purity-reduce.html.Google ScholarGoogle Scholar
  6. SNIA: IOTTA Repository Home. 2018. Retrieved from http://iotta.snia.org/.Google ScholarGoogle Scholar
  7. VMware vSAN: Using Deduplication and Compression. 2018. Retrieved from https://docs.vmware.com/en/VMware-vSphere/.Google ScholarGoogle Scholar
  8. XIOS 6.1 Data Reduction (DRR) Reporting per a Volume. 2018. Retrieved from https://xtremio.me/.Google ScholarGoogle Scholar
  9. XtremIO Integrated Data Reduction. 2018. Retrieved September 2018 from https://www.emc.com/collateral/solution-overview/h12453-xtremio-integrated-data-reduction-so.pdf.Google ScholarGoogle Scholar
  10. Lior Aronovich, Ron Asher, Eitan Bachmat, Haim Bitner, Michael Hirsch, and Shmuel T. Klein. 2009. The design of a similarity based deduplication system. In Proceedings of the ACM International Systems and Storage Conference (SYSTOR’09). ACM.Google ScholarGoogle Scholar
  11. Ziv Bar-Yossef, T. S. Jayram, Ravi Kumar, D. Sivakumar, and Luca Trevisan. 2002. Counting distinct elements in a data stream. In Proceedings of the Randomization and Approximation Techniques, 6th International Workshop (RANDOM’02). 1--10.Google ScholarGoogle ScholarCross RefCross Ref
  12. Deepavali Bhagwat, Kave Eshghi, Darrell D. E. Long, and Mark Lillibridge. 2009. Extreme binning: Scalable, parallel deduplication for chunk-based file backup. In Proceedings of the IEEE International Symposium on the Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS’09). 1--9.Google ScholarGoogle ScholarCross RefCross Ref
  13. P. Deutsch and J. L. Gailly. 1996. Zlib Compressed Data Format Specification version 3.3. Technical Report RFC 1950. Network Working Group.Google ScholarGoogle Scholar
  14. Wei Dong, Fred Douglis, Kai Li, R. Hugo Patterson, Sazzala Reddy, and Philip Shilane. 2011. Tradeoffs in scalable data routing for deduplication clusters. In Proceedings of the USENIX Conference on File and Storage Technologies (FAST’11). 15--29.Google ScholarGoogle Scholar
  15. Fred Douglis, Deepti Bhardwaj, Hangwei Qian, and Philip Shilane. 2011. Content-aware load balancing for distributed backup. In Proceedings of the 25th Large Installation System Administration Conference (LISA’11).Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Philippe Flajolet and G. Nigel Martin. 1985. Probabilistic counting algorithms for data base applications. J. Comput. Syst. Sci. 31, 2 (1985), 182--209.Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. George Forman, Kave Eshghi, and Jaap Suermondt. 2009. Efficient detection of large-scale redundancy in enterprise file systems. Operat. Syst. Rev. 43, 1 (2009), 84--91.Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Davide Frey, Anne-Marie Kermarrec, and Konstantinos Kloudas. 2012. Probabilistic deduplication for cluster-based storage systems. In Proceedings of the ACM Symposium on Cloud Computing (SOCC’12). 17.Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Yinjin Fu, Hong Jiang, and Nong Xiao. 2012. A scalable inline cluster deduplication framework for big data protection. In Proceedings of the ACM/IFIP/USENIX 13th International Middleware Conference (Middleware’12). 354--373.Google ScholarGoogle ScholarCross RefCross Ref
  20. Phillip B. Gibbons and Srikanta Tirthapura. 2001. Estimating simple functions on the union of data streams. In Proceedings of the ACM Symposium on Parallelism in Algorithms and Architectures (SPAA’01). 281--291.Google ScholarGoogle Scholar
  21. William Greene. 1993. k-way merging and k-ary sorts. In Proceedings of the 31st Annual ACM Southeast Conference. 127--135.Google ScholarGoogle Scholar
  22. Danny Harnik, Ronen Kat, Dmitry Sotnikov, Avishay Traeger, and Oded Margalit. 2013. To zip or not to zip: Effective resource usage for real-time compression. In Proceedings of the USENIX Conference on File and Storage Technologies (FAST’13).Google ScholarGoogle Scholar
  23. Danny Harnik, Ety Khaitzin, and Dmitry Sotnikov. 2016. Estimating unseen deduplication—from theory to practice. In Proceedings of the 14th USENIX Conference on File and Storage Technologies (FAST’16). 277--290.Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Danny Harnik, Oded Margalit, Dalit Naor, Dmitry Sotnikov, and Gil Vernik. 2012. Estimation of deduplication ratios in large data sets. In Proceedings of the IEEE 28th Symposium on Mass Storage Systems and Technologies (MSST’12). 1--11.Google ScholarGoogle ScholarCross RefCross Ref
  25. D. A. Huffman. 1952. A method for the construction of minimum-redundancy codes. Proc. Inst. Radio Eng. 40, 9 (Sep. 1952), 1098--1101.Google ScholarGoogle Scholar
  26. Mark Lillibridge, Kave Eshghi, Deepavali Bhagwat, Vinay Deolalikar, Greg Trezise, and Peter Camble. 2009. Sparse indexing: Large scale, inline deduplication using sampling and locality. In Proceedings of the 7th USENIX Conference on File and Storage Technologies (FAST’09).Google ScholarGoogle Scholar
  27. Maohua Lu, Cornel Constantinescu, and Prasenjit Sarkar. 2012. Content sharing graphs for deduplication-enabled storage systems. Algorithms 5, 2 (2012).Google ScholarGoogle Scholar
  28. Dutch T. Meyer and William J. Bolosky. 2011. A study of practical deduplication. In Proceedings of the 9th USENIX Conference on File and Storage Technologies (FAST’11). 1--13.Google ScholarGoogle Scholar
  29. Rajeev Motwani and Prabhakar Raghavan. 1995. Randomized Algorithms. Cambridge University Press, New York, NY.Google ScholarGoogle Scholar
  30. P. C. Nagesh and Atish Kathpal. 2013. Rangoli: Space management in deduplication environments. In Proceedings of the 6th International Systems and Storage Conference (SYSTOR’13). 14:1--14:6.Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Philip Shilane, Ravi Chitloor, and Uday Kiran Jonnala. 2016. 99 deduplication problems. In Proceedings of the 8th USENIX Workshop on Hot Topics in Storage and File Systems (HotStorage’16).Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Carl A. Waldspurger, Nohhyun Park, Alexander Garthwaite, and Irfan Ahmad. 2015. Efficient MRC construction with SHARDS. In Proceedings of the 13th USENIX Conference on File and Storage Technologies (FAST’15). USENIX Association, Santa Clara, CA, 95--110. https://www.usenix.org/conference/fast15/technical-sessions/presentation/waldspurger.Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Jake Wires, Pradeep Ganesan, and Andrew Warfield. 2017. Sketches of space: Ownership accounting for shared storage. In Proceedings of the 2017 Symposium on Cloud Computing (SoCC’17). 535--547.Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Fei Xie, Michael Condict, and Sandip Shete. 2013. Estimating duplication by content-based sampling. In Proceedings of the 2013 USENIX Conference on Annual Technical Conference (USENIX ATC’13). 181--186.Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Jacob Ziv and Abraham Lempel. 1977. A universal algorithm for sequential data compression. IEEE Trans. Inf. Theory 23, 3 (1977), 337--343.Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Sketching Volume Capacities in Deduplicated Storage

            Recommendations

            Comments

            Login options

            Check if you have access through your login credentials or your institution to get full access on this article.

            Sign in

            Full Access

            • Published in

              cover image ACM Transactions on Storage
              ACM Transactions on Storage  Volume 15, Issue 4
              Usenix Fast 2019 Special Section and Regular Papers
              November 2019
              228 pages
              ISSN:1553-3077
              EISSN:1553-3093
              DOI:10.1145/3373756
              • Editor:
              • Sam H. Noh
              Issue’s Table of Contents

              Copyright © 2019 ACM

              Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

              Publisher

              Association for Computing Machinery

              New York, NY, United States

              Publication History

              • Published: 18 December 2019
              • Revised: 1 October 2019
              • Accepted: 1 October 2019
              • Received: 1 June 2019
              Published in tos Volume 15, Issue 4

              Permissions

              Request permissions about this article.

              Request Permissions

              Check for updates

              Qualifiers

              • research-article
              • Research
              • Refereed

            PDF Format

            View or Download as a PDF file.

            PDF

            eReader

            View online with eReader.

            eReader

            HTML Format

            View this article in HTML Format .

            View HTML Format