skip to main content
research-article

GoSeed: Optimal Seeding Plan for Deduplicated Storage

Published: 16 August 2021 Publication History

Abstract

Deduplication decreases the physical occupancy of files in a storage volume by removing duplicate copies of data chunks, but creates data-sharing dependencies that complicate standard storage management tasks. Specifically, data migration plans must consider the dependencies between files that are remapped to new volumes and files that are not. Thus far, only greedy approaches have been suggested for constructing such plans, and it is unclear how they compare to one another and how much they can be improved.
We set to bridge this gap for seeding—migration in which the target volume is initially empty. We prove that even this basic instance of data migration is NP-hard in the presence of deduplication. We then present GoSeed, a formulation of seeding as an integer linear programming (ILP) problem, and three acceleration methods for applying it to real-sized storage volumes. Our experimental evaluation shows that, while the greedy approaches perform well on “easy” problem instances, the cost of their solution can be significantly higher than that of GoSeed’s solution on “hard” instances, for which they are sometimes unable to find a solution at all.

References

[1]
[n.d.]. CPLEX Optimizer. IBM. Retrieved on Dec. 29, 2019 from https://www.ibm.com/analytics/cplex-optimizer.
[2]
[n.d.]. The Fastest Mathematical Programming Solver. Gurobi. Retrieved on Dec. 29, 2019 from http://www.gurobi.com/.
[3]
[n.d.]. GLPK (GNU Linear Programming Kit). Free Software Foundation. Retrieved on Dec. 29, 2019 from https://www.gnu.org/software/glpk/.
[4]
[n.d.]. Introduction to lp_solve 5.5.2.5. Free Software Foundation. Retrieved on Dec. 29, 2019 from http://lpsolve.sourceforge.net/5.5/.
[5]
[n.d.]. SNIA IOTTA Repository. SNIA. Retrieved on Dec. 29, 2019 from http://iotta.snia.org/tracetypes/6.
[6]
Laszlo Ladanyi, Ted Ralphs, Menal Guzelsoy, and Ashutosh Mahajan. [n.d.]. SYMPHONY development home page. Retrieved on Dec. 29, 2019 from https://projects.coin-or.org/SYMPHONY.
[7]
[n.d.]. Traces and Snapshots Public Archive. File systems and Storage Lab (FSL), Stony Brook University. Retrieved on Dec. 29, 2019 from http://tracer.filesystems.org/.
[8]
Jeph Abara. 1989. Applying integer linear programming to the fleet assignment problem. Interfaces 19, 4 (1989), 20–28.
[9]
Bhavish Aggarwal, Aditya Akella, Ashok Anand, Athula Balachandran, Pushkar Chitnis, Chitra Muthukrishnan, Ramachandran Ramjee, and George Varghese. 2010. EndRE: An end-system redundancy elimination service for enterprises. In 7th USENIX Conference on Networked Systems Design and Implementation (NSDI’10).
[10]
Yamini Allu, Fred Douglis, Mahesh Kamat, Ramya Prabhakar, Philip Shilane, and Rahul Ugale. 2018. Can’t we all get along? Redesigning protection storage for modern workloads. In USENIX Annual Technical Conference (USENIX ATC’18).
[11]
Eric Anderson, Joseph Hall, Jason D. Hartline, Michael Hobbs, Anna R. Karlin, Jared Saia, Ram Swaminathan, and John Wilkes. 2001. An experimental study of data migration algorithms. In 5th International Workshop on Algorithm Engineering (WAE01).
[12]
Eric Anderson, Michael Hobbs, Kimberly Keeton, Susan Spence, Mustafa Uysal, and Alistair Veitch. 2002. Hippodrome: Running circles around storage administration. In 1st USENIX Conference on File and Storage Technologies (FAST’02).
[13]
Eric Anderson, Mahesh Kallahalla, Susan Spence, Ram Swaminathan, and Qiang Wan. 2002. Ergastulum: Quickly Finding Near-optimal Storage System Designs. HP Laboratories.
[14]
Alysson Bessani, Miguel Correia, Bruno Quaresma, Fernando André, and Paulo Sousa. 2013. DepSky: Dependable and secure storage in a cloud-of-clouds. ACM Trans. Storage 9, 4 (Nov. 2013).
[15]
Deepavali Bhagwat, Kave Eshghi, Darrell D. E. Long, and Mark Lillibridge. 2009. Extreme binning: Scalable, parallel deduplication for chunk-based file backup. In IEEE International Symposium on Modeling, Analysis Simulation of Computer and Telecommunication Systems (MASCOTS’09).
[16]
Feng Chen, Tian Luo, and Xiaodong Zhang. 2011. CAFTL: A content-aware flash translation layer enhancing the lifespan of flash memory based solid state drives. In 9th USENIX Conference on File and Stroage Technologies (FAST’11).
[17]
Licheng Chen, Zhipeng Wei, Zehan Cui, Mingyu Chen, Haiyang Pan, and Yungang Bao. 2014. CMD: Classification-based memory deduplication through page access characteristics. In 10th ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments (VEE’14).
[18]
Austin T. Clements, Irfan Ahmad, Murali Vilayannur, and Jinyuan Li. 2009. Decentralized deduplication in SAN cluster file systems. In Conference on USENIX Annual Technical Conference (USENIX’09).
[19]
George B. Dantzig. 1963. Linear Programming and Extensions. Princeton University Press, Princeton, NJ.
[20]
Biplob Debnath, Sudipta Sengupta, and Jin Li. 2010. ChunkStash: Speeding up inline storage deduplication using flash memory. In USENIX Conference on USENIX Annual Technical Conference (USENIX ATC’10).
[21]
Wei Dong, Fred Douglis, Kai Li, Hugo Patterson, Sazzala Reddy, and Philip Shilane. 2011. Tradeoffs in scalable data routing for deduplication clusters. In 9th USENIX Conference on File and Storage Technologies (FAST’11).
[22]
Fred Douglis, Deepti Bhardwaj, Hangwei Qian, and Philip Shilane. 2011. Content-aware load balancing for distributed backup. In 25th International Conference on Large Installation System Administration (LISA’11).
[23]
Fred Douglis, Abhinav Duggal, Philip Shilane, Tony Wong, Shiqin Yan, and Fabiano Botelho. 2017. The logic of physical garbage collection in deduplicating storage. In 15th USENIX Conference on File and Storage Technologies (’17).
[24]
Cezary Dubnicki, Leszek Gryz, Lukasz Heldt, Michal Kaczmarczyk, Wojciech Kilian, Przemyslaw Strzelczak, Jerzy Szczepkowski, Cristian Ungureanu, and Michal Welnicki. 2009. HYDRAstor: A scalable secondary storage. In 7th Conference on File and Storage Technologies (FAST’09).
[25]
Abhinav Duggal, Fani Jenkins, Philip Shilane, Ramprasad Chinthekindi, Ritesh Shah, and Mahesh Kamat. 2019. Data Domain Cloud Tier: Backup here, backup there, deduplicated everywhere! In 2019 USENIX Annual Technical Conference (USENIX ATC’19).
[26]
EMC Corporation 2015. INTRODUCTION TO THE EMC XtremIO STORAGE ARRAY (Ver. 4.0) (rev. 08 ed.). EMC Corporation. Retrieved May 30, 2016.
[27]
Jingxin Feng and Jiri Schindler. 2013. A deduplication study for host-side caches in virtualized data center environments. In 29th IEEE Symposium on Mass Storage Systems and Technologies (MSST’13).
[28]
Min Fu, Dan Feng, Yu Hua, Xubin He, Zuoning Chen, Wen Xia, Fangting Huang, and Qing Liu. 2014. Accelerating restore and garbage collection in deduplication-based backup systems via exploiting historical information. In USENIX Annual Technical Conference (USENIX ATC’14).
[29]
Min Fu, Dan Feng, Yu Hua, Xubin He, Zuoning Chen, Wen Xia, Yucheng Zhang, and Yujuan Tan. 2015. Design tradeoffs for data deduplication performance in backup workloads. In 13th USENIX Conference on File and Storage Technologies (FAST’15).
[30]
Fanglu Guo and Petros Efstathopoulos. 2011. Building a high-performance deduplication system. In USENIX Conference on USENIX Annual Technical Conference (USENIX ATC’11).
[31]
Aayush Gupta, Raghav Pisolkar, Bhuvan Urgaonkar, and Anand Sivasubramaniam. 2011. Leveraging value locality in optimizing NAND flash-based SSDs. In 9th USENIX Conference on File and Stroage Technologies (FAST’11).
[32]
Diwaker Gupta, Sangmin Lee, Michael Vrable, Stefan Savage, Alex C. Snoeren, George Varghese, Geoffrey M. Voelker, and Amin Vahdat. 2008. Difference engine: Harnessing memory redundancy in virtual machines. In 8th USENIX Conference on Operating Systems Design and Implementation (OSDI’08).
[33]
Danny Harnik, Moshik Hershcovitch, Yosef Shatsky, Amir Epstein, and Ronen Kat. 2019. Sketching volume capacities in deduplicated storage. In 17th USENIX Conference on File and Storage Technologies (FAST’19).
[34]
Danny Harnik, Ety Khaitzin, and Dmitry Sotnikov. 2016. Estimating unseen deduplication-from theory to practice. In 14th USENIX Conference on File and Storage Technologies (FAST16).
[35]
Danny Harnik, Benny Pinkas, and Alexandra Shulman-Peleg. 2010. Side channels in cloud services: Deduplication in cloud storage. IEEE Secur. Priv. 8, 6 (Nov. 2010), 40–47.
[36]
Charles B. Morrey III and Dirk Grunwald. 2006. Content-based block caching. In 23rd IEEE Symposium on Mass Storage Systems and Technologies (MSST’06).
[37]
Michal Kaczmarczyk, Marcin Barczynski, Wojciech Kilian, and Cezary Dubnicki. 2012. Reducing impact of data fragmentation caused by in-line deduplication. In 5th International Systems and Storage Conference (SYSTOR’12).
[38]
R. Karp. 1972. Reducibility among combinatorial problems. In Complexity of Computer Computations, R. Miller and J. Thatcher (Eds.). Plenum Press, 85–103.
[39]
Cheng Li, Philip Shilane, Fred Douglis, Hyong Shim, Stephen Smaldone, and Grant Wallace. 2014. Nitro: A capacity-optimized SSD cache for primary storage. In USENIX Annual Technical Conference (USENIX ATC’14).
[40]
Jin Li, Xiaofeng Chen, Mingqiang Li, Jingwei Li, Patrick PC Lee, and Wenjing Lou. 2014. Secure deduplication with efficient and reliable convergent key management. IEEE Trans’ Parallel Distrib. Syst. 25, 6 (June 2014), 1615–1625.
[41]
Mark Lillibridge, Kave Eshghi, and Deepavali Bhagwat. 2013. Improving restore speed for backup systems that use inline chunk-based deduplication. In 11th USENIX Conference on File and Storage Technologies (FAST’13).
[42]
Mark Lillibridge, Kave Eshghi, Deepavali Bhagwat, Vinay Deolalikar, Greg Trezise, and Peter Camble. 2009. Sparse indexing: Large scale, inline deduplication using sampling and locality. In 7th Conference on File and Storage Technologies (FAST’09).
[43]
Xing Lin, Guanlin Lu, Fred Douglis, Philip Shilane, and Grant Wallace. 2014. Migratory compression: Coarse-grained data reordering to improve compressibility. In 12th USENIX Conference on File and Storage Technologies (FAST’14).
[44]
Chenyang Lu, Guillermo A. Alvarez, and John Wilkes. 2002. Aqueduct: Online data migration with performance guarantees. In 1st USENIX Conference on File and Storage Technologies (FAST’02).
[45]
Udi Manber. 1994. Finding similar files in a large file system. In USENIX Winter Technical Conference (WTEC’94).
[46]
Keiichi Matsuzawa, Mitsuo Hayasaka, and Takahiro Shinagawa. 2018. The quick migration of file servers. In 11th ACM International Systems and Storage Conference (SYSTOR’18).
[47]
Dirk Meister, Jürgen Kaiser, Andre Brinkmann, Toni Cortes, Michael Kuhn, and Julian Kunkel. 2012. A study on data deduplication in HPC storage systems. In International Conference on High Performance Computing, Networking, Storage and Analysis (SC’12).
[48]
Dutch T. Meyer and William J. Bolosky. 2011. A study of practical deduplication. In 9th USENIX Conference on File and Storage Technologies (FAST’11).
[49]
Athicha Muthitacharoen, Benjie Chen, and David Mazières. 2001. A low-bandwidth network file system. In 18th ACM Symposium on Operating Systems Principles (SOSP’01).
[50]
Aviv Nachman, Gala Yadgar, and Sarai Sheinvald. 2020. GoSeed: Generating an optimal seeding plan for deduplicated storage. In 18th USENIX Conference on File and Storage Technologies (FAST’20).
[51]
P. C. Nagesh and Atish Kathpal. 2013. Rangoli: Space management in deduplication environments. In 6th International Systems and Storage Conference (SYSTOR’13).
[52]
Youngjin Nam, Guanlin Lu, Nohhyun Park, Weijun Xiao, and David H. C. Du. 2011. Chunk fragmentation level: An effective indicator for read performance degradation in deduplication storage. In 2011 IEEE International Conference on High Performance Computing and Communications (HPCC’11).
[53]
A. Richards and J. P. How. 2002. Aircraft trajectory planning with collision avoidance using mixed integer linear programming. In American Control Conference, Vol. 3. 1936–1941.
[54]
Prateek Sharma and Purushottam Kulkarni. 2012. Singleton: System-wide page deduplication in virtual environments. In 21st International Symposium on High-performance Parallel and Distributed Computing (HPDC’12).
[55]
Philip Shilane, Ravi Chitloor, and Uday Kiran Jonnala. 2016. 99 deduplication problems. In 8th USENIX Workshop on Hot Topics in Storage and File Systems (HotStorage’16).
[56]
Kiran Srinivasan, Tim Bisson, Garth Goodson, and Kaladhar Voruganti. 2012. iDedup: Latency-aware, inline data deduplication for primary storage. In 10th USENIX Conference on File and Storage Technologies (FAST’12).
[57]
Mark W. Storer, Kevin Greenan, Darrell D. E. Long, and Ethan L. Miller. 2008. Secure data deduplication. In ACM International Workshop on Storage Security and Survivability (StorageSS’08).
[58]
John D. Strunk, Eno Thereska, Christos Faloutsos, and Gregory R. Ganger. 2008. Using utility to provision storage systems. In 6th USENIX Conference on File and Storage Technologies (FAST’08).
[59]
Zhen Sun, Geoff Kuenning, Sonam Mandal, Philip Shilane, Vasily Tarasov, Nong Xiao, and Erez Zadok. 2016. A long-term user-centric analysis of deduplication patterns. In 32nd Symposium on Mass Storage Systems and Technologies (MSST’16).
[60]
Vasily Tarasov, Amar Mudrankit, Will Buik, Philip Shilane, Geoff Kuenning, and Erez Zadok. 2012. Generating realistic datasets for deduplication analysis. In USENIX Annual Technical Conference (USENIX ATC’12).
[61]
Nguyen Tran, Marcos K. Aguilera, and Mahesh Balakrishnan. 2011. Online migration for geo-distributed storage systems. In USENIX Conference on USENIX Annual Technical Conference (USENIX ATC’11).
[62]
Carl A. Waldspurger. 2002. Memory resource management in VMware ESX server. ACM SIGOPS Oper. Syst. Rev. - OSDI’02 36, SI (Dec. 2002), 181–194.
[63]
Grant Wallace, Fred Douglis, Hangwei Qian, Philip Shilane, Stephen Smaldone, Mark Chamness, and Windsor Hsu. 2012. Characteristics of backup workloads in production systems. In 10th USENIX Conference on File and Storage Technologies (FAST’12).
[64]
Nai Xia, Chen Tian, Yan Luo, Hang Liu, and Xiaoliang Wang. 2018. UKSM: Swift memory deduplication via hierarchical and adaptive memory region distilling. In 16th USENIX Conference on File and Storage Technologies (FAST’18).
[65]
Wen Xia, Hong Jiang, Dan Feng, Lei Tian, Min Fu, and Yukun Zhou. 2014. Ddelta: A deduplication-inspired fast delta compression approach. Perform. Eval. 79 (2014), 258–272.
[66]
Wen Xia, Yukun Zhou, Hong Jiang, Dan Feng, Yu Hua, Yuchong Hu, Qing Liu, and Yucheng Zhang. 2016. FastCDC: A fast and efficient content-defined chunking approach for data deduplication. In USENIX Annual Technical Conference (USENIX ATC’16).
[67]
Zhichao Yan, Hong Jiang, Yujuan Tan, and Hao Luo. 2016. Deduplicating compressed contents in cloud storage environment. In 8th USENIX Workshop on Hot Topics in Storage and File Systems (HotStorage’16).
[68]
Yanhua Zhang, X. Sun, and Baowei Wang. 2016. Efficient algorithm for k-barrier coverage based on integer linear programming. China Commun. 13, 7 (July 2016), 16–23.
[69]
Zhichao Cao, Hao Wen, Fenggang Wu, and David H. C. Du. 2018. ALACC: Accelerating restore performance of data deduplication systems using adaptive look-ahead window assisted chunk caching. In 16th USENIX Conference on File and Storage Technologies (FAST’18).
[70]
Benjamin Zhu, Kai Li, and Hugo Patterson. 2008. Avoiding the disk bottleneck in the Data Domain deduplication file system. In 6th USENIX Conference on File and Storage Technologies (FAST’08).
[71]
Charlie Shucheng Zhu, Georg Weissenbacher, and Sharad Malik. 2012. Coverage-based trace signal selection for fault localisation in post-silicon validation. In 8th International Haifa Verification Conference—Hardware and Software: Verification and Testing (HVC’12).

Cited By

View all
  • (2025)PASE: Pro-active Service Embedding in The Mobile EdgeJournal of Network and Systems Management10.1007/s10922-024-09877-x33:1Online publication date: 1-Mar-2025
  • (2024)Speed-Dedup: A New Deduplication Framework for Enhanced Performance and Reduced Overhead in Scale-Out StorageElectronics10.3390/electronics1322439313:22(4393)Online publication date: 9-Nov-2024
  • (2024)CGHit: A Content-Oriented Generative-Hit Framework for Content Delivery Networks2024 International Conference on Networking, Architecture and Storage (NAS)10.1109/NAS63802.2024.10781374(1-8)Online publication date: 9-Nov-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Storage
ACM Transactions on Storage  Volume 17, Issue 3
August 2021
227 pages
ISSN:1553-3077
EISSN:1553-3093
DOI:10.1145/3477268
  • Editor:
  • Sam H. Noh
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 16 August 2021
Accepted: 01 March 2021
Revised: 01 December 2020
Received: 01 September 2020
Published in TOS Volume 17, Issue 3

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Deduplication
  2. data migration
  3. capacity planning

Qualifiers

  • Research-article
  • Refereed

Funding Sources

  • Israel Science Foundation

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)27
  • Downloads (Last 6 weeks)2
Reflects downloads up to 20 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2025)PASE: Pro-active Service Embedding in The Mobile EdgeJournal of Network and Systems Management10.1007/s10922-024-09877-x33:1Online publication date: 1-Mar-2025
  • (2024)Speed-Dedup: A New Deduplication Framework for Enhanced Performance and Reduced Overhead in Scale-Out StorageElectronics10.3390/electronics1322439313:22(4393)Online publication date: 9-Nov-2024
  • (2024)CGHit: A Content-Oriented Generative-Hit Framework for Content Delivery Networks2024 International Conference on Networking, Architecture and Storage (NAS)10.1109/NAS63802.2024.10781374(1-8)Online publication date: 9-Nov-2024
  • (2023)LaDy: Enabling Locality-aware Deduplication Technology on Shingled Magnetic Recording DrivesACM Transactions on Embedded Computing Systems10.1145/360792122:5s(1-25)Online publication date: 9-Sep-2023
  • (2023)Dataset Similarity Detection for Global Deduplication in the DD File System2023 IEEE 39th International Conference on Data Engineering (ICDE)10.1109/ICDE55515.2023.00255(3322-3335)Online publication date: Apr-2023
  • (2022)The what, The from, and The to: The Migration Games in Deduplicated SystemsACM Transactions on Storage10.1145/356502518:4(1-29)Online publication date: 15-Nov-2022

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media