skip to main content
research-article

Copy-on-Abundant-Write for Nimble File System Clones

Published: 29 January 2021 Publication History

Abstract

Making logical copies, or clones, of files and directories is critical to many real-world applications and workflows, including backups, virtual machines, and containers. An ideal clone implementation meets the following performance goals: (1) creating the clone has low latency; (2) reads are fast in all versions (i.e., spatial locality is always maintained, even after modifications); (3) writes are fast in all versions; (4) the overall system is space efficient. Implementing a clone operation that realizes all four properties, which we call a nimble clone, is a long-standing open problem.
This article describes nimble clones in B-ϵ-tree File System (BetrFS), an open-source, full-path-indexed, and write-optimized file system. The key observation behind our work is that standard copy-on-write heuristics can be too coarse to be space efficient, or too fine-grained to preserve locality. On the other hand, a write-optimized key-value store, such as a Bε-tree or an log-structured merge-tree (LSM)-tree, can decouple the logical application of updates from the granularity at which data is physically copied. In our write-optimized clone implementation, data sharing among clones is only broken when a clone has changed enough to warrant making a copy, a policy we call copy-on-abundant-write.
We demonstrate that the algorithmic work needed to batch and amortize the cost of BetrFS clone operations does not erode the performance advantages of baseline BetrFS; BetrFS performance even improves in a few cases. BetrFS cloning is efficient; for example, when using the clone operation for container creation, BetrFS outperforms a simple recursive copy by up to two orders-of-magnitude and outperforms file systems that have specialized Linux Containers (LXC) backends by 3--4×.

References

[1]
1985. Vax/VMS System Software Handbook.
[2]
Michael A. Bender, Jake Christensen, Alex Conway, Martin Farach-Colton, Rob Johnson, and Meng-Tsung Tsai. 2019. Optimal ball recycling. In SODA. SIAM, 2527--2546.
[3]
Michael A. Bender, Richard Cole, Erik D. Demaine, and Martin Farach-Colton. 2002. Scanning and traversing: Maintaining data for traversals in a memory hierarchy. In ESA (Lecture Notes in Computer Science), Vol. 2461. Springer, 139--151.
[4]
Michael A. Bender, Alex Conway, Martin Farach-Colton, William Jannen, Yizheng Jiao, Rob Johnson, Eric Knorr, Sara McAllister, Nirjhar Mukherjee, Prashant Pandey, Donald E. Porter, Jun Yuan, and Yang Zhan. 2019. Small refinements to the DAM can have big consequences for data-structure design. In SPAA. ACM, 265--274.
[5]
Michael A. Bender, Martin Farach-Colton, Jeremy T. Fineman, Yonatan R. Fogel, Bradley C. Kuszmaul, and Jelani Nelson. 2007. Cache-oblivious streaming B-trees. In Proceedings of the 19th Annual ACM Symposium on Parallel Algorithms and Architectures (SPAA). 81--92.
[6]
Michael A. Bender, Martin Farach-Colton, William Jannen, Rob Johnson, Bradley C. Kuszmaul, Donald E. Porter, Jun Yuan, and Yang Zhan. 2015. An introduction to Bϵ-trees and write-optimization. :login; Magazine 40, 5 (Oct 2015), 22--28.
[7]
Michael A. Bender, Martín Farach-Colton, Rob Johnson, Simon Mauras, Tyler Mayer, Cynthia A. Phillips, and Helen Xu. 2017. Write-optimized skip lists. In Proceedings of the 36th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems. ACM, 69--78.
[8]
Daniel G. Bobrow, Jerry D. Burchfiel, Daniel L. Murphy, and Raymond S. Tomlinson. 1972. TENEX, a paged time sharing system for the PDP - 10. Commun. ACM 15, 3 (March 1972), 135--143.
[9]
Bill Bolosky, Scott Corbin, David Goebel, and John (JD) Douceur. 2000. Single instance storage in Windows 2000. In Proceedings of 4th USENIX Windows Systems Symposium (proceedings of 4th usenix windows systems symposium ed.). USENIX. https://www.microsoft.com/en-us/research/publication/single-instance-storage-in-windows-2000/.
[10]
Gerth Stolting Brodal and Rolf Fagerberg. 2003. Lower bounds for external memory dictionaries. In Proceedings of the 14th Annual ACM-SIAM Symposium on Discrete Algorithms. 546--554.
[11]
Sailesh Chutani, Owen T. Anderson, Michael L. Kazar, Bruce W. Leverett, W. Anthony Mason, Robert N. Sidebotham, et al. 1992. The Episode file system. In Proceedings of the USENIX Winter 1992 Technical Conference. 43--60.
[12]
Alex Conway, Ainesh Bakshi, Yizheng Jiao, Yang Zhan, Michael A. Bender, William Jannen, Rob Johnson, Bradley C. Kuszmaul, Donald E. Porter, Jun Yuan, and Martin Farach-Colton. 2017. File systems fated for senescence? Nonsense, says Science! In Proceedings of the 15th Usenix Conference on File and Storage Technologies. 45--58.
[13]
Alex Conway, Ainesh Bakshi, Yizheng Jiao, Yang Zhan, Michael A. Bender, William Jannen, Rob Johnson, Bradley C. Kuszmaul, Donald E. Porter, Jun Yuan, and Martin Farach-Colton. 2017. How to fragment your file system. :login; Magazine 42, 2 (Summer 2017), 22--28.
[14]
Alexander Conway, Martin Farach-Colton, and Philip Shilane. 2018. Optimal Hashing in External Memory. In ICALP (LIPIcs), Vol. 107. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 39:1--39:14.
[15]
Alex Conway, Eric Knorr, Yizheng Jiao, Michael A. Bender, William Jannen, Rob Johnson, Donald Porter, and Martin Farach-Colton. 2019. Filesystem aging: It’s more usage than fullness. In 11th USENIX Workshop on Hot Topics in Storage and File Systems (HotStorage 19). USENIX Association, Renton, WA. https://www.usenix.org/conference/hotstorage19/presentation/conway.
[16]
Chris Dragga and Douglas J. Santry. 2016. GCTrees: Garbage collecting snapshots. ACM Transactions on Storage 12, 1 (2016), 4:1--4:32.
[17]
John K. Edwards, Daniel Ellard, Craig Everhart, Robert Fair, Eric Hamilton, Andy Kahn, Arkady Kanevsky, James Lentini, Ashish Prakash, Keith A. Smith, and Edward Zayas. 2008. FlexVol: Flexible, efficient file volume virtualization in WAFL. In Proceedings of the 2008 USENIX Annual Technical Conference. 129--142.
[18]
John Esmet, Michael A. Bender, Martin Farach-Colton, and Bradley C. Kuszmaul. 2012. The TokuFS streaming file system. In Proceedings of the 4th USENIX Workshop on Hot Topics in Storage and File Systems.
[19]
Jan Finis, Robert Brunel, Alfons Kemper, Thomas Neumann, Norman May, and Franz Faerber. 2015. Indexing highly dynamic hierarchical data. In VLDB.
[20]
Dave Hitz, James Lau, and Michael Malcolm. 1994. File system design for an NFS file server appliance. In Proceedings of the USENIX Winter 1994 Technical Conference. 19--19.
[21]
John H. Howard, Michael L. Kazar, Sherri G. Menees, David A. Nichols, M. Satyanarayanan, Robert N. Sidebotham, and Michael J. West. 1988. Scale and performance in a distributed file system. ACM Transactions on Computer Systems 6, 1 (1988), 51--81.
[22]
William Jannen, Jun Yuan, Yang Zhan, Amogh Akshintala, John Esmet, Yizheng Jiao, Ankur Mittal, Prashant Pandey, Phaneendra Reddy, Leif Walsh, Michael Bender, Martin Farach-Colton, Rob Johnson, Bradley C. Kuszmaul, and Donald E. Porter. 2015. BetrFS: A right-optimized write-optimized file system. In Proceedings of the 13th USENIX Conference on File and Storage Technologies. 301--315.
[23]
William Jannen, Jun Yuan, Yang Zhan, Amogh Akshintala, John Esmet, Yizheng Jiao, Ankur Mittal, Prashant Pandey, Phaneendra Reddy, Leif Walsh, Michael A. Bender, Martin Farach-Colton, Rob Johnson, Bradley C. Kuszmaul, and Donald E. Porter. 2015. BetrFS: Write-optimization in a kernel file system. ACM Transactions on Storage 11, 4 (2015), 18:1--18:29.
[24]
Ryusuke Konishi, Yoshiji Amagai, Koji Sato, Hisashi Hifumi, Seiji Kihara, and Satoshi Moriai. 2006. The Linux implementation of a log-structured file system. SIGOPS Operating Systems Review 40, 3 (2006), 102--107.
[25]
Philip L. Lehman and s. Bing Yao. 1981. Efficient locking for concurrent operations on B-trees. ACM Transactions on Database Systems 6, 4 (Dec. 1981).
[26]
Marshall Kirk McKusick and Gregory R. Ganger. 1999. Soft updates: A technique for eliminating most synchronous writes in the fast filesystem. In Proceedings of the 1999 USENIX Annual Technical Conference. 1--17.
[27]
Digital Equipment Corporation (DEC). 1988. Digital Equipment Corporation (DEC). TOPS-20 User's manual. http://www.bourguet.org/v2/pdp10/users/index.
[28]
Kiran-Kumar Muniswamy-Reddy, Charles P. Wright, Andrew Himmer, and Erez Zadok. 2004. A versatile and user-oriented versioning file system. In Proceedings of the 3rd USENIX Conference on File and Storage Technologies. 115--128.
[29]
Prashanth Nayak and Robert Ricci. 2013. Detailed study on Linux logical volume manager. Flux Research Group University of Utah (2013).
[30]
Patrick O’Neil, Edward Cheng, Dieter Gawlic, and Elizabeth O’Neil. 1996. The log-structured merge-tree (LSM-tree). Acta Informatica 33, 4 (1996), 351--385.
[31]
Zachary Peterson and Randal Burns. 2005. Ext3Cow: A time-shifting file system for regulatory compliance. ACM Transactions on Storage 1, 2 (2005), 190--212.
[32]
Rob Pike, Dave Presotto, Ken Thompson, and Howard Trickey. 1990. Plan 9 from bell labs. In Proceedings of the Summer 1990 UKUUG Conference. 1--9.
[33]
Ohad Rodeh. 2008. B-trees, shadowing, and clones. ACM Transactions on Storage 3, 4 (2008), 2:1--2:27.
[34]
Ohad Rodeh, Josef Bacik, and Chris Mason. 2013. BTRFS: The Linux B-tree filesystem. ACM Transactions on Storage 9, 3 (2013), 9:1--9:32.
[35]
Douglas S. Santry, Michael J. Feeley, Norman C. Hutchinson, Alistair C. Veitch, Ross W. Carton, and Jacob Ofir. 1999. Deciding when to forget in the elephant file system. In Proceedings of the 17th ACM Symposium on Operating Systems Principles. 110--123.
[36]
Mike Schroeder, David K. Gifford, and Roger M. Needham. 1985. A caching file system for a programmer’s workstation. In Proceedings of the 10th ACM Symposium on Operating Systems Principles. ACM, Inc. https://www.microsoft.com/en-us/research/publication/a-caching-file-system-for-a-programmers-workstation/.
[37]
Craig A. N. Soules, Garth R. Goodson, John D. Strunk, and Gregory R. Ganger. 2003. Metadata efficiency in versioning file systems. In Proceedings of the 2nd USENIX Conference on File and Storage Technologies. 43--58.
[38]
Richard P. Spillane, Wenguang Wang, Luke Lu, Maxime Austruy, Rawlinson Rivera, and Christos Karamanolis. 2016. Exo-clones: Better container runtime image management across the clouds. In Proceedings of the 8th USENIX Workshop on Hot Topics in Storage and File Systems (HotStorage 16). USENIX Association, Denver, CO. https://www.usenix.org/conference/hotstorage16/workshop-program/presentation/spillane.
[39]
Vasily Tarasov, Lukas Rupprecht, Dimitris Skourtis, Wenji Li, Raju Rangaswami, and Ming Zhao. 2019. Evaluating Docker storage performance: From workloads to graph drivers. Cluster Computing (2019), 1--14.
[40]
Vasily Tarasov, Lukas Rupprecht, Dimitris Skourtis, Amit Warke, Dean Hildebrand, Mohamed Mohamed, Nagapramod Mandagere, Wenji Li, Raju Rangaswami, and Ming Zhao. 2017. In search of the ideal storage configuration for Docker containers. In Proceedings of the 2017 IEEE 2nd International Workshops on Foundations and Applications of Self* Systems (FAS* W). IEEE, 199--206.
[41]
Veritas. 2019. Veritas System Recovery. Retreived from https://www.veritas.com/product/backup-and-recovery/system-recovery.
[42]
Xingbo Wu, Wenguang Wang, and Song Jiang. 2015. Totalcow: Unleash the power of copy-on-write for thin-provisioned containers. In Proceedings of the 6th Asia-Pacific Workshop on Systems. ACM, 15.
[43]
Jian Xu, Lu Zhang, Amirsaman Memaripour, Akshatha Gangadharaiah, Amit Borase, Tamires Brito Da Silva, Steven Swanson, and Andy Rudoff. 2017. NOVA-Fortis: A fault-tolerant non-volatile main memory file system. In Proceedings of the 26th Symposium on Operating Systems Principles. 478--496.
[44]
Jun Yuan, Yang Zhan, William Jannen, Prashant Pandey, Amogh Akshintala, Kanchan Chandnani, Pooja Deo, Zardosht Kasheff, Leif Walsh, Michael A. Bender, Martin Farach-Colton, Rob Johnson, Bradley C. Kuszmaul, and Donald E. Porter. 2016. Optimizing every operation in a write-optimized file system. In Proceedings of the 14th Usenix Conference on File and Storage Technologies. 1--14.
[45]
Jun Yuan, Yang Zhan, William Jannen, Prashant Pandey, Amogh Akshintala, Kanchan Chandnani, Pooja Deo, Zardosht Kasheff, Leif Walsh, Michael A. Bender, Martin Farach-Colton, Rob Johnson, Bradley C. Kuszmaul, and Donald E. Porter. 2017. Writes wrought right, and other adventures in file system optimization. ACM Transactions on Storage 13, 1 (2017), 3:1--3:26.
[46]
ZFS. [n.d.]. Retrieved July 5, 2018 from http://zfsonlinux.org/.
[47]
Yang Zhan, Alex Conway, Yizheng Jiao, Eric Knorr, Michael A. Bender, Martin Farach-Colton, William Jannen, Rob Johnson, Donald E. Porter, and Jun Yuan. 2018. The full path to full-path indexing. In Proceedings of the 16th USENIX Conference on File and Storage Technologies. 123--138.
[48]
Yang Zhan, Yizheng Jiao, Donald E. Porter, Alex Conway, Eric Knorr, Martin Farach-Colton, Michael A. Bender, Jun Yuan, William Jannen, and Rob Johnson. 2018. Efficient directory mutations in a full-path-indexed file system. ACM Transactions on Storage 14, 3 (2018), 22:1--22:27.
[49]
Frank Zhao, Kevin Xu, and Randy Shain. 2016. Improving Copy-on-Write Performance in Container Storage Drivers. Storage Developer’s Conference.

Cited By

View all
  • (2024)MemSnap μCheckpoints: A Data Single Level Store for Fearless PersistenceProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 310.1145/3620666.3651334(622-638)Online publication date: 27-Apr-2024

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Storage
ACM Transactions on Storage  Volume 17, Issue 1
Special Section on Usenix Fast 2020
February 2021
165 pages
ISSN:1553-3077
EISSN:1553-3093
DOI:10.1145/3446939
  • Editor:
  • Sam H. Noh
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 29 January 2021
Accepted: 01 September 2020
Received: 01 June 2020
Published in TOS Volume 17, Issue 1

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Bε-trees
  2. clone
  3. file system
  4. write optimization

Qualifiers

  • Research-article
  • Research
  • Refereed

Funding Sources

  • NSF

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)25
  • Downloads (Last 6 weeks)1
Reflects downloads up to 07 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2024)MemSnap μCheckpoints: A Data Single Level Store for Fearless PersistenceProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 310.1145/3620666.3651334(622-638)Online publication date: 27-Apr-2024

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media