skip to main content
research-article
Public Access

Writes Wrought Right, and Other Adventures in File System Optimization

Published:16 March 2017Publication History
Skip Abstract Section

Abstract

File systems that employ write-optimized dictionaries (WODs) can perform random-writes, metadata updates, and recursive directory traversals orders of magnitude faster than conventional file systems. However, previous WOD-based file systems have not obtained all of these performance gains without sacrificing performance on other operations, such as file deletion, file or directory renaming, or sequential writes.

Using three techniques, late-binding journaling, zoning, and range deletion, we show that there is no fundamental trade-off in write-optimization. These dramatic improvements can be retained while matching conventional file systems on all other operations.

BetrFS 0.2 delivers order-of-magnitude better performance than conventional file systems on directory scans and small random writes and matches the performance of conventional file systems on rename, delete, and sequential I/O. For example, BetrFS 0.2 performs directory scans 2.2 × faster, and small random writes over two orders of magnitude faster, than the fastest conventional file system. But unlike BetrFS 0.1, it renames and deletes files commensurate with conventional file systems and performs large sequential I/O at nearly disk bandwidth. The performance benefits of these techniques extend to applications as well. BetrFS 0.2 continues to outperform conventional file systems on many applications, such as as rsync, git-diff, and tar, but improves git-clone performance by 35% over BetrFS 0.1, yielding performance comparable to other file systems.

References

  1. David G. Andersen, Jason Franklin, Michael Kaminsky, Amar Phanishayee, Lawrence Tan, and Vijay Vasudevan. 2009. FAWN: A fast array of wimpy nodes. In Proceedings of the ACM SIGOPS 22nd Symposium on Operating Systems Principles. 1--14. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Michael A. Bender, Martin Farach-Colton, Jeremy T. Fineman, Yonatan R. Fogel, Bradley C. Kuszmaul, and Jelani Nelson. 2007. Cache-oblivious streaming B-trees. In Proceedings of the 19th Annual ACM Symposium on Parallel Algorithms and Architectures (SPAA’07). 81--92. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Michael A. Bender, Martin Farach-Colton, William Jannen, Rob Johnson, Bradley C. Kuszmaul, Donald E. Porter, Jun Yuan, and Yang Zhan. 2015. And introduction to Be-trees and write-optimization. Login; Magazine 40, 5 (Oct. 2015).Google ScholarGoogle Scholar
  4. Jeff Bonwick and B. Moore. 2005. ZFS: The Last Word in File Systems. Retrieved from http://opensolaris.org/os/community/zfs/docs/zfslast.pdf.Google ScholarGoogle Scholar
  5. Gerth Stølting Brodal, Erik D. Demaine, Jeremy T. Fineman, John Iacono, Stefan Langerman, and J. Ian Munro. 2010. Cache-oblivious dynamic dictionaries with update/query tradeoffs. In Proceedings of the 21st Annual ACM-SIAM Symposium on Discrete Algorithms (SODA’10). 1448--1456. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Gerth Stølting Brodal and Rolf Fagerberg. 2003. Lower bounds for external memory dictionaries. In Proceedings of the 14th Annual ACM-SIAM Symposium on Discrete Algorithms (ACM’03). 546--554. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. John Esmet, Michael A. Bender, Martin Farach-Colton, and B. C. Kuszmaul. 2012. The TokuFS streaming file system. In Proceedings of the 4th USENIX Workshop on Hot Topics in Storage (HotStorage’12). Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Neeta Garimella. 2006. Understanding and Exploiting Snapshot Technology for Data Protection. Retrieved from http://www.ibm.com/developerworks/tivoli/library/t-snaptsm1/.Google ScholarGoogle Scholar
  9. Jim Gray and Andreas Reuter. 1992. Transaction Processing: Concepts and Techniques. Morgan Kaufmann Publishers, San Francisco, CA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Andromachi Hatzieleftheriou and Stergios V. Anastasiadis. 2011. Okeanos: Wasteless journaling for fast and reliable multistream storage. In Proceedings of the 2011 USENIX Conference on USENIX Annual Technical Conference (USENIX ATC’11). USENIX Association, Berkeley, CA, 19--19. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Val Henson, Amit Gud, Arjan van de Ven, and Zach Brown. 2006. Chunkfs: Using divide-and-conquer to improve file system reliability and repair. In Proceedings of the 2nd Conference on Hot Topics in System Dependability (HotDep’06). USENIX Association, Berkeley, CA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. William Jannen, Jun Yuan, Yang Zhan, Amogh Akshintala, John Esmet, Yizheng Jiao, Ankur Mittal, Prashant Pandey, Phaneendra Reddy, Leif Walsh, Michael A. Bender, Martin Farach-Colton, Rob Johnson, Bradley C. Kuszmaul, and Donald E. Porter. 2015a. BetrFS: Write-optimization in a kernel file system. Transactions on Storage 11, 4, Article 18 (Nov. 2015), 29 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. William Jannen, Jun Yuan, Yang Zhan, Amogh Akshintala, John Esmet, Yizheng Jiao, Ankur Mittal, Prashant Pandey, Phaneendra Reddy, Leif Walsh, Michael Bender, Martin Farach-Colton, Rob Johnson, Bradley C. Kuszmaul, and Donald E. Porter. 2015b. BetrFS: A right-optimized write-optimized file system. In Proceedings of the USENIX Conference on File and Storage Technologies (FAST’15). 301--315. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Junbin Kang, Benlong Zhang, Tianyu Wo, Weiren Yu, Lian Du, Shuai Ma, and Jinpeng Huai. 2015. SpanFS: A scalable file system on fast storage devices. In Proceedings of the 2015 USENIX Conference on Usenix Annual Technical Conference (USENIX ATC’15). USENIX Association, Berkeley, CA, 249--261. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Changman Lee, Dongho Sim, Jooyoung Hwang, and Sangyeun Cho. 2015. F2FS: A new file system for flash storage. In Proceedings of the USENIX Conference on File and Storage Technologies (FAST’15). 273--286. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Nicholas Lester, Alistair Moffat, and Justin Zobel. 2005. Fast on-line index construction by geometric partitioning. In Proceedings of the 14th ACM International Conference on Information and Knowledge Management (CIKM’05). ACM, New York, NY, 776--783. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Andrew W. Leung, Minglong Shao, Timothy Bisson, Shankar Pasupathy, and Ethan L. Miller. 2009. Spyglass: Fast, scalable metadata search for large-scale storage systems. In Proceedings of the 7th Conference on File and Storage Technologies (FAST’09). USENIX Association, Berkeley, CA, 153--166. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Hyeontaek Lim, Bin Fan, David G. Andersen, and Michael Kaminsky. 2011. SILT: A memory-efficient, high-performance key-value store. In Proceedings of the 23rd ACM Symposium on Operating Systems Principles. 1--13. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. David Lomet and Mark Tuttle. 1999. Logical logging to extend recovery to new domains. In Proceedings of the 1999 ACM SIGMOD International Conference on Management of Data (SIGMOD’99). ACM, New York, NY, 73--84. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Lanyue Lu, Yupu Zhang, Thanh Do, Samer Al-Kiswany, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. 2014. Physical disentanglement in a container-based file system. In Proceedings of the 11th USENIX Conference on Operating Systems Design and Implementation (OSDI’14). USENIX Association, Berkeley, CA, 81--96. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Mike Mammarella, Shant Hovsepian, and Eddie Kohler. 2009. Modular data storage with anvil. In Proceedings of the ACM SIGOPS 22nd Symposium on Operating Systems Principles (SOSP’09). ACM, New York, NY, 147--160. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Avantika Mathur, MingMing Cao, Suparna Bhattacharya, Andreas Dilger, Alex Tomas, and Laurent Vivier. 2007. The new ext4 filesystem: Current status and future plans. In Proceedings of the Linux Symposium.Google ScholarGoogle Scholar
  23. M. K. McKusick, G. V. Neville-Neil, and R. N. M. Watson. 2014. The Design and Implementation of the FreeBSD Operating System. Addison Wesley. 632--634 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Marshall Kirk Mckusick and Gregory R. Ganger. 1999. Soft updates: A technique for eliminating most synchronous writes in the fast filesystem. In Proceedings of the USENIX Annual Technical Conference. 1--17. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Soumyadeb Mitra, Marianne Winslett, and Windsor W. Hsu. 2008. Query-based partitioning of documents and indexes for information lifecycle management. In Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data (SIGMOD’08). ACM, New York, NY, 623--636. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Patrick O’Neil, Edward Cheng, Dieter Gawlic, and Elizabeth O’Neil. 1996. The log-structured merge-tree (LSM-tree). Acta Informatica 33, 4 (1996), 351--385. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Kai Ren and Garth A Gibson. 2013. TABLEFS: Enhancing metadata efficiency in the local file system. In USENIX Annual Technical Conference. 145--156. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Ohad Rodeh, Josef Bacik, and Chris Mason. 2013. BTRFS: The linux B-tree filesystem. Transactions on Storage 9, 3, Article 9 (Aug. 2013), 32 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Mendel Rosenblum and John K. Ousterhout. 1992. The design and implementation of a log-structured file system. ACM Transactions on Computer Systems 10, 1 (Feb. 1992), 26--52. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Douglas Santry and Kaladhar Voruganti. 2014. Violet: A storage stack for IOPS/capacity bifurcated storage environments. In Proceedings of the USENIX Annual Technical Conference. USENIX Association, 13--24. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Russell Sears and Raghu Ramakrishnan. 2012. bLSM: A general purpose log structured merge tree. In Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data. ACM, 217--228. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Pradeep Shetty, Richard P. Spillane, Ravikant Malpani, Binesh Andrews, Justin Seyster, and Erez Zadok. 2013. Building workload-independent storage with VT-trees. In Proceedings of the USENIX Conference on File and Storage Technologies (FAST’15). 17--30. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Adam Sweeney, Doug Doucette, Wei Hu, Curtis Anderson, Mike Nishimoto, and Geoff Peck. 1996. Scalability in the XFS file system. In Proceedings of the 1996 Annual Conference on USENIX Annual Technical Conference (ATEC’96). USENIX Association, Berkeley, CA, 1--14. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Tokutek. 2013. TokuDB: MySQL Performance, MariaDB Performance. http://www.tokutek.com/products/tokudb-for-mysql/. (2013).Google ScholarGoogle Scholar
  35. Ben Verghese, Anoop Gupta, and Mendel Rosenblum. 1998. Performance isolation: Sharing and isolation in shared-memory multiprocessors. In Proceedings of the 8th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS VIII’98). ACM, New York, NY, 181--192. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Matthew Wachs, Michael Abd-El-Malek, Eno Thereska, and Gregory R. Ganger. 2007. Argon: Performance insulation for shared storage servers. In Proceedings of the 5th USENIX Conference on File and Storage Technologies (FAST’07). USENIX Association, Berkeley, CA, 61--76. http://dl.acm.org/citation.cfm?id=1267903.1267908 Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Sage A. Weil, Scott A. Brandt, Ethan L. Miller, Darrell D. E. Long, and Carlos Maltzahn. 2006. Ceph: A scalable, high-performance distributed file system. In Proceedings of the 7th Symposium on Operating Systems Design and Implementation (OSDI’06). USENIX Association, Berkeley, CA, 307--320. http://dl.acm.org/citation.cfm?id=1298455.1298485 Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Sage A. Weil, Kristal T. Pollack, Scott A. Brandt, and Ethan L. Miller. 2004. Dynamic metadata management for petabyte-scale file systems. In Proceedings of the 2004 ACM/IEEE Conference on Supercomputing (SC’04). IEEE Computer Society, Washington, DC, 4. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Xingbo Wu, Yuehai Xu, Zili Shao, and Song Jiang. 2015. LSM-trie: An LSM-tree-based ultra-large key-value store for small data items. In Proceedings of the USENIX Annual Technical Conference. 71--82. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Writes Wrought Right, and Other Adventures in File System Optimization

                Recommendations

                Comments

                Login options

                Check if you have access through your login credentials or your institution to get full access on this article.

                Sign in

                Full Access

                • Published in

                  cover image ACM Transactions on Storage
                  ACM Transactions on Storage  Volume 13, Issue 1
                  Special Issue on USENIX FAST 2016 and Regular Papers
                  February 2017
                  201 pages
                  ISSN:1553-3077
                  EISSN:1553-3093
                  DOI:10.1145/3054178
                  • Editor:
                  • Sam H. Noh
                  Issue’s Table of Contents

                  Copyright © 2017 ACM

                  Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

                  Publisher

                  Association for Computing Machinery

                  New York, NY, United States

                  Publication History

                  • Published: 16 March 2017
                  • Accepted: 1 December 2016
                  • Received: 1 October 2016
                  Published in tos Volume 13, Issue 1

                  Permissions

                  Request permissions about this article.

                  Request Permissions

                  Check for updates

                  Qualifiers

                  • research-article
                  • Research
                  • Refereed

                PDF Format

                View or Download as a PDF file.

                PDF

                eReader

                View online with eReader.

                eReader