skip to main content
research-article

Everyone Loves File: Oracle File Storage Service

Published: 05 March 2020 Publication History

Abstract

Oracle File Storage Service (FSS) is an elastic filesystem provided as a managed NFS service. A pipelined Paxos implementation underpins a scalable block store that provides linearizable multipage limited-size transactions. Above the block store, a scalable B-tree holds filesystem metadata and provides linearizable multikey limited-size transactions. Self-validating B-tree nodes and housekeeping operations performed as separate transactions allow each key in a B-tree transaction to require only one page in the underlying block transaction. The filesystem provides snapshots by using versioned key-value pairs. The system is programmed using a nonblocking lock-free programming style. Presentation servers maintain no persistent local state making them scalable and easy to failover. A non-scalable Paxos-replicated hash table holds configuration information required to bootstrap the system. An additional B-tree provides conversational multi-key minitransactions for control-plane information. The system throughput can be predicted by comparing an estimate of the network bandwidth needed for replication to the network bandwidth provided by the hardware. Latency on an unloaded system is about 4 times higher than a Linux NFS server backed by NVMe, reflecting the cost of replication. FSS has been in production since January 2018 and holds tens of thousands of customer file systems comprising many petabytes of data.

References

[1]
Marcos K. Aguilera, Arif Merchant, Mehul Shah, Alistair Veitch, and Christos Karamoanolis. 2009. Sinfonia: A new paradigm for building scalable distributed systems. ACM Trans. Comput. Syst. 27, 3 (Nov. 2009).
[2]
Alibaba 2018. Alibaba Elastic Block Storage. Retrieved September 26, 2018 from https://www.alibabacloud.com/help/doc-detail/25383.htm.
[3]
Hervey Allen. 2005. Introduction to FreeBSD additional topics. In Proceedings of the Pacific Network Operators Group (PacNOG I) Workshop.
[4]
Amazon 2018. Amazon Elastic Block Store. Retrieved September 26, 2018 from https://aws.amazon.com/ebs.
[5]
Amazon 2018. Amazon Elastic File System. Retrieved October 12, 2019 from https://aws.amazon.com/efs.
[6]
Amazon 2018. Amazon FSx. Retrieved January 22, 2020 from https://aws.amazon.com/fsx.
[7]
G. M. Amdahl. 1967. The validity of the single processor approach to achieving large scale computing capabilities. In Proceedings of the American Federation of Information Processing Societies Conference (AFIPS’67), Vol. 30.
[8]
Apache Software Foundation. 2009. ZooKeeper Internals. Retrieved from https://zookeeper.apache.org/doc/r3.1.2/zookeeperInternals.html.
[9]
Rudolf Bayer and Edward M. McCreight. 1972. Organization and maintenance of large ordered indexes. Acta Inf. 1, 3 (Feb. 1972), 173--189.
[10]
Steve Best and Dave Kleikamp. 2000. JFS layout. IBM Developerworks. Retreived from http://jfs.sourceforge.net/project/pub/jfslayout.pdf.
[11]
Robert D. Blumofe, Christopher F. Joerg, Bradley C. Kuszmaul, Charles E. Leiserson, Keith H. Randall, and Yuli Zhou. 1996. Cilk: An efficient multithreaded runtime system. J. Parallel Distrib. Comput. 37, 1 (Aug. 25 1996), 55--69. (An early version appeared in the Proceedings of the 5th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’95). 207--216.
[12]
Hans-J. Boehm. 2009. Transactional memory should be an implementation technique, not a programming interface. In Proceedings of the 1st USENIX Conference on Hot Topics in Parallelism (HotPar’09). 15:1--15:6.
[13]
William J. Bolosky, Dexter Bradshaw, Randolph B. Haagens, Norbert P. Kusters, and Peng Li. 2011. Paxos replicated state machines as the basis of a high-performance data store. In Proceedings of the 8th USENIX Conference on Networked Systems Design and Implementation. 141--154.
[14]
Richard P. Brent. 1974. The parallel evaluation of general arithmetic expressions. Journal of the ACM 21, 2 (Apr. 1974), 201--206.
[15]
Gerth Stølting Brodal, Konstantinos Tsakalidis, Spyros Sioutas, and Kostas Tsichlas. 2012. Fully persistent B-trees. In Proceedings of the 23rd Annual ACM-SIAM Symposium on Discrete Algorithms (SODA’12). 602--614.
[16]
Navin Budhiraja, Keith Marzullo, Fred B. Schneider, and Sam Toueg. 1993. The primary-backup approach. In Distributed Systems (2 ed.). ACM Press/Addison-Wesley, New York, NY, 199--216.
[17]
Brent Callaghan, Brian Pawlowski, and Peter Staubach. 1995. NFS Version 3 Protocol Specification. IETF RFC 1813. Retrieved from https://www.ietf.org/rfc/rfc1813.
[18]
Rémy Card, Theodore Ts’o, and Stephen Tweedie. 1994. Design and implementation of the second extended filesystem. In Proceedings of the 1st Dutch International Symposium on Linux.
[19]
Călin Casçaval, Colin Blundell, Maged Michael, Harold W. Cain, Peng Wu, Stefanie Chiras, and Siddhartha Chatterjee. 2008. Software transactional memory: Why is it only a research toy. ACM Queue 6, 5 (Sept. 2008).
[20]
Tushar D. Chandra, Robert Griesemer, and Joshua Redstone. 2007. Paxos made live: An engineering perspective. In Proceedings of the 26th Annual ACM Symposium on Principles of Distributed Computing (PODC’07). 398--407.
[21]
Alexander Conway, Ainesh Bakshi, Yizheng Jiao, Yang Zhan, Michael A. Bender, William Jannen, Rob Johnson, Bradley C. Kuszmaul, Donald E. Porter, Jun Yuan, and Martin Farach-Colton. 2017. File systems fated for senescence? Nonsense, says science! In Proceedings of the 15th USENIX Conference on File and Storage Technologies (FAST’17). 45--58.
[22]
James C. Corbett, Jeffrey Dean, Michael Epstein, Andrew Fikes, Christopher Frost, J. J. Furman, Sanjay Ghemawat, Andrey Gubarev, Christopher Heiser, Peter Hochschild, Wilson Hsieh, Sebastian Kanthak, Eugene Kogan, Hongyi Li, Alexander Lloyd, Sergey Melnik, David Mwaura, David Nagle, Sean Quinlan, Rajesh Rao, Lindsay Rolig, Yasushi Saito, Michal Szymaniak, Christopher Taylor, Ruth Wang, and Dale Woodford. 2012. Spanner: Google’s globally distributed database. In Proceedings of the 10th USENIX Symposium on Operating Systems Design and Implementation (OSDI’12). 251--264.
[23]
Susan B. Davidson, Hector Garcia-Molina, and Dale Skeen. 1985. Consistency in partitioned network. Comput. Surv. 17, 3 (Sep. 1985), 341--370.
[24]
David L. Detlefs, Christine H. Flood, Alexander T. Garthwaite, Paul A. Martin, Nir N. Shavit, and Guy L. Steele Jr. 2000. Even better DCAS-based concurrented deques. In Proceedings of the 14th International Conference on Distributed Computing (DISC’00). 59--73.
[25]
David L. Detlefs, Paul A. Martin, Mark Moir, and Guy L. Steele Jr. 2002. Lock-free reference counting. Distrib. Comput. 15, 4 (Dec. 2002), 255--271.
[26]
Matthew Dillon. 2008. The Hammer Filesystem. Retrieved from https://www.dragonflybsd.org/hammer/hammer.pdf.
[27]
Mark Fasheh. 2006. OCFS2: The oracle clustered file system version 2. In Proceedings of the 2006 Linux Symposium. 289--302.
[28]
Glustre 2005. GlusterFS. Retrieved from http://www.gluster.org.
[29]
Google 2012. Google Persistent Disk. Retrieved September 26, 2018 from https://cloud.google.com/persistent-disk/.
[30]
Google 2018. Google Filestore. Retrieved January 22, 2020 https://cloud.google.com/filestore/.
[31]
Goetz Graefe. 2010. A survey of B-tree locking techniques. ACM Transactions on Database Systems 35, 3 (Jul. 2010).
[32]
R. L. Graham. 1969. Bounds on multiprocessing timing anomalies. SIAM J. Appl. Math. 17, 2 (Mar. 1969), 416--429.
[33]
Jim Gray and Andreas Reuter. 1993. Transaction Processing: Concepts and Techniques. Morgan Kaufmann.
[34]
Jim N. Gray. 1978. Notes on data base operating systems. In Operating Systems—An Advanced Course. Lecture Notes in Computer Science, Vol. 60. Springer-Verlag, Chapter 3.
[35]
Tim Harris and Keir Fraser. 2003. Language support for lightweight transactions. In Proceedings of the 18th Annual SIGPLAN Conference on Object-Oriented Programming, Systems, Languages, and Applications (OOPSLA’03). 388--402.
[36]
HDFS 2012. Add support for Variable length block. HDFS Ticket. Retrieved from https://issues.apache.org/jira/browse/HDFS-3689.
[37]
HDFS 2013. HDFS Architecture. Retrieved from http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html#Large_Data_Sets.
[38]
Maurice Herlihy. 1991. Wait-free synchronizatoin. ACM Trans. Program. Lang. Syst. 11, 1 (Jan. 1991), 124--149.
[39]
M. Herlihy and J. E. Moss. 1993. Transactional memory: Architectural support for lock-free data structures. In Proceedings of the 20th Annual International Symposium on Computer Architecture (ISCA’93). 289--300.
[40]
Maurice P. Herlihy and Jeannette M. Wing. 1990. Linearizability: A correctness condition for concurrent objects. ACM Trans. Program. Lang. Syst. 12, 3 (Jul. 1990), 463--492.
[41]
Dave Hitz, James Lau, and Michael Malcolm. 1994. File system design for an NFS file server appliance. In Proceedings of the USENIX Winter 1994 Technical Conference. 19--19.
[42]
Valentin Höbel. 2016. LizardFS: Software-Defined Storage As It Should Be. Retrieved from https://www.golem.de/news/lizardfs-software-defined-storage-wie-es-sein-soll-1604-119518.html.
[43]
IBM. 1966. Data File Handbook. Retrieved from http://www.bitsavers.org/pdf/ibm/generalInfo/C20-1638-1_Data_File_Handbook_Mar66.pdf C20-1638-1.
[44]
IBM 1983. IBM System/370 Extended Architecture—Principles of Operation. IBM. Retrieved from https://archive.org/details/bitsavers_ibm370prinrinciplesofOperationMar83_40542805.
[45]
Apple Inc. 2004. HFS Plus Volume Format. Retrieved from Technical Note TN1150. Apple Developer Connection. https://developer.apple.com/library/archive/technotes/tn/tn1150.html.
[46]
William Jannen, Jun Yuan, Yang Zhan, Amogh Akshintala, John Esmet, Yizheng Jiao, Ankur Mittal, Prashant Pandey, Phaneendra Reddy, Leif Walsh, Michael Bender, Martin Farach-Colton, Rob Johnson, Bradley C. Kuszmaul, and Donald E. Porter. 2015. BetrFS: A write-optimization in a kernel file system. ACM Trans. Stor. 11, 4 (Nov. 2015).
[47]
Eric H. Jensen, Gary W. Hagensen, and Jeffrey M. Broughton. 1987. A New Approach to Exclusive Data Access in Shared Memory Multiprocessors. Technical Report UCRL-97663. Lawrence Livermore National Laboratory, Livermore, California. Retrieved from https://e-reports-ext.llnl.gov/pdf/212157.pdf.
[48]
M. Tim Jones. 2004. Ceph: A Linux petabyte-scale distributed file system. Retrieved from https://www.ibm.com/developerworks/linux/library/l-ceph/index.html.
[49]
Sakis Kasampalis. 2010. Copy on Write Based File Systems Performance Analysis and Implementation. Master’s thesis. Department of Informatics, The Technical University of Denmark. Retrieved from http://sakisk.me/files/copy-on-write-based-file-systems.pdf.
[50]
Leslie Lamport. 1998. The part-time parliament. ACM Trans. Comput. Syst. 16, 2 (May 1998), 133--169.
[51]
Leslie Lamport. 2001. Paxos made simple. ACM SIGACT News 32, 4 (121) (Dec. 2001), 51--58. https://www.microsoft.com/en-us/research/publication/paxos-made-simple/.
[52]
Butler Lampson. 1980. Atomic transactions. In Distributed Systems—Architecture and Implementation. Vol. 100. Springer Verlag.
[53]
Philip L. Lehman and S. Bing Yao. 1981. Efficient locking for concurrent operations on B-trees. ACM Transactions on Database Systems 6, 4 (Dec. 1981), 650--670.
[54]
Yossi Lev, Mark Moir, and Dan Nussbaum. 2007. PhTM: Phased transactional memory. In Proceedings of the The 2nd ACM SIGPLAN Workshop on Transactional Computing.
[55]
A. J. Lewis. 2002. LVM HOWTO. Retrieved from http://tldp.org/HOWTO/LVM-HOWTO/.
[56]
Bruce G. Lindsay. 1980. Single and multi-site recovery facilities. In Distributed Data Bases, I. W. Draffan and F. Poole (Eds.). Cambridge University Press, Chapter 10. Also available as Reference [57].
[57]
Bruce G. Lindsay, Patricia G. Selinger, Cesare A. Galtieri, James N. Gray, Raymond A. Lorie, Thomas G. Price, Franco Putzolu, Irving L. Traiger, and Bradford W. Wade. 1979. Notes on Distributed Databases. Research Report RJ2571. IBM Research Laboratory, San Jose, CA. Retrieved from http://domino.research.ibm.com/library/cyberdig.nsf/papers/A776EC17FC2FCE73852579F100578964/$File/RJ2571.pdf.
[58]
Lustre 2003. The Lustre File System. Retrieved from lustre.org.
[59]
Avantika Mathur, MingMing Cao, Suparna Bhattacharya, Andreas Dilger, Alex Tomas, and Laurent Vivier. 2007. The new ext4 filesystem: Current status and future plans. In Proceedings of the Linux Symposium.
[60]
Marshall K. McKusick, William N. Joy, Samuel J. Leffler, and Robert S. Fabry. 1984. A fast file system for UNIX. Comput. Syst. 2, 3 (1984), 181--197.
[61]
Microsoft 2017. Microsoft Azure Blob Storage. Retrieved from https://azure.microsoft.com/en-us/services/storage/blobs/. Viewed 2018-09-26.
[62]
Microsoft 2018. Microsoft SMB Protocol and CIFS Protocol Overview. Retrieved from https://docs.microsoft.com/en-us/windows/desktop/FileIO/microsoft-smb-protocol-and-cifs-protocol-overview.
[63]
Barton P. Miller, Louis Fredersen, and Bryan So. 1990. An empirical study of the reliability of UNIX utilities. Commun. ACM 33, 12 (Dec. 1990), 32--44.
[64]
Moose 2018. MooseFS Fact Sheet. Retrieved from https://moosefs.com/factsheet/.
[65]
Brian Oki and Barbara Liskov. 1988. Viewstamped replication: A new primary copy method to support highly-available distributed systems. In Proceedings of the 7th Annual ACM Symposium on Principles of Distributed Computing (PODC’88). 8--17.
[66]
Diego Ongaro and John Ousterhout. 2014. In search of an understandable consensus algorithm. In Proceedings of the 2014 USENIX Annual Technical Conference (ATC’14).
[67]
Diego Ongaro and John Ousterhout. 2014. In Search of an Understandable Consensus Algorithm (Extended Version). Retrieved from https://raft.github.io/raft.pdf. Extended version of Reference [66].
[68]
Oracle 2016. Oracle Cloud Infrastructure Block Volumes. Retrieved from https://cloud.oracle.com/en_US/storage/block-volume/features.
[69]
K. K. Ramakrishnan, Sally Floyd, and David L. Black. 2001. The Addition of Explicit Congestion Notification (ECN) to IP. IETF RFC 3168. Retrieved from http://www.ietf.org/rfc/rfc3168.txt.
[70]
I. S. Reed and G. Solomon. 1960. Polynomial codes over certain finite fields. J. Soc. Industr. Appl. Math. 8, 2 (Jun. 1960), 300--304.
[71]
Hans T. Reiser. 2006. Reiser4. Retrieved July 6, 2006 from https://web.archive.org/web/20060706032252 http://www.namesys.com:80/.
[72]
Kai Ren and Garth Gibson. 2013. TABLEFS: Enhancing metadata efficiency in the local file system. In Proceedings of the USENIX Annual Technical Conference. 145--156.
[73]
Ohad Rodeh. 2008. B-trees, shadowing, and clones. ACM Trans. Comput. Logic 3, 4 (Feb. 2008), 15:1--15:27.
[74]
Ohad Rodeh, Josef Bacik, and Chris Mason. 2013. BTRFS: The Linux B-tree filesystem. ACM Trans. Stor. 9, 3 (Aug. 2013).
[75]
Mark Russinovich. 2000. Inside Win2K NTFS, Part 1. ITProToday (22 Oct. 2000). Retrieved from https://www.itprotoday.com/management-mobility/inside-win2k-ntfs-part-1.
[76]
Spencer Shepler, Brent Callaghan, David Robinson, Robert Thurlow, Carl Beame, Mike Eisler, and David Noveck. 2003. Network File System (NFS) version 4 Protocol. IETF RFC 3530. Retrieved from https://www.ietf.org/html/rfc3530.
[77]
Chris Siebenmann. 2017. ZFS’s recordsize, Holes In Files, and Partial Blocks. Retrieved from https://utcc.utoronto.ca/cks/space/blog/solaris/ZFSFilePartialAndHoleStorage.
[78]
Chris Siebenmann. 2018. What ZFS Gang Blocks Are and Why They Exist. Retrieved August 30, 2018 from https://utcc.utoronto.ca/ cks/space/blog/solaris/ZFSGangBlocks.
[79]
Jon Stacey. 2009. Mac OS X Resource Forks. Jon’s View (blog). Retrieved January 23, 2020 https://jonsview.com/mac-os-x-resource-forks.
[80]
W. Richard Stevens. 1997. TCP Slow Start, Congestion Avoidance, Fast Retransmit and Fast Recovery Algorithms. IETF RFC 2001. Retrieved from https://www.ietf.org/html/rfc2001.
[81]
Sun Microsystems. 2006. ZFS On-Disk Specification—draft. Retrieved from http://www.giis.co.in/Zfs_ondiskformat.pdf.
[82]
Adam Sweeny, Doug Doucette, Wei Hu, Curtis Anderson, Mike Nishimoto, and Geoff Peck. 1996. Scalability in the XFS file system. In Proceedings of the 1996 USENIX Annual Technical Conference (ATC’96). 1--14.
[83]
Lingxiang Xiang and Michael L. Scott. 2015. Conflict reduction in hardware transactions using advisory locks. In Proceedings of the 27th ACM Symposium on Parallelism in Algorithms and Architectures (SPAA’15). 234--243.
[84]
Jun Yuan, Yang Zhan, William Jannen, Prashant Pandey, Amogh Akshintala, Kanchan Chandnani, Pooja Deo, Zardosht Kasheff, Leif Walsh, Michael A. Bender, Martin Farach-Colton, Rob Johnson, Bradley C. Kuszmaul, and Donald E. Porter. 2017. Writes wrought right, and other adventures in file system optimization. ACM Trans. Stor. 13, 1 (Mar. 2017), 3:1--3:21.
[85]
Yang Zhan, Alexander Conway, Yizheng Jiao, Eric Knorr, Michael A. Bender, Martin Farach-Colton, William Jannen, Rob Johnson, Donald E. Porter, and Jun Yuan. 2018. The full path to full-path indexing. In Proceedings of the 16th USENIX Conference on File and Storage Technologies (FAST’18). 123--138.

Cited By

View all
  • (2022)An Improved Raft Protocol Combined with Cauchy Reed-Solomon Codes2022 5th International Conference on Artificial Intelligence and Big Data (ICAIBD)10.1109/ICAIBD55127.2022.9820425(563-568)Online publication date: 27-May-2022
  • (2022)An Optimized Raft Protocol Combined with Redundant Residue Number System2022 5th International Conference on Data Science and Information Technology (DSIT)10.1109/DSIT55514.2022.9943823(1-6)Online publication date: 22-Jul-2022

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Storage
ACM Transactions on Storage  Volume 16, Issue 1
ATC 2019 Special Section and Regular Papers
February 2020
155 pages
ISSN:1553-3077
EISSN:1553-3093
DOI:10.1145/3386184
  • Editor:
  • Sam H. Noh
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 05 March 2020
Accepted: 01 January 2020
Received: 01 October 2019
Published in TOS Volume 16, Issue 1

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. B-tree-based filesystem
  2. Distributed filesystem
  3. Paxos
  4. cloud filesystem
  5. two-phase commit

Qualifiers

  • Research-article
  • Research
  • Refereed

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)21
  • Downloads (Last 6 weeks)3
Reflects downloads up to 25 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2022)An Improved Raft Protocol Combined with Cauchy Reed-Solomon Codes2022 5th International Conference on Artificial Intelligence and Big Data (ICAIBD)10.1109/ICAIBD55127.2022.9820425(563-568)Online publication date: 27-May-2022
  • (2022)An Optimized Raft Protocol Combined with Redundant Residue Number System2022 5th International Conference on Data Science and Information Technology (DSIT)10.1109/DSIT55514.2022.9943823(1-6)Online publication date: 22-Jul-2022

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media