ABSTRACT
Storage systems need to support high-performance for special-purpose data processing applications that run on an evolving storage device technology landscape. This puts tremendous pressure on storage systems to support rapid change both in terms of their interfaces and their performance. But adapting storage systems can be difficult because unprincipled changes might jeopardize years of code-hardening and performance optimization efforts that were necessary for users to entrust their data to the storage system. We introduce the programmable storage approach, which exposes internal services and abstractions of the storage stack as building blocks for higher-level services. We also build a prototype to explore how existing abstractions of common storage system services can be leveraged to adapt to the needs of new data processing systems and the increasing variety of storage devices. We illustrate the advantages and challenges of this approach by composing existing internal abstractions into two new higher-level services: a file system metadata load balancer and a high-performance distributed shared-log. The evaluation demonstrates that our services inherit desirable qualities of the back-end storage system, including the ability to balance load, efficiently propagate service metadata, recover from failure, and navigate trade-offs between latency and throughput using leases.
- Ceph Architecture. URL http://docs.ceph.com/docs/master/architecture.Google Scholar
- P. Alvaro, N. Conway, J. M. Hellerstein, and W. R. Marczak. Consistency Analysis in Bloom: A CALM and Collected Approach. In Proceedings 5th Biennial Conference on Innovative Data Systems Research, CIDR '11, Asilomar, CA, January 2011.Google Scholar
- Apache Parquet Contributors. Parquet Columnar Storage Format, http://parquet.io.Google Scholar
- M. Armbrust, A. Fox, R. Griffith, A. D. Joseph, R. Katz, A. Konwinski, G. Lee, D. Patterson, A. Rabkin, I. Stoica, and M. Zaharia. A View of Cloud Computing. Communications of the ACM, vol. 53, 2010.Google ScholarDigital Library
- A. C. Arpaci-Dusseau and R. H. Arpaci-Dusseau. Information and Control in Gray-box systems. In Proceedings of the 18th ACM Symposium on Operating Systems Principles, SOSP '01, Banff, Alberta, Canada, 2001. Google ScholarDigital Library
- M. Balakrishnan, D. Malkhi, V. Prabhakaran, T. Wobber, M. Wei, and J. D. Davis. CORFU: A Shared Log Design for Flash Clusters. In Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation, NSDI '12, San Jose, CA, April 2012.Google Scholar
- M. Balakrishnan, D. Malkhi, T. Wobber, M. Wu, V. Prabhakaran, M. Wei, J. D. Davis, S. Rao, T. Zou, and A. Zuck. Tango: Distributed Data Structures Over a Shared Log. In Proceedings of the 24th ACM Symposium on Operating Systems Principles, SOSP '13, Farmington, PA, November 2013. Google ScholarDigital Library
- P. A. Bernstein, C. W. Reid, and S. Das. Hyder -- A Transactional Record Manager for Shared Flash. In Proceedings 5th Biennial Conference on Innovative Data Systems Research, CIDR '11, Asilomar, CA, January 2011.Google Scholar
- P. A. Bernstein, C. W. Reid, M. Wu, and X. Yuan. Optimistic Concurrency Control by Melding Trees. In Proceedings of the 37th International Conference on Very Large Data Bases, VLDB '11, August 2011.Google ScholarDigital Library
- P. A. Bernstein, S. Das, B. Ding, and M. Pilman. Optimizing Optimistic Concurrency Control for Tree-Structured, Log-Structured Databases. In Proceedings of the ACM International Conference on Management of Data, SIGMOD '15, Melbourne, Australia, May 2015. Google ScholarDigital Library
- B. N. Bershad, S. Savage, P. Pardyak, E. G. Sirer, M. E. Fiuczynski, D. Becker, C. Chambers, and S. Eggers. Extensibility Safety and Performance in the SPIN Operating System. In Proceedings of the 15th ACM Symposium on Operating Systems Principles, SOSP '95, Copper Mountain, CO, December 1995. Google ScholarDigital Library
- E. Brewer, L. Ying, L. Greenfield, R. Cypher, and T. T'so. Disks for Data Centers. Technical Report, Google, 2016.Google Scholar
- M. Burrows. The Chubby Lock Service for Loosely-Coupled Distributed Systems. In Proceedings of the 7th Symposium on Operating Systems Design and Implementation, OSDI '06, Seattle, WA, November 2006.Google ScholarDigital Library
- M. Carlson, A. Yoder, L. Schoeb, D. Deel, C. Pratt, C. Lionetti, and D. Voigt. Software Defined Storage. SNIA Whitepaper, January 2015.Google Scholar
- D. R. Engler, M. F. Kaashoek, and J. J. O'Toole. Exokernel: An Operating System Architecture for Application-Level Resource Management. In Proceedings of the 15th ACM Symposium on Operating Systems Principles, SOSP 95, Copper Mountain, CO, December 1995. Google ScholarDigital Library
- Epping, Duncan and Denneman, Frank. VMware vSphere 5.1 Clustering Deepdive, accessed 03/21/2014, http://www.vmware.com/product/drs.Google Scholar
- R. Geambasu, A. A. Levy, T. Kohno, A. Krishnamurthy, and H. M. Levy. Comet: An Active Distributed Key-Value Store. In Proceedings of the 9th USENIX conference on Operating Systems Design and Implementation, OSDI '10, Vancouver, Canada, October 2010.Google Scholar
- S. Ghemawat, H. Gobioff, and S.-T. Leung. The Google File System. In Proceedings of the 14th ACM Symposium on Operating Systems Principles, SOSP '03, Bolton Landing, NY, October 2003. ACM. Google ScholarDigital Library
- R. Gracia-Tinedo et al. IOStack: Software-Defined Object Storage. IEEE Internet Computing, 20(3):10--18, May-June 2016. Google ScholarCross Ref
- M. Grawinkel, T. Sub, G. Best, I. Popov, and A. Brinkmann. Towards Dynamic Scripted pNFS Layouts. In Proceedings of the 2012 SC Companion: High Performance Computing, Networking Storage and Analysis, SCC '12, Salt Lake City, UT, November 2012. Google ScholarDigital Library
- J. Gray. Tape is Dead,Disk is Tape, Flash is Disk, RAM Locality is King. CIDR 2007 - Gong Show Presentation, January 2007.Google Scholar
- J. Gray and B. Fitzgerald. Flash Disk Opportunity for Server Applications. Queue, vol. 6, Juy 2008.Google ScholarDigital Library
- A. Gulati, G. Shanmuganathan, A. Holler, and I. Ahmad. Cloud-Scale Resource Management: Challenges and Techniques. In Proceedings of the 3rd USENIX Conference on Hot Topics in Cloud Computing, HotCloud '11, Portland, OR, June 2011.Google Scholar
- J. M. Hellerstein and M. Stonebraker. Anatomy of a Database System. Readings in Database Systems, January 2005.Google Scholar
- P. Hunt, M. Konar, F. P. Junqueira, and B. Reed. ZooKeeper: Wait-free Coordination for Internet-Scale Systems. In Proceedings of the USENIX Annual Technical Conference, ATC '10, Boston, MA, June 2010.Google Scholar
- R. Ierusalimschy, L. H. De Figueiredo, and W. Celes Filho. Lua - An Extensible Extension Language. Software Practical Experiences, 26(6):635--652, 1996. Google ScholarDigital Library
- S. Jain, A. Kumar, S. Mandal, J. Ong, L. Poutievski, A. Singh, S. Venkata, J. Wanderer, J. Zhou, M. Zhu, J. Zolla, U. Hlzle, S. Stuart, and A. Vahdat. B4: Experience with a Globally-Deployed Software Defined WAN. In Proceedings of the Conference on Applications, Technologies, Architectures, and Protocols for Computer Communications, SIGCOMM '13, Hong Kong, China, August 2013. Google ScholarDigital Library
- I. Jimenez, M. Sevilla, N. Watkins, C. Maltzahn, J. Lofstead, K. Mohror, R. Arpaci-Dusseau, and A. Arpaci-Dusseau. Popper: Making Reproducible Systems Performance Evaluation Practical, UCSC-SOE-16-10. Technical Report UCSC-SOE-16-10, UC Santa Cruz, May 2016.Google Scholar
- L. Joao. Ceph's New Monitor Changes. URL https://ceph.com/dev-notes/cephs-new-monitor-changes.Google Scholar
- L. Lamport. The Part-Time Parliament. ACM Transactions on Computer Systems, 16(2):133--169, May 1998. Google ScholarDigital Library
- Linux Foundation. Kinetic Open Storage Project, 2015. URL https://www.openkinetic.org/.Google Scholar
- J. MacCormick, N. Murphy, M. Najork, andramohan A. Thekkath, and L. Zhou. Boxwood: Abstractions as the Foundation for Storage Infrastructure. In Proceedings of the 6th USENIX Symposium on Operarting Systems Design and Implementation, OSDI '04, San Francisco, CA, December 2004.Google Scholar
- M. Mesnier, F. Chen, and J. B. Akers. Differentiated Storage Services. In Proceedings of the 23rd ACM Symposium on Operating Systems Principles, SOSP '11, Cascais, Portugal, October 2011. Google ScholarDigital Library
- Rackspace. ZeroVM and OpenStack Swift. URL http://www.zerovm.org/zerocloud.html.Google Scholar
- E. Riedel, G. A. Gibson, and C. Faloutsos. Active Storage For Large-Scale Data Mining and Multimedia. In Proceedings of the 24th international Conference on Very Large Databases, VLDB '98, New York, NY, July 1998.Google Scholar
- M. I. Seltzer, Y. Endo, C. Small, and K. A. Smith. Dealing with Disaster: Surviving Misbehaved Kernel Extensions. In Proceedings of the 2nd Symposium on Operating Systems Design and Implementation, OSDI '96, Seattle, WA, October 1996. Google ScholarDigital Library
- M. A. Sevilla, N. Watkins, C. Maltzahn, I. Nassi, S. A. Brandt, S. A. Weil, G. Farnum, and S. Fineberg. Mantle: A Programmable Metadata Load Balancer for the Ceph File System. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC '15, November 2015. Google ScholarDigital Library
- K. V. Shvachko, H. Kuang, S. Radia, and bert Chansler. The Hadoop Distributed File System. In Proceedings of the 26th Symposium on Mass Storage Systems and Technologies, MSST '10, Incline Village, NV, May 2010.Google Scholar
- M. Sivathanu, V. Prabhakaran, F. I. Popovici, T. E. Denehy, A. C. Arpaci-Dusseau, and R. H. Arpaci-Dussea. Semantically-Smart Disk Systems. In Proceedings of the 2rd USENIX Conference on File and Storage Technologies, FAST '03, San Francisco, CA, March 2003.Google Scholar
- SNIA. Implementing Multiple Cloud Storage APIs, November 2014. URL http://www.sniacloud.com/?p=88.Google Scholar
- I. Stefanovici, B. Schroeder, G. O'Shea, and E. Thereska. sRoute: Treating the Storage Stack Like a Network. In Proceedings of the 15th USENIX Conference on File and Storage Technologies, FAST '16, Santa Clara, CA, February 2016.Google Scholar
- R. Thakur, W. Gropp, and E. Lusk. On Implementing MPIIO Portably and with High Performance. In Proceedings of the th Workshop on I/O in Parallel and Distributed Systems, IOPADS '99, Atlanta, Georgia, May 1999.Google ScholarDigital Library
- E. Thereska, H. Ballani, G. O'Shea, T. Karagiannis, A. Rowstron, T. Talpey, R. Black, and T. Zhu. IOFlow: A Software-Defined Storage Architecture. In Proceedings of the 24th ACM Symposium on Operating Systems Principles, SOSP '13, Farmington, PA, November 2013. Google ScholarDigital Library
- R. van Renesse and F. B. Schneider. Chain Replication for Supporting High Throughput and Availability. In Proceedings of the 6th Conference on Symposium on Opearting Systems Design and Implementation, OSDI '04, San Francisco, CA, December 2004.Google ScholarDigital Library
- L. Vieira Neto, R. Ierusalimschy, A. L. de Moura, and M. Balmer. Scriptable Operating Systems with Lua. In Proceedings of the 10th ACM Symposium on Dynamic Languages, DLS '14, New York, NY, 2014. Google ScholarDigital Library
- N. Watkins. Dynamic Object Interfaces with Lua. URL http://ceph.com/rados/dynamic-object-interfaces-with-lua.Google Scholar
- N. Watkins, C. Maltzahn, S. Brandt, and A. Manzanares. DataMods: Programmable File System Services. In Proceedings of the 6th Workshop on Parallel Data Storage, PDSW '12, Salt Lake City, Utah, November 2012. Google ScholarDigital Library
- N. Watkins, C. Maltzahn, S. Brandt, I. Pye, and A. Manzanares. In-Vivo Storage System Development. In Euro-Par: Parallel Processing Workshops, Aachen, Germany, August 2013.Google Scholar
- S. A. Weil, K. T. Pollack, S. A. Brandt, and E. L. Miller. Dynamic Metadata Management for Petabyte-Scale File Systems. In Proceedings of the 17th ACM/IEEE Conference on Supercomputing, SC '04, Pittsburgh, PA, November 2004. Google ScholarDigital Library
- S. A. Weil, A. W. Leung, S. A. Brandt, and C. Maltzahn. RADOS: A Scalable, Reliable Storage Service for Petabyte-Scale Storage Clusters. In Proceedings of the 2nd International Workshop on Petascale Data Storage, PDSW '07, Reno, NV, November 2007. Google ScholarDigital Library
- Malacology: A Programmable Storage System
Recommendations
File systems unfit as distributed storage backends: lessons from 10 years of Ceph evolution
SOSP '19: Proceedings of the 27th ACM Symposium on Operating Systems PrinciplesFor a decade, the Ceph distributed file system followed the conventional wisdom of building its storage backend on top of local file systems. This is a preferred choice for most distributed file systems today because it allows them to benefit from the ...
NADE: nodes performance awareness and accurate distance evaluation for degraded read in heterogeneous distributed erasure code-based storage
AbstractTo ensure data availability and save storage space, storage systems usually save data across multiple storage nodes (or servers) using erasure codes. Storage systems need to reconstruct the complete data to respond to reading requests in the case ...
The Case for Custom Storage Backends in Distributed Storage Systems
SOSP 2019 Special Section and Regular PapersFor a decade, the Ceph distributed file system followed the conventional wisdom of building its storage backend on top of local file systems. This is a preferred choice for most distributed file systems today, because it allows them to benefit from the ...
Comments