skip to main content
research-article

Derecho: Fast State Machine Replication for Cloud Services

Published: 02 April 2019 Publication History

Abstract

Cloud computing services often replicate data and may require ways to coordinate distributed actions. Here we present Derecho, a library for such tasks. The API provides interfaces for structuring applications into patterns of subgroups and shards, supports state machine replication within them, and includes mechanisms that assist in restart after failures. Running over 100Gbps RDMA, Derecho can send millions of events per second in each subgroup or shard and throughput peaks at 16GB/s, substantially outperforming prior solutions. Configured to run purely on TCP, Derecho is still substantially faster than comparable widely used, highly-tuned, standard tools. The key insight is that on modern hardware (including non-RDMA networks), data-intensive protocols should be built from non-blocking data-flow components.

References

[1]
{n. d.}. LibPaxos: Open-source Paxos. Retrieved from http://libpaxos.sourceforge.net/.
[2]
{n. d.}. RDMA-Paxos: Open-source Paxos. Retrieved from https://github.com/wangchenghku/RDMA-PAXOS.
[3]
2011. Vsync Reliable Multicast Library. Retrieved from http://vsync.codeplex.com/.
[4]
2012. Gbcast Protocol. Retrieved from https://en.wikipedia.org/wiki/Gbcast.
[5]
Mahesh Balakrishnan, Dahlia Malkhi, John D. Davis, Vijayan Prabhakaran, Michael Wei, and Ted Wobber. 2013. CORFU: A distributed shared log. ACM Trans. Comput. Syst. 31, 4, Article 10 (Dec. 2013), 24 pages.
[6]
Bela Ban. 2002. JGroups Reliable Multicast Library. Retrieved from http://jgroups.org/.
[7]
Andrew Baumann, Paul Barham, Pierre-Evariste Dagand, Tim Harris, Rebecca Isaacs, Simon Peter, Timothy Roscoe, Adrian Schüpbach, and Akhilesh Singhania. 2009. The multikernel: A new OS architecture for scalable multicore systems. In Proceedings of the ACM SIGOPS 22nd Symposium on Operating Systems Principles (SOSP’09). ACM, New York, NY, 29--44.
[8]
Behrens, Jonathan and Birman, Ken and Jha, Sagar and Tremel, Edward. 2018. RDMC: A reliable RDMA multicast for large objects. In Proceedings of the 2018 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN’18). IEEE Computer Society, Los Alamitos, CA, 1--12.
[9]
Adam Belay, George Prekas, Ana Klimovic, Samuel Grossman, Christos Kozyrakis, and Edouard Bugnion. 2014. IX: A protected dataplane operating system for high throughput and low latency. In Proceedings of the 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI’14). USENIX Association, Broomfield, CO, 49--65.
[10]
Kenneth Birman. 2012. Guide to Reliable Distributed Systems. Number XXII in Texts in Computer Science. Springer-Verlag, London.
[11]
Kenneth P. Birman. 1985. Replication and fault-tolerance in the isis system. In Proceedings of the 10th ACM Symposium on Operating Systems Principles (SOSP’85). ACM, New York, NY, 79--86.
[12]
Kenneth P. Birman. 2012. Guide to Reliable Distributed Systems: Building High-Assurance Applications and Cloud-Hosted Services. Springer Verlag Texts in Computer Science, New York, NY.
[13]
Kenneth P. Birman and Thomas A. Joseph. 1987. Exploiting virtual synchrony in distributed systems. In Proceedings of the 11th ACM Symposium on Operating Systems Principles (SOSP’87). ACM, New York, NY, 123--138.
[14]
Kenneth P. Birman and Thomas A. Joseph. 1987. Reliable communication in the presence of failures. ACM Trans. Comput. Syst. 5, 1 (Jan. 1987), 47--76.
[15]
Eric Brewer. 2010. A certain freedom: Thoughts on the CAP theorem. In Proceedings of the 29th ACM SIGACT-SIGOPS Symposium on Principles of Distributed Computing (PODC’10). ACM, New York, NY, 335--335.
[16]
Brad Calder, Ju Wang, Aaron Ogus, Niranjan Nilakantan, Arild Skjolsvold, Sam McKelvie, Yikang Xu, Shashwat Srivastav, Jiesheng Wu, Huseyin Simitci, Jaidev Haridas, Chakravarthy Uddaraju, Hemal Khatri, Andrew Edwards, Vaman Bedekar, Shane Mainali, Rafay Abbasi, Arpit Agarwal, Mian Fahim ul Haq, Muhammad Ikram ul Haq, Deepali Bhardwaj, Sowmya Dayanand, Anitha Adusumilli, Marvin McNett, Sriram Sankaran, Kavitha Manivannan, and Leonidas Rigas. 2011. Windows azure storage: A highly available cloud storage service with strong consistency. In Proceedings of the 23rd ACM Symposium on Operating Systems Principles (SOSP’11). ACM, New York, NY, 143--157.
[17]
Tushar D. Chandra, Robert Griesemer, and Joshua Redstone. 2007. Paxos made live: An engineering perspective. In Proceedings of the 26th Annual ACM Symposium on Principles of Distributed Computing (PODC’07). ACM, New York, NY, 398--407.
[18]
Tushar Deepak Chandra and Sam Toueg. 1996. Unreliable failure detectors for reliable distributed systems. J. ACM 43, 2 (Mar. 1996), 225--267.
[19]
K. Mani Chandy and Leslie Lamport. 1985. Distributed snapshots: Determining global states of distributed systems. ACM Trans. Comput. Syst. 3, 1 (Feb. 1985), 63--75.
[20]
Jo-Mei Chang and N. F. Maxemchuk. 1984. Reliable broadcast protocols. ACM Trans. Comput. Syst. 2, 3 (Aug. 1984), 251--273.
[21]
Neil Conway, William R. Marczak, Peter Alvaro, Joseph M. Hellerstein, and David Maier. 2012. Logic and lattices for distributed programming. In Proceedings of the 3rd ACM Symposium on Cloud Computing (SoCC’12). ACM, New York, NY.
[22]
James C. Corbett, Jeffrey Dean, Michael Epstein, Andrew Fikes, Christopher Frost, J. J. Furman, Sanjay Ghemawat, Andrey Gubarev, Christopher Heiser, Peter Hochschild, Wilson Hsieh, Sebastian Kanthak, Eugene Kogan, Hongyi Li, Alexander Lloyd, Sergey Melnik, David Mwaura, David Nagle, Sean Quinlan, Rajesh Rao, Lindsay Rolig, Yasushi Saito, Michal Szymaniak, Christopher Taylor, Ruth Wang, and Dale Woodford. 2012. Spanner: Google’s globally-distributed database. In Proceedings of the 10th USENIX Conference on Operating Systems Design and Implementation (OSDI’12). USENIX Association, Berkeley, CA, 251--264. http://dl.acm.org/citation.cfm?id=2387880.2387905
[23]
Heming Cui, Rui Gu, Cheng Liu, Tianyu Chen, and Junfeng Yang. 2015. Paxos made transparent. In Proceedings of the 25th Symposium on Operating Systems Principles (SOSP’15). ACM, New York, NY, 105--120.
[24]
Huynh Tu Dang, Daniele Sciascia, Marco Canini, Fernando Pedone, and Robert Soulé. 2015. NetPaxos: Consensus at network speed. In Proceedings of the 1st ACM SIGCOMM Symposium on Software Defined Networking Research (SOSR’15). ACM, New York, NY, Article 5, 7 pages.
[25]
Aleksandar Dragojević, Dushyanth Narayanan, Orion Hodson, and Miguel Castro. 2014. FaRM: Fast remote memory. In Proceedings of the 11th USENIX Conference on Networked Systems Design and Implementation (NSDI’14). USENIX Association, Berkeley, CA, 401--414. http://dl.acm.org/citation.cfm?id=2616448.2616486
[26]
Michael J. Fischer, Nancy A. Lynch, and Michael S. Paterson. 1985. Impossibility of distributed consensus with one faulty process. J. ACM 32, 2 (Apr. 1985), 374--382.
[27]
Roy Friedman and Robbert van Renesse. 1997. Packing messages as a tool for boosting the performance of total ordering protocols. In Proceedings of the 6th IEEE International Symposium on High Performance Distributed Computing (HPDC’97).
[28]
Prasanna Ganesan and M. Seshadri. 2005. On cooperative content distribution and the price of barter. In Proceedings of the 25th IEEE International Conference on Distributed Computing Systems (ICDCS’05). 81--90.
[29]
Seth Gilbert and Nancy Lynch. 2002. Brewer’s conjecture and the feasibility of consistent, available, partition-tolerant web services. SIGACT News 33, 2 (Jun. 2002), 51--59.
[30]
Rachid Guerraoui, Ron R. Levy, Bastian Pochon, and Vivien Quéma. 2010. Throughput optimal total order broadcast for cluster environments. ACM Trans. Comput. Syst. 28, 2, Article 5 (Jul. 2010), 32 pages.
[31]
Chuanxiong Guo, Haitao Wu, Zhong Deng, Gaurav Soni, Jianxi Ye, Jitu Padhye, and Marina Lipshteyn. 2016. RDMA over commodity ethernet at scale. In Proceedings of the 2016 ACM SIGCOMM Conference (SIGCOMM’16). ACM, New York, NY, 202--215.
[32]
Joseph Y. Halpern and Yoram Moses. 1990. Knowledge and common knowledge in a distributed environment. J. ACM 37, 3 (Jul. 1990), 549--587.
[33]
Anuj Kalia, Michael Kaminsky, and David G. Andersen. 2014. Using RDMA efficiently for key-value services. In Proceedings of the 2014 ACM Conference on SIGCOMM (SIGCOMM’14). ACM, New York, NY, 295--306.
[34]
Idit Keidar and Alexander Shraer. 2006. Timeliness, failure-detectors, and consensus performance. In Proceedings of the 25th Annual ACM Symposium on Principles of Distributed Computing (PODC’06). ACM, New York, NY, 169--178.
[35]
Leslie Lamport. 1998. The part-time parliament. ACM Trans. Comput. Syst. 16, 2 (May 1998), 133--169.
[36]
Butler Lampson. 2001. The ABCD’s of Paxos. In Proceedings of the 20th Annual ACM Symposium on Principles of Distributed Computing (PODC’01). ACM, New York, NY, 13--.
[37]
Jialin Li, Ellis Michael, Naveen Kr Sharma, Adriana Szekeres, and Dan RK Ports. 2016. Just say NO to Paxos overhead: Replacing consensus with network ordering. In Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI’16).
[38]
Parisa Jalili Marandi, Samuel Benz, Fernando Pedonea, and Kenneth P. Birman. 2014. The performance of paxos in the cloud. In Proceedings of the 2014 IEEE 33rd International Symposium on Reliable Distributed Systems (SRDS’14). IEEE Computer Society, Los Alamitos, CA, 41--50.
[39]
D. Mazieres. 2007. Paxos made practical. Technical report. Retrieved from http://www.scs.stanford.edu/dm/home/papers.
[40]
Christopher Mitchell, Yifeng Geng, and Jinyang Li. 2013. Using one-sided RDMA reads to build a fast, CPU-efficient key-value store. In Proceedings of the 2013 USENIX Conference on Annual Technical Conference (USENIX ATC’13). USENIX Association, Berkeley, CA, 103--114.
[41]
Diego Ongaro and John Ousterhout. 2014. In search of an understandable consensus algorithm. In Proceedings of the 2014 USENIX Conference on USENIX Annual Technical Conference (USENIX ATC’14). USENIX Association, Berkeley, CA, 305--320.
[42]
John Ousterhout, Parag Agrawal, David Erickson, Christos Kozyrakis, Jacob Leverich, David Mazières, Subhasish Mitra, Aravind Narayanan, Diego Ongaro, Guru Parulkar, Mendel Rosenblum, Stephen M. Rumble, Eric Stratmann, and Ryan Stutsman. 2011. The case for RAMCloud. Commun. ACM 54, 7 (Jul. 2011), 121--130.
[43]
Simon Peter, Jialin Li, Irene Zhang, Dan R. K. Ports, Doug Woos, Arvind Krishnamurthy, Thomas Anderson, and Timothy Roscoe. 2014. Arrakis: The operating system is the control plane. In Proceedings of the 11th USENIX Conference on Operating Systems Design and Implementation (OSDI’14). USENIX Association, Berkeley, CA, 1--16.
[44]
Marius Poke and Torsten Hoefler. 2015. DARE: High-performance state machine replication on RDMA networks. In Proceedings of the 24th International Symposium on High-Performance Parallel and Distributed Computing (HPDC’15). ACM, New York, NY, 107--118.
[45]
Marius Poke, Torsten Hoefler, and Colin Glass. 2017. AllConcur: Leaderless concurrent atomic broadcast. In Proceedings of the 26th International Symposium on High-Performance Parallel and Distributed Computing (HPDC’17). ACM, New York, NY, 18.
[46]
Roberto De Prisco, Butler W. Lampson, and Nancy A. Lynch. 1997. Revisiting the paxos algorithm. In Proceedings of the 11th International Workshop on Distributed Algorithms (WDAG’97). Springer-Verlag, London, 111--125.
[47]
Dan Pritchett. 2008. BASE: An ACID alternative. Queue 6, 3 (May 2008), 48--55.
[48]
Fred B. Schneider. 1990. Implementing fault-tolerant services using the state machine approach: A tutorial. ACM Comput. Surv. 22, 4 (Dec. 1990), 299--319.
[49]
Steve Shankland. 2008. Google’s Jeff Dean spotlights data center inner workings. C|Net Reviews (May 2008). Retrieved from https://www.cnet.com/news/google-spotlights-data-center-inner-workings/.
[50]
Weijia Song, Theo Gkountouvas, Ken Birman, Qi Chen, and Zhen Xiao. 2016. The freeze-frame file system. In Proceedings of the 7th ACM Symposium on Cloud Computing (SoCC’16). ACM, New York, NY, 307--320.
[51]
Robbert Van Renesse and Deniz Altinbuken. 2015. Paxos made moderately complex. ACM Comput. Surv. 47, 3, Article 42 (Feb. 2015), 36 pages.
[52]
Robbert van Renesse and Fred B. Schneider. 2004. Chain replication for supporting high throughput and availability. In Proceedings of the 6th Conference on Symposium on Operating Systems Design 8 Implementation, Volume 6 (OSDI’04). USENIX Association, Berkeley, CA, USA, 7--7.
[53]
T. von Eicken, A. Basu, V. Buch, and W. Vogels. 1995. U-Net: A user-level network interface for parallel and distributed computing. In Proceedings of the 15th ACM Symposium on Operating Systems Principles (SOSP’95). ACM, New York, NY, 40--53.
[54]
Cheng Wang, Jianyu Jiang, Xusheng Chen, Ning Yi, and Heming Cui. 2017. APUS: Fast and scalable paxos on RDMA. In Proceedings of the 8th ACM Symposium on Cloud Computing (SoCC’17). ACM, Santa Clara, CA, 14.
[55]
Michael Wei, Amy Tai, Christopher J. Rossbach, Ittai Abraham, Maithem Munshed, Medhavi Dhawan, Jim Stabile, Udi Wieder, Scott Fritchie, Steven Swanson, Michael J. Freedman, and Dahlia Malkhi. 2017. vCorfu: A cloud-scale object store on a shared log. In Proceedings of the 14th USENIX Symposium on Networked Systems Design and Implementation (NSDI’17). USENIX Association, 35--49.
[56]
Xingda Wei, Jiaxin Shi, Yanzhe Chen, Rong Chen, and Haibo Chen. 2015. Fast In-memory transaction processing using RDMA and HTM. In Proceedings of the 25th Symposium on Operating Systems Principles (SOSP’15). ACM, New York, NY, 87--104.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Computer Systems
ACM Transactions on Computer Systems  Volume 36, Issue 2
May 2018
112 pages
ISSN:0734-2071
EISSN:1557-7333
DOI:10.1145/3323874
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 02 April 2019
Accepted: 01 December 2018
Revised: 01 July 2018
Received: 01 September 2017
Published in TOCS Volume 36, Issue 2

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Cloud computing
  2. RDMA
  3. consistency
  4. non-volatile memory
  5. replication

Qualifiers

  • Research-article
  • Research
  • Refereed

Funding Sources

  • NSF-sponsored Institute for Computational Sustainability at Cornell University, NSF, DARPA, ARPA-e, NYSERDA, Huawei, and Mellanox

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)130
  • Downloads (Last 6 weeks)9
Reflects downloads up to 17 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Empowering Cloud Computing With Network Acceleration: A SurveyIEEE Communications Surveys & Tutorials10.1109/COMST.2024.337753126:4(2729-2768)Online publication date: 1-Oct-2024
  • (2024)Verifying a C Implementation of Derecho’s Coordination Mechanism Using VST and CoqNASA Formal Methods10.1007/978-3-031-60698-4_6(99-117)Online publication date: 4-Jun-2024
  • (2023)FUSEEProceedings of the 21st USENIX Conference on File and Storage Technologies10.5555/3585938.3585944(81-97)Online publication date: 21-Feb-2023
  • (2023)Specification and Runtime Checking of Derecho, A Protocol for Fast Replication for Cloud ServicesProceedings of the 5th workshop on Advanced tools, programming languages, and PLatforms for Implementing and Evaluating algorithms for Distributed systems10.1145/3584684.3597275(1-10)Online publication date: 19-Jun-2023
  • (2023)Dash: A Low Code Development Platform for AI Applications in Industry2023 IEEE 14th Annual Ubiquitous Computing, Electronics & Mobile Communication Conference (UEMCON)10.1109/UEMCON59035.2023.10316092(0072-0081)Online publication date: 12-Oct-2023
  • (2023)Telepathy: A Lightweight Silent Data Access Protocol for NVRAM+RDMA Enabled Distributed StorageIEEE Transactions on Computers10.1109/TC.2022.318426972:3(839-852)Online publication date: 1-Mar-2023
  • (2023)Heron: Scalable State Machine Replication on Shared Memory2023 53rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)10.1109/DSN58367.2023.00025(138-150)Online publication date: Jun-2023
  • (2023)Invited Paper: Monotonicity and Opportunistically-Batched Actions in DerechoStabilization, Safety, and Security of Distributed Systems10.1007/978-3-031-44274-2_14(172-190)Online publication date: 2-Oct-2023
  • (2022)Acuerdo: Fast Atomic Broadcast over RDMAProceedings of the 51st International Conference on Parallel Processing10.1145/3545008.3545041(1-11)Online publication date: 29-Aug-2022
  • (2022)Hamband: RDMA replicated data typesProceedings of the 43rd ACM SIGPLAN International Conference on Programming Language Design and Implementation10.1145/3519939.3523426(348-363)Online publication date: 9-Jun-2022
  • Show More Cited By

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media