skip to main content
10.1145/3267809.3267837acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

SDPaxos: Building Efficient Semi-Decentralized Geo-replicated State Machines

Authors Info & Claims
Published:11 October 2018Publication History

ABSTRACT

Existing state machine replication protocols are confronting two major challenges in geo-replication: (1) limited performance caused by load imbalance, and (2) severe performance degradation in heterogeneous environments or under high-contention workloads. This paper presents a new semi-decentralized approach to addressing both the challenges at the same time. Our protocol, SDPaxos, divides the task of a replication protocol into two parts: durably replicating each command across replicas without global order, and ordering all commands to enforce the consistency guarantee. We decentralize the process of replicating commands, which accounts for the largest proportion of load, to provide high performance. In contrast, we centralize the process of ordering commands, which is lightweight but needs a global view, for better performance stability against heterogeneity or contention. The key novelty lies in that SDPaxos achieves the optimal one-round-trip latency under realistic configurations, despite the two separated steps, replicating and ordering, which are both based on Paxos. We also design a recovery protocol to do rapid failover under failures, and a series of optimizations to boost performance. We show via a prototype implementation the significant advantage of SDPaxos on both throughput and latency, facing different environments and workloads.

References

  1. 2013. EPaxos code base. (2013). https://github.com/efficient/epaxosGoogle ScholarGoogle Scholar
  2. 2018. Amazon Elastic Compute Cloud. (2018). https://aws.amazon.com/ec2/Google ScholarGoogle Scholar
  3. 2018. SDPaxos: Building Efficient Semi-Decentralized Geo-replicated State Machines (Extended Version). (2018). https://github.com/zhypku/SDPaxos/blob/master/sdpaxos.pdfGoogle ScholarGoogle Scholar
  4. 2018. Source code of SDPaxos. (2018). https://github.com/zhypku/SDPaxosGoogle ScholarGoogle Scholar
  5. Raja Appuswamy, Angelos C. Anadiotis, Danica Porobic, Mustafa K. Iman, and Anastasia Ailamaki. 2017. Analyzing the Impact of System Architecture on the Scalability of OLTP Engines for High-contention Workloads. Proc. VLDB Endow. 11, 2 (Oct. 2017), 121--134. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Berk Atikoglu, Yuehai Xu, Eitan Frachtenberg, Song Jiang, and Mike Paleczny. 2012. Workload Analysis of a Large-scale Key-value Store. In Proceedings of the 12th ACM SIGMETRICS/PERFORMANCE Joint International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS '12). ACM, New York, NY, USA, 53--64. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Jason Baker, Chris Bond, James C. Corbett, JJ Furman, Andrey Khorlin, James Larson, Jean-Michel Leon, Yawei Li, Alexander Lloyd, and Vadim Yushprakh. 2011. Megastore: Providing Scalable, Highly Available Storage for Interactive Services. In Proceedings of the Conference on Innovative Data system Research (CIDR). 223--234. http://www.cidrdb.org/cidr2011/Papers/CIDR11_Paper32.pdfGoogle ScholarGoogle Scholar
  8. Mahesh Balakrishnan, Dahlia Malkhi, Vijayan Prabhakaran, Ted Wobber, Michael Wei, and John D. Davis. 2012. CORFU: A Shared Log Design for Flash Clusters. In Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation (NSDI'12). USENIX Association, Berkeley, CA, USA, 1--1. http://dl.acm.org/citation.cfm?id=2228298.2228300 Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Hitesh Ballani, Paolo Costa, Thomas Karagiannis, and Ant Rowstron. 2011. Towards Predictable Datacenter Networks. In Proceedings of the ACM SIGCOMM 2011 Conference (SIGCOMM '11). ACM, New York, NY, USA, 242--253. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Kenneth Birman, André Schiper, and Pat Stephenson. 1991. Lightweight Causal and Atomic Group Multicast. ACM Trans. Comput. Syst. 9, 3 (Aug. 1991), 272--314. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. L. Breslau, Pei Cao, Li Fan, G. Phillips, and S. Shenker. 1999. Web caching and Zipf-like distributions: evidence and implications. In IEEE INFOCOM '99. Conference on Computer Communications. Proceedings. Eighteenth Annual Joint Conference of the IEEE Computer and Communications Societies. The Future is Now (Cat. No.99CH36320), Vol. 1. 126--134 vol. 1.Google ScholarGoogle Scholar
  12. Mike Burrows. 2006. The Chubby Lock Service for Loosely-coupled Distributed Systems. In Proceedings of the 7th Symposium on Operating Systems Design and Implementation (OSDI '06). USENIX Association, Berkeley, CA, USA, 335--350. http://dl.acm.org/citation.cfm?id=1298455.1298487 Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Tushar D. Chandra, Robert Griesemer, and Joshua Redstone. 2007. Paxos Made Live: An Engineering Perspective. In Proceedings of the Twenty-sixth Annual ACM Symposium on Principles of Distributed Computing (PODC '07). ACM, New York, NY, USA, 398--407. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Brian F. Cooper, Adam Silberstein, Erwin Tam, Raghu Ramakrishnan, and Russell Sears. 2010. Benchmarking Cloud Serving Systems with YCSB. In Proceedings of the 1st ACM Symposium on Cloud Computing (SoCC '10). ACM, New York, NY, USA, 143--154. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. James C. Corbett, Jeffrey Dean, Michael Epstein, Andrew Fikes, Christopher Frost, J. J. Furman, Sanjay Ghemawat, Andrey Gubarev, Christopher Heiser, Peter Hochschild, Wilson Hsieh, Sebastian Kanthak, Eugene Kogan, Hongyi Li, Alexander Lloyd, Sergey Melnik, David Mwaura, David Nagle, Sean Quinlan, Rajesh Rao, Lindsay Rolig, Yasushi Saito, Michal Szymaniak, Christopher Taylor, Ruth Wang, and Dale Woodford. 2012. Spanner: Google's Globally-distributed Database. (2012), 251--264. http://dl.acm.org/citation.cfm?id=2387880.2387905 Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Jeffrey Dean and Sanjay Ghemawat. 2008. MapReduce: Simplified Data Processing on Large Clusters. Commun. ACM 51, 1 (Jan. 2008), 107--113. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Xavier Défago, André Schiper, and Péter Urbán. 2004. Total Order Broadcast and Multicast Algorithms: Taxonomy and Survey. ACM Comput. Surv. 36, 4 (Dec. 2004), 372--421. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Michael J. Fischer, Nancy A. Lynch, and Michael S. Paterson. 1985. Impossibility of Distributed Consensus with One Faulty Process. J. ACM 32, 2 (April 1985), 374--382. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Maurice P. Herlihy and Jeannette M. Wing. 1990. Linearizability: A Correctness Condition for Concurrent Objects. ACM Trans. Program. Lang. Syst. 12, 3 (July 1990), 463--492. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Qi Huang, Ken Birman, Robbert van Renesse, Wyatt Lloyd, Sanjeev Kumar, and Harry C. Li. 2013. An Analysis of Facebook Photo Caching. In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles (SOSP '13). ACM, New York, NY, USA, 167--181. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Patrick Hunt, Mahadev Konar, Flavio P. Junqueira, and Benjamin Reed. 2010. ZooKeeper: Wait-free Coordination for Internet-scale Systems. In Proceedings of the 2010 USENIX Conference on USENIX Annual Technical Conference (USENIX-ATC'10). USENIX Association, Berkeley, CA, USA, 11--11. http://dl.acm.org/citation.cfm?id=1855840.1855851 Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Jiawei Jiang, Bin Cui, Ce Zhang, and Lele Yu. 2017. Heterogeneity-aware Distributed Parameter Servers. In Proceedings of the 2017 ACM International Conference on Management of Data (SIGMOD '17). ACM, New York, NY, USA, 463--478. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. M. F. Kaashoek and A. S. Tanenbaum. 1991. Group communication in the Amoeba distributed operating system. (May 1991), 222--230.Google ScholarGoogle Scholar
  24. David Karger, Eric Lehman, Tom Leighton, Rina Panigrahy, Matthew Levine, and Daniel Lewin. 1997. Consistent Hashing and Random Trees: Distributed Caching Protocols for Relieving Hot Spots on the World Wide Web. In Proceedings of the Twenty-ninth Annual ACM Symposium on Theory of Computing (STOC '97). ACM, New York, NY, USA, 654--663. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Tim Kraska, Gene Pang, Michael J. Franklin, Samuel Madden, and Alan Fekete. 2013. MDCC: Multi-data Center Consistency. In Proceedings of the 8th ACM European Conference on Computer Systems (EuroSys '13). ACM, New York, NY, USA, 113--126. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Leslie Lamport. 1998. The Part-time Parliament. ACM Trans. Comput. Syst. 16, 2 (May 1998), 133--169. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Leslie Lamport. 2001. Paxos made simple. ACM Sigact News 32, 4 (2001), 18--25.Google ScholarGoogle Scholar
  28. Leslie Lamport. 2005. Generalized Consensus and Paxos. Technical Report. 60 pages. https://www.microsoft.com/en-us/research/publication/generalized-consensus-and-paxos/Google ScholarGoogle Scholar
  29. Leslie Lamport. 2006. Fast Paxos. Distributed Computing 19 (October 2006), 79--103. https://www.microsoft.com/en-us/research/publication/fast-paxos/ Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Ang Li, Xiaowei Yang, Srikanth Kandula, and Ming Zhang. 2010. CloudCmp: Comparing Public Cloud Providers. In Proceedings of the 10th ACM SIGCOMM Conference on Internet Measurement (IMC '10). ACM, New York, NY, USA, 1--14. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Jialin Li, Ellis Michael, Naveen Kr. Sharma, Adriana Szekeres, and Dan R. K. Ports. 2016. Just Say No to Paxos Overhead: Replacing Consensus with Network Ordering. In Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation (OSDI'16). USENIX Association, Berkeley, CA, USA, 467--483. http://dl.acm.org/citation.cfm?id=3026877.3026914 Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Barbara Liskov and James Cowling. 2012. Viewstamped Replication Revisited. Technical Report. Technical Report MIT-CSAIL-TR-2012-021, MIT.Google ScholarGoogle Scholar
  33. John MacCormick, Nick Murphy, Marc Najork, Chandramohan A. Thekkath, and Lidong Zhou. 2004. Boxwood: Abstractions As the Foundation for Storage Infrastructure. In Proceedings of the 6th Conference on Symposium on Opearting Systems Design & Implementation - Volume 6 (OSDI'04). USENIX Association, Berkeley, CA, USA, 8--8. http://dl.acm.org/citation.cfm?id=1251254.1251262 Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Yanhua Mao, Flavio P. Junqueira, and Keith Marzullo. 2008. Mencius: Building Efficient Replicated State Machines for WANs. In Proceedings of the 8th USENIX Conference on Operating Systems Design and Implementation (OSDI'08). USENIX Association, Berkeley, CA, USA, 369--384. http://dl.acm.org/citation.cfm?id=1855741.1855767 Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Iulian Moraru, David G. Andersen, and Michael Kaminsky. 2013. There is More Consensus in Egalitarian Parliaments. In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles (SOSP '13). ACM, New York, NY, USA, 358--372. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Iulian Moraru, David G. Andersen, and Michael Kaminsky. 2014. Paxos Quorum Leases: Fast Reads Without Sacrificing Writes. In Proceedings of the ACM Symposium on Cloud Computing (SOCC '14). ACM, New York, NY, USA, Article 22, 13 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Shuai Mu, Lamont Nelson, Wyatt Lloyd, and Jinyang Li. 2016. Consolidating Concurrency Control and Consensus for Commits Under Conflicts. (2016), 517--532. http://dl.acm.org/citation.cfm?id=3026877.3026917 Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Brian M. Oki and Barbara H. Liskov. 1988. Viewstamped Replication: A New Primary Copy Method to Support Highly-Available Distributed Systems. In Proceedings of the Seventh Annual ACM Symposium on Principles of Distributed Computing (PODC '88). ACM, New York, NY, USA, 8--17. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Diego Ongaro and John Ousterhout. 2014. In Search of an Understandable Consensus Algorithm. In Proceedings of the 2014 USENIX Conference on USENIX Annual Technical Conference (USENIX ATC'14). USENIX Association, Berkeley, CA, USA, 305--320. http://dl.acm.org/citation.cfm?id=2643634.2643666 Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Dan R. K. Ports, Jialin Li, Vincent Liu, Naveen Kr. Sharma, and Arvind Krishnamurthy. 2015. Designing Distributed Systems Using Approximate Synchrony in Data Center Networks. In Proceedings of the 12th USENIX Conference on Networked Systems Design and Implementation (NSDI'15). USENIX Association, Berkeley, CA, USA, 43--57. http://dl.acm.org/citation.cfm?id=2789770.2789774 Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. K. V. Rashmi, Mosharaf Chowdhury, Jack Kosaian, Ion Stoica, and Kannan Ramchandran. 2016. EC-cache: Load-balanced, Low-latency Cluster Caching with Online Erasure Coding. In Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation (OSDI'16). USENIX Association, Berkeley, CA, USA, 401--417. http://dl.acm.org/citation.cfm?id=3026877.3026909 Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Charles Reiss, Alexey Tumanov, Gregory R. Ganger, Randy H. Katz, and Michael A. Kozuch. 2012. Heterogeneity and Dynamicity of Clouds at Scale: Google Trace Analysis. In Proceedings of the Third ACM Symposium on Cloud Computing (SoCC '12). ACM, New York, NY, USA, Article 7, 13 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Fred B. Schneider. 1990. Implementing Fault-tolerant Services Using the State Machine Approach: A Tutorial. ACM Comput. Surv. 22, 4 (Dec. 1990), 299--319. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Alexander Thomson, Thaddeus Diamond, Shu-Chun Weng, Kun Ren, Philip Shao, and Daniel J. Abadi. 2012. Calvin: Fast Distributed Transactions for Partitioned Database Systems. In Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data (SIGMOD '12). ACM, New York, NY, USA, 1--12. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Matei Zaharia, Andy Konwinski, Anthony D. Joseph, Randy Katz, and Ion Stoica. 2008. Improving MapReduce Performance in Heterogeneous Environments. In Proceedings of the 8th USENIX Conference on Operating Systems Design and Implementation (OSDI'08). USENIX Association, Berkeley, CA, USA, 29--42. http://dl.acm.org/citation.cfm?id=1855741.1855744 Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Irene Zhang, Naveen Kr. Sharma, Adriana Szekeres, Arvind Krishnamurthy, and Dan R. K. Ports. 2015. Building Consistent Transactions with Inconsistent Replication. In Proceedings of the 25th Symposium on Operating Systems Principles (SOSP '15). ACM, New York, NY, USA, 263--278. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. SDPaxos: Building Efficient Semi-Decentralized Geo-replicated State Machines

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          SoCC '18: Proceedings of the ACM Symposium on Cloud Computing
          October 2018
          546 pages
          ISBN:9781450360111
          DOI:10.1145/3267809

          Copyright © 2018 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 11 October 2018

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article
          • Research
          • Refereed limited

          Acceptance Rates

          Overall Acceptance Rate169of722submissions,23%

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader