skip to main content
10.1145/3578358.3591329acmconferencesArticle/Chapter ViewAbstractPublication PageseurosysConference Proceedingsconference-collections
research-article

Generic Checkpointing Support for Stream-based State-Machine Replication

Published:08 May 2023Publication History

ABSTRACT

Stream-based replication facilitates the deployment and operation of state-machine replication protocols by running them as applications on top of data-stream processing frameworks. Taking advantage of platform-provided features, this approach makes it possible to significantly minimize implementation complexity at the protocol level. To further extend the associated benefits, in this paper we examine how the concept can be used to provide generic support for creating, storing, and applying checkpoints of replica states, both in the use case for catch up and garbage collection as well as to recover failed replicas. Specifically, we present three checkpointing-mechanism designs with different degrees of platform involvement and evaluate them in the context of Twitter's stream-processing engine Heron.

References

  1. Eduardo Alchieri, Fernando Dotti, Odorico M Mendizabal, and Fernando Pedone. 2017. Reconfiguring Parallel State Machine Replication. In Proceedings of the 36th International Symposium on Reliable Distributed Systems (SRDS '17). 104--113.Google ScholarGoogle ScholarCross RefCross Ref
  2. Alysson Bessani, Marcel Santos, João Felix, Nuno Neves, and Miguel Correia. 2013. On the Efficiency of Durable State Machine Replication. In Proceedings of the 2013 USENIX Annual Technical Conference (USENIX ATC '13). 169--180.Google ScholarGoogle Scholar
  3. Alysson Bessani, João Sousa, and Eduardo E P Alchieri. 2014. State Machine Replication for the Masses with BFT-SMaRt. In Proceedings of the 44th International Conference on Dependable Systems and Networks (DSN '14). 355--362.Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Tobias Distler. 2021. Byzantine Fault-Tolerant State-Machine Replication from a Systems Perspective. Comput. Surveys 54, 1, Article 24 (2021), 38 pages.Google ScholarGoogle Scholar
  5. Tobias Distler, Rüdiger Kapitza, Ivan Popov, Hans P. Reiser, and Wolfgang Schröder-Preikschat. 2011. SPARE: Replicas on Hold. In Proceedings of the 18th Network and Distributed System Security Symposium (NDSS '11). 407--420.Google ScholarGoogle Scholar
  6. Tobias Distler, Rüdiger Kapitza, and Hans P. Reiser. 2010. State Transfer for Hypervisor-Based Proactive Recovery of Heterogeneous Replicated Services. In Proceedings of the 5th "Sicherheit, Schutz und Zuverlässigkeit" Conference (SICHERHEIT '10). 61--72.Google ScholarGoogle Scholar
  7. Michael Eischer, Markus Büttner, and Tobias Distler. 2019. Deterministic Fuzzy Checkpoints. In Proceedings of the 38th International Symposium on Reliable Distributed Systems (SRDS '19). 153--162.Google ScholarGoogle ScholarCross RefCross Ref
  8. E. N. (Mootaz) Elnozahy, Lorenzo Alvisi, Yi-Min Wang, and David B. Johnson. 2002. A Survey of Rollback-Recovery Protocols in Message-Passing Systems. Comput. Surveys 34, 3 (2002), 375--408.Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Robert Hagmann. 1987. Reimplementing the Cedar File System Using Logging and Group Commit. In Proceedings of the 11th Symposium on Operating Systems Principles (SOSP '87). 155--162.Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Jaehyun Hwang and Qizhe Cai. 2020. TCP ≈ RDMA: CPU-efficient Remote Storage Access with i10. In Proceedings of the 17th USENIX Symposium on Networked Systems Design and Implementation (NSDI '20). 127--140.Google ScholarGoogle Scholar
  11. Jan Kończak and Paweł T Wojciechowski. 2021. Failure Recovery from Persistent Memory in Paxos-Based State Machine Replication. In Proceedings of the 40th International Symposium on Reliable Distributed Systems (SRDS '21). 88--98.Google ScholarGoogle ScholarCross RefCross Ref
  12. Jan Kończak, Paweł T Wojciechowski, Nuno Santos, Tomasz Żurkowski, and André Schiper. 2019. Recovery Algorithms for Paxos-Based State Machine Replication. IEEE Transactions on Dependable and Secure Computing 18, 2 (2019), 623--640.Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Sanjeev Kulkarni, Nikunj Bhagat, Maosong Fu, Vikas Kedigehalli, Christopher Kellogg, Sailesh Mittal, Jignesh M. Patel, Karthik Ramasamy, and Siddarth Taneja. 2015. Twitter Heron: Stream Processing at Scale. In Proceedings of the 41st International Conference on Management of Data (SIGMOD '15). 239--250.Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Leslie Lamport. 1998. The Part-time Parliament. ACM Transactions on Computer Systems 16, 2 (1998), 133--169.Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Laura Lawniczak and Tobias Distler. 2021. Stream-based State Machine Replication. In Proceedings of the 17th European Dependable Computing Conference (EDCC '21). 119--126.Google ScholarGoogle ScholarCross RefCross Ref
  16. Xiaojian Liao, Zhe Yang, and Jiwu Shu. 2022. RIO: Order-Preserving and CPU-Efficient Remote Storage Access. arXiv preprint arXiv:2210.08934 (2022).Google ScholarGoogle Scholar
  17. Odorico M Mendizabal, Fernando Luís Dotti, and Fernando Pedone. 2016. Analysis of Checkpointing Overhead in Parallel State Machine Replication. In Proceedings of the 31st Annual ACM Symposium on Applied Computing (SAC '16). 534--537.Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Odorico M Mendizabal, Parisa Jalili Marandi, Fernando Luís Dotti, and Fernando Pedone. 2014. Checkpointing in Parallel State-Machine Replication. In Proceedings of the 18th International Conference on Principles of Distributed Systems (OPODIS '14). 123--138.Google ScholarGoogle ScholarCross RefCross Ref
  19. Diego Ongaro and John Ousterhout. 2014. In Search of an Understandable Consensus Algorithm. In Proceedings of the 2014 USENIX Annual Technical Conference (USENIX ATC '14). 305--320.Google ScholarGoogle Scholar
  20. Tuanir F Rezende, Pierre Sutra, Rodrigo Q Saramago, and Lasaro Camargos. 2017. On Making Generalized Paxos Practical. In Proceedings of the 31st International Conference on Advanced Information Networking and Applications (AINA '17). 347--354.Google ScholarGoogle ScholarCross RefCross Ref
  21. Ankit Toshniwal, Siddarth Taneja, Amit Shukla, Karthik Ramasamy, Jignesh M Patel, Sanjeev Kulkarni, Jason Jackson, Krishna Gade, Maosong Fu, Jake Donham, Nikunj Bhagat, Sailesh Mittal, and Dmitriy Ryaboy. 2014. Storm @Twitter. In Proceedings of the 40th International Conference on Management of Data (SIGMOD '14). 147--156.Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Qingfeng Zhuge, Hao Zhang, Edwin Hsing-Mean Sha, Rui Xu, Jun Liu, and Shengyu Zhang. 2021. Exploring Efficient Architectures on Remote In-Memory NVM over RDMA. ACM Transactions on Embedded Computing Systems (TECS) 20, 5s (2021), 1--20.Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Generic Checkpointing Support for Stream-based State-Machine Replication

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      PaPoC '23: Proceedings of the 10th Workshop on Principles and Practice of Consistency for Distributed Data
      May 2023
      89 pages
      ISBN:9798400700866
      DOI:10.1145/3578358

      Copyright © 2023 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 8 May 2023

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      Overall Acceptance Rate34of47submissions,72%
    • Article Metrics

      • Downloads (Last 12 months)62
      • Downloads (Last 6 weeks)1

      Other Metrics

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader