skip to main content
10.1145/3615366.3615417acmotherconferencesArticle/Chapter ViewAbstractPublication PagesladcConference Proceedingsconference-collections
research-article

A Time-Phased Partitioned Checkpoint Approach to Reduce State Snapshot Overhead

Published: 17 October 2023 Publication History

Abstract

Replication and recovery are essential techniques in developing fault-tolerant systems. Replication enhances availability by ensuring the system remains operational even in the presence of faults, while recovery improves resilience by replacing failed replicas or adding new ones during runtime. To achieve recovery, replicas must implement durability strategies such as logging, checkpointing, and state transfer. While these approaches enhance overall availability and resilience, they impact system performance. Among them, checkpointing is especially expensive due to the synchronization needed to create a consistent snapshot of the replica’s state and the overhead to persistently store it, leading to reduced throughput, increased latency, and even causing momentary service interruptions. To mitigate the performance degradation caused by checkpointing during normal execution, this work proposes a new checkpoint strategy that divides the replica’s state into partitions and takes snapshots of only a few partitions simultaneously. During checkpointing, incoming requests experience delays only if they access the partition being saved. Meanwhile, replicas can continue executing requests directed to other partitions without interruption. Our approach allows checkpointing different partitions at different moments while maintaining strong consistency. By employing this new approach using Parallel State Machine Replication, we can observe a reduction in the snapshot duration proportional to the number of partitions and lower latency observed by clients during checkpointing. Furthermore, the approach speeds up the system’s recovery by implementing a collaborative state transfer.

References

[1]
Eduardo Alchieri, Fernando Dotti, Odorico M Mendizabal, and Fernando Pedone. 2017. Reconfiguring parallel state machine replication. In SRDS.
[2]
Amazon. 2012. Summary of the December 24, 2012 Amazon ELB Service Event in the US-East Region. https://aws.amazon. com/message/680587/
[3]
A. Bessani, J. Sousa, and E. E. P. Alchieri. 2014. State Machine Replication for the Masses with BFT-SMART. In DSN.
[4]
Alysson Neves Bessani, Marcel Santos, João Felix, Nuno Ferreira Neves, and Miguel Correia. 2013. On the Efficiency of Durable State Machine Replication. In USENIX ATC.
[5]
Allen Clement, Manos Kapritsos, Sangmin Lee, Yang Wang, Lorenzo Alvisi, Mike Dahlin, and Taylor Riche. 2009. Upright Cluster Services. In SOSP.
[6]
Elmootazbellah Nabil Elnozahy, Lorenzo Alvisi, Yi-Min Wang, and David B Johnson. 2002. A survey of rollback-recovery protocols in message-passing systems. ACM Computing Surveys (CSUR) 34, 3 (2002), 375–408.
[7]
Alvaro Frank, Manuel Baumgartner, Reza Salkhordeh, and André Brinkmann. 2021. Improving checkpointing intervals by considering individual job failure probabilities. In IPDPS.
[8]
Gitlab. 2017. GitLab.com databse incident. https://about.gitlab.com/blog/2017/02/01/gitlab-dot-com-database-incident/
[9]
Henrique Goulart, Álvaro Franco, and Odorico Mendizabal. 2023. Checkpointing Techniques in Distributed Systems: A Synopsis of Diverse Strategies Over the Last Decades. In WTF.
[10]
Maurice P Herlihy and Jeannette M Wing. 1990. Linearizability: A correctness condition for concurrent objects. ACM Transactions on Programming Languages and Systems (TOPLAS) 12, 3 (1990), 463–492.
[11]
Peng Huang, Chuanxiong Guo, Lidong Zhou, Jacob R Lorch, Yingnong Dang, Murali Chintalapati, and Randolph Yao. 2017. Gray failure: The achilles’ heel of cloud-scale systems. In Proceedings of the 16th Workshop on Hot Topics in Operating Systems. 150–155.
[12]
Ramakrishna Kotla and Mike Dahlin. 2004. High throughput Byzantine fault tolerance. In DSN.
[13]
Leslie Lamport. 1978. Time, clocks, and the ordering of events in a distributed system. Commun. ACM 21, 7 (1978), 558–565.
[14]
Parisa Jalili Marandi and Fernando Pedone. 2014. Optimistic parallel state-machine replication. In SRDS.
[15]
Odorico M Mendizabal, Rudá ST De Moura, Fernando Luís Dotti, and Fernando Pedone. 2017. Efficient and deterministic scheduling for parallel state machine replication. In IPDPS.
[16]
Odorico Machado Mendizabal, Fernando Luís Dotti, and Fernando Pedone. 2017. High performance recovery for parallel state machine replication. In ICDCS.
[17]
Devesh Tiwari, Saurabh Gupta, and Sudharshan S Vazhkudai. 2014. Lazy checkpointing: Exploiting temporal locality in failures to mitigate checkpointing overheads on extreme-scale systems. In DSN.
[18]
Brian White, Jay Lepreau, Leigh Stoller, Robert Ricci, Shashi Guruprasad, Mac Newbold, Mike Hibler, Chad Barb, and Abhijeet Joglekar. 2002. An integrated experimental environment for distributed systems and networks. ACM SIGOPS Operating Systems Review 36, SI (2002), 255–270.
[19]
Wenting Zheng, Stephen Tu, Eddie Kohler, and Barbara Liskov. 2014. Fast databases with fast durability and recovery through multicore parallelism. In OSDI.

Cited By

View all
  • (2025)Beelog: Online Log Compaction for Dependable SystemsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2025.354162836:4(689-700)Online publication date: Apr-2025
  • (2024)Reducing Persistence Overhead in Parallel State Machine Replication through Time-Phased Partitioned CheckpointJournal of Internet Services and Applications10.5753/jisa.2024.389115:1(194-211)Online publication date: 26-Jul-2024

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
LADC '23: Proceedings of the 12th Latin-American Symposium on Dependable and Secure Computing
October 2023
242 pages
ISBN:9798400708442
DOI:10.1145/3615366
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 October 2023

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Fault-tolerance
  2. checkpoint
  3. recovery
  4. state machine replication

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

LADC 2023

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)12
  • Downloads (Last 6 weeks)3
Reflects downloads up to 07 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2025)Beelog: Online Log Compaction for Dependable SystemsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2025.354162836:4(689-700)Online publication date: Apr-2025
  • (2024)Reducing Persistence Overhead in Parallel State Machine Replication through Time-Phased Partitioned CheckpointJournal of Internet Services and Applications10.5753/jisa.2024.389115:1(194-211)Online publication date: 26-Jul-2024

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media