research-article

A Time-Phased Partitioned Checkpoint Approach to Reduce State Snapshot Overhead

Authors:

Everaldo Gomes Junior,

Eduardo Alchieri,

Fernando Dotti,

Odorico MendizabalAuthors Info & Claims

LADC '23: Proceedings of the 12th Latin-American Symposium on Dependable and Secure Computing

Pages 100 - 109

https://doi.org/10.1145/3615366.3615417

Published: 17 October 2023 Publication History

Abstract

Replication and recovery are essential techniques in developing fault-tolerant systems. Replication enhances availability by ensuring the system remains operational even in the presence of faults, while recovery improves resilience by replacing failed replicas or adding new ones during runtime. To achieve recovery, replicas must implement durability strategies such as logging, checkpointing, and state transfer. While these approaches enhance overall availability and resilience, they impact system performance. Among them, checkpointing is especially expensive due to the synchronization needed to create a consistent snapshot of the replica’s state and the overhead to persistently store it, leading to reduced throughput, increased latency, and even causing momentary service interruptions. To mitigate the performance degradation caused by checkpointing during normal execution, this work proposes a new checkpoint strategy that divides the replica’s state into partitions and takes snapshots of only a few partitions simultaneously. During checkpointing, incoming requests experience delays only if they access the partition being saved. Meanwhile, replicas can continue executing requests directed to other partitions without interruption. Our approach allows checkpointing different partitions at different moments while maintaining strong consistency. By employing this new approach using Parallel State Machine Replication, we can observe a reduction in the snapshot duration proportional to the number of partitions and lower latency observed by clients during checkpointing. Furthermore, the approach speeds up the system’s recovery by implementing a collaborative state transfer.

References

[1]

Eduardo Alchieri, Fernando Dotti, Odorico M Mendizabal, and Fernando Pedone. 2017. Reconfiguring parallel state machine replication. In SRDS.

[2]

Amazon. 2012. Summary of the December 24, 2012 Amazon ELB Service Event in the US-East Region. https://aws.amazon. com/message/680587/

[3]

A. Bessani, J. Sousa, and E. E. P. Alchieri. 2014. State Machine Replication for the Masses with BFT-SMART. In DSN.

[4]

Alysson Neves Bessani, Marcel Santos, João Felix, Nuno Ferreira Neves, and Miguel Correia. 2013. On the Efficiency of Durable State Machine Replication. In USENIX ATC.

[5]

Allen Clement, Manos Kapritsos, Sangmin Lee, Yang Wang, Lorenzo Alvisi, Mike Dahlin, and Taylor Riche. 2009. Upright Cluster Services. In SOSP.

[6]

Elmootazbellah Nabil Elnozahy, Lorenzo Alvisi, Yi-Min Wang, and David B Johnson. 2002. A survey of rollback-recovery protocols in message-passing systems. ACM Computing Surveys (CSUR) 34, 3 (2002), 375–408.

Digital Library

[7]

Alvaro Frank, Manuel Baumgartner, Reza Salkhordeh, and André Brinkmann. 2021. Improving checkpointing intervals by considering individual job failure probabilities. In IPDPS.

[8]

Gitlab. 2017. GitLab.com databse incident. https://about.gitlab.com/blog/2017/02/01/gitlab-dot-com-database-incident/

[9]

Henrique Goulart, Álvaro Franco, and Odorico Mendizabal. 2023. Checkpointing Techniques in Distributed Systems: A Synopsis of Diverse Strategies Over the Last Decades. In WTF.

[10]

Maurice P Herlihy and Jeannette M Wing. 1990. Linearizability: A correctness condition for concurrent objects. ACM Transactions on Programming Languages and Systems (TOPLAS) 12, 3 (1990), 463–492.

Digital Library

[11]

Peng Huang, Chuanxiong Guo, Lidong Zhou, Jacob R Lorch, Yingnong Dang, Murali Chintalapati, and Randolph Yao. 2017. Gray failure: The achilles’ heel of cloud-scale systems. In Proceedings of the 16th Workshop on Hot Topics in Operating Systems. 150–155.

Digital Library

[12]

Ramakrishna Kotla and Mike Dahlin. 2004. High throughput Byzantine fault tolerance. In DSN.

[13]

Leslie Lamport. 1978. Time, clocks, and the ordering of events in a distributed system. Commun. ACM 21, 7 (1978), 558–565.

Digital Library

[14]

Parisa Jalili Marandi and Fernando Pedone. 2014. Optimistic parallel state-machine replication. In SRDS.

[15]

Odorico M Mendizabal, Rudá ST De Moura, Fernando Luís Dotti, and Fernando Pedone. 2017. Efficient and deterministic scheduling for parallel state machine replication. In IPDPS.

[16]

Odorico Machado Mendizabal, Fernando Luís Dotti, and Fernando Pedone. 2017. High performance recovery for parallel state machine replication. In ICDCS.

[17]

Devesh Tiwari, Saurabh Gupta, and Sudharshan S Vazhkudai. 2014. Lazy checkpointing: Exploiting temporal locality in failures to mitigate checkpointing overheads on extreme-scale systems. In DSN.

[18]

Brian White, Jay Lepreau, Leigh Stoller, Robert Ricci, Shashi Guruprasad, Mac Newbold, Mike Hibler, Chad Barb, and Abhijeet Joglekar. 2002. An integrated experimental environment for distributed systems and networks. ACM SIGOPS Operating Systems Review 36, SI (2002), 255–270.

[19]

Wenting Zheng, Stephen Tu, Eddie Kohler, and Barbara Liskov. 2014. Fast databases with fast durability and recovery through multicore parallelism. In OSDI.

Cited By

Xavier LMeinhardt CMendizabal O(2025)Beelog: Online Log Compaction for Dependable SystemsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2025.354162836:4(689-700)Online publication date: Apr-2025
https://doi.org/10.1109/TPDS.2025.3541628
Gomes Jr. EAlchieri EDotti FMendizabal O(2024)Reducing Persistence Overhead in Parallel State Machine Replication through Time-Phased Partitioned CheckpointJournal of Internet Services and Applications10.5753/jisa.2024.389115:1(194-211)Online publication date: 26-Jul-2024
https://doi.org/10.5753/jisa.2024.3891

Index Terms

A Time-Phased Partitioned Checkpoint Approach to Reduce State Snapshot Overhead
1. Computer systems organization
  1. Dependable and fault-tolerant systems and networks
2. Information systems
  1. Information storage systems
    1. Storage replication
      1. Storage recovery strategies

Recommendations

Joining Parallel and Partitioned State Machine Replication Models for Enhanced Shared Logging Performance
LADC '23: Proceedings of the 12th Latin-American Symposium on Dependable and Secure Computing

State Machine Replication (SMR) is a widely used approach for implementing highly available fault-tolerant services with critical data and strong consistency requirements. Although being a rather simple execution model from a design perspective, ...
A low-overhead recovery technique using quasi-synchronous checkpointing
ICDCS '96: Proceedings of the 16th International Conference on Distributed Computing Systems (ICDCS '96)

In this paper, we propose a quasi-synchronous checkpointing algorithm and a low-overhead recovery algorithm based on it. The checkpointing algorithm preserves process autonomy by allowing them to take checkpoints asynchronously and uses communication-...
A Communication-Induced Checkpointing Algorithm Using Virtual Checkpoint on Distributed Systems
ICPADS '00: Proceedings of the Seventh International Conference on Parallel and Distributed Systems

Checkpointing is one of the fault-tolerant techniques to restore faults and to restart job fast. The algorithms for checkpointing on distributed systems have been under study for years. These algorithms can be classified into three classes: coordinated, ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

LADC '23: Proceedings of the 12th Latin-American Symposium on Dependable and Secure Computing

October 2023

242 pages

ISBN:9798400708442

DOI:10.1145/3615366

Copyright © 2023 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 October 2023

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

LADC 2023

LADC 2023: 12th Latin-American Symposium on Dependable and Secure Computing

October 16 - 18, 2023

La Paz, Bolivia

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
26
Total Downloads

Downloads (Last 12 months)12
Downloads (Last 6 weeks)3

Reflects downloads up to 07 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Xavier LMeinhardt CMendizabal O(2025)Beelog: Online Log Compaction for Dependable SystemsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2025.354162836:4(689-700)Online publication date: Apr-2025
https://doi.org/10.1109/TPDS.2025.3541628
Gomes Jr. EAlchieri EDotti FMendizabal O(2024)Reducing Persistence Overhead in Parallel State Machine Replication through Time-Phased Partitioned CheckpointJournal of Internet Services and Applications10.5753/jisa.2024.389115:1(194-211)Online publication date: 26-Jul-2024
https://doi.org/10.5753/jisa.2024.3891

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Figures

Tables

Media

View Table of Conten