Accurate and efficient follower log repair for Raft-replicated database systems

Guo, Jinwei; Cai, Peng; Qian, Weining; Zhou, Aoying

doi:10.1007/s11704-019-8349-0

Accurate and efficient follower log repair for Raft-replicated database systems

Research Article
Published: 04 January 2021

Volume 15, article number 152605, (2021)
Cite this article

Frontiers of Computer Science Aims and scope Submit manuscript

Jinwei Guo¹,
Peng Cai¹,
Weining Qian¹ &
…
Aoying Zhou¹

61 Accesses
4 Citations
Explore all metrics

Abstract

State machine replication has been widely used in modern cluster-based database systems. Most commonly deployed configurations adopt the Raft-like consensus protocol, which has a single strong leader which replicates the log to other followers. Since the followers can handle read requests and many real workloads are usually read-intensive, the recovery speed of a crashed follower may significantly impact on the throughput. Different from traditional database recovery, the recovering follower needs to repair its local log first. Original Raft protocol takes many network round trips to do log comparison between leader and the crashed follower. To reduce network round trips, an optimization method is to truncate the follower’s uncertain log entries behind the latest local commit point, and then to directly fetch all committed log entries from the leader in one round trip. However, if the commit point is not persisted, the recovering follower has to get the whole log from the leader. In this paper, we propose an accurate and efficient log repair (AELR) algorithm for follower recovery. AELR is more robust and resilient to follower failure, and it only needs one network round trip to fetch the least number of log entries for follower recovery. This approach is implemented in the open source database system OceanBase. We experimentally show that the system adopting AELR has a good performance in terms of recovery time.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fast Follower Recovery for State Machine Replication

Fast Raft Replication for Transactional Database Systems over Unreliable Networks

Efficient Snapshot Isolation in Paxos-Replicated Database Systems

Discover the latest articles and news from researchers in related subjects, suggested using machine learning.

References

Brewer E A. Towards robust distributed systems (abstract). In: Proceedings of the 19th Annual ACM Symposium on Principles of Distributed Computing. 2000
Gilbert S, Lynch N A. Brewer’s conjecture and the feasibility of consistent, available, partition-tolerant Web services. ACM SIGACT News, 2002, 33(2): 51–59
Article Google Scholar
DeCandia G, Hastorun D, Jampani M, Kakulapati G, Lakshman A, Pilchin A, Sivasubramanian S, Vosshall P, Vogels W. Dynamo: amazon’s highly available key-value store. In: Proceedings of the 21st ACM Symposium on Operating Systems Principles. 2007, 205–220
Vargas-Solar G, Zechinelli-Martini J, Espinosa-Oviedo J. Big data management: what to keep from the past to face future challenges? Data Science and Engineering, 2017, 2(4): 328–345
Article Google Scholar
Burrows M. The chubby lock service forloosely-coupled distributed systems. In: Proceeding of the 7th Symposium on Operating Systems Design and Implementation. 2006, 335–350
Chandra T D, Griesemer R, Redstone J. Paxos made live: an engineering perspective. In: Proceedings of the 26th Annual ACM Symposium on Principles of Distributed Computing. 2007, 398–407
Zheng J, Lin Q, Xu J, Wei C, Zeng C, Yang P, Zhang Y. Paxosstore: high-availability storage made practical in WeChat. Proceedings of the VLDB Endowment, 2017, 10(12): 1730–1741
Article Google Scholar
Ongaro D, Ousterhout J K. In search of an understandable consensus algorithm. In: Proceedings of 2014 USENIX Annual Technical Conference. 2014, 305–319
Maas M, Asanovic K, Harris T, Kubiatowicz J. Taurus: a holistic language runtime system for coordinating distributed managed-language applications. In: Proceedings of the 21st International Conference on Architectural Support for Programming Languages and Operating Systems. 2016, 457–471
Vallentin M, Paxson V, Sommer R. VAST: a unified platform for interactive network forensics. In: Proceedings of the 13th USENIX Symposium on Networked Systems Design and Implementation. 2016, 345–362
Pan W, Li Z, Zhang Y, Weng C. The new hardware development trend and the challenges in data management and analysis. Data Science and Engineering, 2018, 3(3): 263–276
Article Google Scholar
Zheng W, Tu S, Kohler E, Liskov B. Fast databases with fast durability and recovery through multicore parallelism. In: Proceedings of the 11th USENIX Symposium on Operating Systems Design and Implementation. 2014, 465–477
Ren K, Diamond T, Abadi D J, Thomson A. Low-overhead asynchronous checkpointing in main-memory database systems. In: Proceedings of the 2016 ACM International Conference on Management of Data. 2016, 1539–1551
Wu Y, Guo W, Chan C, Tan K. Fast failure recovery for main-memory dbmss on multicores. In: Proceedings of the 2017 ACM International Conference on Management of Data. 2017, 267–281
Cao W, Liu Z, Wang P, Chen S, Zhu C, Zheng S, Wang Y, Ma G. Polarfs: an ultra-low latency and failure resilient distributed file system for shared storage cloud database. Proceedings of the VLDB Endowment, 2018, 11(12): 1849–1862
Article Google Scholar
Guo J, Chu J, Cai P, Zhou M, Zhou A. Low-overhead paxos replication. Data Science and Engineering, 2017, 2(2): 169–177
Article Google Scholar
Howard H. ARC: analysis of Raft consensus. University of Cambridge, Technical Report, 2014
Rao J, Shekita E J, Tata S. Using paxos to build a scalable, consistent, and highly available datastore. Proceedings of the VLDB Endowment, 2011, 4(4): 243–254
Article Google Scholar
Oki B M, Liskov B. Viewstamped replication: a new primary copy method to support highly-available distributed systems. In: Proceedings of the 7th Annual ACM Symposium on Principles of Distributed Computing. 1988, 8–17
Cooper B F, Silberstein A, Tam E, Ramakrishnan R, Sears R. Benchmarking cloud serving systems with YCSB. In: Proceedings of the 1st ACM Symposium on Cloud Computing. 2010, 143–154
Schneider F B. Implementing fault-tolerant services using the state machine approach: a tutorial. ACM Computing Surveys, 1990, 22(4): 299–319
Article Google Scholar
Mohan C, Haderle D J, Lindsay B G, Pirahesh H, Schwarz P M. ARIES: a transaction recovery method supporting fine-granularity locking and partial rollbacks using write-ahead logging. ACM Transactions on Database Systems, 1992, 17(1): 94–162
Article Google Scholar
Gray J, Helland P, O’Neil PE, Shasha D E. The dangers of replication and a solution. In: Proceedings of the 1996 ACM International Conference on Management of Data. 1996, 173–182
Lamport L. The part-time parliament. ACM Transactions on Computer Systems, 1998, 16(2): 133–169
Article MATH Google Scholar
Lamport L. Paxos made simple. ACM SIGACT News, 2001, 32(4): 18–25
Google Scholar
Baker J, Bond C, Corbett J C, Furman J J, Khorlin A, Larson J, Leon J, Li Y, Lloyd A, Yushprakh V. Megastore: providing scalable, highly available storage for interactive services. In: Proceedings of the 5th Biennial Conference on Innovative Data Systems Research. 2011, 223–234
Corbett J C, Dean J, Epstein M, Fikes A, Frost C, Furman J J, Ghemawat S, Gubarev A, Heiser C, Hochschild P, Hsieh W C, Kanthak S, Kogan E, Li H, Lloyd A, Melnik S, Mwaura D, Nagle D, Quinlan S, Rao R, Rolig L, Saito Y, Szymaniak M, Taylor C, Wang R, Woodford D. Spanner: google’s globally-distributed database. In: Proceedings of the 10th USENIX Symposium on Operating Systems Design and Implementation. 2012, 251–264
Hunt P, Konar M, Junqueira F P, Reed B. Zookeeper: wait-free coordination forinternet-scale systems. In: Proceedings of 2010 USENIX Annual Technical Conference. 2010
Junqueira F P, Reed B C, Serafini M. Zab: high-performance broadcast for primary-backup systems. In: Proceedings of the 2011 IEEE/IFIP International Conference on Dependable Systems and Networks. 2011, 245–256
van Renesse R, Schiper N, Schneider F B. Vive la différence: paxos vs. viewstamped replication vs. zab. IEEE Transactions on Dependable and Secure Computing, 2015, 12(4): 472–484
Article Google Scholar
Liskov B, Cowling J. Viewstamped replication revisited. Technical Report, 2012

Download references

Acknowledgements

This research was supported in part by National Key R&D Program of China (2018YFB1003303), the National Natural Science Foundation of China (Grant Nos. 61432006, 61732014 and 61972149).

Author information

Authors and Affiliations

School of Data Science and Engineering, East China Normal University, Shanghai, 200062, China
Jinwei Guo, Peng Cai, Weining Qian & Aoying Zhou

Authors

Jinwei Guo
View author publications
Search author on:PubMed Google Scholar
Peng Cai
View author publications
Search author on:PubMed Google Scholar
Weining Qian
View author publications
Search author on:PubMed Google Scholar
Aoying Zhou
View author publications
Search author on:PubMed Google Scholar

Corresponding author

Correspondence to Peng Cai.

Additional information

Jinwei Guo is a PhD candidate in School of Data Science and Engineering from East China Normal University (ECNU), China. He received his bachelor degree in Computer Science and Technology from Qufu Normal University, China in 2010, and his master degree from Guizhou University, China in 2014. His research interests include transaction processing in database management systems and high availability in distributed systems.

Peng Cai is an associate professor in the School of Data Science and Engineering at East China Normal University (ECNU), China. He received his PhD degree in Computer Science and Technology from ECNU in 2011. He joined ECNU in 2015, prior to which Peng worked for the IBM China Research Lab and Baidu. His work has been published in various leading conferences, such as ICDE, SIGIR and ACL. His main research interests include inmemory transaction processing and building adaptive systems using machine learning techniques.

Weining Qian is a professor and dean of the School of Data Science and Engineering, East China Normal University, China. He received his MS and PhD in computer science from Fudan University, China in 2001 and 2004, respectively. He is now serving as a standing committee member of Database Technology Committee of China Computer Federation, and committee member of ACM SIGMOD China Chapter. His research interests include scalable transaction processing, benchmarking big data systems, and management and analysis of massive datasets.

Aoying Zhou, Vice President of East China Normal University, Founding Dean of School of Data Science and Engineering (DaSE), Professor. He got his master and bachelor degree in Computer Science from Sichuan University, China in 1988 and 1985 respectively, and he won his PhD from Fudan University, China in 1993. He is the winner of the National Science Fund for Distinguished Young Scholars supported by National Natural Science Foundation of China (NSFC). He is CCF (China Computer Federation) Fellow, the Vice Director of Database Technology Committee of CCF, and Associate Editor-in-Chief of Chinese Journal of Computer. He served General Chair of ER’2004, Vice PC Chair of ICDE’2009 and ICDE’2012, PC Co-chair of VLDB’2014. His research interests include Web data management, data management for data-intensive computing, inmemory cluster computing, distributed transaction processing, benchmarking for big data and performance.

Electronic supplementary material