ABSTRACT
ZooKeeper masks crash failure of servers to provide a highly available, distributed coordination kernel; however, in production, not all failures are crash failures. Bugs in underlying software systems and hardware can corrupt the ZooKeeper replicas, leading to data loss. Since ZooKeeper is used as a 'source of truth' for mission-critical applications, it essential to detect data inconsistencies caused by arbitrary faults to safeguard reliability. Byzantine Fault Tolerance (BFT) promises to handle these problems. However, these protocols are expensive in important dimensions: development, deployment, complexity, and performance. ZooKeeper takes an alternative approach that focuses on detecting faulty behavior rather than tolerating it and thus providing improved reliability without paying the full expense of BFT protocols.
This paper describes various techniques used for detecting data inconsistencies in ZooKeeper. We also analyzed the impact of using these techniques on the reliability and performance of the overall system. Our evaluation shows that a real-time digest-based fault detection technique can be employed in production to provide improved reliability with a minimal performance penalty and no additional operational cost. We hope that our analysis and evaluation can help guide the design of next-generation primary-backup systems aiming to provide high reliability.
- Michael Abd-El-Malek, Gregory R Ganger, Garth R Goodson, Michael K Reiter, and Jay J Wylie. 2005. Fault-scalable Byzantine fault-tolerant services. ACM SIGOPS Operating Systems Review 39, 5 (2005), 59--74.Google ScholarDigital Library
- Mihir Bellare and Daniele Micciancio. 1997. A new paradigm for collision-free hashing: Incrementality at reduced cost. In Advances in Cryptology---EUROCRYPT'97: International Conference on the Theory and Application of Cryptographic Techniques Konstanz, Germany, May 11--15, 1997 Proceedings 16. Springer, 163--192.Google ScholarCross Ref
- Miguel Castro and Barbara Liskov. 2002. Practical Byzantine fault tolerance and proactive recovery. ACM Transactions on Computer Systems (TOCS) 20, 4 (2002), 398--461.Google ScholarDigital Library
- Miguel Castro, Barbara Liskov, et al. 1999. Practical byzantine fault tolerance. In OsDI, Vol. 99. 173--186.Google ScholarDigital Library
- Tushar D Chandra, Robert Griesemer, and Joshua Redstone. 2007. Paxos made live: an engineering perspective. In Proceedings of the twenty-sixth annual ACM symposium on Principles of distributed computing. 398--407.Google ScholarDigital Library
- Allen Clement, Edmund Wong, Lorenzo Alvisi, Mike Dahlin, Mirco Marchetti, et al. 2009. Making Byzantine fault tolerant systems tolerate Byzantine faults. In Proceedings of the 6th USENIX symposium on Networked systems design and implementation. The USENIX Association.Google ScholarDigital Library
- Miguel Correia, Michael McThrow, Daniel Gómez Ferro, Flavio P Junqueira, Dahlia Malkhi, Marco Serafini, Hojung Cha, Karsten Schwan, Ganapati Srinivasa, Adrian Perrig, et al. 2012. Practical hardening of crash-tolerant systems. In 2012 {USENIX} Annual Technical Conference ({USENIX} {ATC} 12). 453--466.Google Scholar
- James Cowling, Daniel Myers, Barbara Liskov, Rodrigo Rodrigues, and Liuba Shrira. 2006. HQ replication: A hybrid quorum protocol for Byzantine fault tolerance. In Proceedings of the 7th symposium on Operating systems design and implementation. 177--190.Google Scholar
- Aishwarya Ganesan, Ramnatthan Alagappan, Andrea C Arpaci-Dusseau, and Remzi H Arpaci-Dusseau. 2017. Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions.. In FAST. 149--166.Google Scholar
- Haryadi S Gunawi, Mingzhe Hao, Tanakorn Leesatapornwongsa, Tiratat Patana-anake, Thanh Do, Jeffry Adityatama, Kurnia J Eliazar, Agung Laksono, Jeffrey F Lukman, Vincentius Martin, et al. 2014. What bugs live in the cloud? a study of 3000+ issues in cloud systems. In Proceedings of the ACM symposium on cloud computing. 1--14.Google ScholarDigital Library
- Andreas Haeberlen, Petr Kouznetsov, and Peter Druschel. 2006. The Case for Byzantine Fault Detection.. In HotDep.Google Scholar
- Patrick Hunt, Mahadev Konar, Flavio Paiva Junqueira, and Benjamin Reed. 2010. ZooKeeper: wait-free coordination for internet-scale systems.. In USENIX annual technical conference, Vol. 8.Google ScholarDigital Library
- Flavio Junqueira and Benjamin Reed. 2013. ZooKeeper: distributed process coordination. " O'Reilly Media, Inc.".Google Scholar
- Flavio P Junqueira, Benjamin C Reed, and Marco Serafini. 2011. Zab: High-performance broadcast for primary-backup systems. In 2011 IEEE/IFIP 41st International Conference on Dependable Systems & Networks (DSN). IEEE, 245--256.Google ScholarDigital Library
- Ramakrishna Kotla, Lorenzo Alvisi, Mike Dahlin, Allen Clement, and Edmund Wong. 2007. Zyzzyva: speculative byzantine fault tolerance. In Proceedings of twenty-first ACM SIGOPS symposium on Operating systems principles. 45--58.Google ScholarDigital Library
- Leslie Lamport, Robert Shostak, and Marshall Pease. 2019. The Byzantine generals problem. In Concurrency: the works of leslie lamport. 203--226.Google Scholar
- Shubhendu S Mukherjee, Joel Emer, and Steven K Reinhardt. 2005. The soft error problem: An architectural perspective. In 11th International Symposium on High-Performance Computer Architecture. IEEE, 243--247.Google ScholarDigital Library
- Onur Mutlu. 2021. Secure and Reliable Memory. https://www.youtube.com/live/7wVKnPj3NVw?feature=shareGoogle Scholar
- David Oppenheimer, Archana Ganapathi, and David A Patterson. 2003. Why do Internet services fail, and what can be done about it?. In USENIX symposium on internet technologies and systems, Vol. 67. Seattle, WA.Google ScholarDigital Library
- Pinterest [n. d.]. ZooKeeper Resilience at Pinterest. https://medium.com/@Pinterest_Engineering/zookeeper-resilience-at-pinterest-adfd8acf2a6bGoogle Scholar
- Shard Manager at Facebook [n. d.]. Scaling services with Shard Manager. https://engineering.fb.com/2020/08/24/production-engineering/scaling-services-with-shard-managerGoogle Scholar
- Trust, but verify [n. d.]. Trust, but verify: How CockroachDB checks replication. https://www.cockroachlabs.com/blog/trust-but-verify-cockroachdb-checks-replication/Google Scholar
- Venice, Derived Data Platform for Planet-Scale Workloads [n. d.]. Data integrity validation in VeniceDB. https://github.com/linkedin/venice/blob/main/internal/venice-consumer/src/main/java/com/linkedin/venice/kafka/validation/KafkaDataIntegrityValidator.javaGoogle Scholar
- Jian Yin, Jean-Philippe Martin, Arun Venkataramani, Lorenzo Alvisi, and Mike Dahlin. 2003. Separating agreement from execution for Byzantine fault tolerant services. In Proceedings of the nineteenth ACM symposium on Operating systems principles. 253--267.Google ScholarDigital Library
- ZOOKEEPER-3104 [n. d.]. Potential data inconsistency due to NEWLEADER packet being sent too early during SNAP sync. https://issues.apache.org/jira/browse/ZOOKEEPER-3104Google Scholar
- ZooKeeper at Twitter [n. d.]. ZooKeeper at Twitter. https://blog.twitter.com/engineering/en_us/topics/infrastructure/2018/zookeeper-at-twitterGoogle Scholar
- ZooKeeper on GitHub [n. d.]. Apache ZooKeeper Benchmarking Tool. https://github.com/apache/zookeeper/blob/master/zookeeper-it/README.txtGoogle Scholar
- Zookeeper Twin at Facebook [n. d.]. Containerizing ZooKeeper with Twine: Powering container orchestration from within. https://engineering.fb.com/2020/08/31/developer-tools/zookeeper-twineGoogle Scholar
Index Terms
- Verify, And Then Trust: Data Inconsistency Detection in ZooKeeper
Recommendations
Evaluation of the QoS of crash-recovery failure detection
SAC '07: Proceedings of the 2007 ACM symposium on Applied computingCrash failure detection is a key topic in fault tolerance, and it is important to be able to assess the QoS of failure detection services. Most previous work on crash failure detectors has been based on the crash-stop or fail-free assumption. In this ...
Sampling + DMR: practical and low-overhead permanent fault detection
ISCA '11With technology scaling, manufacture-time and in-field permanent faults are becoming a fundamental problem. Multi-core architectures with spares can tolerate them by detecting and isolating faulty cores, but the required fault detection coverage becomes ...
Reliability Analysis of N-Modular Redundancy Systems with Intermittent and Permanent Faults
It is well known that static redundancy techniques are very efficient against intermittent (transient) faults which constitute a large portion of logic faults in digital systems. However, very little theoretical work has been done in evaluating the ...
Comments