ABSTRACT
In large-scale distributed systems, node crashes are inevitable, and can happen at any time. As such, distributed systems are usually designed to be resilient to these node crashes via various crash recovery mechanisms, such as write-ahead logging in HBase and hinted handoffs in Cassandra. However, faults in crash recovery mechanisms and their implementations can introduce intricate crash recovery bugs, and lead to severe consequences.
In this paper, we present CREB, the most comprehensive study on 103 Crash REcovery Bugs from four popular open-source distributed systems, including ZooKeeper, Hadoop MapReduce, Cassandra and HBase. For all the studied bugs, we analyze their root causes, triggering conditions, bug impacts and fixing. Through this study, we obtain many interesting findings that can open up new research directions for combating crash recovery bugs.
- Ramnatthan Alagappan, Aishwarya Ganesan, Yuvraj Patel, Thanumalayan Sankaranarayana Pillai, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. 2016. Correlated Crash Vulnerabilities. In Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation (OSDI). 151–167. Google ScholarDigital Library
- Peter Alvaro, Joshua Rosen, and Joseph M. Hellerstein. 2015. Lineage-driven Fault Injection. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD). 331–346. Google ScholarDigital Library
- Mike Burrows. 2006. The Chubby Lock Service for Loosely-Coupled Distributed Systems. In Proceedings of the 7th Symposium on Operating Systems Design and Implementation (OSDI). 335–350. Google ScholarDigital Library
- Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, and Robert E. Gruber. 2008. Bigtable: A Distributed Storage System for Structured Data. ACM Transactions on Computer Systems 26, 2 (2008), 1–26. Google ScholarDigital Library
- Haogang Chen, Daniel Ziegler, Tej Chajed, Adam Chlipala, M. Frans Kaashoek, and Nickolai Zeldovich. 2015. Using Crash Hoare logic for certifying the FSCQ file system. Proceedings of the 25th Symposium on Operating Systems Principles - SOSP ’15 (2015), 18–37. Google ScholarDigital Library
- Brian F. Cooper, Adam Silberstein, Erwin Tam, Raghu Ramakrishnan, and Russell Sears. 2010. Benchmarking Cloud Serving Systems with YCSB. In Proceedings of the 1st ACM Symposium on Cloud Computing (SOCC). 143–154. Google ScholarDigital Library
- Ting Dai, Jingzhu He, Xiaohui Gu, and Shan Lu. 2018. Understanding Real-World Timeout Problems in Cloud Server Systems. In Proceeding of the IEEE International Conference on Cloud Engineering (IC2E). 1–11.Google ScholarCross Ref
- Jeffrey Dean and Sanjay Ghemawat. 2004. MapReduce: Simplified Data Processing on Large Clusters. In Proceedings of 6th Symposium on Operating Systems Design and Implementation (OSDI). 137–149. Google ScholarDigital Library
- Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash Lakshman, Alex Pilchin, Swaminathan Sivasubramanian, Peter Vosshall, and Werner Vogels. 2007. Dynamo: Amazon’s Highly Available Keyvalue Store. In Proceedings of the 21th ACM Symposium on Operating Systems Principles (SOSP). 205–220. Google ScholarDigital Library
- Cormac Flanagan and Patrice Godefroid. 2005. Dynamic Partial-Order Reduction for Model Checking Software. In Proceedings of the 32nd ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (POPL). 110–121. Google ScholarDigital Library
- Pedro Fonseca. 2017. An Empirical Study on the Correctness of Formally Verified Distributed Systems. In Proceedings of the 12th European Conference on Computer Systems (EuroSys). 328–343. Google ScholarDigital Library
- Aishwarya Ganesan, Ramnatthan Alagappan, Andrea C Arpaci-dusseau, and Remzi H Arpaci-dusseau. 2017. Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions. In Proceedings of the 15th Usenix Conference on File and Storage Technologies (FAST). 149–165. Google ScholarDigital Library
- Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung. 2003. The Google File System. In Proceedings of the 9th ACM Symposium on Operating Systems Principles (SOSP). 29–43. Google ScholarDigital Library
- Haryadi S. Gunawi, Thanh Do, Pallavi Joshi, Peter Alvaro, Joseph M. Hellerstein, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, Koushik Sen, and Dhruba Borthakur. 2011. FATE and DESTINI: A Framework for Cloud Recovery Testing. In Proceedings of the 8th USENIX Conference on Networked Systems Design and Implementation (NSDI). 238–252. Google ScholarDigital Library
- Haryadi S. Gunawi, Mingzhe Hao, Tanakorn Leesatapornwongsa, Tiratat Patana-anake, Thanh Do, Jeffry Adityatama, Kurnia J. Eliazar, Agung Laksono, Jeffrey F. Lukman, Vincentius Martin, and Anang D. Satria. 2014. What Bugs Live in the Cloud? A Study of 3000+ Issues in Cloud Systems. In Proceedings of the ACM Symposium on Cloud Computing (SOCC). 1–14. Google ScholarDigital Library
- Haryadi S Gunawi, Abhishek Rajimwale, Andrea C Arpaci-dusseau, and Remzi H Arpaci-dusseau. 2008. SQCK : A Declarative File System Checker. In Proceedings of the 8th USENIX Symposium on Operating System Design and Implementation (OSDI). 131–146. Google ScholarDigital Library
- Huayang Guo, Ming Wu, Lidong Zhou, Gang Hu, Junfeng Yang, and Lintao Zhang. 2011. Practical Software Model Checking via Dynamic Interface Reduction. In Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles (SOSP). 265–278. Google ScholarDigital Library
- Zhenyu Guo, Sean Mcdirmid, Mao Yang, Li Zhuang, Pu Zhang, and Yingwei Luo. 2013. Failure Recovery: When the Cure is Worse Than the Disease. In Proceedings of 14th Workshop on Hot Topics in Operating Systems (HotOS). 1–6. Google ScholarDigital Library
- Chris Hawblitzel, Jon Howell, Manos Kapritsos, Jacob R Lorch, Bryan Parno, Michael L Roberts, Srinath Setty, and Brian Zill. 2015. IronFleet : Proving Practical Distributed Systems Correct. In Proceedings of the 25th Symposium on Operating Systems Principles (SOSP). 1–17. Google ScholarDigital Library
- N. Hayashibara, X. Defago, R. Yared, and T. Katayama. 2004. The ϕ accrual failure detector. In Proceedings of the 23rd IEEE International Symposium on Reliable Distributed Systems. 66–78. Google ScholarDigital Library
- Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony D. Joseph, Randy Katz, Scott Shenker, and Ion Stoica. 2011. Mesos: A Platform for Fine-grained Resource Sharing in the Data Center. In Proceedings of the 8th USENIX Conference on Networked Systems Design and Implementation (NSDI). 295–308. Google ScholarDigital Library
- Patrick Hunt, Mahadev Konar, Flavio P. Junqueira, and Benjamin Reed. 2010. ZooKeeper: Wait-free Coordination for Internet-scale Systems. In Proceedings of the USENIX Conference on USENIX Annual Technical Conference (USENIX ATC). 11–11. Google ScholarDigital Library
- Pallavi Joshi, Malay Ganai, Gogul Balakrishnan, Aarti Gupta, and Nadia Papakonstantinou. 2013. SETSUDO: Perturbation-based Testing Framework for Scalable Distributed Systems Pallavi. In Conference on Timely Results in Operating Systems (TRIOS). 1–14. Google ScholarDigital Library
- Pallavi Joshi, Haryadi S. Gunawi, and Koushik Sen. 2011. PREFAIL : A Programmable Tool for Multiple-Failure Injection. In Proceedings of the 2011 ACM international conference on Object oriented programming systems languages and applications (OOPSLA). 171–188. Google ScholarDigital Library
- Eric Koskinen and Junfeng Yang. 2016. Reducing Crash Recoverability to Reachability. In Proceedings of the 43rd Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (POPL). 97–108. Google ScholarDigital Library
- Vinod Kumar Vavilapalli et al. 2013. Apache Hadoop YARN: Yet Another Resource Negotiator. In Proceedings of the 4th annual Symposium on Cloud Computing (SOCC). 1–16. Google ScholarDigital Library
- Tanakorn Leesatapornwongsa, Mingzhe Hao, Pallavi Joshi, Jeffrey F Lukman, and Haryadi S Gunawi. 2014. SAMC: Semantic-Aware Model Checking for Fast Discovery of Deep Bugs in Cloud Systems. In Proceedings of the 11th USENIX Conference on Operating Systems Design and Implementation (OSDI). 399–414. Google ScholarDigital Library
- Tanakorn Leesatapornwongsa, Jeffrey F. Lukman, Shan Lu, and Haryadi S. Gunawi. 2016. TaxDC: A Taxonomy of Non-Deterministic Concurrency Bugs in Datacenter Distributed Systems. In Proceedings of the 21st International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). 517–530. Google ScholarDigital Library
- Mohsen Lesani, Christian J Bell, and Adam Chlipala. 2016. Chapar: Certified Causally Consistent Distributed Key-Value Stores. In Proceedings of the 43rd Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (POPL). 357–370. Google ScholarDigital Library
- Haopeng Liu, Guangpu Li, Jeffrey F. Lukman, Jiaxin Li, Shan Lu, Haryadi S. Gunawi, and Chen Tian. 2017. DCatch : Automatically Detecting Distributed Concurrency Bugs in Cloud Systems Cloud systems. In Proceedings of the 22nd International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). 677–691. Google ScholarDigital Library
- Haopeng Liu, Xu Wang, Guangpu Li, Shan Lu, Feng Ye, and Chen Tian. 2018. FCatch : Automatically Detecting Time-of-fault Bugs in Cloud Systems. In Proceedings of the 23rd International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). Google ScholarDigital Library
- Xuezheng Liu, Zhenyu Guo, Xi Wang, Feibo Chen, Xiaochen Lian, Jian Tang, Ming Wu, M. Frans Kaashoek, and Zheng Zhang. 2008. D3S: Debugging Deployed Distributed Systems. In Proceedings of the 5th USENIX Symposium on Networked Systems Design and Implementation (NSDI). 423–437. Google ScholarDigital Library
- Jie Lu, Feng Li, Lian Li, and Xiaobing Feng. 2018. CloudRaid : Hunting Concurrency Bugs in the Cloud via Log-Mining. In Proceedings of the 26th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE). To appear. Google ScholarDigital Library
- Thanumalayan Sankaranarayana Pillaic, Vijay Chidambaram, Ramnatthan Alagappan, Samer Al-Kiswany, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. 2014. All File Systems Are Not Created Equal: On the Complexity of Crafting Crash-Consistent Applications. In Proceedings of the 11th USENIX Conference on Operating Systems Design and Implementation (OSDI). 433–448. Google ScholarDigital Library
- Colin Scott, Aurojit Panda, Arvind Krishnamurthy, Vjekoslav Brajkovic, George Necula, and Scott Shenker. 2016. Minimizing Faulty Executions of Distributed Systems. In Proceedings of 13th USENIX Symposium on Networked Systems Design and Implementation (NSDI). 291–309. Google ScholarDigital Library
- Koushik Sen and Gul Agha. 2006. Automated Systematic Testing of Open Distributed Programs. In Proceedings of the 9th International Conference on Fundamental Approaches to Software Engineering (FASE). 339–356. Google ScholarDigital Library
- Guosai Wang, Wei Xu, and Lifei Zhang. 2017. What Can We Learn from Four Years of Data Center Hardware Failures ? In Proceedings of 47th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN). 25–36.Google ScholarCross Ref
- James R. Wilcox, Doug Woos, Pavel Panchekha, Zachary Tatlock, Xi Wang, Michael D. Ernst, and Thomas Anderson. 2015. Verdi: A Framework for Implementing and Formally Verifying Distributed Systems. In Proceedings of the 36th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI). 357–368. Google ScholarDigital Library
- Wei Xu, Armando Fox, David Patterson, and Michael I. Jordan. 2009. Detecting Large-Scale System Problems by Mining Console Logs. In Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles (SOSP). 117–132. Google ScholarDigital Library
- Junfeng Yang, Tisheng Chen, Ming Wu, Zhilei Xu, Xuezheng Liu, Haoxiang Lin, Mao Yang, Fan Long, Lintao Zhang, and Lidong Zhou. 2009. MODIST: ESEC/FSE’18, November 4–9, 2018, Lake Buena Vista, FL, USA Yu Gao et al. Transparent Model Checking of Unmodified Distributed Systems. In Proceedings of the 6th USENIX symposium on Networked systems design and implementation (NSDI). 213–228. Google ScholarDigital Library
- Junfeng Yang, Paul Twohey, Dawson Engler, and Madanlal Musuvathi. 2004. Using Model Checking to Find Serious File System Errors. In Proceedings ofthe Sixth Symposium on Operating Systems Design and Implementation (OSDI). 273– 288. Google ScholarDigital Library
- Ding Yuan, Yu Luo, Xin Zhuang, Guilherme Renna Rodrigues, Xu Zhao, Yongle Zhang, Pranay U. Jain, and Michael Stumm. 2014. Simple Testing Can Prevent Most Critical Failures: An Analysis of Production Failures in Distributed Data-Intensive Systems. In Proceedings of the 11th USENIX Conference on Operating Systems Design and Implementation (OSDI). 249–265. Google ScholarDigital Library
- Andreas Zeller and Ralf Hildebrandt. 2002. Simplifying and Isolating Failure-Inducing Input. IEEE Transactions on Software Engineering (TSE) 8, 2 (2002), 183– 200. Google ScholarDigital Library
- Mai Zheng, Joseph Tucek, Dachuan Huang, Feng Qin, Mark Lillibridge, Elizabeth S. Yang, Bill W. Zha, and Shashank Singh. 2014. Torturing Databases for Fun and Profit. In Proceedings of the 11th USENIX Conference on Operating Systems Design and Implementation (OSDI). 449–464. Google ScholarDigital Library
- Apache Cassandra. Retrieved from http://cassandra.apache.org.Google Scholar
- Apache Flume Project. Retrieved from http://flume.apache.org.Google Scholar
- Apache Hadoop. Retrieved from http://hadoop.apache.org.Google Scholar
- Apache HBase. Retrieved from http://hadoop.apache.org/hbase.Google Scholar
- Apache ZooKeeper. Retrieved from http://zookeeper.apache.org.Google Scholar
- Chaos Monkey. Retrieved from https://github.com/Netflix/SimianArmy/wiki/Chaos-Monkey.Google Scholar
- Dafny is a verification-aware programming language. Retrieved from https://github.com/Microsoft/dafny.Google Scholar
- Fault Injection Framework and Development Guide. Retrieved from https://hadoop.apache.org/docs/r2.7.2/hadoop-project-dist/hadoophdfs/FaultInjectFramework.html.Google Scholar
- FIT: Failure Injection Testing. Retrieved from https://medium.com/netflixtechblog/fit-failure-injection-testing-35d8e2a9bb2.Google Scholar
- HDFS Architecture. Retrieved from http://hadoop.apache.org/%0Adocs/current/hadoop-project-dist/hadoop-hdfs/ HdfsDesign.html.Google Scholar
- HintedHandoff. Retrieved from https://wiki.apache.org/cassandra/HintedHandoff. 2016. The 10 Biggest Cloud Outages of 2016. Retrieved from http://www.crn.com/slide-shows/cloud/300083247/the-10-biggest-cloudoutages-of-2016.htm. 2017. The 10 Biggest Cloud Outages of 2017 (So Far). Retrieved from http://www.crn.com/slide-shows/cloud/300089786/the-10-biggest-cloudoutages-of-2017-so-far.htm.Google Scholar
- The Coq Proof Assistant. Retrieved from https://coq.inria.fr/.Google Scholar
- Write Ahead Log (WAL). Retrieved from http://hbase.apache.org/book.html#wal.Google Scholar
Index Terms
- An empirical study on crash recovery bugs in large-scale distributed systems
Recommendations
CrashTuner: detecting crash-recovery bugs in cloud systems via meta-info analysis
SOSP '19: Proceedings of the 27th ACM Symposium on Operating Systems PrinciplesCrash-recovery bugs (bugs in crash-recovery-related mechanisms) are among the most severe bugs in cloud systems and can easily cause system failures. It is notoriously difficult to detect crash-recovery bugs since these bugs can only be exposed when ...
Checkpointing and Rollback-Recovery for Distributed Systems
Special issue on distributed systemsWe consider the problem of bringing a distributed system to a consistent state after transient failures. We address the two components of this problem by describing a distributed algorithm to create consistent checkpoints, as well as a rollback-recovery ...
Comments