skip to main content
10.1145/3236024.3236030acmconferencesArticle/Chapter ViewAbstractPublication PagesfseConference Proceedingsconference-collections
research-article

An empirical study on crash recovery bugs in large-scale distributed systems

Authors Info & Claims
Published:26 October 2018Publication History

ABSTRACT

In large-scale distributed systems, node crashes are inevitable, and can happen at any time. As such, distributed systems are usually designed to be resilient to these node crashes via various crash recovery mechanisms, such as write-ahead logging in HBase and hinted handoffs in Cassandra. However, faults in crash recovery mechanisms and their implementations can introduce intricate crash recovery bugs, and lead to severe consequences.

In this paper, we present CREB, the most comprehensive study on 103 Crash REcovery Bugs from four popular open-source distributed systems, including ZooKeeper, Hadoop MapReduce, Cassandra and HBase. For all the studied bugs, we analyze their root causes, triggering conditions, bug impacts and fixing. Through this study, we obtain many interesting findings that can open up new research directions for combating crash recovery bugs.

References

  1. Ramnatthan Alagappan, Aishwarya Ganesan, Yuvraj Patel, Thanumalayan Sankaranarayana Pillai, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. 2016. Correlated Crash Vulnerabilities. In Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation (OSDI). 151–167. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Peter Alvaro, Joshua Rosen, and Joseph M. Hellerstein. 2015. Lineage-driven Fault Injection. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD). 331–346. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Mike Burrows. 2006. The Chubby Lock Service for Loosely-Coupled Distributed Systems. In Proceedings of the 7th Symposium on Operating Systems Design and Implementation (OSDI). 335–350. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, and Robert E. Gruber. 2008. Bigtable: A Distributed Storage System for Structured Data. ACM Transactions on Computer Systems 26, 2 (2008), 1–26. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Haogang Chen, Daniel Ziegler, Tej Chajed, Adam Chlipala, M. Frans Kaashoek, and Nickolai Zeldovich. 2015. Using Crash Hoare logic for certifying the FSCQ file system. Proceedings of the 25th Symposium on Operating Systems Principles - SOSP ’15 (2015), 18–37. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Brian F. Cooper, Adam Silberstein, Erwin Tam, Raghu Ramakrishnan, and Russell Sears. 2010. Benchmarking Cloud Serving Systems with YCSB. In Proceedings of the 1st ACM Symposium on Cloud Computing (SOCC). 143–154. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Ting Dai, Jingzhu He, Xiaohui Gu, and Shan Lu. 2018. Understanding Real-World Timeout Problems in Cloud Server Systems. In Proceeding of the IEEE International Conference on Cloud Engineering (IC2E). 1–11.Google ScholarGoogle ScholarCross RefCross Ref
  8. Jeffrey Dean and Sanjay Ghemawat. 2004. MapReduce: Simplified Data Processing on Large Clusters. In Proceedings of 6th Symposium on Operating Systems Design and Implementation (OSDI). 137–149. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash Lakshman, Alex Pilchin, Swaminathan Sivasubramanian, Peter Vosshall, and Werner Vogels. 2007. Dynamo: Amazon’s Highly Available Keyvalue Store. In Proceedings of the 21th ACM Symposium on Operating Systems Principles (SOSP). 205–220. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Cormac Flanagan and Patrice Godefroid. 2005. Dynamic Partial-Order Reduction for Model Checking Software. In Proceedings of the 32nd ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (POPL). 110–121. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Pedro Fonseca. 2017. An Empirical Study on the Correctness of Formally Verified Distributed Systems. In Proceedings of the 12th European Conference on Computer Systems (EuroSys). 328–343. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Aishwarya Ganesan, Ramnatthan Alagappan, Andrea C Arpaci-dusseau, and Remzi H Arpaci-dusseau. 2017. Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions. In Proceedings of the 15th Usenix Conference on File and Storage Technologies (FAST). 149–165. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung. 2003. The Google File System. In Proceedings of the 9th ACM Symposium on Operating Systems Principles (SOSP). 29–43. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Haryadi S. Gunawi, Thanh Do, Pallavi Joshi, Peter Alvaro, Joseph M. Hellerstein, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, Koushik Sen, and Dhruba Borthakur. 2011. FATE and DESTINI: A Framework for Cloud Recovery Testing. In Proceedings of the 8th USENIX Conference on Networked Systems Design and Implementation (NSDI). 238–252. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Haryadi S. Gunawi, Mingzhe Hao, Tanakorn Leesatapornwongsa, Tiratat Patana-anake, Thanh Do, Jeffry Adityatama, Kurnia J. Eliazar, Agung Laksono, Jeffrey F. Lukman, Vincentius Martin, and Anang D. Satria. 2014. What Bugs Live in the Cloud? A Study of 3000+ Issues in Cloud Systems. In Proceedings of the ACM Symposium on Cloud Computing (SOCC). 1–14. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Haryadi S Gunawi, Abhishek Rajimwale, Andrea C Arpaci-dusseau, and Remzi H Arpaci-dusseau. 2008. SQCK : A Declarative File System Checker. In Proceedings of the 8th USENIX Symposium on Operating System Design and Implementation (OSDI). 131–146. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Huayang Guo, Ming Wu, Lidong Zhou, Gang Hu, Junfeng Yang, and Lintao Zhang. 2011. Practical Software Model Checking via Dynamic Interface Reduction. In Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles (SOSP). 265–278. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Zhenyu Guo, Sean Mcdirmid, Mao Yang, Li Zhuang, Pu Zhang, and Yingwei Luo. 2013. Failure Recovery: When the Cure is Worse Than the Disease. In Proceedings of 14th Workshop on Hot Topics in Operating Systems (HotOS). 1–6. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Chris Hawblitzel, Jon Howell, Manos Kapritsos, Jacob R Lorch, Bryan Parno, Michael L Roberts, Srinath Setty, and Brian Zill. 2015. IronFleet : Proving Practical Distributed Systems Correct. In Proceedings of the 25th Symposium on Operating Systems Principles (SOSP). 1–17. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. N. Hayashibara, X. Defago, R. Yared, and T. Katayama. 2004. The ϕ accrual failure detector. In Proceedings of the 23rd IEEE International Symposium on Reliable Distributed Systems. 66–78. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony D. Joseph, Randy Katz, Scott Shenker, and Ion Stoica. 2011. Mesos: A Platform for Fine-grained Resource Sharing in the Data Center. In Proceedings of the 8th USENIX Conference on Networked Systems Design and Implementation (NSDI). 295–308. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Patrick Hunt, Mahadev Konar, Flavio P. Junqueira, and Benjamin Reed. 2010. ZooKeeper: Wait-free Coordination for Internet-scale Systems. In Proceedings of the USENIX Conference on USENIX Annual Technical Conference (USENIX ATC). 11–11. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Pallavi Joshi, Malay Ganai, Gogul Balakrishnan, Aarti Gupta, and Nadia Papakonstantinou. 2013. SETSUDO: Perturbation-based Testing Framework for Scalable Distributed Systems Pallavi. In Conference on Timely Results in Operating Systems (TRIOS). 1–14. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Pallavi Joshi, Haryadi S. Gunawi, and Koushik Sen. 2011. PREFAIL : A Programmable Tool for Multiple-Failure Injection. In Proceedings of the 2011 ACM international conference on Object oriented programming systems languages and applications (OOPSLA). 171–188. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Eric Koskinen and Junfeng Yang. 2016. Reducing Crash Recoverability to Reachability. In Proceedings of the 43rd Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (POPL). 97–108. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Vinod Kumar Vavilapalli et al. 2013. Apache Hadoop YARN: Yet Another Resource Negotiator. In Proceedings of the 4th annual Symposium on Cloud Computing (SOCC). 1–16. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Tanakorn Leesatapornwongsa, Mingzhe Hao, Pallavi Joshi, Jeffrey F Lukman, and Haryadi S Gunawi. 2014. SAMC: Semantic-Aware Model Checking for Fast Discovery of Deep Bugs in Cloud Systems. In Proceedings of the 11th USENIX Conference on Operating Systems Design and Implementation (OSDI). 399–414. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Tanakorn Leesatapornwongsa, Jeffrey F. Lukman, Shan Lu, and Haryadi S. Gunawi. 2016. TaxDC: A Taxonomy of Non-Deterministic Concurrency Bugs in Datacenter Distributed Systems. In Proceedings of the 21st International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). 517–530. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Mohsen Lesani, Christian J Bell, and Adam Chlipala. 2016. Chapar: Certified Causally Consistent Distributed Key-Value Stores. In Proceedings of the 43rd Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (POPL). 357–370. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Haopeng Liu, Guangpu Li, Jeffrey F. Lukman, Jiaxin Li, Shan Lu, Haryadi S. Gunawi, and Chen Tian. 2017. DCatch : Automatically Detecting Distributed Concurrency Bugs in Cloud Systems Cloud systems. In Proceedings of the 22nd International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). 677–691. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Haopeng Liu, Xu Wang, Guangpu Li, Shan Lu, Feng Ye, and Chen Tian. 2018. FCatch : Automatically Detecting Time-of-fault Bugs in Cloud Systems. In Proceedings of the 23rd International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Xuezheng Liu, Zhenyu Guo, Xi Wang, Feibo Chen, Xiaochen Lian, Jian Tang, Ming Wu, M. Frans Kaashoek, and Zheng Zhang. 2008. D3S: Debugging Deployed Distributed Systems. In Proceedings of the 5th USENIX Symposium on Networked Systems Design and Implementation (NSDI). 423–437. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Jie Lu, Feng Li, Lian Li, and Xiaobing Feng. 2018. CloudRaid : Hunting Concurrency Bugs in the Cloud via Log-Mining. In Proceedings of the 26th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE). To appear. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Thanumalayan Sankaranarayana Pillaic, Vijay Chidambaram, Ramnatthan Alagappan, Samer Al-Kiswany, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. 2014. All File Systems Are Not Created Equal: On the Complexity of Crafting Crash-Consistent Applications. In Proceedings of the 11th USENIX Conference on Operating Systems Design and Implementation (OSDI). 433–448. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Colin Scott, Aurojit Panda, Arvind Krishnamurthy, Vjekoslav Brajkovic, George Necula, and Scott Shenker. 2016. Minimizing Faulty Executions of Distributed Systems. In Proceedings of 13th USENIX Symposium on Networked Systems Design and Implementation (NSDI). 291–309. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Koushik Sen and Gul Agha. 2006. Automated Systematic Testing of Open Distributed Programs. In Proceedings of the 9th International Conference on Fundamental Approaches to Software Engineering (FASE). 339–356. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Guosai Wang, Wei Xu, and Lifei Zhang. 2017. What Can We Learn from Four Years of Data Center Hardware Failures ? In Proceedings of 47th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN). 25–36.Google ScholarGoogle ScholarCross RefCross Ref
  38. James R. Wilcox, Doug Woos, Pavel Panchekha, Zachary Tatlock, Xi Wang, Michael D. Ernst, and Thomas Anderson. 2015. Verdi: A Framework for Implementing and Formally Verifying Distributed Systems. In Proceedings of the 36th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI). 357–368. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Wei Xu, Armando Fox, David Patterson, and Michael I. Jordan. 2009. Detecting Large-Scale System Problems by Mining Console Logs. In Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles (SOSP). 117–132. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Junfeng Yang, Tisheng Chen, Ming Wu, Zhilei Xu, Xuezheng Liu, Haoxiang Lin, Mao Yang, Fan Long, Lintao Zhang, and Lidong Zhou. 2009. MODIST: ESEC/FSE’18, November 4–9, 2018, Lake Buena Vista, FL, USA Yu Gao et al. Transparent Model Checking of Unmodified Distributed Systems. In Proceedings of the 6th USENIX symposium on Networked systems design and implementation (NSDI). 213–228. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Junfeng Yang, Paul Twohey, Dawson Engler, and Madanlal Musuvathi. 2004. Using Model Checking to Find Serious File System Errors. In Proceedings ofthe Sixth Symposium on Operating Systems Design and Implementation (OSDI). 273– 288. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Ding Yuan, Yu Luo, Xin Zhuang, Guilherme Renna Rodrigues, Xu Zhao, Yongle Zhang, Pranay U. Jain, and Michael Stumm. 2014. Simple Testing Can Prevent Most Critical Failures: An Analysis of Production Failures in Distributed Data-Intensive Systems. In Proceedings of the 11th USENIX Conference on Operating Systems Design and Implementation (OSDI). 249–265. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Andreas Zeller and Ralf Hildebrandt. 2002. Simplifying and Isolating Failure-Inducing Input. IEEE Transactions on Software Engineering (TSE) 8, 2 (2002), 183– 200. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Mai Zheng, Joseph Tucek, Dachuan Huang, Feng Qin, Mark Lillibridge, Elizabeth S. Yang, Bill W. Zha, and Shashank Singh. 2014. Torturing Databases for Fun and Profit. In Proceedings of the 11th USENIX Conference on Operating Systems Design and Implementation (OSDI). 449–464. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Apache Cassandra. Retrieved from http://cassandra.apache.org.Google ScholarGoogle Scholar
  46. Apache Flume Project. Retrieved from http://flume.apache.org.Google ScholarGoogle Scholar
  47. Apache Hadoop. Retrieved from http://hadoop.apache.org.Google ScholarGoogle Scholar
  48. Apache HBase. Retrieved from http://hadoop.apache.org/hbase.Google ScholarGoogle Scholar
  49. Apache ZooKeeper. Retrieved from http://zookeeper.apache.org.Google ScholarGoogle Scholar
  50. Chaos Monkey. Retrieved from https://github.com/Netflix/SimianArmy/wiki/Chaos-Monkey.Google ScholarGoogle Scholar
  51. Dafny is a verification-aware programming language. Retrieved from https://github.com/Microsoft/dafny.Google ScholarGoogle Scholar
  52. Fault Injection Framework and Development Guide. Retrieved from https://hadoop.apache.org/docs/r2.7.2/hadoop-project-dist/hadoophdfs/FaultInjectFramework.html.Google ScholarGoogle Scholar
  53. FIT: Failure Injection Testing. Retrieved from https://medium.com/netflixtechblog/fit-failure-injection-testing-35d8e2a9bb2.Google ScholarGoogle Scholar
  54. HDFS Architecture. Retrieved from http://hadoop.apache.org/%0Adocs/current/hadoop-project-dist/hadoop-hdfs/ HdfsDesign.html.Google ScholarGoogle Scholar
  55. HintedHandoff. Retrieved from https://wiki.apache.org/cassandra/HintedHandoff. 2016. The 10 Biggest Cloud Outages of 2016. Retrieved from http://www.crn.com/slide-shows/cloud/300083247/the-10-biggest-cloudoutages-of-2016.htm. 2017. The 10 Biggest Cloud Outages of 2017 (So Far). Retrieved from http://www.crn.com/slide-shows/cloud/300089786/the-10-biggest-cloudoutages-of-2017-so-far.htm.Google ScholarGoogle Scholar
  56. The Coq Proof Assistant. Retrieved from https://coq.inria.fr/.Google ScholarGoogle Scholar
  57. Write Ahead Log (WAL). Retrieved from http://hbase.apache.org/book.html#wal.Google ScholarGoogle Scholar

Index Terms

  1. An empirical study on crash recovery bugs in large-scale distributed systems

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          ESEC/FSE 2018: Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering
          October 2018
          987 pages
          ISBN:9781450355735
          DOI:10.1145/3236024

          Copyright © 2018 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 26 October 2018

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article

          Acceptance Rates

          Overall Acceptance Rate112of543submissions,21%

          Upcoming Conference

          FSE '24

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader