research-article

An empirical study on crash recovery bugs in large-scale distributed systems

Authors:
Yu Gao

Institute of Software at Chinese Academy of Sciences, China / University of Chinese Academy of Sciences, China

Institute of Software at Chinese Academy of Sciences, China / University of Chinese Academy of Sciences, China
View Profile

,
Wensheng Dou

Institute of Software at Chinese Academy of Sciences, China / University of Chinese Academy of Sciences, China

Institute of Software at Chinese Academy of Sciences, China / University of Chinese Academy of Sciences, China
View Profile

,
Feng Qin

Ohio State University, USA

Ohio State University, USA
View Profile

,
Chushu Gao

Institute of Software at Chinese Academy of Sciences, China / University of Chinese Academy of Sciences, China

Institute of Software at Chinese Academy of Sciences, China / University of Chinese Academy of Sciences, China
View Profile

,
Dong Wang

Institute of Software at Chinese Academy of Sciences, China / University of Chinese Academy of Sciences, China

Institute of Software at Chinese Academy of Sciences, China / University of Chinese Academy of Sciences, China
View Profile

,
Jun Wei

Institute of Software at Chinese Academy of Sciences, China / University of Chinese Academy of Sciences, China

Institute of Software at Chinese Academy of Sciences, China / University of Chinese Academy of Sciences, China
View Profile

,
Ruirui Huang

Alibaba Group, China

Alibaba Group, China
View Profile

,
Li Zhou

Alibaba Group, China

Alibaba Group, China
View Profile

,
Yongming Wu

Alibaba Group, China

Alibaba Group, China
View Profile

ESEC/FSE 2018: Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software EngineeringOctober 2018Pages 539–550https://doi.org/10.1145/3236024.3236030

Published:26 October 2018Publication History

ESEC/FSE 2018: Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering

Pages 539–550

ABSTRACT

In large-scale distributed systems, node crashes are inevitable, and can happen at any time. As such, distributed systems are usually designed to be resilient to these node crashes via various crash recovery mechanisms, such as write-ahead logging in HBase and hinted handoffs in Cassandra. However, faults in crash recovery mechanisms and their implementations can introduce intricate crash recovery bugs, and lead to severe consequences.

In this paper, we present CREB, the most comprehensive study on 103 Crash REcovery Bugs from four popular open-source distributed systems, including ZooKeeper, Hadoop MapReduce, Cassandra and HBase. For all the studied bugs, we analyze their root causes, triggering conditions, bug impacts and fixing. Through this study, we obtain many interesting findings that can open up new research directions for combating crash recovery bugs.

References

Ramnatthan Alagappan, Aishwarya Ganesan, Yuvraj Patel, Thanumalayan Sankaranarayana Pillai, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. 2016. Correlated Crash Vulnerabilities. In Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation (OSDI). 151–167. Google ScholarDigital Library
Peter Alvaro, Joshua Rosen, and Joseph M. Hellerstein. 2015. Lineage-driven Fault Injection. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD). 331–346. Google ScholarDigital Library
Mike Burrows. 2006. The Chubby Lock Service for Loosely-Coupled Distributed Systems. In Proceedings of the 7th Symposium on Operating Systems Design and Implementation (OSDI). 335–350. Google ScholarDigital Library
Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, and Robert E. Gruber. 2008. Bigtable: A Distributed Storage System for Structured Data. ACM Transactions on Computer Systems 26, 2 (2008), 1–26. Google ScholarDigital Library
Haogang Chen, Daniel Ziegler, Tej Chajed, Adam Chlipala, M. Frans Kaashoek, and Nickolai Zeldovich. 2015. Using Crash Hoare logic for certifying the FSCQ file system. Proceedings of the 25th Symposium on Operating Systems Principles - SOSP ’15 (2015), 18–37. Google ScholarDigital Library
Brian F. Cooper, Adam Silberstein, Erwin Tam, Raghu Ramakrishnan, and Russell Sears. 2010. Benchmarking Cloud Serving Systems with YCSB. In Proceedings of the 1st ACM Symposium on Cloud Computing (SOCC). 143–154. Google ScholarDigital Library
Ting Dai, Jingzhu He, Xiaohui Gu, and Shan Lu. 2018. Understanding Real-World Timeout Problems in Cloud Server Systems. In Proceeding of the IEEE International Conference on Cloud Engineering (IC2E). 1–11.Google ScholarCross Ref
Jeffrey Dean and Sanjay Ghemawat. 2004. MapReduce: Simplified Data Processing on Large Clusters. In Proceedings of 6th Symposium on Operating Systems Design and Implementation (OSDI). 137–149. Google ScholarDigital Library
Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash Lakshman, Alex Pilchin, Swaminathan Sivasubramanian, Peter Vosshall, and Werner Vogels. 2007. Dynamo: Amazon’s Highly Available Keyvalue Store. In Proceedings of the 21th ACM Symposium on Operating Systems Principles (SOSP). 205–220. Google ScholarDigital Library
Cormac Flanagan and Patrice Godefroid. 2005. Dynamic Partial-Order Reduction for Model Checking Software. In Proceedings of the 32nd ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (POPL). 110–121. Google ScholarDigital Library
Pedro Fonseca. 2017. An Empirical Study on the Correctness of Formally Verified Distributed Systems. In Proceedings of the 12th European Conference on Computer Systems (EuroSys). 328–343. Google ScholarDigital Library
Aishwarya Ganesan, Ramnatthan Alagappan, Andrea C Arpaci-dusseau, and Remzi H Arpaci-dusseau. 2017. Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions. In Proceedings of the 15th Usenix Conference on File and Storage Technologies (FAST). 149–165. Google ScholarDigital Library
Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung. 2003. The Google File System. In Proceedings of the 9th ACM Symposium on Operating Systems Principles (SOSP). 29–43. Google ScholarDigital Library
Haryadi S. Gunawi, Thanh Do, Pallavi Joshi, Peter Alvaro, Joseph M. Hellerstein, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, Koushik Sen, and Dhruba Borthakur. 2011. FATE and DESTINI: A Framework for Cloud Recovery Testing. In Proceedings of the 8th USENIX Conference on Networked Systems Design and Implementation (NSDI). 238–252. Google ScholarDigital Library
Haryadi S. Gunawi, Mingzhe Hao, Tanakorn Leesatapornwongsa, Tiratat Patana-anake, Thanh Do, Jeffry Adityatama, Kurnia J. Eliazar, Agung Laksono, Jeffrey F. Lukman, Vincentius Martin, and Anang D. Satria. 2014. What Bugs Live in the Cloud? A Study of 3000+ Issues in Cloud Systems. In Proceedings of the ACM Symposium on Cloud Computing (SOCC). 1–14. Google ScholarDigital Library
Haryadi S Gunawi, Abhishek Rajimwale, Andrea C Arpaci-dusseau, and Remzi H Arpaci-dusseau. 2008. SQCK : A Declarative File System Checker. In Proceedings of the 8th USENIX Symposium on Operating System Design and Implementation (OSDI). 131–146. Google ScholarDigital Library
Huayang Guo, Ming Wu, Lidong Zhou, Gang Hu, Junfeng Yang, and Lintao Zhang. 2011. Practical Software Model Checking via Dynamic Interface Reduction. In Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles (SOSP). 265–278. Google ScholarDigital Library
Zhenyu Guo, Sean Mcdirmid, Mao Yang, Li Zhuang, Pu Zhang, and Yingwei Luo. 2013. Failure Recovery: When the Cure is Worse Than the Disease. In Proceedings of 14th Workshop on Hot Topics in Operating Systems (HotOS). 1–6. Google ScholarDigital Library
Chris Hawblitzel, Jon Howell, Manos Kapritsos, Jacob R Lorch, Bryan Parno, Michael L Roberts, Srinath Setty, and Brian Zill. 2015. IronFleet : Proving Practical Distributed Systems Correct. In Proceedings of the 25th Symposium on Operating Systems Principles (SOSP). 1–17. Google ScholarDigital Library
N. Hayashibara, X. Defago, R. Yared, and T. Katayama. 2004. The ϕ accrual failure detector. In Proceedings of the 23rd IEEE International Symposium on Reliable Distributed Systems. 66–78. Google ScholarDigital Library
Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony D. Joseph, Randy Katz, Scott Shenker, and Ion Stoica. 2011. Mesos: A Platform for Fine-grained Resource Sharing in the Data Center. In Proceedings of the 8th USENIX Conference on Networked Systems Design and Implementation (NSDI). 295–308. Google ScholarDigital Library
Patrick Hunt, Mahadev Konar, Flavio P. Junqueira, and Benjamin Reed. 2010. ZooKeeper: Wait-free Coordination for Internet-scale Systems. In Proceedings of the USENIX Conference on USENIX Annual Technical Conference (USENIX ATC). 11–11. Google ScholarDigital Library
Pallavi Joshi, Malay Ganai, Gogul Balakrishnan, Aarti Gupta, and Nadia Papakonstantinou. 2013. SETSUDO: Perturbation-based Testing Framework for Scalable Distributed Systems Pallavi. In Conference on Timely Results in Operating Systems (TRIOS). 1–14. Google ScholarDigital Library
Pallavi Joshi, Haryadi S. Gunawi, and Koushik Sen. 2011. PREFAIL : A Programmable Tool for Multiple-Failure Injection. In Proceedings of the 2011 ACM international conference on Object oriented programming systems languages and applications (OOPSLA). 171–188. Google ScholarDigital Library
Eric Koskinen and Junfeng Yang. 2016. Reducing Crash Recoverability to Reachability. In Proceedings of the 43rd Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (POPL). 97–108. Google ScholarDigital Library
Vinod Kumar Vavilapalli et al. 2013. Apache Hadoop YARN: Yet Another Resource Negotiator. In Proceedings of the 4th annual Symposium on Cloud Computing (SOCC). 1–16. Google ScholarDigital Library
Tanakorn Leesatapornwongsa, Mingzhe Hao, Pallavi Joshi, Jeffrey F Lukman, and Haryadi S Gunawi. 2014. SAMC: Semantic-Aware Model Checking for Fast Discovery of Deep Bugs in Cloud Systems. In Proceedings of the 11th USENIX Conference on Operating Systems Design and Implementation (OSDI). 399–414. Google ScholarDigital Library
Tanakorn Leesatapornwongsa, Jeffrey F. Lukman, Shan Lu, and Haryadi S. Gunawi. 2016. TaxDC: A Taxonomy of Non-Deterministic Concurrency Bugs in Datacenter Distributed Systems. In Proceedings of the 21st International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). 517–530. Google ScholarDigital Library
Mohsen Lesani, Christian J Bell, and Adam Chlipala. 2016. Chapar: Certified Causally Consistent Distributed Key-Value Stores. In Proceedings of the 43rd Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (POPL). 357–370. Google ScholarDigital Library
Haopeng Liu, Guangpu Li, Jeffrey F. Lukman, Jiaxin Li, Shan Lu, Haryadi S. Gunawi, and Chen Tian. 2017. DCatch : Automatically Detecting Distributed Concurrency Bugs in Cloud Systems Cloud systems. In Proceedings of the 22nd International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). 677–691. Google ScholarDigital Library
Haopeng Liu, Xu Wang, Guangpu Li, Shan Lu, Feng Ye, and Chen Tian. 2018. FCatch : Automatically Detecting Time-of-fault Bugs in Cloud Systems. In Proceedings of the 23rd International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). Google ScholarDigital Library
Xuezheng Liu, Zhenyu Guo, Xi Wang, Feibo Chen, Xiaochen Lian, Jian Tang, Ming Wu, M. Frans Kaashoek, and Zheng Zhang. 2008. D3S: Debugging Deployed Distributed Systems. In Proceedings of the 5th USENIX Symposium on Networked Systems Design and Implementation (NSDI). 423–437. Google ScholarDigital Library
Jie Lu, Feng Li, Lian Li, and Xiaobing Feng. 2018. CloudRaid : Hunting Concurrency Bugs in the Cloud via Log-Mining. In Proceedings of the 26th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE). To appear. Google ScholarDigital Library
Thanumalayan Sankaranarayana Pillaic, Vijay Chidambaram, Ramnatthan Alagappan, Samer Al-Kiswany, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. 2014. All File Systems Are Not Created Equal: On the Complexity of Crafting Crash-Consistent Applications. In Proceedings of the 11th USENIX Conference on Operating Systems Design and Implementation (OSDI). 433–448. Google ScholarDigital Library
Colin Scott, Aurojit Panda, Arvind Krishnamurthy, Vjekoslav Brajkovic, George Necula, and Scott Shenker. 2016. Minimizing Faulty Executions of Distributed Systems. In Proceedings of 13th USENIX Symposium on Networked Systems Design and Implementation (NSDI). 291–309. Google ScholarDigital Library
Koushik Sen and Gul Agha. 2006. Automated Systematic Testing of Open Distributed Programs. In Proceedings of the 9th International Conference on Fundamental Approaches to Software Engineering (FASE). 339–356. Google ScholarDigital Library
Guosai Wang, Wei Xu, and Lifei Zhang. 2017. What Can We Learn from Four Years of Data Center Hardware Failures ? In Proceedings of 47th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN). 25–36.Google ScholarCross Ref
James R. Wilcox, Doug Woos, Pavel Panchekha, Zachary Tatlock, Xi Wang, Michael D. Ernst, and Thomas Anderson. 2015. Verdi: A Framework for Implementing and Formally Verifying Distributed Systems. In Proceedings of the 36th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI). 357–368. Google ScholarDigital Library
Wei Xu, Armando Fox, David Patterson, and Michael I. Jordan. 2009. Detecting Large-Scale System Problems by Mining Console Logs. In Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles (SOSP). 117–132. Google ScholarDigital Library
Junfeng Yang, Tisheng Chen, Ming Wu, Zhilei Xu, Xuezheng Liu, Haoxiang Lin, Mao Yang, Fan Long, Lintao Zhang, and Lidong Zhou. 2009. MODIST: ESEC/FSE’18, November 4–9, 2018, Lake Buena Vista, FL, USA Yu Gao et al. Transparent Model Checking of Unmodified Distributed Systems. In Proceedings of the 6th USENIX symposium on Networked systems design and implementation (NSDI). 213–228. Google ScholarDigital Library
Junfeng Yang, Paul Twohey, Dawson Engler, and Madanlal Musuvathi. 2004. Using Model Checking to Find Serious File System Errors. In Proceedings ofthe Sixth Symposium on Operating Systems Design and Implementation (OSDI). 273– 288. Google ScholarDigital Library
Ding Yuan, Yu Luo, Xin Zhuang, Guilherme Renna Rodrigues, Xu Zhao, Yongle Zhang, Pranay U. Jain, and Michael Stumm. 2014. Simple Testing Can Prevent Most Critical Failures: An Analysis of Production Failures in Distributed Data-Intensive Systems. In Proceedings of the 11th USENIX Conference on Operating Systems Design and Implementation (OSDI). 249–265. Google ScholarDigital Library
Andreas Zeller and Ralf Hildebrandt. 2002. Simplifying and Isolating Failure-Inducing Input. IEEE Transactions on Software Engineering (TSE) 8, 2 (2002), 183– 200. Google ScholarDigital Library
Mai Zheng, Joseph Tucek, Dachuan Huang, Feng Qin, Mark Lillibridge, Elizabeth S. Yang, Bill W. Zha, and Shashank Singh. 2014. Torturing Databases for Fun and Profit. In Proceedings of the 11th USENIX Conference on Operating Systems Design and Implementation (OSDI). 449–464. Google ScholarDigital Library
Apache Cassandra. Retrieved from http://cassandra.apache.org.Google Scholar
Apache Flume Project. Retrieved from http://flume.apache.org.Google Scholar
Apache Hadoop. Retrieved from http://hadoop.apache.org.Google Scholar
Apache HBase. Retrieved from http://hadoop.apache.org/hbase.Google Scholar
Apache ZooKeeper. Retrieved from http://zookeeper.apache.org.Google Scholar
Chaos Monkey. Retrieved from https://github.com/Netflix/SimianArmy/wiki/Chaos-Monkey.Google Scholar
Dafny is a verification-aware programming language. Retrieved from https://github.com/Microsoft/dafny.Google Scholar
Fault Injection Framework and Development Guide. Retrieved from https://hadoop.apache.org/docs/r2.7.2/hadoop-project-dist/hadoophdfs/FaultInjectFramework.html.Google Scholar
FIT: Failure Injection Testing. Retrieved from https://medium.com/netflixtechblog/fit-failure-injection-testing-35d8e2a9bb2.Google Scholar
HDFS Architecture. Retrieved from http://hadoop.apache.org/%0Adocs/current/hadoop-project-dist/hadoop-hdfs/ HdfsDesign.html.Google Scholar
HintedHandoff. Retrieved from https://wiki.apache.org/cassandra/HintedHandoff. 2016. The 10 Biggest Cloud Outages of 2016. Retrieved from http://www.crn.com/slide-shows/cloud/300083247/the-10-biggest-cloudoutages-of-2016.htm. 2017. The 10 Biggest Cloud Outages of 2017 (So Far). Retrieved from http://www.crn.com/slide-shows/cloud/300089786/the-10-biggest-cloudoutages-of-2017-so-far.htm.Google Scholar
The Coq Proof Assistant. Retrieved from https://coq.inria.fr/.Google Scholar
Write Ahead Log (WAL). Retrieved from http://hbase.apache.org/book.html#wal.Google Scholar

Index Terms

An empirical study on crash recovery bugs in large-scale distributed systems
1. Software and its engineering
  1. Software creation and management
    1. Software verification and validation
      1. Software defect analysis
        Software testing and debugging
  2. Software organization and properties
    1. Extra-functional properties
      1. Software reliability
    2. Software system structures
      1. Distributed systems organizing principles
        Cloud computing

Recommendations

CrashTuner: detecting crash-recovery bugs in cloud systems via meta-info analysis
SOSP '19: Proceedings of the 27th ACM Symposium on Operating Systems Principles

Crash-recovery bugs (bugs in crash-recovery-related mechanisms) are among the most severe bugs in cloud systems and can easily cause system failures. It is notoriously difficult to detect crash-recovery bugs since these bugs can only be exposed when ...
Read More
Checkpointing and Rollback-Recovery for Distributed Systems
Special issue on distributed systems

We consider the problem of bringing a distributed system to a consistent state after transient failures. We address the two components of this problem by describing a distributed algorithm to create consistent checkpoints, as well as a rollback-recovery ...
Read More
Fast crash recovery in distributed file systems
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
ESEC/FSE 2018: Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering
October 2018
987 pages
ISBN:9781450355735
DOI:10.1145/3236024
General Chair:
Gary T. Leavens
University of Central Florida, USA
,
Program Chairs:
Alessandro Garcia
PUC-Rio, Brazil
,
Corina S. Păsăreanu
NASA Ames Research Center, USA
Copyright © 2018 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 26 October 2018
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Distributed systems
crash recovery bugs
empirical study
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate112of543submissions,21%
Upcoming Conference
FSE '24

Sponsor:

sigsoft

32nd ACM International Conference on the Foundations of Software Engineering

July 15 - 19, 2024

Ipojuca (Pernambuco) , Brazil
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 32
  Total Citations
  View Citations
- 798
  Total Downloads
- Downloads (Last 12 months)102
- Downloads (Last 6 weeks)9
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

An empirical study on crash recovery bugs in large-scale distributed systems

ESEC/FSE 2018: Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering

ABSTRACT

References

Cited By

Index Terms

Recommendations

CrashTuner: detecting crash-recovery bugs in cloud systems via meta-info analysis

Checkpointing and Rollback-Recovery for Distributed Systems

Fast crash recovery in distributed file systems

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

An empirical study on crash recovery bugs in large-scale distributed systems

ESEC/FSE 2018: Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering

ABSTRACT

References

Cited By

Index Terms

Recommendations

CrashTuner: detecting crash-recovery bugs in cloud systems via meta-info analysis

Checkpointing and Rollback-Recovery for Distributed Systems

Fast crash recovery in distributed file systems

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media