research-article

From Missteps to Milestones: A Journey to Practical Fail-Slow Detection

Authors:
Ruiming Lu

Shanghai Jiao Tong University, China

Shanghai Jiao Tong University, China

0000-0002-4236-289X
View Profile

,
Erci Xu

Alibaba Inc. and Shanghai Jiao Tong University, China

Alibaba Inc. and Shanghai Jiao Tong University, China

0000-0002-6654-4364
View Profile

,
Yiming Zhang

Xiamen University, China

Xiamen University, China

0000-0001-6450-8485
View Profile

,
Fengyi Zhu

Alibaba Inc., China

Alibaba Inc., China

0009-0003-3207-4418
View Profile

,
Zhaosheng Zhu

Alibaba Inc., China

Alibaba Inc., China

0009-0005-8889-0507
View Profile

,
Mengtian Wang

Alibaba Inc., China

Alibaba Inc., China

0009-0007-0877-0391
View Profile

,
Zongpeng Zhu

Alibaba Inc., China

Alibaba Inc., China

0009-0006-2618-8314
View Profile

,
Guangtao Xue

Shanghai Jiao Tong University, China

Shanghai Jiao Tong University, China

0000-0002-1617-3593
View Profile

,
Jiwu Shu

Xiamen University, China

Xiamen University, China

0000-0002-7362-2789
View Profile

,
Minglu Li

Shanghai Jiao Tong University and Zhejiang Normal University, China

Shanghai Jiao Tong University and Zhejiang Normal University, China

0000-0003-1751-9418
View Profile

,
Jiesheng Wu

Alibaba Inc., China

Alibaba Inc., China

0000-0001-7417-5469
View Profile

Authors Info & Claims

ACM Transactions on Storage Volume 19 Issue 4Article No.: 33pp 1–28https://doi.org/10.1145/3617690

Published:01 November 2023Publication History

ACM Transactions on Storage

Abstract

The newly emerging “fail-slow” failures plague both software and hardware where the victim components are still functioning yet with degraded performance. To address this problem, this article presents Perseus, a practical fail-slow detection framework for storage devices. Perseus leverages a light regression-based model to quickly pinpoint and analyze fail-slow failures at the granularity of drives. Within a 10-month close monitoring on 248K drives, Perseus managed to find 304 fail-slow cases. Isolating them can reduce the (node-level) 99.99th tail latency by 48%. We assemble a large-scale fail-slow dataset (including 41K normal drives and 315 verified fail-slow drives) from our production traces, based on which we provide root cause analysis on fail-slow drives covering a variety of ill-implemented scheduling, hardware defects, and environmental factors. We have released the dataset to the public for fail-slow study.

REFERENCES

[1] (n.d.). S.M.A.R.T. (Self-Monitoring, Analysis and Reporting Technology). https://en.wikipedia.org/wiki/S.M.A.R.T.Google Scholar
[2] Abdi Hervé and Williams Lynne J.. 2010. Principal component analysis. WIREs Computational Statistics (2010).Google ScholarDigital Library
[3] Alagappan Ramnatthan, Ganesan Aishwarya, Patel Yuvraj, Pillai Thanumalayan Sankaranarayana, Arpaci-Dusseau Andrea C., and Arpaci-Dusseau Remzi H.. 2016. Correlated crash vulnerabilities. In Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI). https://www.usenix.org/conference/osdi16/technical-sessions/presentation/alagappanGoogle Scholar
[4] Alter Jacob, Xue Ji, Dimnaku Alma, and Smirni Evgenia. 2019. SSD failures in the field: Symptoms, causes, and prediction models. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC). Article 75, 14 pages. DOI:Google ScholarDigital Library
[5] Arzani Behnaz, Ciraci Selim, Chamon Luiz, Zhu Yibo, Liu Hongqiang (Harry), Padhye Jitu, Loo Boon Thau, and Outhred Geoff. 2018. 007: Democratically finding the cause of packet drops. In Proceedings of the 15th USENIX Symposium on Networked Systems Design and Implementation (NSDI). https://www.usenix.org/conference/nsdi18/presentation/arzaniGoogle Scholar
[6] Bairavasundaram Lakshmi N., Arpaci-Dusseau Andrea C., Arpaci-Dusseau Remzi H., Goodson Garth R., and Schroeder Bianca. 2008. An analysis of data corruption in the storage stack. In Proceedings of the 6th USENIX Conference on File and Storage Technologies (FAST). https://www.usenix.org/conference/fast-08/analysis-data-corruption-storage-stackGoogle ScholarDigital Library
[7] Bairavasundaram Lakshmi N., Goodson Garth R., Pasupathy Shankar, and Schindler Jiri. 2007. An analysis of latent sector errors in disk drives. In Proceedings of the 2007 ACM International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS). 12. DOI:Google ScholarDigital Library
[8] Bairavasundaram Lakshmi N., Rungta Meenali, Agrawa Nitin, Arpaci-Dusseau Andrea C., Arpaci-Dusseau Remzi H., and Swift Michael M.. 2008. Analyzing the effects of disk-pointer corruption. In Proceedings of the 38th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN). DOI:Google ScholarCross Ref
[9] Cai Yu, Luo Yixin, Ghose Saugata, and Mutlu Onur. 2015. Read disturb errors in MLC NAND flash memory: Characterization, mitigation, and recovery. In Proceedings of the 45th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN). DOI:Google ScholarDigital Library
[10] Chandra Tushar Deepak and Toueg Sam. 1996. Unreliable failure detectors for reliable distributed systems. J. ACM 43, 2 (1996), 225–267. Google ScholarDigital Library
[11] Chen Wei, Toueg S., and Aguilera M. Kawazoe. 2000. On the quality of service of failure detectors. In Proceedings of the 30th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN).Google Scholar
[12] Cheong Wooseong, Yoon Chanho, Woo Seonghoon, Han Kyuwook, Kim Daehyun, Lee Chulseung, Choi Youra, Kim Shine, Kang Dongku, Yu Geunyeong, Kim Jaehong, Park Jaechun, Song Ki-Whan, Park Ki-Tae, Cho Sangyeun, Oh Hwaseok, Lee Daniel D. G., Choi Jin-Hyeok, and Jeong Jaeheon. 2018. A flash memory controller for 15$\mu$s ultra-low-latency SSD using high-speed 3D NAND flash with 3$\mu$s read time. In Proceedings of the IEEE International Solid State Circuits Conference (ISSCC).Google Scholar
[13] Chicco David and Jurman Giuseppe. 2020. The advantages of the matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation. BMC Genomics (2020).Google ScholarCross Ref
[14] Choi Brian, Burns Randal, and Huang Peng. 2021. Understanding and dealing with hard faults in persistent memory systems. In Proceedings of the 16th European Conference on Computer Systems (EuroSys).Google ScholarDigital Library
[15] Clement Allen, Wong Edmund, Alvisi Lorenzo, and Marchetti Mirco. 2009. Making byzantine fault tolerant systems tolerate byzantine faults. In Proceedings of the 6th USENIX Symposium on Networked Systems Design and Implementation (NSDI).Google ScholarDigital Library
[16] Corder G. W. and Foreman D. I.. 2014. Nonparametric Statistics: A Step-by-Step Approach.Wiley.Google Scholar
[17] Do Thanh, Hao Mingzhe, Leesatapornwongsa Tanakorn, Patana-anake Tiratat, and Gunawi Haryadi S.. 2013. Limplock: Understanding the impact of limpware on scale-out cloud systems. In Proceedings of the 4th Annual Symposium on Cloud Computing (SoCC). Article 14, 14 pages. DOI:Google ScholarDigital Library
[18] Draper Norman R. and Smith Harry. 1998. Applied Regression Analysis. Wiley.Google ScholarCross Ref
[19] Gunawi Haryadi S., Hao Mingzhe, Suminto Riza O., Laksono Agung, Satria Anang D., Adityatama Jeffry, and Eliazar Kurnia J.. 2016. Why does the cloud stop computing? Lessons from hundreds of service outages. In Proceedings of the 7th ACM Symposium on Cloud Computing (SoCC). 16. DOI:Google ScholarDigital Library
[20] Gunawi Haryadi S., Suminto Riza O., Sears Russell, Golliher Casey, Sundararaman Swaminathan, Lin Xing, Emami Tim, Sheng Weiguang, Bidokhti Nematollah, McCaffrey Caitie, Grider Gary, Fields Parks M., Harms Kevin, Ross Robert B., Jacobson Andree, Ricci Robert, Webb Kirk, Alvaro Peter, Runesha H. Birali, Hao Mingzhe, and Li Huaicheng. 2018. Fail-slow at scale: Evidence of hardware performance faults in large production systems. In Proceedings of the 16th USENIX Conference on File and Storage Technologies (FAST). https://www.usenix.org/conference/fast18/presentation/gunawiGoogle ScholarDigital Library
[21] Gupta Trinabh, Leners Joshua B., Aguilera Marcos K., and Walfish Michael. 2013. Improving availability in distributed systems with failure informers. In Proceedings of the 10th USENIX Symposium on Networked Systems Design and Implementation (NSDI). https://www.usenix.org/conference/nsdi13/technical-sessions/presentation/lenersGoogle Scholar
[22] Han Shujie, Lee Patrick P. C., Xu Fan, Liu Yi, He Cheng, and Liu Jiongzhou. 2021. An in-depth study of correlated failures in production SSD-based data centers. In Proceedings of the 19th USENIX Conference on File and Storage Technologies (FAST). https://www.usenix.org/conference/fast21/presentation/hanGoogle Scholar
[23] Hao Mingzhe, Soundararajan Gokul, Kenchammana-Hosekote Deepak, Chien Andrew A., and Gunawi Haryadi S.. 2016. The tail at store: A revelation from millions of hours of disk and SSD deployments. In Proceedings of the 14th USENIX Conference on File and Storage Technologies (FAST). https://www.usenix.org/conference/fast16/technical-sessions/presentation/haoGoogle Scholar
[24] Huang Peng, Guo Chuanxiong, Lorch Jacob R., Zhou Lidong, and Dang Yingnong. 2018. Capturing and enhancing in situ system observability for failure detection. In Proceedings of the 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI). https://www.usenix.org/conference/osdi18/presentation/huangGoogle ScholarDigital Library
[25] Huang Peng, Guo Chuanxiong, Zhou Lidong, Lorch Jacob R., Dang Yingnong, Chintalapati Murali, and Yao Randolph. 2017. Gray failure: The achilles’ heel of cloud-scale systems. In Proceedings of the 16th Workshop on Hot Topics in Operating Systems (HotOS). 6. DOI:Google ScholarDigital Library
[26] Huang Peng, Ma Xiao, Shen Dongcai, and Zhou Yuanyuan. 2014. Performance regression testing target prioritization via performance risk analysis. In Proceedings of the 36th International Conference on Software Engineering (ICSE).Google ScholarDigital Library
[27] Kuznetsov Volodymyr, Chipounov Vitaly, and Candea George. 2010. Testing closed-source binary device drivers with DDT. In Proceedings of the 2010 USENIX Annual Technical Conference (USENIX ATC). https://www.usenix.org/conference/usenix-atc-10/testing-closed-source-binary-device-drivers-ddtGoogle ScholarDigital Library
[28] Leners Joshua B., Wu Hao, Hung Wei-Lun, Aguilera Marcos K., and Walfish Michael. 2011. Detecting failures in distributed systems with the falcon spy network. In Proceedings of the 23rd ACM Symposium on Operating Systems Principles (SOSP). 16. DOI:Google ScholarDigital Library
[29] Li Jiaxin, Chen Yuxi, Liu Haopeng, Lu Shan, Zhang Yiming, Gunawi Haryadi S., Gu Xiaohui, Lu Xicheng, and Li Dongsheng. 2018. Pcatch: Automatically detecting performance cascading bugs in cloud systems. In Proceedings of the 13th European Conference on Computer Systems (EuroSys). DOI:Google ScholarDigital Library
[30] Lou Chang, Huang Peng, and Smith Scott. 2020. Understanding, detecting and localizing partial failures in large system software. In Proceedings of the 17th USENIX Symposium on Networked Systems Design and Implementation (NSDI). https://www.usenix.org/conference/nsdi20/presentation/louGoogle Scholar
[31] Lu Ruiming, Xu Erci, Zhang Yiming, Zhu Fengyi, Zhu Zhaosheng, Wang Mengtian, Zhu Zongpeng, Xue Guangtao, Shu Jiwu, Li Minglu, and Wu Jiesheng. 2023. Perseus: A fail-slow detection framework for cloud storage systems. In Proceedings of the 21st USENIX Conference on File and Storage Technologies (FAST). https://www.usenix.org/conference/fast23/presentation/luGoogle Scholar
[32] Lu Ruiming, Xu Erci, Zhang Yiming, Zhu Zhaosheng, Wang Mengtian, Zhu Zongpeng, Xue Guangtao, Li Minglu, and Wu Jiesheng. 2022. NVMe SSD failures in the field: The fail-stop and the fail-slow. In Proceedings of the 2022 USENIX Annual Technical Conference (USENIX ATC 22).Google Scholar
[33] Ma Ao, Douglis Fred, Lu Guanlin, Sawyer Darren, Chandra Surendar, and Hsu Windsor. 2015. RAIDShield: Characterizing, monitoring, and proactively protecting against disk failures. In Proceedings of the 13th USENIX Conference on File and Storage Technologies (FAST). https://www.usenix.org/conference/fast15/technical-sessions/presentation/maGoogle ScholarDigital Library
[34] Maneas Stathis, Mahdaviani Kaveh, Emami Tim, and Schroeder Bianca. 2020. A study of SSD reliability in large scale enterprise storage deployments. In Proceedings of the 18th USENIX Conference on File and Storage Technologies (FAST). https://www.usenix.org/conference/fast20/presentation/maneasGoogle ScholarDigital Library
[35] Matthews Brian. 1975. Comparison of the predicted and observed secondary structure of T4 phage lysozyme. Biochimica et Biophysica Acta (BBA)-Protein Structure (1975).Google ScholarCross Ref
[36] Meza Justin, Wu Qiang, Kumar Sanjev, and Mutlu Onur. 2015. A large-scale study of flash memory failures in the field. In Proceedings of the 2015 ACM International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS). 14. DOI:Google ScholarDigital Library
[37] Narayanan Iyswarya, Wang Di, Jeon Myeongjae, Sharma Bikash, Caulfield Laura, Sivasubramaniam Anand, Cutler Ben, Liu Jie, Khessib Badriddine, and Vaid Kushagra. 2016. SSD failures in datacenters: What? When? And Why?. In Proceedings of the 9th ACM International on Systems and Storage Conference (SYSTOR). Article 7, 11 pages. DOI:Google ScholarDigital Library
[38] Panda Biswaranjan, Srinivasan Deepthi, Ke Huan, Gupta Karan, Khot Vinayak, and Gunawi Haryadi S.. 2019. IASO: A fail-slow detection and mitigation framework for distributed storage services. In Proceedings of the 2019 USENIX Annual Technical Conference (USENIX ATC). https://www.usenix.org/conference/atc19/presentation/pandaGoogle Scholar
[39] Pillai Thanumalayan Sankaranarayana, Alagappan Ramnatthan, Lu Lanyue, Chidambaram Vijay, Arpaci-Dusseau Andrea C., and Arpaci-Dusseau Remzi H.. 2017. Application crash consistency and performance with CCFS. In Proceedings of the 15th USENIX Conference on File and Storage Technologies (FAST). https://www.usenix.org/conference/fast17/technical-sessions/presentation/pillaiGoogle ScholarDigital Library
[40] Prabhakaran Vijayan, Bairavasundaram Lakshmi N., Agrawal Nitin, Gunawi Haryadi S., Arpaci-Dusseau Andrea C., and Arpaci-Dusseau Remzi H.. 2005. IRON file systems. In Proceedings of the 20th ACM Symposium on Operating Systems Principles (SOSP). 15. DOI:Google ScholarDigital Library
[41] Renzelmann Matthew J., Kadav Asim, and Swift Michael M.. 2012. SymDrive: Testing drivers without devices. In Proceedings of the 10th USENIX Symposium on Operating Systems Design and Implementation (OSDI).Google Scholar
[42] Schroeder Bianca, Damouras Sotirios, and Gill Phillipa. 2010. Understanding latent sector errors and how to protect against them. In Proceedings of the 8th USENIX Conference on File and Storage Technologies (FAST). https://www.usenix.org/conference/fast-10/understanding-latent-sector-errors-and-how-protect-against-themGoogle ScholarDigital Library
[43] Schroeder Bianca, Lagisetty Raghav, and Merchant Arif. 2016. Flash reliability in production: The expected and the unexpected. In Proceedings of the 14th USENIX Conference on File and Storage Technologies (FAST). https://www.usenix.org/conference/fast16/technical-sessions/presentation/schroederGoogle Scholar
[44] Schroeder Bianca, Pinheiro Eduardo, and Weber Wolf-Dietrich. 2009. DRAM errors in the wild: A large-scale field study. In Proceedings of the 2009 ACM International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS). 12. DOI:Google ScholarDigital Library
[45] Schubert Erich, Sander Jörg, Ester Martin, Kriegel Hans, and Xu Xiaowei. 2017. DBSCAN revisited, revisited: Why and how you should (still) use DBSCAN. ACM Transactions on Database Systems (2017). DOI:Google ScholarDigital Library
[46] Suminto Riza O., Stuardo Cesar A., Clark Alexandra, Ke Huan, Leesatapornwongsa Tanakorn, Fu Bo, Kurniawan Daniar H., Martin Vincentius, Uma Maheswara Rao G., and Gunawi Haryadi S.. 2017. PBSE: A robust path-based speculative execution for degraded-network tail tolerance in data-parallel frameworks. In Proceedings of the 8th ACM Symposium on Cloud Computing (SoCC).Google ScholarDigital Library
[47] Tan Cheng, Jin Ze, Guo Chuanxiong, Zhang Tianrong, Wu Haitao, Deng Karl, Bi Dongming, and Xiang Dong. 2019. NetBouncer: Active device and link failure localization in data center networks. In Proceedings of the 16th USENIX Symposium on Networked Systems Design and Implementation (NSDI). https://www.usenix.org/conference/nsdi19/presentation/tanGoogle Scholar
[48] Tan Yongmin, Nguyen Hiep, Shen Zhiming, Gu Xiaohui, Venkatramani Chitra, and Rajan Deepak. 2012. PREPARE: Predictive performance anomaly prevention for virtualized cloud systems. In Proceedings of the 32nd International Conference on Distributed Computing Systems (ICDCS). DOI:Google ScholarDigital Library
[49] Walker Benjamin. 2016. SPDK: Building blocks for scalable, high performance storage applications. In Proceedings of the Storage Developer Conference. https://www.snia.org/sites/default/files/SDC/2016/presentations/performance/BenjaminWalker_SPDK_Building_Blocks_SDC_2016.pdfGoogle Scholar
[50] Xu Erci, Zheng Mai, Qin Feng, Xu Yikang, and Wu Jiesheng. 2019. Lessons and actions: What we learned from 10K SSD-related storage system failures. In Proceedings of the 2019 USENIX Annual Technical Conference (USENIX ATC). https://www.usenix.org/conference/atc19/presentation/xuGoogle Scholar
[51] Zhang Qiao, Yu Guo, Guo Chuanxiong, Dang Yingnong, Swanson Nick, Yang Xinsheng, Yao Randolph, Chintalapati Murali, Krishnamurthy Arvind, and Anderson Thomas. 2018. Deepview: Virtual disk failure diagnosis and pattern detection for azure. In Proceedings of the 15th USENIX Symposium on Networked Systems Design and Implementation (NSDI). https://www.usenix.org/conference/nsdi18/presentation/zhang-qiaoGoogle Scholar

Index Terms

From Missteps to Milestones: A Journey to Practical Fail-Slow Detection
1. Computer systems organization
  1. Dependable and fault-tolerant systems and networks
    1. Availability
    2. Reliability
2. Computing methodologies
  1. Machine learning
    1. Machine learning approaches

Recommendations

Diagnostic Fail Data Minimization Using an $N$ -Cover Algorithm

With the increasing transistor count and design complexity of modern integrated circuits, a large volume of fail data is collected by the tester for a failing die. This fail data is analyzed by a diagnosis procedure to obtain information about the ...
Read More
Comprehensible evaluation of prognostic factors and prediction of wound healing

We analyzed the data of a controlled clinical study of the chronic wound healing acceleration as a result of electrical stimulation. The study involved a conventional conservative treatment, sham treatment, biphasic pulsed current, and direct current ...
Read More
Information-Theoretic Syndrome Evaluation, Statistical Root-Cause Analysis, and Correlation-Based Feature Selection for Guiding Board-Level Fault Diagnosis
Reasoning-based functional-fault diagnosis has recently been advocated to achieve high diagnosis accuracy, low defect escapes, and reducing manufacturing cost. However, such diagnosis method requires a rich set of test items (syndromes) and a sizable ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on Storage Volume 19, Issue 4
November 2023
238 pages
ISSN:1553-3077
EISSN:1553-3093
DOI:10.1145/3626486
Editor:
Erez Zadok
Stony Brook University, USA
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 1 November 2023
- Online AM: 11 September 2023
- Accepted: 13 August 2023
- Received: 9 May 2023
Published in tos Volume 19, Issue 4

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Fail-slow failures
machine learning
datasets
root cause reasoning
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 245
  Total Downloads
- Downloads (Last 12 months)245
- Downloads (Last 6 weeks)38
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Full Text

View this article in Full Text.

View Full Text

From Missteps to Milestones: A Journey to Practical Fail-Slow Detection

ACM Transactions on Storage

Abstract

REFERENCES

Cited By

Index Terms

Recommendations

Diagnostic Fail Data Minimization Using an $N$ -Cover Algorithm

Comprehensible evaluation of prognostic factors and prediction of wound healing

Information-Theoretic Syndrome Evaluation, Statistical Root-Cause Analysis, and Correlation-Based Feature Selection for Guiding Board-Level Fault Diagnosis

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Full Text

Caption

From Missteps to Milestones: A Journey to Practical Fail-Slow Detection

ACM Transactions on Storage

Abstract

REFERENCES

Cited By

Index Terms

Recommendations

Diagnostic Fail Data Minimization Using an $N$ -Cover Algorithm

Comprehensible evaluation of prognostic factors and prediction of wound healing

Information-Theoretic Syndrome Evaluation, Statistical Root-Cause Analysis, and Correlation-Based Feature Selection for Guiding Board-Level Fault Diagnosis

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Full Text

Share this Publication link

Share on Social Media