Abstract
The newly emerging “fail-slow” failures plague both software and hardware where the victim components are still functioning yet with degraded performance. To address this problem, this article presents Perseus, a practical fail-slow detection framework for storage devices. Perseus leverages a light regression-based model to quickly pinpoint and analyze fail-slow failures at the granularity of drives. Within a 10-month close monitoring on 248K drives, Perseus managed to find 304 fail-slow cases. Isolating them can reduce the (node-level) 99.99th tail latency by 48%. We assemble a large-scale fail-slow dataset (including 41K normal drives and 315 verified fail-slow drives) from our production traces, based on which we provide root cause analysis on fail-slow drives covering a variety of ill-implemented scheduling, hardware defects, and environmental factors. We have released the dataset to the public for fail-slow study.
- [1] (n.d.). S.M.A.R.T. (Self-Monitoring, Analysis and Reporting Technology). https://en.wikipedia.org/wiki/S.M.A.R.T.Google Scholar
- [2] . 2010. Principal component analysis. WIREs Computational Statistics (2010).Google ScholarDigital Library
- [3] . 2016. Correlated crash vulnerabilities. In Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI). https://www.usenix.org/conference/osdi16/technical-sessions/presentation/alagappanGoogle Scholar
- [4] . 2019. SSD failures in the field: Symptoms, causes, and prediction models. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC). Article
75 , 14 pages.DOI: Google ScholarDigital Library - [5] . 2018. 007: Democratically finding the cause of packet drops. In Proceedings of the 15th USENIX Symposium on Networked Systems Design and Implementation (NSDI). https://www.usenix.org/conference/nsdi18/presentation/arzaniGoogle Scholar
- [6] . 2008. An analysis of data corruption in the storage stack. In Proceedings of the 6th USENIX Conference on File and Storage Technologies (FAST). https://www.usenix.org/conference/fast-08/analysis-data-corruption-storage-stackGoogle ScholarDigital Library
- [7] . 2007. An analysis of latent sector errors in disk drives. In Proceedings of the 2007 ACM International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS). 12.
DOI: Google ScholarDigital Library - [8] . 2008. Analyzing the effects of disk-pointer corruption. In Proceedings of the 38th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN).
DOI: Google ScholarCross Ref - [9] . 2015. Read disturb errors in MLC NAND flash memory: Characterization, mitigation, and recovery. In Proceedings of the 45th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN).
DOI: Google ScholarDigital Library - [10] . 1996. Unreliable failure detectors for reliable distributed systems. J. ACM 43, 2 (1996), 225–267. Google ScholarDigital Library
- [11] . 2000. On the quality of service of failure detectors. In Proceedings of the 30th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN).Google Scholar
- [12] . 2018. A flash memory controller for 15\(\mu\)s ultra-low-latency SSD using high-speed 3D NAND flash with 3\(\mu\)s read time. In Proceedings of the IEEE International Solid State Circuits Conference (ISSCC).Google Scholar
- [13] . 2020. The advantages of the matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation. BMC Genomics (2020).Google ScholarCross Ref
- [14] . 2021. Understanding and dealing with hard faults in persistent memory systems. In Proceedings of the 16th European Conference on Computer Systems (EuroSys).Google ScholarDigital Library
- [15] . 2009. Making byzantine fault tolerant systems tolerate byzantine faults. In Proceedings of the 6th USENIX Symposium on Networked Systems Design and Implementation (NSDI).Google ScholarDigital Library
- [16] . 2014. Nonparametric Statistics: A Step-by-Step Approach.Wiley.Google Scholar
- [17] . 2013. Limplock: Understanding the impact of limpware on scale-out cloud systems. In Proceedings of the 4th Annual Symposium on Cloud Computing (SoCC). Article
14 , 14 pages.DOI: Google ScholarDigital Library - [18] . 1998. Applied Regression Analysis. Wiley.Google ScholarCross Ref
- [19] . 2016. Why does the cloud stop computing? Lessons from hundreds of service outages. In Proceedings of the 7th ACM Symposium on Cloud Computing (SoCC). 16.
DOI: Google ScholarDigital Library - [20] . 2018. Fail-slow at scale: Evidence of hardware performance faults in large production systems. In Proceedings of the 16th USENIX Conference on File and Storage Technologies (FAST). https://www.usenix.org/conference/fast18/presentation/gunawiGoogle ScholarDigital Library
- [21] . 2013. Improving availability in distributed systems with failure informers. In Proceedings of the 10th USENIX Symposium on Networked Systems Design and Implementation (NSDI). https://www.usenix.org/conference/nsdi13/technical-sessions/presentation/lenersGoogle Scholar
- [22] . 2021. An in-depth study of correlated failures in production SSD-based data centers. In Proceedings of the 19th USENIX Conference on File and Storage Technologies (FAST). https://www.usenix.org/conference/fast21/presentation/hanGoogle Scholar
- [23] . 2016. The tail at store: A revelation from millions of hours of disk and SSD deployments. In Proceedings of the 14th USENIX Conference on File and Storage Technologies (FAST). https://www.usenix.org/conference/fast16/technical-sessions/presentation/haoGoogle Scholar
- [24] . 2018. Capturing and enhancing in situ system observability for failure detection. In Proceedings of the 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI). https://www.usenix.org/conference/osdi18/presentation/huangGoogle ScholarDigital Library
- [25] . 2017. Gray failure: The achilles’ heel of cloud-scale systems. In Proceedings of the 16th Workshop on Hot Topics in Operating Systems (HotOS). 6.
DOI: Google ScholarDigital Library - [26] . 2014. Performance regression testing target prioritization via performance risk analysis. In Proceedings of the 36th International Conference on Software Engineering (ICSE).Google ScholarDigital Library
- [27] . 2010. Testing closed-source binary device drivers with DDT. In Proceedings of the 2010 USENIX Annual Technical Conference (USENIX ATC). https://www.usenix.org/conference/usenix-atc-10/testing-closed-source-binary-device-drivers-ddtGoogle ScholarDigital Library
- [28] . 2011. Detecting failures in distributed systems with the falcon spy network. In Proceedings of the 23rd ACM Symposium on Operating Systems Principles (SOSP). 16.
DOI: Google ScholarDigital Library - [29] . 2018. Pcatch: Automatically detecting performance cascading bugs in cloud systems. In Proceedings of the 13th European Conference on Computer Systems (EuroSys).
DOI: Google ScholarDigital Library - [30] . 2020. Understanding, detecting and localizing partial failures in large system software. In Proceedings of the 17th USENIX Symposium on Networked Systems Design and Implementation (NSDI). https://www.usenix.org/conference/nsdi20/presentation/louGoogle Scholar
- [31] . 2023. Perseus: A fail-slow detection framework for cloud storage systems. In Proceedings of the 21st USENIX Conference on File and Storage Technologies (FAST). https://www.usenix.org/conference/fast23/presentation/luGoogle Scholar
- [32] . 2022. NVMe SSD failures in the field: The fail-stop and the fail-slow. In Proceedings of the 2022 USENIX Annual Technical Conference (USENIX ATC 22).Google Scholar
- [33] . 2015. RAIDShield: Characterizing, monitoring, and proactively protecting against disk failures. In Proceedings of the 13th USENIX Conference on File and Storage Technologies (FAST). https://www.usenix.org/conference/fast15/technical-sessions/presentation/maGoogle ScholarDigital Library
- [34] . 2020. A study of SSD reliability in large scale enterprise storage deployments. In Proceedings of the 18th USENIX Conference on File and Storage Technologies (FAST). https://www.usenix.org/conference/fast20/presentation/maneasGoogle ScholarDigital Library
- [35] . 1975. Comparison of the predicted and observed secondary structure of T4 phage lysozyme. Biochimica et Biophysica Acta (BBA)-Protein Structure (1975).Google ScholarCross Ref
- [36] . 2015. A large-scale study of flash memory failures in the field. In Proceedings of the 2015 ACM International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS). 14.
DOI: Google ScholarDigital Library - [37] . 2016. SSD failures in datacenters: What? When? And Why?. In Proceedings of the 9th ACM International on Systems and Storage Conference (SYSTOR). Article
7 , 11 pages.DOI: Google ScholarDigital Library - [38] . 2019. IASO: A fail-slow detection and mitigation framework for distributed storage services. In Proceedings of the 2019 USENIX Annual Technical Conference (USENIX ATC). https://www.usenix.org/conference/atc19/presentation/pandaGoogle Scholar
- [39] . 2017. Application crash consistency and performance with CCFS. In Proceedings of the 15th USENIX Conference on File and Storage Technologies (FAST). https://www.usenix.org/conference/fast17/technical-sessions/presentation/pillaiGoogle ScholarDigital Library
- [40] . 2005. IRON file systems. In Proceedings of the 20th ACM Symposium on Operating Systems Principles (SOSP). 15.
DOI: Google ScholarDigital Library - [41] . 2012. SymDrive: Testing drivers without devices. In Proceedings of the 10th USENIX Symposium on Operating Systems Design and Implementation (OSDI).Google Scholar
- [42] . 2010. Understanding latent sector errors and how to protect against them. In Proceedings of the 8th USENIX Conference on File and Storage Technologies (FAST). https://www.usenix.org/conference/fast-10/understanding-latent-sector-errors-and-how-protect-against-themGoogle ScholarDigital Library
- [43] . 2016. Flash reliability in production: The expected and the unexpected. In Proceedings of the 14th USENIX Conference on File and Storage Technologies (FAST). https://www.usenix.org/conference/fast16/technical-sessions/presentation/schroederGoogle Scholar
- [44] . 2009. DRAM errors in the wild: A large-scale field study. In Proceedings of the 2009 ACM International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS). 12.
DOI: Google ScholarDigital Library - [45] . 2017. DBSCAN revisited, revisited: Why and how you should (still) use DBSCAN. ACM Transactions on Database Systems (2017).
DOI: Google ScholarDigital Library - [46] . 2017. PBSE: A robust path-based speculative execution for degraded-network tail tolerance in data-parallel frameworks. In Proceedings of the 8th ACM Symposium on Cloud Computing (SoCC).Google ScholarDigital Library
- [47] . 2019. NetBouncer: Active device and link failure localization in data center networks. In Proceedings of the 16th USENIX Symposium on Networked Systems Design and Implementation (NSDI). https://www.usenix.org/conference/nsdi19/presentation/tanGoogle Scholar
- [48] . 2012. PREPARE: Predictive performance anomaly prevention for virtualized cloud systems. In Proceedings of the 32nd International Conference on Distributed Computing Systems (ICDCS).
DOI: Google ScholarDigital Library - [49] . 2016. SPDK: Building blocks for scalable, high performance storage applications. In Proceedings of the Storage Developer Conference. https://www.snia.org/sites/default/files/SDC/2016/presentations/performance/BenjaminWalker_SPDK_Building_Blocks_SDC_2016.pdfGoogle Scholar
- [50] . 2019. Lessons and actions: What we learned from 10K SSD-related storage system failures. In Proceedings of the 2019 USENIX Annual Technical Conference (USENIX ATC). https://www.usenix.org/conference/atc19/presentation/xuGoogle Scholar
- [51] . 2018. Deepview: Virtual disk failure diagnosis and pattern detection for azure. In Proceedings of the 15th USENIX Symposium on Networked Systems Design and Implementation (NSDI). https://www.usenix.org/conference/nsdi18/presentation/zhang-qiaoGoogle Scholar
Index Terms
- From Missteps to Milestones: A Journey to Practical Fail-Slow Detection
Recommendations
Diagnostic Fail Data Minimization Using an $N$ -Cover Algorithm
With the increasing transistor count and design complexity of modern integrated circuits, a large volume of fail data is collected by the tester for a failing die. This fail data is analyzed by a diagnosis procedure to obtain information about the ...
Comprehensible evaluation of prognostic factors and prediction of wound healing
We analyzed the data of a controlled clinical study of the chronic wound healing acceleration as a result of electrical stimulation. The study involved a conventional conservative treatment, sham treatment, biphasic pulsed current, and direct current ...
Information-Theoretic Syndrome Evaluation, Statistical Root-Cause Analysis, and Correlation-Based Feature Selection for Guiding Board-Level Fault Diagnosis
Reasoning-based functional-fault diagnosis has recently been advocated to achieve high diagnosis accuracy, low defect escapes, and reducing manufacturing cost. However, such diagnosis method requires a rich set of test items (syndromes) and a sizable ...
Comments