skip to main content
research-article

From Missteps to Milestones: A Journey to Practical Fail-Slow Detection

Published:01 November 2023Publication History
Skip Abstract Section

Abstract

The newly emerging “fail-slow” failures plague both software and hardware where the victim components are still functioning yet with degraded performance. To address this problem, this article presents Perseus, a practical fail-slow detection framework for storage devices. Perseus leverages a light regression-based model to quickly pinpoint and analyze fail-slow failures at the granularity of drives. Within a 10-month close monitoring on 248K drives, Perseus managed to find 304 fail-slow cases. Isolating them can reduce the (node-level) 99.99th tail latency by 48%. We assemble a large-scale fail-slow dataset (including 41K normal drives and 315 verified fail-slow drives) from our production traces, based on which we provide root cause analysis on fail-slow drives covering a variety of ill-implemented scheduling, hardware defects, and environmental factors. We have released the dataset to the public for fail-slow study.

REFERENCES

  1. [1] (n.d.). S.M.A.R.T. (Self-Monitoring, Analysis and Reporting Technology). https://en.wikipedia.org/wiki/S.M.A.R.T.Google ScholarGoogle Scholar
  2. [2] Abdi Hervé and Williams Lynne J.. 2010. Principal component analysis. WIREs Computational Statistics (2010).Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. [3] Alagappan Ramnatthan, Ganesan Aishwarya, Patel Yuvraj, Pillai Thanumalayan Sankaranarayana, Arpaci-Dusseau Andrea C., and Arpaci-Dusseau Remzi H.. 2016. Correlated crash vulnerabilities. In Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI). https://www.usenix.org/conference/osdi16/technical-sessions/presentation/alagappanGoogle ScholarGoogle Scholar
  4. [4] Alter Jacob, Xue Ji, Dimnaku Alma, and Smirni Evgenia. 2019. SSD failures in the field: Symptoms, causes, and prediction models. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC). Article 75, 14 pages. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. [5] Arzani Behnaz, Ciraci Selim, Chamon Luiz, Zhu Yibo, Liu Hongqiang (Harry), Padhye Jitu, Loo Boon Thau, and Outhred Geoff. 2018. 007: Democratically finding the cause of packet drops. In Proceedings of the 15th USENIX Symposium on Networked Systems Design and Implementation (NSDI). https://www.usenix.org/conference/nsdi18/presentation/arzaniGoogle ScholarGoogle Scholar
  6. [6] Bairavasundaram Lakshmi N., Arpaci-Dusseau Andrea C., Arpaci-Dusseau Remzi H., Goodson Garth R., and Schroeder Bianca. 2008. An analysis of data corruption in the storage stack. In Proceedings of the 6th USENIX Conference on File and Storage Technologies (FAST). https://www.usenix.org/conference/fast-08/analysis-data-corruption-storage-stackGoogle ScholarGoogle ScholarDigital LibraryDigital Library
  7. [7] Bairavasundaram Lakshmi N., Goodson Garth R., Pasupathy Shankar, and Schindler Jiri. 2007. An analysis of latent sector errors in disk drives. In Proceedings of the 2007 ACM International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS). 12. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. [8] Bairavasundaram Lakshmi N., Rungta Meenali, Agrawa Nitin, Arpaci-Dusseau Andrea C., Arpaci-Dusseau Remzi H., and Swift Michael M.. 2008. Analyzing the effects of disk-pointer corruption. In Proceedings of the 38th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN). DOI:Google ScholarGoogle ScholarCross RefCross Ref
  9. [9] Cai Yu, Luo Yixin, Ghose Saugata, and Mutlu Onur. 2015. Read disturb errors in MLC NAND flash memory: Characterization, mitigation, and recovery. In Proceedings of the 45th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN). DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. [10] Chandra Tushar Deepak and Toueg Sam. 1996. Unreliable failure detectors for reliable distributed systems. J. ACM 43, 2 (1996), 225–267. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. [11] Chen Wei, Toueg S., and Aguilera M. Kawazoe. 2000. On the quality of service of failure detectors. In Proceedings of the 30th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN).Google ScholarGoogle Scholar
  12. [12] Cheong Wooseong, Yoon Chanho, Woo Seonghoon, Han Kyuwook, Kim Daehyun, Lee Chulseung, Choi Youra, Kim Shine, Kang Dongku, Yu Geunyeong, Kim Jaehong, Park Jaechun, Song Ki-Whan, Park Ki-Tae, Cho Sangyeun, Oh Hwaseok, Lee Daniel D. G., Choi Jin-Hyeok, and Jeong Jaeheon. 2018. A flash memory controller for 15\(\mu\)s ultra-low-latency SSD using high-speed 3D NAND flash with 3\(\mu\)s read time. In Proceedings of the IEEE International Solid State Circuits Conference (ISSCC).Google ScholarGoogle Scholar
  13. [13] Chicco David and Jurman Giuseppe. 2020. The advantages of the matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation. BMC Genomics (2020).Google ScholarGoogle ScholarCross RefCross Ref
  14. [14] Choi Brian, Burns Randal, and Huang Peng. 2021. Understanding and dealing with hard faults in persistent memory systems. In Proceedings of the 16th European Conference on Computer Systems (EuroSys).Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. [15] Clement Allen, Wong Edmund, Alvisi Lorenzo, and Marchetti Mirco. 2009. Making byzantine fault tolerant systems tolerate byzantine faults. In Proceedings of the 6th USENIX Symposium on Networked Systems Design and Implementation (NSDI).Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. [16] Corder G. W. and Foreman D. I.. 2014. Nonparametric Statistics: A Step-by-Step Approach.Wiley.Google ScholarGoogle Scholar
  17. [17] Do Thanh, Hao Mingzhe, Leesatapornwongsa Tanakorn, Patana-anake Tiratat, and Gunawi Haryadi S.. 2013. Limplock: Understanding the impact of limpware on scale-out cloud systems. In Proceedings of the 4th Annual Symposium on Cloud Computing (SoCC). Article 14, 14 pages. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. [18] Draper Norman R. and Smith Harry. 1998. Applied Regression Analysis. Wiley.Google ScholarGoogle ScholarCross RefCross Ref
  19. [19] Gunawi Haryadi S., Hao Mingzhe, Suminto Riza O., Laksono Agung, Satria Anang D., Adityatama Jeffry, and Eliazar Kurnia J.. 2016. Why does the cloud stop computing? Lessons from hundreds of service outages. In Proceedings of the 7th ACM Symposium on Cloud Computing (SoCC). 16. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. [20] Gunawi Haryadi S., Suminto Riza O., Sears Russell, Golliher Casey, Sundararaman Swaminathan, Lin Xing, Emami Tim, Sheng Weiguang, Bidokhti Nematollah, McCaffrey Caitie, Grider Gary, Fields Parks M., Harms Kevin, Ross Robert B., Jacobson Andree, Ricci Robert, Webb Kirk, Alvaro Peter, Runesha H. Birali, Hao Mingzhe, and Li Huaicheng. 2018. Fail-slow at scale: Evidence of hardware performance faults in large production systems. In Proceedings of the 16th USENIX Conference on File and Storage Technologies (FAST). https://www.usenix.org/conference/fast18/presentation/gunawiGoogle ScholarGoogle ScholarDigital LibraryDigital Library
  21. [21] Gupta Trinabh, Leners Joshua B., Aguilera Marcos K., and Walfish Michael. 2013. Improving availability in distributed systems with failure informers. In Proceedings of the 10th USENIX Symposium on Networked Systems Design and Implementation (NSDI). https://www.usenix.org/conference/nsdi13/technical-sessions/presentation/lenersGoogle ScholarGoogle Scholar
  22. [22] Han Shujie, Lee Patrick P. C., Xu Fan, Liu Yi, He Cheng, and Liu Jiongzhou. 2021. An in-depth study of correlated failures in production SSD-based data centers. In Proceedings of the 19th USENIX Conference on File and Storage Technologies (FAST). https://www.usenix.org/conference/fast21/presentation/hanGoogle ScholarGoogle Scholar
  23. [23] Hao Mingzhe, Soundararajan Gokul, Kenchammana-Hosekote Deepak, Chien Andrew A., and Gunawi Haryadi S.. 2016. The tail at store: A revelation from millions of hours of disk and SSD deployments. In Proceedings of the 14th USENIX Conference on File and Storage Technologies (FAST). https://www.usenix.org/conference/fast16/technical-sessions/presentation/haoGoogle ScholarGoogle Scholar
  24. [24] Huang Peng, Guo Chuanxiong, Lorch Jacob R., Zhou Lidong, and Dang Yingnong. 2018. Capturing and enhancing in situ system observability for failure detection. In Proceedings of the 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI). https://www.usenix.org/conference/osdi18/presentation/huangGoogle ScholarGoogle ScholarDigital LibraryDigital Library
  25. [25] Huang Peng, Guo Chuanxiong, Zhou Lidong, Lorch Jacob R., Dang Yingnong, Chintalapati Murali, and Yao Randolph. 2017. Gray failure: The achilles’ heel of cloud-scale systems. In Proceedings of the 16th Workshop on Hot Topics in Operating Systems (HotOS). 6. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. [26] Huang Peng, Ma Xiao, Shen Dongcai, and Zhou Yuanyuan. 2014. Performance regression testing target prioritization via performance risk analysis. In Proceedings of the 36th International Conference on Software Engineering (ICSE).Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. [27] Kuznetsov Volodymyr, Chipounov Vitaly, and Candea George. 2010. Testing closed-source binary device drivers with DDT. In Proceedings of the 2010 USENIX Annual Technical Conference (USENIX ATC). https://www.usenix.org/conference/usenix-atc-10/testing-closed-source-binary-device-drivers-ddtGoogle ScholarGoogle ScholarDigital LibraryDigital Library
  28. [28] Leners Joshua B., Wu Hao, Hung Wei-Lun, Aguilera Marcos K., and Walfish Michael. 2011. Detecting failures in distributed systems with the falcon spy network. In Proceedings of the 23rd ACM Symposium on Operating Systems Principles (SOSP). 16. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. [29] Li Jiaxin, Chen Yuxi, Liu Haopeng, Lu Shan, Zhang Yiming, Gunawi Haryadi S., Gu Xiaohui, Lu Xicheng, and Li Dongsheng. 2018. Pcatch: Automatically detecting performance cascading bugs in cloud systems. In Proceedings of the 13th European Conference on Computer Systems (EuroSys). DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. [30] Lou Chang, Huang Peng, and Smith Scott. 2020. Understanding, detecting and localizing partial failures in large system software. In Proceedings of the 17th USENIX Symposium on Networked Systems Design and Implementation (NSDI). https://www.usenix.org/conference/nsdi20/presentation/louGoogle ScholarGoogle Scholar
  31. [31] Lu Ruiming, Xu Erci, Zhang Yiming, Zhu Fengyi, Zhu Zhaosheng, Wang Mengtian, Zhu Zongpeng, Xue Guangtao, Shu Jiwu, Li Minglu, and Wu Jiesheng. 2023. Perseus: A fail-slow detection framework for cloud storage systems. In Proceedings of the 21st USENIX Conference on File and Storage Technologies (FAST). https://www.usenix.org/conference/fast23/presentation/luGoogle ScholarGoogle Scholar
  32. [32] Lu Ruiming, Xu Erci, Zhang Yiming, Zhu Zhaosheng, Wang Mengtian, Zhu Zongpeng, Xue Guangtao, Li Minglu, and Wu Jiesheng. 2022. NVMe SSD failures in the field: The fail-stop and the fail-slow. In Proceedings of the 2022 USENIX Annual Technical Conference (USENIX ATC 22).Google ScholarGoogle Scholar
  33. [33] Ma Ao, Douglis Fred, Lu Guanlin, Sawyer Darren, Chandra Surendar, and Hsu Windsor. 2015. RAIDShield: Characterizing, monitoring, and proactively protecting against disk failures. In Proceedings of the 13th USENIX Conference on File and Storage Technologies (FAST). https://www.usenix.org/conference/fast15/technical-sessions/presentation/maGoogle ScholarGoogle ScholarDigital LibraryDigital Library
  34. [34] Maneas Stathis, Mahdaviani Kaveh, Emami Tim, and Schroeder Bianca. 2020. A study of SSD reliability in large scale enterprise storage deployments. In Proceedings of the 18th USENIX Conference on File and Storage Technologies (FAST). https://www.usenix.org/conference/fast20/presentation/maneasGoogle ScholarGoogle ScholarDigital LibraryDigital Library
  35. [35] Matthews Brian. 1975. Comparison of the predicted and observed secondary structure of T4 phage lysozyme. Biochimica et Biophysica Acta (BBA)-Protein Structure (1975).Google ScholarGoogle ScholarCross RefCross Ref
  36. [36] Meza Justin, Wu Qiang, Kumar Sanjev, and Mutlu Onur. 2015. A large-scale study of flash memory failures in the field. In Proceedings of the 2015 ACM International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS). 14. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. [37] Narayanan Iyswarya, Wang Di, Jeon Myeongjae, Sharma Bikash, Caulfield Laura, Sivasubramaniam Anand, Cutler Ben, Liu Jie, Khessib Badriddine, and Vaid Kushagra. 2016. SSD failures in datacenters: What? When? And Why?. In Proceedings of the 9th ACM International on Systems and Storage Conference (SYSTOR). Article 7, 11 pages. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. [38] Panda Biswaranjan, Srinivasan Deepthi, Ke Huan, Gupta Karan, Khot Vinayak, and Gunawi Haryadi S.. 2019. IASO: A fail-slow detection and mitigation framework for distributed storage services. In Proceedings of the 2019 USENIX Annual Technical Conference (USENIX ATC). https://www.usenix.org/conference/atc19/presentation/pandaGoogle ScholarGoogle Scholar
  39. [39] Pillai Thanumalayan Sankaranarayana, Alagappan Ramnatthan, Lu Lanyue, Chidambaram Vijay, Arpaci-Dusseau Andrea C., and Arpaci-Dusseau Remzi H.. 2017. Application crash consistency and performance with CCFS. In Proceedings of the 15th USENIX Conference on File and Storage Technologies (FAST). https://www.usenix.org/conference/fast17/technical-sessions/presentation/pillaiGoogle ScholarGoogle ScholarDigital LibraryDigital Library
  40. [40] Prabhakaran Vijayan, Bairavasundaram Lakshmi N., Agrawal Nitin, Gunawi Haryadi S., Arpaci-Dusseau Andrea C., and Arpaci-Dusseau Remzi H.. 2005. IRON file systems. In Proceedings of the 20th ACM Symposium on Operating Systems Principles (SOSP). 15. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. [41] Renzelmann Matthew J., Kadav Asim, and Swift Michael M.. 2012. SymDrive: Testing drivers without devices. In Proceedings of the 10th USENIX Symposium on Operating Systems Design and Implementation (OSDI).Google ScholarGoogle Scholar
  42. [42] Schroeder Bianca, Damouras Sotirios, and Gill Phillipa. 2010. Understanding latent sector errors and how to protect against them. In Proceedings of the 8th USENIX Conference on File and Storage Technologies (FAST). https://www.usenix.org/conference/fast-10/understanding-latent-sector-errors-and-how-protect-against-themGoogle ScholarGoogle ScholarDigital LibraryDigital Library
  43. [43] Schroeder Bianca, Lagisetty Raghav, and Merchant Arif. 2016. Flash reliability in production: The expected and the unexpected. In Proceedings of the 14th USENIX Conference on File and Storage Technologies (FAST). https://www.usenix.org/conference/fast16/technical-sessions/presentation/schroederGoogle ScholarGoogle Scholar
  44. [44] Schroeder Bianca, Pinheiro Eduardo, and Weber Wolf-Dietrich. 2009. DRAM errors in the wild: A large-scale field study. In Proceedings of the 2009 ACM International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS). 12. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. [45] Schubert Erich, Sander Jörg, Ester Martin, Kriegel Hans, and Xu Xiaowei. 2017. DBSCAN revisited, revisited: Why and how you should (still) use DBSCAN. ACM Transactions on Database Systems (2017). DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. [46] Suminto Riza O., Stuardo Cesar A., Clark Alexandra, Ke Huan, Leesatapornwongsa Tanakorn, Fu Bo, Kurniawan Daniar H., Martin Vincentius, Uma Maheswara Rao G., and Gunawi Haryadi S.. 2017. PBSE: A robust path-based speculative execution for degraded-network tail tolerance in data-parallel frameworks. In Proceedings of the 8th ACM Symposium on Cloud Computing (SoCC).Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. [47] Tan Cheng, Jin Ze, Guo Chuanxiong, Zhang Tianrong, Wu Haitao, Deng Karl, Bi Dongming, and Xiang Dong. 2019. NetBouncer: Active device and link failure localization in data center networks. In Proceedings of the 16th USENIX Symposium on Networked Systems Design and Implementation (NSDI). https://www.usenix.org/conference/nsdi19/presentation/tanGoogle ScholarGoogle Scholar
  48. [48] Tan Yongmin, Nguyen Hiep, Shen Zhiming, Gu Xiaohui, Venkatramani Chitra, and Rajan Deepak. 2012. PREPARE: Predictive performance anomaly prevention for virtualized cloud systems. In Proceedings of the 32nd International Conference on Distributed Computing Systems (ICDCS). DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. [49] Walker Benjamin. 2016. SPDK: Building blocks for scalable, high performance storage applications. In Proceedings of the Storage Developer Conference. https://www.snia.org/sites/default/files/SDC/2016/presentations/performance/BenjaminWalker_SPDK_Building_Blocks_SDC_2016.pdfGoogle ScholarGoogle Scholar
  50. [50] Xu Erci, Zheng Mai, Qin Feng, Xu Yikang, and Wu Jiesheng. 2019. Lessons and actions: What we learned from 10K SSD-related storage system failures. In Proceedings of the 2019 USENIX Annual Technical Conference (USENIX ATC). https://www.usenix.org/conference/atc19/presentation/xuGoogle ScholarGoogle Scholar
  51. [51] Zhang Qiao, Yu Guo, Guo Chuanxiong, Dang Yingnong, Swanson Nick, Yang Xinsheng, Yao Randolph, Chintalapati Murali, Krishnamurthy Arvind, and Anderson Thomas. 2018. Deepview: Virtual disk failure diagnosis and pattern detection for azure. In Proceedings of the 15th USENIX Symposium on Networked Systems Design and Implementation (NSDI). https://www.usenix.org/conference/nsdi18/presentation/zhang-qiaoGoogle ScholarGoogle Scholar

Index Terms

  1. From Missteps to Milestones: A Journey to Practical Fail-Slow Detection

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image ACM Transactions on Storage
          ACM Transactions on Storage  Volume 19, Issue 4
          November 2023
          238 pages
          ISSN:1553-3077
          EISSN:1553-3093
          DOI:10.1145/3626486
          Issue’s Table of Contents

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 1 November 2023
          • Online AM: 11 September 2023
          • Accepted: 13 August 2023
          • Received: 9 May 2023
          Published in tos Volume 19, Issue 4

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article
        • Article Metrics

          • Downloads (Last 12 months)245
          • Downloads (Last 6 weeks)38

          Other Metrics

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        Full Text

        View this article in Full Text.

        View Full Text