skip to main content
10.1145/2939672.2939699acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
research-article

Predicting Disk Replacement towards Reliable Data Centers

Published: 13 August 2016 Publication History

Abstract

Disks are among the most frequently failing components in today's IT environments. Despite a set of defense mechanisms such as RAID, the availability and reliability of the system are still often impacted severely. In this paper, we present a highly accurate SMART-based analysis pipeline that can correctly predict the necessity of a disk replacement even 10-15 days in advance. Our method has been built and evaluated on more than 30000 disks from two major manufacturers, monitored over 17 months. Our approach employs statistical techniques to automatically detect which SMART parameters correlate with disk replacement and uses them to predict the replacement of a disk with even 98% accuracy.

Supplementary Material

MP4 File (kdd2016_botezatu_disk_replacement_01-acm.mp4)

References

[1]
Data center downtime costs. http://www.emerson.com/en-us/News/Pages/Net-Power-Study-Data-Center.aspx.
[2]
Hard drive smart stats. https://www.backblaze.com/blog/hard-drive-smart-stats/.
[3]
IBM system storagetextDS8000 architecture and implementation. http://www.redbooks.ibm.com/redbooks/pdfs/sg248886.pdf.
[4]
S.M.A.R.T. https://en.wikipedia.org/wiki/S.M.A.R.T.
[5]
V. Agarwal, C. Bhattacharyya, T. Niranjan, and S. Susarla. Discovering rules from disk events for predicting hard drive failures. ICMLA '09, pages 782--786, Dec 2009.
[6]
L. Breiman. Random forests. Machine Learning, 45(1):5--32.
[7]
K. H. Brodersen, F. Gallusser, J. Koehler, N. Remy, and S. L. Scott. Inferring causal impact using bayesian structural time-series models. Annals of Applied Statistics, 9:247--274, 2015.
[8]
C. J. C. Burges. A tutorial on support vector machines for pattern recognition. Data Min. Knowl. Discov., 2(2):121--167, June 1998.
[9]
D. R. Cox. The regression analysis of binary sequences (with discussion). J Roy Stat Soc B, 20:215--242, 1958.
[10]
J. Elerath and S. Shah. Server class disk drives: how reliable are they? In Reliability and Maintainability, 2004 Annual Symposium - RAMS, pages 151--156.
[11]
G. Hamerly and C. Elkan. Bayesian approaches to failure prediction for disk drives. ICML '01, pages 202--209, San Francisco, CA, USA, 2001. Morgan Kaufmann Publishers Inc.
[12]
How does S.M.A.R.T. function of hard disks work? www.hdsentinel.com/smart/index.php.
[13]
G. F. Hughes, J. F. Murray, K. Kreutz-Delgado, and C. Elkan. Improved disk-drive failure warnings. IEEE Transactions on Reliability, 51(3):350--357, 2002.
[14]
R. Johnson and T. Zhang. Learning nonlinear functions using regularized greedy forest. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(5):942--954, May 2014.
[15]
J. Macqueen. Some methods for classification and analysis of multivariate observations. In Berkeley Symposium on Mathematical Statistics and Probability, pages 281--297, 1967.
[16]
J. F. Murray, G. F. Hughes, and D. Schuurmans. Machine learning methods for predicting failures in hard drives: A multiple-instance application. Journal of Machine Learning research, 6:816, 2005.
[17]
E. Pinheiro, W.-D. Weber, and L. A. Barroso. Failure trends in a large disk drive population. FAST 2007, Berkeley, CA, USA, 2007. USENIX Association.
[18]
B. Schroeder and G. A. Gibson. Disk failures in the real world: What does an mttf of 1,000,000 hours mean to you? FAST 2007, Berkeley, CA, USA, 2007. USENIX Association.
[19]
Y. Tan and X. Gu. On predictability of system anomalies in real world. In IEEE MASCOTS, 2010, pages 133--140, Aug 2010.
[20]
G. M. Weiss and H. Hirsh. Learning to predict rare events in event sequences. In KDD 1998, pages 359--363. AAAI Press, 1998.
[21]
J. Ye, J.-H. Chow, J. Chen, and Z. Zheng. Stochastic gradient boosted distributed decision trees. CIKM 2009, pages 2061--2064. ACM, 2009.

Cited By

View all
  • (2024)Exploit both SMART attributes and NAND flash wear characteristics to effectively forecast SSD-based storage failures in clustersProceedings of the 2024 USENIX Conference on Usenix Annual Technical Conference10.5555/3691992.3692059(1101-1117)Online publication date: 10-Jul-2024
  • (2024)Prediction of Disk Failure Based on Classification Intensity ResamplingInformation10.3390/info1506032215:6(322)Online publication date: 31-May-2024
  • (2024)On the Model Update Strategies for Supervised Learning in AIOps SolutionsACM Transactions on Software Engineering and Methodology10.1145/366459933:7(1-38)Online publication date: 26-Aug-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
KDD '16: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
August 2016
2176 pages
ISBN:9781450342322
DOI:10.1145/2939672
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 August 2016

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. changepoint
  2. classification
  3. disk replacement
  4. time series

Qualifiers

  • Research-article

Conference

KDD '16
Sponsor:

Acceptance Rates

KDD '16 Paper Acceptance Rate 66 of 1,115 submissions, 6%;
Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

Upcoming Conference

KDD '25

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)61
  • Downloads (Last 6 weeks)5
Reflects downloads up to 07 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Exploit both SMART attributes and NAND flash wear characteristics to effectively forecast SSD-based storage failures in clustersProceedings of the 2024 USENIX Conference on Usenix Annual Technical Conference10.5555/3691992.3692059(1101-1117)Online publication date: 10-Jul-2024
  • (2024)Prediction of Disk Failure Based on Classification Intensity ResamplingInformation10.3390/info1506032215:6(322)Online publication date: 31-May-2024
  • (2024)On the Model Update Strategies for Supervised Learning in AIOps SolutionsACM Transactions on Software Engineering and Methodology10.1145/366459933:7(1-38)Online publication date: 26-Aug-2024
  • (2024)Is Your Anomaly Detector Ready for Change? Adapting AIOps Solutions to the Real WorldProceedings of the IEEE/ACM 3rd International Conference on AI Engineering - Software Engineering for AI10.1145/3644815.3644961(222-233)Online publication date: 14-Apr-2024
  • (2024)MISP: A Multimodal-based Intelligent Server Failure Prediction Model for Cloud Computing SystemsProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3637528.3671568(5509-5520)Online publication date: 25-Aug-2024
  • (2024)SOIL: Score Conditioned Diffusion Model for Imbalanced Cloud Failure PredictionCompanion Proceedings of the ACM Web Conference 202410.1145/3589335.3648303(65-72)Online publication date: 13-May-2024
  • (2024)SiaDFP: A Disk Failure Prediction Framework Based on Siamese Neural Network in Large-Scale Data CenterIEEE Transactions on Services Computing10.1109/TSC.2024.339469217:5(2890-2903)Online publication date: Sep-2024
  • (2024)Proactive Drive Failure Prediction for Cloud Storage System Through Semi-Supervised LearningIEEE Transactions on Dependable and Secure Computing10.1109/TDSC.2023.328609321:4(1528-1543)Online publication date: Jul-2024
  • (2024)Vehicular Lead-Acid Battery Fault Prediction Method based on A-DeepFM2024 International Conference on New Trends in Computational Intelligence (NTCI)10.1109/NTCI64025.2024.10776118(111-116)Online publication date: 18-Oct-2024
  • (2024)Early Bird: Ensuring Reliability of Cloud Systems Through Early Failure Prediction2024 IEEE 35th International Symposium on Software Reliability Engineering Workshops (ISSREW)10.1109/ISSREW63542.2024.00046(49-54)Online publication date: 28-Oct-2024
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media