skip to main content
10.1145/3316781.3317918acmconferencesArticle/Chapter ViewAbstractPublication PagesdacConference Proceedingsconference-collections
research-article

System-level hardware failure prediction using deep learning

Published: 02 June 2019 Publication History

Abstract

Disk and memory faults are the leading causes of server breakdown. A proactive solution is to predict such hardware failure at the runtime and then isolate the hardware at risk and backup the data. However, the current model-based predictors are incapable of using the discrete time-series data, such as the values of device attributes, which conveys high-level information of the device behavior. In this paper, we propose a novel deep-learning based prediction scheme for system-level hardware failure prediction. We normalize the distribution of samples' attributes from different vendors to make use of diverse training sets. We propose a temporal Convolution Neural Network based model that is insensitive to the noise in the time dimension. Finally, we design a loss function to train the model with extremely imbalanced samples effectively. Experimental results from an open S.M.A.R.T data set and an industrial data set show the effectiveness of the proposed scheme.

References

[1]
Backblaze. 2018. The Backblzae Hard Drive Data and Stats. https://www.backblaze.com/classificationtree/hard-drive-test-data.html.
[2]
Elisabeth Baseman, Nathan Debardeleben, Kurt Ferreira, Vilas Sridharan, Taniya Siddiqua, Olena Tkachenko, Elisabeth Baseman, Nathan Debardeleben, Kurt Ferreira, and Vilas Sridharan. 2017. Automating DRAM Fault Mitigation By Learning From Experience. In Ieee/ifip International Conference on Dependable Systems and Networks Workshop. 137--140.
[3]
Ioana Giurgiu, Jacint Szabo, Dorothea Wiesmann, and John Bird. 2017. Predicting DRAM reliability in the field with machine learning. In Proceedings of the 18th ACM/IFIP/USENIX Middleware Conference: Industrial Track. ACM, 15--21.
[4]
Ponemon Institute. 2016. Cost of Data Center Outages. https://www.vertivco.com/globalassets/documents/reports/2016-cost-of-data-center-outages-11-11_51190_1.pdf.
[5]
Jing Li, Xinpu Ji, Yuhan Jia, Bingpeng Zhu, Gang Wang, Zhongwei Li, and Xiaoguang Liu. 2014. Hard Drive Failure Prediction Using Classification and Regression Trees. In Ieee/ifip International Conference on Dependable Systems and Networks. 383--394.
[6]
Jing Li, Rebecca J. Stones, Gang Wang, Zhongwei Li, Xiaoguang Liu, and Kang Xiao. 2016. Being Accurate Is Not Enough: New Metrics for Disk Failure Prediction. In Reliable Distributed Systems. 71--80.
[7]
Tsung-Yi Lin, Priyal Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. 2018. Focal loss for dense object detection. IEEE transactions on pattern analysis and machine intelligence (2018).
[8]
Catello Di Martino, Zbigniew Kalbarczyk, Ravishankar K. Iyer, Fabio Baccanico, Joseph Fullop, and William Kramer. 2014. Lessons Learned from the Analysis of System Failures at Petascale: The Case of Blue Waters. In Ieee/ifip International Conference on Dependable Systems and Networks. 610--621.
[9]
Joseph F Murray, Gordon F Hughes, and Kenneth Kreutz-Delgado. 2005. Machine learning methods for predicting failures in hard drives: A multiple-instance application. Journal of Machine Learning Research 6, May (2005), 783--816.
[10]
Carlos A RincÃşn, Jehan FranÃğois PÃćris, Ricardo Vilalta, Albert M K Cheng, and Darrell D E Long. 2017. Disk Failure Prediction in Heterogeneous Environments. In International Symposium on PERFORMANCE Evaluation of Computer and Telecommunication Systems.
[11]
Zelong Sun, Li Jiang, Qiang Xu, Zhaobo Zhang, Zhiyuan Wang, and Xinli Gu. 2013. AgentDiag: An agent-assisted diagnostic framework for board-level functional failures. In 2013 IEEE International Test Conference (ITC). 1--8.
[12]
Zelong Sun, Li Jiang, Qiang Xu, Zhaobo Zhang, Zhiyuan Wang, and Xinli Gu. 2015. On test syndrome merging for reasoning-based board-level functional fault diagnosis. In The 20th Asia and South Pacific Design Automation Conference. 737--742.
[13]
Guosai Wang, Lifei Zhang, and Wei Xu. 2017. What Can We Learn from Four Years of Data Center Hardware Failures?. In Ieee/ifip International Conference on Dependable Systems and Networks. 25--36.
[14]
Chang Xu, Gang Wang, Xao Guang Liu, Dongdong Guo, and Tie Yan Liu. 2016. Health Status Assessment and Failure Prediction for Hard Drives with Recurrent Neural Networks. IEEE Trans. Comput. 65, 11 (2016), 3502--3508.
[15]
Wenjun Yang, Dianming Hu, Yuliang Liu, Shuhao Wang, and Tianming Jiang. 2015. Hard Drive Failure Prediction Using Big Data. In IEEE Symposium on Reliable Distributed Systems Workshop. 13--18.
[16]
Shengdong Zhang. 2017. Deep learning on symbolic representations for large-scale heterogeneous time-series event prediction. In IEEE International Conference on Acoustics, Speech and Signal Processing.
[17]
Weinan Zhang, Tianming Du, and Jun Wang. 2016. Deep Learning over Multifield Categorical Data. (2016).

Cited By

View all
  • (2024)Removing obstacles before breaking through the memory wallProceedings of the 2024 USENIX Conference on Usenix Annual Technical Conference10.5555/3691992.3692044(851-867)Online publication date: 10-Jul-2024
  • (2024)Development of accelerated test method for developing failure prediction technology for SoC-equipped boards and application to failure predictionSoC搭載基板の故障予測技術開発に向けた加速試験方法の開発と故障予測への適用Transactions of the JSME (in Japanese)10.1299/transjsme.23-0010890:929(23-00108-23-00108)Online publication date: 2024
  • (2024)MISP: A Multimodal-based Intelligent Server Failure Prediction Model for Cloud Computing SystemsProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3637528.3671568(5509-5520)Online publication date: 25-Aug-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
DAC '19: Proceedings of the 56th Annual Design Automation Conference 2019
June 2019
1378 pages
ISBN:9781450367257
DOI:10.1145/3316781
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

In-Cooperation

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 02 June 2019

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Hardware failure
  2. temporal CNN
  3. transfer learning

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

DAC '19
Sponsor:

Acceptance Rates

Overall Acceptance Rate 1,770 of 5,499 submissions, 32%

Upcoming Conference

DAC '25
62nd ACM/IEEE Design Automation Conference
June 22 - 26, 2025
San Francisco , CA , USA

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)100
  • Downloads (Last 6 weeks)7
Reflects downloads up to 07 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Removing obstacles before breaking through the memory wallProceedings of the 2024 USENIX Conference on Usenix Annual Technical Conference10.5555/3691992.3692044(851-867)Online publication date: 10-Jul-2024
  • (2024)Development of accelerated test method for developing failure prediction technology for SoC-equipped boards and application to failure predictionSoC搭載基板の故障予測技術開発に向けた加速試験方法の開発と故障予測への適用Transactions of the JSME (in Japanese)10.1299/transjsme.23-0010890:929(23-00108-23-00108)Online publication date: 2024
  • (2024)MISP: A Multimodal-based Intelligent Server Failure Prediction Model for Cloud Computing SystemsProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3637528.3671568(5509-5520)Online publication date: 25-Aug-2024
  • (2024)Time-Aware Attention-Based Transformer (TAAT) for Cloud Computing System Failure PredictionProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3637528.3671547(4906-4917)Online publication date: 25-Aug-2024
  • (2024)DRAM Errors and Cosmic Rays: Space Invaders or Science Fiction?2024 IEEE 36th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)10.1109/SBAC-PAD63648.2024.00025(194-205)Online publication date: 13-Nov-2024
  • (2024)Early Bird: Ensuring Reliability of Cloud Systems Through Early Failure Prediction2024 IEEE 35th International Symposium on Software Reliability Engineering Workshops (ISSREW)10.1109/ISSREW63542.2024.00046(49-54)Online publication date: 28-Oct-2024
  • (2024)Scaling Disk Failure Prediction via Multi-Source Stream Mining2024 IEEE International Conference on Data Mining (ICDM)10.1109/ICDM59182.2024.00020(131-140)Online publication date: 9-Dec-2024
  • (2024)Virtual Machine Proactive Fault Tolerance Using Log-Based Anomaly DetectionIEEE Access10.1109/ACCESS.2024.350683312(178951-178970)Online publication date: 2024
  • (2024)Robust battery lifetime prediction with noisy measurements via total-least-squares regressionIntegration, the VLSI Journal10.1016/j.vlsi.2023.10213696:COnline publication date: 1-May-2024
  • (2024)Leveraging survival analysis in cost-aware deepnet for efficient hard drive failure predictionNeural Computing and Applications10.1007/s00521-024-10479-637:2(1089-1104)Online publication date: 16-Oct-2024
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media