research-article

An Attention-augmented Deep Architecture for Hard Drive Status Monitoring in Large-scale Storage Systems

Authors:

Philip S. YuAuthors Info & Claims

ACM Transactions on Storage (TOS), Volume 15, Issue 3

Article No.: 21, Pages 1 - 26

https://doi.org/10.1145/3340290

Published: 13 August 2019 Publication History

Abstract

Data centers equipped with large-scale storage systems are critical infrastructures in the era of big data. The enormous amount of hard drives in storage systems magnify the failure probability, which may cause tremendous loss for both data service users and providers. Despite a set of reactive fault-tolerant measures such as RAID, it is still a tough issue to enhance the reliability of large-scale storage systems. Proactive prediction is an effective method to avoid possible hard-drive failures in advance. A series of models based on the SMART statistics have been proposed to predict impending hard-drive failures. Nonetheless, there remain some serious yet unsolved challenges like the lack of explainability of prediction results. To address these issues, we carefully analyze a dataset collected from a real-world large-scale storage system and then design an attention-augmented deep architecture for hard-drive health status assessment and failure prediction. The deep architecture, composed of a feature integration layer, a temporal dependency extraction layer, an attention layer, and a classification layer, cannot only monitor the status of hard drives but also assist in failure cause diagnoses. The experiments based on real-world datasets show that the proposed deep architecture is able to assess the hard-drive status and predict the impending failures accurately. In addition, the experimental results demonstrate that the attention-augmented deep architecture can reveal the degradation progression of hard drives automatically and assist administrators in tracing the cause of hard drive failures.

References

[1]

Martín Abadi et al. 2015. TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems. Retrieved from: http://tensorflow.org/.

[2]

Backblaze. 2016. The backblaze hard drive data and stats. Retrieved from: https://www.backblaze.com/b2/hard-drive-test-data.html.

[3]

Backblaze. 2016. Hard drive reliability review for 2015. Retrieved from: https://www.backblaze.com/blog/hard-drive-reliability-q4-2015/.

[4]

Backblaze. 2018. The backblaze hard drive data and stats. Retrieved from: https://www.backblaze.com/b2/hard-drive-test-data.html.

[5]

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. Retrieved from: arXiv:1409.0473.

[6]

Y. Bengio, P. Simard, and P. Frasconi. 1994. Learning long-term dependencies with gradient descent is difficult. IEEE Trans. Neural Netw. 5, 2 (1994), 157--166.

Digital Library

[7]

Mirela Madalina Botezatu, Ioana Giurgiu, Jasmina Bogojeska, and Dorothea Wiesmann. 2016. Predicting disk replacement towards reliable data centers. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 39--48.

Digital Library

[8]

Kyunghyun Cho, Bart Van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. 1724--1734.

[9]

Francois Chollet. 2015. Keras. Retrieved from: https://github.com/fchollet/keras.

[10]

Francois Chollet. 2015. Keras optimizer. Retrieved from: https://keras.io/optimizers/.

[11]

B. Eckart, X. Chen, X. He, and S. L. Scott. 2008. Failure prediction models for proactive fault tolerance within storage systems. In Proceedings of the IEEE International Symposium on Modeling, Analysis and Simulation of Computers and Telecommunication Systems. 1--8.

[12]

J. G. Elerath and S. Shah. 2004. Server class disk drives: How reliable are they? In Proceedings of the Symposium on Reliability and Maintainability (RAMS’04). 151--156.

[13]

Paul Fearnhead. 2006. Exact and efficient Bayesian inference for multiple changepoint problems. Statist. Comput. 16, 2 (2006), 203--213.

Digital Library

[14]

Christian Franke. 2016. Smartmontools. Retrieved from: https://www.smartmontools.org/.

[15]

Moises Goldszmidt. 2012. Finding soon-to-fail disks in a haystack. In Proceedings of the 4th USENIX Conference on Hot Topics in Storage and File Systems. USENIX Association, 8--8.

Digital Library

[16]

A. Graves, A. R. Mohamed, and G. Hinton. 2013. Speech recognition with deep recurrent neural networks. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing. 6645--6649.

[17]

Richard H. R. Hahnloser, Rahul Sarpeshkar, Misha A. Mahowald, Rodney J. Douglas, and H. Sebastian Seung. 2000. Digital selection and analogue amplification coexist in a cortex-inspired silicon circuit. Nature 405, 6789 (2000), 947--951.

[18]

Greg Hamerly and Charles Elkan. 2001. Bayesian approaches to failure prediction for disk drives. In Proceedings of the 18th International Conference on Machine Learning. 202--209.

Digital Library

[19]

G. Hinton, L. Deng, D. Yu, G. E. Dahl, A. R. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainath, and B. Kingsbury. 2012. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Sig. Proc. Mag. 29, 6 (2012), 82--97.

[20]

Sepp Hochreiter. 1998. The vanishing gradient problem during learning recurrent neural nets and problem solutions. Int. J. Uncert., Fuzz. Knowl.-based Syst. 6, 2 (1998), 107--116.

Digital Library

[21]

Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Comput. 9, 8 (1997), 1735--1780.

Digital Library

[22]

G. F. Hughes, J. F. Murray, K. Kreutz-Delgado, and C. Elkan. 2002. Improved disk-drive failure warnings. IEEE Trans. Reliab. 51, 3 (2002), 350--357.

[23]

Sébastien Jean, Kyunghyun Cho, Roland Memisevic, and Yoshua Bengio. 2014. Server class disk drives: How reliable are they? In Proceedings of the 53rd Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing. 1--10.

[24]

Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. 2015. Deep learning. Nature 521, 7553 (2015), 436--444.

[25]

J. Li, R. J. Stones, G. Wang, Z. Li, X. Liu, and K. Xiao. 2016. Being accurate is not enough: New metrics for disk failure prediction. In Proceedings of the 35th IEEE Symposium on Reliable Distributed Systems. 71--80.

[26]

Jing Li, Rebecca J. Stones, Gang Wang, Xiaoguang Liu, Zhongwei Li, and Ming Xu. 2017. Hard drive failure prediction using decision trees. Reliab. Eng. Syst. Safety 164 (2017), 55--65.

[27]

Ao Ma, Rachel Traylor, Fred Douglis, Mark Chamness, Guanlin Lu, Darren Sawyer, Surendar Chandra, and Windsor Hsu. 2015. RAIDShield: Characterizing, monitoring, and proactively protecting against disk failures. ACM Trans. Stor. 11, 4 (2015), 17.

Digital Library

[28]

Fenglong Ma, Radha Chitta, Jing Zhou, Quanzeng You, Tong Sun, and Jing Gao. 2017. Dipole: Diagnosis prediction in healthcare via attention-based bidirectional recurrent neural networks. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 1903--1911.

Digital Library

[29]

Farzaneh Mahdisoltani, Ioan Stefanovici, and Bianca Schroeder. 2017. Proactive error prediction to improve storage system reliability. In Proceedings of the USENIX Technical Conference. USENIX Association, 391--402.

Digital Library

[30]

Joseph F. Murray, Gordon F. Hughes, and Kenneth Kreutz-Delgado. 2005. Machine learning methods for predicting failures in hard drives: A multiple-instance application. J. Machine Learn. Res. 6, 5 (2005), 783--816.

Digital Library

[31]

S. Pang, Y. Jia, R. Stones, G. Wang, and X. Liu. 2016. A combined Bayesian network method for predicting drive failure times from SMART attributes. In Proceedings of the International Joint Conference on Neural Networks. 4850--4856.

[32]

Eduardo Pinheiro, Wolf-Dietrich Weber, and Luiz André Barroso. 2007. Failure trends in a large disk drive population. In Proceedings of the 5th USENIX Conference on File and Storage Technologies. USENIX Association, 17--23.

Digital Library

[33]

M. Riedmiller and H. Braun. 1993. A direct adaptive method for faster backpropagation learning: The RPROP algorithm. In Proceedings of the IEEE International Conference on Neural Networks. 586--591.

[34]

Paul Rodriguez, Janet Wiles, and Jeffrey L. Elman. 1999. A recurrent neural network that learns to count. Connect. Sci. 11, 1 (1999), 5--40.

[35]

Hojjat Salehinejad, Joseph Barfett, Shahrokh Valaee, and Timothy Dowdell. 2017. Training neural networks with very little data—A draft. Retrieved from: arXiv:1708.04347.

[36]

Sriram Sankar, Mark Shaw, Kushagra Vaid, and Sudhanva Gurumurthi. 2013. Datacenter scale evaluation of the impact of temperature on hard disk drive failures. ACM Trans. Stor. 9, 2 (2013), 1--24.

Digital Library

[37]

Bianca Schroeder and Garth A. Gibson. 2007. Disk failures in the real world: What does an MTTF of 1,000,000 hours mean to you? In Proceedings of the 5th USENIX Conference on File and Storage Technologies. USENIX Association, 8--24.

Digital Library

[38]

Nitish Srivastava, Geoffrey E. Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: A simple way to prevent neural networks from overfitting. J. Machine Learn. Res. 15, 1 (2014), 1929--1958.

Digital Library

[39]

Tijmen Tieleman and Geoffrey E. Hinton. 2012. Lecture 6.5-rmsprop: Divide the gradient by a running average of Tts recent magnitude. COURSERA: Neural Netw. Machine Learn. 4, 2 (2012).

[40]

Jonathan J. Tompson, Arjun Jain, Yann LeCun, and Christoph Bregler. 2014. Joint training of a convolutional network and a graphical model for human pose estimation. In Adv. Neural Inform. Proc. Syst. 27. Curran Associates, Inc., 1799--1807.

Digital Library

[41]

Userbenchmark. 2013. Seagate desktop HDD 4TB review. Retrieved from: https://hdd.userbenchmark.com/Seagate-Desktop-HDD-4TB-2013/Rating/1598.

[42]

Y. Wang, E. W. M. Ma, T. W. S. Chow, and K. L. Tsui. 2014. A two-step parametric method for failure prediction in hard disk drives. IEEE Trans. Industr. Inform. 10, 1 (2014), 419--430.

[43]

Yu Wang, Qiang Miao, and M. Pecht. 2011. Health monitoring of hard disk drive based on Mahalanobis distance. In Proceedings of the Prognostics and System Health Managment Conference. 1--8.

[44]

Mike West and Jeff Harrison. 1997. Bayesian Forecasting and Dynamic Models. Springer Series in Statistics.

Digital Library

[45]

Ronald J. Williams and David Zipser. 1989. A learning algorithm for continually running fully recurrent neural retworks. Neural Computation 1, 2 (1989), 270--280.

Digital Library

[46]

C. Xu, G. Wang, X. Liu, D. Guo, and T. Y. Liu. 2016. Health status assessment and failure prediction for hard drives with recurrent neural networks. IEEE Trans. Comput. 65, 11 (2016), 3502--3508.

Digital Library

[47]

Ying Zhao, Xiang Liu, Siqing Gan, and Weimin Zheng. 2010. Predicting disk failures with HMM-and HSMM-based approaches. In Proceedings of the 10th Industrial Conference on Advances in Data Mining. Springer, 390--404.

Digital Library

[48]

B. Zhu, G. Wang, X. Liu, D. Hu, S. Lin, and J. Ma. 2013. Proactive drive failure prediction for large scale storage systems. In Proceedings of the 29thIEEE Symposium on Mass Storage Systems and Technologies. 1--5.

Cited By

Wang MZhou WYao XLi H(2025)Adaptive Bit Selection for Scalable Deep HashingIEEE Transactions on Image Processing10.1109/TIP.2025.353321534(1048-1059)Online publication date: 1-Jan-2025
https://dl.acm.org/doi/10.1109/TIP.2025.3533215
Fang XGuan WLi JCao CXia B(2024)SiaDFP: A Disk Failure Prediction Framework Based on Siamese Neural Network in Large-Scale Data CenterIEEE Transactions on Services Computing10.1109/TSC.2024.339469217:5(2890-2903)Online publication date: Sep-2024
https://doi.org/10.1109/TSC.2024.3394692
Song YZheng PTian YWang B(2024)ACPR: Adaptive Classification Predictive Repair Method for Different Fault ScenariosIEEE Access10.1109/ACCESS.2023.334688112(4631-4641)Online publication date: 2024
https://doi.org/10.1109/ACCESS.2023.3346881
Show More Cited By

Index Terms

An Attention-augmented Deep Architecture for Hard Drive Status Monitoring in Large-scale Storage Systems

Recommendations

Hard disk Drive Failure Prediction Challenges in Machine Learning for Multi-variate Time Series
ICAIP '19: Proceedings of the 2019 3rd International Conference on Advances in Image Processing

Hard disk drive failure prediction (HDDFP) is an active area of machine learning applications. While recent work shows very promising results with high failure recall (95%) and precision based on SMART attributes, challenges remain that call for ...
Reliability and security of RAID storage systems and D2D archives using SATA disk drives

Information storage reliability and security is addressed by using personal computer disk drives in enterprise-class nearline and archival storage systems. The low cost of these serial ATA (SATA) PC drives is a tradeoff against drive reliability design ...
Multiple-Instance Learning for Hard Disk Drive Failure Prediction
ICECC '12: Proceedings of the 2012 International Conference on Electronics, Communications and Control

A hard disk drive (HDD) is a critical component in computers. The failure of HDD can result in the users' data loss and computer downtime. Both consequences cause inconveniences to the users. Therefore, detecting the impending failure of HDDs becomes a ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Storage

ACM Transactions on Storage Volume 15, Issue 3

August 2019

173 pages

ISSN:1553-3077

EISSN:1553-3093

DOI:10.1145/3336116

Editor:
Sam H. Noh
Ulsan National Institute of Science and Technology, Ulsan, Republic of Korea

Issue’s Table of Contents

Copyright © 2019 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 August 2019

Accepted: 01 June 2019

Revised: 01 March 2019

Received: 01 June 2018

Published in TOS Volume 15, Issue 3

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

Science Fund for Distinguished Young Scholars in Hunan Province
National Natural Science Foundation of China

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

14
Total Citations
View Citations
371
Total Downloads

Downloads (Last 12 months)25
Downloads (Last 6 weeks)3

Reflects downloads up to 19 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Wang MZhou WYao XLi H(2025)Adaptive Bit Selection for Scalable Deep HashingIEEE Transactions on Image Processing10.1109/TIP.2025.353321534(1048-1059)Online publication date: 1-Jan-2025
https://dl.acm.org/doi/10.1109/TIP.2025.3533215
Fang XGuan WLi JCao CXia B(2024)SiaDFP: A Disk Failure Prediction Framework Based on Siamese Neural Network in Large-Scale Data CenterIEEE Transactions on Services Computing10.1109/TSC.2024.339469217:5(2890-2903)Online publication date: Sep-2024
https://doi.org/10.1109/TSC.2024.3394692
Song YZheng PTian YWang B(2024)ACPR: Adaptive Classification Predictive Repair Method for Different Fault ScenariosIEEE Access10.1109/ACCESS.2023.334688112(4631-4641)Online publication date: 2024
https://doi.org/10.1109/ACCESS.2023.3346881
Ma LWu XTang RZhong CZhang K(2023)YuYin: a multi-task learning model of multi-modal e-commerce background music recommendationEURASIP Journal on Audio, Speech, and Music Processing10.1186/s13636-023-00306-62023:1Online publication date: 19-Oct-2023
https://dl.acm.org/doi/10.1186/s13636-023-00306-6
Du YWang MLu ZZhou WLi H(2023)Weakly Supervised Hashing with Reconstructive Cross-modal AttentionACM Transactions on Multimedia Computing, Communications, and Applications10.1145/358918519:6(1-19)Online publication date: 8-Apr-2023
https://dl.acm.org/doi/10.1145/3589185
Pinciroli RYang LAlter JSmirni E(2023)Lifespan and Failures of SSDs and HDDs: Similarities, Differences, and Prediction ModelsIEEE Transactions on Dependable and Secure Computing10.1109/TDSC.2021.313157120:1(256-272)Online publication date: 1-Jan-2023
https://doi.org/10.1109/TDSC.2021.3131571
Guo ZGuo JBao CZhang TLi M(2023)Construction of Power Equipment Running Status Monitoring System Based on Infrared Temperature Measurement Technology and Big Data Algorithm2023 International Conference on Power, Electrical Engineering, Electronics and Control (PEEEC)10.1109/PEEEC60561.2023.00054(250-254)Online publication date: 25-Sep-2023
https://doi.org/10.1109/PEEEC60561.2023.00054
Harrison CBalu HDutra I(2023)Predicting Hard Disk Drive Faults, Failures and Associated Misbehavior’s2023 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)10.1109/IPDPSW59300.2023.00082(484-493)Online publication date: May-2023
https://doi.org/10.1109/IPDPSW59300.2023.00082
Nor APedapati SMuhammad MLeiva V(2022)Abnormality Detection and Failure Prediction Using Explainable Bayesian Deep Learning: Methodology and Case Study with Industrial DataMathematics10.3390/math1004055410:4(554)Online publication date: 11-Feb-2022
https://doi.org/10.3390/math10040554
Zheng NSong XSu TLiu WYan YNie L(2022)Egocentric Early Action Prediction via Adversarial Knowledge DistillationACM Transactions on Multimedia Computing, Communications, and Applications10.1145/354449319:2(1-21)Online publication date: 16-Jun-2022
https://dl.acm.org/doi/10.1145/3544493
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Figures

Tables

Media

View Issue’s Table of Contents