Abstract
Data centers equipped with large-scale storage systems are critical infrastructures in the era of big data. The enormous amount of hard drives in storage systems magnify the failure probability, which may cause tremendous loss for both data service users and providers. Despite a set of reactive fault-tolerant measures such as RAID, it is still a tough issue to enhance the reliability of large-scale storage systems. Proactive prediction is an effective method to avoid possible hard-drive failures in advance. A series of models based on the SMART statistics have been proposed to predict impending hard-drive failures. Nonetheless, there remain some serious yet unsolved challenges like the lack of explainability of prediction results. To address these issues, we carefully analyze a dataset collected from a real-world large-scale storage system and then design an attention-augmented deep architecture for hard-drive health status assessment and failure prediction. The deep architecture, composed of a feature integration layer, a temporal dependency extraction layer, an attention layer, and a classification layer, cannot only monitor the status of hard drives but also assist in failure cause diagnoses. The experiments based on real-world datasets show that the proposed deep architecture is able to assess the hard-drive status and predict the impending failures accurately. In addition, the experimental results demonstrate that the attention-augmented deep architecture can reveal the degradation progression of hard drives automatically and assist administrators in tracing the cause of hard drive failures.
- Martín Abadi et al. 2015. TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems. Retrieved from: http://tensorflow.org/.Google Scholar
- Backblaze. 2016. The backblaze hard drive data and stats. Retrieved from: https://www.backblaze.com/b2/hard-drive-test-data.html.Google Scholar
- Backblaze. 2016. Hard drive reliability review for 2015. Retrieved from: https://www.backblaze.com/blog/hard-drive-reliability-q4-2015/.Google Scholar
- Backblaze. 2018. The backblaze hard drive data and stats. Retrieved from: https://www.backblaze.com/b2/hard-drive-test-data.html.Google Scholar
- Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. Retrieved from: arXiv:1409.0473.Google Scholar
- Y. Bengio, P. Simard, and P. Frasconi. 1994. Learning long-term dependencies with gradient descent is difficult. IEEE Trans. Neural Netw. 5, 2 (1994), 157--166. Google ScholarDigital Library
- Mirela Madalina Botezatu, Ioana Giurgiu, Jasmina Bogojeska, and Dorothea Wiesmann. 2016. Predicting disk replacement towards reliable data centers. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 39--48. Google ScholarDigital Library
- Kyunghyun Cho, Bart Van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. 1724--1734.Google ScholarCross Ref
- Francois Chollet. 2015. Keras. Retrieved from: https://github.com/fchollet/keras.Google Scholar
- Francois Chollet. 2015. Keras optimizer. Retrieved from: https://keras.io/optimizers/.Google Scholar
- B. Eckart, X. Chen, X. He, and S. L. Scott. 2008. Failure prediction models for proactive fault tolerance within storage systems. In Proceedings of the IEEE International Symposium on Modeling, Analysis and Simulation of Computers and Telecommunication Systems. 1--8.Google Scholar
- J. G. Elerath and S. Shah. 2004. Server class disk drives: How reliable are they? In Proceedings of the Symposium on Reliability and Maintainability (RAMS’04). 151--156.Google Scholar
- Paul Fearnhead. 2006. Exact and efficient Bayesian inference for multiple changepoint problems. Statist. Comput. 16, 2 (2006), 203--213. Google ScholarDigital Library
- Christian Franke. 2016. Smartmontools. Retrieved from: https://www.smartmontools.org/.Google Scholar
- Moises Goldszmidt. 2012. Finding soon-to-fail disks in a haystack. In Proceedings of the 4th USENIX Conference on Hot Topics in Storage and File Systems. USENIX Association, 8--8. Google ScholarDigital Library
- A. Graves, A. R. Mohamed, and G. Hinton. 2013. Speech recognition with deep recurrent neural networks. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing. 6645--6649.Google Scholar
- Richard H. R. Hahnloser, Rahul Sarpeshkar, Misha A. Mahowald, Rodney J. Douglas, and H. Sebastian Seung. 2000. Digital selection and analogue amplification coexist in a cortex-inspired silicon circuit. Nature 405, 6789 (2000), 947--951.Google Scholar
- Greg Hamerly and Charles Elkan. 2001. Bayesian approaches to failure prediction for disk drives. In Proceedings of the 18th International Conference on Machine Learning. 202--209. Google ScholarDigital Library
- G. Hinton, L. Deng, D. Yu, G. E. Dahl, A. R. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainath, and B. Kingsbury. 2012. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Sig. Proc. Mag. 29, 6 (2012), 82--97.Google ScholarCross Ref
- Sepp Hochreiter. 1998. The vanishing gradient problem during learning recurrent neural nets and problem solutions. Int. J. Uncert., Fuzz. Knowl.-based Syst. 6, 2 (1998), 107--116. Google ScholarDigital Library
- Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Comput. 9, 8 (1997), 1735--1780. Google ScholarDigital Library
- G. F. Hughes, J. F. Murray, K. Kreutz-Delgado, and C. Elkan. 2002. Improved disk-drive failure warnings. IEEE Trans. Reliab. 51, 3 (2002), 350--357.Google ScholarCross Ref
- Sébastien Jean, Kyunghyun Cho, Roland Memisevic, and Yoshua Bengio. 2014. Server class disk drives: How reliable are they? In Proceedings of the 53rd Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing. 1--10.Google Scholar
- Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. 2015. Deep learning. Nature 521, 7553 (2015), 436--444.Google Scholar
- J. Li, R. J. Stones, G. Wang, Z. Li, X. Liu, and K. Xiao. 2016. Being accurate is not enough: New metrics for disk failure prediction. In Proceedings of the 35th IEEE Symposium on Reliable Distributed Systems. 71--80.Google Scholar
- Jing Li, Rebecca J. Stones, Gang Wang, Xiaoguang Liu, Zhongwei Li, and Ming Xu. 2017. Hard drive failure prediction using decision trees. Reliab. Eng. Syst. Safety 164 (2017), 55--65.Google ScholarCross Ref
- Ao Ma, Rachel Traylor, Fred Douglis, Mark Chamness, Guanlin Lu, Darren Sawyer, Surendar Chandra, and Windsor Hsu. 2015. RAIDShield: Characterizing, monitoring, and proactively protecting against disk failures. ACM Trans. Stor. 11, 4 (2015), 17. Google ScholarDigital Library
- Fenglong Ma, Radha Chitta, Jing Zhou, Quanzeng You, Tong Sun, and Jing Gao. 2017. Dipole: Diagnosis prediction in healthcare via attention-based bidirectional recurrent neural networks. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 1903--1911. Google ScholarDigital Library
- Farzaneh Mahdisoltani, Ioan Stefanovici, and Bianca Schroeder. 2017. Proactive error prediction to improve storage system reliability. In Proceedings of the USENIX Technical Conference. USENIX Association, 391--402. Google ScholarDigital Library
- Joseph F. Murray, Gordon F. Hughes, and Kenneth Kreutz-Delgado. 2005. Machine learning methods for predicting failures in hard drives: A multiple-instance application. J. Machine Learn. Res. 6, 5 (2005), 783--816. Google ScholarDigital Library
- S. Pang, Y. Jia, R. Stones, G. Wang, and X. Liu. 2016. A combined Bayesian network method for predicting drive failure times from SMART attributes. In Proceedings of the International Joint Conference on Neural Networks. 4850--4856.Google Scholar
- Eduardo Pinheiro, Wolf-Dietrich Weber, and Luiz André Barroso. 2007. Failure trends in a large disk drive population. In Proceedings of the 5th USENIX Conference on File and Storage Technologies. USENIX Association, 17--23. Google ScholarDigital Library
- M. Riedmiller and H. Braun. 1993. A direct adaptive method for faster backpropagation learning: The RPROP algorithm. In Proceedings of the IEEE International Conference on Neural Networks. 586--591.Google Scholar
- Paul Rodriguez, Janet Wiles, and Jeffrey L. Elman. 1999. A recurrent neural network that learns to count. Connect. Sci. 11, 1 (1999), 5--40.Google ScholarCross Ref
- Hojjat Salehinejad, Joseph Barfett, Shahrokh Valaee, and Timothy Dowdell. 2017. Training neural networks with very little data—A draft. Retrieved from: arXiv:1708.04347.Google Scholar
- Sriram Sankar, Mark Shaw, Kushagra Vaid, and Sudhanva Gurumurthi. 2013. Datacenter scale evaluation of the impact of temperature on hard disk drive failures. ACM Trans. Stor. 9, 2 (2013), 1--24. Google ScholarDigital Library
- Bianca Schroeder and Garth A. Gibson. 2007. Disk failures in the real world: What does an MTTF of 1,000,000 hours mean to you? In Proceedings of the 5th USENIX Conference on File and Storage Technologies. USENIX Association, 8--24. Google ScholarDigital Library
- Nitish Srivastava, Geoffrey E. Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: A simple way to prevent neural networks from overfitting. J. Machine Learn. Res. 15, 1 (2014), 1929--1958. Google ScholarDigital Library
- Tijmen Tieleman and Geoffrey E. Hinton. 2012. Lecture 6.5-rmsprop: Divide the gradient by a running average of Tts recent magnitude. COURSERA: Neural Netw. Machine Learn. 4, 2 (2012).Google Scholar
- Jonathan J. Tompson, Arjun Jain, Yann LeCun, and Christoph Bregler. 2014. Joint training of a convolutional network and a graphical model for human pose estimation. In Adv. Neural Inform. Proc. Syst. 27. Curran Associates, Inc., 1799--1807. Google ScholarDigital Library
- Userbenchmark. 2013. Seagate desktop HDD 4TB review. Retrieved from: https://hdd.userbenchmark.com/Seagate-Desktop-HDD-4TB-2013/Rating/1598.Google Scholar
- Y. Wang, E. W. M. Ma, T. W. S. Chow, and K. L. Tsui. 2014. A two-step parametric method for failure prediction in hard disk drives. IEEE Trans. Industr. Inform. 10, 1 (2014), 419--430.Google ScholarCross Ref
- Yu Wang, Qiang Miao, and M. Pecht. 2011. Health monitoring of hard disk drive based on Mahalanobis distance. In Proceedings of the Prognostics and System Health Managment Conference. 1--8.Google Scholar
- Mike West and Jeff Harrison. 1997. Bayesian Forecasting and Dynamic Models. Springer Series in Statistics. Google ScholarDigital Library
- Ronald J. Williams and David Zipser. 1989. A learning algorithm for continually running fully recurrent neural retworks. Neural Computation 1, 2 (1989), 270--280. Google ScholarDigital Library
- C. Xu, G. Wang, X. Liu, D. Guo, and T. Y. Liu. 2016. Health status assessment and failure prediction for hard drives with recurrent neural networks. IEEE Trans. Comput. 65, 11 (2016), 3502--3508. Google ScholarDigital Library
- Ying Zhao, Xiang Liu, Siqing Gan, and Weimin Zheng. 2010. Predicting disk failures with HMM-and HSMM-based approaches. In Proceedings of the 10th Industrial Conference on Advances in Data Mining. Springer, 390--404. Google ScholarDigital Library
- B. Zhu, G. Wang, X. Liu, D. Hu, S. Lin, and J. Ma. 2013. Proactive drive failure prediction for large scale storage systems. In Proceedings of the 29thIEEE Symposium on Mass Storage Systems and Technologies. 1--5.Google Scholar
Index Terms
- An Attention-augmented Deep Architecture for Hard Drive Status Monitoring in Large-scale Storage Systems
Recommendations
Reliability and security of RAID storage systems and D2D archives using SATA disk drives
Information storage reliability and security is addressed by using personal computer disk drives in enterprise-class nearline and archival storage systems. The low cost of these serial ATA (SATA) PC drives is a tradeoff against drive reliability design ...
Multiple-Instance Learning for Hard Disk Drive Failure Prediction
ICECC '12: Proceedings of the 2012 International Conference on Electronics, Communications and ControlA hard disk drive (HDD) is a critical component in computers. The failure of HDD can result in the users' data loss and computer downtime. Both consequences cause inconveniences to the users. Therefore, detecting the impending failure of HDDs becomes a ...
Comments