skip to main content
research-article

An Attention-augmented Deep Architecture for Hard Drive Status Monitoring in Large-scale Storage Systems

Authors Info & Claims
Published:13 August 2019Publication History
Skip Abstract Section

Abstract

Data centers equipped with large-scale storage systems are critical infrastructures in the era of big data. The enormous amount of hard drives in storage systems magnify the failure probability, which may cause tremendous loss for both data service users and providers. Despite a set of reactive fault-tolerant measures such as RAID, it is still a tough issue to enhance the reliability of large-scale storage systems. Proactive prediction is an effective method to avoid possible hard-drive failures in advance. A series of models based on the SMART statistics have been proposed to predict impending hard-drive failures. Nonetheless, there remain some serious yet unsolved challenges like the lack of explainability of prediction results. To address these issues, we carefully analyze a dataset collected from a real-world large-scale storage system and then design an attention-augmented deep architecture for hard-drive health status assessment and failure prediction. The deep architecture, composed of a feature integration layer, a temporal dependency extraction layer, an attention layer, and a classification layer, cannot only monitor the status of hard drives but also assist in failure cause diagnoses. The experiments based on real-world datasets show that the proposed deep architecture is able to assess the hard-drive status and predict the impending failures accurately. In addition, the experimental results demonstrate that the attention-augmented deep architecture can reveal the degradation progression of hard drives automatically and assist administrators in tracing the cause of hard drive failures.

References

  1. Martín Abadi et al. 2015. TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems. Retrieved from: http://tensorflow.org/.Google ScholarGoogle Scholar
  2. Backblaze. 2016. The backblaze hard drive data and stats. Retrieved from: https://www.backblaze.com/b2/hard-drive-test-data.html.Google ScholarGoogle Scholar
  3. Backblaze. 2016. Hard drive reliability review for 2015. Retrieved from: https://www.backblaze.com/blog/hard-drive-reliability-q4-2015/.Google ScholarGoogle Scholar
  4. Backblaze. 2018. The backblaze hard drive data and stats. Retrieved from: https://www.backblaze.com/b2/hard-drive-test-data.html.Google ScholarGoogle Scholar
  5. Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. Retrieved from: arXiv:1409.0473.Google ScholarGoogle Scholar
  6. Y. Bengio, P. Simard, and P. Frasconi. 1994. Learning long-term dependencies with gradient descent is difficult. IEEE Trans. Neural Netw. 5, 2 (1994), 157--166. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Mirela Madalina Botezatu, Ioana Giurgiu, Jasmina Bogojeska, and Dorothea Wiesmann. 2016. Predicting disk replacement towards reliable data centers. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 39--48. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Kyunghyun Cho, Bart Van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. 1724--1734.Google ScholarGoogle ScholarCross RefCross Ref
  9. Francois Chollet. 2015. Keras. Retrieved from: https://github.com/fchollet/keras.Google ScholarGoogle Scholar
  10. Francois Chollet. 2015. Keras optimizer. Retrieved from: https://keras.io/optimizers/.Google ScholarGoogle Scholar
  11. B. Eckart, X. Chen, X. He, and S. L. Scott. 2008. Failure prediction models for proactive fault tolerance within storage systems. In Proceedings of the IEEE International Symposium on Modeling, Analysis and Simulation of Computers and Telecommunication Systems. 1--8.Google ScholarGoogle Scholar
  12. J. G. Elerath and S. Shah. 2004. Server class disk drives: How reliable are they? In Proceedings of the Symposium on Reliability and Maintainability (RAMS’04). 151--156.Google ScholarGoogle Scholar
  13. Paul Fearnhead. 2006. Exact and efficient Bayesian inference for multiple changepoint problems. Statist. Comput. 16, 2 (2006), 203--213. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Christian Franke. 2016. Smartmontools. Retrieved from: https://www.smartmontools.org/.Google ScholarGoogle Scholar
  15. Moises Goldszmidt. 2012. Finding soon-to-fail disks in a haystack. In Proceedings of the 4th USENIX Conference on Hot Topics in Storage and File Systems. USENIX Association, 8--8. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. A. Graves, A. R. Mohamed, and G. Hinton. 2013. Speech recognition with deep recurrent neural networks. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing. 6645--6649.Google ScholarGoogle Scholar
  17. Richard H. R. Hahnloser, Rahul Sarpeshkar, Misha A. Mahowald, Rodney J. Douglas, and H. Sebastian Seung. 2000. Digital selection and analogue amplification coexist in a cortex-inspired silicon circuit. Nature 405, 6789 (2000), 947--951.Google ScholarGoogle Scholar
  18. Greg Hamerly and Charles Elkan. 2001. Bayesian approaches to failure prediction for disk drives. In Proceedings of the 18th International Conference on Machine Learning. 202--209. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. G. Hinton, L. Deng, D. Yu, G. E. Dahl, A. R. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainath, and B. Kingsbury. 2012. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Sig. Proc. Mag. 29, 6 (2012), 82--97.Google ScholarGoogle ScholarCross RefCross Ref
  20. Sepp Hochreiter. 1998. The vanishing gradient problem during learning recurrent neural nets and problem solutions. Int. J. Uncert., Fuzz. Knowl.-based Syst. 6, 2 (1998), 107--116. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Comput. 9, 8 (1997), 1735--1780. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. G. F. Hughes, J. F. Murray, K. Kreutz-Delgado, and C. Elkan. 2002. Improved disk-drive failure warnings. IEEE Trans. Reliab. 51, 3 (2002), 350--357.Google ScholarGoogle ScholarCross RefCross Ref
  23. Sébastien Jean, Kyunghyun Cho, Roland Memisevic, and Yoshua Bengio. 2014. Server class disk drives: How reliable are they? In Proceedings of the 53rd Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing. 1--10.Google ScholarGoogle Scholar
  24. Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. 2015. Deep learning. Nature 521, 7553 (2015), 436--444.Google ScholarGoogle Scholar
  25. J. Li, R. J. Stones, G. Wang, Z. Li, X. Liu, and K. Xiao. 2016. Being accurate is not enough: New metrics for disk failure prediction. In Proceedings of the 35th IEEE Symposium on Reliable Distributed Systems. 71--80.Google ScholarGoogle Scholar
  26. Jing Li, Rebecca J. Stones, Gang Wang, Xiaoguang Liu, Zhongwei Li, and Ming Xu. 2017. Hard drive failure prediction using decision trees. Reliab. Eng. Syst. Safety 164 (2017), 55--65.Google ScholarGoogle ScholarCross RefCross Ref
  27. Ao Ma, Rachel Traylor, Fred Douglis, Mark Chamness, Guanlin Lu, Darren Sawyer, Surendar Chandra, and Windsor Hsu. 2015. RAIDShield: Characterizing, monitoring, and proactively protecting against disk failures. ACM Trans. Stor. 11, 4 (2015), 17. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Fenglong Ma, Radha Chitta, Jing Zhou, Quanzeng You, Tong Sun, and Jing Gao. 2017. Dipole: Diagnosis prediction in healthcare via attention-based bidirectional recurrent neural networks. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 1903--1911. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Farzaneh Mahdisoltani, Ioan Stefanovici, and Bianca Schroeder. 2017. Proactive error prediction to improve storage system reliability. In Proceedings of the USENIX Technical Conference. USENIX Association, 391--402. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Joseph F. Murray, Gordon F. Hughes, and Kenneth Kreutz-Delgado. 2005. Machine learning methods for predicting failures in hard drives: A multiple-instance application. J. Machine Learn. Res. 6, 5 (2005), 783--816. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. S. Pang, Y. Jia, R. Stones, G. Wang, and X. Liu. 2016. A combined Bayesian network method for predicting drive failure times from SMART attributes. In Proceedings of the International Joint Conference on Neural Networks. 4850--4856.Google ScholarGoogle Scholar
  32. Eduardo Pinheiro, Wolf-Dietrich Weber, and Luiz André Barroso. 2007. Failure trends in a large disk drive population. In Proceedings of the 5th USENIX Conference on File and Storage Technologies. USENIX Association, 17--23. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. M. Riedmiller and H. Braun. 1993. A direct adaptive method for faster backpropagation learning: The RPROP algorithm. In Proceedings of the IEEE International Conference on Neural Networks. 586--591.Google ScholarGoogle Scholar
  34. Paul Rodriguez, Janet Wiles, and Jeffrey L. Elman. 1999. A recurrent neural network that learns to count. Connect. Sci. 11, 1 (1999), 5--40.Google ScholarGoogle ScholarCross RefCross Ref
  35. Hojjat Salehinejad, Joseph Barfett, Shahrokh Valaee, and Timothy Dowdell. 2017. Training neural networks with very little data—A draft. Retrieved from: arXiv:1708.04347.Google ScholarGoogle Scholar
  36. Sriram Sankar, Mark Shaw, Kushagra Vaid, and Sudhanva Gurumurthi. 2013. Datacenter scale evaluation of the impact of temperature on hard disk drive failures. ACM Trans. Stor. 9, 2 (2013), 1--24. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Bianca Schroeder and Garth A. Gibson. 2007. Disk failures in the real world: What does an MTTF of 1,000,000 hours mean to you? In Proceedings of the 5th USENIX Conference on File and Storage Technologies. USENIX Association, 8--24. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Nitish Srivastava, Geoffrey E. Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: A simple way to prevent neural networks from overfitting. J. Machine Learn. Res. 15, 1 (2014), 1929--1958. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Tijmen Tieleman and Geoffrey E. Hinton. 2012. Lecture 6.5-rmsprop: Divide the gradient by a running average of Tts recent magnitude. COURSERA: Neural Netw. Machine Learn. 4, 2 (2012).Google ScholarGoogle Scholar
  40. Jonathan J. Tompson, Arjun Jain, Yann LeCun, and Christoph Bregler. 2014. Joint training of a convolutional network and a graphical model for human pose estimation. In Adv. Neural Inform. Proc. Syst. 27. Curran Associates, Inc., 1799--1807. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Userbenchmark. 2013. Seagate desktop HDD 4TB review. Retrieved from: https://hdd.userbenchmark.com/Seagate-Desktop-HDD-4TB-2013/Rating/1598.Google ScholarGoogle Scholar
  42. Y. Wang, E. W. M. Ma, T. W. S. Chow, and K. L. Tsui. 2014. A two-step parametric method for failure prediction in hard disk drives. IEEE Trans. Industr. Inform. 10, 1 (2014), 419--430.Google ScholarGoogle ScholarCross RefCross Ref
  43. Yu Wang, Qiang Miao, and M. Pecht. 2011. Health monitoring of hard disk drive based on Mahalanobis distance. In Proceedings of the Prognostics and System Health Managment Conference. 1--8.Google ScholarGoogle Scholar
  44. Mike West and Jeff Harrison. 1997. Bayesian Forecasting and Dynamic Models. Springer Series in Statistics. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Ronald J. Williams and David Zipser. 1989. A learning algorithm for continually running fully recurrent neural retworks. Neural Computation 1, 2 (1989), 270--280. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. C. Xu, G. Wang, X. Liu, D. Guo, and T. Y. Liu. 2016. Health status assessment and failure prediction for hard drives with recurrent neural networks. IEEE Trans. Comput. 65, 11 (2016), 3502--3508. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Ying Zhao, Xiang Liu, Siqing Gan, and Weimin Zheng. 2010. Predicting disk failures with HMM-and HSMM-based approaches. In Proceedings of the 10th Industrial Conference on Advances in Data Mining. Springer, 390--404. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. B. Zhu, G. Wang, X. Liu, D. Hu, S. Lin, and J. Ma. 2013. Proactive drive failure prediction for large scale storage systems. In Proceedings of the 29thIEEE Symposium on Mass Storage Systems and Technologies. 1--5.Google ScholarGoogle Scholar

Index Terms

  1. An Attention-augmented Deep Architecture for Hard Drive Status Monitoring in Large-scale Storage Systems

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image ACM Transactions on Storage
          ACM Transactions on Storage  Volume 15, Issue 3
          August 2019
          173 pages
          ISSN:1553-3077
          EISSN:1553-3093
          DOI:10.1145/3336116
          • Editor:
          • Sam H. Noh
          Issue’s Table of Contents

          Copyright © 2019 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 13 August 2019
          • Accepted: 1 June 2019
          • Revised: 1 March 2019
          • Received: 1 June 2018
          Published in tos Volume 15, Issue 3

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article
          • Research
          • Refereed

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        HTML Format

        View this article in HTML Format .

        View HTML Format