skip to main content
research-article

A Disk Failure Prediction Method Based on Active Semi-supervised Learning

Authors Info & Claims
Published:12 November 2022Publication History
Skip Abstract Section

Abstract

Disk failure has always been a major problem for data centers, leading to data loss. Current disk failure prediction approaches are mostly offline and assume that the disk labels required for training learning models are available and accurate. However, these offline methods are no longer suitable for disk failure prediction tasks in large-scale data centers. Behind this explosive amount of data, most methods do not consider whether it is not easy to get the label values during the training or the obtained label values are not completely accurate. These problems further restrict the development of supervised learning and offline modeling in disk failure prediction. In this article, Active Semi-supervised Learning Disk-failure Prediction (ASLDP), a novel disk failure prediction method is proposed, which uses active learning and semi-supervised learning. According to the characteristics of data in the disk lifecycle, ASLDP carries out active learning for those clear labeled samples, which selects valuable samples with the most significant probability uncertainty and eliminates redundancy. For those samples that are unclearly labeled or unlabeled, ASLDP uses semi-supervised learning for pre-labeled by calculating the conditional values of the samples and enhances the generalization ability by active learning. Compared with several state-of-the-art offline and online learning approaches, the results on four realistic datasets from Backblaze and Baidu demonstrate that ASLDP achieves stable failure detection rates of 80–85% with low false alarm rates. In addition, we use a dataset from Alibaba to evaluate the generality of ASLDP. Furthermore, ASLDP can overcome the problem of missing sample labels and data redundancy in large data centers, which are not considered and implemented in all offline learning methods for disk failure prediction to the best of our knowledge. Finally, ASLDP can predict the disk failure 4.9 days in advance with lower overhead and latency.

REFERENCES

  1. [1] Allen Bruce. 2004. Monitoring hard disks with SMART. Linux J. 2004, 117 (2004), 9.Google ScholarGoogle Scholar
  2. [2] Anantharaman Preethi, Qiao Mu, and Jadav Divyesh. 2018. Large scale predictive analytics for hard disk remaining useful life estimation. In Proceedings of the IEEE International Congress on Big Data (BigData Congress’18). IEEE, 251254.Google ScholarGoogle ScholarCross RefCross Ref
  3. [3] Backblaze. 2014. Hard Drive SMART Stats. Retrieved from https://www.backblaze.com/blog/hard-drive-smart-stats/.Google ScholarGoogle Scholar
  4. [4] Backblaze. 2015. What Is the Best Hard Drive? Retrieved from https://www.backblaze.com/blog/best-hard-drive-q4-2014/.Google ScholarGoogle Scholar
  5. [5] Backblaze. 2016–2020. Raw Hard Drive Test Data. Retrieved from https://www.backblaze.com/b2/hard-drive-test-data.html.Google ScholarGoogle Scholar
  6. [6] Baidu. 2013. Baidu Dataset. Retrieved from http://pan.baidu.com/share/link?shareid=189977&uk=4278294944.Google ScholarGoogle Scholar
  7. [7] Blum Avrim and Mitchell Tom. 1998. Combining labeled and unlabeled data with co-training. In Proceedings of the Eleventh Annual Conference on Computational Learning Theory. 92100.Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. [8] Botezatu Mirela Madalina, Giurgiu Ioana, Bogojeska Jasmina, and Wiesmann Dorothea. 2016. Predicting disk replacement towards reliable data centers. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 3948.Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. [9] Breiman Leo. 1996. Bagging predictors. Mach. Learn. 24, 2 (1996), 123140.Google ScholarGoogle ScholarCross RefCross Ref
  10. [10] Chen Xu and Wang Tao. 2017. Combining active learning and semi-supervised learning by using selective label spreading. In Proceedings of the IEEE International Conference on Data Mining Workshops (ICDMW’17). IEEE, 850857.Google ScholarGoogle ScholarCross RefCross Ref
  11. [11] Cheng Sheng-Jun, Liu Jia-Feng, Huang Qing-Cheng, and Tang Xiang-Long. 2013. Conditional value-based co-training. Acta Automat. Sin. (2013), 10.Google ScholarGoogle Scholar
  12. [12] Deng Yue, Chen KaWai, Shen Yilin, and Jin Hongxia. 2018. Adversarial active learning for sequences labeling and generation. In International Joint Conference on Artificial Intelligence (IJCAI’18). 40124018.Google ScholarGoogle ScholarCross RefCross Ref
  13. [13] Lima Fernando Dione dos Santos, Amaral Gabriel Maia Rocha, Leite Lucas Goncalves de Moura, Gomes João Paulo Pordeus, and Machado Javam de Castro. 2017. Predicting failures in hard drives with lstm networks. In Proceedings of the Brazilian Conference on Intelligent Systems (BRACIS’17). IEEE, 222227.Google ScholarGoogle ScholarCross RefCross Ref
  14. [14] Elsken Thomas, Metzen Jan Hendrik, and Hutter Frank. 2019. Neural architecture search: A survey. J. Mach. Learn. Res. 20, 1 (2019), 19972017.Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. [15] Gama João, Žliobaitė Indrė, Bifet Albert, Pechenizkiy Mykola, and Bouchachia Abdelhamid. 2014. A survey on concept drift adaptation. ACM Comput. Surv. 46, 4 (2014), 137.Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. [16] Gomes Heitor M., Bifet Albert, Read Jesse, Barddal Jean Paul, Enembreck Fabrício, Pfharinger Bernhard, Holmes Geoff, and Abdessalem Talel. 2017. Adaptive random forests for evolving data stream classification. Mach. Learn. 106, 9 (2017), 14691495.Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. [17] Gunawi Haryadi S., Suminto Riza O., Sears Russell, Golliher Casey, Sundararaman Swaminathan, Lin Xing, Emami Tim, Sheng Weiguang, Bidokhti Nematollah, McCaffrey Caitie, et al. 2018. Fail-slow at scale: Evidence of hardware performance faults in large production systems. ACM Trans. Stor. 14, 3 (2018), 126.Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. [18] Han Shujie, Lee Patrick P. C., Shen Zhirong, He Cheng, Liu Yi, and Huang Tao. 2020. Toward adaptive disk failure prediction via stream mining. In Proceedings of the IEEE International Conference on Distributed Computing Systems (ICDCS’20).Google ScholarGoogle ScholarCross RefCross Ref
  19. [19] Hashemi Milad, Swersky Kevin, Smith Jamie, Ayers Grant, Litz Heiner, Chang Jichuan, Kozyrakis Christos, and Ranganathan Parthasarathy. 2018. Learning memory access patterns. In Proceedings of the International Conference on Machine Learning. PMLR, 19191928.Google ScholarGoogle Scholar
  20. [20] Hughes Gordon F., Murray Joseph F., Kreutz-Delgado Kenneth, and Elkan Charles. 2002. Improved disk-drive failure warnings. IEEE Trans. Reliabil. 51, 3 (2002), 350357.Google ScholarGoogle ScholarCross RefCross Ref
  21. [21] Kraska Tim, Beutel Alex, Chi Ed H., Dean Jeffrey, and Polyzotis Neoklis. 2018. The case for learned index structures. In Proceedings of the International Conference on Management of Data. 489504.Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. [22] Leng Yan, Xu Xinyan, and Qi Guanghui. 2013. Combining active learning and semi-supervised learning to construct SVM classifier. Knowl.-Bas. Syst. 44 (2013), 121131.Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. [23] Levin Anna, Garion Shelly, Kolodner Elliot K., Lorenz Dean H., Barabash Katherine, Kugler Mike, and McShane Niall. 2019. AIOps for a cloud object storage service. In Proceedings of the IEEE International Congress on Big Data (BigDataCongress’19). IEEE, 165169.Google ScholarGoogle ScholarCross RefCross Ref
  24. [24] Li Jing, Ji Xinpu, Jia Yuhan, Zhu Bingpeng, Wang Gang, Li Zhongwei, and Liu Xiaoguang. 2014. Hard drive failure prediction using classification and regression trees. In Proceedings of the 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks. IEEE, 383394.Google ScholarGoogle ScholarCross RefCross Ref
  25. [25] Li Jing, Stones Rebecca J., Wang Gang, Liu Xiaoguang, Li Zhongwei, and Xu Ming. 2017. Hard drive failure prediction using decision trees. Reliabil. Eng. Syst. Saf. 164 (2017), 5565.Google ScholarGoogle ScholarCross RefCross Ref
  26. [26] Liang Chen, Ye Jianbo, Wang Shuting, Pursel Bart, and Giles C. Lee. 2018. Investigating active learning for concept prerequisite learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 32.Google ScholarGoogle ScholarCross RefCross Ref
  27. [27] Liang Chen, Ye Jianbo, Zhao Han, Pursel Bart, and Giles C. Lee. 2018. Active learning of strict partial orders: A case study on concept prerequisite relations. arXiv:1801.06481. Retrieved from https://arxiv.org/abs/1801.06481.Google ScholarGoogle Scholar
  28. [28] Lillicrap Timothy P., Hunt Jonathan J., Pritzel Alexander, Heess Nicolas, Erez Tom, Tassa Yuval, Silver David, and Wierstra Daan. 2015. Continuous control with deep reinforcement learning. arXiv:1509.02971.Google ScholarGoogle Scholar
  29. [29] Lu Jie, Wu Dianshuang, Mao Mingsong, Wang Wei, and Zhang Guangquan. 2015. Recommender system application developments: A survey. Decis. Supp. Syst. 74 (2015), 1232.Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. [30] Lu Sidi, Luo Bing, Patel Tirthak, Yao Yongtao, Tiwari Devesh, and Shi Weisong. 2020. Making disk failure predictions smarter! In Proceedings of the 18th USENIX Conference on File and Storage Technologies (FAST’20). 151167.Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. [31] Mahdisoltani Farzaneh, Stefanovici Ioan, and Schroeder Bianca. 2017. Improving storage system reliability with proactive error prediction. In Proceedings of the USENIX Conference on Usenix Annual Technical Conference, Vol. 1. 391402.Google ScholarGoogle Scholar
  32. [32] Masood Adnan and Hashmi Adnan. 2019. AIOps: Predictive analytics & machine learning in operations. In Cognitive Computing Recipes. Springer, 359382.Google ScholarGoogle ScholarCross RefCross Ref
  33. [33] Miyato Takeru, Maeda Shin-ichi, Koyama Masanori, and Ishii Shin. 2018. Virtual adversarial training: A regularization method for supervised and semi-supervised learning. IEEE Trans. Pattern Anal. Mach. Intell. 41, 8 (2018), 19791993.Google ScholarGoogle ScholarCross RefCross Ref
  34. [34] Montiel Jacob, Read Jesse, Bifet Albert, and Abdessalem Talel. 2018. Scikit-multiflow: A multi-output streaming framework. J. Mach. Learn. Res. 19, 72 (2018), 15.Google ScholarGoogle Scholar
  35. [35] Murray Joseph F., Hughes Gordon F., and Kreutz-Delgado Kenneth. 2003. Hard drive failure prediction using non-parametric statistical methods. In Proceedings of the Artificial Neural Networks and Neural Information Processing (ICANN/ICONIP’03).Google ScholarGoogle Scholar
  36. [36] Murray Joseph F., Hughes Gordon F., Kreutz-Delgado Kenneth, and Schuurmans Dale. 2005. Machine learning methods for predicting failures in hard drives: A multiple-instance application. J. Mach. Learn. Res. 6, 5 (2005).Google ScholarGoogle Scholar
  37. [37] Patterson David A., Gibson Garth, and Katz Randy H.. 1988. A case for redundant arrays of inexpensive disks (RAID). In Proceedings of the ACM SIGMOD International Conference on Management of Data. 109116.Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. [38] Pereira Francisco Lucas F., Teixeira Daniel N., Gomes Joao Paulo P., and Machado Javam C.. 2019. Evaluating one-class classifiers for fault detection in hard disk drives. In Proceedings of the 8th Brazilian Conference on Intelligent Systems (BRACIS’19). IEEE, 586591.Google ScholarGoogle ScholarCross RefCross Ref
  39. [39] Pitakrat Teerat, Hoorn Andre Van, and Grunske Lars. 2013. A comparison of machine learning algorithms for proactive hard disk drive failure detection. In Proceedings of the 4th International ACM Sigsoft Symposium on Architecting Critical Systems. 110.Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. [40] Ribeiro Marco Tulio, Singh Sameer, and Guestrin Carlos. 2016. “Why should i trust you?” Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 11351144.Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. [41] Salfner Felix, Lenk Maren, and Malek Miroslaw. 2010. A survey of online failure prediction methods. ACM Comput. Surv. 42, 3 (2010), 142.Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. [42] Schroeder Bianca and Gibson Garth A.. 2007. Understanding disk failure rates: What does an MTTF of 1,000,000 hours mean to you? ACM Trans. Stor. 3, 3 (2007), 8–es.Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. [43] Settles Burr. 2009. Active learning literature survey.Google ScholarGoogle Scholar
  44. [44] Sun Xiaoyi, Chakrabarty Krishnendu, Huang Ruirui, Chen Yiquan, Zhao Bing, Cao Hai, Han Yinhe, Liang Xiaoyao, and Jiang Li. 2019. System-level hardware failure prediction using deep learning. In Proceedings of the 56th ACM/IEEE Design Automation Conference (DAC’19). IEEE, 16.Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. [45] Talukdar Partha and Cohen William. 2012. Crowdsourced comprehension: Predicting prerequisite structure in wikipedia. In Proceedings of the 7th Workshop on Building Educational Applications Using NLP. 307315.Google ScholarGoogle Scholar
  46. [46] TIANCHI Alibaba Cloud Computing. 2021. Large-scale SSD Failure Prediction Dataset. Retrieved from https://github.com/alibaba-edu/dcbrain/tree/master/ssd_smart_logs.Google ScholarGoogle Scholar
  47. [47] Wang Wei and Zhou Zhi-Hua. 2008. On multi-view active learning and the combination with semi-supervised learning. In Proceedings of the 25th International Conference on Machine Learning. 11521159.Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. [48] Wang Yu, Ma Eden W. M., Chow Tommy W. S., and Tsui Kwok-Leung. 2013. A two-step parametric method for failure prediction in hard disk drives. IEEE Trans. Industr. Inf. 10, 1 (2013), 419430.Google ScholarGoogle ScholarCross RefCross Ref
  49. [49] Wang Yu, Miao Qiang, and Pecht Michael. 2011. Health monitoring of hard disk drive based on mahalanobis distance. In Proceedings of the Prognostics and System Health Managment Conference. IEEE, 18.Google ScholarGoogle ScholarCross RefCross Ref
  50. [50] Xiao Jiang, Xiong Zhuang, Wu Song, Yi Yusheng, Jin Hai, and Hu Kan. 2018. Disk failure prediction in data centers via online learning. In Proceedings of the 47th International Conference on Parallel Processing. 110.Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. [51] Xie Yanwen, Feng Dan, Wang Fang, Tang Xuehai, Han Jizhong, and Zhang Xinyan. 2019. DFPE: Explaining predictive models for disk failure prediction. In Proceedings of the 35th Symposium on Mass Storage Systems and Technologies (MSST’19). IEEE, 193204.Google ScholarGoogle ScholarCross RefCross Ref
  52. [52] Xie Yanwen, Feng Dan, Wang Fang, Zhang Xinyan, Han Jizhong, and Tang Xuehai. 2018. OME: An optimized modeling engine for disk failure prediction in heterogeneous datacenter. In Proceedings of the IEEE 36th International Conference on Computer Design (ICCD’18). IEEE, 561564.Google ScholarGoogle ScholarCross RefCross Ref
  53. [53] Xu Chang, Wang Gang, Liu Xiaoguang, Guo Dongdong, and Liu Tie-Yan. 2016. Health status assessment and failure prediction for hard drives with recurrent neural networks. IEEE Trans. Comput. 65, 11 (2016), 35023508.Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. [54] Xu Fan, Han Shujie, Lee Patrick PC, Liu Yi, He Cheng, and Liu Jiongzhou. 2021. General feature selection for failure prediction in large-scale SSD deployment. In Proceedings of the 51st Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN’21). IEEE, 263270.Google ScholarGoogle ScholarCross RefCross Ref
  55. [55] Xu Yong, Sui Kaixin, Yao Randolph, Zhang Hongyu, Lin Qingwei, Dang Yingnong, Li Peng, Jiang Keceng, Zhang Wenchi, Lou Jian-Guang, et al. 2018. Improving service availability of cloud systems by predicting disk error. In Proceedings of the USENIX Annual Technical Conference (USENIX ATC’18). 481494.Google ScholarGoogle Scholar
  56. [56] Yang Hongzhang, Li Zongzhao, Qiang Huiyuan, Li Zhongliang, Tu Yaofeng, and Yang Yahui. 2020. ZTE-predictor: Disk failure prediction system based on LSTM. In Proceedings of the 50th Annual IEEE-IFIP International Conference on Dependable Systems and Networks-Supplemental Volume (DSN-S’20). IEEE, 1720.Google ScholarGoogle ScholarCross RefCross Ref
  57. [57] Yang Lei, Wu Hong, Zhang Tieying, Cheng Xuntao, Li Feifei, Zou Lei, Wang Yujie, Chen Rongyao, Wang Jianying, and Huang Gui. 2020. Leaper: A learned prefetcher for cache invalidation in LSM-tree based storage engines. Proc. VLDB Endow. 13, 12 (2020), 19761989.Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. [58] Zhang Ji, Huang Ping, Zhou Ke, Xie Ming, and Schelter Sebastian. 2020. HDDse: Enabling high-dimensional disk state embedding for generic failure detection system of heterogeneous disks in large data centers. In Proceedings of the USENIX Annual Technical Conference (USENIX ATC’20). 111126.Google ScholarGoogle Scholar
  59. [59] Zhang Ji, Liu Yu, Zhou Ke, Li Guoliang, Xiao Zhili, Cheng Bin, Xing Jiashu, Wang Yangtao, Cheng Tianheng, Liu Li, et al. 2019. An end-to-end automatic cloud database tuning system using deep reinforcement learning. In Proceedings of the International Conference on Management of Data. 415432.Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. [60] Zhang Ji, Zhou Ke, Huang Ping, He Xubin, Xiao Zhili, Cheng Bin, Ji Yongguang, and Wang Yinhu. 2019. Transfer learning based failure prediction for minority disks in large data centers of heterogeneous disk systems. In Proceedings of the 48th International Conference on Parallel Processing. 110.Google ScholarGoogle ScholarDigital LibraryDigital Library
  61. [61] Zhang Ji, Zhou Ke, Huang Ping, He Xubin, Xie Ming, Cheng Bin, Ji Yongguang, and Wang Yinhu. 2020. Minority disk failure prediction based on transfer learning in large data centers of heterogeneous disk systems. IEEE Trans. Parallel Distrib. Syst. 31, 9 (2020), 21552169.Google ScholarGoogle ScholarCross RefCross Ref
  62. [62] Zhang Yihao, Wen Junhao, Wang Xibin, and Jiang Zhuo. 2014. Semi-supervised learning combining co-training with active learning. Expert Syst. Appl. 41, 5 (2014), 23722378.Google ScholarGoogle ScholarDigital LibraryDigital Library
  63. [63] Zhou Yang, Wang Fang, and Feng Dan. 2021. ASLDP: An active semi-supervised learning method for disk failure prediction. In Proceedings of the 50th International Conference on Parallel Processing. 111.Google ScholarGoogle ScholarDigital LibraryDigital Library
  64. [64] Zhou Yang and Xiao Kui. 2019. Extracting prerequisite relations among concepts in wikipedia. In Proceedings of the International Joint Conference on Neural Networks (IJCNN’19). IEEE, 18.Google ScholarGoogle ScholarCross RefCross Ref
  65. [65] Zhou Yang, Xiao Kui, and Zhang Yan. 2020. An ensemble learning approach for extracting concept prerequisite relations from wikipedia. In Proceedings of the 16th International Conference on Mobility, Sensing and Networking (MSN’20). IEEE, 642647.Google ScholarGoogle ScholarCross RefCross Ref
  66. [66] Zhu Bingpeng, Wang Gang, Liu Xiaoguang, Hu Dianming, Lin Sheng, and Ma Jingwei. 2013. Proactive drive failure prediction for large scale storage systems. In Proceedings of the IEEE 29th Symposium on Mass Storage Systems and Technologies (MSST’13). IEEE, 15.Google ScholarGoogle ScholarCross RefCross Ref
  67. [67] Zhu Xiaojin Jerry. 2005. Semi-supervised learning literature survey. University of Wisconsin-Madison Department of Computer Sciences.Google ScholarGoogle Scholar
  68. [68] Züfle Marwin, Krupitzer Christian, Erhard Florian, Grohmann Johannes, and Kounev Samuel. 2020. To fail or not to fail: Predicting hard disk drive failure time windows. In Proceedings of the International Conference on Measurement, Modelling and Evaluation of Computing Systems. Springer, 1936.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. A Disk Failure Prediction Method Based on Active Semi-supervised Learning

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on Storage
      ACM Transactions on Storage  Volume 18, Issue 4
      November 2022
      255 pages
      ISSN:1553-3077
      EISSN:1553-3093
      DOI:10.1145/3570642
      Issue’s Table of Contents

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 12 November 2022
      • Online AM: 27 September 2022
      • Accepted: 2 March 2022
      • Revised: 24 January 2022
      • Received: 7 July 2021
      Published in tos Volume 18, Issue 4

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Refereed

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Full Text

    View this article in Full Text.

    View Full Text

    HTML Format

    View this article in HTML Format .

    View HTML Format