Abstract
Disk failure has always been a major problem for data centers, leading to data loss. Current disk failure prediction approaches are mostly offline and assume that the disk labels required for training learning models are available and accurate. However, these offline methods are no longer suitable for disk failure prediction tasks in large-scale data centers. Behind this explosive amount of data, most methods do not consider whether it is not easy to get the label values during the training or the obtained label values are not completely accurate. These problems further restrict the development of supervised learning and offline modeling in disk failure prediction. In this article, Active Semi-supervised Learning Disk-failure Prediction (ASLDP), a novel disk failure prediction method is proposed, which uses active learning and semi-supervised learning. According to the characteristics of data in the disk lifecycle, ASLDP carries out active learning for those clear labeled samples, which selects valuable samples with the most significant probability uncertainty and eliminates redundancy. For those samples that are unclearly labeled or unlabeled, ASLDP uses semi-supervised learning for pre-labeled by calculating the conditional values of the samples and enhances the generalization ability by active learning. Compared with several state-of-the-art offline and online learning approaches, the results on four realistic datasets from Backblaze and Baidu demonstrate that ASLDP achieves stable failure detection rates of 80–85% with low false alarm rates. In addition, we use a dataset from Alibaba to evaluate the generality of ASLDP. Furthermore, ASLDP can overcome the problem of missing sample labels and data redundancy in large data centers, which are not considered and implemented in all offline learning methods for disk failure prediction to the best of our knowledge. Finally, ASLDP can predict the disk failure 4.9 days in advance with lower overhead and latency.
- [1] . 2004. Monitoring hard disks with SMART. Linux J. 2004, 117 (2004), 9.Google Scholar
- [2] . 2018. Large scale predictive analytics for hard disk remaining useful life estimation. In Proceedings of the IEEE International Congress on Big Data (BigData Congress’18). IEEE, 251–254.Google ScholarCross Ref
- [3] . 2014. Hard Drive SMART Stats. Retrieved from https://www.backblaze.com/blog/hard-drive-smart-stats/.Google Scholar
- [4] . 2015. What Is the Best Hard Drive? Retrieved from https://www.backblaze.com/blog/best-hard-drive-q4-2014/.Google Scholar
- [5] . 2016–2020. Raw Hard Drive Test Data. Retrieved from https://www.backblaze.com/b2/hard-drive-test-data.html.Google Scholar
- [6] . 2013. Baidu Dataset. Retrieved from http://pan.baidu.com/share/link?shareid=189977&uk=4278294944.Google Scholar
- [7] . 1998. Combining labeled and unlabeled data with co-training. In Proceedings of the Eleventh Annual Conference on Computational Learning Theory. 92–100.Google ScholarDigital Library
- [8] . 2016. Predicting disk replacement towards reliable data centers. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 39–48.Google ScholarDigital Library
- [9] . 1996. Bagging predictors. Mach. Learn. 24, 2 (1996), 123–140.Google ScholarCross Ref
- [10] . 2017. Combining active learning and semi-supervised learning by using selective label spreading. In Proceedings of the IEEE International Conference on Data Mining Workshops (ICDMW’17). IEEE, 850–857.Google ScholarCross Ref
- [11] . 2013. Conditional value-based co-training. Acta Automat. Sin. (2013), 10.Google Scholar
- [12] . 2018. Adversarial active learning for sequences labeling and generation. In International Joint Conference on Artificial Intelligence (IJCAI’18). 4012–4018.Google ScholarCross Ref
- [13] . 2017. Predicting failures in hard drives with lstm networks. In Proceedings of the Brazilian Conference on Intelligent Systems (BRACIS’17). IEEE, 222–227.Google ScholarCross Ref
- [14] . 2019. Neural architecture search: A survey. J. Mach. Learn. Res. 20, 1 (2019), 1997–2017.Google ScholarDigital Library
- [15] . 2014. A survey on concept drift adaptation. ACM Comput. Surv. 46, 4 (2014), 1–37.Google ScholarDigital Library
- [16] . 2017. Adaptive random forests for evolving data stream classification. Mach. Learn. 106, 9 (2017), 1469–1495.Google ScholarDigital Library
- [17] . 2018. Fail-slow at scale: Evidence of hardware performance faults in large production systems. ACM Trans. Stor. 14, 3 (2018), 1–26.Google ScholarDigital Library
- [18] . 2020. Toward adaptive disk failure prediction via stream mining. In Proceedings of the IEEE International Conference on Distributed Computing Systems (ICDCS’20).Google ScholarCross Ref
- [19] . 2018. Learning memory access patterns. In Proceedings of the International Conference on Machine Learning. PMLR, 1919–1928.Google Scholar
- [20] . 2002. Improved disk-drive failure warnings. IEEE Trans. Reliabil. 51, 3 (2002), 350–357.Google ScholarCross Ref
- [21] . 2018. The case for learned index structures. In Proceedings of the International Conference on Management of Data. 489–504.Google ScholarDigital Library
- [22] . 2013. Combining active learning and semi-supervised learning to construct SVM classifier. Knowl.-Bas. Syst. 44 (2013), 121–131.Google ScholarDigital Library
- [23] . 2019. AIOps for a cloud object storage service. In Proceedings of the IEEE International Congress on Big Data (BigDataCongress’19). IEEE, 165–169.Google ScholarCross Ref
- [24] . 2014. Hard drive failure prediction using classification and regression trees. In Proceedings of the 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks. IEEE, 383–394.Google ScholarCross Ref
- [25] . 2017. Hard drive failure prediction using decision trees. Reliabil. Eng. Syst. Saf. 164 (2017), 55–65.Google ScholarCross Ref
- [26] . 2018. Investigating active learning for concept prerequisite learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 32.Google ScholarCross Ref
- [27] . 2018. Active learning of strict partial orders: A case study on concept prerequisite relations. arXiv:1801.06481. Retrieved from https://arxiv.org/abs/1801.06481.Google Scholar
- [28] . 2015. Continuous control with deep reinforcement learning. arXiv:1509.02971.Google Scholar
- [29] . 2015. Recommender system application developments: A survey. Decis. Supp. Syst. 74 (2015), 12–32.Google ScholarDigital Library
- [30] . 2020. Making disk failure predictions smarter! In Proceedings of the 18th USENIX Conference on File and Storage Technologies (FAST’20). 151–167.Google ScholarDigital Library
- [31] . 2017. Improving storage system reliability with proactive error prediction. In Proceedings of the USENIX Conference on Usenix Annual Technical Conference, Vol. 1. 391–402.Google Scholar
- [32] . 2019. AIOps: Predictive analytics & machine learning in operations. In Cognitive Computing Recipes. Springer, 359–382.Google ScholarCross Ref
- [33] . 2018. Virtual adversarial training: A regularization method for supervised and semi-supervised learning. IEEE Trans. Pattern Anal. Mach. Intell. 41, 8 (2018), 1979–1993.Google ScholarCross Ref
- [34] . 2018. Scikit-multiflow: A multi-output streaming framework. J. Mach. Learn. Res. 19, 72 (2018), 1–5.Google Scholar
- [35] . 2003. Hard drive failure prediction using non-parametric statistical methods. In Proceedings of the Artificial Neural Networks and Neural Information Processing (ICANN/ICONIP’03).Google Scholar
- [36] . 2005. Machine learning methods for predicting failures in hard drives: A multiple-instance application. J. Mach. Learn. Res. 6, 5 (2005).Google Scholar
- [37] . 1988. A case for redundant arrays of inexpensive disks (RAID). In Proceedings of the ACM SIGMOD International Conference on Management of Data. 109–116.Google ScholarDigital Library
- [38] . 2019. Evaluating one-class classifiers for fault detection in hard disk drives. In Proceedings of the 8th Brazilian Conference on Intelligent Systems (BRACIS’19). IEEE, 586–591.Google ScholarCross Ref
- [39] . 2013. A comparison of machine learning algorithms for proactive hard disk drive failure detection. In Proceedings of the 4th International ACM Sigsoft Symposium on Architecting Critical Systems. 1–10.Google ScholarDigital Library
- [40] . 2016. “Why should i trust you?” Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 1135–1144.Google ScholarDigital Library
- [41] . 2010. A survey of online failure prediction methods. ACM Comput. Surv. 42, 3 (2010), 1–42.Google ScholarDigital Library
- [42] . 2007. Understanding disk failure rates: What does an MTTF of 1,000,000 hours mean to you? ACM Trans. Stor. 3, 3 (2007), 8–es.Google ScholarDigital Library
- [43] . 2009. Active learning literature survey.Google Scholar
- [44] . 2019. System-level hardware failure prediction using deep learning. In Proceedings of the 56th ACM/IEEE Design Automation Conference (DAC’19). IEEE, 1–6.Google ScholarDigital Library
- [45] . 2012. Crowdsourced comprehension: Predicting prerequisite structure in wikipedia. In Proceedings of the 7th Workshop on Building Educational Applications Using NLP. 307–315.Google Scholar
- [46] . 2021. Large-scale SSD Failure Prediction Dataset. Retrieved from https://github.com/alibaba-edu/dcbrain/tree/master/ssd_smart_logs.Google Scholar
- [47] . 2008. On multi-view active learning and the combination with semi-supervised learning. In Proceedings of the 25th International Conference on Machine Learning. 1152–1159.Google ScholarDigital Library
- [48] . 2013. A two-step parametric method for failure prediction in hard disk drives. IEEE Trans. Industr. Inf. 10, 1 (2013), 419–430.Google ScholarCross Ref
- [49] . 2011. Health monitoring of hard disk drive based on mahalanobis distance. In Proceedings of the Prognostics and System Health Managment Conference. IEEE, 1–8.Google ScholarCross Ref
- [50] . 2018. Disk failure prediction in data centers via online learning. In Proceedings of the 47th International Conference on Parallel Processing. 1–10.Google ScholarDigital Library
- [51] . 2019. DFPE: Explaining predictive models for disk failure prediction. In Proceedings of the 35th Symposium on Mass Storage Systems and Technologies (MSST’19). IEEE, 193–204.Google ScholarCross Ref
- [52] . 2018. OME: An optimized modeling engine for disk failure prediction in heterogeneous datacenter. In Proceedings of the IEEE 36th International Conference on Computer Design (ICCD’18). IEEE, 561–564.Google ScholarCross Ref
- [53] . 2016. Health status assessment and failure prediction for hard drives with recurrent neural networks. IEEE Trans. Comput. 65, 11 (2016), 3502–3508.Google ScholarDigital Library
- [54] . 2021. General feature selection for failure prediction in large-scale SSD deployment. In Proceedings of the 51st Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN’21). IEEE, 263–270.Google ScholarCross Ref
- [55] . 2018. Improving service availability of cloud systems by predicting disk error. In Proceedings of the USENIX Annual Technical Conference (USENIX ATC’18). 481–494.Google Scholar
- [56] . 2020. ZTE-predictor: Disk failure prediction system based on LSTM. In Proceedings of the 50th Annual IEEE-IFIP International Conference on Dependable Systems and Networks-Supplemental Volume (DSN-S’20). IEEE, 17–20.Google ScholarCross Ref
- [57] . 2020. Leaper: A learned prefetcher for cache invalidation in LSM-tree based storage engines. Proc. VLDB Endow. 13, 12 (2020), 1976–1989.Google ScholarDigital Library
- [58] . 2020. HDDse: Enabling high-dimensional disk state embedding for generic failure detection system of heterogeneous disks in large data centers. In Proceedings of the USENIX Annual Technical Conference (USENIX ATC’20). 111–126.Google Scholar
- [59] . 2019. An end-to-end automatic cloud database tuning system using deep reinforcement learning. In Proceedings of the International Conference on Management of Data. 415–432.Google ScholarDigital Library
- [60] . 2019. Transfer learning based failure prediction for minority disks in large data centers of heterogeneous disk systems. In Proceedings of the 48th International Conference on Parallel Processing. 1–10.Google ScholarDigital Library
- [61] . 2020. Minority disk failure prediction based on transfer learning in large data centers of heterogeneous disk systems. IEEE Trans. Parallel Distrib. Syst. 31, 9 (2020), 2155–2169.Google ScholarCross Ref
- [62] . 2014. Semi-supervised learning combining co-training with active learning. Expert Syst. Appl. 41, 5 (2014), 2372–2378.Google ScholarDigital Library
- [63] . 2021. ASLDP: An active semi-supervised learning method for disk failure prediction. In Proceedings of the 50th International Conference on Parallel Processing. 1–11.Google ScholarDigital Library
- [64] . 2019. Extracting prerequisite relations among concepts in wikipedia. In Proceedings of the International Joint Conference on Neural Networks (IJCNN’19). IEEE, 1–8.Google ScholarCross Ref
- [65] . 2020. An ensemble learning approach for extracting concept prerequisite relations from wikipedia. In Proceedings of the 16th International Conference on Mobility, Sensing and Networking (MSN’20). IEEE, 642–647.Google ScholarCross Ref
- [66] . 2013. Proactive drive failure prediction for large scale storage systems. In Proceedings of the IEEE 29th Symposium on Mass Storage Systems and Technologies (MSST’13). IEEE, 1–5.Google ScholarCross Ref
- [67] . 2005. Semi-supervised learning literature survey. University of Wisconsin-Madison Department of Computer Sciences.Google Scholar
- [68] . 2020. To fail or not to fail: Predicting hard disk drive failure time windows. In Proceedings of the International Conference on Measurement, Modelling and Evaluation of Computing Systems. Springer, 19–36.Google ScholarCross Ref
Index Terms
- A Disk Failure Prediction Method Based on Active Semi-supervised Learning
Recommendations
ASLDP: An Active Semi-supervised Learning method for Disk Failure Prediction
ICPP '21: Proceedings of the 50th International Conference on Parallel ProcessingDisk failure has always been a major problem for data centers, leading to data loss. Current research works used supervised learning to offline training through a large number of labeled samples. However, these offline methods are no longer suitable ...
Combining active learning and semi-supervised for improving learning performance
ISABEL '11: Proceedings of the 4th International Symposium on Applied Sciences in Biomedical and Communication TechnologiesIn many learning tasks, there are abundant unlabeled samples but the number of labeled training samples is limited, because labeling the samples requires the efforts of human annotators and expertise. There are three major techniques for labeling the ...
Consistency-Based Semi-supervised Active Learning: Towards Minimizing Labeling Cost
Computer Vision – ECCV 2020AbstractActive learning (AL) combines data labeling and model training to minimize the labeling cost by prioritizing the selection of high value data that can best improve model performance. In pool-based active learning, accessible unlabeled data are not ...
Comments