ABSTRACT
With the rapid deployment of cloud platforms, high service reliability is of critical importance. An industrial cloud platform contains a huge number of disks, and disk failure is a common cause of service unreliability. In recent years, many machine learning based disk failure prediction approaches have been proposed, and they can predict disk failures based on disk status data before the failures actually happen. In this way, proactive actions can be taken in advance to improve service reliability. However, existing approaches treat each disk individually and do not explore the influence of the neighboring disks. In this paper, we propose Neighborhood-Temporal Attention Model (NTAM), a novel deep learning based approach to disk failure prediction. When predicting whether or not a disk will fail in near future, NTAM is a novel approach that not only utilizes a disk’s own status data, but also considers its neighbors’ status data. Moreover, NTAM includes a novel attention-based temporal component to capture the temporal nature of the disk status data. Besides, we propose a data enhancement method, called Temporal Progressive Sampling (TPS), to handle the extreme data imbalance issue. We evaluate NTAM on a public dataset as well as two industrial datasets collected from millions of disks in Microsoft Azure. Our experimental results show that NTAM significantly outperforms state-of-the-art competitors. Also, our empirical evaluations indicate the effectiveness of the neighborhood-ware component and the temporal component underlying NTAM as well as the effectiveness of TPS. More encouragingly, we have successfully applied NTAM and TPS to Microsoft cloud platforms (including Microsoft Azure and Microsoft 365) and obtained benefits in industrial practice.
- Bruce Allen. 2004. Monitoring Hard Disks with SMART. Linux Journal (2004).Google Scholar
- Danilo Ardagna, Barbara Panicucci, and Mauro Passacantando. 2011. A Game Theoretic Formulation of the Service Provisioning Problem in Cloud Systems. In Proceedings of WWW 2011. 177–186.Google ScholarDigital Library
- Mirela Madalina Botezatu, Ioana Giurgiu, Jasmina Bogojeska, and Dorothea Wiesmann. 2016. Predicting Disk Replacement towards Reliable Data Centers. In Proceedings of KDD 2016. 39–48.Google ScholarDigital Library
- Xiangning Chen, Qingwei Lin, Chuan Luo, Xudong Li, Hongyu Zhang, Yong Xu, Yingnong Dang, Kaixin Sui, Xu Zhang, Bo Qiao, Weiyi Zhang, Wei Wu, Murali Chintalapati, and Dongmei Zhang. 2019. Neural Feature Search: A Neural Architecture for Automated Feature Engineering. In Proceedings ICDM 2019. 71–80.Google ScholarCross Ref
- Yujun Chen, Xian Yang, Qingwei Lin, Hongyu Zhang, Feng Gao, Zhangwei Xu, Yingnong Dang, Dongmei Zhang, Hang Dong, Yong Xu, Hao Li, and Yu Kang. 2019. Outage Prediction and Diagnosis for Cloud Service Systems. In Proceedings of WWW 2019. 2659–2665.Google ScholarDigital Library
- Chenyou Fan, Yuze Zhang, Yi Pan, Xiaoyue Li, Chi Zhang, Rong Yuan, Di Wu, Wensheng Wang, Jian Pei, and Heng Huang. 2019. Multi-Horizon Time Series Forecasting with Temporal Attention Learning. In Proceedings of KDD 2019. 2527–2535.Google ScholarDigital Library
- Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, and Yann N. Dauphin. 2017. Convolutional Sequence to Sequence Learning. In Proceedings of ICML 2017. 1243–1252.Google Scholar
- Xavier Glorot, Antoine Bordes, and Yoshua Bengio. 2011. Deep Sparse Rectifier Neural Networks. In Proceedings of AISTATS 2011. 315–323.Google Scholar
- Jiazhen Gu, Chuan Luo, Si Qin, Bo Qiao, Qingwei Lin, Hongyu Zhang, Ze Li, Yingnong Dang, Shaowei Cai, Wei Wu, Yangfan Zhou, Murali Chintalapati, and Dongmei Zhang. 2020. Efficient incident identification from multi-dimensional issue reports via meta-heuristic search. In Proceedings of ESEC/FSE 2020. 292–303.Google ScholarDigital Library
- Jiazhen Gu, Jiaqi Wen, Zijian Wang, Pu Zhao, Chuan Luo, Yu Kang, Yangfan Zhou, Li Yang, Jeffrey Sun, Zhangwei Xu, Bo Qiao, Liqun Li, Qingwei Lin, and Dongmei Zhang. 2020. Efficient customer incident triage via linking with system incidents. In Proceedings of ESEC/FSE 2020. 1296–1307.Google ScholarDigital Library
- Xiaohong Huang. 2017. Hard Drive Failure Prediction for Large Scale Storage System. Ph.D. Dissertation. UCLA.Google Scholar
- Ponemon Institute. 2016. Cost of Data Center Outages. Data Center Performance Benchmark Series(2016).Google Scholar
- Hiranya Jayathilaka, Chandra Krintz, and Rich Wolski. 2017. Performance Monitoring and Root Cause Analysis for Cloud-hosted Web Applications. In Proceedings of WWW 2017. 469–478.Google ScholarDigital Library
- Ishan Jindal, Matthew S. Nokleby, and Xuewen Chen. 2016. Learning Deep Networks from Noisy Labels with Dropout Regularization. In Proceedings of ICDM 2016. 967–972.Google ScholarCross Ref
- Rie Johnson and Tong Zhang. 2014. Learning Nonlinear Functions Using Regularized Greedy Forest. IEEE Transactions on Pattern Analysis and Machine Intelligence 36, 5(2014), 942–954.Google ScholarCross Ref
- Bartosz Krawczyk. 2016. Learning from Imbalanced Data: Open Challenges and Future Directions. Progress in Artificial Intelligence 5, 4 (2016), 221–232.Google ScholarCross Ref
- Sebastien Levy, Randolph Yao, Youjiang Wu, Yingnong Dang, Peng Huang, Zheng Mu, Pu Zhao, Tarun Ramani, Naga Govindaraju, Xukun Li, Qingwei Lin, Gil Lapid Shafriri, and Murali Chintalapati. 2020. Predictive and Adaptive Failure Mitigation to Avert Production Cloud VM Interruptions. In Proceedings of OSDI 2020. 1155–1170.Google Scholar
- Huayu Li, Martin Renqiang Min, Yong Ge, and Asim Kadav. 2017. A Context-aware Attention Network for Interactive Question Answering. In Proceedings of KDD 2017. 927–935.Google ScholarDigital Library
- Jing Li, Xinpu Ji, Yuhan Jia, Bingpeng Zhu, Gang Wang, Zhongwei Li, and Xiaoguang Liu. 2014. Hard Drive Failure Prediction Using Classification and Regression Trees. In Proceedings of DSN 2014. 383–394.Google ScholarDigital Library
- Naihan Li, Shujie Liu, Yanqing Liu, Sheng Zhao, and Ming Liu. 2019. Neural Speech Synthesis with Transformer Network. In Proceedings of AAAI 2019. 6706–6713.Google ScholarDigital Library
- Ze Li, Qian Cheng, Ken Hsieh, Yingnong Dang, Peng Huang, Pankaj Singh, Xinsheng Yang, Qingwei Lin, Youjiang Wu, Sebastien Levy, and Murali Chintalapati. 2020. Gandalf: An Intelligent, End-To-End Analytics Service for Safe Deployment in Large-Scale Cloud Infrastructure. In Proceedings of NSDI 2020. 389–402.Google Scholar
- Sidi Lu, Bing Luo, Tirthak Patel, Yongtao Yao, Devesh Tiwari, and Weisong Shi. 2020. Making Disk Failure Predictions SMARTer!. In Proceedings of FAST 2020. 151–167.Google Scholar
- Chuan Luo, Bo Qiao, Xin Chen, Pu Zhao, Randolph Yao, Hongyu Zhang, Wei Wu, Andrew Zhou, and Qingwei Lin. 2020. Intelligent Virtual Machine Provisioning in Cloud Computing. In Proceedings of IJCAI 2020. 1495–1502.Google ScholarCross Ref
- Chuan Luo, Bo Qiao, Wenqian Xing, Xin Chen, Pu Zhao, Chao Du, Randolph Yao, Hongyu Zhang, Wei Wu, Shaowei Cai, Bing He, Saravanakumar Rajmohan, and Qingwei Lin. 2021. Correlation-Aware Heuristic Search for Intelligent Virtual Machine Provisioning in Cloud Systems. In Proceedings of AAAI 2021.Google ScholarCross Ref
- Chuan Luo, Pu Zhao, Chen Chen, Bo Qiao, Chao Du, Hongyu Zhang, Wei Wu, Shaowei Cai, Bing He, Saravanakumar Rajmohan, and Qingwei Lin. 2021. PULNS: Positive-Unlabeled Learning with Effective Negative Sample Selector. In Proceedings of AAAI 2021.Google Scholar
- Meng Ma, Jingmin Xu, Yuan Wang, Pengfei Chen, Zonghua Zhang, and Ping Wang. 2020. AutoMAP: Diagnose Your Microservice-based Web Applications Automatically. In Proceedings of WWW 2020. 246–258.Google ScholarDigital Library
- Michael Menzel and Rajiv Ranjan. 2012. CloudGenius: Decision Support for Web Server Cloud Migration. In Proceedings of WWW 2012. 979–988.Google Scholar
- Justin Meza, Qiang Wu, Sanjeev Kumar, and Onur Mutlu. 2015. A Large-Scale Study of Flash Memory Failures in the Field. In Proceedings of SIGMETRICS 2015. 177–190.Google ScholarDigital Library
- Kevin P. Murphy. 2012. Machine Learning: A Probabilistic Perspective. MIT Press.Google ScholarDigital Library
- Molly S. Quinn, Katherine Campbell, and Mark T. Keane. 2019. The Expected Unexpected & Unexpected Unexpected. In Proceedings of CogSci 2019. 2627–2633.Google Scholar
- Jack W. Rae, Anna Potapenko, Siddhant M. Jayakumar, and Timothy P. Lillicrap. 2020. Compressive Transformers for Long-Range Sequence Modelling. In Proceedings of ICLR 2020.Google Scholar
- Sriram Sankar, Mark Shaw, Kushagra Vaid, and Sudhanva Gurumurthi. 2013. Datacenter Scale Evaluation of the Impact of Temperature on Hard Disk Drive Failures. ACM Transactions on Storage 9, 2 (2013), 1–24.Google ScholarDigital Library
- Shaukat Ali Shahee and Usha Ananthakumar. 2018. An Adaptive Oversampling Technique for Imbalanced Datasets. In Proceedings of Industrial Conference on Data Mining 2018. 1–16.Google ScholarDigital Library
- Huasong Shan, Yuan Chen, Haifeng Liu, Yunpeng Zhang, Xiao Xiao, Xiaofeng He, Min Li, and Wei Ding. 2019. ϵ-Diagnosis: Unsupervised and Real-time Diagnosis of Small-window Long-tail Latency in Large-scale Microservice Platforms. In Proceedings of WWW 2019. 3215–3222.Google Scholar
- Jing Shen, Jian Wan, Se-Jung Lim, and Lifeng Yu. 2018. Random-Forest-Based Failure Prediction for Hard Disk Drives. International Journal of Distributed Sensor Networks 14, 11 (2018).Google ScholarCross Ref
- Nitish Srivastava, Geoffrey E. Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: A Simple Way to Prevent Neural Networks from Overfitting. Journal of Machine Learning Research 15, 1 (2014), 1929–1958.Google ScholarDigital Library
- Xiaoyi Sun, Krishnendu Chakrabarty, Ruirui Huang, Yiquan Chen, Bing Zhao, Hai Cao, Yinhe Han, Xiaoyao Liang, and Li Jiang. 2019. System-Level Hardware Failure Prediction using Deep Learning. In Proceedings of DAC 2019. 20.Google ScholarDigital Library
- Amoghavarsha Suresh and Anshul Gandhi. 2019. Using Variability as a Guiding Principle to Reduce Latency in Web Applications via OS Profiling. In Proceedings of WWW 2019. 1759–1770.Google ScholarDigital Library
- Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. In Proceedings of NIPS 2017. 5998–6008.Google ScholarDigital Library
- Changjun Wang, Weidong Ma, Tao Qin, Xujin Chen, Xiaodong Hu, and Tie-Yan Liu. 2015. Selling Reserved Instances in Cloud Computing. In Proceedings of IJCAI 2015. 224–231.Google Scholar
- Xuejian Wang, Lantao Yu, Kan Ren, Guanyu Tao, Weinan Zhang, Yong Yu, and Jun Wang. 2017. Dynamic Attention Deep Model for Article Recommendation by Learning Human Editors’ Demonstration. In Proceedings of KDD 2017. 2051–2059.Google ScholarDigital Library
- Chang Xu, Gang Wang, Xiaoguang Liu, Dongdong Guo, and Tie-Yan Liu. 2016. Health Status Assessment and Failure Prediction for Hard Drives with Recurrent Neural Networks. IEEE Transactions on Computers 65, 11 (2016), 3502–3508.Google ScholarDigital Library
- Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron C. Courville, Ruslan Salakhutdinov, Richard S. Zemel, and Yoshua Bengio. 2015. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. In Proceedings of ICML 2015. 2048–2057.Google Scholar
- Yong Xu, Kaixin Sui, Randolph Yao, Hongyu Zhang, Qingwei Lin, Yingnong Dang, Peng Li, Keceng Jiang, Wenchi Zhang, Jian-Guang Lou, Murali Chintalapati, and Dongmei Zhang. 2018. Improving Service Availability of Cloud Systems by Predicting Disk Error. In Proceedings of USENIX ATC 2018. 481–494.Google Scholar
- Qiang Yang and Xindong Wu. 2006. 10 Challenging Problems in Data Mining Research. International Journal of Information Technology & Decision Making 5, 04(2006), 597–604.Google ScholarCross Ref
- Jianguo Zhang, Ji Wang, Lifang He, Zhao Li, and Philip S. Yu. 2018. Layerwise Perturbation-Based Adversarial Training for Hard Drive Health Degree Prediction. In Proceedings of ICDM 2018. 1428–1433.Google ScholarCross Ref
- Xiangyu Zhao, Longbiao Wang, Ruifang He, Ting Yang, Jinxin Chang, and Ruifang Wang. 2020. Multiple Knowledge Syncretic Transformer for Natural Dialogue Generation. In Proceedings of WWW 2020. 752–762.Google ScholarDigital Library
- Xinyan Zhao, Feng Xiao, Haoming Zhong, Jun Yao, and Huanhuan Chen. 2020. Condition Aware and Revise Transformer for Question Answering. In Proceedings of WWW 2020. 2377–2387.Google ScholarDigital Library
- Ying Zhao, Xiang Liu, Siqing Gan, and Weimin Zheng. 2010. Predicting Disk Failures with HMM- and HSMM-Based Approaches. In Proceedings of Industrial Conference on Data Mining 2010. 390–404.Google ScholarCross Ref
- Bingpeng Zhu, Gang Wang, Xiaoguang Liu, Dianming Hu, Sheng Lin, and Jingwei Ma. 2013. Proactive Drive Failure Prediction for Large Scale Storage Systems. In Proceedings of MSST 2013. 1–5.Google ScholarCross Ref
- NTAM: Neighborhood-Temporal Attention Model for Disk Failure Prediction in Cloud Platforms
Recommendations
A Disk Failure Prediction Method Based on Active Semi-supervised Learning
Disk failure has always been a major problem for data centers, leading to data loss. Current disk failure prediction approaches are mostly offline and assume that the disk labels required for training learning models are available and accurate. However, ...
A Multi-source Domain Adaption Approach to Minority Disk Failure Prediction
Algorithms and Architectures for Parallel ProcessingAbstractFrequent happening of disk failures affects the reliability of the storage system, which can cause jittering of performance or even data loss of services and thus seriously threaten the quality of service. Although a host of machine (deep) ...
A Failure Prediction Approach Supporting Multi Granularity Data Fusion for Large-scale Cloud Storage Systems
CSSE '22: Proceedings of the 5th International Conference on Computer Science and Software EngineeringWith the development of cloud computing and cloud storage technology, the data scale has grown rapidly. In order to store and process large-scale data, there are thousands of nodes and devices in the cloud storage center, resulting in a surge in the ...
Comments