research-article

NTAM: Neighborhood-Temporal Attention Model for Disk Failure Prediction in Cloud Platforms

Authors:
Chuan Luo

Microsoft Research, China

Microsoft Research, China
View Profile

,
Pu Zhao

Microsoft Research, China

Microsoft Research, China
View Profile

,
Bo Qiao

Microsoft Research, China

Microsoft Research, China
View Profile

,
Youjiang Wu

Microsoft Azure, USA

Microsoft Azure, USA
View Profile

,
Hongyu Zhang

The University of Newcastle, Australia

The University of Newcastle, Australia
View Profile

,
Wei Wu

Leibniz University Hannover, Germany

Leibniz University Hannover, Germany
View Profile

,
Weihai Lu

Microsoft Research, China

Microsoft Research, China
View Profile

,
Yingnong Dang

Microsoft Azure, USA

Microsoft Azure, USA
View Profile

,
Saravanakumar Rajmohan

Microsoft Office, USA

Microsoft Office, USA
View Profile

,
Qingwei Lin

Microsoft Research, China

Microsoft Research, China
View Profile

,
Dongmei Zhang

Microsoft Research, China

Microsoft Research, China
View Profile

Authors Info & Claims

WWW '21: Proceedings of the Web Conference 2021April 2021Pages 1181–1191https://doi.org/10.1145/3442381.3449867

Published:03 June 2021Publication History

WWW '21: Proceedings of the Web Conference 2021

Pages 1181–1191

ABSTRACT

With the rapid deployment of cloud platforms, high service reliability is of critical importance. An industrial cloud platform contains a huge number of disks, and disk failure is a common cause of service unreliability. In recent years, many machine learning based disk failure prediction approaches have been proposed, and they can predict disk failures based on disk status data before the failures actually happen. In this way, proactive actions can be taken in advance to improve service reliability. However, existing approaches treat each disk individually and do not explore the influence of the neighboring disks. In this paper, we propose Neighborhood-Temporal Attention Model (NTAM), a novel deep learning based approach to disk failure prediction. When predicting whether or not a disk will fail in near future, NTAM is a novel approach that not only utilizes a disk’s own status data, but also considers its neighbors’ status data. Moreover, NTAM includes a novel attention-based temporal component to capture the temporal nature of the disk status data. Besides, we propose a data enhancement method, called Temporal Progressive Sampling (TPS), to handle the extreme data imbalance issue. We evaluate NTAM on a public dataset as well as two industrial datasets collected from millions of disks in Microsoft Azure. Our experimental results show that NTAM significantly outperforms state-of-the-art competitors. Also, our empirical evaluations indicate the effectiveness of the neighborhood-ware component and the temporal component underlying NTAM as well as the effectiveness of TPS. More encouragingly, we have successfully applied NTAM and TPS to Microsoft cloud platforms (including Microsoft Azure and Microsoft 365) and obtained benefits in industrial practice.

References

Bruce Allen. 2004. Monitoring Hard Disks with SMART. Linux Journal (2004).Google Scholar
Danilo Ardagna, Barbara Panicucci, and Mauro Passacantando. 2011. A Game Theoretic Formulation of the Service Provisioning Problem in Cloud Systems. In Proceedings of WWW 2011. 177–186.Google ScholarDigital Library
Mirela Madalina Botezatu, Ioana Giurgiu, Jasmina Bogojeska, and Dorothea Wiesmann. 2016. Predicting Disk Replacement towards Reliable Data Centers. In Proceedings of KDD 2016. 39–48.Google ScholarDigital Library
Xiangning Chen, Qingwei Lin, Chuan Luo, Xudong Li, Hongyu Zhang, Yong Xu, Yingnong Dang, Kaixin Sui, Xu Zhang, Bo Qiao, Weiyi Zhang, Wei Wu, Murali Chintalapati, and Dongmei Zhang. 2019. Neural Feature Search: A Neural Architecture for Automated Feature Engineering. In Proceedings ICDM 2019. 71–80.Google ScholarCross Ref
Yujun Chen, Xian Yang, Qingwei Lin, Hongyu Zhang, Feng Gao, Zhangwei Xu, Yingnong Dang, Dongmei Zhang, Hang Dong, Yong Xu, Hao Li, and Yu Kang. 2019. Outage Prediction and Diagnosis for Cloud Service Systems. In Proceedings of WWW 2019. 2659–2665.Google ScholarDigital Library
Chenyou Fan, Yuze Zhang, Yi Pan, Xiaoyue Li, Chi Zhang, Rong Yuan, Di Wu, Wensheng Wang, Jian Pei, and Heng Huang. 2019. Multi-Horizon Time Series Forecasting with Temporal Attention Learning. In Proceedings of KDD 2019. 2527–2535.Google ScholarDigital Library
Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, and Yann N. Dauphin. 2017. Convolutional Sequence to Sequence Learning. In Proceedings of ICML 2017. 1243–1252.Google Scholar
Xavier Glorot, Antoine Bordes, and Yoshua Bengio. 2011. Deep Sparse Rectifier Neural Networks. In Proceedings of AISTATS 2011. 315–323.Google Scholar
Jiazhen Gu, Chuan Luo, Si Qin, Bo Qiao, Qingwei Lin, Hongyu Zhang, Ze Li, Yingnong Dang, Shaowei Cai, Wei Wu, Yangfan Zhou, Murali Chintalapati, and Dongmei Zhang. 2020. Efficient incident identification from multi-dimensional issue reports via meta-heuristic search. In Proceedings of ESEC/FSE 2020. 292–303.Google ScholarDigital Library
Jiazhen Gu, Jiaqi Wen, Zijian Wang, Pu Zhao, Chuan Luo, Yu Kang, Yangfan Zhou, Li Yang, Jeffrey Sun, Zhangwei Xu, Bo Qiao, Liqun Li, Qingwei Lin, and Dongmei Zhang. 2020. Efficient customer incident triage via linking with system incidents. In Proceedings of ESEC/FSE 2020. 1296–1307.Google ScholarDigital Library
Xiaohong Huang. 2017. Hard Drive Failure Prediction for Large Scale Storage System. Ph.D. Dissertation. UCLA.Google Scholar
Ponemon Institute. 2016. Cost of Data Center Outages. Data Center Performance Benchmark Series(2016).Google Scholar
Hiranya Jayathilaka, Chandra Krintz, and Rich Wolski. 2017. Performance Monitoring and Root Cause Analysis for Cloud-hosted Web Applications. In Proceedings of WWW 2017. 469–478.Google ScholarDigital Library
Ishan Jindal, Matthew S. Nokleby, and Xuewen Chen. 2016. Learning Deep Networks from Noisy Labels with Dropout Regularization. In Proceedings of ICDM 2016. 967–972.Google ScholarCross Ref
Rie Johnson and Tong Zhang. 2014. Learning Nonlinear Functions Using Regularized Greedy Forest. IEEE Transactions on Pattern Analysis and Machine Intelligence 36, 5(2014), 942–954.Google ScholarCross Ref
Bartosz Krawczyk. 2016. Learning from Imbalanced Data: Open Challenges and Future Directions. Progress in Artificial Intelligence 5, 4 (2016), 221–232.Google ScholarCross Ref
Sebastien Levy, Randolph Yao, Youjiang Wu, Yingnong Dang, Peng Huang, Zheng Mu, Pu Zhao, Tarun Ramani, Naga Govindaraju, Xukun Li, Qingwei Lin, Gil Lapid Shafriri, and Murali Chintalapati. 2020. Predictive and Adaptive Failure Mitigation to Avert Production Cloud VM Interruptions. In Proceedings of OSDI 2020. 1155–1170.Google Scholar
Huayu Li, Martin Renqiang Min, Yong Ge, and Asim Kadav. 2017. A Context-aware Attention Network for Interactive Question Answering. In Proceedings of KDD 2017. 927–935.Google ScholarDigital Library
Jing Li, Xinpu Ji, Yuhan Jia, Bingpeng Zhu, Gang Wang, Zhongwei Li, and Xiaoguang Liu. 2014. Hard Drive Failure Prediction Using Classification and Regression Trees. In Proceedings of DSN 2014. 383–394.Google ScholarDigital Library
Naihan Li, Shujie Liu, Yanqing Liu, Sheng Zhao, and Ming Liu. 2019. Neural Speech Synthesis with Transformer Network. In Proceedings of AAAI 2019. 6706–6713.Google ScholarDigital Library
Ze Li, Qian Cheng, Ken Hsieh, Yingnong Dang, Peng Huang, Pankaj Singh, Xinsheng Yang, Qingwei Lin, Youjiang Wu, Sebastien Levy, and Murali Chintalapati. 2020. Gandalf: An Intelligent, End-To-End Analytics Service for Safe Deployment in Large-Scale Cloud Infrastructure. In Proceedings of NSDI 2020. 389–402.Google Scholar
Sidi Lu, Bing Luo, Tirthak Patel, Yongtao Yao, Devesh Tiwari, and Weisong Shi. 2020. Making Disk Failure Predictions SMARTer!. In Proceedings of FAST 2020. 151–167.Google Scholar
Chuan Luo, Bo Qiao, Xin Chen, Pu Zhao, Randolph Yao, Hongyu Zhang, Wei Wu, Andrew Zhou, and Qingwei Lin. 2020. Intelligent Virtual Machine Provisioning in Cloud Computing. In Proceedings of IJCAI 2020. 1495–1502.Google ScholarCross Ref
Chuan Luo, Bo Qiao, Wenqian Xing, Xin Chen, Pu Zhao, Chao Du, Randolph Yao, Hongyu Zhang, Wei Wu, Shaowei Cai, Bing He, Saravanakumar Rajmohan, and Qingwei Lin. 2021. Correlation-Aware Heuristic Search for Intelligent Virtual Machine Provisioning in Cloud Systems. In Proceedings of AAAI 2021.Google ScholarCross Ref
Chuan Luo, Pu Zhao, Chen Chen, Bo Qiao, Chao Du, Hongyu Zhang, Wei Wu, Shaowei Cai, Bing He, Saravanakumar Rajmohan, and Qingwei Lin. 2021. PULNS: Positive-Unlabeled Learning with Effective Negative Sample Selector. In Proceedings of AAAI 2021.Google Scholar
Meng Ma, Jingmin Xu, Yuan Wang, Pengfei Chen, Zonghua Zhang, and Ping Wang. 2020. AutoMAP: Diagnose Your Microservice-based Web Applications Automatically. In Proceedings of WWW 2020. 246–258.Google ScholarDigital Library
Michael Menzel and Rajiv Ranjan. 2012. CloudGenius: Decision Support for Web Server Cloud Migration. In Proceedings of WWW 2012. 979–988.Google Scholar
Justin Meza, Qiang Wu, Sanjeev Kumar, and Onur Mutlu. 2015. A Large-Scale Study of Flash Memory Failures in the Field. In Proceedings of SIGMETRICS 2015. 177–190.Google ScholarDigital Library
Kevin P. Murphy. 2012. Machine Learning: A Probabilistic Perspective. MIT Press.Google ScholarDigital Library
Molly S. Quinn, Katherine Campbell, and Mark T. Keane. 2019. The Expected Unexpected & Unexpected Unexpected. In Proceedings of CogSci 2019. 2627–2633.Google Scholar
Jack W. Rae, Anna Potapenko, Siddhant M. Jayakumar, and Timothy P. Lillicrap. 2020. Compressive Transformers for Long-Range Sequence Modelling. In Proceedings of ICLR 2020.Google Scholar
Sriram Sankar, Mark Shaw, Kushagra Vaid, and Sudhanva Gurumurthi. 2013. Datacenter Scale Evaluation of the Impact of Temperature on Hard Disk Drive Failures. ACM Transactions on Storage 9, 2 (2013), 1–24.Google ScholarDigital Library
Shaukat Ali Shahee and Usha Ananthakumar. 2018. An Adaptive Oversampling Technique for Imbalanced Datasets. In Proceedings of Industrial Conference on Data Mining 2018. 1–16.Google ScholarDigital Library
Huasong Shan, Yuan Chen, Haifeng Liu, Yunpeng Zhang, Xiao Xiao, Xiaofeng He, Min Li, and Wei Ding. 2019. ϵ-Diagnosis: Unsupervised and Real-time Diagnosis of Small-window Long-tail Latency in Large-scale Microservice Platforms. In Proceedings of WWW 2019. 3215–3222.Google Scholar
Jing Shen, Jian Wan, Se-Jung Lim, and Lifeng Yu. 2018. Random-Forest-Based Failure Prediction for Hard Disk Drives. International Journal of Distributed Sensor Networks 14, 11 (2018).Google ScholarCross Ref
Nitish Srivastava, Geoffrey E. Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: A Simple Way to Prevent Neural Networks from Overfitting. Journal of Machine Learning Research 15, 1 (2014), 1929–1958.Google ScholarDigital Library
Xiaoyi Sun, Krishnendu Chakrabarty, Ruirui Huang, Yiquan Chen, Bing Zhao, Hai Cao, Yinhe Han, Xiaoyao Liang, and Li Jiang. 2019. System-Level Hardware Failure Prediction using Deep Learning. In Proceedings of DAC 2019. 20.Google ScholarDigital Library
Amoghavarsha Suresh and Anshul Gandhi. 2019. Using Variability as a Guiding Principle to Reduce Latency in Web Applications via OS Profiling. In Proceedings of WWW 2019. 1759–1770.Google ScholarDigital Library
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. In Proceedings of NIPS 2017. 5998–6008.Google ScholarDigital Library
Changjun Wang, Weidong Ma, Tao Qin, Xujin Chen, Xiaodong Hu, and Tie-Yan Liu. 2015. Selling Reserved Instances in Cloud Computing. In Proceedings of IJCAI 2015. 224–231.Google Scholar
Xuejian Wang, Lantao Yu, Kan Ren, Guanyu Tao, Weinan Zhang, Yong Yu, and Jun Wang. 2017. Dynamic Attention Deep Model for Article Recommendation by Learning Human Editors’ Demonstration. In Proceedings of KDD 2017. 2051–2059.Google ScholarDigital Library
Chang Xu, Gang Wang, Xiaoguang Liu, Dongdong Guo, and Tie-Yan Liu. 2016. Health Status Assessment and Failure Prediction for Hard Drives with Recurrent Neural Networks. IEEE Transactions on Computers 65, 11 (2016), 3502–3508.Google ScholarDigital Library
Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron C. Courville, Ruslan Salakhutdinov, Richard S. Zemel, and Yoshua Bengio. 2015. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. In Proceedings of ICML 2015. 2048–2057.Google Scholar
Yong Xu, Kaixin Sui, Randolph Yao, Hongyu Zhang, Qingwei Lin, Yingnong Dang, Peng Li, Keceng Jiang, Wenchi Zhang, Jian-Guang Lou, Murali Chintalapati, and Dongmei Zhang. 2018. Improving Service Availability of Cloud Systems by Predicting Disk Error. In Proceedings of USENIX ATC 2018. 481–494.Google Scholar
Qiang Yang and Xindong Wu. 2006. 10 Challenging Problems in Data Mining Research. International Journal of Information Technology & Decision Making 5, 04(2006), 597–604.Google ScholarCross Ref
Jianguo Zhang, Ji Wang, Lifang He, Zhao Li, and Philip S. Yu. 2018. Layerwise Perturbation-Based Adversarial Training for Hard Drive Health Degree Prediction. In Proceedings of ICDM 2018. 1428–1433.Google ScholarCross Ref
Xiangyu Zhao, Longbiao Wang, Ruifang He, Ting Yang, Jinxin Chang, and Ruifang Wang. 2020. Multiple Knowledge Syncretic Transformer for Natural Dialogue Generation. In Proceedings of WWW 2020. 752–762.Google ScholarDigital Library
Xinyan Zhao, Feng Xiao, Haoming Zhong, Jun Yao, and Huanhuan Chen. 2020. Condition Aware and Revise Transformer for Question Answering. In Proceedings of WWW 2020. 2377–2387.Google ScholarDigital Library
Ying Zhao, Xiang Liu, Siqing Gan, and Weimin Zheng. 2010. Predicting Disk Failures with HMM- and HSMM-Based Approaches. In Proceedings of Industrial Conference on Data Mining 2010. 390–404.Google ScholarCross Ref
Bingpeng Zhu, Gang Wang, Xiaoguang Liu, Dianming Hu, Sheng Lin, and Jingwei Ma. 2013. Proactive Drive Failure Prediction for Large Scale Storage Systems. In Proceedings of MSST 2013. 1–5.Google ScholarCross Ref

NTAM: Neighborhood-Temporal Attention Model for Disk Failure Prediction in Cloud Platforms
1. Computing methodologies
  1. Machine learning
    1. Machine learning approaches

Recommendations

A Disk Failure Prediction Method Based on Active Semi-supervised Learning
Disk failure has always been a major problem for data centers, leading to data loss. Current disk failure prediction approaches are mostly offline and assume that the disk labels required for training learning models are available and accurate. However, ...
Read More
A Multi-source Domain Adaption Approach to Minority Disk Failure Prediction
Algorithms and Architectures for Parallel Processing
Abstract
Frequent happening of disk failures affects the reliability of the storage system, which can cause jittering of performance or even data loss of services and thus seriously threaten the quality of service. Although a host of machine (deep) ...
Read More
A Failure Prediction Approach Supporting Multi Granularity Data Fusion for Large-scale Cloud Storage Systems
CSSE '22: Proceedings of the 5th International Conference on Computer Science and Software Engineering

With the development of cloud computing and cloud storage technology, the data scale has grown rapidly. In order to store and process large-scale data, there are thousands of nodes and devices in the cloud storage center, resulting in a surge in the ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
WWW '21: Proceedings of the Web Conference 2021
April 2021
4054 pages
ISBN:9781450383127
DOI:10.1145/3442381
Editors:
Jure Leskovec
Stanford
,
Marko Grobelnik
Jožef Stefan Institute
,
Marc Najork
Google
,
Jie Tang
Tsinghua University
,
Leila Zia
Wikimedia Foundation
Copyright © 2021 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 3 June 2021
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Cloud Platforms
Data Imbalance
Disk Failure Prediction
High Service Reliability
Neighborhood-Temporal Attention Model
Qualifiers
- research-article
- Research
- Refereed limited
Conference

Acceptance Rates
Overall Acceptance Rate1,899of8,196submissions,23%
Upcoming Conference
WWW '24

Sponsor:

sigweb

The ACM Web Conference 2024

May 13 - 17, 2024

Singapore , Singapore
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 8
  Total Citations
  View Citations
- 534
  Total Downloads
- Downloads (Last 12 months)104
- Downloads (Last 6 weeks)12
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format

NTAM: Neighborhood-Temporal Attention Model for Disk Failure Prediction in Cloud Platforms

WWW '21: Proceedings of the Web Conference 2021

ABSTRACT

References

Cited By

Recommendations

A Disk Failure Prediction Method Based on Active Semi-supervised Learning

A Multi-source Domain Adaption Approach to Minority Disk Failure Prediction

A Failure Prediction Approach Supporting Multi Granularity Data Fusion for Large-scale Cloud Storage Systems

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

HTML Format

Share this Publication link

Share on Social Media