skip to main content
10.1145/3442381.3449867acmconferencesArticle/Chapter ViewAbstractPublication PageswwwConference Proceedingsconference-collections
research-article

NTAM: Neighborhood-Temporal Attention Model for Disk Failure Prediction in Cloud Platforms

Published:03 June 2021Publication History

ABSTRACT

With the rapid deployment of cloud platforms, high service reliability is of critical importance. An industrial cloud platform contains a huge number of disks, and disk failure is a common cause of service unreliability. In recent years, many machine learning based disk failure prediction approaches have been proposed, and they can predict disk failures based on disk status data before the failures actually happen. In this way, proactive actions can be taken in advance to improve service reliability. However, existing approaches treat each disk individually and do not explore the influence of the neighboring disks. In this paper, we propose Neighborhood-Temporal Attention Model (NTAM), a novel deep learning based approach to disk failure prediction. When predicting whether or not a disk will fail in near future, NTAM is a novel approach that not only utilizes a disk’s own status data, but also considers its neighbors’ status data. Moreover, NTAM includes a novel attention-based temporal component to capture the temporal nature of the disk status data. Besides, we propose a data enhancement method, called Temporal Progressive Sampling (TPS), to handle the extreme data imbalance issue. We evaluate NTAM on a public dataset as well as two industrial datasets collected from millions of disks in Microsoft Azure. Our experimental results show that NTAM significantly outperforms state-of-the-art competitors. Also, our empirical evaluations indicate the effectiveness of the neighborhood-ware component and the temporal component underlying NTAM as well as the effectiveness of TPS. More encouragingly, we have successfully applied NTAM and TPS to Microsoft cloud platforms (including Microsoft Azure and Microsoft 365) and obtained benefits in industrial practice.

References

  1. Bruce Allen. 2004. Monitoring Hard Disks with SMART. Linux Journal (2004).Google ScholarGoogle Scholar
  2. Danilo Ardagna, Barbara Panicucci, and Mauro Passacantando. 2011. A Game Theoretic Formulation of the Service Provisioning Problem in Cloud Systems. In Proceedings of WWW 2011. 177–186.Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Mirela Madalina Botezatu, Ioana Giurgiu, Jasmina Bogojeska, and Dorothea Wiesmann. 2016. Predicting Disk Replacement towards Reliable Data Centers. In Proceedings of KDD 2016. 39–48.Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Xiangning Chen, Qingwei Lin, Chuan Luo, Xudong Li, Hongyu Zhang, Yong Xu, Yingnong Dang, Kaixin Sui, Xu Zhang, Bo Qiao, Weiyi Zhang, Wei Wu, Murali Chintalapati, and Dongmei Zhang. 2019. Neural Feature Search: A Neural Architecture for Automated Feature Engineering. In Proceedings ICDM 2019. 71–80.Google ScholarGoogle ScholarCross RefCross Ref
  5. Yujun Chen, Xian Yang, Qingwei Lin, Hongyu Zhang, Feng Gao, Zhangwei Xu, Yingnong Dang, Dongmei Zhang, Hang Dong, Yong Xu, Hao Li, and Yu Kang. 2019. Outage Prediction and Diagnosis for Cloud Service Systems. In Proceedings of WWW 2019. 2659–2665.Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Chenyou Fan, Yuze Zhang, Yi Pan, Xiaoyue Li, Chi Zhang, Rong Yuan, Di Wu, Wensheng Wang, Jian Pei, and Heng Huang. 2019. Multi-Horizon Time Series Forecasting with Temporal Attention Learning. In Proceedings of KDD 2019. 2527–2535.Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, and Yann N. Dauphin. 2017. Convolutional Sequence to Sequence Learning. In Proceedings of ICML 2017. 1243–1252.Google ScholarGoogle Scholar
  8. Xavier Glorot, Antoine Bordes, and Yoshua Bengio. 2011. Deep Sparse Rectifier Neural Networks. In Proceedings of AISTATS 2011. 315–323.Google ScholarGoogle Scholar
  9. Jiazhen Gu, Chuan Luo, Si Qin, Bo Qiao, Qingwei Lin, Hongyu Zhang, Ze Li, Yingnong Dang, Shaowei Cai, Wei Wu, Yangfan Zhou, Murali Chintalapati, and Dongmei Zhang. 2020. Efficient incident identification from multi-dimensional issue reports via meta-heuristic search. In Proceedings of ESEC/FSE 2020. 292–303.Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Jiazhen Gu, Jiaqi Wen, Zijian Wang, Pu Zhao, Chuan Luo, Yu Kang, Yangfan Zhou, Li Yang, Jeffrey Sun, Zhangwei Xu, Bo Qiao, Liqun Li, Qingwei Lin, and Dongmei Zhang. 2020. Efficient customer incident triage via linking with system incidents. In Proceedings of ESEC/FSE 2020. 1296–1307.Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Xiaohong Huang. 2017. Hard Drive Failure Prediction for Large Scale Storage System. Ph.D. Dissertation. UCLA.Google ScholarGoogle Scholar
  12. Ponemon Institute. 2016. Cost of Data Center Outages. Data Center Performance Benchmark Series(2016).Google ScholarGoogle Scholar
  13. Hiranya Jayathilaka, Chandra Krintz, and Rich Wolski. 2017. Performance Monitoring and Root Cause Analysis for Cloud-hosted Web Applications. In Proceedings of WWW 2017. 469–478.Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Ishan Jindal, Matthew S. Nokleby, and Xuewen Chen. 2016. Learning Deep Networks from Noisy Labels with Dropout Regularization. In Proceedings of ICDM 2016. 967–972.Google ScholarGoogle ScholarCross RefCross Ref
  15. Rie Johnson and Tong Zhang. 2014. Learning Nonlinear Functions Using Regularized Greedy Forest. IEEE Transactions on Pattern Analysis and Machine Intelligence 36, 5(2014), 942–954.Google ScholarGoogle ScholarCross RefCross Ref
  16. Bartosz Krawczyk. 2016. Learning from Imbalanced Data: Open Challenges and Future Directions. Progress in Artificial Intelligence 5, 4 (2016), 221–232.Google ScholarGoogle ScholarCross RefCross Ref
  17. Sebastien Levy, Randolph Yao, Youjiang Wu, Yingnong Dang, Peng Huang, Zheng Mu, Pu Zhao, Tarun Ramani, Naga Govindaraju, Xukun Li, Qingwei Lin, Gil Lapid Shafriri, and Murali Chintalapati. 2020. Predictive and Adaptive Failure Mitigation to Avert Production Cloud VM Interruptions. In Proceedings of OSDI 2020. 1155–1170.Google ScholarGoogle Scholar
  18. Huayu Li, Martin Renqiang Min, Yong Ge, and Asim Kadav. 2017. A Context-aware Attention Network for Interactive Question Answering. In Proceedings of KDD 2017. 927–935.Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Jing Li, Xinpu Ji, Yuhan Jia, Bingpeng Zhu, Gang Wang, Zhongwei Li, and Xiaoguang Liu. 2014. Hard Drive Failure Prediction Using Classification and Regression Trees. In Proceedings of DSN 2014. 383–394.Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Naihan Li, Shujie Liu, Yanqing Liu, Sheng Zhao, and Ming Liu. 2019. Neural Speech Synthesis with Transformer Network. In Proceedings of AAAI 2019. 6706–6713.Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Ze Li, Qian Cheng, Ken Hsieh, Yingnong Dang, Peng Huang, Pankaj Singh, Xinsheng Yang, Qingwei Lin, Youjiang Wu, Sebastien Levy, and Murali Chintalapati. 2020. Gandalf: An Intelligent, End-To-End Analytics Service for Safe Deployment in Large-Scale Cloud Infrastructure. In Proceedings of NSDI 2020. 389–402.Google ScholarGoogle Scholar
  22. Sidi Lu, Bing Luo, Tirthak Patel, Yongtao Yao, Devesh Tiwari, and Weisong Shi. 2020. Making Disk Failure Predictions SMARTer!. In Proceedings of FAST 2020. 151–167.Google ScholarGoogle Scholar
  23. Chuan Luo, Bo Qiao, Xin Chen, Pu Zhao, Randolph Yao, Hongyu Zhang, Wei Wu, Andrew Zhou, and Qingwei Lin. 2020. Intelligent Virtual Machine Provisioning in Cloud Computing. In Proceedings of IJCAI 2020. 1495–1502.Google ScholarGoogle ScholarCross RefCross Ref
  24. Chuan Luo, Bo Qiao, Wenqian Xing, Xin Chen, Pu Zhao, Chao Du, Randolph Yao, Hongyu Zhang, Wei Wu, Shaowei Cai, Bing He, Saravanakumar Rajmohan, and Qingwei Lin. 2021. Correlation-Aware Heuristic Search for Intelligent Virtual Machine Provisioning in Cloud Systems. In Proceedings of AAAI 2021.Google ScholarGoogle ScholarCross RefCross Ref
  25. Chuan Luo, Pu Zhao, Chen Chen, Bo Qiao, Chao Du, Hongyu Zhang, Wei Wu, Shaowei Cai, Bing He, Saravanakumar Rajmohan, and Qingwei Lin. 2021. PULNS: Positive-Unlabeled Learning with Effective Negative Sample Selector. In Proceedings of AAAI 2021.Google ScholarGoogle Scholar
  26. Meng Ma, Jingmin Xu, Yuan Wang, Pengfei Chen, Zonghua Zhang, and Ping Wang. 2020. AutoMAP: Diagnose Your Microservice-based Web Applications Automatically. In Proceedings of WWW 2020. 246–258.Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Michael Menzel and Rajiv Ranjan. 2012. CloudGenius: Decision Support for Web Server Cloud Migration. In Proceedings of WWW 2012. 979–988.Google ScholarGoogle Scholar
  28. Justin Meza, Qiang Wu, Sanjeev Kumar, and Onur Mutlu. 2015. A Large-Scale Study of Flash Memory Failures in the Field. In Proceedings of SIGMETRICS 2015. 177–190.Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Kevin P. Murphy. 2012. Machine Learning: A Probabilistic Perspective. MIT Press.Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Molly S. Quinn, Katherine Campbell, and Mark T. Keane. 2019. The Expected Unexpected & Unexpected Unexpected. In Proceedings of CogSci 2019. 2627–2633.Google ScholarGoogle Scholar
  31. Jack W. Rae, Anna Potapenko, Siddhant M. Jayakumar, and Timothy P. Lillicrap. 2020. Compressive Transformers for Long-Range Sequence Modelling. In Proceedings of ICLR 2020.Google ScholarGoogle Scholar
  32. Sriram Sankar, Mark Shaw, Kushagra Vaid, and Sudhanva Gurumurthi. 2013. Datacenter Scale Evaluation of the Impact of Temperature on Hard Disk Drive Failures. ACM Transactions on Storage 9, 2 (2013), 1–24.Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Shaukat Ali Shahee and Usha Ananthakumar. 2018. An Adaptive Oversampling Technique for Imbalanced Datasets. In Proceedings of Industrial Conference on Data Mining 2018. 1–16.Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Huasong Shan, Yuan Chen, Haifeng Liu, Yunpeng Zhang, Xiao Xiao, Xiaofeng He, Min Li, and Wei Ding. 2019. ϵ-Diagnosis: Unsupervised and Real-time Diagnosis of Small-window Long-tail Latency in Large-scale Microservice Platforms. In Proceedings of WWW 2019. 3215–3222.Google ScholarGoogle Scholar
  35. Jing Shen, Jian Wan, Se-Jung Lim, and Lifeng Yu. 2018. Random-Forest-Based Failure Prediction for Hard Disk Drives. International Journal of Distributed Sensor Networks 14, 11 (2018).Google ScholarGoogle ScholarCross RefCross Ref
  36. Nitish Srivastava, Geoffrey E. Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: A Simple Way to Prevent Neural Networks from Overfitting. Journal of Machine Learning Research 15, 1 (2014), 1929–1958.Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Xiaoyi Sun, Krishnendu Chakrabarty, Ruirui Huang, Yiquan Chen, Bing Zhao, Hai Cao, Yinhe Han, Xiaoyao Liang, and Li Jiang. 2019. System-Level Hardware Failure Prediction using Deep Learning. In Proceedings of DAC 2019. 20.Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Amoghavarsha Suresh and Anshul Gandhi. 2019. Using Variability as a Guiding Principle to Reduce Latency in Web Applications via OS Profiling. In Proceedings of WWW 2019. 1759–1770.Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. In Proceedings of NIPS 2017. 5998–6008.Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Changjun Wang, Weidong Ma, Tao Qin, Xujin Chen, Xiaodong Hu, and Tie-Yan Liu. 2015. Selling Reserved Instances in Cloud Computing. In Proceedings of IJCAI 2015. 224–231.Google ScholarGoogle Scholar
  41. Xuejian Wang, Lantao Yu, Kan Ren, Guanyu Tao, Weinan Zhang, Yong Yu, and Jun Wang. 2017. Dynamic Attention Deep Model for Article Recommendation by Learning Human Editors’ Demonstration. In Proceedings of KDD 2017. 2051–2059.Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Chang Xu, Gang Wang, Xiaoguang Liu, Dongdong Guo, and Tie-Yan Liu. 2016. Health Status Assessment and Failure Prediction for Hard Drives with Recurrent Neural Networks. IEEE Transactions on Computers 65, 11 (2016), 3502–3508.Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron C. Courville, Ruslan Salakhutdinov, Richard S. Zemel, and Yoshua Bengio. 2015. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. In Proceedings of ICML 2015. 2048–2057.Google ScholarGoogle Scholar
  44. Yong Xu, Kaixin Sui, Randolph Yao, Hongyu Zhang, Qingwei Lin, Yingnong Dang, Peng Li, Keceng Jiang, Wenchi Zhang, Jian-Guang Lou, Murali Chintalapati, and Dongmei Zhang. 2018. Improving Service Availability of Cloud Systems by Predicting Disk Error. In Proceedings of USENIX ATC 2018. 481–494.Google ScholarGoogle Scholar
  45. Qiang Yang and Xindong Wu. 2006. 10 Challenging Problems in Data Mining Research. International Journal of Information Technology & Decision Making 5, 04(2006), 597–604.Google ScholarGoogle ScholarCross RefCross Ref
  46. Jianguo Zhang, Ji Wang, Lifang He, Zhao Li, and Philip S. Yu. 2018. Layerwise Perturbation-Based Adversarial Training for Hard Drive Health Degree Prediction. In Proceedings of ICDM 2018. 1428–1433.Google ScholarGoogle ScholarCross RefCross Ref
  47. Xiangyu Zhao, Longbiao Wang, Ruifang He, Ting Yang, Jinxin Chang, and Ruifang Wang. 2020. Multiple Knowledge Syncretic Transformer for Natural Dialogue Generation. In Proceedings of WWW 2020. 752–762.Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Xinyan Zhao, Feng Xiao, Haoming Zhong, Jun Yao, and Huanhuan Chen. 2020. Condition Aware and Revise Transformer for Question Answering. In Proceedings of WWW 2020. 2377–2387.Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Ying Zhao, Xiang Liu, Siqing Gan, and Weimin Zheng. 2010. Predicting Disk Failures with HMM- and HSMM-Based Approaches. In Proceedings of Industrial Conference on Data Mining 2010. 390–404.Google ScholarGoogle ScholarCross RefCross Ref
  50. Bingpeng Zhu, Gang Wang, Xiaoguang Liu, Dianming Hu, Sheng Lin, and Jingwei Ma. 2013. Proactive Drive Failure Prediction for Large Scale Storage Systems. In Proceedings of MSST 2013. 1–5.Google ScholarGoogle ScholarCross RefCross Ref
  1. NTAM: Neighborhood-Temporal Attention Model for Disk Failure Prediction in Cloud Platforms

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      WWW '21: Proceedings of the Web Conference 2021
      April 2021
      4054 pages
      ISBN:9781450383127
      DOI:10.1145/3442381

      Copyright © 2021 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 3 June 2021

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Research
      • Refereed limited

      Acceptance Rates

      Overall Acceptance Rate1,899of8,196submissions,23%

      Upcoming Conference

      WWW '24
      The ACM Web Conference 2024
      May 13 - 17, 2024
      Singapore , Singapore

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format .

    View HTML Format