skip to main content
10.1145/3637528.3671547acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
research-article

Time-Aware Attention-Based Transformer (TAAT) for Cloud Computing System Failure Prediction

Published: 24 August 2024 Publication History

Abstract

Log-based failure prediction helps identify and mitigate system failures ahead of time, increasing the reliability of cloud elastic computing systems. However, most existing log-based failure prediction approaches only focus on semantic information, and do not make full use of the information contained in the timestamps of log messages. This paper proposes time-aware attention-based transformer (TAAT), a failure prediction approach that extracts semantic and temporal information simultaneously from log messages and their timestamps. TAAT first tokenizes raw log messages into specific exceptions, and then performs: 1) exception sequence embedding that reorganizes the exceptions of each node as an ordered sequence and converts them to vectors; 2) time relation estimation that computes time relation matrices from the timestamps; and, 3) time-aware attention that computes semantic correlation matrices from the exception sequences and then combines them with time relation matrices. Experiments on Alibaba Cloud demonstrated that TAAT achieves an approximately 10% performance improvement compared with the state-of-the-art approaches. TAAT is now used in the daily operation of Alibaba Cloud. Moreover, this paper also releases the real-world cloud computing failure prediction dataset used in our study, which consists of about 2.7 billion syslogs from about 300,000 node controllers during a 4-month period. To our knowledge, this is the largest dataset of its kind, and is expected to be very useful to the community.

Supplemental Material

MP4 File - Time-Aware Attention-Based Transformer (TAAT) for Cloud Computing System Failure Prediction
Video presentation about the motivation and innovation of TAAT.

References

[1]
Phuong Pham, Vivek Jain, Lukas Dauterman, Justin Ormont, and Navendu Jain. 2020. DeepTriage: Automated Transfer Assistance for Incidents in Cloud Services. In Proc. of ACM SIGKDD Conf. on Knowledge Discovery and Data Mining. Virtual Event, 3281--3289.
[2]
Xu Zhang, Chao Du, Yifan Li, Yong Xu, Hongyu Zhang, Si Qin, Ze Li, Qingwei Lin, Yingnong Dang, Andrew Zhou, Saravanakumar Rajmohan, and Dongmei Zhang. 2021. HALO: Hierarchy-aware fault localization for cloud systems. In Proc. of ACM SIGKDD Conf. on Knowledge Discovery and Data Mining. Virtual Event, 3948--3958.
[3]
Fangkai Yang, Jue Zhang, Lu Wang, Bo Qiao, Di Weng, Xiaoting Qin, Gregory Weber, Durgesh Nandini Das, Srinivasan Rakhunathan, Ranganathan Srikanth, Qingwei Lin, and Dongmei Zhang. 2023. Contextual self-attentive temporal point process for physical decommissioning prediction of cloud assets. In Proc. of ACM SIGKDD Conf. on Knowledge Discovery and Data Mining. Long Beach, CA, 5372--5381.
[4]
Da Sun Handason Tam, Yang Liu, Huanle Xu, Siyue Xie, and Wing Cheong Lau. 2023. PERT-GNN: Latency prediction for microservice-based cloud-native applications via graph neural networks. In Proc. of ACM SIGKDD Conf. on Knowledge Discovery and Data Mining. Long Beach, CA, 2155--2165.
[5]
Dan Lv, Nurbol Luktarhan, and Yiyong Chen. 2021. ConAnomaly: Content-Based Anomaly Detection for System Logs. Sensors, Vol. 21, 18 (2021), 6125.
[6]
Elisabeth Baseman, Nathan DeBardeleben, Kurt Ferreira, Scott Levy, Steven Raasch, Vilas Sridharan, Taniya Siddiqua, and Qiang Guan. 2016. Improving DRAM fault characterization through machine learning. In Proc. Int'l Conf. on Dependable Systems and Networks Workshop. Toulouse, France, 250--253.
[7]
Min Du, Feifei Li, Guineng Zheng, and Vivek Srikumar. 2017. Deeplog: Anomaly detection and diagnosis from system logs through deep learning. In Proc. of ACM SIGSAC Conf. on Computer and Communications Security. Dallas, TX, 1285--1298.
[8]
Haixuan Guo, Shuhan Yuan, and Xintao Wu. 2021. LogBERT: Log anomaly detection via BERT. In Int'l Joint Conf. on Neural Networks. Shenzhen, China, 1--8.
[9]
Chenyu Zhao, Minghua Ma, Zhenyu Zhong, Shenglin Zhang, Zhiyuan Tan, Xiao Xiong, LuLu Yu, Jiayi Feng, Yongqian Sun, Yuzhi Zhang, Dan Pei, Qingwei Lin, and Dongmei Zhang. 2023. Robust multimodal failure detection for microservice systems. In Proc. of ACM SIGKDD Conf. on Knowledge Discovery and Data Mining. Long Beach, CA, 5639--5649.
[10]
Errin W. Fulp, Glenn A. Fink, and Jereme N. Haack. 2008. Predicting computer system failures using support vector machines. In Proc. USENIX Conf. on Analysis of System Logs. San Diego, CA, 5.
[11]
Moises Goldszmidt. 2012. Finding soon-to-fail disks in a haystack. In Proc. 2012 USENIX Workshop on Hot Topics in Storage and File Systems. Boston, MA, 8.
[12]
Jing Li, Xinpu Ji, Yuhan Jia, Bingpeng Zhu, Gang Wang, Zhongwei Li, and Xiaoguang Liu. 2014. Hard drive failure prediction using classification and regression trees. In Proc. 44th Annual IEEE/IFIP Int'l Conf. on Dependable Systems and Networks. Atlanta, GA, 383--394.
[13]
Farzaneh Mahdisoltani, Ioan Stefanovici, and Bianca Schroeder. 2017. Proactive error prediction to improve storage system reliability. In USENIX Annual Technical Conf. Santa Clara, CA, 391--402.
[14]
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Computation, Vol. 9, 8 (1997), 1735--1780.
[15]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proc. 17th Annual Conf. of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Minneapolis, MN, 4171--4168.
[16]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Proc. Advances in Neural Information Processing Systems. Long Beach, CA, 5998--6008.
[17]
Nicolas Aussel, Samuel Jaulin, Guillaume Gandon, Yohan Petetin, Eriza Fazli, and Sophie Chabridon. 2017. Predictive models of hard drive failures based on operational data. In Proc. IEEE Int'l Conf. on Machine Learning and Applications. Cancun, Mexico, 619--625.
[18]
Sidi Lu, Bing Luo, Tirthak Patel, Yongtao Yao, Devesh Tiwari, and Weisong Shi. 2020. Making Disk Failure Predictions SMARTer!. In Proc. 18th USENIX Conf. on File and Storage Technologies. Santa Clara, CA, 151--167.
[19]
Shilin He, Jieming Zhu, Pinjia He, and Michael R Lyu. 2020. Loghub: A large collection of system log datasets towards automated log analytics. arXiv preprint arXiv:2008.06448 (2020).
[20]
Wei Xu, Ling Huang, Armando Fox, David Patterson, and Michael I Jordan. 2010. Detecting Large-Scale System Problems by Mining Console Logs. In Proc. of Int'l Conf. on Machine Learning. Haifa, Israel, 37--46.
[21]
Hanyang Liu, Sunny S. Lou, Benjamin C. Warner, Derek R. Harford, Thomas Kannampallil, and Chenyang Lu. 2022. HiPAL: A deep framework for physician burnout prediction using activity logs in electronic health records. In Proc. of ACM SIGKDD Conf. on Knowledge Discovery and Data Mining. Washington, DC, 3377--3387.
[22]
Da Xu, Chuanwei Ruan, Evren Korpeoglu, Sushant Kumar, and Kannan Achan. 2019. Self-attention with functional time representation learning. Advances in Neural Information Processing Systems, Vol. 32 (2019).
[23]
Guolin Ke, Di He, and Tie-Yan Liu. 2021. Rethinking positional encoding in language pre-training. In Int'l Conf. on Learning Representations. Virtual Event.
[24]
Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton. 2016. Layer Normalization. arXiv:1607.06450 (2016).
[25]
Yinglung Liang, Yanyong Zhang, Hui Xiong, and Ramendra Sahoo. 2007. Failure prediction in IBM BlueGene/L event logs. In Proc. of IEEE Int'l Conf. on Data Mining. Omaha, Nebraska, 583--588.
[26]
Qingwei Lin, Hongyu Zhang, Jian-Guang Lou, Yu Zhang, and Xuewei Chen. 2016. Log clustering based problem identification for online service systems. In Proc. of Int'l Conf. on Software Engineering Companion. Austin, TX, 102--111.
[27]
Mike Chen, Alice X. Zheng, Jim Lloyd, Michael I. Jordan, and Eric Brewer. 2004. Failure diagnosis using decision trees. In Proc. of Int'l Conf. on Autonomic Computing. New York City, NY, 36--43.
[28]
Tianqi Chen and Carlos Guestrin. 2016. XGBoost: A scalable tree boosting system. In Proc. of ACM SIGKDD Int'l Conf. on Knowledge Discovery and Data Mining. San Francisco, CA, 785--794.
[29]
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. ImageNet classification with deep convolutional neural networks. In Proc. Advances in Neural Information Processing Systems. Lake Tahoe, NV, 1106--1114.
[30]
Pengfei Liu, Xipeng Qiu, and Xuanjing Huang. 2016. Recurrent neural network for text classification with multi-task learning. In Proc. Int'l Joint Conf. on Artificial Intelligence. New York City, NY, 2873--2879.
[31]
Yang Li, Nan Du, and Samy Bengio. 2017. Time-dependent representation for neural event sequence prediction. arXiv preprint arXiv:1708.00065 (2017).
[32]
Jiaxing Qi, Zhongzhi Luan, Shaohan Huang, Carol Fung, Hailong Yang, Hanlu Li, Danfeng Zhu, and Depei Qian. 2023. Logencoder: Log-based contrastive representation learning for anomaly detection. IEEE Trans. on Network and Service Management, Vol. 20, 2 (2023), 1378--1391.
[33]
Shilin He, Jieming Zhu, Pinjia He, and Michael R. Lyu. 2016. Experience report: System log analysis for anomaly detection. In IEEE Int'l Symposium on Software Reliability Engineering. Ottawa, Canada, 207--218.
[34]
Johan Bjorck, Kilian Q. Weinberger, and Carla Gomes. 2021. Understanding decoupled and early weight decay. In Proc. of AAAI Conf. on Artificial Intelligence. Virtual Event, 6777--6785.
[35]
Teerat Pitakrat, Andre Van Hoorn, and Lars Grunske. 2013. A comparison of machine learning algorithms for proactive hard disk drive failure detection. In Proc. 4th Int'l ACM Sigsoft Symposium on Architecting Critical Systems. Vancouver, Canada, 1--10.
[36]
Farzaneh Mahdisoltani, Ioan Stefanovici, and Bianca Schroeder. 2017. Proactive error prediction to improve storage system reliability. In Proc. USENIX Annual Technical Conf. Santa Clara, CA, 391--402.
[37]
Andrea Rosà, Lydia Y. Chen, and Walter Binder. 2015. Predicting and mitigating jobs failures in big data clusters. In Proc. IEEE/ACM Int'l Symposium on Cluster, Cloud and Grid Computing. Shenzhen, China, 221--230.
[38]
Xiaoyi Sun, Krishnendu Chakrabarty, Ruirui Huang, Yiquan Chen, Bing Zhao, Hai Cao, Yinhe Han, Xiaoyao Liang, and Li Jiang. 2019. System-level hardware failure prediction using deep learning. In Proc. 56th ACM/IEEE Design Automation Conf. San Francisco, CA, 1--6.
[39]
Qingwei Lin, Ken Hsieh, Yingnong Dang, Hongyu Zhang, Kaixin Sui, Yong Xu, Jian-Guang Lou, Chenggang Li, Youjiang Wu, Randolph Yao, Murali Chintalapati, and Dongmei Zhang. 2018. Predicting node failure in cloud service systems. In Proc. 26th ACM Joint Meeting on European Software Engineering Conf. and Symposium on the Foundations of Software Engineering. Lake Buena Vista, FL, 480--490.
[40]
Chuan Luo, Pu Zhao, Bo Qiao, Youjiang Wu, Hongyu Zhang, Wei Wu, Weihai Lu, Yingnong Dang, Saravanakumar Rajmohan, Qingwei Lin, and Dongmei Zhang. 2021. NTAM: Neighborhood-temporal attention model for disk failure prediction in cloud platforms. In Proc. Web Conf. 2021. Ljubljana, Slovenia, 1181--1191.
[41]
Sebastien Levy, Randolph Yao, Youjiang Wu, Yingnong Dang, Peng Huang, Zheng Mu, Pu Zhao, Tarun Ramani, Naga Govindaraju, Xukun Li, Qingwei Lin, Gil Lapid Shafriri, and Murali Chintalapati. 2020. Predictive and adaptive failure mitigation to avert production cloud VM interruptions. In Proc. 14th USENIX Symposium on Operating Systems Design and Implementation. Virtual Event, 1155--1170.
[42]
Juho Lee, Yoonho Lee, Jungtaek Kim, Adam Kosiorek, Seungjin Choi, and Yee Whye Teh. 2019. Set Transformer: A framework for attention-based permutation-invariant neural networks. In Proc. 36th Int'l Conf. on Machine Learning. Long Beach, CA, 3744--3753.
[43]
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019).
[44]
Mingxi Li and Bei-Bei Yin. 2021. ARB-BERT: An automatic aging-related bug report classification method based on BERT. In Proc. 8th Int'l Conf. on Dependable Systems and Their Applications. Yinchuan, China, 474--483.
[45]
Houxing Ren, Jingyuan Wang, Wayne Xin Zhao, and Ning Wu. 2021. RAPT: Pre-training of time-aware transformer for learning robust healthcare representation. In Proc. ACM SIGKDD Conf. on Knowledge Discovery and Data Mining. Virtual Event, 3503--3511.
[46]
Junyu Luo, Muchao Ye, Cao Xiao, and Fenglong Ma. 2020. HiTANet: Hierarchical time-aware attention networks for risk prediction on electronic health records. In Proc. ACM SIGKDD Int'l Conf. on Knowledge Discovery and Data Mining. Virtual Event, 647--656.

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
KDD '24: Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining
August 2024
6901 pages
ISBN:9798400704901
DOI:10.1145/3637528
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 24 August 2024

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. cloud elastic compute service
  2. log-based failure prediction
  3. time-aware attention
  4. transformer

Qualifiers

  • Research-article

Conference

KDD '24
Sponsor:

Acceptance Rates

Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

Upcoming Conference

KDD '25

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 497
    Total Downloads
  • Downloads (Last 12 months)497
  • Downloads (Last 6 weeks)40
Reflects downloads up to 27 Feb 2025

Other Metrics

Citations

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media