research-article

Time-Aware Attention-Based Transformer (TAAT) for Cloud Computing System Failure Prediction

Authors:

Dongrui WuAuthors Info & Claims

KDD '24: Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

Pages 4906 - 4917

https://doi.org/10.1145/3637528.3671547

Published: 24 August 2024 Publication History

Abstract

Log-based failure prediction helps identify and mitigate system failures ahead of time, increasing the reliability of cloud elastic computing systems. However, most existing log-based failure prediction approaches only focus on semantic information, and do not make full use of the information contained in the timestamps of log messages. This paper proposes time-aware attention-based transformer (TAAT), a failure prediction approach that extracts semantic and temporal information simultaneously from log messages and their timestamps. TAAT first tokenizes raw log messages into specific exceptions, and then performs: 1) exception sequence embedding that reorganizes the exceptions of each node as an ordered sequence and converts them to vectors; 2) time relation estimation that computes time relation matrices from the timestamps; and, 3) time-aware attention that computes semantic correlation matrices from the exception sequences and then combines them with time relation matrices. Experiments on Alibaba Cloud demonstrated that TAAT achieves an approximately 10% performance improvement compared with the state-of-the-art approaches. TAAT is now used in the daily operation of Alibaba Cloud. Moreover, this paper also releases the real-world cloud computing failure prediction dataset used in our study, which consists of about 2.7 billion syslogs from about 300,000 node controllers during a 4-month period. To our knowledge, this is the largest dataset of its kind, and is expected to be very useful to the community.

Supplemental Material

MP4 File - Time-Aware Attention-Based Transformer (TAAT) for Cloud Computing System Failure Prediction

Video presentation about the motivation and innovation of TAAT.

Download
21.31 MB

References

[1]

Phuong Pham, Vivek Jain, Lukas Dauterman, Justin Ormont, and Navendu Jain. 2020. DeepTriage: Automated Transfer Assistance for Incidents in Cloud Services. In Proc. of ACM SIGKDD Conf. on Knowledge Discovery and Data Mining. Virtual Event, 3281--3289.

Digital Library

[2]

Xu Zhang, Chao Du, Yifan Li, Yong Xu, Hongyu Zhang, Si Qin, Ze Li, Qingwei Lin, Yingnong Dang, Andrew Zhou, Saravanakumar Rajmohan, and Dongmei Zhang. 2021. HALO: Hierarchy-aware fault localization for cloud systems. In Proc. of ACM SIGKDD Conf. on Knowledge Discovery and Data Mining. Virtual Event, 3948--3958.

Digital Library

[3]

Fangkai Yang, Jue Zhang, Lu Wang, Bo Qiao, Di Weng, Xiaoting Qin, Gregory Weber, Durgesh Nandini Das, Srinivasan Rakhunathan, Ranganathan Srikanth, Qingwei Lin, and Dongmei Zhang. 2023. Contextual self-attentive temporal point process for physical decommissioning prediction of cloud assets. In Proc. of ACM SIGKDD Conf. on Knowledge Discovery and Data Mining. Long Beach, CA, 5372--5381.

Digital Library

[4]

Da Sun Handason Tam, Yang Liu, Huanle Xu, Siyue Xie, and Wing Cheong Lau. 2023. PERT-GNN: Latency prediction for microservice-based cloud-native applications via graph neural networks. In Proc. of ACM SIGKDD Conf. on Knowledge Discovery and Data Mining. Long Beach, CA, 2155--2165.

[5]

Dan Lv, Nurbol Luktarhan, and Yiyong Chen. 2021. ConAnomaly: Content-Based Anomaly Detection for System Logs. Sensors, Vol. 21, 18 (2021), 6125.

[6]

Elisabeth Baseman, Nathan DeBardeleben, Kurt Ferreira, Scott Levy, Steven Raasch, Vilas Sridharan, Taniya Siddiqua, and Qiang Guan. 2016. Improving DRAM fault characterization through machine learning. In Proc. Int'l Conf. on Dependable Systems and Networks Workshop. Toulouse, France, 250--253.

[7]

Min Du, Feifei Li, Guineng Zheng, and Vivek Srikumar. 2017. Deeplog: Anomaly detection and diagnosis from system logs through deep learning. In Proc. of ACM SIGSAC Conf. on Computer and Communications Security. Dallas, TX, 1285--1298.

Digital Library

[8]

Haixuan Guo, Shuhan Yuan, and Xintao Wu. 2021. LogBERT: Log anomaly detection via BERT. In Int'l Joint Conf. on Neural Networks. Shenzhen, China, 1--8.

[9]

Chenyu Zhao, Minghua Ma, Zhenyu Zhong, Shenglin Zhang, Zhiyuan Tan, Xiao Xiong, LuLu Yu, Jiayi Feng, Yongqian Sun, Yuzhi Zhang, Dan Pei, Qingwei Lin, and Dongmei Zhang. 2023. Robust multimodal failure detection for microservice systems. In Proc. of ACM SIGKDD Conf. on Knowledge Discovery and Data Mining. Long Beach, CA, 5639--5649.

Digital Library

[10]

Errin W. Fulp, Glenn A. Fink, and Jereme N. Haack. 2008. Predicting computer system failures using support vector machines. In Proc. USENIX Conf. on Analysis of System Logs. San Diego, CA, 5.

[11]

Moises Goldszmidt. 2012. Finding soon-to-fail disks in a haystack. In Proc. 2012 USENIX Workshop on Hot Topics in Storage and File Systems. Boston, MA, 8.

[12]

Jing Li, Xinpu Ji, Yuhan Jia, Bingpeng Zhu, Gang Wang, Zhongwei Li, and Xiaoguang Liu. 2014. Hard drive failure prediction using classification and regression trees. In Proc. 44th Annual IEEE/IFIP Int'l Conf. on Dependable Systems and Networks. Atlanta, GA, 383--394.

Digital Library

[13]

Farzaneh Mahdisoltani, Ioan Stefanovici, and Bianca Schroeder. 2017. Proactive error prediction to improve storage system reliability. In USENIX Annual Technical Conf. Santa Clara, CA, 391--402.

[14]

Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Computation, Vol. 9, 8 (1997), 1735--1780.

Digital Library

[15]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proc. 17th Annual Conf. of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Minneapolis, MN, 4171--4168.

[16]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Proc. Advances in Neural Information Processing Systems. Long Beach, CA, 5998--6008.

[17]

Nicolas Aussel, Samuel Jaulin, Guillaume Gandon, Yohan Petetin, Eriza Fazli, and Sophie Chabridon. 2017. Predictive models of hard drive failures based on operational data. In Proc. IEEE Int'l Conf. on Machine Learning and Applications. Cancun, Mexico, 619--625.

[18]

Sidi Lu, Bing Luo, Tirthak Patel, Yongtao Yao, Devesh Tiwari, and Weisong Shi. 2020. Making Disk Failure Predictions SMARTer!. In Proc. 18th USENIX Conf. on File and Storage Technologies. Santa Clara, CA, 151--167.

Digital Library

[19]

Shilin He, Jieming Zhu, Pinjia He, and Michael R Lyu. 2020. Loghub: A large collection of system log datasets towards automated log analytics. arXiv preprint arXiv:2008.06448 (2020).

[20]

Wei Xu, Ling Huang, Armando Fox, David Patterson, and Michael I Jordan. 2010. Detecting Large-Scale System Problems by Mining Console Logs. In Proc. of Int'l Conf. on Machine Learning. Haifa, Israel, 37--46.

[21]

Hanyang Liu, Sunny S. Lou, Benjamin C. Warner, Derek R. Harford, Thomas Kannampallil, and Chenyang Lu. 2022. HiPAL: A deep framework for physician burnout prediction using activity logs in electronic health records. In Proc. of ACM SIGKDD Conf. on Knowledge Discovery and Data Mining. Washington, DC, 3377--3387.

Digital Library

[22]

Da Xu, Chuanwei Ruan, Evren Korpeoglu, Sushant Kumar, and Kannan Achan. 2019. Self-attention with functional time representation learning. Advances in Neural Information Processing Systems, Vol. 32 (2019).

[23]

Guolin Ke, Di He, and Tie-Yan Liu. 2021. Rethinking positional encoding in language pre-training. In Int'l Conf. on Learning Representations. Virtual Event.

[24]

Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton. 2016. Layer Normalization. arXiv:1607.06450 (2016).

[25]

Yinglung Liang, Yanyong Zhang, Hui Xiong, and Ramendra Sahoo. 2007. Failure prediction in IBM BlueGene/L event logs. In Proc. of IEEE Int'l Conf. on Data Mining. Omaha, Nebraska, 583--588.

Digital Library

[26]

Qingwei Lin, Hongyu Zhang, Jian-Guang Lou, Yu Zhang, and Xuewei Chen. 2016. Log clustering based problem identification for online service systems. In Proc. of Int'l Conf. on Software Engineering Companion. Austin, TX, 102--111.

Digital Library

[27]

Mike Chen, Alice X. Zheng, Jim Lloyd, Michael I. Jordan, and Eric Brewer. 2004. Failure diagnosis using decision trees. In Proc. of Int'l Conf. on Autonomic Computing. New York City, NY, 36--43.

[28]

Tianqi Chen and Carlos Guestrin. 2016. XGBoost: A scalable tree boosting system. In Proc. of ACM SIGKDD Int'l Conf. on Knowledge Discovery and Data Mining. San Francisco, CA, 785--794.

Digital Library

[29]

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. ImageNet classification with deep convolutional neural networks. In Proc. Advances in Neural Information Processing Systems. Lake Tahoe, NV, 1106--1114.

[30]

Pengfei Liu, Xipeng Qiu, and Xuanjing Huang. 2016. Recurrent neural network for text classification with multi-task learning. In Proc. Int'l Joint Conf. on Artificial Intelligence. New York City, NY, 2873--2879.

[31]

Yang Li, Nan Du, and Samy Bengio. 2017. Time-dependent representation for neural event sequence prediction. arXiv preprint arXiv:1708.00065 (2017).

[32]

Jiaxing Qi, Zhongzhi Luan, Shaohan Huang, Carol Fung, Hailong Yang, Hanlu Li, Danfeng Zhu, and Depei Qian. 2023. Logencoder: Log-based contrastive representation learning for anomaly detection. IEEE Trans. on Network and Service Management, Vol. 20, 2 (2023), 1378--1391.

Digital Library

[33]

Shilin He, Jieming Zhu, Pinjia He, and Michael R. Lyu. 2016. Experience report: System log analysis for anomaly detection. In IEEE Int'l Symposium on Software Reliability Engineering. Ottawa, Canada, 207--218.

[34]

Johan Bjorck, Kilian Q. Weinberger, and Carla Gomes. 2021. Understanding decoupled and early weight decay. In Proc. of AAAI Conf. on Artificial Intelligence. Virtual Event, 6777--6785.

[35]

Teerat Pitakrat, Andre Van Hoorn, and Lars Grunske. 2013. A comparison of machine learning algorithms for proactive hard disk drive failure detection. In Proc. 4th Int'l ACM Sigsoft Symposium on Architecting Critical Systems. Vancouver, Canada, 1--10.

Digital Library

[36]

Farzaneh Mahdisoltani, Ioan Stefanovici, and Bianca Schroeder. 2017. Proactive error prediction to improve storage system reliability. In Proc. USENIX Annual Technical Conf. Santa Clara, CA, 391--402.

[37]

Andrea Rosà, Lydia Y. Chen, and Walter Binder. 2015. Predicting and mitigating jobs failures in big data clusters. In Proc. IEEE/ACM Int'l Symposium on Cluster, Cloud and Grid Computing. Shenzhen, China, 221--230.

Digital Library

[38]

Xiaoyi Sun, Krishnendu Chakrabarty, Ruirui Huang, Yiquan Chen, Bing Zhao, Hai Cao, Yinhe Han, Xiaoyao Liang, and Li Jiang. 2019. System-level hardware failure prediction using deep learning. In Proc. 56th ACM/IEEE Design Automation Conf. San Francisco, CA, 1--6.

Digital Library

[39]

Qingwei Lin, Ken Hsieh, Yingnong Dang, Hongyu Zhang, Kaixin Sui, Yong Xu, Jian-Guang Lou, Chenggang Li, Youjiang Wu, Randolph Yao, Murali Chintalapati, and Dongmei Zhang. 2018. Predicting node failure in cloud service systems. In Proc. 26th ACM Joint Meeting on European Software Engineering Conf. and Symposium on the Foundations of Software Engineering. Lake Buena Vista, FL, 480--490.

Digital Library

[40]

Chuan Luo, Pu Zhao, Bo Qiao, Youjiang Wu, Hongyu Zhang, Wei Wu, Weihai Lu, Yingnong Dang, Saravanakumar Rajmohan, Qingwei Lin, and Dongmei Zhang. 2021. NTAM: Neighborhood-temporal attention model for disk failure prediction in cloud platforms. In Proc. Web Conf. 2021. Ljubljana, Slovenia, 1181--1191.

Digital Library

[41]

Sebastien Levy, Randolph Yao, Youjiang Wu, Yingnong Dang, Peng Huang, Zheng Mu, Pu Zhao, Tarun Ramani, Naga Govindaraju, Xukun Li, Qingwei Lin, Gil Lapid Shafriri, and Murali Chintalapati. 2020. Predictive and adaptive failure mitigation to avert production cloud VM interruptions. In Proc. 14th USENIX Symposium on Operating Systems Design and Implementation. Virtual Event, 1155--1170.

[42]

Juho Lee, Yoonho Lee, Jungtaek Kim, Adam Kosiorek, Seungjin Choi, and Yee Whye Teh. 2019. Set Transformer: A framework for attention-based permutation-invariant neural networks. In Proc. 36th Int'l Conf. on Machine Learning. Long Beach, CA, 3744--3753.

[43]

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019).

[44]

Mingxi Li and Bei-Bei Yin. 2021. ARB-BERT: An automatic aging-related bug report classification method based on BERT. In Proc. 8th Int'l Conf. on Dependable Systems and Their Applications. Yinchuan, China, 474--483.

[45]

Houxing Ren, Jingyuan Wang, Wayne Xin Zhao, and Ning Wu. 2021. RAPT: Pre-training of time-aware transformer for learning robust healthcare representation. In Proc. ACM SIGKDD Conf. on Knowledge Discovery and Data Mining. Virtual Event, 3503--3511.

Digital Library

[46]

Junyu Luo, Muchao Ye, Cao Xiao, and Fenglong Ma. 2020. HiTANet: Hierarchical time-aware attention networks for risk prediction on electronic health records. In Proc. ACM SIGKDD Int'l Conf. on Knowledge Discovery and Data Mining. Virtual Event, 647--656.

Digital Library

Index Terms

Time-Aware Attention-Based Transformer (TAAT) for Cloud Computing System Failure Prediction

Index terms have been assigned to the content through auto-classification.

Recommendations

Failure-Aware Virtual Machine Configuration for Cloud Computing
APSCC '12: Proceedings of the 2012 IEEE Asia-Pacific Services Computing Conference)

Failure occurrence and its impact on system performance have become an increasingly important concern in cloud computing. Most techniques in today's system are reactive schemes to recover after failure which could lead to major cost and significantly ...
Failure Prediction with Hierarchical Approach in Private Cloud
Green, Pervasive, and Cloud Computing
Abstract
Cloud computing is widely adopted in real-world data centers. Most companies choose to build a private cloud service with the consideration of privacy. In these circumstances, they provide the service through Infrastructure as a Service (IaaS). ...
Analysis and Research of Cloud Computing System Instance
ICFN '10: Proceedings of the 2010 Second International Conference on Future Networks

As a kind of emerging business computational model, Cloud Computing distribute computation task on the resource pool which consists of massive computers, accordingly ,the application systems can gain the computation strength, the storage space and ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

KDD '24: Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 2024

6901 pages

ISBN:9798400704901

DOI:10.1145/3637528

General Chairs:
Ricardo Baeza-Yates
Northeastern University, USA
,
Francesco Bonchi
CENTAI / Eurecat, Italy

Copyright © 2024 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 24 August 2024

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

KDD '24

Sponsor:

KDD '24: The 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 25 - 29, 2024

Barcelona, Spain

Acceptance Rates

Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

Upcoming Conference

KDD '25

Sponsor:
sigkdd
sigkdd

The 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 3 - 7, 2025

Toronto , ON , Canada

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
497
Total Downloads

Downloads (Last 12 months)497
Downloads (Last 6 weeks)40

Reflects downloads up to 27 Feb 2025

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten