skip to main content
10.1145/3308558.3313501acmotherconferencesArticle/Chapter ViewAbstractPublication PagesthewebconfConference Proceedingsconference-collections
research-article

Outage Prediction and Diagnosis for Cloud Service Systems

Published: 13 May 2019 Publication History

Abstract

With the rapid growth of cloud service systems and their increasing complexity, service failures become unavoidable. Outages, which are critical service failures, could dramatically degrade system availability and impact user experience. To minimize service downtime and ensure high system availability, we develop an intelligent outage management approach, called AirAlert, which can forecast the occurrence of outages before they actually happen and diagnose the root cause after they indeed occur. AirAlert works as a global watcher for the entire cloud system, which collects all alerting signals, detects dependency among signals and proactively predicts outages that may happen anywhere in the whole cloud system. We analyze the relationships between outages and alerting signals by leveraging Bayesian network and predict outages using a robust gradient boosting tree based classification method. The proposed outage management approach is evaluated using the outage dataset collected from a Microsoft cloud system and the results confirm the effectiveness of the proposed approach.

References

[1]
Algirdas Avizienis, J-C Laprie, Brian Randell, and Carl Landwehr. 2004. Basic concepts and taxonomy of dependable and secure computing. IEEE Transactions on Dependable and Secure Computing 1, 1 (2004), 11-33.
[2]
Nitesh V Chawla, Kevin W Bowyer, Lawrence O Hall, and W Philip Kegelmeyer. 2002. SMOTE: synthetic minority over-sampling technique. Journal of Artificial Intelligence Research 16 (2002), 321-357.
[3]
Junjie Chen, Xiaoting He, Qingwei Lin, Yong Xu, Hongyu Zhang, Dan Hao, Feng Gao, Zhangwei Xu, Yingnong Dang, and Dongmei Zhang. 2019. An Empirical Investigation of Incident Triage for Online Service Systems. In Proceedings of the 41st ACM/IEEE International Conference on Software Engineering. to appear.
[4]
Tianqi Chen and Carlos Guestrin. 2016. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 785-794.
[5]
Diego Colombo, Marloes H Maathuis, Markus Kalisch, and Thomas S Richardson. 2012. Learning high-dimensional directed acyclic graphs with latent and selection variables. The Annals of Statistics(2012), 294-321.
[6]
Carlotta Domeniconi, Chang-Shing Perng, Ricardo Vilalta, and Sheng Ma. 2002. A classification approach for prediction of target events in temporal sequences. In European Conference on Principles of Data Mining and Knowledge Discovery. Springer, 125-137.
[7]
Song Fu and Cheng-Zhong Xu. 2007. Exploring event correlation for failure prediction in coalitions of clusters. In Proceedings of the 2007 ACM/IEEE Conference on Supercomputing. 41.
[8]
Song Fu and Cheng-Zhong Xu. 2010. Quantifying event correlations for proactive failure management in networked computing systems. J. Parallel and Distrib. Comput. 70, 11 (2010), 1100-1109.
[9]
Phillipa Gill, Navendu Jain, and Nachiappan Nagappan. 2011. Understanding network failures in data centers: measurement, analysis, and implications. In ACM SIGCOMM Computer Communication Review, Vol. 41. ACM, 350-361.
[10]
Qiang Guan, Ziming Zhang, and Song Fu. 2012. Ensemble of bayesian predictors and decision trees for proactive failure management in cloud computing systems. Journal of Communications 7, 1 (2012), 52-61.
[11]
Guenther Hoffman and Miroslaw Malek. 2006. Call availability prediction in a telecommunication system: A data driven empirical approach. In 2006 25th IEEE Symposium on Reliable Distributed Systems (SRDS'06). IEEE, 83-95.
[12]
Qingwei Lin, Ken Hsieh, Yingnong Dang, Hongyu Zhang, Kaixin Sui, Yong Xu, Jian-Guang Lou, Chenggang Li, Youjiang Wu, Randolph Yao, 2018. Predicting Node failure in cloud service systems. In Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. ACM, 480-490.
[13]
Qingwei Lin, Jian-Guang Lou, Hongyu Zhang, and Dongmei Zhang. 2016. iDice: problem identification for emerging issues. In 2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE). IEEE, 214-224.
[14]
Qingwei Lin, Hongyu Zhang, Jian-Guang Lou, Yu Zhang, and Xuewei Chen. 2016. Log clustering based problem identification for online service systems. In Proceedings of the 38th International Conference on Software Engineering Companion. ACM, 102-111.
[15]
Debra Greenhalgh Lubas. 2017. Department of defense system of systems reliability challenges. In Reliability and Maintainability Symposium (RAMS), 2017 Annual. 1-6.
[16]
Sara Magliacane, Tom Claassen, and Joris M Mooij. 2016. Ancestral causal inference. In Advances in Neural Information Processing Systems. 4466-4474.
[17]
James W Mickens and Brian D Noble. 2006. Exploiting Availability Prediction in Distributed Systems. In Symposium on Networked Systems Design and Implementation (NSDI), Vol. 6. 73-86.
[18]
Ramendra K Sahoo, Adam J Oliner, Irina Rish, Manish Gupta, Jose´ E Moreira, Sheng Ma, Ricardo Vilalta, and Anand Sivasubramaniam. 2003. Critical event prediction for proactive management in large-scale computer clusters. In Proceedings of the ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 426-435.
[19]
Ramendra K Sahoo, Mark S Squillante, Anand Sivasubramaniam, and Yanyong Zhang. 2004. Failure data analysis of a large-scale heterogeneous server environment. In Dependable Systems and Networks, 2004 International Conference on. IEEE, 772-781.
[20]
Felix Salfner, Maren Lenk, and Miroslaw Malek. 2010. A survey of online failure prediction methods. ACM Computing Surveys (CSUR) 42, 3 (2010), 10.
[21]
Felix Salfner and Miroslaw Malek. 2007. Using hidden semi-Markov models for effective online failure prediction. In Reliable Distributed Systems, 26th IEEE International Symposium on. 161-174.
[22]
Felix Salfner and Steffen Tschirpke. 2008. Error Log Processing for Accurate Failure Prediction. In First USENIX Workshop on the Analysis of System Logs, WASL 2008. USENIX.
[23]
Bianca Schroeder and Garth Gibson. 2010. A large-scale study of failures in high-performance computing systems. IEEE Transactions on Dependable and Secure Computing 7, 4 (2010), 337-350.
[24]
Ricardo Vilalta and Sheng Ma. 2002. Predicting rare events in temporal domains. In 2002 IEEE International Conference on Data Mining. IEEE, 474-481.
[25]
Asher J Vitek and MN Morris. 2012. Service oriented cloud computing architectures. In UMM CSci Senior Seminar Conference. Technical Report.
[26]
Hongbing Wang, Lei Wang, Qi Yu, Zibin Zheng, Athman Bouguettaya, and Michael R Lyu. 2017. Online reliability prediction via motifs-based dynamic bayesian networks for service-oriented systems. IEEE Transactions on Software Engineering 43, 6 (2017), 556-579.
[27]
Yong Xu, Kaixin Sui, Randolph Yao, Hongyu Zhang, Qingwei Lin, Yingnong Dang, Peng Li, Keceng Jiang, Wenchi Zhang, Jian-Guang Lou, 2018. Improving service availability of cloud systems by predicting disk error. In 2018 USENIX Annual Technical Conference (USENIX ATC 18). 481-494.
[28]
Zhixiang Xu, Gao Huang, Kilian Q Weinberger, and Alice X Zheng. 2014. Gradient boosted feature selection. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 522-531.
[29]
Praveen Yalagandula, Suman Nath, Haifeng Yu, Phillip B Gibbons, and Srinivasan Seshan. 2004. Beyond Availability: Towards a Deeper Understanding of Machine Failure Characteristics in Large Distributed Systems. In First USENIX Workshop on Real, Large Distributed Systems, WORLDS'04. USENIX.
[30]
Shenglin Zhang, Ying Liu, Weibin Meng, Zhiling Luo, Jiahao Bu, Sen Yang, Peixian Liang, Dan Pei, Jun Xu, Yuzhi Zhang, 2018. PreFix: Switch Failure Prediction in Datacenter Networks. Proceedings of the ACM on Measurement and Analysis of Computing Systems 2, 1(2018), 2.

Cited By

View all
  • (2024)Empirical Studies on Failure Prediction for Distributed Systems Based on Feature SelectionProceeding of the 2024 5th Asia Service Sciences and Software Engineering Conference10.1145/3702138.3702146(43-52)Online publication date: 11-Sep-2024
  • (2024)The Potential of One-Shot Failure Root Cause Analysis: Collaboration of the Large Language Model and Small ClassifierProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering10.1145/3691620.3695475(931-943)Online publication date: 27-Oct-2024
  • (2024)On the Model Update Strategies for Supervised Learning in AIOps SolutionsACM Transactions on Software Engineering and Methodology10.1145/366459933:7(1-38)Online publication date: 26-Aug-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
WWW '19: The World Wide Web Conference
May 2019
3620 pages
ISBN:9781450366748
DOI:10.1145/3308558
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

In-Cooperation

  • IW3C2: International World Wide Web Conference Committee

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 May 2019

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Outage prediction
  2. cloud system
  3. outage diagnosis
  4. service availability
  5. system of systems

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

WWW '19
WWW '19: The Web Conference
May 13 - 17, 2019
CA, San Francisco, USA

Acceptance Rates

Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)104
  • Downloads (Last 6 weeks)7
Reflects downloads up to 16 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Empirical Studies on Failure Prediction for Distributed Systems Based on Feature SelectionProceeding of the 2024 5th Asia Service Sciences and Software Engineering Conference10.1145/3702138.3702146(43-52)Online publication date: 11-Sep-2024
  • (2024)The Potential of One-Shot Failure Root Cause Analysis: Collaboration of the Large Language Model and Small ClassifierProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering10.1145/3691620.3695475(931-943)Online publication date: 27-Oct-2024
  • (2024)On the Model Update Strategies for Supervised Learning in AIOps SolutionsACM Transactions on Software Engineering and Methodology10.1145/366459933:7(1-38)Online publication date: 26-Aug-2024
  • (2024)Easy over Hard: A Simple Baseline for Test Failures Causes PredictionCompanion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering10.1145/3663529.3663850(306-317)Online publication date: 10-Jul-2024
  • (2024)Is Your Anomaly Detector Ready for Change? Adapting AIOps Solutions to the Real WorldProceedings of the IEEE/ACM 3rd International Conference on AI Engineering - Software Engineering for AI10.1145/3644815.3644961(222-233)Online publication date: 14-Apr-2024
  • (2024)Holistic Root Cause Analysis for Failures in Cloud-Native Systems Through Observability DataIEEE Transactions on Services Computing10.1109/TSC.2024.3478759(1-14)Online publication date: 2024
  • (2024)Detecting Cloud Anomaly via Broad Network-Based Contrastive AutoencoderIEEE Transactions on Network and Service Management10.1109/TNSM.2024.335377221:3(3249-3263)Online publication date: Jun-2024
  • (2024)Revamping the Resilience and High Availability of 5G Core for 6G Ready Network SlicesIEEE Transactions on Network and Service Management10.1109/TNSM.2023.334813721:2(2287-2302)Online publication date: Apr-2024
  • (2024)A Review of Software Testing Process Log Parsing and Mining2024 IEEE International Conference on Software Services Engineering (SSE)10.1109/SSE62657.2024.00055(334-343)Online publication date: 7-Jul-2024
  • (2024)Early Bird: Ensuring Reliability of Cloud Systems Through Early Failure Prediction2024 IEEE 35th International Symposium on Software Reliability Engineering Workshops (ISSREW)10.1109/ISSREW63542.2024.00046(49-54)Online publication date: 28-Oct-2024
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media