skip to main content
10.1145/3551349.3559503acmotherconferencesArticle/Chapter ViewAbstractPublication PagesaseConference Proceedingsconference-collections
research-article

A fault injection platform for learning AIOps models

Published: 05 January 2023 Publication History

Abstract

In today’s IT environment with a growing number of costly outages, increasing complexity of the systems, and availability of massive operational data, there is a strengthening demand to effectively leverage Artificial Intelligence and Machine Learning (AI/ML) towards enhanced resiliency. In this paper, we present an automatic fault injection platform to enable and optimize the generation of data needed for building AI/ML models to support modern IT operations. The merits of our platform include the ease of use, the possibility to orchestrate complex fault scenarios and to optimize the data generation for the modeling task at hand. Specifically, we designed a fault injection service that (i) combines fault injection with data collection in a unified framework, (ii) supports hybrid and multi-cloud environments, and (iii) does not require programming skills for its use. Our current implementation covers the most common fault types both at the application and infrastructure levels. The platform also includes some AI capabilities. In particular, we demonstrate the interventional causal learning capability currently available in our platform. We show how our system is able to learn a model of error propagation in a micro-service application in a cloud environment (when the communication graph among micro-services is unknown and only logs are available) for use in subsequent applications such as fault localization.

References

[1]
Canonical Ltd. 2022. Kubernetes and cloud native operations report 2022. Retrieved June 22, 2022 from https://assets.ubuntu.com/v1/ee0365d8-Kubernetes+cloud+native+operations+report+2022_10.05.22.pdf
[2]
Chaos Toolkit team. 2017. The Chaos Engineering toolkit for Developers. Retrieved June 15, 2022 from https://chaostoolkit.org/
[3]
James Clause, Wanchun Li, and Alessandro Orso. 2007. Dytan: a generic dynamic taint analysis framework. In Proceedings of the 2007 international symposium on Software testing and analysis. 196–206.
[4]
DayTrader. 2022. Retrieved June 15, 2022 from https://github.ibm.com/ocp-r2-demo/
[5]
Gremlin Inc.2022. Proactively improve reliability. Retrieved June 15, 2022 from https://www.gremlin.com/
[6]
Instana Inc.2022. Enterprise Observability and APM for Cloud-Native Applications. Retrieved June 20, 2022 from https://www.instana.com
[7]
Saurabh Jha, Subho Banerjee, Timothy Tsai, Siva KS Hari, Michael B Sullivan, Zbigniew T Kalbarczyk, Stephen W Keckler, and Ravishankar K Iyer. 2019. Ml-based fault injection for autonomous vehicles: A case for bayesian fault injection. In 2019 49th annual IEEE/IFIP international conference on dependable systems and networks (DSN). IEEE, 112–124.
[8]
Denis Kennelly. 2019. Three reasons most companies are only 20 percent to cloud transformation. Retrieved June 19, 2022 from https://www.ibm.com/blogs/cloud-computing/2019/03/05/20-percent-cloud-transformation/
[9]
Murat Kocaoglu, Karthikeyan Shanmugam, and Elias Bareinboim. 2017. Experimental design for learning causal graphs with latent variables. In Nips.
[10]
Sergey Levine, Aviral Kumar, George Tucker, and Justin Fu. 2020. Offline reinforcement learning: Tutorial, review, and perspectives on open problems. arXiv preprint arXiv:2005.01643(2020).
[11]
Litmus. 2020. Cloud Native Chaos Engineering platform. Retrieved June 15, 2022 from https://litmuschaos.io/
[12]
Mezmo Inc.2022. Log Analysis and Log Management Software for Observability Data. Retrieved June 20, 2022 from https://www.mezmo.com/
[13]
Software Engineering Laboratory of Fudan University. 2018. Train Ticket: A Benchmark Microservice System. https://github.com/FudanSELab/train-ticket/
[14]
Haoran Qiu, Subho S Banerjee, Saurabh Jha, Zbigniew T Kalbarczyk, and Ravishankar K Iyer. 2020. {FIRM}: An Intelligent Fine-grained Resource Management Framework for {SLO-Oriented} Microservices. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20). 805–825.
[15]
Sebastián Ramírez. 2022. FastAPI. Retrieved June 20, 2022 from https://fastapi.tiangolo.com/
[16]
Red Hat®. 2022. Red Hat OpenShift. Retrieved June 17, 2022 from https://www.redhat.com/en/technologies/cloud-computing/openshift
[17]
Ian Stoica and Shenker Scott. 2021. From Cloud Computing to Sky Computing. In HotOS ’21: Proceedings of the Workshop on Hot Topics in Operating Systems. ACM, 26–32.
[18]
Cheng Tan, Ze Jin, Chuanxiong Guo, Tianrong Zhang, Haitao Wu, Karl Deng, Dongming Bi, and Dong Xiang. 2019. {NetBouncer}: Active Device and Link Failure Localization in Data Center Networks. In 16th USENIX Symposium on Networked Systems Design and Implementation (NSDI 19). 599–614.
[19]
Uptime Institute. 2022. Uptime2022. Retrieved June 19, 2022 from https://uptimeinstitute.com/webinars/webinar-critical-update-uptime-institute-2022-outage-report
[20]
Xiang Zhou, Xin Peng, Tao Xie, Jun Sun, Chao Ji, Dewei Liu, Qilin Xiang, and Chuan He. 2019. Latent error prediction and fault localization for microservice applications by learning from system trace logs. In Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 683–694.

Cited By

View all
  • (2024)Diner: Interpretable Anomaly Detection for Seasonal Time Series in Web ServicesIEEE Transactions on Services Computing10.1109/TSC.2024.342289417:5(2248-2260)Online publication date: Sep-2024
  • (2024)MicroOps: Rapid Microservice Data Simulation and AIOps Model Development Platform2024 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER)10.1109/SANER60148.2024.00012(52-56)Online publication date: 12-Mar-2024
  • (2024)Fault Localization Using Interventional Causal Learning for Cloud-Native Applications2024 54th Annual IEEE/IFIP International Conference on Dependable Systems and Networks - Supplemental Volume (DSN-S)10.1109/DSN-S60304.2024.00040(141-147)Online publication date: 24-Jun-2024
Index terms have been assigned to the content through auto-classification.

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
ASE '22: Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering
October 2022
2006 pages
ISBN:9781450394758
DOI:10.1145/3551349
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 05 January 2023

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. AI supported operations
  2. Fault diagnosis
  3. Fault injection

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

ASE '22

Acceptance Rates

Overall Acceptance Rate 82 of 337 submissions, 24%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)100
  • Downloads (Last 6 weeks)7
Reflects downloads up to 03 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Diner: Interpretable Anomaly Detection for Seasonal Time Series in Web ServicesIEEE Transactions on Services Computing10.1109/TSC.2024.342289417:5(2248-2260)Online publication date: Sep-2024
  • (2024)MicroOps: Rapid Microservice Data Simulation and AIOps Model Development Platform2024 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER)10.1109/SANER60148.2024.00012(52-56)Online publication date: 12-Mar-2024
  • (2024)Fault Localization Using Interventional Causal Learning for Cloud-Native Applications2024 54th Annual IEEE/IFIP International Conference on Dependable Systems and Networks - Supplemental Volume (DSN-S)10.1109/DSN-S60304.2024.00040(141-147)Online publication date: 24-Jun-2024
  • (2024)SAM: Subseries Augmentation-Based Meta-Learning for Generalizing AIOps Models in Multi-Cloud Migration2024 IEEE 17th International Conference on Cloud Computing (CLOUD)10.1109/CLOUD62652.2024.00040(291-301)Online publication date: 7-Jul-2024

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media