Loading [a11y]/accessibility-menu.js
Self-Adaptive Root Cause Diagnosis for Large-Scale Microservice Architecture | IEEE Journals & Magazine | IEEE Xplore

Self-Adaptive Root Cause Diagnosis for Large-Scale Microservice Architecture


Abstract:

The emergence of microservice architecture in Cloud systems poses a new challenges for the reliability operation and maintenance. Due to numerous services and diverse typ...Show More

Abstract:

The emergence of microservice architecture in Cloud systems poses a new challenges for the reliability operation and maintenance. Due to numerous services and diverse types of metrics, it is time-consuming and challenging to identify the root cause of anomaly in large-scale microservice architecture. To solve this issue, this article presents a multi-metric and self-adaptive root cause diagnosis framework, named MS-Rank. MS-Rank decomposes the task into four phases: impact graph construction, random walk diagnosis, result precision evaluation, metrics weight update. Initially, we introduce the concept of implicit metrics and propose a composite impact graph construction algorithm, using multiple types of metrics to discover causal relationships between services. Afterwards, we propose a diagnostic algorithm in which forward, selfward and backward transitions are designed to heuristically identify the root cause services. In addition, we establish a self-adaptive mechanism to update the confidence of different metrics dynamically according to their diagnostic precision. Lastly, we develop a prototype system and integrate MS-Rank into real production system - IBM Cloud. Experimental results show that MS-Rank has a high diagnostic precision and its performance outperforms several selected benchmarks. Through multiple rounds of diagnosis, MS-Rank can optimize itself effectively. MS-Rank can be rapidly deployed in various microservice-based systems and applications, requiring no predefined knowledge. MS-Rank also allows us to introduce expert experiences into its framework to improve the diagnostic efficiency and precision.
Published in: IEEE Transactions on Services Computing ( Volume: 15, Issue: 3, 01 May-June 2022)
Page(s): 1399 - 1410
Date of Publication: 08 May 2020

ISSN Information:

Funding Agency:


References

References is not available for this document.