Abstract
With the growing market of cloud-native applications, microservices architectures are widely used for rapid and automated deployments, scaling, and management. However, behind the prosperity of microservices, diagnosing faults in numerous services has brought great complexities to operators. To tackle this, we present a microservices troubleshooting framework called \(\text {MicroCBR}\), which makes use of history faults from a knowledge base to construct spatio-temporal knowledge graph offline, and then troubleshoot online through case-based reasoning. Compared to existing frameworks, \(\text {MicroCBR}\) (1) takes advantage of heterogeneous data to fingerprint the faults, (2) carefully extracts a spatio-temporal knowledge graph with only one sample for each fault, (3) can handle novel faults through hierarchical reasoning, and incrementally update it to the fault knowledge base thanks to case-based reasoning paradigm. Our framework is explainable to operators, they can easily locate the root causes and refer to historical solutions. We also conduct three different microservices architectures with fault experiments on Grid’5000 testbed, the results show that \(\text {MicroCBR}\) achieves 91% top-1 accuracy, and outperforms three state-of-the-art methods. We report success stories in a real cloud platform and the code is open-sourced.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
\(\text {MicroCBR}\) repository: https://github.com/Fengrui-Liu/MicroCBR.
- 2.
Online-Boutique: https://github.com/GoogleCloudPlatform/microservices-demo.
- 3.
- 4.
Train-Ticket: https://github.com/FudanSELab/train-ticket.
- 5.
Chaos-Mesh: https://chaos-mesh.org/.
References
Liu, P., Xu, H., Ouyang, Q., et al.: Unsupervised detection of microservice trace anomalies through service-level deep Bayesian networks. In: 2020 IEEE 31st International Symposium on Software Reliability Engineering, pp. 48–58. IEEE (2020)
Xu, H., Chen, W., Zhao, N., et al.: Unsupervised anomaly detection via variational auto-encoder for seasonal KPIs in web applications. In: Proceedings of the 2018 World Wide Web Conference, pp. 187–196 (2018)
Gan, Y., Zhang, Y., Hu, K., et al.: Seer: leveraging big data to navigate the complexity of performance debugging in cloud microservices. In: Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 19–33 (2019)
Wu, L., Bogatinovski, J., Nedelkoski, S., Tordsson, J., Kao, O.: Performance diagnosis in cloud microservices using deep learning. In: Hacid, H., et al. (eds.) ICSOC 2020. LNCS, vol. 12632, pp. 85–96. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-76352-7_13
Zhao, N., Wang, H., Li, Z., et al.: An empirical investigation of practical log anomaly detection for online service systems. In: Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pp. 1404–1415 (2021)
Zhou, P., Wang, Y., Li, Z., et al.: Logchain: cloud workflow reconstruction & troubleshooting with unstructured logs. Comput. Netw. 175, 107279 (2020)
Luo, C., Lou, J.-G., Lin, Q., et al.: Correlating events with time series for incident diagnosis. In: Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1583–1592 (2014)
Li, Z., Chen, J., Jiao, R., et al.: Practical root cause localization for microservice systems via trace analysis. In: 2021 IEEE/ACM 29th International Symposium on Quality of Service, pp. 1–10. IEEE (2021)
Zhang, Y., Guan, Z., Qian, H., et al.: CloudRCA: a root cause analysis framework for cloud computing platforms. In: Proceedings of the 30th ACM International Conference on Information & Knowledge Management, pp. 4373–4382 (2021)
Brandón, Á., Solé, M., Huélamo, A., et al.: Graph-based root cause analysis for service-oriented and microservice architectures. J. Syst. Softw. 159, 110432 (2020)
Wang, H., Wu, Z., Jiang, H., et al.: Groot: an event-graph-based approach for root cause analysis in industrial settings. In: 2021 36th IEEE/ACM International Conference on Automated Software Engineering, pp. 419–429. IEEE (2021)
Chen, P., Qi, Y., Zheng, P., Hou, D.: CauseInfer: automatic and distributed performance diagnosis with hierarchical causality graph in large distributed systems. In: IEEE INFOCOM Conference on Computer Communications. IEEE (2014)
Qiu, J., Du, Q., Yin, K., et al.: A causality mining and knowledge graph based method of root cause diagnosis for performance anomaly in cloud applications. Appl. Sci. 10(6), 2166 (2020)
Zhang, C., Zhou, Z., Zhang, Y., et al.: Netrca: an effective network fault cause localization algorithm. arXiv preprint arXiv:2202.11269 (2022)
Nkisi-Orji, I., Wiratunga, N., Palihawadana, C., Recio-García, J.A., Corsar, D.: Clood CBR: towards microservices oriented case-based reasoning. In: Watson, I., Weber, R. (eds.) ICCBR 2020. LNCS (LNAI), vol. 12311, pp. 129–143. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58342-2_9
Bennacer, L., Amirat, Y., Chibani, A., et al.: Self-diagnosis technique for virtual private networks combining Bayesian networks and case-based reasoning. IEEE Trans. Autom. Sci. Eng. 12(1), 354–366 (2014)
Ma, M., Yin, Z., Zhang, S., et al.: Diagnosing root causes of intermittent slow queries in cloud databases. Proc. VLDB Endow. 13(8), 1176–1189 (2020)
Ester, M., Kriegel, H.-P., Sander, J., et al.: A density-based algorithm for discovering clusters in large spatial databases with noise. In: KDD, vol. 96, pp. 226–231 (1996)
Ren, H., Xu, B., Wang, Y., et al.: Time-series anomaly detection service at Microsoft. In: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 3009–3017 (2019)
Blázquez-García, A., Conde, A., Mori, U., Lozano, J.A.: A review on outlier/anomaly detection in time series data. ACM Comput. Surv. 54, 1–33 (2021)
Du, M., Li, F., Zheng, G., Srikumar, V.: DeepLog: anomaly detection and diagnosis from system logs through deep learning. In: Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security (2017)
He, S., Zhu, J., He, P., Lyu, M.R.: Experience report: system log analysis for anomaly detection. In: 2016 IEEE 27th International Symposium on Software Reliability Engineering, pp. 207–218. IEEE (2016)
Amir, A., Gotthilf, Z., Shalom, B.R.: Weighted LCS. J. Discret. Algorithms 8(3), 273–281 (2010)
Balouek, D., et al.: Adding virtualization capabilities to the grid’5000 testbed. In: Ivanov, I.I., van Sinderen, M., Leymann, F., Shan, T. (eds.) CLOSER 2012. CCIS, vol. 367, pp. 3–20. Springer, Cham (2013). https://doi.org/10.1007/978-3-319-04519-1_1
Acknowledgement
This work was supported in part by Austrian-Chinese Cooperative RTD Projects: 171111KYSB20200001, and the National Natural Science Foundation of China No. U20A20180 and No. 61802366.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Liu, F. et al. (2022). MicroCBR: Case-Based Reasoning on Spatio-temporal Fault Knowledge Graph for Microservices Troubleshooting. In: Keane, M.T., Wiratunga, N. (eds) Case-Based Reasoning Research and Development. ICCBR 2022. Lecture Notes in Computer Science(), vol 13405. Springer, Cham. https://doi.org/10.1007/978-3-031-14923-8_15
Download citation
DOI: https://doi.org/10.1007/978-3-031-14923-8_15
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-14922-1
Online ISBN: 978-3-031-14923-8
eBook Packages: Computer ScienceComputer Science (R0)