Skip to main content

MicroCBR: Case-Based Reasoning on Spatio-temporal Fault Knowledge Graph for Microservices Troubleshooting

  • Conference paper
  • First Online:
Book cover Case-Based Reasoning Research and Development (ICCBR 2022)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 13405))

Included in the following conference series:

  • 1176 Accesses

Abstract

With the growing market of cloud-native applications, microservices architectures are widely used for rapid and automated deployments, scaling, and management. However, behind the prosperity of microservices, diagnosing faults in numerous services has brought great complexities to operators. To tackle this, we present a microservices troubleshooting framework called \(\text {MicroCBR}\), which makes use of history faults from a knowledge base to construct spatio-temporal knowledge graph offline, and then troubleshoot online through case-based reasoning. Compared to existing frameworks, \(\text {MicroCBR}\) (1) takes advantage of heterogeneous data to fingerprint the faults, (2) carefully extracts a spatio-temporal knowledge graph with only one sample for each fault, (3) can handle novel faults through hierarchical reasoning, and incrementally update it to the fault knowledge base thanks to case-based reasoning paradigm. Our framework is explainable to operators, they can easily locate the root causes and refer to historical solutions. We also conduct three different microservices architectures with fault experiments on Grid’5000 testbed, the results show that \(\text {MicroCBR}\) achieves 91% top-1 accuracy, and outperforms three state-of-the-art methods. We report success stories in a real cloud platform and the code is open-sourced.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 69.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 89.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    \(\text {MicroCBR}\) repository: https://github.com/Fengrui-Liu/MicroCBR.

  2. 2.

    Online-Boutique: https://github.com/GoogleCloudPlatform/microservices-demo.

  3. 3.

    Sock-Shop: https://github.com/microservices-demo/microservices-demo.

  4. 4.

    Train-Ticket: https://github.com/FudanSELab/train-ticket.

  5. 5.

    Chaos-Mesh: https://chaos-mesh.org/.

References

  1. Liu, P., Xu, H., Ouyang, Q., et al.: Unsupervised detection of microservice trace anomalies through service-level deep Bayesian networks. In: 2020 IEEE 31st International Symposium on Software Reliability Engineering, pp. 48–58. IEEE (2020)

    Google Scholar 

  2. Xu, H., Chen, W., Zhao, N., et al.: Unsupervised anomaly detection via variational auto-encoder for seasonal KPIs in web applications. In: Proceedings of the 2018 World Wide Web Conference, pp. 187–196 (2018)

    Google Scholar 

  3. Gan, Y., Zhang, Y., Hu, K., et al.: Seer: leveraging big data to navigate the complexity of performance debugging in cloud microservices. In: Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 19–33 (2019)

    Google Scholar 

  4. Wu, L., Bogatinovski, J., Nedelkoski, S., Tordsson, J., Kao, O.: Performance diagnosis in cloud microservices using deep learning. In: Hacid, H., et al. (eds.) ICSOC 2020. LNCS, vol. 12632, pp. 85–96. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-76352-7_13

    Chapter  Google Scholar 

  5. Zhao, N., Wang, H., Li, Z., et al.: An empirical investigation of practical log anomaly detection for online service systems. In: Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pp. 1404–1415 (2021)

    Google Scholar 

  6. Zhou, P., Wang, Y., Li, Z., et al.: Logchain: cloud workflow reconstruction & troubleshooting with unstructured logs. Comput. Netw. 175, 107279 (2020)

    Google Scholar 

  7. Luo, C., Lou, J.-G., Lin, Q., et al.: Correlating events with time series for incident diagnosis. In: Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1583–1592 (2014)

    Google Scholar 

  8. Li, Z., Chen, J., Jiao, R., et al.: Practical root cause localization for microservice systems via trace analysis. In: 2021 IEEE/ACM 29th International Symposium on Quality of Service, pp. 1–10. IEEE (2021)

    Google Scholar 

  9. Zhang, Y., Guan, Z., Qian, H., et al.: CloudRCA: a root cause analysis framework for cloud computing platforms. In: Proceedings of the 30th ACM International Conference on Information & Knowledge Management, pp. 4373–4382 (2021)

    Google Scholar 

  10. Brandón, Á., Solé, M., Huélamo, A., et al.: Graph-based root cause analysis for service-oriented and microservice architectures. J. Syst. Softw. 159, 110432 (2020)

    Google Scholar 

  11. Wang, H., Wu, Z., Jiang, H., et al.: Groot: an event-graph-based approach for root cause analysis in industrial settings. In: 2021 36th IEEE/ACM International Conference on Automated Software Engineering, pp. 419–429. IEEE (2021)

    Google Scholar 

  12. Chen, P., Qi, Y., Zheng, P., Hou, D.: CauseInfer: automatic and distributed performance diagnosis with hierarchical causality graph in large distributed systems. In: IEEE INFOCOM Conference on Computer Communications. IEEE (2014)

    Google Scholar 

  13. Qiu, J., Du, Q., Yin, K., et al.: A causality mining and knowledge graph based method of root cause diagnosis for performance anomaly in cloud applications. Appl. Sci. 10(6), 2166 (2020)

    Article  Google Scholar 

  14. Zhang, C., Zhou, Z., Zhang, Y., et al.: Netrca: an effective network fault cause localization algorithm. arXiv preprint arXiv:2202.11269 (2022)

  15. Nkisi-Orji, I., Wiratunga, N., Palihawadana, C., Recio-García, J.A., Corsar, D.: Clood CBR: towards microservices oriented case-based reasoning. In: Watson, I., Weber, R. (eds.) ICCBR 2020. LNCS (LNAI), vol. 12311, pp. 129–143. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58342-2_9

    Chapter  Google Scholar 

  16. Bennacer, L., Amirat, Y., Chibani, A., et al.: Self-diagnosis technique for virtual private networks combining Bayesian networks and case-based reasoning. IEEE Trans. Autom. Sci. Eng. 12(1), 354–366 (2014)

    Article  Google Scholar 

  17. Ma, M., Yin, Z., Zhang, S., et al.: Diagnosing root causes of intermittent slow queries in cloud databases. Proc. VLDB Endow. 13(8), 1176–1189 (2020)

    Article  Google Scholar 

  18. Ester, M., Kriegel, H.-P., Sander, J., et al.: A density-based algorithm for discovering clusters in large spatial databases with noise. In: KDD, vol. 96, pp. 226–231 (1996)

    Google Scholar 

  19. Ren, H., Xu, B., Wang, Y., et al.: Time-series anomaly detection service at Microsoft. In: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 3009–3017 (2019)

    Google Scholar 

  20. Blázquez-García, A., Conde, A., Mori, U., Lozano, J.A.: A review on outlier/anomaly detection in time series data. ACM Comput. Surv. 54, 1–33 (2021)

    Article  Google Scholar 

  21. Du, M., Li, F., Zheng, G., Srikumar, V.: DeepLog: anomaly detection and diagnosis from system logs through deep learning. In: Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security (2017)

    Google Scholar 

  22. He, S., Zhu, J., He, P., Lyu, M.R.: Experience report: system log analysis for anomaly detection. In: 2016 IEEE 27th International Symposium on Software Reliability Engineering, pp. 207–218. IEEE (2016)

    Google Scholar 

  23. Amir, A., Gotthilf, Z., Shalom, B.R.: Weighted LCS. J. Discret. Algorithms 8(3), 273–281 (2010)

    Article  MathSciNet  Google Scholar 

  24. Balouek, D., et al.: Adding virtualization capabilities to the grid’5000 testbed. In: Ivanov, I.I., van Sinderen, M., Leymann, F., Shan, T. (eds.) CLOSER 2012. CCIS, vol. 367, pp. 3–20. Springer, Cham (2013). https://doi.org/10.1007/978-3-319-04519-1_1

    Chapter  Google Scholar 

Download references

Acknowledgement

This work was supported in part by Austrian-Chinese Cooperative RTD Projects: 171111KYSB20200001, and the National Natural Science Foundation of China No. U20A20180 and No. 61802366.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Gaogang Xie .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Liu, F. et al. (2022). MicroCBR: Case-Based Reasoning on Spatio-temporal Fault Knowledge Graph for Microservices Troubleshooting. In: Keane, M.T., Wiratunga, N. (eds) Case-Based Reasoning Research and Development. ICCBR 2022. Lecture Notes in Computer Science(), vol 13405. Springer, Cham. https://doi.org/10.1007/978-3-031-14923-8_15

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-14923-8_15

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-14922-1

  • Online ISBN: 978-3-031-14923-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics