Abstract
In the domain of cloud computing and distributed systems, microservices architecture has become preeminent due to its scalability and flexibility. However, the distributed nature of microservices systems introduces significant challenges in maintaining operational reliability, especially in fault localization. Traditional methods for fault localization are insufficient due to time-intensive and prone to error. Addressing this gap, we present SpanGraph, a novel framework employing graph convolutional networks (GCN) to achieve efficient span-level fault localization. SpanGraph constructs a directed graph from system traces to capture invocation relationships and execution times. It then utilizes GCN for edge representation learning to detect anomalies. Experimental results demonstrate that SpanGraph outperforms all baseline approaches on both the Sockshop and TrainTicket datasets. We also conduct incremental experiments on SpanGraph using unseen traces to validate its generalizability and scalability. Furthermore, we perform an ablation study, sensitivity analysis, and complexity analysis for SpanGraph to further verify its robustness, effectiveness, and flexibility. Finally, we validate SpanGraph’s effectiveness in anomaly detection and fault location using real-world datasets.







Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Data and materials availability
Not applicable.
References
Agarap, A.F.: Deep learning using rectified linear units (RELU). arXiv:1803.08375 (2018)
Apache: Apache SkyWalking. http://skywalking.apache.org (2023)
Audibert, J., Michiardi, P., Guyard, F., Marti, S., Zuluaga, M.A.: USAD: Unsupervised anomaly detection on multivariate time series. In: Proceedings of the 26th ACM SIGKDD international conference on knowledge discovery & data mining, pp. 3395–3404 (2020)
Bruna, J., Zaremba, W., Szlam, A., LeCun, Y.: Spectral networks and locally connected networks on graphs. arXiv:1312.6203 (2013)
Chen, Z., Liu, J., Su, Y., Zhang, H., Ling, X., Yang, Y., Lyu, M.R.: Adaptive performance anomaly detection for online service systems via pattern sketching. In: Proceedings of the 44th international conference on software engineering, pp. 61–72 (2022)
Chen, J., Liu, F., Jiang, J., Zhong, G., Xu, D., Tan, Z., Shi, S.: TraceGra: a trace-based anomaly detection for microservice using graph deep learning. Comput. Commun. 204, 109–117 (2023)
DGL: Deep Graph Library. https://github.com/dmlc/dgl (2023)
Du, M., Li, F., Zheng, G., Srikumar, V.: DeepLog: anomaly detection and diagnosis from system logs through deep learning. In: Proceedings of the 2017 ACM SIGSAC conference on computer and communications security, pp. 1285–1298 (2017)
Gan, Y., Zhang, Y., Hu, K., Cheng, D., He, Y., Pancholi, M., Delimitrou, C.: SEER: leveraging big data to navigate the complexity of performance debugging in cloud microservices. In: Proceedings of the twenty-fourth international conference on architectural support for programming languages and operating systems, pp. 19–33 (2019)
Hochreiter, S., Schmidhuber, J.: Long short-term memory 9(8), 1735–1780 (1997)
Huang, J., Yang, Y., Yu, H., Li, J., Zheng, X.: Twin graph-based anomaly detection via attentive multi-modal learning for microservice system. arXiv:2310.04701 (2023)
Kenton, J.D.M.-W.C., Toutanova, L.K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of NAACL-HLT, 1, p. 2 (2019)
Kim, Y.: Convolutional neural networks for sentence classification. arXiv:1408.5882 (2014)
Kubernetes: Kubernetes. https://kubernetes.io (2019)
Le, V.-H., Zhang, H.: Log-based anomaly detection without log parsing. In: 2021 36th IEEE/ACM international conference on automated software engineering (ASE), IEEE. pp. 492–504 (2021)
Lee, C., Yang, T., Chen, Z., Su, Y., Lyu, M.R.: Eadro: An end-to-end troubleshooting framework for microservices on multi-source data. In: 45th IEEE/ACM international conference on software engineering, ICSE 2023, Melbourne, Australia, May 14-20, 2023, pp. 1750–1762 (2023)
Li, Z., Chen, J., Jiao, R., Zhao, N., Wang, Z., Zhang, S., Wu, Y., Jiang, L., Yan, L., Wang, Z., et al.: Practical root cause localization for microservice systems via trace analysis. In: 2021 IEEE/ACM 29th international symposium on quality of service (IWQOS), IEEE. pp. 1–10 (2021)
Li, Y., Tarlow, D., Brockschmidt, M., Zemel, R.: Gated graph sequence neural networks. arXiv:1511.05493 (2015)
Liu, D., He, C., Peng, X., Lin, F., Zhang, C., Gong, S., Li, Z., Ou, J., Wu, Z.: MicroHECL: high-efficient root cause localization in large-scale microservice systems. In: 2021 IEEE/ACM 43rd international conference on software engineering: software engineering in practice (ICSE-SEIP), IEEE. pp. 338–347 (2021)
Liu, J., Huang, J., Huo, Y., Jiang, Z., Gu, J., Chen, Z., Feng, C., Yan, M., Lyu, M.R.: Log-based anomaly detection based on EVT theory with feedback (2023)
Locust: Locust. https://locust.io/ (2023)
Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv:1711.05101 (2017)
Mariani, L., Monni, C., Pezzé, M., Riganelli, O., Xin, R.: Localizing faults in cloud systems. In: 2018 IEEE 11th international conference on software testing, verification and validation (ICST), IEEE. pp. 262–273 (2018)
Meng, W., Liu, Y., Zhu, Y., Zhang, S., Pei, D., Liu, Y., Chen, Y., Zhang, R., Tao, S., Sun, P., et al.: LogAnomaly: Unsupervised detection of sequential and quantitative anomalies in unstructured logs. In: IJCAI, vol. 19, pp. 4739–4745 (2019)
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. arXiv:1301.3781 (2013)
Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., et al.: Training language models to follow instructions with human feedback. Adv. Neural. Inf. Process. Syst. 35, 27730–27744 (2022)
Prometheus: Prometheus. https://prometheus.io (2023)
Query, T.A.: TrainTicket Auto Query. https://github.com/FudanSELab/train-ticket-auto-query (2023)
Ren, R., Wang, Y., Liu, F., Li, Z., Xie, G.: Triple: the interpretable deep learning anomaly detection framework based on trace-metric-log of microservice. In: 2023 IEEE/ACM 31st international symposium on quality of service (IWQoS), IEEE. pp. 1–10 (2023)
Ruff, L., Vandermeulen, R., Goernitz, N., Deecke, L., Siddiqui, S.A., Binder, A., Müller, E., Kloft, M.: Deep one-class classification. In: International conference on machine learning, PMLR. pp. 4393–4402 (2018)
ScikitLearn: ScikitLearn. https://scikit-learn.org (2023)
Shadija, D., Rezai, M., Hill, R.: Towards an understanding of microservices. In: 2017 23rd international conference on automation and computing (ICAC), IEEE. pp. 1–6 (2017)
SockShop: SockShop. https://github.com/microservices-demo/microservices-demo (2023)
Sun, C.-A., Zeng, T., Zuo, W., Liu, H.: A trace-log-clusterings-based fault localization approach to microservice systems. In: 2023 IEEE international conference on web services (ICWS), IEEE. pp. 7–13 (2023)
TrainTicket: TrainTicket. https://github.com/FudanSELab/train-ticket (2023)
Yu, B., Yin, H., Zhu, Z.: Spatio-temporal graph convolutional networks: a deep learning framework for traffic forecasting. arXiv:1709.04875 (2017)
Zhang, S., Jin, P., Lin, Z., Sun, Y., Zhang, B., Xia, S., Li, Z., Zhong, Z., Ma, M., Jin, W., et al.: Robust failure diagnosis of microservice system through multimodal data. arXiv:2302.10512 (2023)
Zhang, C., Peng, X., Sha, C., Zhang, K., Fu, Z., Wu, X., Lin, Q., Zhang, D.: DeepTraLog: Trace-log combined microservice anomaly detection through graph-based deep learning. In: Proceedings of the 44th international conference on software engineering, pp. 623–634 (2022a)
Zhang, C., Peng, X., Zhou, T., Sha, C., Yan, Z., Chen, Y., Yang, H.: TraceCRL: contrastive representation learning for microservice trace analysis. In: Proceedings of the 30th ACM joint European software engineering conference and symposium on the foundations of software engineering, pp. 1221–1232 (2022b)
Zhou, X., Peng, X., Xie, T., Sun, J., Ji, C., Liu, D., Xiang, Q., He, C.: Latent error prediction and fault localization for microservice applications by learning from system trace logs. In: Proceedings of the 2019 27th ACM joint meeting on European software engineering conference and symposium on the foundations of software engineering, pp. 683–694 (2019)
Zhou, X., Peng, X., Xie, T., Sun, J., Li, W., Ji, C., Ding, D.: Delta debugging microservice systems. In: Proceedings of the 33rd ACM/IEEE international conference on automated software engineering, pp. 802–807 (2018a)
Zhou, X., Peng, X., Xie, T., Sun, J., Ji, C., Li, W., Ding, D.: Fault analysis and debugging of microservice systems: industrial survey, benchmark system, and empirical study. IEEE Trans. Software Eng. 47(2), 243–260 (2018b)
Acknowledgements
This work was supported in part by the National Key Research and Development Program of China under Grant 2022YFB3103402.
Funding
This research was funded by the National Key Research and Development Program of China under Grant 2022YFB3103402.
Author information
Authors and Affiliations
Contributions
He Kong: Data analysis and Writing. Tong Li: Project administration. Jingguo Ge: Supervision. Lei Zhang: Validation. Liangxiong Li: Visualization. All authors reviewed the manuscript.
Corresponding author
Ethics declarations
Conflict of interest
The authors declare no competing interests.
Ethical approval
Not applicable.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Kong, H., Li, T., Ge, J. et al. Enhancing fault localization in microservices systems through span-level using graph convolutional networks. Autom Softw Eng 31, 46 (2024). https://doi.org/10.1007/s10515-024-00445-w
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s10515-024-00445-w