skip to main content
10.1145/3589334.3645665acmconferencesArticle/Chapter ViewAbstractPublication PagesthewebconfConference Proceedingsconference-collections
research-article

GAMMA: Graph Neural Network-Based Multi-Bottleneck Localization for Microservices Applications

Published: 13 May 2024 Publication History

Abstract

Microservices architecture is quickly replacing monolithic and multi-tier architectures as the implementation choice for large-scale web applications as it allows independent development, scalability, and maintenance. However, even with careful node scheduling and scaling, the microservices applications are still vulnerable to performance degradation due to unexpected (dependent or independent) events like anomalous node behavior, workload interference, or sudden spikes in requests or retries. These events can adversely affect the performance of one or more microservices (bottlenecks), degrading the overall application performance. To ensure a good customer experience and avoid revenue loss, it is crucial to detect and mitigate all bottlenecks swiftly.
This work introduces GAMMA, a novel, explainable graph learning model that integrates a mixture of experts to detect multiple bottlenecks. We evaluated GAMMA using a popular open-source benchmarking application deployed on Kubernetes under various practical bottleneck scenarios. Our experimental evaluation results show that GAMMA provides significantly better performance (46% higher F1 score) than existing works that employ deep learning, machine learning, and statistical techniques, demonstrating its ability to detect multiple bottlenecks by learning complex interactions in a microservices architecture.
The dataset is made publicly available [49] for reproducibility and further research in the field.

Supplemental Material

MP4 File
video presentation
MP4 File
Supplemental video

References

[1]
[n. d.]. Building Microservices Driven by Performance--Roblox. https://medium.com/@acovarrubias_7488/building-microservices-drivenby- performance-b347ed1c48e3.
[2]
[n. d.]. CPU Load Generator. https://github.com/molguin92/CPULoadGenerator.
[3]
[n. d.]. Google-DeathStarBench. https://cloud.google.com/blog/products/ management-tools/in-tests-cloud-profiler-adds-negligible-overhead.
[4]
[n. d.]. Jaeger. https://www.jaegertracing.io/.
[5]
[n. d.]. Microsoft-DeathStarBench. https://microsoft.github.io/VirtualClient/ docs/workloads/deathstarbench/.
[6]
[n. d.]. Uber's production Jaeger data. https://github.com/jaegertracing/jaegerui/ issues/680.
[7]
[n. d.]. wrk2 Workload Generator. https://github.com/giltene/wrk2.
[8]
Randy Abernethy. 2018. The Programmer's Guide to Apache Thrift.
[9]
Harold Aragon, Samuel Braganza, Edwin Boza, Jonathan Parrales, and Cristina Abad. 2019. Workload Characterization of a Software-as-a-Service Web Application Implemented with a Microservices Architecture. In Companion Proceedings of The 2019 World Wide Web Conference (San Francisco, USA) (WWW '19). Association for Computing Machinery, New York, NY, USA, 746--750. https://doi.org/10.1145/3308560.3316466
[10]
Luiz André Barroso, Jimmy Clidaras, and Urs Hölzle. 2013. The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines, Second Edition. http://dx.doi.org/10.2200/S00516ED2V01Y201306CAC024
[11]
Daniel Beck, Gholamreza Haffari, and Trevor Cohn. 2018. Graph-to-Sequence Learning using Gated Graph Neural Networks. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Melbourne, Australia, 273--283. https: //doi.org/10.18653/v1/P18--1026
[12]
Shaked Brody, Uri Alon, and Eran Yahav. 2022. How Attentive are Graph Attention Networks? arXiv:2105.14491 [cs.LG]
[13]
Jian Chen, Fagui Liu, Jun Jiang, Guoxiang Zhong, Dishi Xu, Zhuanglun Tan, and Shangsong Shi. 2023. TraceGra: A trace-based anomaly detection for microservice using graph deep learning. Computer Communications 204 (2023), 109--117. https://doi.org/10.1016/j.comcom.2023.03.028
[14]
Koby Crammer and Yoram Singer. 2002. On the Algorithmic Implementation of Multiclass Kernel-Based Vector Machines. J. Mach. Learn. Res. 2 (mar 2002), 265--292.
[15]
Yann N. Dauphin, Angela Fan, Michael Auli, and David Grangier. 2016. Language Modeling with Gated Convolutional Networks. CoRR abs/1612.08083 (2016). arXiv:1612.08083 http://arxiv.org/abs/1612.08083
[16]
Sinan Eski and Feza Buzluca. 2018. AnAutomatic Extraction Approach: Transition to Microservices Architecture from Monolithic Application. In Proceedings of the 19th International Conference on Agile Software Development: Companion (Porto, Portugal) (XP '18). Association for Computing Machinery, New York, NY, USA, Article 25, 6 pages. https://doi.org/10.1145/3234152.3234195
[17]
Huan Fu,Mingming Gong, ChaohuiWang, and Dacheng Tao. 2018. MoE-SPNet: A mixture-of-experts scene parsing network. Pattern Recognition 84 (2018), 226--236. https://doi.org/10.1016/j.patcog.2018.07.020
[18]
Yu Gan, Mingyu Liang, Sundar Dev, David Lo, and Christina Delimitrou. 2021. Sage: Practical and Scalable ML-Driven Performance Debugging in Microservices (ASPLOS '21). Association for Computing Machinery, New York, NY, USA, 135--151. https://doi.org/10.1145/3445814.3446700
[19]
Yu Gan, Yanqi Zhang, Dailun Cheng, Ankitha Shetty, Priyal Rathi, Nayan Katarki, Ariana Bruno, Justin Hu, Brian Ritchken, Brendon Jackson, Kelvin Hu, Meghna Pancholi, Yuan He, Brett Clancy, Chris Colen, Fukang Wen, Catherine Leung, Siyuan Wang, Leon Zaruvinsky, Mateo Espinosa, Rick Lin, Zhongling Liu, Jake Padilla, and Christina Delimitrou. 2019. An Open-Source Benchmark Suite for Microservices and Their Hardware-Software Implications for Cloud & Edge Systems. In Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems (Providence, RI, USA) (ASPLOS '19). Association for Computing Machinery, New York, NY, USA, 3--18. https://doi.org/10.1145/3297858.3304013
[20]
Yu Gan, Yanqi Zhang, Kelvin Hu, Dailun Cheng, Yuan He, Meghna Pancholi, and Christina Delimitrou. 2019. Seer: Leveraging Big Data to Navigate the Complexity of Performance Debugging in Cloud Microservices. In Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems (Providence, RI, USA) (ASPLOS '19). Association for Computing Machinery, New York, NY, USA, 19--33. https://doi.org/10.1145/3297858.3304004
[21]
Jean-Philippe Gouigoux and Dalila Tamzalit. 2017. From Monolith to Microservices: Lessons Learned on an Industrial Migration to aWeb Oriented Architecture. In 2017 IEEE International Conference on Software ArchitectureWorkshops (ICSAW). 62--65. https://doi.org/10.1109/ICSAW.2017.35
[22]
Vipul Harsh, Wenxuan Zhou, Sachin Ashok, Radhika Niranjan Mysore, Brighten Godfrey, and Sujata Banerjee. 2023. Murphy: Performance Diagnosis of Distributed Cloud Applications. In Proceedings of the ACM SIGCOMM 2023 Conference (New York, NY, USA) (ACM SIGCOMM '23). Association for Computing Machinery, New York, NY, USA, 438--451. https://doi.org/10.1145/3603269.3604877
[23]
Lexiang Huang, Matthew Magnusson, Abishek Bangalore Muralikrishna, Salman Estyak, Rebecca Isaacs, Abutalib Aghayev, Timothy Zhu, and Aleksey Charapko. 2022. Metastable Failures in the Wild. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22). USENIX Association, Carlsbad, CA, 73--90. https://www.usenix.org/conference/osdi22/presentation/huang-lexiang
[24]
Darby Huye, Yuri Shkuro, and Raja R. Sambasivan. 2023. Lifting the veil on Meta's microservice architecture: Analyses of topology and request workflows. In 2023 USENIX Annual Technical Conference (USENIX ATC 23). USENIX Association, Boston, MA, 419--432. https://www.usenix.org/conference/atc23/presentation/ huye
[25]
Xinrui Jiang, Yicheng Pan, Meng Ma, and Ping Wang. 2023. Look Deep into the Microservice System Anomaly through Very Sparse Logs. In Proceedings of the ACM Web Conference 2023 (Austin, TX, USA) (WWW '23). Association for Computing Machinery, New York, NY, USA, 2970--2978. https://doi.org/10.1145/ 3543507.3583338
[26]
Hamid Reza Vaezi Joze, Amirreza Shaban, Michael L. Iuzzolino, and Kazuhito Koishida. 2020. MMTM: Multimodal Transfer Module for CNN Fusion. arXiv:1911.08670 [cs.CV]
[27]
Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung, Rahul Sukthankar, and Li Fei-Fei. 2014. Large-Scale Video Classification with Convolutional Neural Networks. In 2014 IEEE Conference on Computer Vision and Pattern Recognition. 1725--1732. https://doi.org/10.1109/CVPR.2014.223
[28]
Cheryl Lee, Tianyi Yang, Zhuangbin Chen, Yuxin Su, and Michael R. Lyu. 2023. Eadro: An End-to-End Troubleshooting Framework for Microservices on Multi- Source Data. In Proceedings of the 45th International Conference on Software Engineering (Melbourne, Victoria, Australia) (ICSE '23). IEEE Press, 1750--1762. https://doi.org/10.1109/ICSE48619.2023.00150
[29]
Xinjie Li and Huijuan Xu. 2023. MEID: mixture-of-experts with internal distillation for long-tailed video recognition. In Proceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence and Thirty-Fifth Conference on Innovative Applications of Artificial Intelligence and Thirteenth Symposium on Educational Advances in Artificial Intelligence (AAAI'23/IAAI'23/EAAI'23). AAAI Press, Article 161, 9 pages. https://doi.org/10.1609/aaai.v37i2.25230
[30]
Bo Liu, Jingliu Xiong, Qiurong Ren, Shmuel Tyszberowicz, and Zheng Yang. 2022. Log2MS: a framework for automated refactoring monolith into microservices using execution logs. In 2022 IEEE International Conference onWeb Services (ICWS). 391--396. https://doi.org/10.1109/ICWS55610.2022.00065
[31]
Ping Liu, Haowen Xu, Qianyu Ouyang, Rui Jiao, Zhekang Chen, Shenglin Zhang, Jiahai Yang, Linlin Mo, Jice Zeng,Wenman Xue, and Dan Pei. 2020. Unsupervised Detection of Microservice Trace Anomalies through Service-Level Deep Bayesian Networks. In 2020 IEEE 31st International Symposium on Software Reliability Engineering (ISSRE). 48--58. https://doi.org/10.1109/ISSRE5003.2020.00014
[32]
W. Liu, Wei-Long Zheng, and Bao-Liang Lu. 2016. Emotion Recognition Using Multimodal Deep Learning. In International Conference on Neural Information Processing. https://api.semanticscholar.org/CorpusID:7767769
[33]
Shutian Luo, Huanle Xu, Chengzhi Lu, Kejiang Ye, Guoyao Xu, Liping Zhang, Yu Ding, Jian He, and Chengzhong Xu. 2021. Characterizing Microservice Dependency and Performance: Alibaba Trace Analysis. In Proceedings of the ACM Symposium on Cloud Computing (Seattle, WA, USA) (SoCC '21). Association for Computing Machinery, New York, NY, USA, 412--426. https://doi.org/10.1145/ 3472883.3487003
[34]
Meng Ma, Jingmin Xu, Yuan Wang, Pengfei Chen, Zonghua Zhang, and Ping Wang. 2020. AutoMAP: Diagnose Your Microservice-Based Web Applications Automatically. In Proceedings of TheWeb Conference 2020 (Taipei, Taiwan) (WWW '20). Association for Computing Machinery, New York, NY, USA, 246--258. https: //doi.org/10.1145/3366423.3380111
[35]
Jonathan Mace. 2017. End-to-End Tracing: Adoption and Use Cases. Survey. Brown University.
[36]
Genc Mazlami, Jürgen Cito, and Philipp Leitner. 2017. Extraction of Microservices from Monolithic Software Architectures. In 2017 IEEE International Conference on Web Services (ICWS). 524--531. https://doi.org/10.1109/ICWS.2017.61
[37]
Franck Michel, Catherine Faron-Zucker, Olivier Corby, and Fabien Gandon. 2019. Enabling Automatic Discovery and Querying of Web APIs at Web Scale Using Linked Data Standards. In Companion Proceedings of The 2019 World Wide Web Conference (San Francisco, USA) (WWW '19). Association for Computing Machinery, New York, NY, USA, 883--892. https://doi.org/10.1145/3308560.3317073
[38]
Micah M. Murray, Antonia Thelen, Silvio Ionta, and Mark T. Wallace. 2019. Contributions of Intraindividual and Interindividual Differences to Multisensory Processes. J. Cognitive Neuroscience 31, 3 (mar 2019), 360--376. https://doi.org/ 10.1162/jocn_a_01246
[39]
Hoa Xuan Nguyen, Shaoshu Zhu, and Mingming Liu. 2022. A Survey on Graph Neural Networks for Microservice-Based Cloud Applications. Sensors 22, 23 (2022). https://doi.org/10.3390/s22239492
[40]
Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. 2019. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Advances in Neural Information Processing Systems 32. Curran Associates, Inc., 8024--8035. http://papers.neurips.cc/paper/9015-pytorch-animperative- style-high-performance-deep-learning-library.pdf
[41]
Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, et al. 2011. Scikit-learn: Machine learning in Python. Journal of machine learning research 12, Oct (2011), 2825--2830.
[42]
Haoran Qiu, Subho S. Banerjee, Saurabh Jha, Zbigniew T. Kalbarczyk, and Ravishankar Iyer. 2020. Pre-processed Tracing Data for Popular Microservice Benchmarks. https://databank.illinois.edu/datasets/IDB-6738796. Online.
[43]
Haoran Qiu, Subho S. Banerjee, Saurabh Jha, Zbigniew T. Kalbarczyk, and Ravishankar K. Iyer. 2020. FIRM: An Intelligent Fine-grained Resource Management Framework for SLO-Oriented Microservices. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20). USENIX Association, 805--825. https://www.usenix.org/conference/osdi20/presentation/qiu
[44]
Bjorn Rabenstein and Julius Volz. 2015. Prometheus: A Next-Generation Monitoring System (Talk). USENIX Association, Dublin.
[45]
Elahe Rahimian, Golara Javadi, Frederick Tung, and Gabriel Oliveira. 2023. DynaShare: Task and Instance Conditioned Parameter Sharing for Multi-Task Learning. In 2023 IEEE/CVF Conference on Computer Vision and Pattern RecognitionWorkshops (CVPRW). 4535--4543. https://doi.org/10.1109/CVPRW59228.2023.00477
[46]
Carlos Ramos-Carreño and José L. Torrecilla. 2023. dcor: Distance correlation and energy statistics in Python. SoftwareX 22 (2 2023). https://doi.org/10.1016/j. softx.2023.101326
[47]
Huasong Shan, Yuan Chen, Haifeng Liu, Yunpeng Zhang, Xiao Xiao, Xiaofeng He, Min Li, and Wei Ding. 2019. ??-Diagnosis: Unsupervised and Real-Time Diagnosis of Small- Window Long-Tail Latency in Large-Scale Microservice Platforms. In The World Wide Web Conference (San Francisco, CA, USA) (WWW '19). Association for Computing Machinery, New York, NY, USA, 3215--3222. https://doi.org/10.1145/3308558.3313653
[48]
Jacopo Soldani and Antonio Brogi. 2022. Anomaly Detection and Failure Root Cause Analysis in (Micro) Service-Based Cloud Applications: A Survey. ACM Comput. Surv. 55, 3, Article 59 (feb 2022), 39 pages. https://doi.org/10.1145/ 3501297
[49]
Gagan Somashekar, Anurag Dutt, Mainak Adak, Tania Lorido Botran, and Anshul Gandhi. 2024. Microservices Bottleneck Detection Dataset. https://doi.org/10. 34740/KAGGLE/DSV/7638732
[50]
Gagan Somashekar, Anurag Dutt, Rohith Vaddavalli, Sai Bhargav Varanasi, and Anshul Gandhi. 2022. B-MEG: Bottlenecked-Microservices Extraction Using Graph Neural Networks. In Companion of the 2022 ACM/SPEC International Conference on Performance Engineering (Bejing, China) (ICPE '22). Association for Computing Machinery, New York, NY, USA, 7--11. https://doi.org/10.1145/ 3491204.3527494
[51]
G. Somashekar, A. Suresh, S. Tyagi, V. Dhyani, K. Donkada, A. Pradhan, and A. Gandhi. 2022. Reducing the Tail Latency of Microservices Applications via Optimal Configuration Tuning. In 2022 IEEE International Conference on Autonomic Computing and Self-Organizing Systems (ACSOS). IEEE Computer Society, Los Alamitos, CA, USA, 111--120. https://doi.org/10.1109/ACSOS55765.2022.00029
[52]
Ximeng Sun, Rameswar Panda, Rogerio Feris, and Kate Saenko. 2020. Adashare: Learning what to share for efficient deep multi-task learning. Advances in Neural Information Processing Systems 33 (2020).
[53]
Gábor J. Székely, Maria L. Rizzo, and Nail K. Bakirov. 2007. Measuring and testing dependence by correlation of distances. The Annals of Statistics 35, 6 (2007), 2769 -- 2794. https://doi.org/10.1214/009053607000000505
[54]
Nicolas Viennot, Mathias Lécuyer, Jonathan Bell, Roxana Geambasu, and Jason Nieh. 2015. Synapse: A Microservices Architecture for Heterogeneous-Database Web Applications. In Proceedings of the Tenth European Conference on Computer Systems (Bordeaux, France) (EuroSys '15). Association for Computing Machinery, NewYork,NY, USA, Article 21, 16 pages. https://doi.org/10.1145/2741948.2741975
[55]
Hanzhang Wang, Zhengkai Wu, Huai Jiang, Yichao Huang, Jiamu Wang, Selcuk Kopru, and Tao Xie. 2022. Groot: An Event-Graph-Based Approach for Root Cause Analysis in Industrial Settings. In Proceedings of the 36th IEEE/ACM International Conference on Automated Software Engineering (Melbourne, Australia) (ASE '21). IEEE Press, 419--429. https://doi.org/10.1109/ASE51524.2021.9678708
[56]
Yingying Wen, Guanjie Cheng, Shuiguang Deng, and Jianwei Yin. 2022. Characterizing and synthesizing the workflow structure of microservices in ByteDance Cloud. Journal of Software: Evolution and Process 34, 8 (2022), e2467. https://doi.org/10.1002/smr.2467 arXiv:https://onlinelibrary.wiley.com/doi/pdf/10.1002/smr.2467
[57]
Zonghan Wu, Shirui Pan, Fengwen Chen, Guodong Long, Chengqi Zhang, and Philip S. Yu. 2021. A Comprehensive Survey on Graph Neural Networks. IEEE Transactions on Neural Networks and Learning Systems 32, 1 (2021), 4--24. https: //doi.org/10.1109/TNNLS.2020.2978386
[58]
Zhe Xie, Haowen Xu, Wenxiao Chen, Wanxue Li, Huai Jiang, Liangfei Su, Hanzhang Wang, and Dan Pei. 2023. Unsupervised Anomaly Detection on Microservice Traces through Graph VAE. In Proceedings of the ACM Web Conference 2023 (Austin, TX, USA) (WWW '23). Association for Computing Machinery, New York, NY, USA, 2874--2884. https://doi.org/10.1145/3543507.3583215
[59]
Chenxi Zhang, Xin Peng, Chaofeng Sha, Ke Zhang, Zhenqing Fu, Xiya Wu, Qingwei Lin, and Dongmei Zhang. 2022. DeepTraLog: Trace-Log Combined Microservice Anomaly Detection through Graph-based Deep Learning. In 2022 IEEE/ACM 44th International Conference on Software Engineering (ICSE). 623--634. https://doi.org/10.1145/3510003.3510180
[60]
Zhizhou Zhang, Murali Krishna Ramanathan, Prithvi Raj, Abhishek Parwal, Timothy Sherwood, and Milind Chabbi. 2022. CRISP: Critical Path Analysis of Large-Scale Microservice Architectures. In 2022 USENIX Annual Technical Conference (USENIX ATC 22). USENIX Association, Carlsbad, CA, 655--672. https: //www.usenix.org/conference/atc22/presentation/zhang-zhizhou

Cited By

View all
  • (2024)Building AI Agents for Autonomous Clouds: Challenges and Design PrinciplesProceedings of the 2024 ACM Symposium on Cloud Computing10.1145/3698038.3698525(99-110)Online publication date: 20-Nov-2024
  • (2024)Power Microservices Troubleshooting by Pretrained Language Model with Multi-source Data2024 IEEE International Symposium on Parallel and Distributed Processing with Applications (ISPA)10.1109/ISPA63168.2024.00241(1768-1775)Online publication date: 30-Oct-2024

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
WWW '24: Proceedings of the ACM Web Conference 2024
May 2024
4826 pages
ISBN:9798400701719
DOI:10.1145/3589334
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 May 2024

Permissions

Request permissions for this article.

Check for updates

Badges

Author Tags

  1. anomalies
  2. bottlenecks
  3. dataset
  4. graph neural network
  5. microservices applications

Qualifiers

  • Research-article

Funding Sources

  • NSF (National Science Foundation)

Conference

WWW '24
Sponsor:
WWW '24: The ACM Web Conference 2024
May 13 - 17, 2024
Singapore, Singapore

Acceptance Rates

Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)222
  • Downloads (Last 6 weeks)23
Reflects downloads up to 05 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Building AI Agents for Autonomous Clouds: Challenges and Design PrinciplesProceedings of the 2024 ACM Symposium on Cloud Computing10.1145/3698038.3698525(99-110)Online publication date: 20-Nov-2024
  • (2024)Power Microservices Troubleshooting by Pretrained Language Model with Multi-source Data2024 IEEE International Symposium on Parallel and Distributed Processing with Applications (ISPA)10.1109/ISPA63168.2024.00241(1768-1775)Online publication date: 30-Oct-2024

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media