Abstract
To ensure the high availability of modern online systems, effective maintenance is of critical importance. Today’s software maintenance techniques for online systems heavily rely on metrics, which are time series data that can describe the real-time state of a system from various perspectives. Typically, software engineers generate dashboards with metrics to aid software maintenance. Though several attempts have been devoted to metric analysis for automatic software maintenance, the primary step, i.e., dashboard generation, remains manual to a large extent. In this paper, we develop a metric recommendation service, which can automate the dashboard generation practice and greatly ease the burden in maintaining an online system. Specifically, we analyze the needs of two essential steps of online system maintenance, i.e., anomaly detection and fault diagnosis, and design metric recommendation mechanisms for them respectively. Graph learning techniques are employed in the automation of metric recommendation. Our experiments demonstrate that the proposed approach can achieve an F1-score of 0.912 in selecting metrics for anomaly detection, and an accuracy of 0.859 in retrieving metrics for faults diagnosis, which significantly outperforms the compared baselines.
Z. He and T. Huang—Co-first authors of this work.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Grafana dashboards (2023). https://grafana.com/grafana/dashboards/. Accessed 12 Dec 2023
Kafka monitoring (2023). https://kafka.apache.org/documentation
Node exporter (2023). https://github.com/prometheus/node_exporter
Prometheus monitoring for containers (2023). https://github.com/google/cadvisor/blob/master/metrics/prometheus.go
Redis monitoring (2023). https://redis.io/commands/info/
Baradari, I., Shoar, M., Nezafati, N., Motadel, M.: A new approach for KPI ranking and selection in ITIL processes: using simultaneous evaluation of criteria and alternatives (SECA). J. Ind. Eng. Manag. Stud. 8(1), 152–179 (2021)
Barandas, M., et al.: TSFEL: time series feature extraction library. SoftwareX 11, 100456 (2020)
Beyer, B., Jones, C., Petoff, J., Murphy, N.R.: Site Reliability Engineering: How Google Runs Production Systems. O’Reilly Media, Inc. (2016)
Chen, P., Qi, Y., Zheng, P., Hou, D.: CauseInfer: automatic and distributed performance diagnosis with hierarchical causality graph in large distributed systems. In: IEEE INFOCOM 2014-IEEE Conference on Computer Communications, pp. 1887–1895. IEEE (2014)
Christ, M., Braun, N., Neuffer, J., Kempa-Liehr, A.W.: Time series feature extraction on basis of scalable hypothesis tests (tsfresh-a python package). Neurocomputing 307, 72–77 (2018)
Farshchi, M., Schneider, J.G., Weber, I., Grundy, J.: Metric selection and anomaly detection for cloud operations using log and metric correlation analysis. J. Syst. Softw. 137, 531–549 (2018)
Fu, S.: Performance metric selection for autonomic anomaly detection on cloud computing systems. In: 2011 IEEE Global Telecommunications Conference-GLOBECOM 2011, pp. 1–5. IEEE (2011)
Guan, Q., Fu, S.: Adaptive anomaly identification by exploring metric subspace in cloud computing infrastructures. In: 2013 IEEE 32nd International Symposium on Reliable Distributed Systems, pp. 205–214. IEEE (2013)
He, Z., et al.: A spatiotemporal deep learning approach for unsupervised anomaly detection in cloud systems. IEEE Trans. Neural Netw. Learn. Syst. 34(4), 1705–1719 (2020)
Huang, T., Chen, P., Li, R.: A semi-supervised VAE based active anomaly detection framework in multivariate time series for online systems. In: Proceedings of the ACM Web Conference 2022, pp. 1797–1806 (2022)
Jha, D.N., Lenton, G., Asker, J., Blundell, D., Wallom, D.: Holistic runtime performance and security-aware monitoring in public cloud environment. In: 2022 22nd IEEE International Symposium on Cluster, Cloud and Internet Computing (CCGrid), pp. 1052–1059. IEEE (2022)
Kraskov, A., Stögbauer, H., Grassberger, P.: Estimating mutual information. Phys. Rev. E 69(6), 066138 (2004)
Levin, J., Benson, T.A.: ViperProbe: rethinking microservice observability with eBPF. In: 2020 IEEE 9th International Conference on Cloud Networking (CloudNet), pp. 1–8. IEEE (2020)
Li, Z., et al.: Actionable and interpretable fault localization for recurring failures in online service systems. In: Proceedings of the 2022 30th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. ESEC/FSE 2022 (2022)
Lin, J., Chen, P., Zheng, Z.: Microscope: pinpoint performance issues with causal graphs in micro-service environments. In: Pahl, C., Vukovic, M., Yin, J., Yu, Q. (eds.) ICSOC 2018. LNCS, vol. 11236, pp. 3–20. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-03596-9_1
Meng, Y., et al.: Localizing failure root causes in a microservice through causality inference. In: 28th IEEE/ACM International Symposium on Quality of Service, IWQoS 2020, Hangzhou, China, 15–17 June 2020, pp. 1–10. IEEE (2020)
Müller, M.: Dynamic time warping. In: Müller, M. (ed.) Information Retrieval for Music and Motion, pp. 69–84. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-74048-3_4
Paul, A., Mukherjee, D.P., Das, P., Gangopadhyay, A., Chintha, A.R., Kundu, S.: Improved random forest for classification. IEEE Trans. Image Process. 27(8), 4012–4024 (2018)
Ramadona, S., Haryadi, S., Aryanti, D.R.: Over the top call service key performance indicator. In: 2015 1st International Conference on Wireless and Telematics (ICWT), pp. 1–4. IEEE (2015)
Siffer, A., Fouque, P., Termier, A., Largouët, C.: Anomaly detection in streams with extreme value theory. In: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Halifax, NS, Canada, 13–17 August 2017, pp. 1067–1075. ACM (2017)
Su, Y., Zhao, Y., Niu, C., Liu, R., Sun, W., Pei, D.: Robust anomaly detection for multivariate time series through stochastic recurrent neural network. In: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 2828–2837. ACM (2019)
Tong, H., Faloutsos, C., Pan, J.Y.: Fast random walk with restart and its applications. In: Sixth International Conference on Data Mining (ICDM 2006), pp. 613–622. IEEE (2006)
Weng, T., Yang, W., Yu, G., Chen, P., Cui, J., Zhang, C.: Kmon: an in-kernel transparent monitoring system for microservice systems with eBPF. In: 2021 IEEE/ACM International Workshop on Cloud Intelligence (CloudIntelligence), pp. 25–30. IEEE (2021)
Wohlin, C., Runeson, P., Höst, M., Ohlsson, M.C., Regnell, B., Wesslén, A.: Experimentation in Software Engineering. Springer, Heidelberg (2012)
Wu, C., et al.: Identifying root-cause metrics for incident diagnosis in online service systems. In: 32nd IEEE International Symposium on Software Reliability Engineering, ISSRE 2021, Wuhan, China, 25–28 October 2021, pp. 91–102. IEEE (2021)
Xu, H., et al.: Unsupervised anomaly detection via variational auto-encoder for seasonal kpis in web applications. In: Proceedings of the 2018 World Wide Web Conference on World Wide Web, WWW 2018, Lyon, France, 23–27 April 2018, pp. 187–196. ACM (2018)
Acknowledgments
The research is supported by the National Natural Science Foundation of China (No. 62272495) and the Guangdong Basic and Applied Basic Research Foundation (No. 2023B1515020054), and sponsored by Tencent.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
He, Z., Huang, T., Chen, P., Li, R., Wang, R., Zheng, Z. (2025). DashChef: A Metric Recommendation Service for Online Systems Using Graph Learning. In: Bai, G., Ishikawa, F., Ait-Ameur, Y., Papadopoulos, G.A. (eds) Engineering of Complex Computer Systems. ICECCS 2024. Lecture Notes in Computer Science, vol 14784 . Springer, Cham. https://doi.org/10.1007/978-3-031-66456-4_1
Download citation
DOI: https://doi.org/10.1007/978-3-031-66456-4_1
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-66455-7
Online ISBN: 978-3-031-66456-4
eBook Packages: Computer ScienceComputer Science (R0)