skip to main content
10.1145/3326285.3329048acmotherconferencesArticle/Chapter ViewAbstractPublication PagesiwqosConference Proceedingsconference-collections
research-article

CoFlux: robustly correlating KPIs by fluctuations for service troubleshooting

Published: 24 June 2019 Publication History

Abstract

Internet-based service companies monitor a large number of KPIs (Key Performance Indicators) to ensure their service quality and reliability. Correlating KPIs by fluctuations reveals interactions between KPIs under anomalous situations and can be extremely useful for service troubleshooting. However, such a KPI flux-correlation has been little studied so far in the domain of Internet service operations management. A major challenge is how to automatically and accurately separate fluctuations from normal variations in KPIs with different structural characteristics (such as seasonal, trend and stationary) for a large number of KPIs. In this paper, we propose CoFlux, an unsupervised approach, to automatically (without manual selection of algorithm fitting and parameter tuning) determine whether two KPIs are correlated by fluctuations, in what temporal order they fluctuate, and whether they fluctuate in the same direction. CoFlux's robust feature engineering and robust correlation score computation enable it to work well against the diverse KPI characteristics. Our extensive experiments have demonstrated that CoFlux achieves the best F1-Scores of 0.84 (0.90), 0.92 (0.95), 0.95 (0.99), in answering these three questions, in the two real datasets from a top global Internet company, respectively. Moreover, we showed that CoFlux is effective in assisting service troubleshooting through the applications of alert compression, recommending Top N causes, and constructing fluctuation propagation chains.

References

[1]
Mike P Papazoglou and Willem-Jan Van Den Heuvel. Service oriented architectures: approaches, technologies and research issues. The VLDB journal, 16(3):389--415, 2007.
[2]
Chen Luo, Jian-Guang Lou, Qingwei Lin, Qiang Fu, Rui Ding, Dongmei Zhang, and Zhe Wang. Correlating events with time series for incident diagnosis. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 1583--1592. ACM, 2014.
[3]
Dapeng Liu, Youjian Zhao, Haowen Xu, Yongqian Sun, Dan Pei, Jiao Luo, Xiaowei Jing, and Mei Feng. Opprentice: Towards practical and automatic anomaly detection through machine learning. In Proceedings of the 2015 Internet Measurement Conference, pages 211--224. ACM, 2015.
[4]
Shenglin Zhang, Ying Liu, Dan Pei, Yu Chen, Xianping Qu, Shimin Tao, Zhi Zang, Xiaowei Jing, and Mei Feng. Funnel: Assessing software changes in web-based services. IEEE Transactions on Service Computing, 2016.
[5]
Yongqian Sun, Youjian Zhao, Ya Su, Dapeng Liu, Xiaohui Nie, Yuan Meng, Shiwen Cheng, Dan Pei, Shenglin Zhang, Xianping Qu, and Xuanyou Guo. Hotspot: Anomaly localization for additive kpis with multi-dimensional attributes. IEEE Access, 6:10909--10923, 2018.
[6]
Xiaohui Nie, Youjian Zhao, Kaixin Sui, Dan Pei, Yu Chen, and Xianping Qu. Mining causality graph for automatic web-based service diagnosis. In Performance Computing and Communications Conference (IPCCC), 2016 IEEE 35th International, pages 1--8. IEEE, 2016.
[7]
Jacob Cohen. Statistical power analysis for the behavioural sciences, 1988.
[8]
Jingmin Xu, Yuan Wang, Pengfei Chen, and Ping Wang. Lightweight and adaptive service api performance monitoring in highly dynamic cloud environment. In 2017 IEEE International Conference on Services Computing (SCC), pages 35--43. IEEE, 2017.
[9]
Adam J Oliner, Ashutosh V Kulkarni, and Alex Aiken. Using correlated surprise to infer shared influence. In Dependable Systems and Networks (DSN), 2010 IEEE/IFIP International Conference on, pages 191--200. IEEE, 2010.
[10]
Yasushi Hamao, Ronald W Masulis, and Victor Ng. Correlations in price changes and volatility across international stock markets. The review of financial studies, 3(2):281--307, 1990.
[11]
Maurizio Filippone and Guido Sanguinetti. Information theoretic novelty detection. Pattern Recognition, 43(3):805--814, 2010.
[12]
Helmut Lütkepohl. Forecasting with varma models. Handbook of economic forecasting, 1:287--325, 2006.
[13]
Richard ID Harris. Using cointegration analysis in econometric modelling. 1995.
[14]
Huida Qiu, Yan Liu, Niranjan A Subrahmanya, and Weichang Li. Granger causality for time-series anomaly detection. In Data Mining (ICDM), 2012 IEEE 12th International Conference on, pages 1074--1079. IEEE, 2012.
[15]
Jianqing Fan and Qiwei Yao. Nonlinear time series: nonparametric and parametric methods. Springer Science & Business Media, 2008.
[16]
Shashank Shanbhag and Tilman Wolf. Accurate anomaly detection through parallelism. IEEE network, 23(1):22--28, 2009.
[17]
Yasushi Sakurai, Spiros Papadimitriou, and Christos Faloutsos. Braid: Stream mining through group lag correlations. In Proceedings of the 2005 ACM SIGMOD international conference on Management of data, pages 599--610. ACM, 2005.
[18]
Suk-Bok Lee, Dan Pei, MohammadTaghi Hajiaghayi, Ioannis Pefkianakis, Songwu Lu, He Yan, Zihui Ge, Jennifer Yates, and Mario Kosseifi. Threshold compression for 3g scalable monitoring. In INFOCOM, 2012 Proceedings IEEE, pages 1350--1358. IEEE, 2012.
[19]
Yingying Chen, Ratul Mahajan, Baskar Sridharan, and Zhi-Li Zhang. A provider-side view of web search response time. In ACM SIGCOMM Computer Communication Review, volume 43, pages 243--254. ACM, 2013.
[20]
David R Choffnes, Fabián E Bustamante, and Zihui Ge. Crowdsourcing service-level network event monitoring. ACM SIGCOMM Computer Communication Review, 41(4):387--398, 2011.
[21]
He Yan, Ashley Flavel, Zihui Ge, Alexandre Gerber, Dan Massey, Christos Papadopoulos, Hiren Shah, and Jennifer Yates. Argus: End-to-end service anomaly detection and localization from an isp's point of view. In INFOCOM, 2012 Proceedings IEEE, pages 2756--2760. IEEE, 2012.
[22]
Keyi Zhang, Ramazan Gençay, and M Ege Yazgan. Application of wavelet decomposition in time-series forecasting. Economics Letters, 158:41--46, 2017.
[23]
Katherine S Pollard and Mark J Van Der Laan. A method to identify significant clusters in gene expression data. 2002.
[24]
Xiwang Yang, Harald Steck, Yang Guo, and Yong Liu. On top-k recommendation using social networks. In Proceedings of the sixth ACM conference on Recommender systems, pages 67--74. ACM, 2012.
[25]
William A Gardner, Antonio Napolitano, and Luigi Paura. Cyclostationarity: Half a century of research. Signal processing, 86(4):639--697, 2006.

Cited By

View all
  • (2025)ISSD: Indicator Selection for Time Series State DetectionProceedings of the ACM on Management of Data10.1145/37096983:1(1-25)Online publication date: 11-Feb-2025
  • (2024)Fault Location Method Based on Dynamic Operation and Maintenance Map and Common Alarm Points AnalysisAlgorithms10.3390/a1705021717:5(217)Online publication date: 16-May-2024
  • (2024)Detecting State Correlations between Heterogeneous Time SeriesProceedings of the 2024 2nd International Conference on Advances in Artificial Intelligence and Applications10.1145/3712623.3712645(131-137)Online publication date: 20-Dec-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
IWQoS '19: Proceedings of the International Symposium on Quality of Service
June 2019
420 pages
ISBN:9781450367783
DOI:10.1145/3326285
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 24 June 2019

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. fluctuation correlation
  2. key performance indicator
  3. service operation and management
  4. service troubleshooting
  5. time series

Qualifiers

  • Research-article

Funding Sources

  • Beijing National Research Center for Information Science and Technology (BNRist)
  • Okawa Research Grant

Conference

IWQoS '19

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)39
  • Downloads (Last 6 weeks)3
Reflects downloads up to 08 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2025)ISSD: Indicator Selection for Time Series State DetectionProceedings of the ACM on Management of Data10.1145/37096983:1(1-25)Online publication date: 11-Feb-2025
  • (2024)Fault Location Method Based on Dynamic Operation and Maintenance Map and Common Alarm Points AnalysisAlgorithms10.3390/a1705021717:5(217)Online publication date: 16-May-2024
  • (2024)Detecting State Correlations between Heterogeneous Time SeriesProceedings of the 2024 2nd International Conference on Advances in Artificial Intelligence and Applications10.1145/3712623.3712645(131-137)Online publication date: 20-Dec-2024
  • (2024)A Service Anomaly Detection and Root Cause Location Method for Complex Power Systems2024 4th Power System and Green Energy Conference (PSGEC)10.1109/PSGEC62376.2024.10721013(1031-1035)Online publication date: 22-Aug-2024
  • (2024)KPIRoot: Efficient Monitoring Metric-based Root Cause Localization in Large-scale Cloud Systems2024 IEEE 35th International Symposium on Software Reliability Engineering (ISSRE)10.1109/ISSRE62328.2024.00046(403-414)Online publication date: 28-Oct-2024
  • (2024)Fast and Robust Localization of Multi-dimensional Root Causes in Online Service2024 6th International Conference on Natural Language Processing (ICNLP)10.1109/ICNLP60986.2024.10692632(85-89)Online publication date: 22-Mar-2024
  • (2023)Hi-RCA: A Hierarchy Anomaly Diagnosis Framework Based on Causality and Correlation AnalysisApplied Sciences10.3390/app13221212613:22(12126)Online publication date: 8-Nov-2023
  • (2023)GROUP: An End-to-end Multi-step-ahead Workload Prediction Approach Focusing on Workload Group BehaviorProceedings of the ACM Web Conference 202310.1145/3543507.3583460(3098-3108)Online publication date: 30-Apr-2023
  • (2023)Efficient and Robust KPI Outlier Detection for Large-Scale DatacentersIEEE Transactions on Computers10.1109/TC.2023.327228872:10(2858-2871)Online publication date: Oct-2023
  • (2023)An Empirical Analysis of Anomaly Detection Methods for Multivariate Time Series2023 IEEE 34th International Symposium on Software Reliability Engineering (ISSRE)10.1109/ISSRE59848.2023.00014(57-68)Online publication date: 9-Oct-2023
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media