A Systematic Mapping Study in AIOps

Notaro, Paolo; Cardoso, Jorge; Gerndt, Michael

doi:10.1007/978-3-030-76352-7_15

Part of the book series: Lecture Notes in Computer Science ((LNPSE,volume 12632))

Included in the following conference series:

International Conference on Service-Oriented Computing

3041 Accesses
10 Citations

Abstract

IT systems of today are becoming larger and more complex, rendering their human supervision more difficult. Artificial Intelligence for IT Operations (AIOps) has been proposed to tackle modern IT administration challenges thanks to AI and Big Data. However, past AIOps contributions are scattered, unorganized and missing a common terminology convention, which renders their discovery and comparison impractical. In this work, we conduct an in-depth mapping study to collect and organize the numerous scattered contributions to AIOps in a unique reference index. We create an AIOps taxonomy to build a foundation for future contributions and allow an efficient comparison of AIOps papers treating similar problems. We investigate temporal trends and classify AIOps contributions based on the choice of algorithms, data sources and the target components. Our results show a recent and growing interest towards AIOps, specifically to those contributions treating failure-related tasks (62%), such as anomaly detection and root cause analysis.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Multi-source Distributed System Data for AI-Powered Analytics

TrueDetective 4.0: A Big Data Architecture for Real Time Anomaly Detection

LogRCA: Log-Based Root Cause Analysis for Distributed Services

References

Abreu, R., Zoeteweij, P., Gemund, A.J.V.: Spectrum-based multiple fault localization. In: IEEE/ACM International Conference on Automated Software Engineering, November 2009. https://doi.org/10.1109/ase.2009.25
Aguilera, M.K., Mogul, J.C., Wiener, J.L., Reynolds, P., Muthitacharoen, A.: Performance debugging for distributed systems of black boxes. ACM SIGOPS Oper. Syst. Rev. 37(5), 74–89 (2003). https://doi.org/10.1145/1165389.945454
Article Google Scholar
Lerner, A.: AIOps Platforms, August 2017. https://blogs.gartner.com/andrew-lerner/2017/08/09/aiops-platforms/
Attariyan, M., Chow, M., Flinn, J.: X-ray: automating root-cause diagnosis of performance anomalies in production software. In: Proceedings of the 10th USENIX Conference on Operating Systems Design and Implementation, OSDI 2012, Hollywood, CA, USA, pp. 307–320, October 2012. https://doi.org/10.5555/2387880.2387910
Bahl, P., Chandra, R., Greenberg, A., Kandula, S., Maltz, D.A., Zhang, M.: Towards highly reliable enterprise network services via inference of multi-level dependencies. In: Proceedings of the 2007 Conference on Applications, Technologies, Architectures, and Protocols for Computer Communications - SIGCOMM (2007). https://doi.org/10.1145/1282380.1282383
Barham, P., Isaacs, R., Mortier, R., Narayanan, D.: Magpie: online modelling and performance-aware systems. In: Proceedings of the 9th Conference on Hot Topics in Operating Systems, HOTOS 2003, Lihue, Hawaii, vol. 9, p. 15, May 2003. https://doi.org/10.5555/1251054.1251069
Bodik, P., Goldszmidt, M., Fox, A., Woodard, D.B., Andersen, H.: Fingerprinting the datacenter: automated classification of performance crises. In: Proceedings of the 5th European Conference on Computer Systems - EuroSys 2010 (2010). https://doi.org/10.1145/1755913.1755926
Chalermarrewong, T., Achalakul, T., See, S.C.W.: Failure prediction of data centers using time series and fault tree analysis. In: IEEE 18th International Conference on Parallel and Distributed Systems, December 2012. https://doi.org/10.1109/icpads.2012.129
Chen, M., Kiciman, E., Fratkin, E., Fox, A., Brewer, E.: Pinpoint: problem determination in large, dynamic Internet services. In: Proceedings of IEEE International Conference on Dependable Systems and Networks (2002). https://doi.org/10.1109/dsn.2002.1029005
Chow, M., Meisner, D., Flinn, J., Peek, D., Wenisch, T.F.: The mystery machine: end-to-end performance analysis of large-scale internet services. In: OSDI 2014: Proceedings of the 11th USENIX Conference on Operating Systems Design and Implementation, pp. 217–231 (2014). https://doi.org/10.5555/2685048.2685066
Cohen, I., Goldszmidt, M., Kelly, T., Symons, J., Chase, J.S.: Correlating instrumentation data to system states: a building block for automated diagnosis and control. In: Proceedings of the 6th USENIX Conference on Symposium on Operating Systems Design & Implementation, OSDI 2004 (2004). https://doi.org/10.5555/1251254.1251270
Costa, C.H., Park, Y., Rosenburg, B.S., Cher, C.Y., Ryu, K.D.: A system software approach to proactive memory-error avoidance. In: SC 2014: International Conference for High Performance Computing, Networking, Storage and Analysis, November 2014. https://doi.org/10.1109/sc.2014.63
Dang, Y., Lin, Q., Huang, P.: AIOps: real-world challenges and research innovations. In: IEEE/ACM 41st International Conference on Software Engineering: Companion, May 2019. https://doi.org/10.1109/icse-companion.2019.00023
Davis, N.A., Rezgui, A., Soliman, H., Manzanares, S., Coates, M.: FailureSim: a system for predicting hardware failures in cloud data centers using neural networks. In: IEEE 10th International Conference on Cloud Computing (CLOUD), Jun 2017. https://doi.org/10.1109/cloud.2017.75
Du, M., Li, F., Zheng, G., Srikumar, V.: DeepLog: anomaly detection and diagnosis from system logs through deep learning. In: Proceedings of ACM SIGSAC Conference on Computer and Communications Security (2017). https://doi.org/10.1145/3133956.3134015
Garg, S., van Moorsel, A., Vaidyanathan, K., Trivedi, K.: A methodology for detection and estimation of software aging. In: Proceedings Ninth International Symposium on Software Reliability Engineering (Cat. No.98TB100257). IEEE Computer Society (1998). https://doi.org/10.1109/issre.1998.730892
Islam, T., Manivannan, D.: Predicting application failure in cloud: a machine learning approach. In: 2017 IEEE International Conference on Cognitive Computing (ICCC), Jun 2017. https://doi.org/10.1109/ieee.iccc.2017.11
Jalali, S., Wohlin, C.: Systematic literature studies: database searches vs. backward snowballing. In: Proceedings of the 2012 ACM-IEEE International Symposium on Empirical Software Engineering and Measurement, pp. 29–38, September 2012. https://doi.org/10.1145/2372251.2372257
Kandula, S., Katabi, D., Vasseur, J.P.: Shrink: a tool for failure diagnosis in IP networks. In: Proceedings of the 2005 ACM SIGCOMM Workshop on Mining Network Data - MineNet 2005 (2005). https://doi.org/10.1145/1080173.1080178
Kobbacy, K.A.H., Vadera, S., Rasmy, M.H.: AI and OR in management of operations: history and trends. J. Oper. Res. Soc. 58(1), 10–28 (2007). https://doi.org/10.1057/palgrave.jors.2602132
Article MATH Google Scholar
Lakhina, A., Crovella, M., Diot, C.: Diagnosing network-wide traffic anomalies. In: Proceedings of the 2004 Conference on Applications, Technologies, Architectures, and Protocols for Computer Communications - SIGCOMM 2004. ACM Press (2004). https://doi.org/10.1145/1015467.1015492
Lakhina, A., Crovella, M., Diot, C.: Mining anomalies using traffic feature distributions. In: Proceedings of the 2005 Conference on Applications, Technologies, Architectures, and Protocols for Computer Communications - SIGCOMM 2005. ACM Press (2005). https://doi.org/10.1145/1080091.1080118
Li, Y., et al.: Predicting node failures in an ultra-large-scale cloud computing platform: an AIOps solution. ACM Trans. Software Eng. Methodol. 29(2), 1–24 (2020). https://doi.org/10.1145/3385187
Article Google Scholar
Liang, Y., Zhang, Y., Xiong, H., Sahoo, R.: Failure prediction in IBM BlueGene/L event logs. In: Seventh IEEE International Conference on Data Mining (ICDM) (2007). https://doi.org/10.1109/icdm.2007.46
Lin, F., Beadon, M., Dixit, H.D., Vunnam, G., Desai, A., Sankar, S.: Hardware remediation at scale. In: 2018 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN-W), June 2018. https://doi.org/10.1109/dsn-w.2018.00015
Lin, Q., Zhang, H., Lou, J.G., Zhang, Y., Chen, X.: Log clustering based problem identification for online service systems. In: Proceedings of the 38th ACM International Conference on Software Engineering Companion (ICSE) (2016). https://doi.org/10.1145/2889160.2889232
Menzies, T., Greenwald, J., Frank, A.: Data mining static code attributes to learn defect predictors. IEEE Trans. Software Eng. 33(1), 2–13 (2007). https://doi.org/10.1109/TSE.2007.256941
Article Google Scholar
Chen, M.Y., Accardi, A., Kiciman, E., Lloyd, J., Patterson, D., Fox, A., Brewer, E.: Path-based failure and evolution management. In: Proceedings of the 1st Conference on Symposium on Networked Systems Design and Implementation, NSDI 2004, San Francisco, California, vol. 1, p. 23, March 2004. https://doi.org/10.5555/1251175.1251198
Moody, A., Bronevetsky, G., Mohror, K., Supinski, B.R.D.: Design, modeling, and evaluation of a scalable multi-level checkpointing system. In: 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, November 2010. https://doi.org/10.1109/sc.2010.18
Moore, A.W., Zuev, D.: Internet traffic classification using Bayesian analysis techniques. In: Proceedings of the 2005 ACM International Conference on Measurement and Modeling of Computer Systems - SIGMETRICS 2005 (2005). https://doi.org/10.1145/1064212.1064220
Mukwevho, M.A., Celik, T.: Toward a smart cloud: a review of fault-tolerance methods in cloud systems. IEEE Trans. Serv. Comput. 1 (2018). https://doi.org/10.1109/tsc.2018.2816644
Natella, R., Cotroneo, D., Duraes, J.A., Madeira, H.S.: On fault representativeness of software fault injection. IEEE Trans. Software Eng. 39(1), 80–96 (2013). https://doi.org/10.1109/tse.2011.124
Article Google Scholar
Nguyen, H., Shen, Z., Tan, Y., Gu, X.: FChain: toward black-box online fault localization for cloud systems. In: IEEE 33rd International Conference on Distributed Computing Systems, July 2013. https://doi.org/10.1109/icdcs.2013.26
Petersen, K., Vakkalanka, S., Kuzniarz, L.: Guidelines for conducting systematic mapping studies in software engineering: an update. Inf. Softw. Technol. 64, 1–18 (2015). https://doi.org/10.1016/j.infsof.2015.03.007
Article Google Scholar
Pitakrat, T., Okanović, D., van Hoorn, A., Grunske, L.: Hora: architecture-aware online failure prediction. J. Syst. Softw. 137, 669–685 (2018). https://doi.org/10.1016/j.jss.2017.02.041
Article Google Scholar
Podgurski, A., et al.: Automated support for classifying software failure reports. In: Proceedings of IEEE 25th International Conference on Software Engineering (2003). https://doi.org/10.1109/icse.2003.1201224
Salfner, F., Malek, M.: Using hidden semi-Markov models for effective online failure prediction. In: 26th IEEE International Symposium on Reliable Distributed Systems (SRDS), October 2007. https://doi.org/10.1109/srds.2007.35
Samir, A., Pahl, C.: A controller architecture for anomaly detection, root cause analysis and self-adaptation for cluster architectures. In: International Conference on Adaptive and Self-Adaptive Systems and Applications (2019). 10993/42062
Google Scholar
Shao, Q., Chen, Y., Tao, S., Yan, X., Anerousis, N.: Efficient ticket routing by resolution sequence mining. In: Proceeding of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining - KDD (2008). https://doi.org/10.1145/1401890.1401964
Sharma, A.B., Chen, H., Ding, M., Yoshihira, K., Jiang, G.: Fault detection and localization in distributed systems using invariant relationships. In: 43rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), June 2013. https://doi.org/10.1109/dsn.2013.6575304
Vaidyanathan, K., Trivedi, K.: A comprehensive model for software rejuvenation. IEEE Trans. Dependable Secure Comput. 2(2), 124–137 (2005). https://doi.org/10.1109/tdsc.2005.15
Article Google Scholar
Xu, H., et al.: Unsupervised anomaly detection via variational auto-encoder for seasonal KPIs in web applications. In: Proceedings of the 2018 World Wide Web Conference on World Wide Web (2018). https://doi.org/10.1145/3178876.3185996
Xu, W., Huang, L., Fox, A., Patterson, D., Jordan, M.I.: Detecting large-scale system problems by mining console logs. In: Proceedings of the ACM SIGOPS 22nd Symposium on Operating Systems Principles - SOSP 2009 (2009). https://doi.org/10.1145/1629575.1629587
Yuan, D., Mai, H., Xiong, W., Tan, L., Zhou, Y., Pasupathy, S.: SherLog: error diagnosis by connecting clues from run-time logs. In: ACM SIGARCH Computer Architecture News, vol. 38, no. 1, pp. 143–154 (2010). https://doi.org/10.1145/1735970.1736038
Zhang, K., Xu, J., Min, M.R., Jiang, G., Pelechrinis, K., Zhang, H.: Automated IT system failure prediction: a deep learning approach. In: IEEE International Conference on Big Data (2016). https://doi.org/10.1109/bigdata.2016.7840733
Zhang, S., et al.: Syslog processing for switch failure diagnosis and prediction in datacenter networks. In: IEEE/ACM 25th International Symposium on Quality of Service (IWQoS), June 2017. https://doi.org/10.1109/iwqos.2017.7969130
Zheng, S., Ristovski, K., Farahat, A., Gupta, C.: Long short-term memory network for remaining useful life estimation. In: IEEE International Conference on Prognostics and Health Management (ICPHM) (2017). https://doi.org/10.1109/icphm.2017.7998311
Zhou, W., Tang, L., Li, T., Shwartz, L., Grabarnik, G.Y.: Resolution recommendation for event tickets in service management. In: IFIP/IEEE International Symposium on Integrated Network Management (IM) (2015). https://doi.org/10.1109/inm.2015.7140303
Zhu, J., He, P., Fu, Q., Zhang, H., Lyu, M.R., Zhang, D.: Learning to log: helping developers make informed logging decisions. In: IEEE/ACM 37th IEEE International Conference on Software Engineering, May 2015. https://doi.org/10.1109/icse.2015.60

Download references

Author information

Authors and Affiliations

Chair of Computer Architecture and Parallel Systems, Technical University of Munich, Munich, Germany
Paolo Notaro & Michael Gerndt
Ultra-scale AIOps Lab, Huawei Munich Research Center, Munich, Germany
Paolo Notaro & Jorge Cardoso
Department of Informatics Engineering/CISUC, University of Coimbra, Coimbra, Portugal
Jorge Cardoso

Authors

Paolo Notaro
View author publications
You can also search for this author in PubMed Google Scholar
Jorge Cardoso
View author publications
You can also search for this author in PubMed Google Scholar
Michael Gerndt
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Paolo Notaro .

Editor information

Editors and Affiliations

Zayed University, Dubai, United Arab Emirates
Hakim Hacid
Zayed University, Dubai, United Arab Emirates
Fatma Outay
University of New South Wales, Sydney, NSW, Australia
Hye-young Paik
Huawei Paris Research Center, Paris, France
Amira Alloum
National Research Council C.N.R., Pisa, Italy
Marinella Petrocchi
Deakin University, Waurn Ponds, VIC, Australia
Mohamed Reda Bouadjenek
Macquarie University, Sydney, NSW, Australia
Amin Beheshti
Rochester Institute of Technology, Rochester, NY, USA
Xumin Liu
Université de Paris, Paris, France
Abderrahmane Maaradji

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Notaro, P., Cardoso, J., Gerndt, M. (2021). A Systematic Mapping Study in AIOps. In: Hacid, H., et al. Service-Oriented Computing – ICSOC 2020 Workshops. ICSOC 2020. Lecture Notes in Computer Science(), vol 12632. Springer, Cham. https://doi.org/10.1007/978-3-030-76352-7_15

Download citation

DOI: https://doi.org/10.1007/978-3-030-76352-7_15
Published: 30 May 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-76351-0
Online ISBN: 978-3-030-76352-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

A Systematic Mapping Study in AIOps

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Multi-source Distributed System Data for AI-Powered Analytics

TrueDetective 4.0: A Big Data Architecture for Real Time Anomaly Detection

LogRCA: Log-Based Root Cause Analysis for Distributed Services

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

A Systematic Mapping Study in AIOps

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Multi-source Distributed System Data for AI-Powered Analytics

TrueDetective 4.0: A Big Data Architecture for Real Time Anomaly Detection

LogRCA: Log-Based Root Cause Analysis for Distributed Services

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation