Abstract
IT systems of today are becoming larger and more complex, rendering their human supervision more difficult. Artificial Intelligence for IT Operations (AIOps) has been proposed to tackle modern IT administration challenges thanks to AI and Big Data. However, past AIOps contributions are scattered, unorganized and missing a common terminology convention, which renders their discovery and comparison impractical. In this work, we conduct an in-depth mapping study to collect and organize the numerous scattered contributions to AIOps in a unique reference index. We create an AIOps taxonomy to build a foundation for future contributions and allow an efficient comparison of AIOps papers treating similar problems. We investigate temporal trends and classify AIOps contributions based on the choice of algorithms, data sources and the target components. Our results show a recent and growing interest towards AIOps, specifically to those contributions treating failure-related tasks (62%), such as anomaly detection and root cause analysis.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Abreu, R., Zoeteweij, P., Gemund, A.J.V.: Spectrum-based multiple fault localization. In: IEEE/ACM International Conference on Automated Software Engineering, November 2009. https://doi.org/10.1109/ase.2009.25
Aguilera, M.K., Mogul, J.C., Wiener, J.L., Reynolds, P., Muthitacharoen, A.: Performance debugging for distributed systems of black boxes. ACM SIGOPS Oper. Syst. Rev. 37(5), 74–89 (2003). https://doi.org/10.1145/1165389.945454
Lerner, A.: AIOps Platforms, August 2017. https://blogs.gartner.com/andrew-lerner/2017/08/09/aiops-platforms/
Attariyan, M., Chow, M., Flinn, J.: X-ray: automating root-cause diagnosis of performance anomalies in production software. In: Proceedings of the 10th USENIX Conference on Operating Systems Design and Implementation, OSDI 2012, Hollywood, CA, USA, pp. 307–320, October 2012. https://doi.org/10.5555/2387880.2387910
Bahl, P., Chandra, R., Greenberg, A., Kandula, S., Maltz, D.A., Zhang, M.: Towards highly reliable enterprise network services via inference of multi-level dependencies. In: Proceedings of the 2007 Conference on Applications, Technologies, Architectures, and Protocols for Computer Communications - SIGCOMM (2007). https://doi.org/10.1145/1282380.1282383
Barham, P., Isaacs, R., Mortier, R., Narayanan, D.: Magpie: online modelling and performance-aware systems. In: Proceedings of the 9th Conference on Hot Topics in Operating Systems, HOTOS 2003, Lihue, Hawaii, vol. 9, p. 15, May 2003. https://doi.org/10.5555/1251054.1251069
Bodik, P., Goldszmidt, M., Fox, A., Woodard, D.B., Andersen, H.: Fingerprinting the datacenter: automated classification of performance crises. In: Proceedings of the 5th European Conference on Computer Systems - EuroSys 2010 (2010). https://doi.org/10.1145/1755913.1755926
Chalermarrewong, T., Achalakul, T., See, S.C.W.: Failure prediction of data centers using time series and fault tree analysis. In: IEEE 18th International Conference on Parallel and Distributed Systems, December 2012. https://doi.org/10.1109/icpads.2012.129
Chen, M., Kiciman, E., Fratkin, E., Fox, A., Brewer, E.: Pinpoint: problem determination in large, dynamic Internet services. In: Proceedings of IEEE International Conference on Dependable Systems and Networks (2002). https://doi.org/10.1109/dsn.2002.1029005
Chow, M., Meisner, D., Flinn, J., Peek, D., Wenisch, T.F.: The mystery machine: end-to-end performance analysis of large-scale internet services. In: OSDI 2014: Proceedings of the 11th USENIX Conference on Operating Systems Design and Implementation, pp. 217–231 (2014). https://doi.org/10.5555/2685048.2685066
Cohen, I., Goldszmidt, M., Kelly, T., Symons, J., Chase, J.S.: Correlating instrumentation data to system states: a building block for automated diagnosis and control. In: Proceedings of the 6th USENIX Conference on Symposium on Operating Systems Design & Implementation, OSDI 2004 (2004). https://doi.org/10.5555/1251254.1251270
Costa, C.H., Park, Y., Rosenburg, B.S., Cher, C.Y., Ryu, K.D.: A system software approach to proactive memory-error avoidance. In: SC 2014: International Conference for High Performance Computing, Networking, Storage and Analysis, November 2014. https://doi.org/10.1109/sc.2014.63
Dang, Y., Lin, Q., Huang, P.: AIOps: real-world challenges and research innovations. In: IEEE/ACM 41st International Conference on Software Engineering: Companion, May 2019. https://doi.org/10.1109/icse-companion.2019.00023
Davis, N.A., Rezgui, A., Soliman, H., Manzanares, S., Coates, M.: FailureSim: a system for predicting hardware failures in cloud data centers using neural networks. In: IEEE 10th International Conference on Cloud Computing (CLOUD), Jun 2017. https://doi.org/10.1109/cloud.2017.75
Du, M., Li, F., Zheng, G., Srikumar, V.: DeepLog: anomaly detection and diagnosis from system logs through deep learning. In: Proceedings of ACM SIGSAC Conference on Computer and Communications Security (2017). https://doi.org/10.1145/3133956.3134015
Garg, S., van Moorsel, A., Vaidyanathan, K., Trivedi, K.: A methodology for detection and estimation of software aging. In: Proceedings Ninth International Symposium on Software Reliability Engineering (Cat. No.98TB100257). IEEE Computer Society (1998). https://doi.org/10.1109/issre.1998.730892
Islam, T., Manivannan, D.: Predicting application failure in cloud: a machine learning approach. In: 2017 IEEE International Conference on Cognitive Computing (ICCC), Jun 2017. https://doi.org/10.1109/ieee.iccc.2017.11
Jalali, S., Wohlin, C.: Systematic literature studies: database searches vs. backward snowballing. In: Proceedings of the 2012 ACM-IEEE International Symposium on Empirical Software Engineering and Measurement, pp. 29–38, September 2012. https://doi.org/10.1145/2372251.2372257
Kandula, S., Katabi, D., Vasseur, J.P.: Shrink: a tool for failure diagnosis in IP networks. In: Proceedings of the 2005 ACM SIGCOMM Workshop on Mining Network Data - MineNet 2005 (2005). https://doi.org/10.1145/1080173.1080178
Kobbacy, K.A.H., Vadera, S., Rasmy, M.H.: AI and OR in management of operations: history and trends. J. Oper. Res. Soc. 58(1), 10–28 (2007). https://doi.org/10.1057/palgrave.jors.2602132
Lakhina, A., Crovella, M., Diot, C.: Diagnosing network-wide traffic anomalies. In: Proceedings of the 2004 Conference on Applications, Technologies, Architectures, and Protocols for Computer Communications - SIGCOMM 2004. ACM Press (2004). https://doi.org/10.1145/1015467.1015492
Lakhina, A., Crovella, M., Diot, C.: Mining anomalies using traffic feature distributions. In: Proceedings of the 2005 Conference on Applications, Technologies, Architectures, and Protocols for Computer Communications - SIGCOMM 2005. ACM Press (2005). https://doi.org/10.1145/1080091.1080118
Li, Y., et al.: Predicting node failures in an ultra-large-scale cloud computing platform: an AIOps solution. ACM Trans. Software Eng. Methodol. 29(2), 1–24 (2020). https://doi.org/10.1145/3385187
Liang, Y., Zhang, Y., Xiong, H., Sahoo, R.: Failure prediction in IBM BlueGene/L event logs. In: Seventh IEEE International Conference on Data Mining (ICDM) (2007). https://doi.org/10.1109/icdm.2007.46
Lin, F., Beadon, M., Dixit, H.D., Vunnam, G., Desai, A., Sankar, S.: Hardware remediation at scale. In: 2018 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN-W), June 2018. https://doi.org/10.1109/dsn-w.2018.00015
Lin, Q., Zhang, H., Lou, J.G., Zhang, Y., Chen, X.: Log clustering based problem identification for online service systems. In: Proceedings of the 38th ACM International Conference on Software Engineering Companion (ICSE) (2016). https://doi.org/10.1145/2889160.2889232
Menzies, T., Greenwald, J., Frank, A.: Data mining static code attributes to learn defect predictors. IEEE Trans. Software Eng. 33(1), 2–13 (2007). https://doi.org/10.1109/TSE.2007.256941
Chen, M.Y., Accardi, A., Kiciman, E., Lloyd, J., Patterson, D., Fox, A., Brewer, E.: Path-based failure and evolution management. In: Proceedings of the 1st Conference on Symposium on Networked Systems Design and Implementation, NSDI 2004, San Francisco, California, vol. 1, p. 23, March 2004. https://doi.org/10.5555/1251175.1251198
Moody, A., Bronevetsky, G., Mohror, K., Supinski, B.R.D.: Design, modeling, and evaluation of a scalable multi-level checkpointing system. In: 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, November 2010. https://doi.org/10.1109/sc.2010.18
Moore, A.W., Zuev, D.: Internet traffic classification using Bayesian analysis techniques. In: Proceedings of the 2005 ACM International Conference on Measurement and Modeling of Computer Systems - SIGMETRICS 2005 (2005). https://doi.org/10.1145/1064212.1064220
Mukwevho, M.A., Celik, T.: Toward a smart cloud: a review of fault-tolerance methods in cloud systems. IEEE Trans. Serv. Comput. 1 (2018). https://doi.org/10.1109/tsc.2018.2816644
Natella, R., Cotroneo, D., Duraes, J.A., Madeira, H.S.: On fault representativeness of software fault injection. IEEE Trans. Software Eng. 39(1), 80–96 (2013). https://doi.org/10.1109/tse.2011.124
Nguyen, H., Shen, Z., Tan, Y., Gu, X.: FChain: toward black-box online fault localization for cloud systems. In: IEEE 33rd International Conference on Distributed Computing Systems, July 2013. https://doi.org/10.1109/icdcs.2013.26
Petersen, K., Vakkalanka, S., Kuzniarz, L.: Guidelines for conducting systematic mapping studies in software engineering: an update. Inf. Softw. Technol. 64, 1–18 (2015). https://doi.org/10.1016/j.infsof.2015.03.007
Pitakrat, T., Okanović, D., van Hoorn, A., Grunske, L.: Hora: architecture-aware online failure prediction. J. Syst. Softw. 137, 669–685 (2018). https://doi.org/10.1016/j.jss.2017.02.041
Podgurski, A., et al.: Automated support for classifying software failure reports. In: Proceedings of IEEE 25th International Conference on Software Engineering (2003). https://doi.org/10.1109/icse.2003.1201224
Salfner, F., Malek, M.: Using hidden semi-Markov models for effective online failure prediction. In: 26th IEEE International Symposium on Reliable Distributed Systems (SRDS), October 2007. https://doi.org/10.1109/srds.2007.35
Samir, A., Pahl, C.: A controller architecture for anomaly detection, root cause analysis and self-adaptation for cluster architectures. In: International Conference on Adaptive and Self-Adaptive Systems and Applications (2019). 10993/42062
Shao, Q., Chen, Y., Tao, S., Yan, X., Anerousis, N.: Efficient ticket routing by resolution sequence mining. In: Proceeding of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining - KDD (2008). https://doi.org/10.1145/1401890.1401964
Sharma, A.B., Chen, H., Ding, M., Yoshihira, K., Jiang, G.: Fault detection and localization in distributed systems using invariant relationships. In: 43rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), June 2013. https://doi.org/10.1109/dsn.2013.6575304
Vaidyanathan, K., Trivedi, K.: A comprehensive model for software rejuvenation. IEEE Trans. Dependable Secure Comput. 2(2), 124–137 (2005). https://doi.org/10.1109/tdsc.2005.15
Xu, H., et al.: Unsupervised anomaly detection via variational auto-encoder for seasonal KPIs in web applications. In: Proceedings of the 2018 World Wide Web Conference on World Wide Web (2018). https://doi.org/10.1145/3178876.3185996
Xu, W., Huang, L., Fox, A., Patterson, D., Jordan, M.I.: Detecting large-scale system problems by mining console logs. In: Proceedings of the ACM SIGOPS 22nd Symposium on Operating Systems Principles - SOSP 2009 (2009). https://doi.org/10.1145/1629575.1629587
Yuan, D., Mai, H., Xiong, W., Tan, L., Zhou, Y., Pasupathy, S.: SherLog: error diagnosis by connecting clues from run-time logs. In: ACM SIGARCH Computer Architecture News, vol. 38, no. 1, pp. 143–154 (2010). https://doi.org/10.1145/1735970.1736038
Zhang, K., Xu, J., Min, M.R., Jiang, G., Pelechrinis, K., Zhang, H.: Automated IT system failure prediction: a deep learning approach. In: IEEE International Conference on Big Data (2016). https://doi.org/10.1109/bigdata.2016.7840733
Zhang, S., et al.: Syslog processing for switch failure diagnosis and prediction in datacenter networks. In: IEEE/ACM 25th International Symposium on Quality of Service (IWQoS), June 2017. https://doi.org/10.1109/iwqos.2017.7969130
Zheng, S., Ristovski, K., Farahat, A., Gupta, C.: Long short-term memory network for remaining useful life estimation. In: IEEE International Conference on Prognostics and Health Management (ICPHM) (2017). https://doi.org/10.1109/icphm.2017.7998311
Zhou, W., Tang, L., Li, T., Shwartz, L., Grabarnik, G.Y.: Resolution recommendation for event tickets in service management. In: IFIP/IEEE International Symposium on Integrated Network Management (IM) (2015). https://doi.org/10.1109/inm.2015.7140303
Zhu, J., He, P., Fu, Q., Zhang, H., Lyu, M.R., Zhang, D.: Learning to log: helping developers make informed logging decisions. In: IEEE/ACM 37th IEEE International Conference on Software Engineering, May 2015. https://doi.org/10.1109/icse.2015.60
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Notaro, P., Cardoso, J., Gerndt, M. (2021). A Systematic Mapping Study in AIOps. In: Hacid, H., et al. Service-Oriented Computing – ICSOC 2020 Workshops. ICSOC 2020. Lecture Notes in Computer Science(), vol 12632. Springer, Cham. https://doi.org/10.1007/978-3-030-76352-7_15
Download citation
DOI: https://doi.org/10.1007/978-3-030-76352-7_15
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-76351-0
Online ISBN: 978-3-030-76352-7
eBook Packages: Computer ScienceComputer Science (R0)