Abstract
Log anomaly is a manifestation of a software system error or security threat. Detecting such unusual behaviours across logs in real-time is the driving force behind large-scale autonomous monitoring technology that can rapidly alert zero-day attacks. Increasingly, AI methods are being used to process voluminous log datasets and reveal patterns of correlated anomaly. In this paper, we propose an enhanced approach to learning semantic-aware embeddings for logs called the Subword Encoder Neural network (SEN). Solving upon a key limitation of previous semantic log parsing works, the proposed work introduces the concept of learning word vectors from subword-level granularity using an attention encoder strategy. The learnt embeddings reflect the contextual/lexical relationships at the word level. As a result, the learnt word representations precisely capture new log messages previously not seen by the model. Furthermore, we develop a novel feature distillation algorithm termed Naive Bayes Feature Selector (NBFS) to extract useful log events. This probabilistic technique examines the occurrence pattern of events to only select the salient ones that can aid anomaly detection. To our best knowledge, this is the first attempt to associate affinity to log events based on the target task. Since the predictions can be traced to the log messages, the AI is inherently explainable too. The model outperforms state-of-the-art methods by a fair margin. It achieves a 0.99 detection F1-score on the benchmarked BGL, HDFS and OpenStack log datasets.
Similar content being viewed by others
Data availability
The syslog data that support the findings of this study are available in the LogHub public repository with the identifier(s). https://doi.org/10.48550/arXiv.2008.06448
References
Google Cloud Fixes Outage That Hit Home Depot, Snap, Spotify. https://www.bloomberg.com/news/articles/2021-11-16/home-depot-amazon-web-services-websites-reportedly-see-outages (Accessed 28 June 2022)
Amazon Web Services’ third outage in a month exposes a weak point in the Internet’s backbone. https://www.washingtonpost.com/business/2021/12/22/amazon-web-services-experiences-another-big-outage/ (Accessed 28 June 2022)
Lin Q, Zhang H, Lou JG, Zhang Y, Chen X (2016) Log clustering based problem identification for online service systems. In: Proceedings of the 38th International Conference on Software Engineering Companion, pp 102–111
Zhou P, Wang Y, Li Z, Wang X, Tyson G, Xie G (2020) Logsayer: Log pattern-driven cloud component anomaly diagnosis with machine learning. In: 2020 IEEE/ACM 28th International Symposium on Quality of Service (IWQoS). IEEE, pp 1–10
Yin K et al (2020) Improving Log-Based Anomaly Detection with Component-Aware Analysis. IEEE Int Conf Softw Maint Evol (ICSME) 2020:667–671. https://doi.org/10.1109/ICSME46990.2020.00069
Lu S, Wei X, Li Y, Wang L (2018) Detecting anomaly in big data system logs using convolutional neural network. In: 2018 IEEE 16th Intl Conf on Dependable, Autonomic and Secure Computing, 16th Intl Conf on Pervasive Intelligence and Computing, 4th Intl Conf on Big Data Intelligence and Computing and Cyber Science and Technology Congress (DASC/PiCom/DataCom/CyberSciTech). IEEE, pp 151–158
Meng W, Liu Y, Huang Y, Zhang S, Zaiter F, Chen B, Pei D (2020) A semantic-aware representation framework for online log analysis. In: In 2020 29th International Conference on Computer Communications and Networks (ICCCN). IEEE, pp 1–7
Du M, Li F, Zheng G, Srikumar V (2017) Deeplog: Anomaly detection and diagnosis from system logs through deep learning. In: Proceedings of the 2017 ACM SIGSAC conference on computer and communications security, pp 1285–1298
Meng W, Liu Y, Zhu Y, Zhang S, Pei D, Liu Y et al (2019) LogAnomaly: Unsupervised detection of sequential and quantitative anomalies in unstructured logs. IJCAI 19(7):4739–4745
Chen Y, Luktarhan N, Lv D (2022) LogLS: Research on System Log Anomaly Detection Method Based on Dual LSTM. In Symmetry. MDPI AG 14(3):454. https://doi.org/10.3390/sym14030454
Lv D, Luktarhan N, Chen Y (2021) ConAnomaly: Content-Based Anomaly Detection for System Logs. In Sensors. MDPI AG 21(18):6125. https://doi.org/10.3390/s21186125
Yang R, Qu D, Gao Y, Qian Y, Tang Y (2019) nLSALog: An Anomaly Detection Framework for Log Sequence in Security Management. In IEEE Access. Ins Electr Electron Eng (IEEE) 7:181152–181164. https://doi.org/10.1109/access.2019.2953981
Li X, Chen P, Jing L, He Z, Yu G (2020) Swisslog: Robust and unified deep learning based log anomaly detection for diverse faults. In: 2020 IEEE 31st International Symposium on Software Reliability Engineering (ISSRE). IEEE, pp 92–103
Li X, Chen P, Jing L, He Z, Yu G (2022) SwissLog: Robust anomaly detection and localization for interleaved unstructured logs. IEEE Transactions on Dependable and Secure Computing
Lee Y, Kim J, Kang P (2021) LAnoBERT: System log anomaly detection based on BERT masked language model. arXiv preprint arXiv:2111.09564
Wang Q, Zhang X, Wang X, Cao Z (2021) Log Sequence Anomaly Detection Method Based on Contrastive Adversarial Training and Dual Feature Extraction. In Entropy. MDPI AG 24(1):69. https://doi.org/10.3390/e24010069
Guo H, Yuan S, Wu X (2021) LogBERT: Log Anomaly Detection via BERT. Int Joint Conf Neural Net (IJCNN) 2021:1–8. https://doi.org/10.1109/IJCNN52387.2021.9534113
Hashemi S, Mäntylä M (2021) OneLog: Towards end-to-end training in software log anomaly detection. arXiv preprint arXiv:2104.07324
Niwa T, Kasuya Y, Kitahara T (2017) Anomaly detection for openstack services with process-related topological analysis. In: 2017 13th International Conference on Network and Service Management (CNSM). IEEE, pp 1–5
Zeufack V, Kim D, Seo D, Lee A (2021) An unsupervised anomaly detection framework for detecting anomalies in real time through network system’s log files analysis. In High-Confidence Computing. Elsevier BV 1(2):100030. https://doi.org/10.1016/j.hcc.2021.100030
Chakraborty B, Divakaran DM, Nevat I, Peters GW, Gurusamy M (2021) Cost-Aware Feature Selection for IoT Device Classification. In IEEE Internet of Things Journal. Inst Electr Electron Eng (IEEE) 8(14):11052–11064. https://doi.org/10.1109/jiot.2021.3051480
Bommert A, Sun X, Bischl B, Rahnenführer J, Lang M (2020) Benchmark for filter methods for feature selection in high-dimensional classification data. In Computational Statistics & Data Analysis. Elsevier BV 143:106839. https://doi.org/10.1016/j.csda.2019.106839
Iqbal M, Abid MM, Khalid MN, Manzoor A (2020) Review of feature selection methods for text classification. In International Journal of Advanced Computer Research (Vo 10, Issue 49, pp 138–152). Association of Computer, Communication and Education for National Triumph Social and Welfare Society (ACCENTS). https://doi.org/10.19101/ijacr.2020.1048037
Liu Y, Ju S, Wang J, Su C (2020) A New Feature Selection Method for Text Classification Based on Independent Feature Space Search. In Mathematical Problems in Engineering. Hindawi Limited 2020:1–14. https://doi.org/10.1155/2020/6076272
Thabtah F, Kamalov F, Hammoud S, Shahamiri SR (2020) Least Loss: A simplified filter method for feature selection. In Information Sciences. Elsevier BV 534:1–15. https://doi.org/10.1016/j.ins.2020.05.017
Gumilar A, Prasetiyowati SS, Sibaroni Y (2022) Performance analysis of hybrid machine learning methods on imbalanced data (rainfall classification). Jurnal RESTI (Rekayasa Sistem dan Teknologi Informasi) 6(3):481–490
Wang Z, Lin Z (2019) Optimal Feature Selection for Learning-Based Algorithms for Sentiment Classification. In Cognitive Computation (Vol 12, Issue 1, pp 238–248). Springer Science and Business Media LLC. https://doi.org/10.1007/s12559-019-09669-5
Vangara RVB, Thirupathur K, Vangara SP (2020) Opinion Mining Classification using Naive Bayes Algorithm. In International Journal of Innovative Technology and Exploring Engineering (Vol 9, Issue 5, pp 495–498). Blue Eyes Intelligence Engineering and Sciences Engineering and Sciences Publication - BEIESP. https://doi.org/10.35940/ijitee.e2402.039520
ThakkarA, Lohiya R (2020) Attack classification using feature selection techniques: a comparative study. In Journal of Ambient Intelligence and Humanized Computing (Vol 12, Issue 1, pp 1249–1266). Springer Science and Business Media LLC. https://doi.org/10.1007/s12652-020-02167-9
Ismail Z, Jantan A, Yusoff Mohd N, Kiru MU (2020) The effects of feature selection on the classification of encrypted botnet. In Journal of Computer Virology and Hacking Techniques (Vol 17, Issue 1, pp 61–74). Springer Science and Business Media LLC. https://doi.org/10.1007/s11416-020-00367-7
Bird JJ, Ekárt A, Buckingham CD, Faria DR (2019) High resolution sentiment analysis by ensemble classification. In: Intelligent Computing: Proceedings of the 2019 Computing Conference, vol 1. Springer International Publishing, pp 593–606
Schroff F, Kalenichenko D, Philbin J (2015) FaceNet: A unified embedding for face recognition and clustering. IEEE Conf Comput Vision Pattern Recog (CVPR) 2015:815–823. https://doi.org/10.1109/CVPR.2015.7298682
Wang Z, Tian J, Fang H, Chen L, Qin J (2022) LightLog: A lightweight temporal convolutional network for log anomaly detection on the edge. In Computer Networks (Vol 203, p 108616). Elsevier BV. https://doi.org/10.1016/j.comnet.2021.108616
Farzad A, Gulliver TA (2020) Unsupervised log message anomaly detection. In ICT Express (Vol 6, Issue 3, pp 229–237). Elsevier BV. https://doi.org/10.1016/j.icte.2020.06.003
Oliner A, Stearley J (2007) What supercomputers say: A study of five system logs. In: 37th annual IEEE/IFIP international conference on dependable systems and networks (DSN'07). IEEE, pp 575–584
Xu W, Huang L, Fox A, Patterson D, Jordan MI (2009) Detecting large-scale system problems by mining console logs. In: Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles, pp 117–132
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Competing interests
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
The authors declare the following financial interests/personal relationships which may be considered as potential competing interests.
Additional information
Publisher's note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Hariharan, M., Mishra, A., Ravi, S. et al. Detecting log anomaly using subword attention encoder and probabilistic feature selection. Appl Intell 53, 22297–22312 (2023). https://doi.org/10.1007/s10489-023-04674-6
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-023-04674-6