Skip to main content
Log in

Detecting log anomaly using subword attention encoder and probabilistic feature selection

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

Log anomaly is a manifestation of a software system error or security threat. Detecting such unusual behaviours across logs in real-time is the driving force behind large-scale autonomous monitoring technology that can rapidly alert zero-day attacks. Increasingly, AI methods are being used to process voluminous log datasets and reveal patterns of correlated anomaly. In this paper, we propose an enhanced approach to learning semantic-aware embeddings for logs called the Subword Encoder Neural network (SEN). Solving upon a key limitation of previous semantic log parsing works, the proposed work introduces the concept of learning word vectors from subword-level granularity using an attention encoder strategy. The learnt embeddings reflect the contextual/lexical relationships at the word level. As a result, the learnt word representations precisely capture new log messages previously not seen by the model. Furthermore, we develop a novel feature distillation algorithm termed Naive Bayes Feature Selector (NBFS) to extract useful log events. This probabilistic technique examines the occurrence pattern of events to only select the salient ones that can aid anomaly detection. To our best knowledge, this is the first attempt to associate affinity to log events based on the target task. Since the predictions can be traced to the log messages, the AI is inherently explainable too. The model outperforms state-of-the-art methods by a fair margin. It achieves a 0.99 detection F1-score on the benchmarked BGL, HDFS and OpenStack log datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

Data availability

The syslog data that support the findings of this study are available in the LogHub public repository with the identifier(s). https://doi.org/10.48550/arXiv.2008.06448

References

  1. Google Cloud Fixes Outage That Hit Home Depot, Snap, Spotify. https://www.bloomberg.com/news/articles/2021-11-16/home-depot-amazon-web-services-websites-reportedly-see-outages (Accessed 28 June 2022)

  2. Amazon Web Services’ third outage in a month exposes a weak point in the Internet’s backbone. https://www.washingtonpost.com/business/2021/12/22/amazon-web-services-experiences-another-big-outage/ (Accessed 28 June 2022)

  3. Lin Q, Zhang H, Lou JG, Zhang Y, Chen X (2016) Log clustering based problem identification for online service systems. In: Proceedings of the 38th International Conference on Software Engineering Companion, pp 102–111

    Google Scholar 

  4. Zhou P, Wang Y, Li Z, Wang X, Tyson G, Xie G (2020) Logsayer: Log pattern-driven cloud component anomaly diagnosis with machine learning. In: 2020 IEEE/ACM 28th International Symposium on Quality of Service (IWQoS). IEEE, pp 1–10

    Google Scholar 

  5. Yin K et al (2020) Improving Log-Based Anomaly Detection with Component-Aware Analysis. IEEE Int Conf Softw Maint Evol (ICSME) 2020:667–671. https://doi.org/10.1109/ICSME46990.2020.00069

    Article  Google Scholar 

  6. Lu S, Wei X, Li Y, Wang L (2018) Detecting anomaly in big data system logs using convolutional neural network. In: 2018 IEEE 16th Intl Conf on Dependable, Autonomic and Secure Computing, 16th Intl Conf on Pervasive Intelligence and Computing, 4th Intl Conf on Big Data Intelligence and Computing and Cyber Science and Technology Congress (DASC/PiCom/DataCom/CyberSciTech). IEEE, pp 151–158

    Google Scholar 

  7. Meng W, Liu Y, Huang Y, Zhang S, Zaiter F, Chen B, Pei D (2020) A semantic-aware representation framework for online log analysis. In: In 2020 29th International Conference on Computer Communications and Networks (ICCCN). IEEE, pp 1–7

    Google Scholar 

  8. Du M, Li F, Zheng G, Srikumar V (2017) Deeplog: Anomaly detection and diagnosis from system logs through deep learning. In: Proceedings of the 2017 ACM SIGSAC conference on computer and communications security, pp 1285–1298

    Chapter  Google Scholar 

  9. Meng W, Liu Y, Zhu Y, Zhang S, Pei D, Liu Y et al (2019) LogAnomaly: Unsupervised detection of sequential and quantitative anomalies in unstructured logs. IJCAI 19(7):4739–4745

    Google Scholar 

  10. Chen Y, Luktarhan N, Lv D (2022) LogLS: Research on System Log Anomaly Detection Method Based on Dual LSTM. In Symmetry. MDPI AG 14(3):454. https://doi.org/10.3390/sym14030454

    Article  Google Scholar 

  11. Lv D, Luktarhan N, Chen Y (2021) ConAnomaly: Content-Based Anomaly Detection for System Logs. In Sensors. MDPI AG 21(18):6125. https://doi.org/10.3390/s21186125

    Article  Google Scholar 

  12. Yang R, Qu D, Gao Y, Qian Y, Tang Y (2019) nLSALog: An Anomaly Detection Framework for Log Sequence in Security Management. In IEEE Access. Ins Electr Electron Eng (IEEE) 7:181152–181164. https://doi.org/10.1109/access.2019.2953981

    Article  Google Scholar 

  13. Li X, Chen P, Jing L, He Z, Yu G (2020) Swisslog: Robust and unified deep learning based log anomaly detection for diverse faults. In: 2020 IEEE 31st International Symposium on Software Reliability Engineering (ISSRE). IEEE, pp 92–103

    Chapter  Google Scholar 

  14. Li X, Chen P, Jing L, He Z, Yu G (2022) SwissLog: Robust anomaly detection and localization for interleaved unstructured logs. IEEE Transactions on Dependable and Secure Computing

  15. Lee Y, Kim J, Kang P (2021) LAnoBERT: System log anomaly detection based on BERT masked language model. arXiv preprint arXiv:2111.09564

  16. Wang Q, Zhang X, Wang X, Cao Z (2021) Log Sequence Anomaly Detection Method Based on Contrastive Adversarial Training and Dual Feature Extraction. In Entropy. MDPI AG 24(1):69. https://doi.org/10.3390/e24010069

    Article  Google Scholar 

  17. Guo H, Yuan S, Wu X (2021) LogBERT: Log Anomaly Detection via BERT. Int Joint Conf Neural Net (IJCNN) 2021:1–8. https://doi.org/10.1109/IJCNN52387.2021.9534113

    Article  Google Scholar 

  18. Hashemi S, Mäntylä M (2021) OneLog: Towards end-to-end training in software log anomaly detection. arXiv preprint arXiv:2104.07324

  19. Niwa T, Kasuya Y, Kitahara T (2017) Anomaly detection for openstack services with process-related topological analysis. In: 2017 13th International Conference on Network and Service Management (CNSM). IEEE, pp 1–5

    Google Scholar 

  20. Zeufack V, Kim D, Seo D, Lee A (2021) An unsupervised anomaly detection framework for detecting anomalies in real time through network system’s log files analysis. In High-Confidence Computing. Elsevier BV 1(2):100030. https://doi.org/10.1016/j.hcc.2021.100030

    Article  Google Scholar 

  21. Chakraborty B, Divakaran DM, Nevat I, Peters GW, Gurusamy M (2021) Cost-Aware Feature Selection for IoT Device Classification. In IEEE Internet of Things Journal. Inst Electr Electron Eng (IEEE) 8(14):11052–11064. https://doi.org/10.1109/jiot.2021.3051480

    Article  Google Scholar 

  22. Bommert A, Sun X, Bischl B, Rahnenführer J, Lang M (2020) Benchmark for filter methods for feature selection in high-dimensional classification data. In Computational Statistics & Data Analysis. Elsevier BV 143:106839. https://doi.org/10.1016/j.csda.2019.106839

    Article  MATH  Google Scholar 

  23. Iqbal M, Abid MM, Khalid MN, Manzoor A (2020) Review of feature selection methods for text classification. In International Journal of Advanced Computer Research (Vo 10, Issue 49, pp 138–152). Association of Computer, Communication and Education for National Triumph Social and Welfare Society (ACCENTS). https://doi.org/10.19101/ijacr.2020.1048037

  24. Liu Y, Ju S, Wang J, Su C (2020) A New Feature Selection Method for Text Classification Based on Independent Feature Space Search. In Mathematical Problems in Engineering. Hindawi Limited 2020:1–14. https://doi.org/10.1155/2020/6076272

    Article  Google Scholar 

  25. Thabtah F, Kamalov F, Hammoud S, Shahamiri SR (2020) Least Loss: A simplified filter method for feature selection. In Information Sciences. Elsevier BV 534:1–15. https://doi.org/10.1016/j.ins.2020.05.017

    Article  MATH  Google Scholar 

  26. Gumilar A, Prasetiyowati SS, Sibaroni Y (2022) Performance analysis of hybrid machine learning methods on imbalanced data (rainfall classification). Jurnal RESTI (Rekayasa Sistem dan Teknologi Informasi) 6(3):481–490

    Article  Google Scholar 

  27. Wang Z, Lin Z (2019) Optimal Feature Selection for Learning-Based Algorithms for Sentiment Classification. In Cognitive Computation (Vol 12, Issue 1, pp 238–248). Springer Science and Business Media LLC. https://doi.org/10.1007/s12559-019-09669-5

  28. Vangara RVB, Thirupathur K, Vangara SP (2020) Opinion Mining Classification using Naive Bayes Algorithm. In International Journal of Innovative Technology and Exploring Engineering (Vol 9, Issue 5, pp 495–498). Blue Eyes Intelligence Engineering and Sciences Engineering and Sciences Publication - BEIESP. https://doi.org/10.35940/ijitee.e2402.039520

  29. ThakkarA, Lohiya R (2020) Attack classification using feature selection techniques: a comparative study. In Journal of Ambient Intelligence and Humanized Computing (Vol 12, Issue 1, pp 1249–1266). Springer Science and Business Media LLC. https://doi.org/10.1007/s12652-020-02167-9

  30. Ismail Z, Jantan A, Yusoff Mohd N, Kiru MU (2020) The effects of feature selection on the classification of encrypted botnet. In Journal of Computer Virology and Hacking Techniques (Vol 17, Issue 1, pp 61–74). Springer Science and Business Media LLC. https://doi.org/10.1007/s11416-020-00367-7

  31. Bird JJ, Ekárt A, Buckingham CD, Faria DR (2019) High resolution sentiment analysis by ensemble classification. In: Intelligent Computing: Proceedings of the 2019 Computing Conference, vol 1. Springer International Publishing, pp 593–606

    Chapter  Google Scholar 

  32. Schroff F, Kalenichenko D, Philbin J (2015) FaceNet: A unified embedding for face recognition and clustering. IEEE Conf Comput Vision Pattern Recog (CVPR) 2015:815–823. https://doi.org/10.1109/CVPR.2015.7298682

    Article  Google Scholar 

  33. Wang Z, Tian J, Fang H, Chen L, Qin J (2022) LightLog: A lightweight temporal convolutional network for log anomaly detection on the edge. In Computer Networks (Vol 203, p 108616). Elsevier BV. https://doi.org/10.1016/j.comnet.2021.108616

  34. Farzad A, Gulliver TA (2020) Unsupervised log message anomaly detection. In ICT Express (Vol 6, Issue 3, pp 229–237). Elsevier BV. https://doi.org/10.1016/j.icte.2020.06.003

  35. Oliner A, Stearley J (2007) What supercomputers say: A study of five system logs. In: 37th annual IEEE/IFIP international conference on dependable systems and networks (DSN'07). IEEE, pp 575–584

    Chapter  Google Scholar 

  36. Xu W, Huang L, Fox A, Patterson D, Jordan MI (2009) Detecting large-scale system problems by mining console logs. In: Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles, pp 117–132

    Chapter  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to M. Hariharan.

Ethics declarations

Competing interests

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

The authors declare the following financial interests/personal relationships which may be considered as potential competing interests.

Additional information

Publisher's note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Hariharan, M., Mishra, A., Ravi, S. et al. Detecting log anomaly using subword attention encoder and probabilistic feature selection. Appl Intell 53, 22297–22312 (2023). https://doi.org/10.1007/s10489-023-04674-6

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-023-04674-6

Keywords

Navigation