An Extensive Comparison of Systems for Entity Extraction from Log Files

Chhabra, Anubhav; Branco, Paula; Jourdan, Guy-Vincent; Viktor, Herna L.

doi:10.1007/978-3-031-08147-7_26

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13291))

Included in the following conference series:

International Symposium on Foundations and Practice of Security

1017 Accesses
1 Citations

Abstract

Log parsing is the process of extracting logical units from system, device or application generated logs. It holds utmost importance in the field of log analytics and forensics. Many security analytic tools rely on logs to detect, prevent and mitigate attacks. It is critical for these tools to extract information from large volumes of logs from multiple evolving sources. Log parsers typically require human intervention as regular expressions or grammar need to be provided to extract knowledge. Teams of experts are required to keep these rules up-to-date in a time-consuming and costly process that is prone to errors and fails when new logs are added. On the other hand, strategies based on machine learning can automate the parsing of logs, thereby reducing time consumption and human labour. In this paper, we perform an extensive and systematic comparison of different log parsing techniques and systems based on machine learning approaches. These include baseline learning solutions such as Perceptron, Stochastic Gradient Descent, Multinomial Naive Bayes, a graphical model: Conditional Random Fields, a pre-trained sequence-to-sequence model: NERLogParser, and a pre-trained language model: BERT. Moreover, we experiment with the Transformer Neural Network, modelling the Named Entity Recognition task as a sequence-to-sequence generation task, an approach not previously tested in this domain. An extensive set of experiments is carried out in in-scope and out-of-scope datasets aiming at estimating the performance in log files from known and unknown log sources. We use multiple evaluation schemes in order to: (i) compare the different systems; and (ii) understand the quality of the information extracted, providing deeper insights on the advantages and disadvantages of the different systems. Overall, we found that sequence-to-sequence models tend to perform better both in in-scope and out-of-scope data.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 49.99; Price excludes VAT (USA)

Softcover Book: USD 64.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014)
Chen, K., Clark, A., De Vel, O., Mohay, G.: ECF-event correlation for forensics. In: First Australian Computer, Network and Information Forensics Conference, pp. 1–10. We-B Centre. com (2003)
Google Scholar
Chinchor, N., Sundheim, B.M.: MUC-5 evaluation metrics. In: Fifth Message Understanding Conference (MUC-5): Proceedings of a Conference Held in Baltimore, 25–27 Aug 1993, Maryland (1993)
Google Scholar
Cho, K., et al.: Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078 (2014)
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
Graves, A., Schmidhuber, J.: Framewise phoneme classification with bidirectional LSTM networks. In: Proceedings of the 2005 IEEE International Joint Conference on Neural Networks, vol. 4, pp. 2047–2052. IEEE (2005)
Google Scholar
Hossain, S.M.M., Couturier, R., Rusk, J., Kent, K.: Automatic event categorizer for SIEM. In: CASCON 2021 (2021)
Google Scholar
Lample, G., Ballesteros, M., Subramanian, S., Kawakami, K., Dyer, C.: Neural architectures for named entity recognition. arXiv preprint arXiv:1603.01360 (2016)
Pennington, J., Socher, R., Manning, C.D.: GloVe: global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014)
Google Scholar
Ramshaw, L.A., Marcus, M.P.: Text chunking using transformation-based learning. In: Armstrong, S., Church, K., Isabelle, P., Manzi, S., Tzoukermann, E., Yarowsky, D. (eds.) Natural Language Processing Using Very Large Corpora, TLTB, vol. 11, pp. 157–176. Springer, Dordrecht (1999). https://doi.org/10.1007/978-94-017-2390-9_10
Sang, E.F., De Meulder, F.: Introduction to the CONLL-2003 shared task: language-independent named entity recognition. arXiv preprint cs/0306050 (2003)
Google Scholar
Schatz, B., Mohay, G., Clark, A.: Rich event representation for computer forensics. In: Proceedings of the Fifth Asia-Pacific Industrial Engineering and Management Systems Conference (APIEMS 2004). vol. 2, pp. 1–16. Queensland University of Technology Publications (2004)
Google Scholar
Segura Bedmar, I., Martínez, P., Herrero Zazo, M.: SemEval-2013 task 9: extraction of drug-drug interactions from biomedical texts (DDIExtraction 2013). Association for Computational Linguistics (2013)
Google Scholar
Studiawan, H., Sohel, F., Payne, C.: Automatic log parser to support forensic analysis (2018)
Google Scholar
Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. In: Advances in Neural Information Processing Systems, pp. 3104–3112 (2014)
Google Scholar
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, pp. 5998–6008 (2017)
Google Scholar
Wallach, H.M.: Conditional random fields: an introduction. Technical Reports (CIS), p. 22 (2004)
Google Scholar

Download references

Author information

Authors and Affiliations

School of Electrical Engineering and Computer Science, University of Ottawa, Ottawa, ON, Canada
Anubhav Chhabra, Paula Branco, Guy-Vincent Jourdan & Herna L. Viktor

Authors

Anubhav Chhabra
View author publications
You can also search for this author in PubMed Google Scholar
Paula Branco
View author publications
You can also search for this author in PubMed Google Scholar
Guy-Vincent Jourdan
View author publications
You can also search for this author in PubMed Google Scholar
Herna L. Viktor
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Anubhav Chhabra .

Editor information

Editors and Affiliations

University of Montreal, Montreal, QC, Canada
Esma Aïmeur
Télécom SudParis, Palaiseau, France
Maryline Laurent
IRT SystemX, Palaiseau, France
Reda Yaich
University of Montreal, Montreal, QC, Canada
Benoît Dupont
Télécom SudParis, Palaiseau, France
Joaquin Garcia-Alfaro

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Chhabra, A., Branco, P., Jourdan, GV., Viktor, H.L. (2022). An Extensive Comparison of Systems for Entity Extraction from Log Files. In: Aïmeur, E., Laurent, M., Yaich, R., Dupont, B., Garcia-Alfaro, J. (eds) Foundations and Practice of Security. FPS 2021. Lecture Notes in Computer Science, vol 13291. Springer, Cham. https://doi.org/10.1007/978-3-031-08147-7_26

Download citation

DOI: https://doi.org/10.1007/978-3-031-08147-7_26
Published: 15 June 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-08146-0
Online ISBN: 978-3-031-08147-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

An Extensive Comparison of Systems for Entity Extraction from Log Files