Abstract
Log parsing is the process of extracting logical units from system, device or application generated logs. It holds utmost importance in the field of log analytics and forensics. Many security analytic tools rely on logs to detect, prevent and mitigate attacks. It is critical for these tools to extract information from large volumes of logs from multiple evolving sources. Log parsers typically require human intervention as regular expressions or grammar need to be provided to extract knowledge. Teams of experts are required to keep these rules up-to-date in a time-consuming and costly process that is prone to errors and fails when new logs are added. On the other hand, strategies based on machine learning can automate the parsing of logs, thereby reducing time consumption and human labour. In this paper, we perform an extensive and systematic comparison of different log parsing techniques and systems based on machine learning approaches. These include baseline learning solutions such as Perceptron, Stochastic Gradient Descent, Multinomial Naive Bayes, a graphical model: Conditional Random Fields, a pre-trained sequence-to-sequence model: NERLogParser, and a pre-trained language model: BERT. Moreover, we experiment with the Transformer Neural Network, modelling the Named Entity Recognition task as a sequence-to-sequence generation task, an approach not previously tested in this domain. An extensive set of experiments is carried out in in-scope and out-of-scope datasets aiming at estimating the performance in log files from known and unknown log sources. We use multiple evaluation schemes in order to: (i) compare the different systems; and (ii) understand the quality of the information extracted, providing deeper insights on the advantages and disadvantages of the different systems. Overall, we found that sequence-to-sequence models tend to perform better both in in-scope and out-of-scope data.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014)
Chen, K., Clark, A., De Vel, O., Mohay, G.: ECF-event correlation for forensics. In: First Australian Computer, Network and Information Forensics Conference, pp. 1–10. We-B Centre. com (2003)
Chinchor, N., Sundheim, B.M.: MUC-5 evaluation metrics. In: Fifth Message Understanding Conference (MUC-5): Proceedings of a Conference Held in Baltimore, 25–27 Aug 1993, Maryland (1993)
Cho, K., et al.: Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078 (2014)
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
Graves, A., Schmidhuber, J.: Framewise phoneme classification with bidirectional LSTM networks. In: Proceedings of the 2005 IEEE International Joint Conference on Neural Networks, vol. 4, pp. 2047–2052. IEEE (2005)
Hossain, S.M.M., Couturier, R., Rusk, J., Kent, K.: Automatic event categorizer for SIEM. In: CASCON 2021 (2021)
Lample, G., Ballesteros, M., Subramanian, S., Kawakami, K., Dyer, C.: Neural architectures for named entity recognition. arXiv preprint arXiv:1603.01360 (2016)
Pennington, J., Socher, R., Manning, C.D.: GloVe: global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014)
Ramshaw, L.A., Marcus, M.P.: Text chunking using transformation-based learning. In: Armstrong, S., Church, K., Isabelle, P., Manzi, S., Tzoukermann, E., Yarowsky, D. (eds.) Natural Language Processing Using Very Large Corpora, TLTB, vol. 11, pp. 157–176. Springer, Dordrecht (1999). https://doi.org/10.1007/978-94-017-2390-9_10
Sang, E.F., De Meulder, F.: Introduction to the CONLL-2003 shared task: language-independent named entity recognition. arXiv preprint cs/0306050 (2003)
Schatz, B., Mohay, G., Clark, A.: Rich event representation for computer forensics. In: Proceedings of the Fifth Asia-Pacific Industrial Engineering and Management Systems Conference (APIEMS 2004). vol. 2, pp. 1–16. Queensland University of Technology Publications (2004)
Segura Bedmar, I., Martínez, P., Herrero Zazo, M.: SemEval-2013 task 9: extraction of drug-drug interactions from biomedical texts (DDIExtraction 2013). Association for Computational Linguistics (2013)
Studiawan, H., Sohel, F., Payne, C.: Automatic log parser to support forensic analysis (2018)
Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. In: Advances in Neural Information Processing Systems, pp. 3104–3112 (2014)
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, pp. 5998–6008 (2017)
Wallach, H.M.: Conditional random fields: an introduction. Technical Reports (CIS), p. 22 (2004)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 Springer Nature Switzerland AG
About this paper
Cite this paper
Chhabra, A., Branco, P., Jourdan, GV., Viktor, H.L. (2022). An Extensive Comparison of Systems for Entity Extraction from Log Files. In: Aïmeur, E., Laurent, M., Yaich, R., Dupont, B., Garcia-Alfaro, J. (eds) Foundations and Practice of Security. FPS 2021. Lecture Notes in Computer Science, vol 13291. Springer, Cham. https://doi.org/10.1007/978-3-031-08147-7_26
Download citation
DOI: https://doi.org/10.1007/978-3-031-08147-7_26
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-08146-0
Online ISBN: 978-3-031-08147-7
eBook Packages: Computer ScienceComputer Science (R0)