Skip to main content

An Extensive Comparison of Systems for Entity Extraction from Log Files

  • Conference paper
  • First Online:
Foundations and Practice of Security (FPS 2021)

Abstract

Log parsing is the process of extracting logical units from system, device or application generated logs. It holds utmost importance in the field of log analytics and forensics. Many security analytic tools rely on logs to detect, prevent and mitigate attacks. It is critical for these tools to extract information from large volumes of logs from multiple evolving sources. Log parsers typically require human intervention as regular expressions or grammar need to be provided to extract knowledge. Teams of experts are required to keep these rules up-to-date in a time-consuming and costly process that is prone to errors and fails when new logs are added. On the other hand, strategies based on machine learning can automate the parsing of logs, thereby reducing time consumption and human labour. In this paper, we perform an extensive and systematic comparison of different log parsing techniques and systems based on machine learning approaches. These include baseline learning solutions such as Perceptron, Stochastic Gradient Descent, Multinomial Naive Bayes, a graphical model: Conditional Random Fields, a pre-trained sequence-to-sequence model: NERLogParser, and a pre-trained language model: BERT. Moreover, we experiment with the Transformer Neural Network, modelling the Named Entity Recognition task as a sequence-to-sequence generation task, an approach not previously tested in this domain. An extensive set of experiments is carried out in in-scope and out-of-scope datasets aiming at estimating the performance in log files from known and unknown log sources. We use multiple evaluation schemes in order to: (i) compare the different systems; and (ii) understand the quality of the information extracted, providing deeper insights on the advantages and disadvantages of the different systems. Overall, we found that sequence-to-sequence models tend to perform better both in in-scope and out-of-scope data.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 49.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 64.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014)

  2. Chen, K., Clark, A., De Vel, O., Mohay, G.: ECF-event correlation for forensics. In: First Australian Computer, Network and Information Forensics Conference, pp. 1–10. We-B Centre. com (2003)

    Google Scholar 

  3. Chinchor, N., Sundheim, B.M.: MUC-5 evaluation metrics. In: Fifth Message Understanding Conference (MUC-5): Proceedings of a Conference Held in Baltimore, 25–27 Aug 1993, Maryland (1993)

    Google Scholar 

  4. Cho, K., et al.: Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078 (2014)

  5. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)

  6. Graves, A., Schmidhuber, J.: Framewise phoneme classification with bidirectional LSTM networks. In: Proceedings of the 2005 IEEE International Joint Conference on Neural Networks, vol. 4, pp. 2047–2052. IEEE (2005)

    Google Scholar 

  7. Hossain, S.M.M., Couturier, R., Rusk, J., Kent, K.: Automatic event categorizer for SIEM. In: CASCON 2021 (2021)

    Google Scholar 

  8. Lample, G., Ballesteros, M., Subramanian, S., Kawakami, K., Dyer, C.: Neural architectures for named entity recognition. arXiv preprint arXiv:1603.01360 (2016)

  9. Pennington, J., Socher, R., Manning, C.D.: GloVe: global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014)

    Google Scholar 

  10. Ramshaw, L.A., Marcus, M.P.: Text chunking using transformation-based learning. In: Armstrong, S., Church, K., Isabelle, P., Manzi, S., Tzoukermann, E., Yarowsky, D. (eds.) Natural Language Processing Using Very Large Corpora, TLTB, vol. 11, pp. 157–176. Springer, Dordrecht (1999). https://doi.org/10.1007/978-94-017-2390-9_10

  11. Sang, E.F., De Meulder, F.: Introduction to the CONLL-2003 shared task: language-independent named entity recognition. arXiv preprint cs/0306050 (2003)

    Google Scholar 

  12. Schatz, B., Mohay, G., Clark, A.: Rich event representation for computer forensics. In: Proceedings of the Fifth Asia-Pacific Industrial Engineering and Management Systems Conference (APIEMS 2004). vol. 2, pp. 1–16. Queensland University of Technology Publications (2004)

    Google Scholar 

  13. Segura Bedmar, I., Martínez, P., Herrero Zazo, M.: SemEval-2013 task 9: extraction of drug-drug interactions from biomedical texts (DDIExtraction 2013). Association for Computational Linguistics (2013)

    Google Scholar 

  14. Studiawan, H., Sohel, F., Payne, C.: Automatic log parser to support forensic analysis (2018)

    Google Scholar 

  15. Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. In: Advances in Neural Information Processing Systems, pp. 3104–3112 (2014)

    Google Scholar 

  16. Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, pp. 5998–6008 (2017)

    Google Scholar 

  17. Wallach, H.M.: Conditional random fields: an introduction. Technical Reports (CIS), p. 22 (2004)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Anubhav Chhabra .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Chhabra, A., Branco, P., Jourdan, GV., Viktor, H.L. (2022). An Extensive Comparison of Systems for Entity Extraction from Log Files. In: Aïmeur, E., Laurent, M., Yaich, R., Dupont, B., Garcia-Alfaro, J. (eds) Foundations and Practice of Security. FPS 2021. Lecture Notes in Computer Science, vol 13291. Springer, Cham. https://doi.org/10.1007/978-3-031-08147-7_26

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-08147-7_26

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-08146-0

  • Online ISBN: 978-3-031-08147-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics