Skip to main content
Log in

Improving the system log analysis with language model and semi-supervised classifier

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Mining the vast amount of server-side logging data is an essential step to boost the business intelligence, as well as to facilitate the system maintenance for multimedia or IoT oriented services. Considering the vast volume of the data repository, designers of these logging-data analysis systems need to carefully balance the speed of the processing and the accuracy of the message classification. Conventional keyword-based log data monitoring and classification is sufficiently fast, but does not scale well in complex systems, especially when the target system is contributed by a large group of developers, each may differ in the way to encode the logging messages, and often carrying misleading labels. Conversely, many of the sophisticated approaches may suffer from their considerable time consumption, such that delayed processing jobs may begin to accumulate, and can hardly support the timely decision requirements. Meanwhile, we also suggest that the design of a large scale online log analysis should follow a principle that requires the least prior knowledge, in which unsupervised or semi-supervised solution is preferred. In this paper, we propose a two-stage machine learning based method, in which the system logs are regarded as the output of a quasi-natural language, pre-filtered by a perplexity score threshold, and then undergo a fine-grained classification procedure. Empirical studies on our web-services show that our method has obvious advantage in terms of processing speed and classification accuracy.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Notes

  1. We use the term “target system” to refer to the system which produce the logging data to be analyzed.

  2. It may be confusing to use the term “unlabeled”, as the log messages commonly carry labels when the are firstly generated, which may highly obscure and unrelated to their actual meaning.

  3. https://www.splunk.com/en_us/homepage.html

  4. https://radimrehurek.com/gensim/index.html

  5. In the baseline system, we use the keyword-set {“Exception”} to capture the system error log entries, and the keyword-set {“Error”, “Failure”} to capture the operation error log entries.

References

  1. Añorga J, Arrizabalaga S, Sedano B, Goya J, Alonso-Arce M, Mendizabal J (2018) Analysis of youtube’s traffic adaptation to dynamic environments. Multimed Tools Appl 77(7):7977

    Article  Google Scholar 

  2. Bhuiyan MZA, Wang G, Wu J, Cao J, Liu X, Wang T (2017) Dependable structural health monitoring using wireless sensor networks. IEEE Trans Depend Secure Comput 14(4):363

    Article  Google Scholar 

  3. Blei DM, Ng AY, Jordan MI (2003) Latent dirichlet allocation. J Mach Learn Res 3:993

    MATH  Google Scholar 

  4. Charniak E (1996) Statistical language learning. MIT, Cambridge

    Google Scholar 

  5. Cheng R, Xu R, Tang X, Sheng VS, Cai C (2018) An abnormal network flow feature sequence prediction approach for ddos attacks detection in big data environment. Comput Mater Contin 55(1):95

    Google Scholar 

  6. Datta D, Singh SK, Chowdary CR (2017) Bridging the gap: effect of text query reformulation in multimodal retrieval. Multimed Tools Appl 76(21):22871

    Article  Google Scholar 

  7. Du M, Li F, Zheng G, Srikumar V (2017) .. In: Proceedings of the 2017 ACM SIGSAC conference on computer and communications security. ACM, pp 1285–1298

  8. Elayeb B, Romdhane WB, Saoud NBB (2018) Towards a new possibilistic query translation tool for cross-language information retrieval. Multimed Tools Appl 77(2):2423

    Article  Google Scholar 

  9. He P, Deng Z, Wang H, Liu Z (2016) Model approach to grammatical evolution: theory and case study. Soft Comput 20(9):3537

    Article  MATH  Google Scholar 

  10. He P, Deng Z, Gao C, Wang X, Li J (2017) Model approach to grammatical evolution: deep-structured analyzing of model and representation. Soft Comput 21(18):5413

    Article  MATH  Google Scholar 

  11. Kaur J, Kaur K (2017) A fuzzy approach for an iot-based automated employee performance appraisal. Comput Mater Contin 53(1):23

    Google Scholar 

  12. Kobayashi S, Fukuda K, Esaki H (2014). In: Proceedings of the ninth international conference on future internet technologies. ACM, p 11

  13. Liu Q, Guo Y, Wu J, Wang G (2017) Effective query grouping strategy in clouds. J Comput Sci Technol 32(6):1231

    Article  MathSciNet  Google Scholar 

  14. Liu Y, Ling J, Liu Z, Shen J, Gao C (2018) Finger vein secure biometric template generation based on deep learning. Soft Comput 22(7):2257

    Article  Google Scholar 

  15. Ponte JM, Croft WB (1998). In: Proceedings of the 21st annual international ACM SIGIR conference on research and development in information retrieval. ACM, pp 275–281

  16. Rumelhart DE, Hinton GE, Williams RJ (1985) Learning internal representations by error propagation. Tech. rep. California Univ San Diego La Jolla Inst for Cognitive Science

  17. Salvetti F, Nicolov N (2006). In: Proceedings of the human language technology conference of the NAACL, companion volume: short papers. Association for Computational Linguistics, pp 137–140

  18. Shen J, Gui Z, Ji S, Shen J, Tan H, Tang Y (2018) Cloud-aided lightweight certificateless authentication protocol with anonymity for wireless body area networks. J Netw Comput Appl 106:117–123

    Article  Google Scholar 

  19. Silverstein C, Marais H, Henzinger M, Moricz M (1999). In: ACm SIGIR forum, vol 33. ACM, pp 6–12

  20. Sylaiou S, Mania K, Paliokas I, Pujol-Tost L, Killintzis V, Liarokapis F (2017) Exploring the educational impact of diverse technologies in online virtual museums. Int J Arts Technol 10(1):58

    Article  Google Scholar 

  21. Veale T, Chen H, Li G (2017) I read the news today, oh boy, international conference on distributed, ambient, and pervasive interactions. In: International conference on distributed, ambient, and pervasive interactions. Springer, Cham, pp 696–709

  22. Venkitasubramanian AN, Tuytelaars T, Moens MF (2017) Entity linking across vision and language. Multimed Tools Appl 76(21):22599

    Article  Google Scholar 

  23. Řehůřek R, Sojka P (2010) .. In: Proceedings of the LREC 2010 workshop on new challenges for NLP frameworks. ELRA, Malta, pp 45–50

  24. Xia Z, Xiong NN, Vasilakos AV, Sun X (2017) Epcbir: an efficient and privacy-preserving content-based image retrieval scheme in cloud computing. Inf Sci 387:195

    Article  Google Scholar 

  25. Xia Z, Zhu Y, Sun X, Qin Z, Ren K (2018) Towards privacy-preserving content-based image retrieval in cloud computing. IEEE Trans Cloud Comput 6(1):276

    Article  Google Scholar 

  26. Xu W, Huang L, Fox A, Patterson D, Jordan MI (2009) .. In: Proceedings of the ACM SIGOPS 22nd symposium on operating systems principles. ACM, pp 117–132

  27. Yang W, Wang G, Bhuiyan MZA, Choo KKR (2017) Hypergraph partitioning for social networks based on information entropy modularity. J Netw Comput Appl 86:59

    Article  Google Scholar 

  28. Yen TF, Oprea A, Onarlioglu K, Leetham T, Robertson W, Juels A, Kirda E (2013) Beehive: Large-scale log analysis for detecting suspicious activity in enterprise networks. In: Proceedings of the 29th annual computer security applications conference. ACM, pp 199–208

  29. Yuan D, Mai H, Xiong W, Tan L, Zhou Y, Pasupathy S (2010) Sherlog: error diagnosis by connecting clues from run-time logs, architectural support for programming languages and operating systems. 38(1):143

Download references

Acknowledgments

This research is supported by Shanghai University Youth Teacher Training Funding Scheme (ZZslg16054), National Social Science Foundation (16BXW031), Grant of Shandong Province Vocational Education Educational Reform Research Project “Study on Vocational Colleges” Professional Building Service Regional Upgrade Industries” (2017209).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ning Cao.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Li, G., Zhu, P., Cao, N. et al. Improving the system log analysis with language model and semi-supervised classifier. Multimed Tools Appl 78, 21521–21535 (2019). https://doi.org/10.1007/s11042-018-7020-3

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-018-7020-3

Keywords

Navigation