Improving the system log analysis with language model and semi-supervised classifier

Li, Guofu; Zhu, Pengjia; Cao, Ning; Wu, Mei; Chen, Zhiyi; Cao, Guangsheng; Li, Hongjun; Gong, Chenjing

doi:10.1007/s11042-018-7020-3

Improving the system log analysis with language model and semi-supervised classifier

Published: 23 March 2019

Volume 78, pages 21521–21535, (2019)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Guofu Li^1,2,
Pengjia Zhu³,
Ning Cao ORCID: orcid.org/0000-0001-6430-3586⁴,
Mei Wu⁵,
Zhiyi Chen¹,
Guangsheng Cao⁶,
Hongjun Li⁶ &
…
Chenjing Gong⁶

489 Accesses
Explore all metrics

Abstract

Mining the vast amount of server-side logging data is an essential step to boost the business intelligence, as well as to facilitate the system maintenance for multimedia or IoT oriented services. Considering the vast volume of the data repository, designers of these logging-data analysis systems need to carefully balance the speed of the processing and the accuracy of the message classification. Conventional keyword-based log data monitoring and classification is sufficiently fast, but does not scale well in complex systems, especially when the target system is contributed by a large group of developers, each may differ in the way to encode the logging messages, and often carrying misleading labels. Conversely, many of the sophisticated approaches may suffer from their considerable time consumption, such that delayed processing jobs may begin to accumulate, and can hardly support the timely decision requirements. Meanwhile, we also suggest that the design of a large scale online log analysis should follow a principle that requires the least prior knowledge, in which unsupervised or semi-supervised solution is preferred. In this paper, we propose a two-stage machine learning based method, in which the system logs are regarded as the output of a quasi-natural language, pre-filtered by a perplexity score threshold, and then undergo a fine-grained classification procedure. Empirical studies on our web-services show that our method has obvious advantage in terms of processing speed and classification accuracy.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

ML-Parser: An Efficient and Accurate Online Log Parser

Article 30 November 2022

Building an Adaptive Logs Classification System: Industrial Report

On the effectiveness of log representation for log-based anomaly detection

Article 09 October 2023

Notes

We use the term “target system” to refer to the system which produce the logging data to be analyzed.
It may be confusing to use the term “unlabeled”, as the log messages commonly carry labels when the are firstly generated, which may highly obscure and unrelated to their actual meaning.
https://www.splunk.com/en_us/homepage.html
https://radimrehurek.com/gensim/index.html
In the baseline system, we use the keyword-set {“Exception”} to capture the system error log entries, and the keyword-set {“Error”, “Failure”} to capture the operation error log entries.

References

Añorga J, Arrizabalaga S, Sedano B, Goya J, Alonso-Arce M, Mendizabal J (2018) Analysis of youtube’s traffic adaptation to dynamic environments. Multimed Tools Appl 77(7):7977
Article Google Scholar
Bhuiyan MZA, Wang G, Wu J, Cao J, Liu X, Wang T (2017) Dependable structural health monitoring using wireless sensor networks. IEEE Trans Depend Secure Comput 14(4):363
Article Google Scholar
Blei DM, Ng AY, Jordan MI (2003) Latent dirichlet allocation. J Mach Learn Res 3:993
MATH Google Scholar
Charniak E (1996) Statistical language learning. MIT, Cambridge
Google Scholar
Cheng R, Xu R, Tang X, Sheng VS, Cai C (2018) An abnormal network flow feature sequence prediction approach for ddos attacks detection in big data environment. Comput Mater Contin 55(1):95
Google Scholar
Datta D, Singh SK, Chowdary CR (2017) Bridging the gap: effect of text query reformulation in multimodal retrieval. Multimed Tools Appl 76(21):22871
Article Google Scholar
Du M, Li F, Zheng G, Srikumar V (2017) .. In: Proceedings of the 2017 ACM SIGSAC conference on computer and communications security. ACM, pp 1285–1298
Elayeb B, Romdhane WB, Saoud NBB (2018) Towards a new possibilistic query translation tool for cross-language information retrieval. Multimed Tools Appl 77(2):2423
Article Google Scholar
He P, Deng Z, Wang H, Liu Z (2016) Model approach to grammatical evolution: theory and case study. Soft Comput 20(9):3537
Article MATH Google Scholar
He P, Deng Z, Gao C, Wang X, Li J (2017) Model approach to grammatical evolution: deep-structured analyzing of model and representation. Soft Comput 21(18):5413
Article MATH Google Scholar
Kaur J, Kaur K (2017) A fuzzy approach for an iot-based automated employee performance appraisal. Comput Mater Contin 53(1):23
Google Scholar
Kobayashi S, Fukuda K, Esaki H (2014). In: Proceedings of the ninth international conference on future internet technologies. ACM, p 11
Liu Q, Guo Y, Wu J, Wang G (2017) Effective query grouping strategy in clouds. J Comput Sci Technol 32(6):1231
Article MathSciNet Google Scholar
Liu Y, Ling J, Liu Z, Shen J, Gao C (2018) Finger vein secure biometric template generation based on deep learning. Soft Comput 22(7):2257
Article Google Scholar
Ponte JM, Croft WB (1998). In: Proceedings of the 21st annual international ACM SIGIR conference on research and development in information retrieval. ACM, pp 275–281
Rumelhart DE, Hinton GE, Williams RJ (1985) Learning internal representations by error propagation. Tech. rep. California Univ San Diego La Jolla Inst for Cognitive Science
Salvetti F, Nicolov N (2006). In: Proceedings of the human language technology conference of the NAACL, companion volume: short papers. Association for Computational Linguistics, pp 137–140
Shen J, Gui Z, Ji S, Shen J, Tan H, Tang Y (2018) Cloud-aided lightweight certificateless authentication protocol with anonymity for wireless body area networks. J Netw Comput Appl 106:117–123
Article Google Scholar
Silverstein C, Marais H, Henzinger M, Moricz M (1999). In: ACm SIGIR forum, vol 33. ACM, pp 6–12
Sylaiou S, Mania K, Paliokas I, Pujol-Tost L, Killintzis V, Liarokapis F (2017) Exploring the educational impact of diverse technologies in online virtual museums. Int J Arts Technol 10(1):58
Article Google Scholar
Veale T, Chen H, Li G (2017) I read the news today, oh boy, international conference on distributed, ambient, and pervasive interactions. In: International conference on distributed, ambient, and pervasive interactions. Springer, Cham, pp 696–709
Venkitasubramanian AN, Tuytelaars T, Moens MF (2017) Entity linking across vision and language. Multimed Tools Appl 76(21):22599
Article Google Scholar
Řehůřek R, Sojka P (2010) .. In: Proceedings of the LREC 2010 workshop on new challenges for NLP frameworks. ELRA, Malta, pp 45–50
Xia Z, Xiong NN, Vasilakos AV, Sun X (2017) Epcbir: an efficient and privacy-preserving content-based image retrieval scheme in cloud computing. Inf Sci 387:195
Article Google Scholar
Xia Z, Zhu Y, Sun X, Qin Z, Ren K (2018) Towards privacy-preserving content-based image retrieval in cloud computing. IEEE Trans Cloud Comput 6(1):276
Article Google Scholar
Xu W, Huang L, Fox A, Patterson D, Jordan MI (2009) .. In: Proceedings of the ACM SIGOPS 22nd symposium on operating systems principles. ACM, pp 117–132
Yang W, Wang G, Bhuiyan MZA, Choo KKR (2017) Hypergraph partitioning for social networks based on information entropy modularity. J Netw Comput Appl 86:59
Article Google Scholar
Yen TF, Oprea A, Onarlioglu K, Leetham T, Robertson W, Juels A, Kirda E (2013) Beehive: Large-scale log analysis for detecting suspicious activity in enterprise networks. In: Proceedings of the 29th annual computer security applications conference. ACM, pp 199–208
Yuan D, Mai H, Xiong W, Tan L, Zhou Y, Pasupathy S (2010) Sherlog: error diagnosis by connecting clues from run-time logs, architectural support for programming languages and operating systems. 38(1):143

Download references

Acknowledgments

This research is supported by Shanghai University Youth Teacher Training Funding Scheme (ZZslg16054), National Social Science Foundation (16BXW031), Grant of Shandong Province Vocational Education Educational Reform Research Project “Study on Vocational Colleges” Professional Building Service Regional Upgrade Industries” (2017209).

Author information

Authors and Affiliations

College of Communication and Art Design, University of Shanghai for Science and Technology, Shanghai, China
Guofu Li & Zhiyi Chen
Computer Science and Informatics, University College Dublin, Dublin, Ireland
Guofu Li
State Street Corporation, Boston, MA, USA
Pengjia Zhu
College of Information Engineering, Sanming University, Sanming, China
Ning Cao
School of Computer Science and Engineering, Wuhan Institute of Technology, Wuhan, China
Mei Wu
College of Information Engineering, Qingdao Binhai University, Qingdao, China
Guangsheng Cao, Hongjun Li & Chenjing Gong

Authors

Guofu Li
View author publications
You can also search for this author inPubMed Google Scholar
Pengjia Zhu
View author publications
You can also search for this author inPubMed Google Scholar
Ning Cao
View author publications
You can also search for this author inPubMed Google Scholar
Mei Wu
View author publications
You can also search for this author inPubMed Google Scholar
Zhiyi Chen
View author publications
You can also search for this author inPubMed Google Scholar
Guangsheng Cao
View author publications
You can also search for this author inPubMed Google Scholar
Hongjun Li
View author publications
You can also search for this author inPubMed Google Scholar
Chenjing Gong
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Ning Cao.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Li, G., Zhu, P., Cao, N. et al. Improving the system log analysis with language model and semi-supervised classifier. Multimed Tools Appl 78, 21521–21535 (2019). https://doi.org/10.1007/s11042-018-7020-3

Download citation

Received: 12 July 2018
Revised: 22 November 2018
Accepted: 03 December 2018
Published: 23 March 2019
Issue Date: 15 August 2019
DOI: https://doi.org/10.1007/s11042-018-7020-3

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Improving the system log analysis with language model and semi-supervised classifier

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

ML-Parser: An Efficient and Accurate Online Log Parser

Building an Adaptive Logs Classification System: Industrial Report

On the effectiveness of log representation for log-based anomaly detection

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now