Abstract
Mining the vast amount of server-side logging data is an essential step to boost the business intelligence, as well as to facilitate the system maintenance for multimedia or IoT oriented services. Considering the vast volume of the data repository, designers of these logging-data analysis systems need to carefully balance the speed of the processing and the accuracy of the message classification. Conventional keyword-based log data monitoring and classification is sufficiently fast, but does not scale well in complex systems, especially when the target system is contributed by a large group of developers, each may differ in the way to encode the logging messages, and often carrying misleading labels. Conversely, many of the sophisticated approaches may suffer from their considerable time consumption, such that delayed processing jobs may begin to accumulate, and can hardly support the timely decision requirements. Meanwhile, we also suggest that the design of a large scale online log analysis should follow a principle that requires the least prior knowledge, in which unsupervised or semi-supervised solution is preferred. In this paper, we propose a two-stage machine learning based method, in which the system logs are regarded as the output of a quasi-natural language, pre-filtered by a perplexity score threshold, and then undergo a fine-grained classification procedure. Empirical studies on our web-services show that our method has obvious advantage in terms of processing speed and classification accuracy.







Similar content being viewed by others
Notes
We use the term “target system” to refer to the system which produce the logging data to be analyzed.
It may be confusing to use the term “unlabeled”, as the log messages commonly carry labels when the are firstly generated, which may highly obscure and unrelated to their actual meaning.
In the baseline system, we use the keyword-set {“Exception”} to capture the system error log entries, and the keyword-set {“Error”, “Failure”} to capture the operation error log entries.
References
Añorga J, Arrizabalaga S, Sedano B, Goya J, Alonso-Arce M, Mendizabal J (2018) Analysis of youtube’s traffic adaptation to dynamic environments. Multimed Tools Appl 77(7):7977
Bhuiyan MZA, Wang G, Wu J, Cao J, Liu X, Wang T (2017) Dependable structural health monitoring using wireless sensor networks. IEEE Trans Depend Secure Comput 14(4):363
Blei DM, Ng AY, Jordan MI (2003) Latent dirichlet allocation. J Mach Learn Res 3:993
Charniak E (1996) Statistical language learning. MIT, Cambridge
Cheng R, Xu R, Tang X, Sheng VS, Cai C (2018) An abnormal network flow feature sequence prediction approach for ddos attacks detection in big data environment. Comput Mater Contin 55(1):95
Datta D, Singh SK, Chowdary CR (2017) Bridging the gap: effect of text query reformulation in multimodal retrieval. Multimed Tools Appl 76(21):22871
Du M, Li F, Zheng G, Srikumar V (2017) .. In: Proceedings of the 2017 ACM SIGSAC conference on computer and communications security. ACM, pp 1285–1298
Elayeb B, Romdhane WB, Saoud NBB (2018) Towards a new possibilistic query translation tool for cross-language information retrieval. Multimed Tools Appl 77(2):2423
He P, Deng Z, Wang H, Liu Z (2016) Model approach to grammatical evolution: theory and case study. Soft Comput 20(9):3537
He P, Deng Z, Gao C, Wang X, Li J (2017) Model approach to grammatical evolution: deep-structured analyzing of model and representation. Soft Comput 21(18):5413
Kaur J, Kaur K (2017) A fuzzy approach for an iot-based automated employee performance appraisal. Comput Mater Contin 53(1):23
Kobayashi S, Fukuda K, Esaki H (2014). In: Proceedings of the ninth international conference on future internet technologies. ACM, p 11
Liu Q, Guo Y, Wu J, Wang G (2017) Effective query grouping strategy in clouds. J Comput Sci Technol 32(6):1231
Liu Y, Ling J, Liu Z, Shen J, Gao C (2018) Finger vein secure biometric template generation based on deep learning. Soft Comput 22(7):2257
Ponte JM, Croft WB (1998). In: Proceedings of the 21st annual international ACM SIGIR conference on research and development in information retrieval. ACM, pp 275–281
Rumelhart DE, Hinton GE, Williams RJ (1985) Learning internal representations by error propagation. Tech. rep. California Univ San Diego La Jolla Inst for Cognitive Science
Salvetti F, Nicolov N (2006). In: Proceedings of the human language technology conference of the NAACL, companion volume: short papers. Association for Computational Linguistics, pp 137–140
Shen J, Gui Z, Ji S, Shen J, Tan H, Tang Y (2018) Cloud-aided lightweight certificateless authentication protocol with anonymity for wireless body area networks. J Netw Comput Appl 106:117–123
Silverstein C, Marais H, Henzinger M, Moricz M (1999). In: ACm SIGIR forum, vol 33. ACM, pp 6–12
Sylaiou S, Mania K, Paliokas I, Pujol-Tost L, Killintzis V, Liarokapis F (2017) Exploring the educational impact of diverse technologies in online virtual museums. Int J Arts Technol 10(1):58
Veale T, Chen H, Li G (2017) I read the news today, oh boy, international conference on distributed, ambient, and pervasive interactions. In: International conference on distributed, ambient, and pervasive interactions. Springer, Cham, pp 696–709
Venkitasubramanian AN, Tuytelaars T, Moens MF (2017) Entity linking across vision and language. Multimed Tools Appl 76(21):22599
Řehůřek R, Sojka P (2010) .. In: Proceedings of the LREC 2010 workshop on new challenges for NLP frameworks. ELRA, Malta, pp 45–50
Xia Z, Xiong NN, Vasilakos AV, Sun X (2017) Epcbir: an efficient and privacy-preserving content-based image retrieval scheme in cloud computing. Inf Sci 387:195
Xia Z, Zhu Y, Sun X, Qin Z, Ren K (2018) Towards privacy-preserving content-based image retrieval in cloud computing. IEEE Trans Cloud Comput 6(1):276
Xu W, Huang L, Fox A, Patterson D, Jordan MI (2009) .. In: Proceedings of the ACM SIGOPS 22nd symposium on operating systems principles. ACM, pp 117–132
Yang W, Wang G, Bhuiyan MZA, Choo KKR (2017) Hypergraph partitioning for social networks based on information entropy modularity. J Netw Comput Appl 86:59
Yen TF, Oprea A, Onarlioglu K, Leetham T, Robertson W, Juels A, Kirda E (2013) Beehive: Large-scale log analysis for detecting suspicious activity in enterprise networks. In: Proceedings of the 29th annual computer security applications conference. ACM, pp 199–208
Yuan D, Mai H, Xiong W, Tan L, Zhou Y, Pasupathy S (2010) Sherlog: error diagnosis by connecting clues from run-time logs, architectural support for programming languages and operating systems. 38(1):143
Acknowledgments
This research is supported by Shanghai University Youth Teacher Training Funding Scheme (ZZslg16054), National Social Science Foundation (16BXW031), Grant of Shandong Province Vocational Education Educational Reform Research Project “Study on Vocational Colleges” Professional Building Service Regional Upgrade Industries” (2017209).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Li, G., Zhu, P., Cao, N. et al. Improving the system log analysis with language model and semi-supervised classifier. Multimed Tools Appl 78, 21521–21535 (2019). https://doi.org/10.1007/s11042-018-7020-3
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-018-7020-3