research-article

LogFlow: Simplified Log Analysis for Large Scale Systems

Authors:

Benoit Pelletier,

Noel De PalmaAuthors Info & Claims

ICDCN '21: Proceedings of the 22nd International Conference on Distributed Computing and Networking

Pages 116 - 125

https://doi.org/10.1145/3427796.3427808

Published: 05 January 2021 Publication History

Abstract

Distributed infrastructures generate huge amount of logs that can provide useful information about the state of system, but that can be challenging to analyze. The paper presents LogFlow, a tool to help human operators in the analysis of logs by automatically constructing graphs of correlations between log entries. The core of LogFlow is an interpretable predictive model based on a Recurrent Neural Network augmented with a state-of-the-art attention layer from which correlations between log entries are deduced. To be able to deal with huge amount of data, LogFlow also relies on a new log parser algorithm that can be orders of magnitude faster than best existing log parsers. Experiments run with several system logs generated by Supercomputers and Cloud systems show that LogFlow is able to achieve more than 96% of accuracy in most cases.

References

[1]

Jeremy Appleyard. 2016. Optimizing Recurrent Neural Networks in cuDNN 5. https://developer.nvidia.com/blog/optimizing-recurrent-neural-networks-cudnn-5/.

[2]

Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Jauvin. 2003. A neural probabilistic language model. Journal of machine learning research(2003).

[3]

Andy Brown, Aaron Tuor, Brian Hutchinson, and Nicole Nichols. 2018. Recurrent neural network attention mechanisms for interpretable system log anomaly detection. In Workshop on Machine Learning for Computing Systems.

Digital Library

[4]

Mateusz Buda, Atsuto Maki, and Maciej A Mazurowski. 2018. A systematic study of the class imbalance problem in convolutional neural networks. Neural Networks 106(2018).

[5]

Anwesha Das, Frank Mueller, Paul Hargrove, Eric Roman, and Scott Baden. 2018. Doomsday: Predicting which node will fail when on supercomputers. In SuperComputing’18.

[6]

Anwesha Das, Frank Mueller, Charles Siegel, and Abhinav Vishnu. 2018. Desh: deep learning for system health prediction of lead times to failure in hpc. In Proceedings of the 27th International Symposium on High-Performance Parallel and Distributed Computing. 40–51.

Digital Library

[7]

Biplob Debnath, Mohiuddin Solaimani, Muhammad Ali Gulzar Gulzar, Nipun Arora, Cristian Lumezanu, Jianwu Xu, Bo Zong, Hui Zhang, Guofei Jiang, and Latifur Khan. 2018. LogLens: A real-time log analysis system. In IEEE 38th International Conference on Distributed Computing Systems (ICDCS).

[8]

Min Du, Feifei Li, Guineng Zheng, and Vivek Srikumar. 2017. Deeplog: Anomaly detection and diagnosis from system logs through deep learning. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security.

Digital Library

[9]

Xiaoyu Fu, Rui Ren, Sally A McKee, Jianfeng Zhan, and Ninghui Sun. 2014. Digging deeper into cluster system logs for failure prediction and root cause diagnosis. In IEEE International Conference on Cluster Computing.

[10]

Ana Gainaru, Franck Cappello, Joshi Fullop, Stefan Trausan-Matu, and William Kramer. 2011. Adaptive event prediction strategy with dynamic time window for large-scale hpc systems. In Managing Large-scale Systems via the Analysis of System Logs and the Application of Machine Learning Techniques.

[11]

Saurabh Gupta, Tirthak Patel, Christian Engelmann, and Devesh Tiwari. 2017. Failures in large scale systems: long-term measurement, analysis, and implications. In SuperComputing’17.

[12]

Pinjia He, Jieming Zhu, Zibin Zheng, and Michael R Lyu. 2017. Drain: An online log parsing approach with fixed depth tree. In 2017 IEEE International Conference on Web Services (ICWS). IEEE, 33–40.

[13]

Shilin He, Jieming Zhu, Pinjia He, and Michael R. Lyu. 2020. Loghub: A Large Collection of System Log Datasets towards Automated Log Analytics. arxiv:2008.06448

[14]

Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long Short-Term Memory. Neural Computation 9, 8 (1997).

Digital Library

[15]

David Jauk, Dai Yang, and Martin Schulz. 2019. Predicting faults in high performance computing systems: An in-depth survey of the state-of-the-practice. In SuperComputing’19.

[16]

Minh-Thang Luong, Hieu Pham, and Christopher D. Manning. 2015. Effective approaches to attention-based neural machine translation. arxiv:1508.04025

[17]

Adetokunbo AO Makanju, A Nur Zincir-Heywood, and Evangelos E Milios. 2009. Clustering event logs using iterative partitioning. In Proceedings of the 15th ACM international conference on Knowledge discovery and data mining. 1255–1264.

Digital Library

[18]

Weibin Meng, Ying Liu, Yichen Zhu, Shenglin Zhang, Dan Pei, Yuqing Liu, Yihao Chen, Ruizhi Zhang, Shimin Tao, Pei Sun, 2019. Loganomaly: Unsupervised detection of sequential and quantitative anomalies in unstructured logs. In International Joint Conference on Artificial Intelligence.

[19]

Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word representations in vector space. arXiv:1301.3781 (2013).

[20]

Adam Oliner and Jon Stearley. 2007. What supercomputers say: A study of five system logs. In 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks. IEEE, 575–584.

Digital Library

[21]

Wenjie Pei, Tadas Baltrusaitis, David MJ Tax, and Louis-Philippe Morency. 2017. Temporal attention-gated model for robust sequence classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6730–6739.

[22]

Guosai Wang, Lifei Zhang, and Wei Xu. 2017. What can we learn from four years of data center hardware failures?. In 47th Annual IEEE/IFIP International Conference on Dependable Systems and Networks. IEEE, 25–36.

[23]

Shoujin Wang, Wei Liu, Jia Wu, Longbing Cao, Qinxue Meng, and Paul J Kennedy. 2016. Training deep neural networks on imbalanced data sets. In 2016 international joint conference on neural networks. IEEE, 4368–4374.

[24]

Wei Xu, Ling Huang, Armando Fox, David Patterson, and Michael I Jordan. 2009. Detecting large-scale system problems by mining console logs. In Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles. 117–132.

Digital Library

[25]

Min-Ling Zhang and Zhi-Hua Zhou. 2013. A review on multi-label learning algorithms. IEEE transactions on knowledge and data engineering 26, 8(2013).

[26]

Ziming Zheng, Li Yu, Zhiling Lan, and Terry Jones. 2012. 3-dimensional root cause diagnosis via co-analysis. In Proceedings of the 9th international conference on Autonomic computing. ACM, 181–190.

Digital Library

[27]

Jieming Zhu, Shilin He, Jinyang Liu, Pinjia He, Qi Xie, Zibin Zheng, and Michael R Lyu. 2019. Tools and benchmarks for automated log parsing. In 2019 IEEE/ACM 41st International Conference on Software Engineering: Software Engineering in Practice. IEEE, 121–130.

Digital Library

Cited By

Lupton SWashizaki HYoshioka NFukazawa Y(2024)Landscape and Taxonomy of Online Parser-Supported Log Anomaly Detection MethodsIEEE Access10.1109/ACCESS.2024.338728712(78193-78218)Online publication date: 2024
https://doi.org/10.1109/ACCESS.2024.3387287
Batoun MSayagh MAghili ROuni ALi H(2024)A literature review and existing challenges on software logging practicesEmpirical Software Engineering10.1007/s10664-024-10452-w29:4Online publication date: 18-Jun-2024
https://doi.org/10.1007/s10664-024-10452-w
Borowiec MPiszko RRak T(2023)Knowledge Extraction and Discovery about Web System Based on the Benchmark Application of Online Stock Trading SystemSensors10.3390/s2304227423:4(2274)Online publication date: 17-Feb-2023
https://doi.org/10.3390/s23042274
Show More Cited By

Recommendations

Fully dynamic connectivity in O(log n(log log n)²) amortized expected time
SODA '17: Proceedings of the Twenty-Eighth Annual ACM-SIAM Symposium on Discrete Algorithms

Dynamic connectivity is one of the most fundamental problems in dynamic graph algorithms. We present a new randomized dynamic connectivity structure with O(log n(log log n)²) amortized expected update time and O(log n/log log log n) query time, which ...
Finding Connected Components in O(log n log log n) Time on the EREW PRAM

In this paper, a parallel algorithm for finding the connected components of an undirected graph is presented. On a graph with n vertices and m edges, the algorithm runs in O(log n log log n) time using n + m processors on an EREW (exclusive-read and ...
Finding Connected Components in O(log n log log n) Time on the EREW PRAM
SODA '93: Selected papers from the fourth annual ACM SIAM symposium on Discrete algorithms

In this paper, a parallel algorithm for finding the connected components of an undirected graph is presented. On a graph with n vertices and m edges, the algorithm runs in O(log n log log n) time using n + m processors on an EREW (exclusive-read and ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

ICDCN '21: Proceedings of the 22nd International Conference on Distributed Computing and Networking

January 2021

252 pages

ISBN:9781450389334

DOI:10.1145/3427796

Copyright © 2021 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 05 January 2021

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Research-article
Research
Refereed limited

Conference

ICDCN '21

ICDCN '21: International Conference on Distributed Computing and Networking 2021

January 5 - 8, 2021

Nara, Japan

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

5
Total Citations
View Citations
195
Total Downloads

Downloads (Last 12 months)16
Downloads (Last 6 weeks)0

Reflects downloads up to 19 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Lupton SWashizaki HYoshioka NFukazawa Y(2024)Landscape and Taxonomy of Online Parser-Supported Log Anomaly Detection MethodsIEEE Access10.1109/ACCESS.2024.338728712(78193-78218)Online publication date: 2024
https://doi.org/10.1109/ACCESS.2024.3387287
Batoun MSayagh MAghili ROuni ALi H(2024)A literature review and existing challenges on software logging practicesEmpirical Software Engineering10.1007/s10664-024-10452-w29:4Online publication date: 18-Jun-2024
https://doi.org/10.1007/s10664-024-10452-w
Borowiec MPiszko RRak T(2023)Knowledge Extraction and Discovery about Web System Based on the Benchmark Application of Online Stock Trading SystemSensors10.3390/s2304227423:4(2274)Online publication date: 17-Feb-2023
https://doi.org/10.3390/s23042274
Bhanage DPawar AKotecha KAbraham A(2023)Failure Detection Using Semantic Analysis and Attention-Based Classifier Model for IT Infrastructure Log DataIEEE Access10.1109/ACCESS.2023.331943811(108178-108197)Online publication date: 2023
https://doi.org/10.1109/ACCESS.2023.3319438
Bhanage DPawar A(2023)Improving Classification-Based Log Analysis Using Vectorization TechniquesProceedings of Third International Conference on Advances in Computer Engineering and Communication Systems10.1007/978-981-19-9228-5_24(271-282)Online publication date: 18-Mar-2023
https://doi.org/10.1007/978-981-19-9228-5_24

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Figures

Tables

Media

View Table of Conten