Log mining to re-construct system behavior: An exploratory study on a large telescope system

https://doi.org/10.1016/j.infsof.2019.06.011Get rights and content

Abstract

Context

A large amount of information about system behavior is stored in logs that record system changes. Such information can be exploited to discover anomalies of a system and the operations that cause them. Given their large size, manual inspection of logs is hard and infeasible in a desired timeframe (e.g., real-time), especially for critical systems.

Objective

This study proposes a semi-automated method for reconstructing sequences of tasks of a system, revealing system anomalies, and associating tasks and anomalies to code components.

Method

The proposed approach uses unsupervised machine learning (Latent Dirichlet Allocation) to discover latent topics in messages of log events and introduces a novel technique based on pattern recognition to derive the semantic of such topics (topic labelling). The approach has been applied to the big data generated by the ALMA telescope system consisting of more than 2000 log events collected in about five hours of telescope operation.

Results

With the application of our approach to such data, we were able to model the behavior of the telescope over 16 different observations. We found five different behavior models and three different types of errors. We use the models to interpret each error and discuss its cause.

Conclusions

With this work, we have also been able to discuss some of the known challenges in log mining. The experience we gather has been then summarized in lessons learned.

Introduction

System behavior is determined by a complex interplay of technical and non-technical factors in an operational environment, (i.e., a set of conditions under which the system operates). Low performance or unscheduled downtime can carry huge costs that seriously concern system managers and engineers. Therefore, understanding and, eventually, forecasting system behavior is of paramount importance, especially in critical systems that must be 24/7 operational.

System logs consist of events that keep track of system execution. Log events are automatically generated by the logging service of a system and are typically archived as a semi-structured short text (event text). Log mining is the process to discover patterns in log events. Such patterns have been used to model or predict system’s behavior [11], [17], [22], [38], [47]. Although there are several proposals to standardize the format of the event [32], [47], [48], no specific standard has been extensively adopted and mining logs becomes a sort of art with the following log mining known challenges:

Short texts. The unstructured part of the log text (i.e., log message) may be very short (e.g., one / two lines) and may include domain terms (e.g., acronyms). Thus, automated text analysis on individual log texts may be not that efficient [56].

Context investigation. Log data is big and distributed. Data can be generated every millisecond and triggered by any application running in the system. As such, log mining requires a deep understanding of the system context and careful data preprocessing [47].

Sequences of events. System behavior (e.g., system operations) can be encapsulated in more than one log event. Therefore, log mining recently focuses on sequences of log events and mining logs requires context analysis to define and identify such sequences [17], [47] and the involvement of domain experts is of paramount importance [47].

Observation time window. Over the course of a system’s lifetime, anything from software upgrades to minor configuration changes can drastically alter the meaning or character of the logs. Thus, a system must be observed in a period in which no specific upgrade, test, or drastic change happen [43].

Fully-automated log mining. Over the years, several tools have been developed to support log mining (e.g., [38], [46], although no fully-automated method is yet available and most of the mining effort is still manual and left to researchers, who may have a limited knowledge of the system and the environment in which it operates [12].

Fault localization. Existing techniques of log mining provide useful insights about the possible locations of faults although developers must still put a significant effort to exactly identify the faults to be fixed [20], [48].

In this study, we illustrate the case of a system orchestrating the Atacama Large Millimeter Array telescope (ALMA)1. ALMA is the largest astronomical project in existence. It consists of a single telescope of revolutionary design, composed of 66 high precision antennas. Such system is compounded of 200 computers controlling the 66 antennas positioned in the Atacama desert in Chile at 5000 m of altitude. The system is operational 24/7 and its logging service generates logs each millisecond for a total of up to 30 GB per day (log database size is measured in Terabytes) from six different subsystems triggered by over twenty heterogeneous teams at twelve participating institutes over the world [21]. As such, mining logs of the ALMA system poses additional challenges, that are specific to its context, but that may incur in other similar large / complex / critical systems.

Log mining at ALMA (Fig. 3) is performed by the software engineers with the goal of finding the most ‘optimal’ query to interrogate the log dataset and reconstruct the cause of an error by visualizing log events time series that occur in a time window that incorporates the timestamp of the error reported in the ticket. All activities are performed manually and the ‘optimum’ is defined and reached when experts decide so, Table 3. Thus, automatizing the procedure is a real need and poses specific challenges of log mining:

Incomplete information. Deciphering the information in log events might not be simple. One reason concerns the user interface used to include the event description. Such interface may not incorporate all relevant information [12] or may not be resilient to change with the same pace of the ALMA system. As result, some events may carry incomplete information about the system operations.

Identical timestamp. Log events can also occur with a precision lower than a millisecond and the logging system may not be able to distinguish one from the other. Thus, logs can be reported with identical timestamp and their order of occurrence is therefore lost. Thus, it is not completely straightforward to follow the logic of system just by looking at the individual logs sequentially and it becomes important to analyze them into clusters.

The goal of the study is to design an approach to reconstruct from system logs the (anomalous) behavior of a system as sequences of system’s tasks and identify the low-level software components (i.e., files and methods) involved in the tasks.

Once a system’s anomaly is detected, maintainers can automatically associate telescope’s tasks and code components to such misbehavior and avoid the manual, complex analysis of big sets of individual log events like the one described in Fig. 3 for the ALMA operations. To achieve this goal, we propose a semi-automated approach that combines existing machine learning method for log mining, the Latent Dirichlet Allocation (LDA), with pattern recognition for topic labelling. LDA is one of the most popular tools for text mining used to discover a hidden thematic structure (i.e., latent topics) from collections of textual documents [9]. LDA is trained on a set of textual documents and produces a set of latent topics as bags of words extracted from the documents and a model that gives the probability for a new textual document to belong to any of such topics (i.e., posterior probability). We use LDA to discover the tasks of the ALMA telescope as latent topics of documents defined by sets of event messages. Therefore, topics are represented as bags of words of the ALMA log vocabulary. The process to associate a semantic to bags of word is called topic labelling. Such process is typically performed manually by the domain experts that interpret the words of a topic within the domain context [37]. The semantic expressed by the bags of words might not be always enough to distinguish one topic from another. Thus, in our approach, the information associated to each topic is enriched: each topic is represented as a set of patterns of sequences of event messages recurring over documents of the ALMA corpus. The greater semantic of such patterns may better guide experts and researchers to find suitable labels than individual messages or words. As illustrative example, we applied our overall method to a sample of logs of the ALMA system and reflect on the lessons learned and how we overcome some of the challenges in log mining.

Summarizing, the contributions of this work are the following:

  • An iterative method to identify the number of latent topics in log messages based on the coherency and independency of topics as sets of most relevant messages.

  • A novel approach to label latent topics in log messages that exploits the natural time-ordering of sequences of log events and use pattern recognition on such sequences. By reading patterns of sequences of messages instead of bags of words researchers can better determine the semantic of a topic and associate it a label.

  • A novel approach to model system’s behavior as set of system’s tasks by means of latent topics in sequences of log messages. By leveraging the information carried by log events, the model can also be used to localize the software classes and methods that are involved in each task. This is particularly useful for testers to trace in software the cause of system misbehavior.

  • An application of the approach to the real case study of the complex system orchestrating the ALMA telescope. With the proposed model of system behavior, we are also able to describe the system’s errors and their cause that would otherwise have been hidden behind the highly technical language of the logs.

  • A final reflection on the challenges in log mining existing in literature and arising from the specific context of research.

The paper is organized as follows. We discuss the related work in Section 2. In Section 4, we introduce the research questions and illustrate the context analysis and the study samples. The approach is described in Section 5. In Section 6, we answer the research questions and we summaries the lesson learned and the threats to validity in Section 7 and 8 respectively. Finally, we conclude in Section 9.

Section snippets

Related work

Three major research areas are relevant for the present study: 1) log mining, 2) topic modeling, and 3) topic labelling. In the following, we review the relevant literature according to these major perspectives.

Topic modeling and Latent Dirichlet Allocation

A topic model is a statistical algorithm for text mining used to discover the hidden thematic structure (i.e., latent topics) of a collection of documents. In this work, the use of topic models is motivated by their capability to reduce dimensionality, which may be useful to raise the level of abstraction in logs from low-level event messages to higher level system’s tasks. Specifically, documents are first build to represent the behavior of a system and then topic modeling is used to discover

Study design

Our work uses an experimental protocol for exploratory analysis [53] on how to apply topic modeling to system’s log mining. To this aim, in this section, we illustrate the protocol as 1) the research questions derived from the goal in Section 1.1, 2) the study context and the decision taken thereafter that drove the experiment settings and the analysis of the data, 3) the data collection and selection and description of the study samples, 4) the results are qualitatively described in Section 6

Study approach

The approach proposed as the contribution of this paper is overviewed in Fig. 4 and discussed in the following sections.

The approach is designed to be used for any system whose log events carry similar information and structure (e.g., code data) as for our SUT and the system behavior can be inferred from events’ sequences (e.g., with begin and end events). Domain experts have also a key role in the proposed approach as the domain is highly technical and the validation of the model requires

Results

In this section, we illustrate the application of the method to the 1752 log events of the testbed sample, Table 1 and answer the research questions.

RQ1. Can the behavior of a system be reconstructed from system logs? With the procedure described in Section 5.10, we reconstruct an SB as a set of tasks performed by the ACS system, where an SB corresponds to a document D and tasks correspond to the topics Ti found by the LDA model. The behavior model is then defined by a specific set of tasks

Lessons learned in mining a large corpus of log events

In this section, we briefly summaries the how we overcome new and old challenges of mining log to extract system behavior.

Short texts. To overcome the lack of vocabulary of short text messages, we implemented two strategies: 1) the behavior of a system is modelled by hidden topics in sequences of event messages and not by individual message and 2) topics are labelled using patterns of such messages not bag of words.

Context investigation. A thorough context analysis and validation of the results

Threats to validity

Given the explorative nature of this work, the relevant class of threats falls under construct validity.

Firstly, the event messages may still contain a vocabulary specific to non-ordinary activities of the system (e.g., ad hoc maintenance) although we accurately selected the logs during the Observation phase of the telescope. Thus, to understand whether this is the case, the vocabularies of the messages of the testbed and the validation sample have been compared. The testbed vocabulary contains

Conclusion

In this work, we propose a method that exploits the information contained in log events to reconstruct system behavior as set of telescope’s tasks. The work uses LDA analysis to identify the number of such tasks and a pooling schema improvement based on pattern recognition to label such topics. The application of our method illustrates how to mine about 2000 events and reconstruct the tasks of 16 sequences of events each describing an observation of the ALMA telescope. The method is also able

Conflict of Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgment

This work has been partially supported by the GAUSS Italian research project (funded by MIUR, PRIN 2015, Contract: 2015KWREMX) and DEBASS research project (funded by Free University of Bolzano).

References (56)

  • K. Benoit, D. Muhr, K. Watanabe, D. Muhr, Smart, 2017....
  • C. Bertero et al.

    Experience report: Log mining using natural language processing and application to anomaly detection

    2017 IEEE 28th International Symposium on Software Reliability Engineering (ISSRE)

    (2017)
  • D.M. Blei et al.

    Latent Dirichlet allocation

    J. Mach. Learn. Res.

    (2003)
  • J. Chang et al.

    Reading tea leaves: How humans interpret topic models

    Proceedings of the 22nd International Conference on Neural Information Processing Systems, NIPS’09

    (2009)
  • E. Chuah et al.

    Diagnosing the root-causes of failures from cluster log files

    2010 International Conference on High Performance Computing

    (2010)
  • M. Cinque et al.

    Event logs for the analysis of software failures: a rule-based approach

    IEEE Trans. Softw. Eng.

    (2013)
  • K. Damevski et al.

    Interactive exploration of developer interaction traces using a hidden Markov model

    Proceedings of the 13th International Conference on Mining Software Repositories, MSR ’16

    (2016)
  • K. Damevski et al.

    Predicting future developer behavior in the ide using topic models

    IEEE Trans. Softw. Eng.

    (2018)
  • ESO, Error definition, 2016....
  • R.W. Featherstun et al.

    Using syslog message sequences for predicting disk failures

    Proceedings of the 24th International Conference on Large Installation System Administration, LISA’10

    (2010)
  • I. Fronza et al.

    Failure prediction based on log files using random indexing and support vector machines

    J. Syst. Softw.

    (2013)
  • L. Gazzola et al.

    Automatic software repair: A survey

    IEEE Trans. Softw. Eng.

    (2019)
  • J.P. Gil et al.

    Operational logs analysis at alma observatory based on elk stack

  • M. Goldstein et al.

    Experience report: Log-based behavioral differencing

    2017 IEEE 28th International Symposium on Software Reliability Engineering (ISSRE)

    (2017)
  • A.D. Gordon

    Classification (2nd ed.)

    (1999)
  • T.L. Griffiths et al.

    Finding scientific topics

    Proc. Natl. Acad. Sci.

    (2004)
  • T.L. Griffiths et al.

    Finding scientific topics

    PNAS 2004

    (2004)
  • B. Gruen et al.

    topicmodels: An r package for fitting topic models

    J. Statistical Softw.

    (2011)
  • Cited by (15)

    • An Assessment of ChatGPT on Log Data

      2024, Communications in Computer and Information Science
    • Robust Analysis of IT Infrastructure's Log Data with BERT Language Model

      2023, International Journal of Advanced Computer Science and Applications
    View all citing articles on Scopus
    View full text