An ensemble-based approach to the security-oriented classification of low-level log traces

https://doi.org/10.1016/j.eswa.2020.113386Get rights and content

Highlights

  • We propose to classify traces as insecure/secure based on example logs and securitybreach models

  • We face a setting where the traces are sequences of events that do not refer to the modelsactivities

  • A meta-classification scheme is used to mix two example-driven classifiers and a modeldriven one

  • The proposed framework was empirically proven to improve example- and model- driven approaches

Abstract

Traditionally, Expert Systems have found a natural application in the behavioral analysis of processes. In fact, they have proved effective in the tasks of interpreting the data collected during the process executions and of analyzing these data with the aim of diagnosing/detecting anomalies. In this context, we focus on log data generated by executions of business processes, and consider the issue of detecting “insecure” process instances, involving some kind of security breach (e.g. attacks, frauds). We propose a hybrid framework for accomplishing a security-oriented classification of activity-unawaretraces, i.e., traces consisting of “low-level” events with no explicit reference to the “high-level” activities the analysts are typically familiar with. The framework integrates two classification approaches traditionally used as alternative ways to decide on the “secureness” of process traces: (i) a model-driven approach, using knowledge of behavioral models expressed at the abstraction level of the activities, and (ii) an example driven approach, exploiting the availability of event sequences labeled by experts as symptomatic of “secure” or “insecure” behavior. The core of our solution is a meta-classifier combining (i) and (ii) thanks to a probabilistic Montecarlo mechanism that allows the traces to be simultaneously viewed as sequences of low-level events and of high-level activities. The framework has been empirically proved effective in jointly exploiting the two aforementioned forms of knowledge/expertise, typically coming from different experts, and in acting as a sort of “super-expert” classification tool. Its accuracy and efficiency make it a solid basis for implementing a novel kind of expert system for the security-oriented monitoring/analysis of business processes.

Introduction

Interpreting, predicting, repairing, and monitoring system behaviors are among the prominent application contexts of Expert Systems. In this regard, both the industry and the research community of Business Process Management (BPM) have recently devoted a great deal of attention to solutions for assisting/replacing human experts in the security-oriented analysis of business process logs, i.e., the problem of analyzing the traces collected by systems monitoring the business process enactments to detect the process instances that involved security breaches of different kinds (such as frauds, attacks, misuses). As a matter of fact, “insecure” process instances may produce severe damages to organizations/enterprises that may cause loss of image and reputation as well as fines and penalties. This is the main reason behind the recent research efforts (Accorsi, Stocker, 2012, Accorsi, Stocker, Müller, 2013, Bose, van der Aalst, 2013, Jans, van der Werf, Lybaert, Vanhoof, 2011) to employ process mining techniques (van der Aalst, 2016) as a support for a security-oriented analysis of process logs and, in particular, as the core of Expert Systems for auditing applications (Zerbino, Aloini, Dulmin, & Mininno, 2018).

The approaches in the literature addressing the classification problem of recognizing process instances as “secure” or “insecure” on the basis of what reported in the corresponding log traces can be partitioned in the following two classes:

  • (i)

    Example-driven methods (Bose, van der Aalst, 2013, Cuzzocrea, Folino, Guarascio, Pontieri, 2016b, Leontjeva, Conforti, Di Francescomarino, Dumas, Maggi, 2015, Nguyen, Dumas, Rosa, Maggi, Suriadi, 2014): these approaches require the presence of example data, in the form of a set LAET of Annotated Example Traces. The annotation of a trace says whether some security breach is known to have occurred during the process enactment that generated the trace. These annotations are then exploited to induce a (preferably interpretable, e.g. rule-based) classification model that, when applied to “new” trace, decides whether the corresponding process instance was affected by a security breach or not.

  • (ii)

    Model-driven methods (Fazzinga, Flesca, Furfaro, & Pontieri, 2018a): these approaches require the presence of security-breach models, i.e. descriptions of the behaviors that are known to be symptomatic of risks for the security (or, alternatively, of compliance models representing the allowed behaviors as proposed in (Accorsi & Stocker, 2012)). Then, a process instance is classified as “secure” or “insecure” depending on whether the corresponding trace conforms to these models.

Clearly, example- and model- driven methods use two mutually independent sources of knowledge/information (namely, annotated log data and explicit security-breach models) in isolation, as a basis for their respective classification mechanisms. As a matter of fact, these two different sources often co-exist in real-life application contexts, and they tend to separately provide partial and different views of insecure process-execution patterns. That is, models and annotated data can be simultaneously available, and they can be complementary, since what described by models as an insecure behavior may not be caught by any annotated example trace, and vice versa (we will elaborate more on this complementarity aspect in the core of the paper, when discussing where models and example traces come from). Our work stems from these arguments, seeing as the core idea underlying it is to devise a joint framework where models and annotated data are used synergistically.

Problem Setting: High-Level Activities and Low-Level Events. In the scenario above, we tackle the security-oriented classification problem in a challenging setting: the events stored in the log have no clear reference to the activities that caused them. This situation occurs frequently in practice, and in fact several works in the process-mining literature have dealt with this issue (Baier, Mendling, Weske, 2014, Fazzinga, Flesca, Furfaro, Pontieri, 2018a, van der Aalst, Leopold, Reijers, 2017). For instance, this happens when the tracing/enactment system (in charge of generating the log) records “low-level” operations, instead of their translation to the “high-level” activities in terms of which analysts typically reason. Thus, there is a mismatch between the alphabet of symbols used to describe the actions in the log and the alphabet of activities used by the analysts to refer, in the behavioral models, to the actions performed during the process enactments. Furthermore, the correspondence between the types of activities (in terms of which the process models are defined) and the types of events (recorded in the log traces) might be not one-to-one, as described in the following example.

Example 1

Consider the case of an administrative office where the high-level activities are “Receive a paperwork”, “Analyze a paperwork”, “Reject a paperwork”, “Request more information about a paperwork”, “Acquire more information about a paperwork”, and “Approve a paperwork”. If the enactment of the process is not guided by a structured process-aware system (e.g., a Workflow Management System), it may happen that the log is generated by a tracing system that just captures generic low-level operations, such that there is no bijective mapping between these operations and the activities above. For instance, while activity Analyze one-to-one corresponds to performing the low-level operation format check (meaning that a suitable tool is run to perform a preliminary validation of the paperwork), it may happen that different executions of the same activity result in different operations and thus are stored as instances of different events in the log: Request may be stored as the occurrence of the event send email or of make a call (depending on whether the person who issued the paperwork was contacted via electronic mail or phone call). On the other hand, different activities may result in the same operation. Thus, executions of activities Receive and Acquire may both be encoded in the log as occurrences of the event receive email, as well as the executions of Reject and Approve and Request could all be stored in terms of the event send email. In such a case, given a log trace ϕ= receive email, format check, send email, there is some uncertainty in reconstructing the actual sequence of activities underlying ϕ. In fact, ϕ admits six interpretations in terms of high-level activities, among which: Receive, Analyze, Reject and Receive, Analyze, Approve.

Limits of the current approaches and Challenges. The state-of-the-art classification approaches that can be used in Expert Systems targeted at deciding on the “secureness” of business process instances are not able to jointly use the complementary knowledge on the insecure behaviors encoded in the annotated traces and the breach models, respectively. As a matter of fact, combining classifiers of types (i) and (ii) is far from straightforward. The point is that leveraging the availability of the behavioral models in (ii) requires reasoning in terms of activities, while exploiting annotated event traces in (i) requires reasoning in terms of events. Unfortunately, simultaneously considering both these abstraction levels is a hard task. This complexity derives from the fact that, in the general case, the mapping between activities and events is many-to-many, since there can be shared functionalities (i.e., events that may be generated by the executions of different activities) as well as “polymorphic” activities (i.e., activities whose different executions can generate different events). This implies that a trace (i.e., sequence of events) admits different interpretations (i.e., it can result from different sequences of activity instances). Therefore, the combination of classifiers exploiting sources of knowledge/information at different abstraction levels requires a suitable mechanism for “unifying” these standpoints.

Our contribution: a joint security-oriented classification framework. Our contribution is the definition and validation of a novel framework that is able to classify new low-level event traces as either insecure or secure by jointly considering the following information/background knowledge:

  • (1)

    a log LAET of annotated example traces, whose traces are sequence of low level events providing examples of correct interpretation and classification. That is, each trace ϕ in LAET is specified along with its correct interpretation I*(ϕ) (i.e., the sequence of high-level activities that actually generated ϕ) and a security flag SF(ϕ), that says whether the process instance encoded by ϕ should be classified as insecure;

  • (2)

    two sets W and SBM of process models and security-breach models, describing some knowledge about the processes’ behaviors and the behaviors symptomatic of security breaches, respectively. In order to possibly deal with lowly-structured process management contexts (where the event-activity mismatch issue discussed above is more likely to occur), these models are expressed according to the popular declarative paradigm introduced in (van der Aalst, Pesic, & Schonenberg, 2009a), using presence/absence constraints and/or precedence/causality constraints over the activities to encode what is known about these different kinds of behaviors;

  • (3)

    a probabilistic many-to-many mapping μ between events and activities, that encodes in probabilistic terms the correspondence between the events reported in the logs and the activities describing the actions in the models. As we will see, the mapping μ may be viewed as a derivative information, since it can be obtained by computing statistics over the log LAET.

To accomplish this security-classification task, we propose to adopt a novel ensemble-based kind of classification model, named Hybrid Security-Oriented Trace Classifier (or HSOTC for short), that incorporates and integrates three separate classifiers: a model-driven classifier CM and two example-driven classifiers CE and CA, trained to classify process instances represented as sequences of events and of high-level process activities, respectively. An HSOTC constitutes the core of our solution approach, which technically consists in first constructing an HSOTC (by exploiting both annotated log data and the given process/breach models), and then applying it to any novel log trace to be classified. Essentially, on the one hand, the two classifiers CE and CA are induced by mainly applying standard machine learning methods (Witten, Frank, Hall, & Pal, 2016) to the event traces and their ground-truth interpretations in LAET, respectively. On the other hand, CM is made to implement the model-driven classification logics that relies on checking whether activity traces conform to the given security-breach models. The three heterogeneous base classifiers CE, CA, and CM are integrated by means of an induced meta-classifier, that is meant to classify new event traces by both looking into them as they are (i.e., as event sequences) and looking into their possible interpretations as activity sequences. To use the meta-classifier on new traces (i.e., mere event sequences, lacking a ground-truth interpretation), the classification process is embedded in a Montecarlo scheme that generates a statistically representative sample of activity-level interpretations of the trace, consistently with the mapping μ and the given process models.

The framework has been validated experimentally in terms of efficiency and efficacy, over logs conforming to real-life process models. The experimental analysis has addressed different aspects: besides investigating the gain in accuracy compared with the separate base classifiers CE, CA, and CM, we have studied the efficiency of the approach (showing that the proposed approach is feasible in real-life settings), and we have conducted a thorough sensitivity analysis. Herein, the gain in effectiveness (w.r.t. the base classifiers) and the feasibility of our approach have been shown to be preserved when moving across different technical choices of the core components of an HSOTC, such as the trace encoding and machine learning methods that are employed to train the example-driven classifiers CE and CA.

Relevance of the contribution. We here summarize the major characteristics of the proposed framework that make it a relevant contribution in the field of Expert Systems (these points will be further discussed at the end of the article, in Section 7):

  • novelty: Our two-level classification scheme somehow represents a “super-expert”, since it integrates the ways of reasoning on security issues that characterize two distinct types of security experts. In fact, human experts who are familiar with the behavior of processes at the high abstraction level of activities are rarely familiar with the semantics of the low-level log events and their relationships. Vice versa, the latter aspects are typically clearer to other stakeholders, dealing with operational details and/or involved in the tracing process;

  • usefulness: nowadays, there is a strong interest towards effective solutions for assisting/replacing experts in diagnosing security issues in the field of business process management. In particular, a smart automated tool for classifying the large amounts of log traces produced by a business process is a fast and convenient alternative to the traditional semi-automated auditing methods. Indeed, besides helping the analyst quickly retrieve and inspect the history of the most suspicious process instances, both the breach-detection models considered in our framework and the traces automatically classified with it can be further exploited as a rich source of knowledge for second-level security analyses (e.g., security-breach instance explanation). Moreover, if used to classify a stream of process traces in a continuous fashion, our framework can be a basis for implementing added-value run-time services, such as: notifying alerts, monitoring attack statistics and security-oriented KPIs (i.e. Key Performance Indicators1), triggering auditing procedures or risk-mitigation counter-measures;

  • technical depth: simultaneously reasoning at both the “high” and “low” abstraction levels of activities and events is a complex task (due to the many-to-many mapping between events and activities). Thus, the design of the proposed framework has required an elaborate solution, where a classical classification schema induced from low-level traces is augmented with a rule-based classifier exploiting a Montecarlo mechanism for generating alternative high-level interpretations of the log events;

  • effectiveness: our classification framework is shown to be effective in a variety of settings. In particular, it is able to provide a more accurate classification than those provided by employing classifiers separately working at the abstraction levels of activities and events.

These characteristics have pushed us to pursue the direction of using the proposed framework as the core of a novel Expert System for the security-oriented analysis of business process logs. The architecture of the system (whose design and implementation are ongoing activities conducted in the Technological District on Cyber Security, supported by the Italian Ministry of University and Research) is sketched in Fig. 1. Here, our hybrid framework is the main component of the back-end side, where it plays the key roles of managing the knowledge of secure/insecure behaviors and using it to automatically infer a classification of log traces. The result of the classification is used by the front-end side, where a “security cockpit” is provided for the analyst, along with suitable interfaces for feeding the knowledge base. More details on the components of the architecture will be provided in the discussion of the future work (Section 7.4).

Organization. The remainder of this paper is structured as follows. Section 2 introduces some basic notions and notations used throughout the paper (concerning, in particular, log traces, process/breach models, the mapping between log event and process activities, and trace interpretations). Then, in Section 3 the reader is provided with an informal high-level description of the problem addressed by our work and of the proposed solution approach. A technically deeper illustration of the approach is given in Section 4, that both provides a formal definition for the proposed HSOTC model and presents two algorithms that allow for discovering an instance of such a model and for applying it to new log traces, respectively. The experimental analysis is discussed in detail in Section 5, while Section 6 is devoted to provide an overview of related research work and to compare them to our proposal. Some concluding remarks (covering also a discussion of the significance, impact and limits of our proposal) and directions of future work are finally presented in Section 7.

Section snippets

Preliminary notions and notation

Basics: logs, traces, processes, activities and events. An instance w of a process is a sequence a1,,an of instances of (high-level) activities. In turn, each activity instance ai generates a (low-level) event ei. We assume the presence of a tracking system that records the executions of every event. Therefore, the execution of w is recorded by the tracking system in terms of the sequence ϕ=e1,,en. In particular, ϕ is called trace, and the set of traces L recorded by the tracking system in

How to classify log traces as insecure/secure? Our solution approach

Our work aims at solving the following security-classification problem: given a trace ϕ composed by low-level events, what is the probability pins(ϕ) that is insecure, i.e., ϕ involves a security breach? To answer this question, we exploit two sources of information of insecure behaviors: an example-based one, represented by a log LAET of annotated example traces, and a model-based one, represented by the two sets W and SBM, that are process models and security-breach models, respectively.

As

Algorithms for discovering and using an ensemble-based security classifier

This section first illustrates, in Section 4.1, a novel ensemble-based type of classification model, named Hybrid Security-Oriented Trace Classifier (HSOTC). An HSOTC is the core of our approach, whose fundamental steps consist in first learning an HSOTC from training data, and then applying it to novel traces, in order to support the security-oriented analysis of these traces. These steps will be discussed in detail in Section 4.2, which illustrates a Montecarlo-based algorithm that employes

Experiments

The framework proposed here was tested over log data concerning real (lowly-structured) business processes, enacted within a service agency. Essentially, the tests performed were conceived having the following main research questions in mind:

  • RQ1:

    What is the accuracy of the “baseline”, that is the accuracy of the classification obtained by separately using current example-driven and model-driven security classification methods? And, specifically, to which extent does the accuracy worsen for each of

Related work

The idea of addressing the security-oriented classification of log traces by combining example-driven and model-driven approaches was first discussed in (Fazzinga, Folino, Furfaro, & Pontieri, 2018b), that is a very preliminary version of this article. In (Fazzinga et al., 2018b), the motivations for resorting to such a combined approach were debated, but no concrete solutions were devised and experimentally validated.

In the rest of this section, major related research works are overviewed,

Summary and value of the contribution

Proposed classification framework A framework has been proposed for detecting insecure process instances, involving some kind of security breach (e.g. frauds, attacks, misuse), based on their associated log traces, in the challenging setting where the traces consist of “low-level” events that cannot be mapped deterministically to high-level process activities. Differently than in previous solutions, the framework has been devised to use two sources of information: (i) explicit knowledge

CRediT authorship contribution statement

Bettina Fazzinga: Conceptualization, Investigation, Formal analysis, Methodology, Software, Visualization, Writing - original draft, Writing - review & editing, Validation. Francesco Folino: Conceptualization, Investigation, Formal analysis, Methodology, Software, Visualization, Writing - original draft, Writing - review & editing, Validation. Filippo Furfaro: Conceptualization, Investigation, Formal analysis, Methodology, Software, Visualization, Writing - original draft, Writing - review &

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

References (55)

  • P.N. Bennett et al.

    Probabilistic combination of text classifiers using reliability indicators: Models and results

    Proc. of the 25th annual intl. acm sigir conference on research and development in information retrieval (sigir)

    (2002)
  • R.J.C. Bose et al.

    Abstractions in process mining: A taxonomy of patterns.

    Proc. of the 7th intl. conf. on business process management (bpm)

    (2009)
  • R.J.C. Bose et al.

    Discovering signature patterns from event logs

    Proc. of the IEEE symposium on computational intelligence and data mining (cidm)

    (2013)
  • L. Breiman

    Random forests

    Machine Learning

    (2001)
  • C. Cortes et al.

    Support-vector networks

    Machine Learning

    (1995)
  • A. Cuzzocrea et al.

    A multi-view multi-dimensional ensemble learning approach to mining business process deviances

    Ieee intl. joint conf. on neural networks (ijcnn)

    (2016)
  • A. Cuzzocrea et al.

    A robust and versatile multi-view learning framework for the detection of deviant business process instances

    Intl. Journal of Cooperative Information Systems

    (2016)
  • M. Diligenti et al.

    Bridging logic and kernel machines

    Machine learning

    (2012)
  • G. Ditzler et al.

    Learning in nonstationary environments: A survey

    IEEE Computational Intelligence Magazine

    (2015)
  • J.E. van Engelen et al.

    A survey on semi-supervised learning

    Machine Learning

    (2019)
  • B. Fazzinga et al.

    Online and offline classification of traces of event logs on the basis of security risks

    Journal of Intelligent Information Systems

    (2018)
  • B. Fazzinga et al.

    Combining model- and example-driven classification to detect security breaches in activity-unaware logs

    Proc. of the 26th otm 2018 conferences - coopis

    (2018)
  • F. Folino et al.

    Business process deviance mining

  • P. Fournier-Viger et al.

    A survey of itemset mining

    Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery

    (2017)
  • E. Frank et al.

    Weka-a machine learning workbench for data mining

    Data mining and knowledge discovery handbook, 2nd ed.

    (2010)
  • J. Gama et al.

    A survey on concept drift adaptation

    ACM computing surveys (CSUR)

    (2014)
  • M. Kubat et al.

    Learning when negative examples abound

    Proc. of the 9th european conf. on machine learning (ecml)

    (1997)
  • Cited by (10)

    • A graph-based approach to detect unexplained sequences in a log

      2021, Expert Systems with Applications
      Citation Excerpt :

      The approach in (Cinque et al., 2019) requires a catalog of regular expressions, built off-line, to transform text entries into events to analyze, while the approach in (Roldán et al., 2020) requires security experts to build and maintain both a security domain model and security attack pattern models. Finally, approaches in (Du et al., 2017; Yang et al., 2019; Fazzinga et al., 2020; Lopez-Martin et al., 2020) have been evaluated on system, IDS, firewall and business process logs, instead of unstructured application level logs as in our proposal; in addition, approaches in (Fazzinga et al., 2020; Lopez-Martin et al., 2020) require labeled data. The use of graphs for representing raw data allows capturing entities (i.e., nodes) as well as their relationships (i.e., edges), which can be useful to obtain insights from the data.

    • Log Mining-based Online Failure Prediction in Client-Server Architecture

      2023, UBMK 2023 - Proceedings: 8th International Conference on Computer Science and Engineering
    View all citing articles on Scopus
    View full text