Elsevier

Information Systems

Volume 73, March 2018, Pages 1-24
Information Systems

Efficiently interpreting traces of low level events in business process logs

https://doi.org/10.1016/j.is.2017.11.001Get rights and content

Highlights

  • Interpret the traces of low-level events in multi-process logs as high-level activity instances.

  • Many-to-many mapping between events and activities requires dealing with uncertainty.

  • Capability of dealing with complex activities (generating multiple events).

Abstract

Process mining methods have been proven effective in turning historical log data into actionable process knowledge. However, most of them work under the assumption that the events reported in the logs can be easily mapped to well-defined process activities, that are the terms in which analysts are used to reason on the processes’ behaviors. We here consider the challenging scenario where this assumption does not hold: the log traces are sequences of low-level operations with no explicit reference to the corresponding high-level process activities. In this setting, we face the fundamental problem of bringing the log traces to the abstraction level of the analyst’s perspective. Formally, given a trace Φ, and on the basis of a high-level behavioral description of the processes, we search for every possible interpretationσ, W⟩ of Φ, where σ is a sequence of high-level activities whose execution may have generated the sequence of low-level operations Φ, and, in turn, W is a process that may have triggered the execution of σ. We address this problem probabilistically, and propose a framework that builds a compact representation of Φ’s interpretations, each associated with a probability score. This probability measures how likely the associated interpretation is the correct one, and it is evaluated adopting a revision paradigm guided by the background knowledge provided by the processes’ models. Notably, our approach can deal with “complex” activities (i.e., each generating a sequence of low-level operations, rather than a single one), and with the case that the traces encode process instances exhibiting some deviation from the expected behaviors encoded in the process models.

Introduction

The analysis of log data describing executions of business processes has attracted the attention of researchers for decades [2], [3], with the ultimate goal of improving the effectiveness and the performance of processes. Thanks to the increasing diffusion of automated tracing systems and the abundance of log data, this topic has gained momentum with the growth of the Process Mining research field, which addresses the “confrontation between event data and process models” [4] through new process-aware data analysis tasks, such as: inducing a process model [5], detecting deviations from a normative process model [6], quantifying “how much” a log and a model conform one to the other [7], and supporting advanced query-based process analytics [8], [9], [10], [11]. However, all the approaches and tools developed in this field require that each log event can be mapped to well-defined activity concepts, corresponding to some high-level view of the process. Unfortunately, this assumption often does not hold in practice: in the logs of many lowly-structured processes, the events just represent low-level operations, with no clear reference to the business activities that were carried out through these operations, as shown in the following example.

Example 1 Running example

Consider the case (inspired to a case study discussed in Section 9) of a phone company, where two business processes are carried out: a process W1 for the activation of services, and an issue-management process W2.

An abstract description of the behaviors of these processes is available in the form of loose process models, that specify which high-level activities compose each process, and several temporal constraints over the execution of the activities. The activities of processes W1 and W2 are shown in Fig. 1 via process-activity links. In addition, assume that the following constraints are known to be satisfied by the two processes’ instances: (i) every instance of either process starts and ends with an execution of activities R and N, respectively, and these activities cannot be executed multiple times in the instance; (ii) in every instance of W1, any execution of activity P (resp. D) must be followed (resp. cannot be followed) by an execution of activity D (resp. P).

All these activities are performed by executing low-level operations (e.g., email exchanges, phone calls, database accesses), supported by a (non workflow-based) IT system, which also stores each execution of these operations as the occurrence of an event (type) for some process instance.

Fig. 1 reports all the operations, regarded each as an event and denoted with a Greek letter, as well as their mapping to the high-level activities. In particular, the graph in the left side summarizes all the possible mappings between activities and events, as well as process-activity relationships. Specifically, a link between an activity and an event means that the activity can generate an instance of that event when executed (e.g., an instance of activity N can produce an instance of event δ or ξ). A link between a process and an activity means that instances of the activity might be executed in any instance of the process (e.g., an instance of W1 might generate instances of D).

Notably, the event-activity mapping is many to many: for example, activity G can produce an instance of either event δ or event γ, while an instance of event δ can be generated by the execution of either activity G or N. Thus, any execution of a process activity generates an instance of one of the events associated with the activity, and the sequence of all the event instances that are produced for an entire process instance is stored in the log as the trace of the process instance. For example, an instance of W1 corresponding to the activity sequence RIGPN might be stored in the log in the form of a trace (i.e. a sequence of event instances) α¯ β¯ γ¯ β¯ξ¯ or of a trace α¯ β¯ δ¯ β¯ δ¯, where α¯ (resp. β¯, γ¯, etc.) denotes an instance of event α (resp. β, γ, etc.).

In a lowly-structured process management scenario like that illustrated above, where no workflow models are used to control and trace the execution of business activities, we want to address a novel process mining problem, stated informally as follows.

Given a log L, containing traces generated by an arbitrary number of business processes, and a set W of (partial) behavioral models for the processes, we want to interpret each trace Φ in L by establishing: (1) the process whose execution generated the sequence of events stored in Φ, and (2) for each low-level event e in Φ, which activity generated e as a result of its execution.

The problem is faced in the following challenging setting: (i) the events in the log do not directly refer to any process activity, and the candidate event-activity mappings form a many-to-many relation (in particular, a low-level operation can be used, as a shared functionality, to execute different activities, even in the same process instance); (ii) each process is executed in a lowly-structured flexible way, and its associated model just encodes simple behavioral constraints (similar to those in the example and in some compliance checking [12] and declarative modeling [13], [14] frameworks) that are known to hold in all (or most of) the process’ instances; (iii) different processes may share many activities, and may produce many common activity sequences. Consequently, a trace may have multiple activity sequences “explaining” it and, in turn, each of these sequences may comply with multiple process models.

As an instance, with regard to Example 1, assume that we want to interpret a trace Φex=α¯ β¯ ξ¯ ξ¯ (representing a sequence of instances of the events α, β, ξ, and ξ, respectively) as an execution of either W1 or W2. Solving this problem amounts to replacing each event in Φex with one of the activities that could have generated it (according to the mapping in Fig. 1), while ensuring that the resulting activity sequence complies with one of the process models. Only three of such activity sequences exist for Φex: R I D N, R I A N, and R P D N. Indeed, despite the sequence R P A N too complies with the event-activity mapping, it is not a valid interpretation, since it does not conform to either W1 (no occurrence of D appears in the sequence after that of P) or W2 (P is not an activity of W2).

In general, the given process models can help prune “unrealistic” interpretations for a given trace. However, as these models just encode loose descriptions of the processes’ behaviors, many “compliant” interpretations may exist for a trace. Indeed, the flexibility of the process models and the ambiguity affecting the event-activity and activity-process mappings produces a certain level of uncertainty, which makes it unsuitable to reuse current (deterministic) abstraction/interpretation solutions, as explained in the following.

The need of bridging some abstraction gap between log events and process activities has been addressed by several recent works [15], [16], [17], [18], [19], [20], [21], [22], [23], [24], owing to the fact that process mining/analytics techniques are of little usefulness on a log having no clear correspondence to recognizable process activities. However, none of these approaches can actually solve the interpretation problem stated above. In brief, indeed, two common features of all these approaches make them unsuitable for our setting: (i) all the given traces are assumed to come from a single business process; and (ii) just one “optimal” interpretation is returned for each trace. In a scenario affected by a high level of uncertainty (e.g., due to the combined presence of flexible processes and of ambiguous event-activity mappings), the latter feature may lead to losing information, whenever a trace may be explained via different similarly-plausible alternative interpretations, and the analyst’s expertise does not suffice to identify “the right interpretation” and definitely discard the remaining ones. By contrast, our approach provides the analyst with a rich representation of the different interpretations available for the traces, which can support advanced probability-aware analyses like those described in Section 8. Most of the approaches in the literature also suffer from further limitations: (iii) they cannot take account for a-priori knowledge on the behaviors of the processes (e.g. encoded in the form of business rules, process models, or described in process documentation), and may then return “unmeaningful” interpretations; (iv) they rely on the strong assumption that there are no shared functionalities (thus, all the instances of the same event are mapped to instances of the same activity).

As a matter of fact, the problem stated here has been only considered in [1] and in [25] in simplified settings: the former is an abridged preliminary version of our current proposal that just introduced a naïve exhaustive trace interpretation technique, while the latter proposes a Monte-Carlo trace interpretation technique for the detection of security breaches. Both these interpretation techniques have been considered as terms of comparison in the experimental evaluation of our approach.

A deeper discussion on the works in [1], [15], [16], [17], [18], [19], [20], [21], [22], [23], [24], [25] can be found in Section 10.

To deal with the inherent uncertainty of the problem stated above, we propose a probabilistic interpretation approach that solves the problem at both mapping levels (i.e. events vs. activities, and traces vs. process models). For the sake of presentation, our approach is first described in the simpler scenario of “elementary” activities (where the execution of each activity results in a unique low-level event), and the behavioral constraints associated with each process are “hard” – i.e. are assumed to be fulfilled by all the instances of the process. These two assumptions (made both in the toy scenario of Example 1) will be removed later on.

Technically, in Section 2, we introduce the notations used to describe (via “composition rules”) the structure of each process and to model the event-activity and trace-process mappings. Then, we formally state the interpretation problem in probabilistic terms in Section 3. The ad-hoc data structure, named cas-graph, that allows all the valid interpretations of a trace to be concisely represented (and constitutes the core of our interpretation approach) is introduced in Section 4. Next, Section 5 presents an algorithm that computes a cas-graph for a given trace in a two-phase incremental way, leveraging a probabilistic conditioning strategy. Both the structure of the cas-graph and the algorithm for building it are intended to deal with the scenario where the process behaviors do not ever deviate from the process models, and that the activities are “elementary”. These limitations are lifted in the subsequent two sections, where the approach is extended to the more general cases where there is some chance that process models are violated (Section 6), and/or there may be “complex” activities, each of which can generate a sequence of events instead of a single one (Section 7). After illustrating, in Section 8, how a cas-graph can be used for the analysis of log traces, we discuss an empirical analysis conducted on a real-life case study in Section 9. Related works are presented and contrasted to our approach in Section 10, while several important remarks and directions of future work are finally given in Section 11.

Before leaving this section, let us emphasizes two features of our current proposal that make it different from our previous work [1], where a simplified version of the interpretation problem and an exhaustive solution approach were presented: (a) we here devise an articulated solution method (based on the calculation of a cas-graph) that is far more scalable than the one proposed in [1] (as shown in our tests); and (b) we generalize the basic problem setting of [1], as to accommodate situations where: (i) each process can contain “complex” activities, and (ii) the instances of a process may exhibit violations with respect to (the constraints encoded in) its model – as a further form of noise/uncertainty.

Section snippets

Logs, traces, processes, activities and events

A log is a set of traces. Each trace Φ describes a process instance at the abstraction level of basic events, each generated by the execution of an activity. That is, an instance w of a process W is the execution of a sequence A1Am of activities; in turn, the execution of each activity Ai yields an instance ei of an event Ei; hence, the trace describing w consists in the sequence e1em of event instances. For any event ei occurring in a trace, we assume that the starting time of its execution

Problem statement

The problem addressed in this paper is that of “interpreting” an input trace Φ=e1em from an event log. This means deciding, for each ei of Φ, the activity Aicand-act(ei) whose execution generated ei, and, in turn, the process W whose execution caused the sequence of activities A1Am to be performed. When deciding this, the models of the processes encoded by the composition rules in CR must be taken into account.

More formally, the solution of an instance Φ=e1em of this problem is called

Conditioned activity sequence graphs

The conditioned activity sequence graph (cas-graph) over a trace Φ is the core of our framework: it will be used to concisely represent the sequence-interpretations and the process-interpretations of Φ consistent with the given set CR of composition rules.

The nodes of a cas-graph are said to be activity nodes. Each activity node n is defined over an event ei of Φ, and represents an interpretation of ei as an instance of some activity A within the execution of some process W. In a cas-graph, an

The interpretation algorithm

We here introduce our algorithm for building a cas-graph G representing all and only the sequence- and process-interpretations for an input trace that are consistent with a set of composition rules. In G, each valid sequence-interpretation σ (resp. process-interpretation W) is encoded by a source-to-target path π (resp. a source node), and vice versa; moreover, the conditioned probabilities of the valid sequence- and process-interpretations are encoded by assigning suitable probability values

Dealing with process instances deviating from the models

The approach presented so far deals with the composition rules in CR in a strict way: all the interpretations inconsistent with CR are discarded. This leads to producing an empty cas-graph in the extreme case that there is no interpretation of the input trace that is consistent with CR. As a matter of fact, in our original setting CR was meant to encode a partial set of behavioral rules that are known to be satisfied in every execution of the given business processes, e.g. by virtue of strong

Dealing with complex activities

In this section, we show how our interpretation framework can be extended to deal with the case that the activities are complex, i.e., they generate sequences of events, rather than single events. For the sake of simplicity, we present this extension in the case that no deviations are allowed from the process models. However, the combination of the extensions described in this section and in the previous one is straightforward and has been experimentally validated (see the results over the data

Usage of the cas-graph

This section is meant to discuss a natural usage of our approach for interpreting and analyzing the events in a given log.

As illustrated in detail so far, a cas-graph compactly represents all the possible interpretations of a log trace that comply with the composition rules associated with the given processes (and possibly with the activities, when these are complex). Diverse kinds of information can be extracted from a cas-graph, which represent different facets of the interpretations encoded

Experiments

This section consists of two parts. In the first part, we will validate our framework over noise-free data, i.e., traces encoding process executions exhibiting no deviations from the expected behaviors encoded in the composition rules. In the second part (Section 9.4), we will report the experimental analysis validating the efficacy and efficiency of the variant of our framework introduced in Section 6, that deals with the presence of noise in the traces.

Related work

The only works in the literature that directly faces our interpretation problem is a recent short paper [1] of ours, where the possibility of using an exhaustive approach for computing the interpretations was explored. This exhaustive approach has been used as a term of comparison in this paper (see Section 9), and shown to be largely outperformed by our cas-graph construction algorithm and to be infeasible for traces longer than 20 steps. Moreover, the study in [1] was targeted at a simplified

Conclusion, discussion and future work

We have presented a novel probabilistic approach to the problem of interpreting a low-level log trace as an instance of one or more given process models, encoding a partial collection of behavioral constraints. The approach relies on computing, for each model, different probability-aware event-activity mappings, which all comply (as much as possible) with the behavioral constraints encoded in the model. As a result, a data structure is returned that keeps information on different alternative

Bettina Fazzinga is a researcher at the Institute for High Performance Computing and Networks (ICAR-CNR) of the National Research Council of Italy. She is also adjunct professor at University of Calabria. She received a Laurea degree in Computer Engineering and a Ph.D. degree in Computer Science from University of Calabria. She was visiting researcher at Computing Laboratory of Oxford University several times since 2008. Her main research interests are information retrieval, approximate XML

References (41)

  • W.M.P. van der Aalst et al.

    Process mining and verification of properties: An approach based on temporal logic

    Proceedings of the Confederated International Conference “On the Move to Meaningful Internet Systems”: CoopIS, DOA, and ODBASE

    (2005)
  • B. Fazzinga et al.

    A compression-based framework for the efficient analysis of business process logs

    Proceedings of the Twenty-Seventh International Conference on Scientific and Statistical Database Management (SSDBM)

    (2015)
  • B. Fazzinga et al.

    A framework supporting the analysis of process logs stored in either relational or NoSQL DBMSs

    Proceedings of the 2015 International Symposium on Methodologies for Intelligent Systems (ISMIS)

    (2015)
  • B. Fazzinga et al.

    How, who and when: enhancing business process warehouses by graph based queries

    Proceedings of the Twentieth International Database Engineering & Applications Symposium (IDEAS ’16)

    (2016)
  • W.M.P. van der Aalst et al.

    Declarative workflows: balancing between flexibility and support

    Comput. Sci. R&D

    (2009)
  • M. Westergaard et al.

    Looking into the future

    Proceedings of the Confederated International Conference “On the Move to Meaningful Internet Systems”: CoopIS, DOA-SVI, and ODBASE

    (2012)
  • C. Günther et al.

    Activity mining by global trace segmentation

    Proceedings of the 2009 Workshops of International Conference on Business Process Management (BPM)

    (2009)
  • R.P.J.C. Bose et al.

    Abstractions in process mining: A taxonomy of patterns

    Proceedings of the 2009 International Conference on Business Process Management (BPM)

    (2009)
  • F. Folino et al.

    Mining multi-variant process models from low-level logs

    Proceedings of the Eighteenth International Conference on Business Information Systems (BIS)

    (2015)
  • F. Folino et al.

    Mining predictive process models out of low-level multidimensional logs

    Proceedings of the 2014 International Conference on Advanced Information Systems Engineering (CAiSE)

    (2014)
  • Cited by (31)

    • Process Mining meets argumentation: Explainable interpretations of low-level event logs via abstract argumentation

      2022, Information Systems
      Citation Excerpt :

      T.b.o.k, our proposal is the first addressing the interpretation task in an online setting and supporting the exploration of alternative trace interpretations. In fact, all the previous proposals in the field but [7] were conceived as a means for preparing an event log to the application of standard process mining tools, by turning each trace into a single sequence of events referring to recognizable process activities. This unfits operational settings where the domain knowledge is not enough to spot the ground-truth interpretation with certainty, and considering only one of the possible interpretations exposes to the risk of taking wrong conclusions.

    • Hybridizing humans and robots: An RPA horizon envisaged from the trenches

      2022, Computers in Industry
      Citation Excerpt :

      The segmentation of activities is a relevant topic in RPA (Leno et al., 2020). Although it has not yet been extensively studied, some proposals address this problem by using Machine Learning techniques (Agostinelli et al., 2021; Geyer-Klingeberg et al., 2018), rule mining and discovering data transformation techniques Bosco et al. (2019) or probabilistic methods (Fazzinga et al., 2018), among others. In general, these methods pursue identifying parts of the process with a high volume of cases, prone to human failure, or which require limited cognitive effort (Fung, 2014).

    View all citing articles on Scopus

    Bettina Fazzinga is a researcher at the Institute for High Performance Computing and Networks (ICAR-CNR) of the National Research Council of Italy. She is also adjunct professor at University of Calabria. She received a Laurea degree in Computer Engineering and a Ph.D. degree in Computer Science from University of Calabria. She was visiting researcher at Computing Laboratory of Oxford University several times since 2008. Her main research interests are information retrieval, approximate XML querying, semantic web search and argumentation frameworks.

    Sergio Flesca is an associate professor at University of Calabria. He received a Ph.D. degree in Computer Science from University of Calabria. His research interests include databases, web and semi-structured data management, information extraction, inconsistent data management, and approximate query answering.

    Filippo Furfaro received the Ph.D. degree in computer science from the University of Calabria. He is an associate professor at the University of Calabria. His research interests include databases, web and semi-structured data management, multidimensional data compression, inconsistent data management, computation in grid, and p2p environments.

    Elio Masciari is currently senior researcher at the Institute for High Performance Computing and Networks (ICAR-CNR) of the National Research Council of Italy. His research interests include Database Management, Semistructured Data and Big Data. He has been advisor of several master thesis at the University of Calabria and at Univeristy Magna Graecia in Catanzaro. He was advisor of Ph.D. thesis in computer engineering at University of Calabria. He has served as a member of the program committee of several international conferences. He served as a reviewer for several scientific journal of international relevance. He is author of more than 80 publications on journals and both national and international conferences. He also holds “Abilitazione Scientifica Nazionale” for Associate Professor role.

    Luigi Pontieri is a senior researcher at the High Performance Computing and Networks Institute (ICAR-CNR) of the National Research Council of Italy, and contract professor at the University of Calabria, Italy. He received the Laurea Degree in Computer Engineering, in July 1996, and the Ph.D. in System Engineering and Computer Science, in April 2001, from the University of Calabria. His current research interests include Knowledge Discovery and Data Mining, Process Mining and Business Process Intelligence.

    A preliminary version of this paper appeared in [1].

    View full text