1 Introduction

Information systems monitor and support business processes in real-time, while recording process transactions into event logs [1]. Process mining strives for analysis of business processes based on such event logs [2]. Specifically, various algorithms for the discovery of process models from event logs have been developed in recent years. Each of these algorithms strikes different trade-offs in process discovery, e.g., related to the scalability of the approach or the accuracy and complexity of the resulting models [3, 4]. Specifically, due to differences in how noise and incompleteness in the event log is handled and which representational bias [5] is adopted, algorithms differ in terms of the behavioural structures that they recognise in an event log and, thus, represent in a process model.

In this work, we argue that the results of process discovery can be improved by combining several of the existing algorithms and suggest to adopt the idea of ensemble methods from domains such as a statistics and machine learning. That is, by combining the strengths of different discovery algorithms, the resulting model shall have higher quality than a model that would be discovered by any of the base algorithms in isolation. Such an effect would stem from one algorithm outperforming other algorithms on parts of the event log, whereas for other parts a different algorithm turns out to be superior.

To realise this idea, we present an approach for process discovery, referred to as FuseDisc, that is based on the notion of model fusion. The FuseDisc framework takes as input an event log and a set of discovery algorithms, and produces a process model by fusing results from several discovery algorithms. The fused model should then yield an improvement over the models that would have been discovered by the algorithms in isolation. This improvement is assessed using common quality measures for process discovery, which evaluate different aspects of the resulting models [6], such as its ability to replay the event log or the extent of generalisation beyond the behaviour recorded in the log. We capture instantiations of the FuseDisc framework that come with guarantees on the improvement with respect to a quality measure using the notion of a proper fusion.

Following the above idea, we develop two novel algorithms to discover process trees using the FuseDisc framework: First, the Exhaustive Noise-aware Inductive Miner, exNoise for short, employs a set of discovery algorithms, which are given as different variants of the Inductive Miner [7, 8]. These variants in differ in their handling of noise in the event log: By applying different thresholds in noise filtering they are more or less aggressive in discarding infrequent behaviour. Second, given the high run-time complexity of exNoise, we also introduce the Adaptive Noise-aware Inductive Miner, adaNoise, which is a computationally tractable and heuristics-driven version of exNoise. That is, adaNoise uses a pre-defined discovery quality measure as a black-box heuristic.

Both algorithms are shown to generalise the Inductive Miner Infrequent [8], a variant of the Inductive Miner that is robust against noise. It relies on a single, global threshold for noise filtering that has to be configured upfront. Our algorithms, in turn, select the appropriate intensity of noise handling locally, i.e., per part of the event log.

Below, we summarise our contributions, while providing the structure of this paper after the definition of preliminaries in the next section:

  • We propose FuseDisc, a framework for process discovery based on model fusion and define the notion of a proper fusion (Sect. 3).

  • Using this framework, we propose two discovery algorithms, exNoise and adaNoise (Sect. 4). We prove both algorithms to be proper fusions and elaborate on their computational complexity.

Section 5 evaluates our algorithms empirically, using both synthetic logs and a real-world healthcare log. Our results indicate that the proposed algorithms improve over each of the base discovery methods, in terms of combined quality measures. Finally, Sects. 6 and 7 review related work and present concluding remarks, respectively.

2 Preliminaries

This section introduces preliminaries for process discovery. We formalise the notion of an event log (Sect. 2.1), before turning to process trees, their automated discovery from event logs, and common quality measures for process discovery (Sect. 2.2).

2.1 Event Logs

We adopt a common model of event logs that is grounded in sequences of activity labels that denote the executions of activities as part of a single process instance, also known as case. Let \(\mathcal {A}\) be a universe of activity labels (activities for short). A trace \(\sigma = \langle a_1, \ldots , a_n \rangle \in \mathcal {A}^*\) is a finite sequence of activities. The universe of traces is denoted by \(\mathcal {T} \). An event log \(L: \mathcal {T} \rightarrow \mathbb {N}\) is a multi-set of traces. We write \(\mathcal {L} \) for the universe of event logs. Furthermore, let \(|\sigma |\) denote the length of a trace, and |L| be the number of unique traces in the event log. The set representation \(\bar{L} \in \mathcal {L} \) of a log \(L \) is the set of traces that occur at least once in \(L \), which is defined as \(\bar{L} =\{ \sigma \in \mathcal {T} \mid L(\sigma )>0 \}\).

For example, \(L = \{ \langle a,b,c \rangle ^3, \langle b,c \rangle ^1 \}\) is an event log that comprises three traces \(\langle a,b,c\rangle \) of length three and one trace \(\langle b,c \rangle \) of length two. Its set representation contains two traces and is given as \(\bar{L} = \{\langle a,b,c\rangle , \langle b,c \rangle \}\).

Furthermore, we define a trace operator set \(\bigoplus \) with \(\oplus \in \bigoplus \) being a function that maps n traces into a new trace, i.e., \(\oplus : \mathcal {T} ^n \rightarrow \mathcal {T} \). For example, the concatenation operator \(\oplus _{\rightarrow } \in \bigoplus \) concatenates two given traces. Let \(\sigma _1 = \langle a,b,c \rangle \) and \(\sigma _2 = \langle d,e \rangle \) be two traces. Then, it holds that \(\sigma _1 \oplus _{\rightarrow } \sigma _2 = \langle a,b,c,d,e\rangle \).

A partition of an event log L into sublogs with respect to a set of operators \(\bigoplus \) is denoted by \(\pi _{\bigoplus }(L)\). Formally it can be written as follows:

$$\begin{aligned} \pi _{\bigoplus }(L) = \left\{ L_1,\ldots , L_M \in \mathcal {L} \mid \bar{L} = \bigoplus _{i=1}^M \bar{L} _i \right\} \subseteq 2^\mathcal {L}. \end{aligned}$$

That is, we may reconstruct the set representation \(\bar{L} \) of the log by applying a sequence of (possibly different) operators from \(\bigoplus \) on the set representations of \(L_1,\ldots , L_M\).

2.2 Process Trees and Their Discovery

In this work, we focus on the combination of discovery algorithms that adopt the same representational bias, but differ in their handling of noise in event logs. Specifically, we consider algorithms that discover process trees, such as the Inductive Miner [7] and the Evolutionary Tree Miner [9].

Process Trees. A process tree represents a process as a rooted tree, in which the leaf nodes are activities and all non-leaf nodes are control-flow operators, see [7]. Common control-flow operators include sequences of activities (\(\rightarrow \)), exclusives choice (\(\times \)), concurrency (\(\wedge \)), and structured loops (\(\circlearrowleft \)). Process trees are defined recursively, as follows. Let \(\varPhi = \{\rightarrow , \times , \wedge , \circlearrowleft \}\) be a set of operators and \(\epsilon \notin \mathcal {A}\) be the silent activity. Then, \(a\in \mathcal {A} \cup \{\epsilon \}\) is a process tree; and \(\phi (T_1,\ldots , T_n)\), \(n>0\), with \(T_1,\ldots , T_n\) being process trees and \(\phi \in \varPsi \) being an operator, is a process tree (\(n>1\) if \(\phi = \circlearrowleft \)). The universe of process trees is denoted by \(\mathcal {M}_T \). The semantics of a process tree T is defined by a set of traces, which is also constructed recursively: A function \(\nu : \mathcal {M}_T \rightarrow 2^\mathcal {T} \) assigns a set of traces to a process tree. Trivially, \(\nu (a) = \{ \langle a \rangle \}\) for \(a\in \mathcal {A}\) and \(\nu (\epsilon ) = \{ \langle \rangle \}\). The interpretation of an operator \(\phi \in \varPhi \) is grounded in a language join function \(\phi _l:2^\mathcal {T} \times \ldots \times 2^\mathcal {T} \rightarrow 2^\mathcal {T} \). Then, the semantics of a process tree \(\phi (T_1,\ldots , T_n)\) is defined as \(\nu (\phi (T_1,\ldots , T_n))=\phi _l(\nu (T_1),\ldots , \nu (T_n))\). For instance, the traces induced by the exclusive choice operator \(\times _l(L_1, \ldots , L_n)\) are given by the union of the traces of its children \(\bigcup _{1\le i\le n} L_i\). See [7] for formalisations of all operators in \(\varPhi \).

A process tree for the aforementioned example log \(L \) could be one that describes a sequence of a choice between a and a silent activity \(\epsilon \), followed by activity b, and then c:

figure a

Process Tree Discovery. With \(\mathcal {L} \) as the universe of event logs and \(\mathcal {M}\) as the universe of process models, e.g., \(\mathcal {M}_T \) for all process trees, we capture the essence of process discovery as follows:

Definition 1

(Discovery Algorithm). A discovery algorithm is a function \(\gamma : \mathcal {L} \rightarrow \mathcal {M}\), i.e., \(\gamma \) produces a process model from an event log.

Fig. 1.
figure 1

Schematic view of the IM.

A prominent approach for discovering process trees from event logs is the Inductive Miner (IM) [7, 8]. To balance between over- and under-fitting, it is parametrised by a noise filtering technique that uses a predefined threshold \(\tau \in [0,1]\) [8]. The workings of the Inductive Miner are summarised as follows. The algorithm recursively applies a select function \(\eta : \mathcal {L} \times [0,1] \rightarrow \varPhi \times 2^\mathcal {L} \), which, given an event log and a noise threshold, selects a trace operator and partitions the log based on the selected operator.

An overview of the algorithmic steps is given in the bipartite tree of Fig. 1. The algorithm starts by operating the select function \(\eta \) on the given event log, L, using a fixed noise threshold. A log partitioning operator \(\oplus \) is returned (dashed circle), along with its corresponding logs, \(L_1^1,\ldots , L_{v_1}^1\) (solid circles). The select function is then applied to every resulting event log \(L_i^1\) recursively, until a base case of a log containing only single-activity traces is reached. The algorithm is guaranteed to terminate [7].

We observe that for the Inductive Miner, trace operators \(\bigoplus \) are strongly coupled with process tree operators \(\varPhi \). In fact, partitioning based on a log operator \(\oplus \) yields a corresponding tree operator, denoted \(\phi _\oplus \in \varPhi \). Hence, the construction of a process tree from the result of the recursive application of \(\eta \) is uniquely defined: The respective operators are appended according to the bipartite tree shown in Fig. 1.

An alternative approach for mining process trees is rooted in the concept of searching for an improved model, with respect to some quality criteria. A well-known representative of such an approach is the evolutionary tree miner [9], which applies genetic mining to elicit the best model. However, this approach has a major advantage: Even though any returned solution is guaranteed to be of high quality, there is no guarantee that such a solution will be found in a finite period of time—a common drawback of genetic algorithms. As shown in experiments in [8], this is a limitation of practical relevance. The miner turned out to not discover any process tree within reasonable amounts of time for several real-world event logs.

Quality Measures for Process Discovery. Once a process model has been discovered from an event log, its quality shall be assessed. To this end, it has been argued that there is a common set of evaluation dimensions that shall be considered with according measures [6] With \(\mathcal {L} \) and \(\mathcal {M}\) as the universe of event logs and process models, respectively we capture these quality measures as follows:

Definition 2

(Discovery Quality Measure). A discovery quality measure is a function \(\psi : \mathcal {L} \times \mathcal {M} \rightarrow \mathbb {R}^{0+}\).

Applied to a log L and a model T (a process tree in our case), a measure \(\psi \) potentially quantifies several dimensions of the relation between the log and the model. Also, the definition of the measure does not necessarily mean that the process tree T has been discovered from L. Instead, it could be a normative model that represents the process.

In the remainder, we focus on the following three dimensions: Fitness (can the model represent all of the behaviour that is observed in the log?), precision (does a model allow only for the behaviour observed in the log?), and generalisation (does a model generalise beyond the behaviour that was observed in the log?) [6]. Technically, we consider the three measures jointly, based on a weighted scoring function:

$$\begin{aligned} \psi _{Score} (L,T) = \omega _{Fit} \psi _{Fit}(L,T) + \omega _{Prec}\psi _{Pre}(L,T) + \omega _{Gen}\psi _{Gen}(L,T), \end{aligned}$$

with \(\sum _i \omega _i = 1, \omega _i \in [0,1]\). In our experiments, we shall test various values of \(\omega _i\). Note that we omitted simplicity due to the selected representational bias. That is, the algorithms used in the remainder construct process trees with uniquely labelled leaf nodes, and hence we assume that they achieve maximal simplicity.

3 Process Discovery Based on Model Fusion

Different discovery algorithms, understood also in the sense of varying configurations of a single algorithm, have particular strengths and limitations. Hence, discovery shall not be restricted to the selection of the most suitable algorithm for an event log, but rely on a combination of various algorithms, selecting a suitable one for each specific part of a log. For obvious reasons, such a combination should not provide worse results compared to applying any of the base algorithms to the complete log. In such a case, the overhead of considering multiple algorithms in the first place would not be justified.

To realize the above idea, we define FuseDisc, a framework for process discovery that is based on model fusion. An instance of this framework is given by a set of process discovery algorithms and a quality measure. The latter is used to determine the suitability of the algorithms for particular parts of the log, thereby guiding how the algorithms are combined. An instance of the framework is proper, if indeed the combined application of several algorithms leads to results that are at least as good as those obtained with any individual algorithm, in terms of the given quality measure.

Definition 3

(Fusion-based Process Discovery (FuseDisc); Proper Fusion). Let \(\varGamma = \{\gamma _1,\ldots , \gamma _n\}\) be a finite set of process discovery algorithms. Given an event log L, and a discovery quality measure \(\psi \), a fusion-based discovery (FuseDisc) algorithm \(\gamma ^{(\varGamma , \psi )}\) produces a process model using the discovery algorithms in \(\varGamma \), potentially guided by \(\psi \). Such an algorithm is called proper, if and only if it holds that

$$\begin{aligned} \psi \left( L,\gamma ^{(\varGamma , \psi )}(L)\right) \ \ge \ \psi \left( L,\gamma (L)\right) , \ \forall \ \gamma \in \varGamma . \end{aligned}$$
(1)

Clearly, a trivial solution to achieve properness of a FuseDisc algorithm would be to define it as \(\gamma ^{(\varGamma , \psi )} = {\text {argmax}}_{\gamma \in \varGamma } \psi (L,\gamma (L))\). However, this trivial algorithm is an uninteresting one, since it does not stand a chance to satisfy Eq. 1 in a strict manner. In practice, the expectation is that a FuseDisc algorithm would yield a strictly better model. Yet, for the sake of flexibility, we do not enforce this requirement in Definition 3. We later show how empirical arguments demonstrate the actual benefit of a FuseDisc algorithm.

4 Inductive Mining with Adaptive Noise Filtering

This section introduces two discovery algorithms based on model fusion. To this end, Sect. 4.1 outlines how to instantiate the FuseDisc framework using a divide-and-conquer scheme. Following this idea, Sect. 4.2 presents exNoise, a specific discovery algorithm that relies on the Inductive Miner and is proven to yield a proper fusion. Due to its exponential runtime complexity, Sect. 4.3 then introduces adaNoise, a greedy and thus more efficient version of exNoise.

4.1 Fusion-Based Discovery Based on a Divide-and-Conquer Scheme

The idea of the FuseDisc framework as defined in Sect. 3 is to combine a set of discovery algorithms \(\varGamma \) in the construction of a process model for a given event log. While this combination shall be guided by some discovery quality measure \(\psi \), the framework does not enforce any assumptions on how to combine the algorithms from \(\varGamma \).

In this section, we argue that one way of organising the combination of different discovery algorithms is by means of a divide-and-conquer scheme. That is, the given event log is decomposed into sub-logs, and the algorithms from \(\varGamma \) are applied to each of them. This results in a set of sub-models per sub-log. Guided by the measure \(\psi \), such sub-models are composed again to obtain a single model, which represents the result of fusion-based discovery.

Specifically, a log L is split into a set of sub-logs \(L_1,\ldots ,L_m\). For each sub-log \(L_i\), \(1\le i\le m\), either the algorithms from \(\varGamma = \{\gamma _1,\ldots , \gamma _n\}\) are applied to create sub-models \(M^1_i, \ldots , M^n_i\), or the split is applied again, splitting \(L_i\) into \(L_{i,1},\ldots ,L_{i,m'}\). In the composition step, in turn, the model for the sub-logs \(L_1,\ldots ,L_m\) is obtained by selecting one sub-model \(M^j_i\), derived by applying discovery algorithm \(\gamma _j\) to sub-log \(L_i\), for each of the sub-logs \(L_i\) and composing them into a single model.

The above idea requires \(\varGamma \) to contain discovery algorithms, so that the resulting models can be composed correctly in a hierarchical manner. In the remainder, we populate \(\varGamma \) with configurations of the Inductive Miner, which is motivated as follows:

  • The Inductive Miner internally splits the event log hierarchically into sub-logs and associates a tree operator to each split. This tree operator provides an immediate means to compose the models obtained for the respective sub-logs.

  • The Inductive Miner further guarantees the absence of behavioural anomalies, such as deadlocks. Consequently, behavioural anomalies cannot be introduced as part of the composition of models obtained for sub-logs.

The above points illustrate that the definition of fusion-based discovery by means of a divide-and-conquer scheme is closely related to notions of model compositionality: The representational bias adopted by the discovery algorithms in \(\varGamma \) must enable correct composition of a model from sub-models. As an example, one may also consider populating \(\varGamma \) with different discovery algorithms that construct a Petri-net [10]. Then, composition may be approached based on existing notions of Petri-net composition and refinement [11]. However, most existing discovery algorithms constructing Petri-nets, e.g., the \(\alpha \)-algorithms [12] or the ILP-miner [13] lack guarantees on both the structure of the resulting net and the absence of behavioural anomalies. This makes model composition challenging in the general case.

4.2 The exNoise Algorithm

The Exhaustive Noise-aware Inductive Miner, exNoise, is a FuseDisc algorithm that is parametrised by a process quality measure \(\psi \) and by defining \(\varGamma = \{ \gamma _1,\ldots , \gamma _n\}\) as a set of n variants of the Inductive Miner Infrequent, see Sect. 2.2. However, each algorithm \(\gamma _i\) applies a different threshold \(\tau _i\) for noise filtering.

The idea of the algorithm is depicted in Fig. 2a. Adopting the above divide-and-conquer scheme, the log is split into sub-logs hierarchically. For each sub-log (solid circle), exNoise considers the n trace operators that originate from the application of the n algorithms in \(\varGamma \) to the sub-log. For example, starting at L, the exNoise runs all select functions \(\eta _i\) of the respective algorithms \(\gamma _i\) on the event log. Each of these functions will determine a particular trace operator \(\oplus _i\). Next, the algorithm considers all log partitions \(\pi _{\oplus _1}, \ldots , \pi _{\oplus _n}\) that stem from these trace operators in a recursive manner, until a base case is reached. A path of operators from the root of the graph to its leafs corresponds to a process tree, as every trace operator \(\oplus \) is associated with a tree operator \(\phi _{\oplus }\).

By means of the above procedure, exNoise constructs a search space of process tress. From this set, denoted by \(\mathcal {T}_{ex}\), exNoise then selects a tree \(T^*_{ex}\) as follows:

$$\begin{aligned} T^*_{ex} = \mathop {\text {argmax}}\limits _{T\in \mathcal {T}_{ex}} \ \psi (L,T). \end{aligned}$$
(2)

Note that the Inductive Miner Infrequent, as discussed in Sect. 2, is a special case of exNoise. It corresponds to an instantiation of exNoise with \(\varGamma \) containing solely a single variant of the discovery algorithm with a single, pre-defined noise filtering threshold.

To show properness of exNoise, we note that the result of applying any of the considered algorithms in isolation, is contained in the search space of exNoise.

Proposition 1

Let \(\mathcal {T}_{ex}\) be the set of process trees explored by exNoise. Then, it holds that \(\gamma (L) \in \mathcal {T}_{ex}\), \(\forall \ \gamma \in \varGamma \).

Proof

We prove the proposition by construction. Let \(\gamma \in \varGamma \) be the Inductive Miner Infrequent with noise threshold \(\tau \). Every selection of a trace operator in exNoise includes the option of applying \(\eta \). Choosing this option for every level of the recursion is a feasible solution to exNoise. Hence, it holds that \(\gamma (L) \in \mathcal {T}_{ex}\).     \(\square \)

Fig. 2.
figure 2

Two fusion-based discovery algorithms.

Corollary 1

The exNoise algorithm is proper.

Proof

We assume by negation that there exists an algorithm \(\gamma \in \varGamma \), such that \(\psi (L,T^*_{ex}) < \psi (L,\gamma (L))\), which contradicts the properness condition. By combining Proposition 1, \(\gamma (L) \in \mathcal {T}_{ex}\) and Eq. 2, \(T^*_{ex} = \text {argmax}_{T\in \mathcal {T}_{ex}} \psi (L,T)\), however, we get that \(\psi (L,T^*_{ex}) \ge \psi (\gamma (L))\). This contradicts with the above negative assumption.     \(\square \)

We now turn the focus on the computational complexity of the exNoise. Denote by \(v = \max \left( |L|,|\sigma _{max}|\right) \), with \(\sigma _{max} = \text {argmax}_{\sigma \in \bar{L}}|\sigma |\), the maximum between event log size, i.e., the number of traces, and the length of L’s longest trace.

For every event log, exNoise considers \(n = |\varGamma |\) select functions (dashed nodes of the bipartite tree in Fig. 2a), and creates a log partitioning for each of them. In the worst-case, this partitioning step needs v calculations: The worst-case may result either from a horizontal partition of the event log into |L| separate event logs (every event log is a single trace of the originating log), or from a vertical partitioning of \(\sigma _{max}\) into traces with single activity (e.g., due to the sequence operator). Therefore, for every level in the search space (dashed circles in Fig. 2a), exNoise performs at most \(v\times n\) steps.

Since we recursively select every trace operator, the time complexity of exNoise is \(\mathcal {O}((v\times n)^k)\), with k being the depth of the bipartite tree. In the worst case, the latter is \(k=v\), since the maximum number of recursive splits, again, depends on the length of the longest trace in the originating log and the size of the original event log [7]. Hence, the complexity of exNoise is \(\mathcal {O}((v\times n)^v)\).

In practice, event logs may become very large, see [14], so that one cannot guarantee to have a bound on v. Hence, we consider the time complexity of exNoise to be exponential in the size of the log, so that exNoise quickly becomes intractable in practice.

4.3 The adaNoise Algorithm

Given the above results, the Adaptive Noise-aware Inductive Miner, adaNoise, is a quality-aware and greedy version of the exNoise. Specifically, it attempts to overcome the computational complexity of exNoise by moving from an exhaustive to a heuristic search. To this end, it incorporates the given quality measure \(\psi \) not only to select the best combination of discovery algorithms for a log partitioning. Rather, the measure directly guides the exploration of combinations of discovery algorithms. Similarly to exNoise, the adaNoise algorithm is parametrised by a quality measure \(\psi \) and a set \(\varGamma \) of n algorithms, given as variants of the Inductive Miner Infrequent.

We explain the intuition of adaNoise by means of Fig. 2b. Compared to the approach taken by the Inductive Miner (Fig. 1), adaNoise applies selection based on a (locally) optimal discovery algorithm \(\gamma ^*\), where optimality is measured using \(\psi \). Since we consider a set of inductive mining algorithms, this corresponds to the (local) optimisation of the noise filtering threshold. In each step of deriving the next log partitioning, adaNoise chooses the select function \(\eta \) of the algorithm \(\gamma ^*\), such that

$$\begin{aligned} \gamma ^* = \mathop {\text {argmax}}\limits _{\gamma \in \varGamma } \ \psi (L,\gamma (L)), \end{aligned}$$
(3)

with L being the log on which \(\eta \) is applied in the respective step.

While proceeding according to this heuristic search, adaNoise further constructs the process models for each intermediate step. To do so, it considers the operators that were selected prior to the current log L (i.e., the path from the root of the bipartite tree to L), concatenated with the result of applying each of the possible discovery algorithms to the current log L. The resulting n intermediate models are added to a solution set \(\mathcal {T}_{ada}\), and their \(\psi \) values are maintained for future use.

The trees in Fig. 2 illustrate the difference between exNoise and adaNoise. For every log, adaNoise proceeds with one trace operator and recursively splits the log with a single select function. In contrast, exNoise considers all trace operators for all logs.

If adaNoise reaches leaf nodes, the recursion is stopped, the corresponding process tree is computed, and added to the solution set \(\mathcal {T}_{ada}\). Once leaf nodes have been reached for all logs, adaNoise returns \(T^*_{ada}\), as follows:

$$\begin{aligned} T^*_{ada} = \mathop {\text {argmax}}\limits _{T \in \mathcal {T}_{ada}} \ \psi (L,T). \end{aligned}$$
(4)

Similar to exNoise, the Inductive Miner Infrequent is a special case of adaNoise, when \(\varGamma \) is a singleton set containing only one configuration of the algorithm, with a single noise filtering threshold. Next, we provide theoretical guarantees for adaNoise.

Proposition 2

Let \(T_\varGamma = \{ \gamma (L) | \gamma \in \varGamma \}\) be the set of process trees discovered by the algorithms in \(\varGamma \) to an event log L. Then, it holds that \(T_\varGamma = \{ \gamma (L) | \gamma \in \varGamma \} \subseteq \mathcal {T}_{ada}\).

Proof

The result is due to the computation of intermediate models for every event log that is considered for partitioning. Given a log L, adaNoise always obtains \(\gamma (L)\) for all \(\gamma \in \varGamma \) and these intermediate models are added to \(\mathcal {T}_{ada}\).     \(\square \)

Corollary 2

The adaNoise algorithm is proper.

Proof

By Proposition 2, the solution space \( \mathcal {T}_{ada}\) contains \(T_\varGamma \). Then, due to Eq. 4, we derive that the following holds true: \(\psi (L,T^*_{ada}) \ge \psi (L,\gamma (L)), \forall \ \gamma \in \varGamma . \)     \(\square \)

Note that \(\psi (L,T^*_{ada})\) is not guaranteed to attain \(\psi (L,T^*_{ex})\), as the greedy search in adaNoise may miss out on solutions that are found by exNoise.

The adaNoise algorithm attempts to provide a computationally feasible alternative to the exhaustive search of exNoise. Yet, its computational complexity depends on the performance of \(\psi \). Specifically, the algorithm computes the heuristic n times at every level, which results in \(n\times v\) operations, since, in the worst case, we consider v event logs per level. Furthermore, the maximal depth of the bipartite tree is v. This yields a time complexity of \(\mathcal {O}(n\times v^2)\). The computation of \(\psi \) at each level factorizes this expression. Hence, if \(\psi \) is exponential in v (e.g., computing \(\psi \) based on alignments [15]), adaNoise has exponential runtime. In our experiments, however, we show that, in practice, adaNoise runs efficiently, even if \(\psi \) has exponential time complexity.

5 Evaluation

We evaluate the FuseDisc framework by comparing adaNoise against the plain Inductive Miner Infrequent (IMi). The latter is indeed dominated by adaNoise, with the difference being large, when focusing on precision as a quality dimension. This is explained by the fact that, unlike IMi, our adaNoise algorithm is able to improve precision, even when weighted against fitness and generalization.

5.1 Benchmarks and Experimental Setup

We run experiments on three benchmarks. First, we tested the algorithms with synthetic event logs from two BPM (2016–2017) discovery contest benchmarks.Footnote 1 The two benchmarks comprise 10 event logs from 10 corresponding process models. These models include complex control-flow constructs, and log phenomena such as recurrent activities, loops, inclusive choices, etc. Each log contains 1000 traces, which we split into training-validation-test sets, as described below. The third benchmark is a real-world hospital log, which comprises one month of event data from an outpatient cancer clinic in the US. The dataset comprises about 25000 treatment paths (\(\sim \) \(1000\) patients per day) that consist of a total 68800 events. Log behaviour includes parallelism, loops, and exclusive choices.

For the fusion, we considered the Inductive Miner Infrequent (IMi) [8] with noise filtering thresholds of increasing size, with step size being 0.05. In other words, we set \(\varGamma = \{\gamma _1, \ldots , \gamma _n\}\) with the threshold of \(\gamma _i\) being set to \(\tau _i = 0.05\times (i-1)\). We always compared the results obtained with adaNoise against the best base miner, i.e., we compare against \(\gamma ^* = \text {argmax}_{\gamma \in \varGamma } \psi (L,\gamma (L))\). As such, we consider the most challenging baseline in our evaluation, even though, in practice, there would be no means to know a-priori which of the baseline algorithms performs best.

In our experiments, we measured three quality metrics, namely fitness, precision and generalization, along with the total score. For the evaluation, the adaNoise algorithm was implemented as a ProM plugin, using the EDU-ProM [16] variant of ProM.

To measure fitness we employ measures that are grounded in the notion of an alignment between model and log [15]. Specifically, the alignment is defined based on steps \((x,y)\in \mathcal {A}^\bot \times \mathcal {A}^\bot \), where \(\mathcal {A}^\bot =\mathcal {A}\cup \{\bot \}\) is constructed from the universe of activities and a symbol \(\bot \notin \mathcal {A}\). A step (xy) is legal if \(x\in \mathcal {A}\) or \(y\in \mathcal {A}\) and is interpreted such that an alignment is said to ‘move in both’ traces (\((x,y)\in \mathcal {A}\times \mathcal {A}\)), ‘move in first’ (\(y=\bot \)), or ‘move in second’ (\(x=\bot \)). Given two traces \(\sigma , \sigma '\), an alignment is a sequence of legal steps. Each step is assigned a cost and a common cost model assigns unit cost if either \(x=\bot \) or \(y=\bot \); zero cost if \(x=y\); and infinite cost if \(x\ne y\). An alignment is a sequence of steps and the alignment cost is the sum of the costs of its steps. To obtain the fitness score, costs per log trace are aggregated, weighted by the trace frequency in the log, and normalized by the maximal possible cost.

To measure precision, we rely on a log traversal technique proposed in [17]: Every trace in the log is mapped to a state in the model, and escaping edges (traces and activities allowed by the model, while missing in the log) are computed. The precision score is defined as a function of the ratio of escaping edges over all allowed activities.

Generalization was measured similarly to the method that was used in past discovery contests, assessing the model’s capability to allow for legal behaviour that was not observed in the event log. Formally, let \(\sigma _0\) be a legal trace which was never observed in L. Denote by R a trace replay function that given a trace \(\sigma \) and a process tree T, returns 1, if \(\sigma \) can be replayed on T, and 0 otherwise. Since \(\sigma _0\) is random, generalization is written in expectation as \(\mathbb {E} R(\sigma _0,T)\). This corresponds to the probability that T is able to parse a new legal trace \(\sigma _0\). A generalization measure \(\psi _{Gen}\) should estimate \(\mathbb {E} R(\sigma _0,T)\) explicitly from data. To this end, we use the validation set approach, which is a standard method to quantify the generalization error in the Machine Learning literature [18].

The event log L is partitioned into a training set \(L_T\) and a validation set \(L_V\). We align every \(\sigma \) in \(L_V\) on T using R using a standard tree and log alignment procedure [6]. The generalization is given by the following unbiased estimator for \(\mathbb {E} R(\sigma _0,T)\):

$$\begin{aligned} \psi _{Gen}(L,T) = \ \widehat{\mathbb {E} R(\sigma _0,T)} = \ \frac{1}{|L_V|} {\sum _{i=1}^{|L_V|} R(\sigma _i,T) }. \end{aligned}$$

Clearly, higher values of \(\psi _{Gen}(L,T)\) imply greater generalization.

Our experiments were divided into two parts. First, we considered the synthetic datasets, setting \(\omega _i\) of \(\psi _{Score}\), see Sect. 2, to be uniform. Then, we altered the weight of \(\omega _{Prec}\) to demonstrate the major advantage of fusion-based discovery. Second, we validated our approach on a real-world hospital log, testing in particular the sensitivity of adaNoise to the weights in the definition of \(\psi _{Score}\).

5.2 Results

Before assessing the qualitative results, we note that even when considering an alignment-based calculation of \(\psi \) for fitness and precision, adaNoise had reasonable response times. Discovery took 80 s on average (stdev 60 s) for the synthetic data, and 27 s (stdev 7.8 s) for the real-life log. While this is less efficient than the plain IMi that runs in less than one second on average, we argue that the dominance of adaNoise in terms of discovery quality justifies this difference in run-time.

Table 1. Results obtained for the BPM 2016/2017 contest event logs.

Part I: Synthetic Logs. Table 1 shows the results for the two BPM contest benchmarks. Specifically, we compare the discovery quality of the best algorithm in \(\varGamma \) (denoted IMi), to the quality of models created by the adaNoise algorithm for the BPM 2016 and BPM 2017 benchmarks. Furthermore, we present the results for all three quality measures, along with the total score measure obtained with a uniform weighting of the three measures. For every miner, IMi and adaNoise, we present the results for the each of the 10 individual logs per benchmark.

First, the theoretical guarantee formalised as Corollary 2 is visible the results. The total quality score obtained with adaNoise, is always higher (or equal) to the one observed for IMi. Furthermore, adaNoise better balances between precision and the other two measures. It is a known property of the IMi to sacrifice precision by adding generality, in the form of ‘flower’ constructs, while preserving fitness [8]. These results show that the adaNoise algorithm indeed follows the FuseDisc paradigm by compensating for the weaknesses of one algorithm by using other algorithms from \(\varGamma \). Specifically, adaNoise flexibly tunes the noise thresholds that shall be applied to different parts of the event log. This yields improved results, without sacrificing precision.

We further vary the weight of precision for two selected event logs, namely \(L_{8}\) of either benchmark. Figure 3 illustrates obtained results with both algorithms. Here, the vertical axis represents \(\psi _{Score}\), the total quality score, while the horizontal axis corresponds to the weight assigned to precision, with the remaining weights being set uniformly to the complement of the precision weight. We observe that as the importance of precision grows, adaNoise’s dominance increases, as the IMi is insensitive to \(\omega _{Prec}\).

Fig. 3.
figure 3

Details results for the 8th event logs of the BPM 2016/2017 benchmarks.

Part II: Hospital Log with Varying Weights. Next, we explored the sensitivity to the weights in the definition of model quality with the real-life hospital log. Figure 4 shows the overall score \(\psi _{Score}\) as function of changes in the respective weights. For example, in Fig. 4a, when the fitness weight is set to 0, the other two weights are set uniformly to 1/2. We observe throughout the results that adaNoise dominates the IMi, as formalised as a guarantee in Corollary 2. Furthermore, the largest improvement is observed when the emphasis is on precision. This is expected, since, as mentioned above, IMi tends to sacrifice precision, but is not guided by a comprehensive quality measure. A consequence of this behaviour is that, when the weight of precision is 0, the two algorithms yield the same value for \(\psi _{Score}\). For virtually all other configurations, however, adaNoise provides a considerable improvement over the result of IMi in terms of model quality.

Fig. 4.
figure 4

Exploring the model quality with varying weights.

6 Related Work

Our approach falls in the area of process discovery, see a recent review by Augusto et al. [4]. We instantiated our framework with the Inductive Miner, introduced by Leemans et al. in [7] and later extended with a noise filtering mechanism [8]. The latter filters an event log representation (e.g., a directly follows graph) using a single noise threshold. Similar noise filtering strategies have been incorporated in many other discovery algorithms, e.g., those extracting simple heuristic nets, see [19, 20]. Another example is the Split-Miner [21] to generate BPMN models, which reduces noise by several threshold-based pruning techniques, while ensuring appropriate levels of fitness. We argue that our fusion-based algorithms are able to overcome the limitations of such individual configurations and tune the noise filtering strategy for each of the event log.

We instantiated the FuseDisc framework with a divide-and-conquer scheme that partitions an event log. This is similar to splitting a log into passages for process discovery [22]. Yet, there is a fundamental difference: Our methods split the log to optimise accuracy in terms of fitness, precision, and generalization, whereas passage-based discovery is purely driven by performance considerations. In fact, it requires the selection of a particular discovery algorithm. Once this choice is taken, the model discovered from the complete log or obtained by combining the results discovered for the sub-logs are expected to be equivalent. While one may try to optimise accuracy, in the spirit of exNoise and adaNoise, for each sub-log obtained from minimal passages, this would require a careful selection of considered discovery algorithms to achieve a correct composition. Without that, applying [22] in the scheme defined in Sect. 4.1, would not even guarantee connectedness of the resulting model.

Discovery based on the theory of regions, see [24], may also give hints on how to split a log for fusion-based discovery. However, the regions in a transition system representing a log may be overlapping, so that it is unclear how the composition of models obtained for the respective sub-logs would have to be done.

Divide-and-conquer schemes may speed-up the computation of a alignments [23]. However, unlike our methods, this decomposes a model into fragments, not an event log.

Our approach is also related to discovery based on trace clustering. Respective approaches typically proceed in two steps: (1) traces of an event log are clustered based on a pre-defined similarity measure, and (2) a model is derived for each of the clustered sub-logs. A large number of such trace clustering methods has been proposed in the literature, see [25, 26], and references within. Recently, trace clustering has also been conducted such that the impact of clustering on the discovery result is taken into account to improve the discovered models [27]. Model improvement by trace clustering can further be achieved by slicing of discovered models [28], i.e., models are split based on identified trace clusters. The main difference of these works on trace clustering and our approach is that trace clustering partitions a log horizontally, i.e., per trace. Our FuseDisc framework, however, targets vertical and horizontal partitioning of traces, selecting a suitable algorithm for the log partitioning obtained with particular trace operators.

Finally, search-based process discovery methods, e.g., the Genetic Miner [29] and the Evolutionary Tree Miner [30] are related, as they explore a space of models when attempting to find an optimal one, given a set of quality criteria. Yet, these algorithms are not guaranteed to terminate in finite time. Recently, it was also suggested to consider the question of conformance between a log and a model as a search problem [31]. In our work, we avoid a direct search in the space of process models, yet adopt search ideas when partitioning the event log and combining different discovery algorithms.

7 Conclusion

To improve the quality of discovered process models, we introduced FuseDisc, a framework that combines the results of multiple discovery algorithms by fusing process models obtained for parts of an event log. Within this framework, we defined the notion of properness, which guarantees that the fused result will be at least as good as the results of all base algorithms. We then argued that fusion-based discovery may exploit divide-and-conquer schemes and presented two specific algorithms, namely exNoise and adaNoise. We showed that they are proper, and discussed their computational complexity. To evaluate our approach, we ran experiments using synthetic and real-world event logs. The results illustrate that the presented techniques improve over a state-of-the-art discovery algorithm, the Inductive Miner Infrequent, in terms of combined quality measures that include fitness, precision, and generalization.