Keywords

1 Introduction

Most of information systems produce event logs as an evidence of processes execution. An event log consists of a set of events. Each event represents an executed activity in a business process. An event has a case identifier, an execution timestamp, and may have other context data such as the human resources or life cycle transition etc.

Postmortem analysis techniques of event logs, e.g., conformance checking [1], Enhancement [2], and process performance analysis [14], require the existence of case identifiers associated with events. A case identifier is important to correlate the events of a process instance. However, case identifiers genuinely exist in event logs that are produced from a centrally orchestrated system, so called labeled event logs.

There are many reasons why execution of business processes may produce event logs with missing information or errors [5, 7, 12] such as: some events are collected and recorded by humans, as well as the lack of central orchestration systems that are aware of the process model. The latter case is the most common case in real life. It is called unmanaged execution of processes and represents the middle level of the event logs categories [2], as well as level-4 or lower of logging information as in [5]. These logs do not contain case identifiers and are called unlabeled event logs.

The problem of labeling unlabeled logs has received little attention in the community of business process management [1]. The work in [5, 6, 12, 16] has addressed the issue in the form of directly mining process models from unlabeled event logs. But these approaches do not support mining the logs generated from cyclic processes.

In a previous work [4], we introduced the Deduce Case IDs, DCI, approach which deduces the case identifiers of the unlabeled events, and generates a set of labeled event logs based on acyclic processes. In this paper, we extend DCI to support unlabeled events that are generated from cyclic processes. We call the extension DCIc. In DCIc, we introduce a preprocessing step for the process model to construct a relationship matrix that represents the relations among the activities considering the cyclic behavior in the model. Also we modify the deducing process to support the labeling possibilities based on the cyclic behavior. For our approach to work, in addition to the unlabeled log, we require as input the executed business process model and heuristic information about the execution duration time of the activities within the model. The output is a set of ranked labeled event logs. The ranking score indicates the degree of trust in the labeling of events within each log.

The remainder of this paper is organized as follows: preliminaries and foundational concepts are discussed in Sect. 2. The overview of the approach and a running example are presented in Sect. 3. In Sect. 4, we discuss the preparation and construction of the relationship matrix from the process model. In Sect. 5, we present the details of DCIc approach. Implementation details and experimental evaluation on real life logs are discussed in Sect. 6. Related work is discussed in Sect. 7. Finally, we conclude the paper in Sect. 8 with a critical discussion of our approach.

2 Preliminaries

In this section, we discuss the fundamental concepts that are used in our approach. In Sect. 2.1, we define the decision tree used in DCIc. In Sect. 2.2 we explain the behavioral profile used to construct the relationship matrix. In Sect. 2.3, we discuss heuristic data that are used by DCIc.

2.1 Decision Tree

In general, a decision tree represents the decisions and their possible consequences. Each node has a conditional probability to its parent which affects the decisions in the tree. In the context of our work [4], a decision tree is used to represent the possible labelings for each input unlabeled event. An unlabeled event can be represented with a set of different nodes in the tree based on the event possible labelings.

Definition 1

(Case Decision Tree). A case decision tree is a tuple CTree=\(\langle \)Node, F, root, Leaves\(\rangle \)

  • Node is the set of nodes within a tree. Each node is further attributed by the caseId, timestamp, activity, event identifier and a probability,

  • \(F \subset (Node \times Node)\) is the relation between nodes,

  • \(root \in Node\) is the root node of the tree, defined with \(caseId=0\),

  • \(Leaves \subset Node\) is the set of leaf nodes in the tree.

A branch(\(n_i\)) \(\sigma \) in the tree is the sequence of nodes visited when traversing the tree from the node \(n_i\) to the root. \(\sigma = n_i,n_{i-1},...n_1,root|(root,n_1)\in F \wedge \forall _{i=2}^{j} (n_{i-1},n_i)\in F\).

Definition 1 describes the structure of the decision tree used in DCIc. Each child of the root represents a case. The set of branches within the same case describe the possible execution behavior for this case. Each node carries its conditional probability w.r.t its parent node.

2.2 Behavioral Profile

A behavioral profile (BP) describes a business process model (BPM) in terms of relations between activities of the model [17].

Definition 2

(Behavioral Profile). Let A be the set of activities within a process model BPM. The behavioral profile is a function that for any pair of activities defines the behavioral relation as none \(\bot \), reverse , sequence \(\rightsquigarrow \), exclusive \(+\) or parallel \(\parallel \).

A behavioral profile returns one of the relationships or \(\bot \) for any pair of activities (ab) that belong to the process model under investigation [4, 17]. To explain what these relations mean, imagine two traces on the form abcd and acbe. The \(BP(a,b) = \rightsquigarrow \) because a was observed directly followed by b, so the relation . \(BP(b,c)=\parallel \) because both bc and cb were observed. \(BP(d,e) = +\). Finally, \(BP(a,d)=\bot \) because we never observe in any trace the sequence ad nor da and in the mean time they both appear in at least one trace. The \(\parallel \) relation is identified between two activities a and b whenever two or more traces \(\dots ,a,b,\dots \) and \(\dots ,b,a,\dots \) are observed. In terms of the process structure, such traces can be observed whether tasks a and b are belonging to two concurrent branches or a and b are belonging to a cyclic component of the process. Within the context of this paper, behavioral profiles are used to filter out the incorrect labelings of events. However, it is crucial for correct labeling to distinguish between concurrent behavior and cyclic behavior. We elaborate more on that in Sect. 4.

2.3 Activity Heuristics

Activity heuristics are some statistical data about the execution duration of each activity. This duration is represented in the range [minmax], and an average (avg) value. This information is useful in building the case decision tree, as with this extra data we are able to deduce the labeling possibilities of an unlabeled event.

Fig. 1.
figure 1

Approach overview

3 Approach Overview

Figure 1 shows the main components of DCI [4] and the proposed extension DCIc. DCIc has three main inputs: the unlabeled event log (S), the heuristic data, and the process model. Also, it has an optional input: the ranking-score threshold, to display the labeled logs based on the user-specified value. As an output DCIc generates a set of ranked labeled event logs, as a result of the uncertainty in the labeling process of a single unlabeled event. As shown in Fig. 1, there is a preprocessing step to produce so-called the relationship matrix (RM) that describes the relations between the process model activities. The relationship matrix supports the representation of the cyclic behavior, which is one of the main contributions in this paper. Details about the preprocessing step of creating the relationship matrix are discussed in Sect. 4.

The case ID deduction process starts with the “Build Case Decision Tree” step. It uses the unlabeled event log to construct the CTree. During the CTree construction, the “Filtering process” step is applied using both the relationship matrix and the heuristic data to filter for the valid labeling possibilities. The last step in DCIc is “Build Event Logs”, which is introduced in [4]. This step generates the different consistent combinations between the cases that are represented by CTree branches and writes each combination into a different labeled event log along with its ranking score, cf. [4]. Details about how DCIc works and the handling of cyclic behavior are presented in Sect. 5. The rest of this section discusses a running example to explain the inputs to DCIc.

Fig. 2.
figure 2

Order business process

Our approach as described in Fig. 1 needs the following inputs:

  1. (1)

    A process model, cf. Fig. 2. That is used later to produce the relationship matrix.

  2. (2)

    An unlabeled event log S with activity and timestamp, where case ID is unknown. A sample unlabeled log is shown in Fig. 3a.

  3. (3)

    Activity heuristics heur, are data about the execution duration of each activity. As discussed in [5], extra data about the execution behavior is required to be able to correlate the unlabeled events. DCIc requires heuristic execution data to be able to reduce the possibilities of case identifiers for the unlabeled events. Example values of these heuristics for activities of the process in Fig. 2 are shown in Fig. 3b.

  4. (4)

    The threshold ranking-score (optional) is used to suppress generated labeled logs with a ranking score less than the threshold, by default all generated labeled logs are displayed e.g. threshold=0.

Fig. 3.
figure 3

Required input for example in Fig. 2

As shown in Fig. 1, the result of DCIc is a set of ranked labeled logs. These logs are categorized as either: (a) Complete logs that include all events existing in S. (b) noisy logs that contain logs with either model violation or inconsistent with heuristic execution data or under the user defined ranked threshold.

There are some assumptions that are considered for DCIc to work. First, There is no waiting time between the activities. Each event in S has a timestamp that represents the completion time of an activity and the start time of the next activity. If there would be consideration of waiting times, that would require explicit start and complete lifecycle transition events, which is not always the case for logs from unmanaged executions. The second assumption is that the input process model must be dead- and live-lock free. The third assumption is that there is exactly one start activity that is not contained in any loop, see G3 in [9].

4 Generating the Relationship Matrix

DCIc uses so-called the relationship matrix among activities to deduce case identifiers of unlabeled events. This matrix is based on the behavioral profile, cf. Sect. 2.2. However, a limitation of the behavioral profile, wrt this paper, is that a BP does not distinguish a concurrent execution of two activities from a cyclic execution. For example, cf. Figs. 2 and 6a, \(BP(B,F)=\parallel \), the same as the relation BP(ED). However, when checking these relations in the acyclic form of the model, cf. Fig. 6b, \(BP(B,F)=\rightsquigarrow \) whereas \(BP(E,D)= \parallel \) as it is a result of a concurrent behavior not a cyclic behavior. We are interested in having that distinction. Thus, we generate the relationship matrix in a four-step approach. The first step is to obtain the behavioral profile of the original, possibly cyclic, input process model. The second step is to iteratively detect loops and remove the connecting branch from the exit to the entry nodes of the loop. These elements are maintained in a separate structure. The second step is repeated until all loops are removed. The repetition is intended to account for nested and unstructured/irreducible loops. The third step is to obtain the behavioral profile of the acyclic model. The fourth and the final step is to merge the two profiles from step one and step three to obtain the relationship matrix. Fig. 4 shows the steps of generating RM.

Fig. 4.
figure 4

Relationship matrix creation steps

Definition 3

(Relationship Matrix). Let A be the set of activities within a process model. The Relationship Matrix is a function that for any pair of activities defines the relation as none \(\bot \), reverse , sequence \(\rightsquigarrow \), exclusive \(+\) or parallel \(\parallel \). We define some auxiliary functions:

  • \(Predecessors(b) =\{P: P \subset 2^A \wedge \forall a \in P~BP(a,b) = \rightsquigarrow \wedge \forall a, c \in P ~BP(a,c) = \parallel \wedge ((\exists P' \in Predecessors(b): P \ne P') \rightarrow \forall a \in P \forall a' \in P'~BP(a,a') \ne \parallel )\}\)

  • Loop(b) return the loops, that contains b,such that \(loop=(Is,Ss,Es,Bs,IEs)\) where Is is the activities within loop, \(Ss \subset Is\) represents the start activities, \(Es \subset Is\) represents the ends activities, \(Bs \subset Is\) contains the activities within the loop branch.

  • \(StartActivities()=\{a:a \in A \wedge Predecessors(a) = \emptyset \}\)

The predecessors of an activity b, Predecessors(b), is represented as a set of sets where the relation between members in the same set is parallel whereas the relation between members of different sets is not parallel. For example, in Fig. 2, \(Predecessors(F)=\{\{E,D\}\}\), the predecessors of F are both activities E and D that are in \(\parallel \) relation. The loops within the model are represented by a set of loop objects, cf. Fig. 6c. For example, in Fig. 2, as C is part of nested loops so \(Loop(C)=\{L1=( Is=\{B,C,D,E,F,H\},Ss=\{B\},Es=\{F\},Bs=\{H\}), L2=(Is=\{C,D,E\},Ss=\{C,D\},Es=\{D,E\},Bs=\{\})\}\).

Fig. 5.
figure 5

Apply the second step of generating RM for the original model in Fig. 2

4.1 Detecting and Breaking Loops

We use Tarjan’s algorithm [13] to identify the loops by detecting the strongly connected component with length \(\ge 1\). For each loop, we identify the start and end activities, and the loop branch. The loop branch is the back edge flow between the loop end activities, i.e. activities with successors \(\notin \) loop, and the loop start activities, i.e. activities with predecessors \(\notin \) loop. As shown in Fig. 5, this step is repeated until all loops are detected and removed by cutting the loop branch each time. So, if a loop is unstructured or irreducible, i.e. a loop with multiple entries, our approach is still able to handle it because in one step one of the branches will be removed and in a later step the other branch will be removed.

The results from the ‘remove loops’ are: (a) An acyclic model, cf. Fig. 5c (b) The Loops, cf. Fig. 6c. For acyclic input process models, the resulting model will be identical to the input and there will be no detected loops.

4.2 Generating the RM

This step generates the relationship matrix by merging the behavioral profiles of the original model BP, the acyclic model \(BP_a\) and the identified loops. The output of the generation process is the relationship matrix(RM), cf. Fig. 6d, as defined in Definition 3. RM is used later on as an input for DCIc. The ‘Generate relationship matrix’ step determines the relation between pairs of activities ab as the following:

  1. 1.

    \(BP(a,b)=\parallel \wedge (\exists L\in Loops:a,b\in L.Bs)\wedge a\in predecessors(b)\rightarrow RM(a,b)=\rightsquigarrow \)

  2. 2.
  3. 3.

    \(BP(a,b) = \parallel \wedge (\exists L\in Loops: a \in L.Es \wedge b \in L.Bs \wedge a\in predecessors(b)) \rightarrow RM(a,b) = \rightsquigarrow \)

  4. 4.

    \(BP(a,b) = \parallel \wedge (\exists L\in Loops: a \in L.Bs \wedge b \in L.Ss \wedge a\in predecessors(b)) \rightarrow RM(a,b) = \rightsquigarrow \)

  5. 5.

    \(BP(a,b) = \parallel \wedge (\exists ~L\in Loops: a \in L.Es \wedge b \in L.Ss \wedge L.Bs=\emptyset ) \rightarrow RM(a,b) = \rightsquigarrow \)

  6. 6.

    \(BP(a,b) = \parallel \wedge (\exists ~L\in Loops: a \in L.Ss \wedge b \in L.Es) \wedge a \in predecessors(b) \rightarrow RM(a,b) = \rightsquigarrow \)

  7. 7.

    \(BP(a,b) = \parallel \wedge (\exists ~L\in Loops: a,b \in L.Is \wedge ((a\vee b) \notin L.Es \cup L.Ss \cup L.Bs \vee a,b \in L.Ss \vee a,b \in L.Es) ) \rightarrow RM(a,b) = BP_a(a,b)\)

  8. 8.

    \(BP(a,b) = \parallel \wedge (\forall ~L\in Loops: a,b \notin L.Is ) \rightarrow RM(a,b) = BP(a,b)\)

  9. 9.

    \(BP(a,b) \ne \parallel \rightarrow RM(a,b) = BP(a,b)\)

Items 1 to item 7 explain the conditions used for activities within loops. Items 1 and 2 handle the case when both a and b exist within the loop branch, and a is a predecessor of b. Then the relation in RM is sequence, otherwise it is reverse. In item 3, if a is part of the end nodes of the loop, b is one of the branch activities and a is a predecessor of b then the relation is sequence. For example, the relation between F and H, where \(F \in L1.Es \wedge H \in L1.Bs\), so \(RM(F,H)=\rightsquigarrow \), cf. Fig. 6. In item 4, if a is part of the branch, b is part of the loop start activities and a is a predecessor of b, then \(RM(a,b)=\rightsquigarrow \), as the relation between H and B, where \(H \in L1.Bs \wedge B \in L1.Ss \wedge H \in Predecessors(B)\), so \(RM(H,B)=\rightsquigarrow \). In item 5, the relation between the loop end and loop start activities when there is no branch activities is sequence, as the relation between D and E where \(D \in L2.Es \wedge E \in L2.Ss \wedge L2.Bs= \emptyset \). item 6 handles the case when a loop start activity is a predecessor of a loop end activity. In that case, the relation is sequence, as the relation between C and D where \(C \in L2.Ss \wedge D \in L2.Es \wedge C \in Predecessors(D)\), then \(RM(C,D)=\rightsquigarrow \). item 7 handles the loop default case, where either a or b only is part of the loop activities, or both belong to either the start or end activities of the loop. In such case, the relation is the same as the acyclic relation between the two activities. For example, the relation between D and F is \(\rightsquigarrow \), also the C and E is parallel based on the acyclic relation. In item 8, If the relation between two activities is parallel and both are not included in any loop then the relation remains parallel. Last condition in generating RM is when the relation is not parallel, so the relation remains as the relation in the original model, cf. item 9.

Fig. 6.
figure 6

The ‘Generate relationship matrix’ process input and output for the process model in Fig. 2

5 Deducing Case IDs

In this section, we explain in details how DCIc works, cf. Fig. 1. The first step is to build the CTree, i.e. case decision tree, cf. Definition 1, for the unlabeled log S. During this step the filtering process takes place to avoid incorrect combinations based on the RM and the heuristics data, cf. Figs. 3, 6d and Definition 3 respectively. The second step is to generate the set of ranked labeled event logs, while ignoring redundant cases or events within the same log, cf. [4].

5.1 Building Case Decision Tree

The first step in generating labeled event logs is deducing the case identifier (caseId) for each unlabeled event in (S) by building the Case Decision Tree (CTree). Algorithm 5.1 builds the CTree based on the unlabeled event log S, the relationship matrix RM, and the heuristic data Heur, cf. Figs. 3, 6d. While building CTree, unlabeled events are allocated in their respective locations using a filtering process based on model and heuristic data. According to the results of the filtering process, i.e. possible parents, new nodes are defined for the event with different probabilities w.r.t each node’s parent.

figure a

Algorithm 5.1, in line 5, the filtering process is applied to obtain the candidate parent nodes in the tree for each unlabeled event. The filtering process avoids invalid combinations w.r.t RM, and out-of-range execution time w.r.t heuristics data. The output of Algorithm 5.2 is a dictionary, i.e. Parents, that categorizes the event possible parents nodes w.r.t its heuristic range \(\in \) (‘avg’,‘otherRanges’). In lines 6–15, for each possible parent, a new child node is created to represent the event in the tree. Lines 7–11 assign the caseId of the event based on its parent (n.caseId). If the parent is the root, i.e. \(caseId = 0\), then the event represents the start of a new case.

Filtering for candidate parents for unlabeled event is covered by Algorithm 5.2. First, it uses RM to filter for the permissible combinations by getting the relation between the leaves activities of CTree and the current event’s activity. Second, it uses the heuristic data to check if each candidate parent from the model is in the execution range of the activity, cf. [4]. Finally the filtering process produces the possible parents as \(\{`avg\)’,‘otherRanges\(\}\)

figure b

Algorithm 5.2 starts with checking if the activity of the current event is a start activity in lines 2–4. In that situation, event represents a new case, i.e. a direct child of root. Otherwise, Tree is traversed from leaves till root, in lines 6–7, looking for possible parents. Line10, defines a set with the running loops within the branch by traversing from node till root and split the branch w.r.t loop.Ss and loop.Es, cf. Definition 3. cb is begin from the last occurrence of any start activity of RM.Loop(activity) that is not followed by any of loop end activities. As Loop(activity) may result with set of loops contains event activity, so the branch can be spited into set of running loops. Line 12, uses RM to retrieve the relation between activity of the current event and the leaf nodeactivity. There are four types of relations:

  • In case of none \((\bot )\), lines 13–37 we ignore the branch and don’t do any further investigation.

  • In case of exclusive \((+)\) or reverse , lines 14–14, we traverse the tree upwards searching for labeling possibilities.

  • In case of sequence (\(\rightsquigarrow \)), lines 15–24. We get the nodes corresponding to the existed predecessors of the event activity within the cb, i.e. current branch, in line 16. Based on the results of predecessors nodes, the heuristic filtering step is performed.

    As RM represents the relation between end and start activities of a loop as sequence relation. So if the activity is part of a loop and the node is in the execution range of the event, then it traverses the branch to explore the other possibles based on cyclic behavior, in line 21.

  • In case of parallel \((\parallel )\), lines 25–36. As parallel activities executed in different order and behavior, we need to make sure that the event activity doesn’t not exist in the current branch, i.e. cb. Whereas, if parallel activities are part of a loop then they can be repeated within the branch. In line 28, check the existence of the event activity within the current branch and if the event activity doesn’t include in any loop, i.e. \(Loop(activity)=\varnothing \), then the event can’t be added on this branch. otherwise, get the nodes corresponding to the existed predecessors of the event activity within the cb, and proceed with the Heuristic filtering step, in line 32. After that check the execution duration if its in heuristic execution range of event activity, then it traverses the branch to explore the other possibles of both parallel and cyclic behavior.

Fig. 7.
figure 7

Case Decision Tree

Figure 7 visualizes the output CTree of the Algorithm 5.1 for the inputs in Figs. 36d. The tuple (idtsap) with each node defines the deduced case ID, the time stamp, the activity name and node probability respectively. As shown in Fig. 7, unlabeled event (5, B) in S, cf. Fig. 3a, is represented in CTree by two nodes, case 1 includes one node with probability 0.75 for this event and the same for case 2 but with probability 0.25. Also note that the event (12, E) is represented by six nodes, in case 1 it has three nodes with different parent nodes and the same for case 2. As shown in the model, cf. Fig. 2, activity E exists within two loops, and it considers as a part of the end and the start activities of the inner loop, so the relation between E and itself is \(\rightsquigarrow \), cf. Fig. 6d. Thats why DCIc considers the event (9, E) as a parent event for (12, E). The final result of applying DCIc is shown in Fig. 8.

Fig. 8.
figure 8

Generated labeled logs by DCIc

6 Evaluation

We implemented a prototypeFootnote 1 for DCIc and Relationship Matrix RM. We implemented DCIc in Python, and implemented RM in Java as it is based on the behavior profile implementation. We improved the performance of DCIc over DCI, cf. [4], by applying dynamic programming techniques in building CTree, and using the producer-consumer threading pattern to build the labeled logs while constructing the CTree.

Fig. 9.
figure 9

Evaluation steps

Figure 9 shows the evaluation steps of DCIc with both synthetic and real life logs. To generate synthetic logs, we use the ProM [15] plug-in: “Perform a simple simulation of a (stochastic) Petri net”. Then the simulated log is updated to reflect the heuristic data. For real life logs, we used the ProM plug-in: “Mine Petri net with Inductive Miner” for inductive mining technique [8] to obtain the process model. Then we extracted heuristic information from the real life log using a tool we built. In either case, we remove caseId from the labeled log to produce an unlabeled log. Also we build the relationship matrix for the process model. Finally, we measure the quality of our generated labeled logs by calculating the precision and recall with respect to the original labeled log. Precision measures the percentage of cases in the generated labeled logs that are correct i.e. also exists in the original labeled log, while recall measures the percentage of the correct cases that have been found, cf. Eq. 1.

$$\begin{aligned} precision=\frac{tp}{tp+fp} ~~,~~ recall=\frac{tp}{tp+fn} \end{aligned}$$
(1)

In the above formulas, tp be the set of cases that exist in the generated labeled log and also in the original, fp is the set of cases that exist in the generated labeled log but do not exist in the original log, and fn is the set of cases that do not exist in the generated labeled log but exist in the original log.

Table 1. DCIc results from different logs

We evaluated DCIc against different real and synthetic log, as shown in Table 1. We calculated the precision and recall using the top-most ranked generated labeled log against the given labeled log, cf. Eq. 1. The precision and recall of the BPI-2012 is lower than the others as the mined model does not fit the log. So the quality of the generated logs is directly affected by the quality of the given model, because the deducing process is based on the model. If there is a deviation between the model and the log, i.e. activities exist in the log and do not exist in the model, that leads to consider the generated labeled logs as a noisy log.

Fig. 10.
figure 10

Execution results

Figure 10 shows different execution times for DCIc. As shown in Fig. 10a, The number of events in the unlabeled log influences the execution time of DCIc. As the number of events affects both the breadth and depth of the CTree used in deducing case id for the unlabeled event. As shown in Table 1, the execution time of BPI-2012 log that has 262201 events is 11 h while the execution of the same log but with considering only 11388 events is around 30 min, cf. Fig. 10a. Also the execution time is affected by the number of the loops within the model, which influences the breadth and depth of the CTree, as for the synthetic log which takes 60 min, cf. Table 1. As shown in Fig. 10b, we deduce CoseLog using different heuristic ranges, which affects both the execution time and the labeled logs quality. Because DCIc depends on the heuristic range to eliminate and filter some of the possible labels that resulted from the model filtering process. So when the heuristic ranges are not accurate and not realistic that leads to produce a noisy labeled logs. Also when the ranges are too large this leads to filter only based on the model and no labeling possibilities will be eliminated based on the heuristic ranges.

7 Related Work

There are several process mining techniques that discover and conform the model from event logs. Most of these techniques need a labeled event log to proceed [1]. Also there are different performance analysis techniques that use labeled event logs to extract process performance indicators [14]. We see our work as intermediate step between low quality logs, according to [2], and those mining and analysis approaches.

In [3], authors address the problem of mapping of the process model activities and the executed event activity names. They introduce a semiautomated approach to map events to activities using declarative constraints that are extracted from the model and the labeled log. In common with our approach is the need for more information beyond the log, namely the process model whose execution should have generated the log. In contrast to our work, the approach in [3] addresses the problem of correlating a labeled event to a specific activity within a process instance whereas we address a challenge on an earlier step which is labeling the events in the log.

Handling event correlation problem within the web service environment is addressed in [5, 11]. In [5], authors discuss how to discover web service workflows and the difficulty of finding a rich log with specific information. The execution of web services has missing workflow case identifiers in order to analyze workflow execution log. They also discuss the need for extra information regarding execution time heuristics. In [11], the authors introduce a semiautomated approach to discover the set of process views based on finding correlation conditions. A process view is a representation of the process model from the process instance view. Common with our approach is the need for heuristic data to eliminate some correlation options. In contrast they correlate event using data from different layers, while we only requires the process model and heuristic information.

The handling of unlabeled log problem is also addressed in [6, 16]. In [6], an Expectation Maximization approach is introduced to estimate a Markov model from an unlabeled event log. It is a greedy algorithm that finds a single solution most often with a local maximum of the likelihood function. The building process of the Markov chain against the unlabeled event log is an ongoing process, it stops when the Markov chain stops changing. The main limitations of the approach are the lack of support for loops and that the existence of parallelism may lead to mislabeling some events in the unlabeled log. In [16], a sequence partitioning approach is introduced to produce set partitions that represent the minimum cover of the unlabeled log. The main limitations of the approach are the lack of support of loops and also the representation of parallelism as it will represent the execution of concurrent parts of the process into different partitions as if they are not related.

In [12], authors introduced Correlation Miner to discover the process model of the unlabeled logs, by using Integer linear programming to set the constraints to find the suitable model. In contrast to our approach, they mine the log to get process model, while we use the given process model to deduce the case id for the unlabeled event. A limitation of this approach is the lack of support for cyclic behavior.

In [10], authors used the redo logs and data model to construct labeled logs from the database. In common with our approach is the need from extra information to be used for correlating the events. In contrast with our approach is the type of the required data, also the source of the unlabeled events as they are extracted from redo logs.

8 Discussion

In this paper, we have shown an approach to label unlabeled event logs based on extending the deduce case IDs [4], DCIc. The extension is to allow labeling of unlabeled event logs from cyclic process execution. We use as input, in addition to the unlabeled event log, process model and heuristic data about activity execution in order to generate a set of labeled event logs with ranking scores. There is a preprocessing step to generate the relationship matrix, RM, to describe the relations between process model activities to handle cyclic behavior.

The quality of the output labeled logs is affected by the quality of the inputs. First, if the unlabeled log deviates from the process execution this is reflected in generation of noisy logs. If activities in the log appear in a control flow order other than implied by the model, the DCIc might fail to assign a label for an event and thus the labeled log will miss some events. Also, if the input log is originally missing some events this will lead to noisy logs. If activity heuristics were inaccurate, this will affect the overall score of the logs and might result in more noisy logs to be generated. In this regard, our approach can be seen as a conformance checking technique between unlabeled logs and process model.

We assumed that an event in the unlabeled log represents the completion event of an activity and the start of the next activity. A relaxation of this assumption would affect the runtime of the deduction algorithm as the CTree would contain more combinations. In this case, the accuracy of the activity heuristics will affect also the growth rate of the CTree as it will be used to correlate start and complete events for the same activity.

Other factors that affect the runtime of the deduction step are the size of the input log, the number of concurrent execution branches and the number of (nested) loops as all these factors contribute to the growth of the depth and the breadth of the CTree.