1 Introduction

With the emerging Internet-of-Things, more and more assets are connected and continuously monitored. It creates opportunities as this data (events occurring in the asset and stored in event logs and sensor data) contains valuable information which can be explored to gain further insights about the asset behaviour and apply data analytic to predict e.g. failures, remaining useful life or performance degradation. Exploiting this data is therefore crucial for industry to enhance performance and reduce costs.

However, these opportunities are accompanied by challenges. For instance, the heterogeneity of the data prevents the comparison of the events logged in assets from different manufacturers as different manufacturers may have implemented different and incompatible formats to report these events. Another challenge is the presence of noise in the sensor data which could hamper the accuracy of models based on this data.

One crucial challenge that is still not yet widely addressed and researched is the voluminous amount of data produced by an industrial asset. The amount of sensor data and events reported make impossible a manual exploration. Moreover, it also prevents a fast exploration and creation of models with machine learning/data mining approaches as most of them have a complexity that is, at least, linear with the amount of data. This is even more crucial for event logs that can not be summarized or aggregated. Therefore, the computational cost of most machine learning/data mining methods recently started to sky-rocket.

At our knowledge, unless distributed computing, very little have been done for tackling the voluminous data challenge. In the software event logs, the data is usually pruned by removing images, HTTP codes known as irrelevant and other irrelevant metadata. On the other hand, specific methods to deal with vast amount of data are not widespread. In [2], Bronchi et al. tackles this challenge in their pattern mining method, by creating a specific candidate generation phase that discard irrelevant events. However, such approaches are not frequent and most machine learning/data mining methods simply rely on the cheap and extensive computational power available now with cloud computing.

We believe that another approach is possible and that preprocessing the events to discard the irrelevant events without losing information can play a role in a fast and green model building. In [7], we have conceived and benchmarked 10 methods (inspired from various domains such as biomedical or text mining) to preprocess the event logs by removing the irrelevant events (i.e. the events linked to the normal behaviour of the asset). We have shown that our best method was able to decrease the log size by 70% to 90% without losing information.

In this paper, we intend to explore further this method and present the workflow that could leverage it as a preprocessing step for industrial event logs. This workflow allows to automatically detect and discard irrelevant events (ones that do not point out to failures, under-performances, ...). It can be applied to any industrial event logs generated by assets such as cars, assembly lines or windmills.

The workflow is crucial as a reduced dataset means a reduced computation time of the methods exploiting it, e.g. a pattern mining method would be faster as it would have to run on shorter sequences. Therefore, the cost and the electricity consumptions of most of the machine learning/data mining methods would decrease. It would creates a more efficient environment which would benefits both industry (through cost reduction) and society (through electricity saving and then Co2 emission decrease).

In addition, such preprocessing could also allows embedding machine learning/data mining models in the industrial assets themselves. These assets typically do not have enough computation power to run the existing models. However, by preprocessing the data to reduce its size, the computation time/power requested to run the model could be affordable for the assets, allowing decentralized analytics.

We have validated this workflow in the photovoltaic (PV) domain, with the event logs of a fleet of inverters, i.e. the devices that collect and aggregate the current generated by the PV modules. We have benchmarked the computational impact of our pre-processing workflow for various sequential pattern mining methods and configurations.

The paper is organised as follows: First, the relevant literature about our preprocessing workflow and pattern mining is outlined in Sect. 2. Then, in Sect. 3, we explain our preprocessing workflow. Section 4 covers the benchmarking of the computational gain of our preprocessing workflow. We discuss the advantage of our workflow in Sect. 5. Finally, we conclude the paper in Sect. 6.

2 Literature Review

The challenges related to the voluminous character of event log data have received little attention in literature. The current strategies usually let the process mining methods deal with it, such as Bonchi et al. [2] who defined Ex-ante, a constrained pattern mining method adapted to large data by adding a pruning of the search space during the traditional pruning of the frequent itemsets of the APriori method. In [7], we have adapted and benchmarked 10 methods from various domains, namely: (1) The process mining domain, (2) The outlier detection domain, (3) The web log cleaning domain, (4) The state index pruning domain from the information retrieval field and (5) The diversity measures in the biological domain.

The most successful method was based on TF-IDF, a method from the static index pruning domain. Static index pruning is a field of the information retrieval domain. The goal is to reduce the size of the index of one or several texts by removing the entries that will probably not be needed by the users, e.g. the index entries of the word “the” or “me”, to reduce the memory size of the index. This field exists since the seminal papers of Carmel et al. [3]. Even though some methods are domain specific, others can be adapted to industrial event logs. Billerbeck et al. [1] used TF-IDF to compute the frequency score of the words by combining the overall frequency of the word in all texts and the number of text in which this word occurs. Given a corpus of text, e.g. a set of technical documents, this method will provide a ranking of the word for each document, e.g. for document A, the word “inverter” will have a high score as this word is frequent in this document but not in the others, which means that this word is probably discriminative of the document topic. TF-IDF combines two metrics: (1) The term frequency (TF) that measures the frequency of the term, i.e. word, in the document. (2) The inverse document frequency (IDF) that measure the (inverse of the) frequency of the term in the corpus (see formulas below).

$$\begin{aligned} TF_{w_i,d_i} = \frac{\# \text{ occ. } \text{ of } \text{ word } w_i \text{ in } \text{ document } d_i}{ \# \text{ of } \text{ word } \text{ in } \text{ document } d_i } \end{aligned}$$
$$\begin{aligned} IDF_{w_i} = \log { \frac{\# \text{ of } \text{ documents } \text{ in } \text{ the } \text{ corpus } }{\# \text{ of } \text{ documents } \text{ containing } \text{ the } \text{ word } w_i}} \end{aligned}$$
$$\begin{aligned} TF-IDF_{w_i,d_i}= TF_{w_i,d_i} * IDF_{w_i} \end{aligned}$$

The intuition behind that is to decrease the score of the terms that occur in most or all of the documents as they are probably less discriminant (they correspond to words such as “the” or “of”). Billerbeck et al. used TF-IDF to remove from the index the words with low frequency scores, i.e. frequent words.

Sequential Pattern Mining (SPM) algorithms are particularly useful to get insights into the operational process of a industrial assets. By uncovering recurring patterns, i.e. sequences of events, it is possible to get insights on normal and anomalous behavior. Next to the strong explorative potential of such patterns, exploitation possibilities can also be envisaged, e.g. predicting failures ahead of time by early detection of event sequences that where known to lead to failures in the past. With such forecasting information available, maintenance teams can be dispatched pro-actively or plant operation can be adjusted.

Mainly two types of algorithms are used for SPM. On the one hand, there are the Candidate Generation (CG) algorithms that stems from the seminal paper of Agrawal et al. [10] introducing the APriori algorithm. Other CG algorithms include e.g. GSP [10] or SPADE [10]. On the other hand, there are the Frequent Pattern Growth algorithms, such as PrefixSpan [10] or BIDE [10]. Only a few algorithms are unique in their design, such as CloSpan [10] or DISC [10]. All these SPM algorithms have been applied successfully in many cases, such as critical events prediction [5] or the prediction of the next prescribed medication [12].

Many different variants of the standard SPM algorithms exist. Closed SPM, an FP-Growth algorithm, focuses on the discovery of patterns with very low Support Threshold, i.e. low minimum number of occurrences required to be considered, or in very long sequences [4]. Time-interval SPM focuses on the time-span between events, and Chen et al. [4] adapted two algorithms based on APriori (CG) and PrefixSpan (FP-Growth) without impacting their accuracy and computation times. Constraint-based SPM, puts constraints on retrievable patterns, e.g. the pattern must contain a specific event, such as the FP-Growth algorithm of Pei et al. [4] that restricts patterns by means of regular expressions. Multi-dimensional SPM, as introduced by Pinto et al. [4], takes additional information, so-called dimensions, relative to the process that created the items, e.g. the size of a PV plant and its location into account.

One interesting flavour of SPM to find patterns in items for which hierarchical knowledge is available is Multi-level Sequential pattern Mining (MLSPM). Those algorithms can look for patterns at different levels of conceptual detail, by using a taxonomy of the concepts behind the items in a sequence. For instance, MLSPM can retrieve patterns containing Fruit and/or their lower level instantiations such as Apple or Peach.

Chen et al. [6] defined a specific algorithm. They used a numerical re-encoding of the taxonomy, defining, for example, a 3 level taxonomy with \(1^{**}\) as root, \(11^{*}, 12^{*}, 13^{*}\) as second level and 111, 112, 123 as child of \(11^{*}\). By renaming the sequences using this encoding, it is easy to check if a pattern matches a sequence. E.g. (111, 112, 123), 258, 235 match \(1^{**}, 2^{**}\), which can be easily verified by only looking at the first letter of 111 and 258. This re-encoding allows to easily compute the support of a pattern without using the taxonomy anymore. An adapted APriori based algorithm is then used to find multi-level patterns. One drawback, mentioned by Lianglei et al. [9], of this method is its inability to deal with large taxonomy. They took the example of the node 111 that can be the first node of third level or the eleventh node of second level or the first node of second level (child of eleventh node of first level). Therefore, they proposed the Prime encoding based Multi-level Sequential patterns Mining (PMSM) algorithm [9], where they used prime numbers for this numerical re-coding.

PMSM relies on the APriori approach of Candidate Generation. It starts by finding all concepts (events) above the support threshold, which can be seen as frequent patterns of size 1. Then it iterates to generate patterns of size n by combining the frequent patterns of size \(n-1\) among them (see example below). Then, only the frequent patterns of size n are kept. The algorithm stops when no patterns can be found.

The candidate generation follows these rules: (1) the combination of the frequent events \(a_1\) and \(a_2\) is \(a_1a_2\) and \(a_2a_1\); (2) the combination of the patterns \(a_1a_2a_3\) and \(a_2a_3a_4\) is the pattern \(a_1a_2a_3a_4\). Only overlapping patterns can be combined to obtain patterns of size \(n+1\).

As its name suggests, PMSM uses a prime-number-based encoding of the taxonomy. Each node is assigned the multiplication of the number of its ancestor and a prime number not used in any upper level or left node, i.e. any node at the same level already treated by the algorithm. For instance, the root node is 1 and its two children are thus respectively 2 (\(1\times 2\)) and 3 (\(1\times 3\)). If node 2 has two children, they would be encoded as follows: 10 (\(2\times 5\)) and 14 (\(2\times 7\)). Due to the properties of primes numbers, this encoding makes it easy to check ancestors: a is an ancestor of b \(\Leftrightarrow \) \(b \bmod a = 0\). By renaming the sequences, using these prime numbers, it is easy to verify if a node or a pattern is the ancestor of another node or patterns without referring to the taxonomy anymore, which simplifies the computation of the support of a pattern, as it correspond to the sum of its number of occurrence and the number of occurrence of its children, i.e. more specific instantiation.

3 Preprocessing Workflow

The proposed workflow allows to discard irrelevant events in any industrial event logs. However, the definition of irrelevant events may vary based on the goal to achieve. To analyse and understand the irregular behaviour of industrial assets, all the events associated with regular operation, like the start and stop sequences, should be discarded and only the events linked to failures or underperformance should be kept. Similarly, to explore the regular behavior of the asset, only the regular events, i.e. the ones triggered by normal operations are relevant and all the warning/failure events are irrelevant and should be discarded. For the rest of this paper, we will consider that the irregular behavior needs to be explored as it is the most frequent case. Therefore, the irrelevant events will be the regular ones. Note that in the case where the regular behaviour needs to be explored, the workflow is similar as, when the irregular events have been labelled, the regular events are simply the remaining events.

The main challenge is that an event can be part of the regular behavior at some point and part of the irregular behavior later. This implies that the context, i.e. the surrounding events, of the event in question is crucial. It is therefore impossible to simply define a set of events associated with the regular periods and remove them. To address this challenge, the relations between the events during the operation of an industrial asset should be considered and the relevancy of an event should be computed within the particular operation context. The workflow follows 4 steps:

  1. 1.

    The log files are divided in atomic event logs (AEL), i.e. in traces (e.g. drive of a car), to aggregate all events that could interact together.

  2. 2.

    The statuses, i.e. the specific events indicating the current state of the asset, are removed as a first pruning step.

  3. 3.

    The relevancy scores of each events are computed.

  4. 4.

    The events with a relevancy score below a certain threshold are removed.

3.1 Defining AEL

The first step is to divide the event logs into atomic pieces, i.e. into “traces or meaningful periods” of the asset, called atomic event logs (AEL). For instance, in case of a car, the event logs could be divided by traces, from the start of the travel to its end. However, in case only the start sequence of the car needs to be analysed, the AEL would be the start sequence, e.g. the first 2 min after the start, and the rest of the event logs could be discarded. The definition of these atomic event logs is therefore domain and goal oriented.

The ultimate goal is to have AELs containing all the relevant event correlations. For example, the interpretation of the event “temperature error” is modified in case it is preceded by the event “temperature sensor broken”.

3.2 Removing Statuses

Some events are status events, i.e. an indication of the current state of the system such as “start” or “running”. As we focus on labelling regular events as irrelevant, and these status events describe regular behaviour, they can be discarded. Note, that if the purpose is to remove irregular events and analyse the regular one, the statuses need to be kept. As all the occurrences of these specific events are removed, it will not impact the computation of the event relevancy of the remaining events in the next steps.

3.3 Computing Relevancy Score

We have used a method inspired by TF-IDF where for each event type of each AEL, its relevancy score is computed. The goal is to attribute a score reflecting the degree of “abnormality” of the event, i.e. whether the event point to regular asset behavior. For example, the event “temperature error” that occurred 2 times in the atomic logs should have a high relevancy score as it is a critical event indicating a failure, while the event “wait sun” that occurred 17 times should have a relevancy score of 0 as it is an event representing the usual behavior of the device. Therefore, the event frequencies need to be carefully exploited.

By considering the AELs as a text, text mining methods such as TF-IDF can be adapted for this purpose. Therefore, our methodology relies on the computation of two frequencies: (1) The frequency of the event (type) in the AEL, and (2) the frequency of the event (type) in well selected corpus of AELs aligned with the analysis goal in mind.

First, the event frequency (EF) is computed, i.e. for each event type that can be reported by the asset, its frequency in the AEL is computed. The formula below is used.

$$\begin{aligned} EF_{e_i,a_i,l_i} = \frac{\# \text { occ. of events } e_i \text { in logs of asset }a_i \text { for AEL }l_i}{\# \text{ of } \text{ event } \text{ in } \text{ AEL } l_i \text{ for } \text{ asset } a_i } \end{aligned}$$

The corpus frequency (CF) - inspired by the inverse document frequency - need to be adapted to the industrial event logs context as the text corpus on which it relies do not apply here. Therefore, the corpus definition needs to be adapted. Three approaches are possible and need to be carefully selected:

  • The corpus consists of all available AELs. It allows to compare asset behavior over time and across assets.

    $$\begin{aligned} CF_{e_i} = \log { \frac{\# \text { of AEL in all assets and all days}}{\# \text { of AEL where event }e_i \text { occured }} } \end{aligned}$$
  • The corpus consists of all the AELs of one asset. It allows to focus on one asset behavior and monitor the evolution of performance over time.

    $$\begin{aligned} CF_{e_i,a_i} = \log { \frac{\# \text { of AEL for asset }a_i }{\# \text { of AEL for asset }a_i \text { with event }e_i} } \end{aligned}$$
  • The corpus is composed of AELs of all assets for the same operational cycle (e.g. the same day). It allows objective comparison of performance across assets. However, as events occurring in all AELs of the corpus are considered less relevant, a failure occurring in all assets would be masked by this case.

    $$\begin{aligned} CF_{e_i,p_i} = \log { \frac{\# \text { of AEL occuring at the period }p_i }{\# \text { of AEL for period }p_i \text { with event }e_i} } \end{aligned}$$

Subsequently, the relevancy score is computed by multiplying EF and CF using the formula below:

$$\begin{aligned} \text {Relevancy score } = EF_{e_i,a_i,l_i} * CF \end{aligned}$$

In this way, the relevancy score uses the frequency of the event (more frequent events have higher scores) corrected by the CF that will decrease the score of events frequent in the corpus (if an even occurs in all AELs of the corpus, its CF is \(\log (1)=0\), which leads to a relevancy score of zero).

The relevancy score are then normalized using min-max to have score in the same range and help comparison of the scores.

3.4 Removing Irrelevant Events

It is not straightforward to determine the threshold between relevant and irrelevant events since it is domain dependant and should be decided by domain experts. However, some simple guidelines can be drawn. A conservative model simply defines as irrelevant any events with scores null, i.e. the events that occurs in all sequences. This approach is straightforward, do not require the evaluation of domain experts and is not prone to wrongly label relevant event as irrelevant (it would imply that certain failure occurs in every traces). However, this approach has very broad relevancy interpretation and might lead to label events with very little significance as relevant.

Another more flexible approach is to put the threshold at the first quartile, i.e. consider as irrelevant the 25% of the event types with low scores, which leaves 75% of the event types labelled as relevant. According to our experiments, increasing the percentage of event types considered as irrelevant may lead to losing substantial portion of relevant events which get discarded as irrelevant. Therefore, more aggressive approaches should be validated by domain experts.

An important aspect is that the border between relevant and irrelevant events is not closed. Domain experts can also decide to label relevant but not important events as irrelevant, e.g. if they know that a warning/event is harmless although not part of the regular behaviour of the asset, it could also be labelled as irrelevant. However, this decision is not straightforward as it is a trade-off between removing more events types to decrease the computation time and potentially missing event correlations.

Finally, if the definition of relevant and irrelevant events based on their scores is expansive as it requires the guidance of a domain experts, it is also a step that do not need to be repeated in the workflow. Two cases exist:

  • The irrelevant events list vary from one trace to another, i.e. one event irrelevant in one trace can be relevant in another. Therefore, the labelling must rely on an automatic method that discard the event with relevancy score null or below certain threshold. This threshold need to be defined by a domain expert after the analysis of a statistically significant number of asset traces and can be time consuming.

  • The irrelevant events list is similar from one trace to another. Therefore, the mean relevancy score can be computed for the whole dataset or a statistically significant subset. The domain expert has then to simply analyse this ranking and define the set of events that are irrelevant. Which make this approach less time consuming.

However, in both cases, once the thresholds or the list of irrelevant events has been defined and validated by domain experts, they can be applied to any event logs generated by the same asset type with a minimal computation time.

4 Validation

To benchmark the impact of our preprocessing workflow on the computation time, we have validated it on real industrial data from photovoltaic (PV) plants. These plants generate a vast mount of events that will be explored through pattern mining. We will benchmark the computation time of the pattern mining algorithm for various support thresholds and dataset sizes in order to quantify the advantage of our approach.

4.1 Experimental Setup

The data is provided by our industrial partners 3E, which is active, through its Software-as-a-Service SynaptiQ, in the PV plant monitoring domain. PV plants are built around inverters, the device that collects the electricity produced by the PV panels and convert it to send it to the grid. Therefore, inverters are continuously monitored and report events occurring to them but also to their surrounding (the PV panels attached to them, the strings between the PV panels and the inverter, ...). The events are therefore stored at the inverter level.

We have decided to use the sequential pattern mining (SPM) algorithm defined by Wang et al. [11], and to use the implementation of pymining, available on github for reproducibility purposes. The event logs will be split per day and per inverter, i.e. each sequence will contain all the events occurring in one specific inverter during one specific day. They have been divided by inverters as the events reflect the specific behaviour of only one inverter and per day as a PV plant is inactive at night. Therefore, each day can be seen as a trace of the inverter. It is then customary to speak about inverter-day to refer to that trace.

We have also experimented with a multi-level sequential pattern mining. This method looks for patterns at multiple conceptual level, i.e. if you have a sequence from a PV event log containing “under-temperature”, “over-temperature” and “grid fault”, the method will consider “under-temperature” and “over-temperature” as an item but will also consider them as a “temperature error”, which may allows to find more generic patterns. This kind of pattern mining is typically more time consuming and could therefore benefits more of our workflow. We have used our own implementation of the PMSM algorithm of Lianglei et al. [9].

We have used one year of data from one plant located in Belgium. This plant has 26 inverters which produced, in average, 31 events per days, with a standard deviation of 58 events. The minimal amount of event per day is 5 and the maximum amount is 603 (a higher amount of events is not necessarily linked to a failure). Therefore the sequence length vary from 5 events to thousands of events. In total, the dataset contains 9490 inverter-days.

4.2 Benchmarking

Two aspects of the problem needs to be analysed: (1) The computation time of our approach (2) The computation time gained by our approach. If the computation time of our approach exceeds the computational gains, it’s benefits become dubious. A more thorough analysis of the accuracy of our worflow to only discard the irrelevant events can be found in our paper [7]. An exploration of the interest of MLSPM for multi-level industrial event logs can be found in our paper [8]. These two topics will therefore not be covered by this paper.

However, to avoid any loss of quality in the patterns retrieved, we have used a conservative approach in the determination of the irrelevant events. We have only removed the statuses and three types of events with the lowest relevancy scores. These three types of events have been validated as irrelevant by domain experts after a through exploration and they have guaranteed that they would not be involved in relevant patterns. It implies that the computational gain of our workflow can be enhanced by adopting a less conservative approach.

Computation Time of Our Workflow. Our workflow consists of three steps, each with different computation times. The first step is to remove the statuses. This step is straightforward as it only requires to remove some ID from the event logs. Therefore, the computation time can be considered as negligible as e.g.in Python, a library such as Pandas has been built to perform such action in a minimal amount of time. In our dataset, this step was instantaneous. Therefore, this step do not impact the computation time of our workflow.

The second step is to compute the relevancy scores. It requires to: (1) Split the event logs by days, (2) Compute the event frequency in each inverter-day. (3) Compute the corpus frequency (4) Multiply the event frequencies and the corpus frequencies to obtain the event relevancy scores. The last processing operation (4) is, again, negligible as most of the data handling libraries are built to provide fast methods for that purpose. Therefore, the bottleneck will corresponds to processing operations 1–2 (as they are interleaved) and 3.

The third processing operation consists of defining which events are irrelevant and to remove them. The computation time required to remove these events can also be considered as negligible as built-in function in Python and dedicated libraries allow to perform this action almost instantaneously. Unfortunately, defining the irrelevant events may be time consuming, depending on the method selected. However, this step need to be done only once for each asset type.

Relevancy Score Computation. The part of our workflow with the most expansive computation time is therefore the computation of the event and corpus frequencies. The computation will obviously be domain dependant and will vary based on the average number of events in a trace of the asset, i.e. a sequence. Figure 1 contains a analysis of the computation of both frequencies on our dataset. Each event frequency has been computed on an inverter-day, each corpus frequency has been computed on all the inverter-days occurring during the same day, i.e. on 26 inverter-days for our dataset. The mean, quartile and min max are provided in the form of a boxplot. It appears that the computation of the corpus frequency is almost negligible. The main part of the computation time is dedicated to the computation time of the event frequencies with in average 0.005 s per inverter-days. For our annual dataset of 9490 inverter-days, it corresponds to a computation of 47,45 s.

Fig. 1.
figure 1

Computational time of the two steps of the relevancy scores computation, namely the event frequency and the corpus frequency computations

The computation time for a dataset of around 340.000 events divided in around 10.000 sequences is therefore a bit less than one minutes. However, the computation time could be significantly reduced by using parallelism to compute the event frequencies. This step is independent and all event frequencies could be computed simultaneously. Therefore, for bigger dataset, a parallelisation using tools such as Hadoop could significantly reduce the computation time (but, in turns, increase the energy consumption).

Irrelevant Events Labelling. The definition of the set of irrelevant events can be time consuming as it implies a manual check by a domain expert. For the PV domain, the set of irrelevant events with low scores (below 0.2) is similar from one trace to another. Events above that threshold have a relevancy that vary from one trace to another.

We have used a conservative approach where we have only removed 3 types of events with the lowest scores (below 0.2) after a careful analysis by a domain experts. Therefore, we didn’t had to define a relevancy threshold that would have be more time consuming. However, the definition of the irrelevant event/threshold only needs to be done once per asset type. Once it has been done, the filtering can be applied automatically with minimal computation cost.

Computational Gain. We have explored the computation time for SPM and PMSM at various support thresholds (the support of a patterns indicates its frequency in the dataset, e.g. SPM with a support threshold of 50 will return all patterns occurring in at least 50% of the sequences) and for various datasets sizes. The experiment has been conducted on a MacBook Pro.

Figure 2 displays the computational gain of SPM applied to respectively 3, 12 and 14 months of event logs data and PMSM applied to 1 month of dataset for support thresholds ranging from 10 to 100 (due to the high computation time of PMSM, we have only run it on support threshold 30, 50 and 80). The computational gains are expressed in percentage, i.e. the baseline, at 100% correspond to the computation time of the SPM on the full dataset, then for each methods its computation time is expressed relative to the baseline. For instance, SPM applied on 12 months of data with a support threshold of 20 has a computation time representing 1,37% of the computation time on the full dataset (0.1 s instead of 23 s).

It appears that the computation on the cleaned dataset usually corresponds to less than 5% of the computation of the full dataset for most of the support thresholds. The computational gain is lower for support threshold 100 as the algorithm is more selective and consider less patterns.

To make it more concrete, Fig. 3 shows the computation time for SPM on 24 months of data. The computation time on the cleaned dataset stays stable for all support thresholds while the computation time on the full dataset increases significantly for support thresholds below 40. Hence, the difference of computation time between the two datasets is only significant for low support thresholds. For instance, SPM needs 26 s to retrieve the patterns on the full dataset while it only needs 0.16 s on the cleaned dataset. However, low support threshold are particularly relevant for industrial contexts as they allows to detect rare events/patterns.

The absolute difference of computation time is therefore less impressive for SPM, if we include the computation time to clean the dataset. However, for more sophisticated SPM approach, our workflow can provide a significant gain. Figure 4 shows that the computational gains for PMSM are significant. For a support thresholds of 30, the computation time decrease from 37 h to 6 s. For support threshold 80, it drops from 23 h to 5 s.

Fig. 2.
figure 2

Computational gain of our workflow for various SPM computation

Fig. 3.
figure 3

Computation time of SPM on full and cleaned dataset of 24 months of data

The number of patterns returned also significantly decreases. PMSM with a support threshold of 50 returns 64.770 patterns on the full dataset and 122 on the cleaned dataset. The 64648 noisy patterns (64.770–122) only retrieved in the full dataset are symptomatic of one of the drawback of PMSM application on industrial datasets. For instance, in the PV domain, the start and stop sequences of an inverters are not “stable”, the events/statuses occurring during theses sequences do not occur in the same orders and do not always occurs. E.g. a start sequence could be “wait sun, freeze, sensor test, start” or “wait sun, sensor test, start” or “wait sun, freeze, start”. All these variations artificially increase the amount of patterns retrieved by any pattern mining method.

This behaviour is worsened by MLSPM as this method also considers the upper-conceptual classes. For instance, the event “start” has as upper class “Producing status” and “wait sun” has as upper class “non producing status”. Therefore a pattern such as “wait sun, freeze, start” would also be returned under the form “non producing status, freeze, start”, “wait sun, freeze, producing status”, “non producing status, freeze, producing status”, ...The number of patterns grows exponentially with the number of events and the complexity of the conceptual hierarchy (the more upper-classes exists, the more high-level patterns are returned). Our workflow not only decreases the computation time but also removes the noisy irrelevant patterns which greatly simplify the analysis of the patterns.

Fig. 4.
figure 4

Computation time of MLSPM on full and cleaned dataset of 1 month of data

5 Discussion

Pattern mining is a very powerful method for deriving valuable insights from event logs data generated during the operation of industrial assets. However, pattern mining can be very costly in terms of computation time as it increases with the amount of events in a sequence. Our workflow has shown its effectiveness to reduce this computation time. At average, it has reduced the computation time of a SPM algorithm with up to 95%. The computational gain is higher for low support thresholds and for more expansive pattern mining methods, such as PMSM where the computation time has been reduced e.g. from 23 h to 5 s. Evaluating the impact on the energy efficiency is not straightforward as it would depends on the hardware and software used. However, a drop in similar percentage could be expected as every seconds gained by reducing the computation time implies that it’s a second where there is no energy computation.

For complex pattern mining methods such as MLSPM, the computational gain was significant as it allowed to provide relevant patterns in a few seconds for the three support thresholds tested while it took more than one day on the full dataset. We expect that more time consuming machine learning/data mining methods such as deep neural network or process mining could also benefit from our preprocessing workflow in a similar scale.

The average computation time of our workflow was around 1 min for annual data (not including the validation of the workflow by a domain expert which is only required once to adapt the workflow to a new asset type). However, a parallelisation of the workflow computation could significantly reduce the computational time.

Another advantage of our workflow is its ability to remove noises in the machine learning/data mining method results. By removing the regular events that do not have any impact on failures or under-performance, it also removes the irrelevant patterns containing them. It significantly decreases the amount of irrelevant patterns returned by SPM. The biggest impact has also been noticed on more advanced MLSPM with a decrease of the patterns returned of 99% (from 64.770 patterns on the full dataset to 122 patterns on the cleaned dataset). It implies that the time to analyze/post-process the patterns to retrieve the relevant one has also been decreased. Our workflow is therefore a valuable pre-processing step for computationally expensive data mining/machine learning methods by not only reducing their computation time but also by reducing the time needed to interpret and post-process the results.

For an industrial point of view, this preprocessing workflow is particularly interesting since reduced computation time (to obtain the patterns and post-processing them) implies reduced electricity consumption to run/train the models, which, in turns, leads to improved cost efficiency. In addition, our workflow is particularly effective for low support thresholds which are, typically, the thresholds for which the industry is looking for, i.e. to find the patterns leading to rare occurrence of critical events.

In addition, as our preprocessing workflow do not requires a high computational power, it can be embedded in smart devices. For example, it could be directly embedded in the inverters that would only report the relevant events. At average, the amount of events reported by an inverter would decrease by 84% (from 670 events to 108 events per month). It will not only ease the monitoring of the asset but also reduce the amount of information sent from the asset to the central processing location and hence, again, reduce the energy required to transfer and store these irrelevant events.

6 Conclusion

In this paper, we expanded the work realized in [7] by defining a workflow using the data pruning methods explored there to define a preprocessing workflow allowing to remove the irrelevant events before applying data mining/machine learning methods on industrial event logs. We have validated this worflow as a preprocessing step before applying SPM on real world PV event logs. We have proven that the computation time is reduced in average by 95%, with more important computational gains for more expansive data mining/machine learning methods where it could e.g. decreases the computation time of MLSPM by days. In addition, the workflow reduces the noises created by these irrelevant events in the results/models and eases their interpretation. Generalizing this workflow for industrial event logs exploration would therefore significantly increases the energy efficiency of the models and decreases the energy and time costs.