Abstract
An essential characteristic of data streams is the possibility of occurrence of concept drift, i.e., change in the distribution of the data in the stream over time. The capability to detect and adapt to changes in data stream mining methods is thus a necessity. While methods for multi-target prediction on data streams have recently appeared, they have largely remained without such capability. In this paper, we propose novel methods for change detection and adaptation in the context of incremental online learning of decision trees for multi-target regression. One of the approaches we propose is ensemble based, while the other uses the Page–Hinckley test. We perform an extensive evaluation of the proposed methods on real-world and artificial data streams and show their effectiveness. We also demonstrate their utility on a case study from spacecraft operations, where cosmic events can cause change and demand an appropriate and timely positioning of the space craft.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
1 Introduction
Machine learning (ML) tasks can be categorized as supervised, semi–supervised or unsupervised (Langley, 1996). The goal of supervised and semi–supervised algorithms is to learn, from a set of data examples, models that predict the values of one or more target attributes from the values of descriptive attributes. The tasks where one, respectively many target attributes is/are present are named single-, respectively multi-target prediction (MTP) (Kocev et al., 2013) tasks. When predicting multiple discrete or continuous-valued attributes, we talk about multi-target classification (MTC) (Last et al., 2010) or multi-target regression (MTR) (Struyf & Džeroski, 2005), respectively. In the special case of MTC where the targets are binary, we talk about multi-label classification (MLC) (Madjarov et al., 2012).
As devices that generate huge amounts of data are omnipresent, ML faces increasing data complexity – not only in the number of target or descriptive columns, but also in the number of rows and the velocity at which they become available. In the extreme case, there may be an infinite number of rows that are continuously arriving. Storing these for future knowledge extraction is obviously impossible. In this case, we are talking about data streams (Bifet et al., 2018), commonly referred to as online data. In online learning or mining data streams (Gama, 2010), the topic of multi-target prediction has been recently addressed (Shi et al., 2014; Osojnik et al., 2017; Ikonomovska et al., 2011) for the task of MTR. Here, the AMRules (Almeida et al., 2013; Sousa & Gama, 2016) and iSOUP-Tree approaches (Osojnik et al., 2018, 2020) are of note.
Examples in data streams arrive sequentially and are temporarily ordered. The temporal dimension implies that there is a possibility for change in the data distribution over time. In ML, such change is referred to as concept drift, or simply change (Gama, 2010). The task of change detection and adaptation has also been addressed in the data streams literature (Gama et al., 2014). Two major approaches deserve a mention here. The ensemble-based ADWIN (Bifet et al., 2009) approach and the FIMT-DD (Ikonomovska et al., 2011) approach, based on the Page–Hinckley test (Mouss et al., 2004). However, change detection and adaptation for the task of multi-target regression on data streams have only been considered in AMRules (Almeida et al., 2013; Sousa & Gama, 2016): They have so far not been considered in tree or tree ensemble approaches to MTR on data streams.
Contributions This paper is concerned with supervised online algorithms for MTR able to detect concept drift and adapt the learned model accordingly. It considers tree-based and tree-ensemble-based approaches to MTR. The main contributions of the paper to the research field are summarized as follows:
-
1.
Two novel method families for change detection and adaptation in the context of supervised MTR on data streams – iSOUP–ADWIN, based on the ADWIN approach wrapped around iSOUP–Tree, and Adaptive–iSOUP, based on a direct extension of iSOUP–Tree with change detection and adaptation mechanisms, based on the Page–Hinkley test.
-
2.
Extensive empirical evaluation of the developed approaches, including a comparison with competing methods, where we consider predictive performance and performance in terms of detecting change, by executing experiments on real-world and artificial data streams with concept drift. The Adaptive-iSOUP family performs best on real-world streams, while iSOUP-ADWIN performs best overall on artificial data streams without and with different types of concept drift.
-
3.
Application of the novel methods and a comparison to their competitors in a real-world practical use case, i.e., predicting the thermal power consumption of the MEX spacecraft, showing that Adaptive–iSOUP performs the best.
2 Background and related work
2.1 Multi-target prediction on data streams
The task of predictive modeling is to learn a model that predicts the values of a dependent variable y from vectors of values of independent variables X, thus approximating a function \(y=f(X)\), from a set of training examples (pairs of the form \((X_i, y_i )\)). In the batch case, all examples are presented together in the form of a dataset, while in the online case they are presented sequentially in the form of a data stream. When y takes scalar values, we are talking about single-target prediction, i.e., classification (if discrete) or regression (if continuous). Predicting vectors of values \(Y = (y_1,..., y_n )\) is the task of multi-target prediction (MTP), i.e., multi-target classification (MTC) for discrete and multi-target regression (MTR) for continuous values of y.
The task of classification has been extensively addressed in stream mining (Gama, 2010; Liao et al., 2023), with tree-based approaches playing a crucial role. State-of-the-art approaches are based on algorithms for learning ensembles of models, such as Bagging (Breiman, 1996) and Random forests (Breiman, 2001). Adaptations of these algorithms for the streaming setting have been proposed by (Oza & Russell, 2001; Oza, 2005). Regression has received much less attention in the streaming setting, but some tree-based approaches do exist (Ikonomovska et al., 2011; Chaouki et al., 2023, 2024).
MTP has been researched extensively for the batch setting, where we have tree-based (Struyf & Džeroski, 2005; Wei et al., 2024), rule-based (Aho et al., 2012), and kernel-based (Vazquez & Walter, 2003; Zhang et al., 2023) methods. Little attention is given to MTP for data streams, which has only received some attention recently. There are a few methods for MTC (Shi et al., 2014; Osojnik et al., 2017) on data streams. For MTR on data streams, there are two state-of-the-art methods, i.e., the tree-based incremental Structured Output Prediction tree (iSOUP-Tree) (Osojnik et al., 2018) approach, based on the Fast Incremental Model Tree for Multi Target prediction (FIMT-MT) method (Ikonomovska et al., 2011) [the multi-target extension of Fast Incremental Model Trees with Drift Detection (FIMT-DD) (Ikonomovska et al., 2011)], and the rule-based approach of learning Adaptive Model Rules from High Speed Data Streams (AMRules) (Almeida et al., 2013). Ensemble approaches, which learn collections of base models, widen the space of available methods: iSOUP-Tree has been used in combination with online bagging and random forests (Oza & Russell, 2001; Oza, 2005) in the iSOUP-Bag and iSOUP-RF methods (Osojnik et al., 2018), respectively.
2.2 Change detection and adaptation
Assuming that \(\mathcal {X}\) and \(\mathcal {Y}\) are the input and output space, respectively, and that the data generating process is characterized by the joint distribution \(P(\mathcal {X},\mathcal {Y})\), we are interested in changes over time in this distribution. Using the Bayes theorem to obtain \(P(\mathcal {X},\mathcal {Y}) = P(\mathcal {X}) P(\mathcal {Y} | \mathcal {X})\), we can note three kinds of changes to the joint distribution (Gao et al., 2007). The first is a sampling shift in \(P(\mathcal {X})\) known as virtual concept drift. The second is a change in the conditional probability \(P(\mathcal {Y} | \mathcal {X})\), known as real concept drift, or simply concept drift. Finally, change can occur in both \(P(\mathcal {X})\) and \(P(\mathcal {Y} | \mathcal {X})\). In this paper, we are interested in investigating real concept drift. Drift occurring in limited areas of the data space is called local, while global concept drift affects the whole data space.
Figure 1 visually illustrates three speeds of concept drift (Gama et al., 2014): (1) abrupt/sudden, (2) incremental, and (3) gradual. The different speeds of concept drift make it difficult to distinguish it from intrinsic noise or outliers.
2.3 Methods for concept drift detection and adaptation
Concept drift detection and adaptation are receiving significant attention in data stream mining (Gama et al., 2014). Here, we describe the two methods that are most often used in practice, the Page–Hinckley test and the ADWIN method, which are used in our approaches. We then discuss other related work on the topic.
The Page–Hinckley (PH) test Mouss et al. (2004) considers a cumulative variable \(m_T = \sum _{t=1}^T (x_t - \bar{x}_T - \alpha )\), where \(x_t\) is the observed value of the monitored variate at time t, \(\bar{x}_T = \frac{1}{T} \sum _{t=1}^T x_t\) is the mean of \(x_t\) values up to time T, and \(\alpha\) corresponds to the magnitude of allowed changes. The minimum value of this variable up to \(t=T\) is computed as \(M_T = \min _{t=1,\dots ,T}(m_t)\). When the difference \(PH_T = m_T - M_T\) exceeds a fixed value (parameter \(\lambda\)), concept drift is signalled. The PH test has been used for drift detection in single-target regression [FIMT-DD (Ikonomovska et al., 2011)] and MTR [AMRules (Almeida et al., 2013)].
ADWIN Bifet and Gavalda (2007) is a parameter- and assumption-free change detection mechanism. It uses a sliding window W (with variable length), which is expanded while it has no two subwindows \(W_1\) and \(W_2\) with distinct enough means. Formally, ADWIN defines
where n, \(n_1\), and \(n_2\) are the lengths of the windows W, \(W_1\), and \(W_2\) (\(n = n_1 + n_2\)), m is the harmonic mean of \(n_1\) and \(n_2\), \(\delta\) is the desired confidence, and \(\sigma _W^2\) is the variance of the target in window W. If the observed means \(\mu _1\) and \(\mu _2\) differ by more than \(\epsilon _{cut}\), ADWIN detects a change and discards the statistics of the older subwindow.
ADWIN does not keep the elements of the window W in memory. Instead, it stores only the statistics of its subwindows according to the exponential histogram technique: It stores the statistics of the new element in a subwindow/bucket of size \(1 = 2^0\) and when subwindows of sizes \(2^i\) accumulate, their statistics are merged into a subwindow of size \(2^{i+1}\). ADWIN checks every two consecutive subwindows: If their mean values are significantly different, it signals concept drift.
In an ensemble setting, a concept drift occurrence is expected to worsen the predictive performance of the base models as they do not incorporate change detection mechanisms, and to raise the models’ error measures. Hence, concept drift can be detected by monitoring the errors of the base models, and raising an alarm if a significant change is observed. ADWIN, which can be used with Bagging (Bifet et al., 2009) and Random forest ensembles, raises an alarm and replaces the least performant model with a new one if there is a change in at least one of the monitored errors. It uses the ADWIN change detection mechanism (Bifet & Gavalda, 2007) to monitor the error of each base model.
The ARF-Reg method (here referred to as ARF) (Gomes et al., 2018), is the adaptation of Adaptive Random Forests (Gomes et al., 2017) approach for regression. It employs two ADWIN mechanisms for each tree: one with lower detection confidence \(\delta _w\), issuing warnings of possible concept drifts, and another, with higher confidence \(\delta _d\), confirming the occurrence of concept drift. If the former mechanism issues a warning for a given tree, the method initializes a new tree grown in the background and in parallel with it, where the incoming instances are passed to the background tree. If the latter mechanism corresponding to a base model tree confirms a concept drift, the tree is replaced by its background tree.
Other methods for concept drift detection and adaptation differ from our approaches mainly in the tasks addressed. For example, Korycki and Krawczyk (2021) address change detection and adaptation in single-target classification, specifically for multi-class imbalanced data streams, while we consider multi-target regression. Sobhani and Beigy (2011) and Dehghan et al. (2016) also address single-target classification. The former imposes an additional assumption that the data is arriving in batches, and the method stores the last batch of data instances. Souza et al. (2020) address change detection and adaptation in an unsupervised learning setting, while our work considers a supervised learning setting (MTR). Read (2018) provides a compelling case for using gradient-based methods when mining single-target data streams which are known to be susceptible to concept drift, however, gradient-based multi-target online methods are still missing. Only rare exceptions, such as AMRules (Duarte et al., 2016), consider concept drift detection and adaptation for online MTP.
2.4 Existing online MTR methods
2.4.1 Rule-based methods: AMRules
The AMRules method (Duarte et al., 2016) is a representative of rule-based approaches to online MTP that build rule sets (RS), which can detect and adapt to changes. AMRules starts with an empty RS and a default rule \(\{\} \rightarrow \mathcal {L}\), where \(\mathcal {L}\), initialized to NULL, is a modified version of the extended binary search tree (E-BST) data structure also used in iSOUP-Tree for storing statistics. The E-BST \(\mathcal {L}\) stores the statistics of the values observed so far, needed for predicting each of the target values.
When a new data example arrives, AMRules checks if some rule in the RS covers it, i.e., if all of the literals/conditions on the left-hand side of the rule for that example are true. A covering rule \(\mathcal {L}_r\) tests whether the example is an anomaly (noisy) by computing the ratio \(OR = \frac{1}{d}\sum _{j=1}^{d}\log \left( \frac{P(X_j=v | \mathcal {L}_r)}{1-P(X_j=v | \mathcal {L}_r)} \right)\) over all d attributes, given the \(j^{th}\) attribute \(X_j\) equals v. For numerical attributes, the empirical probability \(P(X_j=v | \mathcal {L}_r)\) is estimated by using Cantelli’s inequality (Bhattacharyya, 1987). After having observed more than a predetermined \(m_{min}\) (anomaly grace period) examples, the rule signals anomaly if \(OR < T\), where T is an anomaly threshold parameter.
If the example is not anomalous, its target values are used to update the statistics of the rule. The PH change detection tests are used on their mean square error values to discover concept drift. The rule is removed from the RS if change is detected.
If a rule is not removed, it is considered for expansion. Here, again, a grace period parameter is used. The expansion procedure is almost identical to leaf node splitting of iSOUP-Tree and uses the Hoeffding bound, with the same heuristic function. Rule expansion is a process where the hypothetical candidate split is added to the literals on the rule’s left hand side. As a special case, expanding the default rule means adding a new rule in the RS, with the extended literals.
The prediction and model building strategies depend on whether the rules are ordered or unordered, leading to the AMRules variants – \(\text {AMRules}^o\) and \(\text {AMRules}^u\). In the former case, only the first rule that covers the example is removed, expanded or used in prediction. The latter one enables all the rules that cover an example to have the same treatment, independent of their order, and the final prediction is made as the aggregation (mean) of individual predictions.
The rules learned by the AMRules method generate predictions in a similar manner as the leaves in iSOUP-Tree. They use an adaptive strategy, choosing between a perceptron’s and a mean regressor’s prediction. AMRules and iSOUP-Tree differ in their learning rate, which in AMRules is a constant.
2.4.2 Tree-based methods: iSOUP-Tree
iSOUP-Tree (Osojnik et al., 2018) is instance-incremental method. It is an initial method used for the development of our novel methods. iSOUP-Tree starts from an empty leaf and once enough examples have been processed (but not stored), a check is made to examine if there is significant statistical support to split it.
All possible binary splits are evaluated by using multi-target intra-cluster variance reduction (ICVR) as a heuristic function. ICVR represents the homogeneity gain on the target values if a split is chosen.
According to the ICVR heuristic, the best candidate split \(h_{1}\) is selected, as well as the second-best \(h_{2}\). Next, the following sequence is constructed \(\dots \frac{h_2(k)}{h_1(k)}, \frac{h_2(k+1)}{h_1(k+1)}, \frac{h_2(k+2)}{h_1(k+2)} \dots\) where k denotes the number of accumulated examples.
Let \(X_k\) be a random variable denoting the ratio \(\frac{h_2(k)}{h_1(k)}\), and \(x_k\) be one sample of it. Then the observed average can be computed as \(\bar{x} = \frac{1}{k}(x_1 + x_2 + \dots + x_{|S|})\), which is a sample from the random variable \(\bar{X} = \frac{1}{k}(X_1 + X_2 + \dots + X_{|S|})\). The Hoeffding bound (Hoeffding, 1963) is then applied to make an \((\epsilon , \delta )\)-approximation, using the standard notation of E[X] to denote the expected value of the random variable X. The Hoeffding bound is of the following form: \(P(|\bar{X} - E[\bar{X}]| > \epsilon ) \le 2e^{-2|S|\epsilon ^2}=:\delta .\) The value \(\delta\) is a parameter to the iSOUP-Tree method named splitting confidence. The value \(\epsilon\) can be expressed in terms of \(\delta\) and |S| as follows: \(\epsilon = \sqrt{\frac{1}{2|S|}}\text {ln}\frac{2}{\delta }\).
Plugging \(\bar{x}\) as an observation of \(\bar{X}\) in the Hoeffding-bound, one gets \(E[\bar{X}] \in [\bar{x} - \epsilon , \bar{x} + \epsilon ]\) with probability \(1-\delta\), i.e., if \(\bar{x} + \epsilon < 1\) then \(E[\bar{X}] < 1\) implying \(\frac{h_2}{h_1} < 1\) (with probability \(1-\delta\)). In other words, this means that there exists a significant support to accept the currently best candidate as the best and split the leaf node. In the case when \(\bar{x} + \epsilon \ge 1\), the leaf waits for more examples. This condition is checked only when enough examples have accumulated in the leaf, i.e., whenever the leaf has accumulated a number of examples which is a multiple of the parameter GP (grace period).
In order to overcome a drawback of the Hoeffding bound, which occurs when the values of the two best heuristics are close to each other, iSOUP-Tree introduces a new parameter \(\tau\), the tie breaking threshold. \(\tau\) determines the minimal value \(\epsilon\) can have before the leaf is split.
Each leaf makes a prediction by using an adaptive multi-target model, consisting of a multi-target perceptron and a multi-target mean predictor. The perceptron updates its weights by a backpropagation rule with a given learning rate. When a leaf is constructed, its learning rate is set to the parameter \(\eta _0\), named initial learning rate. After each incoming example, the learning rate \(\eta\) is updated by using the rule \(\eta = \frac{\eta _0}{1 + n \cdot \eta _\Delta }\) where n is the number of recorded values and \(\eta _\Delta\) is a parameter called learning rate decay factor. Finally, a prediction is made by using the perceptron or the mean regressor, depending on which one has the lower fading mean absolute error (fMAE) for that target: \(fMAE^j(e_n) = \frac{\sum _{i=1}^n 0.95^{n-i} |\hat{y_i}^j - y_i^j|}{\sum _{i=1}^n 0.95^{n-i}}\) where \(e_n\), \(\hat{y_i}^j\) and \(y_i^j\) are the \(n^{th}\) observed example, the predicted and the real value of the \(j^{th}\) target for the \(i^{th}\) example.
2.4.3 Tree-based ensemble methods: iSOUP-Bag and iSOUP-RF
Online bagging with iSOUP-Tree The iSOUP-Bag method combines the streaming setting adaptation of the bagging method with iSOUP-Trees as base models. Online bagging is an approximation of the batch bagging approach as the number of data examples grows to infinity. The distributions over the data supplied to the base models converge assuming infinite data.
In a given bootstrap sample of the batch bagging approach, the number of copies of an individual data example from a dataset of size n is distributed according to the Binomial distribution – \(B(n, \frac{1}{n})\). This distribution tends to Poisson(1) as the dataset size n grows to infinity. Therefore, given the infinite number of examples in the data stream, the online bagging method uses the Poisson distribution to determine the number of times each base model should be updated.
Given the number of base models, \(\mathcal {N}\), the iSOUP-Bag method is initialized to a collection of \(\mathcal {N}\) empty leaf nodes. Each iSOUP-Tree model in the ensemble is updated with an incoming data instance k times, where k is sampled at random (afresh for each model) from the Poisson distribution with parameter \(\lambda =1\), i.e., from Poisson(1). With the processing of the data examples, each of them gets further split according to the description provided previously. The final prediction by an iSOUP-Bag is made by aggregating the base models’ predictions by per-target averaging.
Online random forests with iSOUP-Tree The online random forest ensemble learning method builds upon online bagging and introduces larger base model diversity. To obtain the proposed diversity, the iSOUP-RF method uses the online random forest approach with a modified version of the iSOUP-Tree. In particular, it introduces randomization to the individual iSOUP-Trees. Hence, both the base models and the data provided to them are randomized.
Each leaf node, at the time of its construction, chooses a random subset of the available descriptive attributes. It calculates and stores the necessary statistics for future splits only for the attributes in this subset, ignoring the other attributes. Therefore, iSOUP-RF is more time and memory efficient than the iSOUP-Bag method.
In addition to the number \(\mathcal {N}\) of base models, iSOUP-RF also has the number of descriptive attributes sampled at each node as a parameter. It is given as a function of the total number of descriptive attributes. To make a prediction for an incoming data example, per-target averaging of the corresponding models’ predictions is performed.
3 Novel methods for online multi-target regression with change detection and adaptation
We introduce six novel tree–based online MTR methods that incorporate concept drift detection and adaption mechanisms. We first present three novel methods that combine the iSOUP-Tree approach to MTR with the ADWIN change detector (iSOUP-ADWIN-Bag and iSOUP-ADWIN-RF) and ARF-Reg (iSOUP-ARF). We then equip the iSOUP-Tree method with adaptation capabilities, based on the PH test, getting three novel methods: Adaptive-iSOUP-Tree and its ensemble variants (Adaptive-iSOUP-Bag and Adaptive-iSOUP-RF). The novel methods are implemented in the open source Massive Online Analysis (MOA) (Bifet et al., 2010) framework: The code will be publicly available upon publication.
ADWIN, which deals with change detection and adaptation in single-target classification on data streams, works with bagging and random forest tree ensembles. The error of each tree in the ensemble is continuously monitored and, once change is detected, the least accurate tree in the ensemble is dropped and the construction of a new model in the ensemble starts. ARF-Reg works with random forests for regression, uses two ADWIN detectors (one to detect and the other to confirm drift), starts to build an alternative tree in the background once drift is detected and replaces the original tree with the alternative oncd drift is confirmed.
Two of our novel approaches, iSOUP-ADWIN-Bag and -RF, build (bagging and random forest) ensembles of MTR trees with iSOUP-Tree and monitor the error of the trees (mean absolute error across all the targets), adapting to detected changes in much the same way as the original ADWIN (dropping the tree with the largest error and starting to build a new one). iSOUP-ARF follows ARF-Reg, using two ADWIN detectors and the randomized version of iSOUP-Tree to build trees for the ensemble (and alternative trees in the background).
Adaptive-iSOUP-Tree adapts the PH test approach taken by FIMT-DD for use in iSOUP-Tree. In Adaptive-iSOUP-Tree, PH tests are conducted for each target in every internal node of a multi-target regression tree. Only when change is indicated for all targets in a given node is concept drift reported overall for the node at hand: At that point, adaptation starts by learning an alternate tree rooted in that node. The Adaptive-iSOUP-Bag and -RF methods build bagging and random forest ensembles by using the Adaptive-iSOUP-Tree method.
Novel method 1: ADWIN bagging with iSOUP-Tree We propose iSOUP-ADWIN-Bag that combines iSOUP-Trees as base models with the ADWIN approach to online bagging with an ensemble-level concept drift detection mechanism. Following the streaming setting adaptation of bagging, iSOUP-ADWIN-Bag updates individual base models using the Poisson distribution for data example sampling. It produces the final ensemble prediction for an example using per-target aggregation of the base models’ predictions for that example.
Algorithm 1 presents the update operator of iSOUP-ADWIN-Bag. ADWIN detectors monitor the standardized errors of the base iSOUP-Trees, where both the ground truth and predicted values are standardized for each target (by subtracting the mean and dividing the difference with the standard deviation value calculated for that target up to a given moment). The ADWIN change detector is provided with the mean absolute error at each time point i, given as \(MAE_i = \frac{1}{M}\sum _{i=1}^M |y^j_i - \hat{y}^j_i|\), where M denotes the number of targets, and \(y^j_i\) and \(\hat{y}^j_i\) are the standardized ground truth and predicted values for the \(j^{th}\) target of the \(i^{th}\) data example. When a change is detected on any base model, the iSOUP-Tree with highest error is replaced by an empty leaf.
Novel methods 2 and 3: ADWIN and Adaptive random forests with iSOUP-Tree The iSOUP-ADWIN-RF method uses the modified/ randomized version of iSOUP-Tree, whose leaf nodes select a random subset of the descriptive attributes when constructed. The ensemble learning and concept drift detection and adaptation mechanism incorporated by iSOUP-ADWIN-RF closely mirror those included in iSOUP-ADWIN-Bag. Similarly, the iSOUP-ARF ensemble learning and concept drift detection and adaptation mechanism closely mirror those of ARF-Reg. As ensemble methods, iSOUP-ADWIN-RF and iSOUP-ARF require the number of base models as an input parameter, along the size of the attribute subset as a function of the total number of available descriptive attributes, just as iSOUP-RF (which uses the same randomized version of iSOUP).
Novel method 4: Adaptive-iSOUP-Tree In the case of local concept drift, which occurs in a subspace (rather than the whole data space), inaccurate predictions are likely to come from some subtrees of the iSOUP-Tree, while other subtrees will be still modeling the data generating process well. Discarding the whole tree, as ADWIN does, seems thus unreasonable. We therefore propose a novel Adaptive-iSOUP-Tree method that extends iSOUP-Tree by equipping every internal node in the tree with a concept drift detection and adaptation mechanism.
Every node in an iSOUP-Tree, covers a hyperrectangle in the feature data space. By monitoring the error of its subspace, every node detects changes in its subspace only and adapts the model to reflect those changes. Since every node, including the root, handles concept drift, Adaptive-iSOUP-Tree can detect both local and global concept drift. Algorithm 2 presents the update operator of the Adaptive-iSOUP-Tree method.
Following the FIMT-DD single-target approach and the multi-target approach of AMRules, we use PH tests in every internal node of an iSOUP-Tree for change detection. Initialization (with an empty leaf node) and split selection (using the Hoeffding bound) in Adaptive-iSOUP-Tree are the same as in iSOUP-Tree. Adaptive-iSOUP-Tree uses an adaptive prediction strategy, choosing between the MT perceptron (with a constant learning rate) and the MT mean predictor.
The predictions are propagated and calculated by the corresponding leaf node. The predictions and the true target values are first standardized. The absolute error per target is then computed and back-propagated to the ancestor nodes. The PH test statistics in every node along the path to the root are updated and are queried for potential concept drift. When all targets detect a drift, the node confirms a detection.
When a node detects a change, it triggers an adaptation strategy to build an alternative subtree, grown in parallel with the original one. Data examples in that instance subspace are used for training both subtrees. Given sufficient evidence about which subtree performs better, the other is removed. In the case of true concept drift detection and proper adaptation, the alternate subtree is expected to outperform the original one, as it reflects the change, and will replace it in the original tree. Otherwise, in the case of a false alarm, the original tree will not be outperformed, and the alternate tree will be discarded. To decide which subtree outperforms the other, we monitor the log ratios of the per-target faded mean squared errors of both subtrees. We use the improved adaptation mechanism that includes a fading factor, included also in FIMT-DD, as proposed by Gama et al. (2009). Given the predicted and real values of the \(j^{th}\) target for the \(i^{th}\) example, \(\hat{y_i}^j\) and \(y_i^j\), we define the faded error \(S_i^{j}(Tree) = L_i^{j}(Tree) + f S_{i-1}^{j}(Tree)\), where \(L_i^{j}(Tree)\), \(S_{i-1}^{j}(Tree)\) and f are, resp., the current loss \((\hat{y_i}^j - y_i^j)^2\), the accumulated faded mean squared error up to time \(i-1\), and a fading factor (with a value close to 1, such as 0.95). We monitor the \(Q_i^j\) statistic defined as \(Q_i^j = \text {log}\left( \frac{S_i^j(OrgTree)}{S_i^j(AltTree)} \right)\),
where OrgTree and AltTree stand for the original and the alternate tree, resp. Positive values of this statistic for a given target imply that the original tree has a higher error on that target than the alternative.
The \(Q_i^j\) statistics are checked after every \(T_{min}\) examples. If they are positive for most of the target attributes (at least \(90\%\) of them), the alternate tree replaces the original one. If the alternate tree does not promise any improvement, its \(Q_i^j\) average will begin to decrease. The adaptation mechanism concludes that the alarm is false and deletes the alternate subtree when the time period for alternate tree growing has passed or most of the \(Q_i^j\) averages (at least \(90\%\) of them) start to decrease. Adaptive-iSOUP-Tree follows the approach of FIMT-DD to consider the average statistics only after \(10T_{min}\) examples have been stored to avoid premature discarding.
To ensure that the alternate tree has a large enough learning rate to learn the new concept and that both subtrees have no vastly different values, Adaptive-iSOUP-Tree takes a constant learning rate perceptron along with the mean regressor in an adaptive learning strategy following iSOUP-Tree.
Novel methods 5 and 6: Online bagging and random forests with Adaptive-iSOUP-Tree The Adaptive-iSOUP-Tree has an internal concept drift detection and adaption mechanism. It does not need external ensemble-level techniques for change and adaptation. Therefore, we test Adaptive-iSOUP-Tree with the vanilla online bagging approach yielding Adaptive-iSOUP-Bag.
In addition, we also propose a modified version of the Adaptive-iSOUP-Tree to be used in random forest ensembles. When initializing the model as an empty leaf node and when splitting a node into two leaf nodes, the nodes select a subset of the descriptive attributes. The statistics needed for future splits are stored only for the selected attributes. This modification of Adaptive-iSOUP-Tree enables the development of the online random forest method Adaptive-iSOUP-RF.
Both Adaptive-iSOUP-Bag and Adaptive-iSOUP-RF initialize as sets of empty leaf nodes. MTR predictions for an example are calculated as per-target averages of the base-model MTR predictions.
4 Experimental design
Here, we describe the framework for evaluating the performance of the novel tree-based multi-target data stream mining methods that can detect and adapt to change. More specifically, we are interested in the following research questions:
-
1.
What is the performance of different multi-target data stream methods on real-world data in a supervised learning setting?
-
2.
What is the performance of multi-target data stream mining methods on stationary, i.e., non-evolving data streams?
-
3.
What is the influence of the three possible concept drift types/speeds on the performance of the multi-target data stream mining methods on evolving data streams?
-
4.
What is the performance of multi-target methods in detecting concept drift occurring at all three possible speeds (abrupt, gradual, incremental)?
4.1 Evaluated methods
In the experimental evaluation, we perform experiments with 11 different methods listed in Table 1 with their acronyms. The AMRules method is the sole competitor state-of-the-art online MTR method that incorporates a mechanism for concept drift detection and adaptation. Here, we evaluate its two versions, ordered and unordered: \(\text {AMRules}^o\) and \(\text {AMRules}^u\).
As baseline methods, we include the iSOUP-Tree methods family. iSOUP-Tree is the foundation upon which our novel methods are built. We also include online bagging and random forests ensembles incorporating iSOUP-Tree as a base model. Finally, the evaluation includes the six novel methods introduced in this paper.
4.2 Data streams
All methods are evaluated on both real-world and artificially generated data streams. Here, we provide short descriptions of the real-world data streams employed in our study. We finally elaborate on the process of generating artificial data streams and injecting them with concept drift at the three speeds.
Real-world data streams The evaluation uses 11 real-world data streams with a varying number of descriptive and target attributes. Table 2 presents an overview of the number of data examples and the dimensions of the input and output spaces of the real-world data streams. The descriptions of the used data streams are available in Appendix A.
Artificial data streams On one hand, the real-world data streams correspond to real-world practical problems, making them more interesting and practically relevant. On the other hand, we can only obtain a few points where a concept drift is known to happen, with the help of a domain expert. However, to perform an extensive comparative evaluation of the data stream mining algorithms’ ability to detect change, we need data streams whose every change point is known.
To obtain data streams with controlled concept drift, we use an off-the shelf MTR data generator (Mastelini et al., 2018), parametrized with the number of underlying data distributions, referred to as number of generation groups. We generate 12 data streams of 1 million instances each, where each stream has a predefined number (45) of descriptive features, out of which up to half (22) are relevant. Each stream has 6 target attributes [the first five dependencies are taken from the publication proposing the generator (Mastelini et al., 2018)]. The dependencies between the targets and inputs for the different data streams are shown in Table 10 of Appendix A.
We generate all examples from one generation group to obtain data streams with guaranteed no concept drift. For obtaining an abrupt change, we use one generation group for the first half of the data examples, and, at exactly the half, we switch to a second generation group. For generating data streams with gradual drifts, we take the framework presented by Bifet et al. (2009), which builds on the work by Narasimhamurthy and Kuncheva (2007): After drift starts, examples are chosen from both groups, with the probability of the new group gradually increasing from 0 to 1. Data streams with incremental drift are generated by taking the weighted arithmetic mean of examples from the two data generating processes, the weights being the probabilities used in gradual drift generation. For both gradual and incremental concept drift, we assume that the change begins after the \(500000^{th}\) example, with a window size of \(|W| = 100000\).
4.3 Evaluating scenarios and measures
Two main approaches are available for evaluating performance on data streams (Dawid, 1984). This includes holdout evaluation and prequential evaluation. In this work, we use prequential evaluation, where each newly received example is used for testing first, but once that is done, the model proceeds to use the example for training, without waiting for other examples.
To evaluate the performance of the methods, we use two different types of evaluation measures: (1) measures that estimate the predictive performance and (2) measures that estimate its performance on detecting concept drift.
Evaluation measures for predictive performance A prediction is generated for each incoming example. In the single target scenario, there are multiple evaluation measures adopted from statistics, such as mean absolute error (MAE). In the multi-target scenario, we average individual single-target scores. We report the average relative mean absolute error (\(\overline{RMAE}\)) (Osojnik et al., 2018) over a window with length n (here, \(n=1000\)) as a multi-target evaluation measure. This measure is calculated as \(\overline{RMAE} = \frac{1}{M} \sum _{j=1}^M \frac{\sum _{i=1}^n |y_i^j - \hat{y}_i^j|}{\sum _{i=1}^n |y_i^j - \bar{y}^j(i)|}\),
where \(y_i^j\) is the actual value of target j for data example i, and \(\hat{y}_i^j\) is the value predicted by the evaluated model, while \(\bar{y}^j(i)\) is the prediction for \(y_i^j\) made by the mean regressor.
If the model evaluated is the mean regressor for each target value, then the \(\overline{RMAE}\) score would be 1, since the nominator and denominator would be the same expression. For any other model, its performance is compared with the performance of the mean regressor as a baseline. If the \(\overline{RMAE}\) score is below 1, the evaluated model outperforms the baseline mean regressor. Lower values for \(\overline{RMAE}\) scores are desired, where the perfect model has a score of 0.
Evaluation measures for detecting concept drift The design of a change detection mechanism is subject to a trade-off between true detections and false alarms (Gustafsson, 2000). To formally capture and measure these characteristics, the literature proposes several evaluation metrics.
Mean Time between False Alarms (MTFA) (Basseville & Nikiforov, 1993) quantifies the frequency of false alarms, when it is known that no change has occurred. MTFA is the inverse of the false positive rate. It measures the expected time between false positive detections, hence, larger values are preferred.
When we know where concept drift has happened, the Mean Time to Detection (MTD) (Basseville & Nikiforov, 1993; Gustafsson, 2000) measures the reactivity of the mechanism. As we want changes to be detected as soon as possible, the time to detect them should be low.
To avoid making false change detections, an algorithm might abstain from raising an alarm, which can cause it to miss real changes. Missed Detection Rate (MDR) (Basseville & Nikiforov, 1993; Gustafsson, 2000) represents the fraction of undetected changes. A good change detection mechanism is expected to catch as many changes as possible and yield a low value of MDR. Ideally, no change will slip undetected, leading to an MDR value of 0.
To measure the overall performance at detecting changes, the above three measures are combined into the Mean Time Ratio (MTR) (Bifet et al., 2013) as \(MTR = \frac{MTFA}{MTD} (1 - MDR)\).
4.4 Statistical comparisons
We use two types of statistical comparison of the considered methods in terms of their predictive performance.
Comparison of multiple methods The evaluation of the predictive performance of multiple algorithms uses a comparison of means of multiple random variables. We assess the statistical significance of the differences in the predictive performance of multiple algorithms by using the Friedman test (Iman & Davenport, 1980; Friedman, 1940) with post-hoc Nemenyi analysis (Nemenyi, 1963; Demšar, 2006). The results of the Nemenyi post-hoc analysis are visually represented as an average rank diagram, showing the critical distance and the average rank for each method. The groups of methods which do not exhibit a statistically significant difference in performance are connected with a line.
Pairwise method comparison The similarities across different pairs of methods are not constant. Any given method is more similar to some, and less similar to other methods. To provide a more detailed analysis, we need to test if there is a significant statistical evidence of improved performance by, e.g., the methods we propose, as compared to their existing counterparts. We perform statistical tests between all compared pairs of methods, however, our analysis focuses on comparing the most similar methods. In our comparison of the novel proposed methods to their most similar counterparts, we assess the significance of difference in performance by using the Wilcoxon signed-ranks test (Wilcoxon, 1945) with the Benjamini-Hochberg correction (Benjamini & Hochberg, 1995).
The approaches of using Friedman’s test with post-hoc Nemenyi analysis and of the Wilcoxon test are recommended and explained by Demšar (2006) and Benavoli et al. (2017).
4.5 Experimental setup
Parameter settings Because of the discussed similarities between the iSOUP-Tree, AMRules, and Adaptive-iSOUP-Tree methods, many of their parameters overlap. Table 3 provides an overview of the parameters, including their values as used in our experiments. It specifies which parameters are unique to each of the methods and which are shared. We re-used the recommended parameter values as provided by the respective authors of the existing methods in the corresponding papers that introduced the methods.
Compared pairs of methods We perform pairwise comparisons to assess if the novel methods exhibit a significantly improved predictive performance over the other methods they are most similar to. Selecting the most similar rule-based counterparts of the tree-based methods is not that trivial nor obvious. Since the ordered variant of AMRules (\(\text {AMRules}^o\)) uses only the first rule that covers an incoming data example in both the learning and predicting processes, we consider it as a method learning standalone models. On the other hand, the unordered variant (\(\text {AMRules}^u\)) updates all covering rules in learning and makes the final prediction in an ensemble-like process by aggregating the predictions of individual rules, hence, we compare it with ensemble learning methods.
We make the following pairwise comparisons:
-
Adaptive-iSOUP-Tree with iSOUP-Tree (\(\text {AMRules}^o\)),
-
Adaptive-iSOUP-Bag with iSOUP-Bag (\(\text {AMRules}^u\)),
-
Adaptive-iSOUP-RF with iSOUP-RF (\(\text {AMRules}^u\)),
-
iSOUP-ADWIN-Bag with iSOUP-Bag (\(\text {AMRules}^u\)), and
-
iSOUP-ADWIN-RF with iSOUP-RF (\(\text {AMRules}^u\)).
Additionally, we check if there is a difference in performance between the appropriate pairs of novel methods. Thus, we compare: Adaptive-iSOUP-Bag with iSOUP-ADWIN-Bag and Adaptive-iSOUP-RF with iSOUP-ADWIN-RF.
Finally, we also include the appropriate pairwise comparisons of the most similar existing methods. In this context, we compare: iSOUP-Tree with (\(\text {AMRules}^o\)), iSOUP-Bag with (\(\text {AMRules}^u\)), and iSOUP-RF with (\(\text {AMRules}^u\)).
Before all other comparisons (pairwise and multiple) we have evaluated the ensemble-level concept drift detection strategy of ARF combined with the RF-version of iSOUP-Trees, proposed in this paper and called iSOUP-ARF. We compare it to iSOUP-ADWIN-RF, and based on the results of the comparison, exclude iSOUP-ARF from all further comparisons.
5 Results and discussion
We now present and discuss the results of the empirical evaluation. We first compare the performance of the two ensemble-level concept drift detection and adaptation approaches. We then address each of the four research questions stated earlier, i.e., comparison of the predictive performance of different methods for learning models from data streams on (1) real-world data streams, (2) artificial data streams with no concept drift, (3) artificial data streams with concept drift of different speeds, and (4) comparison of the different methods’ ability to detect concept drift. At the end, we briefly discuss the time complexity of the compared methods, both theoretical and empirical.
5.1 Ensemble-level concept drift: ADWIN-RF vs. ARF
We first compare the ARF and ADWIN strategies, which are very similar to each other, by comparing iSOUP-ARF with iSOUP-ADWIN-RF (see Table 4). ARF performs statistically significantly worse (pBH-value < 0.05) when the data streams have gradual, incremental or no concept drift, while it performs significantly better if the data stream has an abrupt change. The differences in performance are small, except for the case of no drift, where ARF performs much worse (with a \(\overline{RMAE}\) over 1). We hence exclude ARF (i.e., iSOUP-ARF) from further comparisons.
5.2 Real-world data streams
We now turn to the comparison on real-world data streams from Table 2. The results of the Friedman test and Nemenyi analysis are shown in the average rank diagram of Fig. 2.
The family of methods based on Adaptive-iSOUP-Tree outperforms the rest. The ensemble methods in the family lift the performance of a single Adaptive-iSOUP-Tree. The online random forest approach is outperformed by the online bagging approach, but is more resource-efficient as it randomly samples the attribute space.
The iSOUP-Tree ensembles, iSOUP-Bag and iSOUP-RF, are the next best-performing methods. As for the Adaptive-iSOUP-Tree ensembles, iSOUP-Bag outperforms iSOUP-RF. The iSOUP-ADWIN-Bag and iSOUP-ADWIN-RF methods exhibit performance comparable to that of the iSOUP-Tree ensemble methods, while the iSOUP-Tree and AMRules methods have the next-to-worst and worst performance.
Based on the critical distance, only two differences in performance are significant: between the ensemble methods of Adaptive-iSOUP-Tree and AMRules, and between the Adaptive-iSOUP-Tree ensembles and the iSOUP-Tree method.
Table 5 presents the results of the pairwise comparison. The statistical tests show significant improvements in performance by the novel methods over their most similar counterparts. The Adaptive-iSOUP-Tree family of methods show statistically significant improvements over the family of methods based on iSOUP-Tree. Also, both families of methods have significantly better performance than the performance of the AMRules methods.
For the iSOUP-ADWIN family, no significant improvements are observed over the iSOUP ensembles family. iSOUP-ADWIN performs slightly better than \(\text {AMRules}^u\), but not significantly. Finally, the Adaptive-iSOUP ensemble approaches are not significantly better than the iSOUP-ADWIN approaches.
In sum, Adaptive-iSOUP performs the best overall and clearly (significantly) improves over iSOUP. iSOUP-ADWIN, surprisingly, does not improve over iSOUP. This deserves further investigation and discussion.
5.3 Artificial data streams with no concept drift
In contrast to the real-world data streams, where concept drift possibly occurs at an unknown point in time, we discuss here data streams where no drift occurs (guaranteed).
In Fig. 3 and Table 6, we summarize the results of the multiple method comparison (Friedman/Nemenyi) and the pairwise comparisons of the methods (Wilcoxon with Benjamini-Hochberg correction). The experiments are performed on the twelve artificial data streams listed in Table 10 in Appendix A. We compare the same set of methods as for the real-world data streams.
The iSOUP-ADWIN-Bag method shows the best performance among all compared methods (see Fig. 3). It is followed by \(\text {AMRules}^o\) and \(\text {AMRules}^u\), while the average rank of iSOUP-Bag is in-between the AMRules variants. The Adaptive-iSOUP-Tree method family is next, followed by iSOUP-Tree. The random forest methods of iSOUP-ADWIN-RF and iSOUP-RF are performing the worst.
iSOUP-ADWIN-Bag has significantly better performance than all the random forests methods (including iSOUP-ADWIN-RF) and the non-ensemble iSOUP-Tree and Adaptive-iSOUP-Tree methods. Additionally, \(\text {AMRules}^o\) and iSOUP-Bag significantly outperform iSOUP-RF.
The only significant differences in the pairwise tests confirm that iSOUP-ADWIN-Bag and iSOUP-ADWIN-RF outperform iSOUP-Bag and iSOUP-RF, respectively. This is unexpected, as ADWIN is specifically targeted at handling change and the data streams considered contain no change. On the other hand, Adaptive-iSOUP does not significantly differ in performance from iSOUP: the former is also geared towards detecting and handling change, but the streams at hand contain no change.
5.4 Data streams with concept drift of different speeds
Here, we provide a comparison of the predictive performance of MTR methods on the artificially generated datasets with concept drift at different speeds.
General observations. We first look at the average rank diagrams (Fig. 4a–c). At all concept drift speeds, the methods that do not detect concept drift (iSOUP-Tree and ensembles thereof) are outperformed by the rest (except for iSOUP-Bag performing better than Adaptive-iSOUP-Tree on streams with incremental change – but note the former is an ensemble method and the latter is not). There is no significant difference in performance among the different iSOUP-Tree based methods.
The best performing methods are Adaptive-iSOUP-Bag on streams with abrupt, and the ADWIN-based ensembles on data streams with gradual and incremental change. These results showing different behavior on streams with different drift speeds are not unexpected, given the design of the incorporated Page–Hinckley and ADWIN change detection mechanisms. Both variants of AMRules are second-best performing, and the Adaptive-iSOUP-Tree method family has lower average ranks than the iSOUP-Tree family and higher than AMRules, on all three concept drift types.
We next look at the tables with the pairwise comparisons (Table 7a–c). On abrupt and gradual drifts, the improvements of both novel method families over iSOUP-Tree (and its ensembles) are statistically significant. On streams with incremental change, iSOUP-ADWIN-Bag significantly improves over iSOUP-Bag, and iSOUP-ADWIN-RF and Adaptive-iSOUP-RF significantly improve over iSOUP-RF.
Abrupt concept drift From Fig. 4a, we can see that Adaptive-iSOUP-Bag and \(\text {AMRules}^o\) significantly outperform iSOUP-Tree on streams with abrupt drift. iSOUP-Tree is significantly worse than all methods addressing drift. In fact, the methods from the iSOUP family are clearly the worst, even though not all differences in performance are significant.
Table 7a shows that both Adaptive-iSOUP and iSOUP-ADWIN significantly improve their iSOUP counterparts. No pairwise comparisons besides these yield significant results. The Adaptive-iSOUP approaches do not perform better than their iSOUP-ADWIN counterparts.
Incremental and gradual concept drift Under incremental drift, iSOUP-Tree is significantly worse than \(\text {AMRules}^o\) and the ADWIN-based approaches; iSOUP-RF is outperformed only by the ADWIN-based approaches (Fig. 4b). Additionally, iSOUP-ADWIN-Bag is significantly better than iSOUP-Bag, Adaptive-iSOUP-RF, and Adaptive-iSOUP-Tree.
All the significant differences on streams with incremental drift are also significant under gradual drift (Fig. 4c). Also, the differences between Adaptive-iSOUP-Bag and iSOUP-Tree; iSOUP-RF and \(\text {AMRules}^o\); iSOUP-ADWIN-Bag and Adaptive-iSOUP-RF, are significant under gradual change.
On data streams with gradual drift, the methods we propose clearly improve upon their iSOUP counterparts. From Table 7c, it is clear that this holds for both the Adaptive-iSOUP and the iSOUP-ADWIN families. None of the other pairwise comparisons indicate significant differences, especially between Adaptive-iSOUP and iSOUP-ADWIN.
As for gradual drift, on streams with incremental drift, the new methods improve the performance of the iSOUP methods. According to Table 7b, iSOUP-ADWIN significantly improves performance over iSOUP. Adaptive-iSOUP also improves performance, but the difference is only significant for Adaptive-iSOUP-RF, while the p-values are just above the threshold for Adaptive-iSOUP-Tree and Adaptive-iSOUP-Bag.
Concept drift detection Tables 8a–c show the performance of the methods with concept drift detection mechanisms in regard to the Mean Time between False Alarms (MTFA), Mean Time to Detection (MTD), and Mean Time Ratio evaluation measure (MTR). The tables show the means of the different performance measures across the different data streams. This comparison includes all methods but the iSOUP family. All artificial data streams are used in this comparison, i.e., those without change as well as those with different speeds of concept drift. The table for the Missed Detection Rate (MDR) evaluation measure is not shown, as it has all zero entries.
The ADWIN-based ensemble methods significantly outperform the remaining methods. The iSOUP-ADWIN-Bag method quickly detects a real concept drift, while producing no false alarms. The iSOUP-ADWIN-RF method rarely raises false alarms.
The methods based on Adaptive-iSOUP-Tree produce numerous change detections, many of which are false flags. Consequently, they quickly detect the change when change indeed happens. Since 1) the method produces local detections, and 2) it might disregard the detection as a false alarm if the alternative sub-tree does not outperform the original, this does not affect the models based on Adaptive-iSOUP-Tree in terms of their predictive performance. Both variants of AMRules show mid-level performance at concept drift detection.
The methods generally have trouble detecting incremental concept drift. In contrast to incremental drift, under the abrupt and gradual concept drift types, two adjacent data examples can be sampled from two completely different data generating processes. Hence, they are easier to be detected, and the methods show comparable performances at detecting them.
Time complexities Table 9 presents the theoretical worst case asymptotic time complexities of the different methods, as well as their running times on the real-world Bicycles dataset. The experiments were run on an Intel® Core™ i7-8700K CPU @ 3.70GHz, under Ubuntu 20.04.1 LTS. The ensemble methods expectedly have an order of magnitude longer run times.
6 Case study: predicting thermal power consumption of the MEX spacecraft
The 3D imagery of Mars that MEX has generated during the past 15 years has provided unprecedented information about the red planet. In order for MEX to continue providing valuable information, which would support ground exploration missions and other research, as well as to enable the proper function of MEX without breaking, twisting, deforming or failure of any equipment, careful power management is needed.
The available power, \(\phi _{available}\), stored in the MEX batteries or generated by its solar arrays, that is not consumed by the platform, \(\phi _{platform}\), or by the thermal subsystem, \(\phi _{thermal}\), can be used in science operations \(\phi _{science}\). More specifically, \(\phi _{science} = \phi _{available} - \phi _{platform} - \phi _{thermal}\). Two of the three terms in the right hand side of the equation are well known. The 200 thermistors in the spacecraft continually measure the temperatures around it and therefore enable the autonomous turning on or off of the electrical heaters, making the \(\phi _{thermal}\) an unknown variable, difficult to predict.
In an initial empirical model, ESA has identified and incorporated key influencing factors, such as the distance of the spacecraft to the Sun and to Mars, the orbit phase and instrument and spacecraft operations (Lucas & Boumghar, 2017). However, the aging of the spacecraft has confronted this approach with many challenges. This motivated ESA to organize the MEX Power ChallengeFootnote 1 and reach out to the ML community.
The data for the MEX Power Challenge consists of i) raw telemetry (context) data; and ii) measurements of the electric current on the 33 thermal power lines (observation data). The time period covered in the data spans 4 Martian (or cca. 7.5 Earth) years, from 22\(^{\text {nd}}\) August 2008 to 1\(^{\text {st}}\) March 2016. MEX data was pre-processed and subject to feature engineering (Petković et al., 2019).
Here, we apply the data stream MTR methods, from Sect. 3 to 6 MEX data streams corresponding to 6 different time resolutions: 1, 5, 10, 15, 30, and 60 min. Table 2 presents the statistics of the streams. We compare the performance curves of the different methods for each time resolution, with one graph per time resolution (Fig. 5) using \(\overline{RMAE}\) error.
For all methods and all time resolutions, the concept drift is clearly visible. There are occasional sharp spikes in error, no doubt due to concept drift, e.g., in the curves for iSOUP-Tree and iSOUP-RF in the top-left panel of Fig. 5.
6.1 Performance of AMRules
We first inspect the performance of AMRules. Overall, it has \(\overline{RMAE}\) values between 0.95 and 1.0. It performs consistently better than the mean regressor, with the exception of the 1-min resolution. Although the unordered variant \(\text {AMRules}^u\) takes an ensemble-like aggregation approach for making a prediction, at the 5 coarsest time resolutions, it does not have better predictive performance as compared to its ordered \(\text {AMRules}^o\) counterpart. Using all rules that cover a given data example, each of them independently detecting concept drift, leads to less sensitive detection and more challenging adaptation. At the highest (1 min) resolution, \(\text {AMRules}^o\) and \(\text {AMRules}^u\) detect cca. 1 and 2 million data example anomalies respectively. Hence, their rule statistics are not updated and concept drift is not tested with a given example, resulting in \(\overline{RMAE}\) curves with all values above 1. Since a given example can be covered by more than one rule in \(\text {AMRules}^u\), an anomalous data example detected by one rule might be used for updating the statistics of another covering rule. At all other resolutions, the number of detected anomalies is negligible.
In the last time period, \(\text {AMRules}^o\) has a lower \(\overline{RMAE}\) than iSOUP-Tree (except at the 1 min resolution). AMRules detects and adapts to concept drift, even when changes occur over a prolonged period of time. On the other hand, iSOUP-Tree does not address change detection explicitly, which makes it vulnerable to concept drift, which is likely to occur when learning over long periods of time. iSOUP-Tree errors thus increase towards the end, at all time resolutions.
6.2 Performance of iSOUP methods
If we focus on the initial time period (first Martian year), iSOUP-Tree clearly outperforms both variants of AMRules. This is the case for all time resolutions, except for the 60 min one. For the 60 min resolution, the \(\overline{RMAE}\) curves of the two methods are comparable and very close to each other.
Identical conclusions related to the handling of concept drift (as for iSOUP-Tree) can be made for the ensemble methods iSOUP-Bag and iSOUP-RF. Since neither of them addresses concept drift detection, their predictive performance visibly drops (and error increases) in the last time period.
iSOUP-Bag and iSOUP-RF struggle to produce good predictions in the initial period of time in part due to their data sampling, but recover quickly. In the middle time period, we can see that iSOUP-Bag shows \(\overline{RMAE}\) values close to those of iSOUP-Tree for the middle 4 resolutions, while outperforming AMRules and the other iSOUP-Tree based methods at the 1 min resolution. At the coarsest resolution, iSOUP-Bag shows similar performance to iSOUP-Tree. The feature subspace sampling of iSOUP-RF lowers its need for computational resources, at the cost of reducing the predictive performance. It has a \(\overline{RMAE}\) curve between the curves of iSOUP-Tree and iSOUP-Bag.
6.3 Performance of ADWIN-based methods
The ADWIN-based ensemble methods also struggle in the initial period, but they do not recover as quickly as iSOUP-Bag and iSOUP-RF. Instead, their performance improves as we consider finer time resolutions having a larger number of data examples. Completely removing a base model as an adaptation mechanism when concept drift is detected is a severe action. After removal, the newly initialized model requires some time to learn. Thus, false alarms of the ADWIN-based ensembles impair their predictive performance. At the highest 1 min resolution, they outperform the iSOUP-Tree-based and the AMRules methods. The online bagging ADWIN approach outperforms the ADWIN online random forest, except in the final time period at the 60 min resolution.
6.4 Performance of the adaptive methods
Adaptive-iSOUP-Tree and its ensembles outperform all methods at all time resolutions. Adaptive-iSOUP-Bag and Adaptive-iSOUP-RF, similarly to the other ensemble methods, show high \(\overline{RMAE}\) values in the initial period. As expected, Adaptive-iSOUP-RF has worse performance than Adaptive-iSOUP-Bag, due to the random sampling of the attribute space.
6.5 Summary
Overall, \(\text {AMRules}^o\) has the most stable error, while \(\text {AMRules}^u\) has difficulties detecting and handling concept drift. iSOUP-Tree-based methods model MEX’s TPC much better than AMRules, but suffer from sensitivity to changes in the underlying distribution. The ensemble approaches initially have a high error, but improve significantly over time. The ADWIN-based ensemble methods need to process many more data examples to improve their initially high error curve. Adaptive-iSOUP-Tree-based methods perform best overall: Adaptive-iSOUP-Bag outperforms all other methods at all time resolutions.
Finally, all methods outperform the mean regressor, as they all have \(\overline{RMAE} < 1\) (with the exception of AMRules at 1 min resolution and the ADWIN-based methods at resolutions coarser than 1 min). The mean regressor is a relatively powerful model in terms of predictive performance, but is unable to adapt to unforeseen changes in the data.
Returning to the two groups of methods proposed in this paper, the Adaptive-iSOUP group clearly performs best, with Adaptive-iSOUP-Bag at the forefront. The iSOUP-ADWIN group is second best at the highest (1 min) resolution and clearly worst for all other resolutions. AMRules is worst at the highest resolution and second worst for all other. The relative performance of the different methods is most clearly visible at the 30 min resolution.
7 Conclusion and future work
In this paper, we propose novel methods for learning from multi-target data streams that incorporate change detection and adaptation mechanisms. In particular, we introduce two MTR method families, one of which uses the ensemble-based ADWIN change detection mechanism. iSOUP-ADWIN-Bag combines the iSOUP-Tree method for MTR as a base model with online bagging and the ADWIN mechanism. iSOUP-ADWIN-RF is also based on ADWIN, but replaces the original iSOUP-Tree method with its modified version that performs feature sampling in its leaves, learning random forest ensembles of trees for MTR.
Next, we propose a novel MTR method family extending iSOUP-Tree with a concept drift detection and adaptation mechanism, based on the Page–Hinckley statistic. Adaptive-iSOUP-Tree can detect local as well as global changes. We then introduce the Adaptive-iSOUP-Bag and Adaptive-iSOUP-RF methods for learning tree ensembles, by using the Adaptive-iSOUP-Tree as base learner.
Finally, we present extensive evaluation results and discuss the statistical significance of the differences in performance between the compared methods. The evaluation includes both real-world and artificially generated data streams. In real-world streams, concept drift might occur at some unknown point(s) in time, while artificial streams include regulated drift with controlled speed.
Our methods explicitly detect changes and adapt accordingly. On real-world data streams, the novel Adaptive-iSOUP-Tree method and its online ensembles outperform the competitors. These methods show statistically significant improvement over their most similar counterparts.
On artificial data streams with no drift, ensemble-level concept drift detection and adaptation methods (based on ADWIN) significantly outperform ensembles without such a mechanism. Even in this case, where no change has occurred, the change detection mechanism facilitates the management of the base models in the ensemble, yielding significantly improved performance. iSOUP-ADWIN-Bag outperforms not only its most similar counterpart but all other methods as well.
The MTR methods that do not address concept drift are outperformed by the methods that incorporate change detection and adaptation mechanisms at all concept drift speeds. Adaptive-iSOUP-Bag is the best-performing method on abrupt changes. The ADWIN-based ensembles perform best on gradual and incremental changes. The novel methods show statistically significant improvement over their most similar existing methods for all concept drift types.
The ADWIN-based methods show the best performance in detecting concept drift. While iSOUP-ADWIN-Bag raises no false alarms and iSOUP-ADWIN-RF does it rarely, the Adaptive-iSOUP methods detecting local changes raise numerous flags. Both families of methods quickly detect changes that have occurred. The AMRules method comes in the middle.
The current methods for change detection in MTR on data streams use different ways of aggregating the error for the different targets and/or aggregating the signals for change detected. The iSOUP-ADWIN methods monitor the average error across all targets and detect change based on that statistic, while Adaptive-iSOUP monitors errors per-target and aggregates detected change signals (when change is detected on all targets). A systematic exploration of different possibilities (e.g., detect overall drift when at least one target has drift or the majority of targets have drift or similar) would be an excellent direction for further work.
The introduced online MTR methods can be applied to MLC problems using the MLC-via-MTR transformation. In particular, after mapping the discrete values of the target attributes into numerical values, all of the proposed MTR methods can be applied. One line of future work is the empirical evaluation of these methods on MLC tasks.
Another line of future work is to apply the algorithms to real-world data stream mining problems, in collaboration with domain experts. The aims, besides accurate predictions, would be to identify points in time when change has happened. The domain expert would ideally inspect the concept drift detection points and confirm/deny their authenticity. This approach can help 1) the data stream mining research field by producing real-world multi-target data identifying concept drift, and 2) the domain expert by identifying points in time where changes in data might be taking place.
A final direction to explore in further work is change detection and adaptation in semi-supervised and unsupervised learning settings. The iSOUP-Tree method, which is at the heart of the work presented here, has recently been extended to incrementally learn predictive clustering trees, which handle supervised, semi-supervised and unsupervised learning uniformly (Osojnik et al., 2020). This opens the road to an immediate extension of our work towards semi-supervised learning, and with some additional effort, also unsupervised learning.
Materials availability
Not applicable.
Code availability
The code for this paper is available at https://github.com/BStevanoski/change-detection-adaptation-for-MTR-data-streams.
Notes
URL: https://kelvins.esa.int/mars-express-power-challenge/, [Last accessed: 29 February 2024]
References
Aho, T., Ženko, B., Džeroski, S., & Elomaa, T. (2012). Multi-target regression with rule ensembles. Journal of Machine Learning Research, 13, 2367–2407.
Almeida, E., Ferreira, C., & Gama, J. (2013). Adaptive model rules from data streams. In Proc. ECML/PKDD (Machine Learning and Knowledge Discovery in Databases), 480–492 .
Basseville, M., Nikiforov, I. V., et al. (1993). Detection of abrupt changes: Theory and application. Prentice Hall.
Benavoli, A., Corani, G., Demšar, J., & Zaffalon, M. (2017). Time for a change: A tutorial for comparing multiple classifiers through Bayesian analysis. Journal of Machine Learning Research, 18, 2653–2688.
Benjamini, Y., & Hochberg, Y. (1995). Controlling the false discovery rate: A practical and powerful approach to multiple testing. Journal of the Royal Statistical Society: Series B (Methodological), 57, 289–300.
Bhattacharyya, B. (1987). One sided Chebyshev inequality when the first four moments are known. Communications in Statistics-Theory and Methods, 16, 2789–2791.
Bifet, A., Holmes, G., Pfahringer, B., Kirkby, R., & Gavalda, R. (2009). New ensemble methods for evolving data streams. In Proc. 15th ACM SIGKDD Intl. Conf. Knowledge Discovery and Data Mining, 139–148.
Bifet, A., Read, J., Pfahringer, B., Holmes, G., & Žliobaitė, I. (2013). CD-MOA: Change detection framework for massive online analysis. In Proc. Intl. Symp. Intelligent Data Analysis, 92–103. Springer.
Bifet, A., & Gavalda, R. (2007). Learning from time-changing data with adaptive windowing. In Proc. SIAM Intl. Conf. Data Mining, 443–448.
Bifet, A., Gavaldà, R., Holmes, G., & Pfahringer, B. (2018). Machine learning for data streams with practical examples in MOA. MIT Press.
Bifet, A., Holmes, G., Kirkby, R., & Pfahringer, B. (2010). MOA: Massive Online Analysis. Journal of Machine Learning Research, 11, 1601–1604.
Breiman, L. (1996). Bagging predictors. Machine Learning, 24, 123–140.
Breiman, L. (2001). Random forests. Machine Learning, 45, 5–32.
Chaouki, A., Read, J., & Bifet, A. (2023) Online decision tree construction with deep reinforcement learning. In Sixteenth European Workshop on Reinforcement Learning.
Chaouki, A., Read, J., & Bifet, A. (2024) Online learning of decision trees with thompson sampling. In International Conference on Artificial Intelligence and Statistics, 2944–2952. PMLR.
Dawid, A. P. (1984). Present position and potential developments: Some personal views. Statistical theory: The prequential approach. Journal of the Royal Statistical Society: Series A (General), 147, 278–290.
Dehghan, M., Beigy, H., & ZareMoodi, P. (2016). A novel concept drift detection method in data streams using ensemble classifiers. Intelligent Data Analysis, 20, 1329–1350.
Demšar, J. (2006). Statistical comparisons of classifiers over multiple data sets. Journal of Machine Learning Research, 7, 1–30.
Duarte, J., Gama, J., & Bifet, A. (2016). Adaptive model rules from high-speed data streams. ACM Transactions on Knowledge Discovery from Data, 10, 30.
Fanaee-T, H., & Gama, J. (2014). Event labeling combining ensemble detectors and background knowledge. Progress in Artificial Intelligence, 2, 113–127.
Friedman, M. (1940). A comparison of alternative tests of significance for the problem of m rankings. The Annals of Mathematical Statistics, 11, 86–92.
Gama, J., Sebastiao, R., & Rodrigues, P. P. (2009). Issues in evaluation of stream learning algorithms. In Proc. 15th ACM SIGKDD Intl. Conf. on Knowledge Discovery and Data Mining, 329–338.
Gama, J. (2010). Knowledge discovery from data streams. CRC Press.
Gama, J., Žliobaitė, I., Bifet, A., Pechenizkiy, M., & Bouchachia, A. (2014). A survey on concept drift adaptation. ACM Computing Surveys, 46, 1–37.
Gao, J., Fan, W., Han, J., & Yu, P. S. (2007). A general framework for mining concept-drifting data streams with skewed distributions. In Proc. SIAM Intl. Conf. Data Mining, 3–14 . SIAM.
Gomes, H. M., Barddal, J. P., Ferreira, L. E. B., & Bifet, A. (2018). Adaptive random forests for data stream regression. In Proc. European Symp. Artificial Neural Network (ESANN), 267–272.
Gomes, H. M., Bifet, A., Read, J., Barddal, J. P., Enembreck, F., Pfharinger, B., Holmes, G., & Abdessalem, T. (2017). Adaptive random forests for evolving data stream classification. Machine Learning, 106, 1469–1495.
Gustafsson, F. (2000). Adaptive filtering and change detection. Wiley.
Hoeffding, W. (1963). Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association, 58, 13–30.
Ikonomovska, E., Gama, J., & Džeroski, S. (2011). Incremental multi-target model trees for data streams. In Proc. ACM Symp. on Applied Computing, 988–993.
Ikonomovska, E., Gama, J., & Džeroski, S. (2011). Learning model trees from evolving data streams. Data Mining and Knowledge Discovery, 23, 128–168.
Iman, R. L., & Davenport, J. M. (1980). Approximations of the critical region of the Friedman statistic. Communications in Statistics-Theory and Methods, 9, 571–595.
Kocev, D., Vens, C., Struyf, J., & Džeroski, S. (2013). Tree ensembles for predicting structured outputs. Pattern Recognition, 46, 817–833.
Korycki, Ł., & Krawczyk, B. (2021). Concept drift detection from multi-class imbalanced data streams. In Proc. 37th IEEE Intl. Conf. Data Engineering (ICDE), 1068–1079 . IEEE.
Langley, P. (1996). Elements of machine learning. Morgan Kaufmann.
Last, M., Sinaiski, A., & Subramania, H. S (2010). Predictive maintenance with multi-target classification models. In Proc. Asian Conf. Intelligent Information and Database Systems, 368–377. Springer.
Liao, G., Zhang, P., Yin, H., Deng, X., Li, Y., Zhou, H., & Zhao, D. (2023). A novel semi-supervised classification approach for evolving data streams. Expert Systems with Applications, 215, 119273. https://doi.org/10.1016/j.eswa.2022.119273
Lucas, L., & Boumghar, R. (2017). Machine learning for spacecraft operations support - The Mars Express power challenge. In Proc. Intl. Conf. Space Mission Challenges for Information Technology, 82–87.
Madjarov, G., Kocev, D., Gjorgjevikj, D., & Džeroski, S. (2012). An extensive experimental comparison of methods for multi-label learning. Pattern Recognition, 45, 3084–3104.
Mastelini, S. M., Santana, E. J., Costa, V. G. T., & Barbon, S. (2018). Benchmarking multi-target regression methods. In Proc. 7th Brazilian Conference on Intelligent Systems, 396–401. IEEE.
Mouss, H., Mouss, D., Mouss, N., & Sefouhi, L. (2004). Test of Page-Hinckley, an approach for fault detection in an Agro-alimentary production system. In Proc. 5th Asian Control Conference, 2, 815–818. IEEE.
Narasimhamurthy, A. M., & Kuncheva, L. I. (2007). A framework for generating data to simulate changing environments. In Proc. 25th Intl. Conf. Artificial Intelligence and Applications, 384–389.
Nemenyi, P. B. (1963). Distribution-free multiple comparisons. Princeton University.
Osojnik, A., Panov, P., & Džeroski, S. (2017). Multi-label classification via multi-target regression on data streams. Machine Learning, 106(6), 745–770.
Osojnik, A., Panov, P., & Džeroski, S. (2018). Tree-based methods for online multi-target regression. Journal of Intelligent Information Systems, 50, 315–339.
Osojnik, A., Panov, P., & Džeroski, S. (2020). Incremental predictive clustering trees for online semi-supervised multi-target regression. Machine Learning, 109, 2121–2139.
Oza, N.C.(2005). Online bagging and boosting. In Proc. IEEE Intl. Conf. on Systems, Man and Cybernetics, 3, 2340–2345.
Oza, N. C., & Russell, S. (2001). Experimental comparisons of online and batch versions of bagging and boosting. In Proc. 7th ACM SIGKDD Intl. Conf. on Knowledge Discovery and Data Mining, 359–364.
Petković, M., Boumghar, R., Breskvar, M., Džeroski, S., Kocev, D., Levatić, J., Lucas, L., Osojnik, A., Ženko, B., & Simidjievski, N. (2019). Machine learning for predicting thermal power consumption of the Mars Express spacecraft. IEEE Aerospace and Electronic Systems Magazine, 34, 46–60.
Read, J. (2018). Concept-drifting data streams are time series: The case for continuous adaptation. arXiv:1810.02266.
Shi, Z., Wen, Y., Feng, C., & Zhao, H.(2014). Drift detection for multi-label data streams based on label grouping and entropy. In Proc. ICDM (Intl. Conf. Data Mining) Workshops, 724–731 . IEEE.
Sobhani, P. & Beigy, H. (2011). New drift detection method for data streams. In Adaptive and Intelligent Systems, 88–97. Springer.
Sousa, R., & Gama, J. (2016). Online semi-supervised learning for multi-target regression in data streams using AMrules. In Proc. Intl. Symp. Intelligent Data Analysis, 123–133.
Souza, V.M., Chowdhury, F. A., & Mueen, A. (2020). Unsupervised drift detection on high-speed data streams. In Proc. Intl. Conf. Big Data, 102–111. IEEE.
Spyromitros-Xioufis, E., Tsoumakas, G., Groves, W., & Vlahavas, I. (2016). Multi-target regression via input space expansion: Treating targets as inputs. Machine Learning, 104, 55–98.
Stevanoski, B., Kocev, D., Osojnik, A., Dimitrovski, I., & Džeroski, S. (2019). Predicting thermal power consumption of the Mars Express satellite with data stream mining. In Proc. Intl. Conf. Discovery Science, 186–201. Springer.
Struyf, J., & Džeroski, S. (2005). Constraint based induction of multi-objective regression trees. In Proc. Intl. Wshp. Knowledge Discovery in Inductive Databases, 222–233. Springer.
Vazquez, E., & Walter, E. (2003). Multi-output suppport vector regression. In IFAC Proceedings Volumes,36, 1783–1788.
Wei, H., Wang, X., Wen, Z., Li, E., & Wang, H. (2024). An ensemble-adaptive tree-based chain framework for multi-target regression problems. Information Sciences, 653, 119769. https://doi.org/10.1016/j.ins.2023.119769
Wilcoxon F.(1945). Individual comparisons by ranking methods. In Breakthroughs in Statistics, 196–202.
Yekutieli, D., & Benjamini, Y. (1999). Resampling-based false discovery rate controlling multiple test procedures for correlated test statistics. Journal of Statistical Planning and Inference, 82(1–2), 171–196.
Zhang, Q., Tsang, E. C. C., He, Q., & Guo, Y. (2023). Ensemble of kernel extreme learning machine based elimination optimization for multi-label classification. Knowledge-Based Systems, 278, 110817. https://doi.org/10.1016/j.knosys.2023.110817
Funding
This work was supported by: Slovenian Research Agency (grants J2-2505, P2-0103, and young researcher grant PR-09773 to AK); EU (Horizon 2020, GA No 952215, project TAILOR); Public Scholarship, Development, Disability and Maintenance Fund of the Republic of Slovenia (scholarship to BS).
Author information
Authors and Affiliations
Contributions
Bozhidar Stevanoski: Conceptualization, Software, Visualization, Investigation, Writing - Original Draft, Validation; Ana Kostovska: Conceptualization, Writing - Review and Editing; Panče Panov: Conceptualization, Writing - Review and Editing, Supervision; Sašo Džeroski: Conceptualization, Methodology, Writing - Review and Editing, Supervision, Funding acquisition.
Corresponding author
Ethics declarations
Conflict of interest
The authors have no Conflict of interest to declare that are relevant to the content of this article.
Ethical approval
Not applicable.
Consent for publication
Not applicable.
Additional information
Editors: Ana Carolina Lorena, Albert Bifet, Rita P. Ribeiro.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix A: Datasets
1.1 A.1 Artificial data streams
We use an off-the-shelf MTR data generator (Mastelini et al., 2018) to obtain data streams with controllable concept drift occurrence and speed. The dependencies between the target attributes are shown in Table 10.
1.2 A.2 Real-world datasets
We here briefly describe the real-world data streams used in the empirical evaluation.
The Bicycles (Fanaee-T & Gama, 2014) data stream is obtained from a bicycle sharing system offering bicycle renting from one position and returning it at another position in a city. The data examples describe the hour-by-hour rentals. The descriptive attributes include information about the time of the day and the day in the week, as well as the weather information. The tree target attributes represent the number of rentals by registered users, casual users, and the total rentals for that hour.
Each data example in the Supply Chain Management (SCM) data streams (Spyromitros-Xioufis et al., 2016), SCM1d and SCM20d, corresponds to an observation day of the Trading Agent Competition in Supply Chain Management (TAC SCM) tournament. The descriptive attributes include the daily prices for each observed product, as well as 4 time-delayed observations with delays of 1, 2, 4, and 8 days. The 16 target attributes correspond to the mean price of the next day, or respectively of the day 20 days in the future, for the SCM1d, respectively SCM20d data streams.
The RF1 and RF2 data streams (Spyromitros-Xioufis et al., 2016) examine the river flows at 8 locations in the Mississippi River network. Each data example comprises as attributes the most recent observations from each of the 8 locations, plus time-lagged observations from 6, 12, 18, 24, 36, 48 and 60 hours in the past. It comprises as targets the values 48 h in the future for the 8 monitored locations. Thus, the RF1 data stream has 8 targets and 64 descriptive attributes. RF2 extends RF1 with descriptive attributes giving precipitation forecasts on 6 h windows (6, 12, 18, 24, 30, 36, 42, and 48 hours), which cover the 8 locations, as well as 19 other sites.
The Mars Express (MEX) (Petković et al., 2019; Stevanoski et al., 2019) data streams concern the prediction of thermal power consumption (TPC) of the MEX spacecraft over 6 different time resolutions. The target attributes represent the measurements of the electrical current/power on all 33 thermal power lines. The descriptive attributes consist of five components: (1) Solar Aspect Data, (2) Detailed Mission Operations Plans, (3) Flight dynamics TimeLine, (4) Miscellaneous events, and (5) Long Term Data.
Appendix B: Detailed results
1.1 B.1 Pairwise method comparison
In this section, we show the results for the method pairs in the pairwise comparison of predictive performance. Note that the tables in the main text of the paper give the results only for selected and statistically significant method pairs of interest.
Tables 11, 12, 13, 14 and 15 show the p-values of the Wilcoxon’s tests and the corresponding Benjamini-Hochberg corrections on real-world data streams, data streams with no concept drift, with abrupt, incremental, and gradual concept drift, respectively.
1.2 B.2 Mean evaluation metrics per data stream type
Table 16 presents the mean values of \(\overline{RMAE}\) across data stream types.
Appendix C: Statistical comparisons
1.1 C.1 Comparison of multiple methods
We use the Friedman test (Iman & Davenport, 1980; Friedman, 1940) with post-hoc Nemenyi analysis (Nemenyi, 1963; Demšar, 2006) to assess the statistical significance of the differences in predictive performance of multiple algorithms.
The Friedman test is a non-parametric statistical test ranking the methods on each data stream. It assigns rank 1 to the best performing method, rank 2 to the second-best, etc. If multiple methods have identical evaluation metric values, they are all assigned the appropriate average rank. Denoting the rank of the \(j^{th}\) method on the \(i^{th}\) data stream with \(r_i^j\), and assuming we compare k methods over N data streams, the Friedman test defines the average algorithm rank \(R_j = \frac{1}{N} \sum _{i=1}^N r_i^j\). The statistic suggested by Friedman
follows the \(\chi ^2\) distribution with \(k-1\) degrees of freedom. As it has been shown to be overly conservative by Iman and Davenport (1980), we utilize their alternative corrected statistic
which is distributed according to the Fisher-Snedecor distribution with \(k-1\) and \((k-1)(N-1)\) degrees of freedom.
The null hypothesis suggests that all methods are equivalent, i.e., perform equally well. It is rejected when the \(F_F\) value is greater than the appropriate critical value calculated depending on the significance level \(\alpha\), in which case, Nemenyi post-hoc tests (Nemenyi, 1963) follow. In our experiments, we consider a significance level \(\alpha\) of 0.05.
The Nemenyi post-hoc test concludes a significant predictive performance difference between methods if their average ranks differ by at least the critical difference(CD) value, defined as
where \(q_\alpha\) is the critical value of the two-tailed Nemenyi test, given the significance level \(\alpha\). In the literature, the term critical difference is often referred to as critical distance, and the two terms are used interchangeably. The Nemenyi critical values are equivalent to the critical values of the Studentized range distribution with a multiplicative factor of \(\frac{1}{\sqrt{2}}\). As before, our experimental setup considers a significance level of 0.05.
The results of the Nemenyi post-hoc analysis are visually represented as an average rank diagram, showing the critical distance and the average rank for each method. The groups of methods which do not exhibit a statistically significant difference in performance are connected with a line.
1.2 C.2 Pairwise method comparison
We use the Wilcoxon signed-ranks test (Wilcoxon, 1945) with the Benjamini-Hochberg correction (Benjamini & Hochberg, 1995) to assess the significance of difference in performance of the novel proposed methods compared to their most similar counterparts.
Like the Friedman test, the Wilcoxon signed-ranks test is a non-parametric test. It ranks the differences \(d_i\) according to their absolute values, defined as \(d_i = c_i^2 - c_i^1\), where \(c_i^1\) and \(c_i^2\) are the evaluation metric values of the two compared methods on the \(i^{th}\) data stream. By summing the ranks on the data streams depending on the sign of the differences (\(d_i\)’s), \(R^+\) and \(R^-\) are defined as:
The statistic
of the minimal sum of ranks \(T = \text {min}(R^+, R^-)\) and the number of data streams N follows the normal distribution for two-sided comparison testing if there is a significant performance difference between the two methods. We are interested in inspecting whether our novel methods show improved performance over their most similar counterparts. Hence, we utilize the one-sided (one-tailed) test version, where the statistic z in Equation (C5) of \(T=R^+\) is distributed normally as well. The p-value is derived as \(p = 2(1-\Phi (|z|))\) and \(p=1-\Phi (z)\) for two-sided and one-sided comparison, respectively.
To control the false discovery rate when performing multiple Wilcoxon’s tests with outcomes which are not independent, we perform Benjamini-Hochberg correction. We use the alternative representation of the Benjamini-Hochberg correction (Yekutieli & Benjamini, 1999), that adjusts the p-values, instead of selecting the hypothesis that will be rejected.
When performing t tests, we sort the p-values as \(p_{1} \le p_{2} \le \dots \le p_{t}\). The Benjamini-Hochberg corrected p-values are defined as
The approaches of using Friedman’s test with post-hoc Nemenyi analysis and of the Wilcoxon test are recommended and explained in detail by Demšar (2006) and briefly by Benavoli et al. (2017).
Appendix D: Additional experiments
1.1 D.1 Impact of the number of independent variables
In this subsection, we explore the impact of the number of independent variables. Here, we construct artificial data streams following the same process described in Appendix A, but with half of the independent variables (22 instead of 45).
Figure 6d shows the critical distance diagram when there is no concept drift. The number of independent variables did not impact the results much in this case – iSOUP-ADWIN-Bag still performs the best, significantly outperforming the same methods (iSOUP-Tree and -RF, Adaptive-iSOUP-Tree and -RF and iSOUP-ADWIN-RF). The order of second best performing methods stayed the same as well.
Figure 6a–c show the critical distance diagrams when there is abrupt, incremental and gradual concept drift, respectively. The \(\text {AMRules}^o\) method shows better rankings with fewer independent variables, closing the performance gap with iSOUP-ADWIN-BAG on incremental and gradual concept drift and outperforming it on abrupt, but not statistically significantly. The rest of the methods show stable ranking with different number of independent variables.
1.2 D.2 Impact of the number of dependent variables observing concept drift
In this subsection, we explore the impact of the number of dependent variables that are affected by a concept drift. Namely, we generate the same artificial data streams described in Appendix A but we only let one target variable, Y1, observe a concept drift. Figure 7a–c show the corresponding critical distance diagrams in this case for abrupt, incremental, and gradual concept drift. iSOUP-ADMIN-Bag remains the best performing method, followed by \(\text {AMRules}^o\) for all concept drift types. The only methods affected by the number of variables undergoing a concept drift are the Random Forests methods, which are among the worst performing methods for all concept drift types. The rest of the methods are observing the same performance ranking as described in Sect. 5.4.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Stevanoski, B., Kostovska, A., Panov, P. et al. Change detection and adaptation in multi-target regression on data streams. Mach Learn 113, 8585–8622 (2024). https://doi.org/10.1007/s10994-024-06621-z
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10994-024-06621-z