Change detection and adaptation in multi-target regression on data streams

Stevanoski, Bozhidar; Kostovska, Ana; Panov, Panče; Džeroski, Sašo

doi:10.1007/s10994-024-06621-z

Change detection and adaptation in multi-target regression on data streams

Open access
Published: 09 October 2024

Volume 113, pages 8585–8622, (2024)
Cite this article

Download PDF

You have full access to this open access article

Machine Learning Aims and scope Submit manuscript

Change detection and adaptation in multi-target regression on data streams

Download PDF

Bozhidar Stevanoski^1,3,
Ana Kostovska^1,2,
Panče Panov^1,2 &
…
Sašo Džeroski^1,2

1444 Accesses
1 Altmetric
Explore all metrics

Abstract

An essential characteristic of data streams is the possibility of occurrence of concept drift, i.e., change in the distribution of the data in the stream over time. The capability to detect and adapt to changes in data stream mining methods is thus a necessity. While methods for multi-target prediction on data streams have recently appeared, they have largely remained without such capability. In this paper, we propose novel methods for change detection and adaptation in the context of incremental online learning of decision trees for multi-target regression. One of the approaches we propose is ensemble based, while the other uses the Page–Hinckley test. We perform an extensive evaluation of the proposed methods on real-world and artificial data streams and show their effectiveness. We also demonstrate their utility on a case study from spacecraft operations, where cosmic events can cause change and demand an appropriate and timely positioning of the space craft.

Mitigating concept drift in data streams: an incremental decision tree approach

Article 07 October 2024

Online GRNN-Based Ensembles for Regression on Evolving Data Streams

Streaming Data Analytics for Feature Importance Measures in Concept Drift Detection and Adaptation

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Machine learning (ML) tasks can be categorized as supervised, semi–supervised or unsupervised (Langley, 1996). The goal of supervised and semi–supervised algorithms is to learn, from a set of data examples, models that predict the values of one or more target attributes from the values of descriptive attributes. The tasks where one, respectively many target attributes is/are present are named single-, respectively multi-target prediction (MTP) (Kocev et al., 2013) tasks. When predicting multiple discrete or continuous-valued attributes, we talk about multi-target classification (MTC) (Last et al., 2010) or multi-target regression (MTR) (Struyf & Džeroski, 2005), respectively. In the special case of MTC where the targets are binary, we talk about multi-label classification (MLC) (Madjarov et al., 2012).

As devices that generate huge amounts of data are omnipresent, ML faces increasing data complexity – not only in the number of target or descriptive columns, but also in the number of rows and the velocity at which they become available. In the extreme case, there may be an infinite number of rows that are continuously arriving. Storing these for future knowledge extraction is obviously impossible. In this case, we are talking about data streams (Bifet et al., 2018), commonly referred to as online data. In online learning or mining data streams (Gama, 2010), the topic of multi-target prediction has been recently addressed (Shi et al., 2014; Osojnik et al., 2017; Ikonomovska et al., 2011) for the task of MTR. Here, the AMRules (Almeida et al., 2013; Sousa & Gama, 2016) and iSOUP-Tree approaches (Osojnik et al., 2018, 2020) are of note.

Examples in data streams arrive sequentially and are temporarily ordered. The temporal dimension implies that there is a possibility for change in the data distribution over time. In ML, such change is referred to as concept drift, or simply change (Gama, 2010). The task of change detection and adaptation has also been addressed in the data streams literature (Gama et al., 2014). Two major approaches deserve a mention here. The ensemble-based ADWIN (Bifet et al., 2009) approach and the FIMT-DD (Ikonomovska et al., 2011) approach, based on the Page–Hinckley test (Mouss et al., 2004). However, change detection and adaptation for the task of multi-target regression on data streams have only been considered in AMRules (Almeida et al., 2013; Sousa & Gama, 2016): They have so far not been considered in tree or tree ensemble approaches to MTR on data streams.

Contributions This paper is concerned with supervised online algorithms for MTR able to detect concept drift and adapt the learned model accordingly. It considers tree-based and tree-ensemble-based approaches to MTR. The main contributions of the paper to the research field are summarized as follows:

1.
Two novel method families for change detection and adaptation in the context of supervised MTR on data streams – iSOUP–ADWIN, based on the ADWIN approach wrapped around iSOUP–Tree, and Adaptive–iSOUP, based on a direct extension of iSOUP–Tree with change detection and adaptation mechanisms, based on the Page–Hinkley test.
2.
Extensive empirical evaluation of the developed approaches, including a comparison with competing methods, where we consider predictive performance and performance in terms of detecting change, by executing experiments on real-world and artificial data streams with concept drift. The Adaptive-iSOUP family performs best on real-world streams, while iSOUP-ADWIN performs best overall on artificial data streams without and with different types of concept drift.
3.
Application of the novel methods and a comparison to their competitors in a real-world practical use case, i.e., predicting the thermal power consumption of the MEX spacecraft, showing that Adaptive–iSOUP performs the best.

2 Background and related work

2.1 Multi-target prediction on data streams

The task of predictive modeling is to learn a model that predicts the values of a dependent variable y from vectors of values of independent variables X, thus approximating a function $y=f(X)$, from a set of training examples (pairs of the form $(X_i, y_i )$). In the batch case, all examples are presented together in the form of a dataset, while in the online case they are presented sequentially in the form of a data stream. When y takes scalar values, we are talking about single-target prediction, i.e., classification (if discrete) or regression (if continuous). Predicting vectors of values $Y = (y_1,..., y_n )$ is the task of multi-target prediction (MTP), i.e., multi-target classification (MTC) for discrete and multi-target regression (MTR) for continuous values of y.

The task of classification has been extensively addressed in stream mining (Gama, 2010; Liao et al., 2023), with tree-based approaches playing a crucial role. State-of-the-art approaches are based on algorithms for learning ensembles of models, such as Bagging (Breiman, 1996) and Random forests (Breiman, 2001). Adaptations of these algorithms for the streaming setting have been proposed by (Oza & Russell, 2001; Oza, 2005). Regression has received much less attention in the streaming setting, but some tree-based approaches do exist (Ikonomovska et al., 2011; Chaouki et al., 2023, 2024).

MTP has been researched extensively for the batch setting, where we have tree-based (Struyf & Džeroski, 2005; Wei et al., 2024), rule-based (Aho et al., 2012), and kernel-based (Vazquez & Walter, 2003; Zhang et al., 2023) methods. Little attention is given to MTP for data streams, which has only received some attention recently. There are a few methods for MTC (Shi et al., 2014; Osojnik et al., 2017) on data streams. For MTR on data streams, there are two state-of-the-art methods, i.e., the tree-based incremental Structured Output Prediction tree (iSOUP-Tree) (Osojnik et al., 2018) approach, based on the Fast Incremental Model Tree for Multi Target prediction (FIMT-MT) method (Ikonomovska et al., 2011) [the multi-target extension of Fast Incremental Model Trees with Drift Detection (FIMT-DD) (Ikonomovska et al., 2011)], and the rule-based approach of learning Adaptive Model Rules from High Speed Data Streams (AMRules) (Almeida et al., 2013). Ensemble approaches, which learn collections of base models, widen the space of available methods: iSOUP-Tree has been used in combination with online bagging and random forests (Oza & Russell, 2001; Oza, 2005) in the iSOUP-Bag and iSOUP-RF methods (Osojnik et al., 2018), respectively.

2.2 Change detection and adaptation

Assuming that $\mathcal {X}$ and $\mathcal {Y}$ are the input and output space, respectively, and that the data generating process is characterized by the joint distribution $P(\mathcal {X},\mathcal {Y})$, we are interested in changes over time in this distribution. Using the Bayes theorem to obtain $P(\mathcal {X},\mathcal {Y}) = P(\mathcal {X}) P(\mathcal {Y} | \mathcal {X})$, we can note three kinds of changes to the joint distribution (Gao et al., 2007). The first is a sampling shift in $P(\mathcal {X})$ known as virtual concept drift. The second is a change in the conditional probability $P(\mathcal {Y} | \mathcal {X})$, known as real concept drift, or simply concept drift. Finally, change can occur in both $P(\mathcal {X})$ and $P(\mathcal {Y} | \mathcal {X})$. In this paper, we are interested in investigating real concept drift. Drift occurring in limited areas of the data space is called local, while global concept drift affects the whole data space.

Figure 1 visually illustrates three speeds of concept drift (Gama et al., 2014): (1) abrupt/sudden, (2) incremental, and (3) gradual. The different speeds of concept drift make it difficult to distinguish it from intrinsic noise or outliers.

2.3 Methods for concept drift detection and adaptation

Concept drift detection and adaptation are receiving significant attention in data stream mining (Gama et al., 2014). Here, we describe the two methods that are most often used in practice, the Page–Hinckley test and the ADWIN method, which are used in our approaches. We then discuss other related work on the topic.

The Page–Hinckley (PH) test Mouss et al. (2004) considers a cumulative variable $m_T = \sum _{t=1}^T (x_t - \bar{x}_T - \alpha )$, where $x_t$ is the observed value of the monitored variate at time t, $\bar{x}_T = \frac{1}{T} \sum _{t=1}^T x_t$ is the mean of $x_t$ values up to time T, and $\alpha$ corresponds to the magnitude of allowed changes. The minimum value of this variable up to $t=T$ is computed as $M_T = \min _{t=1,\dots ,T}(m_t)$. When the difference $PH_T = m_T - M_T$ exceeds a fixed value (parameter $\lambda$), concept drift is signalled. The PH test has been used for drift detection in single-target regression [FIMT-DD (Ikonomovska et al., 2011)] and MTR [AMRules (Almeida et al., 2013)].

ADWIN Bifet and Gavalda (2007) is a parameter- and assumption-free change detection mechanism. It uses a sliding window W (with variable length), which is expanded while it has no two subwindows $W_1$ and $W_2$ with distinct enough means. Formally, ADWIN defines

$$\begin{aligned} \begin{aligned} m = \frac{1}{\frac{1}{n_1} + \frac{1}{n_2}}, \delta ' = \frac{\delta }{n}, \epsilon _{cut} = \sqrt{\frac{2}{m}\cdot \sigma _W^2\cdot \text {ln}\frac{2}{\delta '}} + \frac{2}{3m}\text {ln}\frac{2}{\delta '}, \end{aligned} \end{aligned}$$

where n, $n_1$, and $n_2$ are the lengths of the windows W, $W_1$, and $W_2$ ($n = n_1 + n_2$), m is the harmonic mean of $n_1$ and $n_2$, $\delta$ is the desired confidence, and $\sigma _W^2$ is the variance of the target in window W. If the observed means $\mu _1$ and $\mu _2$ differ by more than $\epsilon _{cut}$, ADWIN detects a change and discards the statistics of the older subwindow.

ADWIN does not keep the elements of the window W in memory. Instead, it stores only the statistics of its subwindows according to the exponential histogram technique: It stores the statistics of the new element in a subwindow/bucket of size $1 = 2^0$ and when subwindows of sizes $2^i$ accumulate, their statistics are merged into a subwindow of size $2^{i+1}$. ADWIN checks every two consecutive subwindows: If their mean values are significantly different, it signals concept drift.

In an ensemble setting, a concept drift occurrence is expected to worsen the predictive performance of the base models as they do not incorporate change detection mechanisms, and to raise the models’ error measures. Hence, concept drift can be detected by monitoring the errors of the base models, and raising an alarm if a significant change is observed. ADWIN, which can be used with Bagging (Bifet et al., 2009) and Random forest ensembles, raises an alarm and replaces the least performant model with a new one if there is a change in at least one of the monitored errors. It uses the ADWIN change detection mechanism (Bifet & Gavalda, 2007) to monitor the error of each base model.

The ARF-Reg method (here referred to as ARF) (Gomes et al., 2018), is the adaptation of Adaptive Random Forests (Gomes et al., 2017) approach for regression. It employs two ADWIN mechanisms for each tree: one with lower detection confidence $\delta _w$, issuing warnings of possible concept drifts, and another, with higher confidence $\delta _d$, confirming the occurrence of concept drift. If the former mechanism issues a warning for a given tree, the method initializes a new tree grown in the background and in parallel with it, where the incoming instances are passed to the background tree. If the latter mechanism corresponding to a base model tree confirms a concept drift, the tree is replaced by its background tree.

Other methods for concept drift detection and adaptation differ from our approaches mainly in the tasks addressed. For example, Korycki and Krawczyk (2021) address change detection and adaptation in single-target classification, specifically for multi-class imbalanced data streams, while we consider multi-target regression. Sobhani and Beigy (2011) and Dehghan et al. (2016) also address single-target classification. The former imposes an additional assumption that the data is arriving in batches, and the method stores the last batch of data instances. Souza et al. (2020) address change detection and adaptation in an unsupervised learning setting, while our work considers a supervised learning setting (MTR). Read (2018) provides a compelling case for using gradient-based methods when mining single-target data streams which are known to be susceptible to concept drift, however, gradient-based multi-target online methods are still missing. Only rare exceptions, such as AMRules (Duarte et al., 2016), consider concept drift detection and adaptation for online MTP.

2.4 Existing online MTR methods

2.4.1 Rule-based methods: AMRules

The AMRules method (Duarte et al., 2016) is a representative of rule-based approaches to online MTP that build rule sets (RS), which can detect and adapt to changes. AMRules starts with an empty RS and a default rule $\{\} \rightarrow \mathcal {L}$, where $\mathcal {L}$, initialized to NULL, is a modified version of the extended binary search tree (E-BST) data structure also used in iSOUP-Tree for storing statistics. The E-BST $\mathcal {L}$ stores the statistics of the values observed so far, needed for predicting each of the target values.

When a new data example arrives, AMRules checks if some rule in the RS covers it, i.e., if all of the literals/conditions on the left-hand side of the rule for that example are true. A covering rule $\mathcal {L}_r$ tests whether the example is an anomaly (noisy) by computing the ratio $OR = \frac{1}{d}\sum _{j=1}^{d}\log \left( \frac{P(X_j=v | \mathcal {L}_r)}{1-P(X_j=v | \mathcal {L}_r)} \right)$ over all d attributes, given the $j^{th}$ attribute $X_j$ equals v. For numerical attributes, the empirical probability $P(X_j=v | \mathcal {L}_r)$ is estimated by using Cantelli’s inequality (Bhattacharyya, 1987). After having observed more than a predetermined $m_{min}$ (anomaly grace period) examples, the rule signals anomaly if $OR < T$, where T is an anomaly threshold parameter.

If the example is not anomalous, its target values are used to update the statistics of the rule. The PH change detection tests are used on their mean square error values to discover concept drift. The rule is removed from the RS if change is detected.

If a rule is not removed, it is considered for expansion. Here, again, a grace period parameter is used. The expansion procedure is almost identical to leaf node splitting of iSOUP-Tree and uses the Hoeffding bound, with the same heuristic function. Rule expansion is a process where the hypothetical candidate split is added to the literals on the rule’s left hand side. As a special case, expanding the default rule means adding a new rule in the RS, with the extended literals.

The prediction and model building strategies depend on whether the rules are ordered or unordered, leading to the AMRules variants – $\text {AMRules}^o$ and $\text {AMRules}^u$. In the former case, only the first rule that covers the example is removed, expanded or used in prediction. The latter one enables all the rules that cover an example to have the same treatment, independent of their order, and the final prediction is made as the aggregation (mean) of individual predictions.

The rules learned by the AMRules method generate predictions in a similar manner as the leaves in iSOUP-Tree. They use an adaptive strategy, choosing between a perceptron’s and a mean regressor’s prediction. AMRules and iSOUP-Tree differ in their learning rate, which in AMRules is a constant.

2.4.2 Tree-based methods: iSOUP-Tree

iSOUP-Tree (Osojnik et al., 2018) is instance-incremental method. It is an initial method used for the development of our novel methods. iSOUP-Tree starts from an empty leaf and once enough examples have been processed (but not stored), a check is made to examine if there is significant statistical support to split it.

All possible binary splits are evaluated by using multi-target intra-cluster variance reduction (ICVR) as a heuristic function. ICVR represents the homogeneity gain on the target values if a split is chosen.

According to the ICVR heuristic, the best candidate split $h_{1}$ is selected, as well as the second-best $h_{2}$. Next, the following sequence is constructed $\dots \frac{h_2(k)}{h_1(k)}, \frac{h_2(k+1)}{h_1(k+1)}, \frac{h_2(k+2)}{h_1(k+2)} \dots$ where k denotes the number of accumulated examples.

Let $X_k$ be a random variable denoting the ratio $\frac{h_2(k)}{h_1(k)}$, and $x_k$ be one sample of it. Then the observed average can be computed as $\bar{x} = \frac{1}{k}(x_1 + x_2 + \dots + x_{|S|})$, which is a sample from the random variable $\bar{X} = \frac{1}{k}(X_1 + X_2 + \dots + X_{|S|})$. The Hoeffding bound (Hoeffding, 1963) is then applied to make an $(\epsilon , \delta )$-approximation, using the standard notation of E[X] to denote the expected value of the random variable X. The Hoeffding bound is of the following form: $P(|\bar{X} - E[\bar{X}]| > \epsilon ) \le 2e^{-2|S|\epsilon ^2}=:\delta .$ The value $\delta$ is a parameter to the iSOUP-Tree method named splitting confidence. The value $\epsilon$ can be expressed in terms of $\delta$ and |S| as follows: $\epsilon = \sqrt{\frac{1}{2|S|}}\text {ln}\frac{2}{\delta }$.

Plugging $\bar{x}$ as an observation of $\bar{X}$ in the Hoeffding-bound, one gets $E[\bar{X}] \in [\bar{x} - \epsilon , \bar{x} + \epsilon ]$ with probability $1-\delta$, i.e., if $\bar{x} + \epsilon < 1$ then $E[\bar{X}] < 1$ implying $\frac{h_2}{h_1} < 1$ (with probability $1-\delta$). In other words, this means that there exists a significant support to accept the currently best candidate as the best and split the leaf node. In the case when $\bar{x} + \epsilon \ge 1$, the leaf waits for more examples. This condition is checked only when enough examples have accumulated in the leaf, i.e., whenever the leaf has accumulated a number of examples which is a multiple of the parameter GP (grace period).

In order to overcome a drawback of the Hoeffding bound, which occurs when the values of the two best heuristics are close to each other, iSOUP-Tree introduces a new parameter $\tau$, the tie breaking threshold. $\tau$ determines the minimal value $\epsilon$ can have before the leaf is split.

Each leaf makes a prediction by using an adaptive multi-target model, consisting of a multi-target perceptron and a multi-target mean predictor. The perceptron updates its weights by a backpropagation rule with a given learning rate. When a leaf is constructed, its learning rate is set to the parameter $\eta _0$, named initial learning rate. After each incoming example, the learning rate $\eta$ is updated by using the rule $\eta = \frac{\eta _0}{1 + n \cdot \eta _\Delta }$ where n is the number of recorded values and $\eta _\Delta$ is a parameter called learning rate decay factor. Finally, a prediction is made by using the perceptron or the mean regressor, depending on which one has the lower fading mean absolute error (fMAE) for that target: $fMAE^j(e_n) = \frac{\sum _{i=1}^n 0.95^{n-i} |\hat{y_i}^j - y_i^j|}{\sum _{i=1}^n 0.95^{n-i}}$ where $e_n$, $\hat{y_i}^j$ and $y_i^j$ are the $n^{th}$ observed example, the predicted and the real value of the $j^{th}$ target for the $i^{th}$ example.

2.4.3 Tree-based ensemble methods: iSOUP-Bag and iSOUP-RF

Online bagging with iSOUP-Tree The iSOUP-Bag method combines the streaming setting adaptation of the bagging method with iSOUP-Trees as base models. Online bagging is an approximation of the batch bagging approach as the number of data examples grows to infinity. The distributions over the data supplied to the base models converge assuming infinite data.

In a given bootstrap sample of the batch bagging approach, the number of copies of an individual data example from a dataset of size n is distributed according to the Binomial distribution – $B(n, \frac{1}{n})$. This distribution tends to Poisson(1) as the dataset size n grows to infinity. Therefore, given the infinite number of examples in the data stream, the online bagging method uses the Poisson distribution to determine the number of times each base model should be updated.

Given the number of base models, $\mathcal {N}$, the iSOUP-Bag method is initialized to a collection of $\mathcal {N}$ empty leaf nodes. Each iSOUP-Tree model in the ensemble is updated with an incoming data instance k times, where k is sampled at random (afresh for each model) from the Poisson distribution with parameter $\lambda =1$, i.e., from Poisson(1). With the processing of the data examples, each of them gets further split according to the description provided previously. The final prediction by an iSOUP-Bag is made by aggregating the base models’ predictions by per-target averaging.

Online random forests with iSOUP-Tree The online random forest ensemble learning method builds upon online bagging and introduces larger base model diversity. To obtain the proposed diversity, the iSOUP-RF method uses the online random forest approach with a modified version of the iSOUP-Tree. In particular, it introduces randomization to the individual iSOUP-Trees. Hence, both the base models and the data provided to them are randomized.

Each leaf node, at the time of its construction, chooses a random subset of the available descriptive attributes. It calculates and stores the necessary statistics for future splits only for the attributes in this subset, ignoring the other attributes. Therefore, iSOUP-RF is more time and memory efficient than the iSOUP-Bag method.

In addition to the number $\mathcal {N}$ of base models, iSOUP-RF also has the number of descriptive attributes sampled at each node as a parameter. It is given as a function of the total number of descriptive attributes. To make a prediction for an incoming data example, per-target averaging of the corresponding models’ predictions is performed.

3 Novel methods for online multi-target regression with change detection and adaptation

We introduce six novel tree–based online MTR methods that incorporate concept drift detection and adaption mechanisms. We first present three novel methods that combine the iSOUP-Tree approach to MTR with the ADWIN change detector (iSOUP-ADWIN-Bag and iSOUP-ADWIN-RF) and ARF-Reg (iSOUP-ARF). We then equip the iSOUP-Tree method with adaptation capabilities, based on the PH test, getting three novel methods: Adaptive-iSOUP-Tree and its ensemble variants (Adaptive-iSOUP-Bag and Adaptive-iSOUP-RF). The novel methods are implemented in the open source Massive Online Analysis (MOA) (Bifet et al., 2010) framework: The code will be publicly available upon publication.

ADWIN, which deals with change detection and adaptation in single-target classification on data streams, works with bagging and random forest tree ensembles. The error of each tree in the ensemble is continuously monitored and, once change is detected, the least accurate tree in the ensemble is dropped and the construction of a new model in the ensemble starts. ARF-Reg works with random forests for regression, uses two ADWIN detectors (one to detect and the other to confirm drift), starts to build an alternative tree in the background once drift is detected and replaces the original tree with the alternative oncd drift is confirmed.

Two of our novel approaches, iSOUP-ADWIN-Bag and -RF, build (bagging and random forest) ensembles of MTR trees with iSOUP-Tree and monitor the error of the trees (mean absolute error across all the targets), adapting to detected changes in much the same way as the original ADWIN (dropping the tree with the largest error and starting to build a new one). iSOUP-ARF follows ARF-Reg, using two ADWIN detectors and the randomized version of iSOUP-Tree to build trees for the ensemble (and alternative trees in the background).

Adaptive-iSOUP-Tree adapts the PH test approach taken by FIMT-DD for use in iSOUP-Tree. In Adaptive-iSOUP-Tree, PH tests are conducted for each target in every internal node of a multi-target regression tree. Only when change is indicated for all targets in a given node is concept drift reported overall for the node at hand: At that point, adaptation starts by learning an alternate tree rooted in that node. The Adaptive-iSOUP-Bag and -RF methods build bagging and random forest ensembles by using the Adaptive-iSOUP-Tree method.

Novel method 1: ADWIN bagging with iSOUP-Tree We propose iSOUP-ADWIN-Bag that combines iSOUP-Trees as base models with the ADWIN approach to online bagging with an ensemble-level concept drift detection mechanism. Following the streaming setting adaptation of bagging, iSOUP-ADWIN-Bag updates individual base models using the Poisson distribution for data example sampling. It produces the final ensemble prediction for an example using per-target aggregation of the base models’ predictions for that example.

Algorithm 1 presents the update operator of iSOUP-ADWIN-Bag. ADWIN detectors monitor the standardized errors of the base iSOUP-Trees, where both the ground truth and predicted values are standardized for each target (by subtracting the mean and dividing the difference with the standard deviation value calculated for that target up to a given moment). The ADWIN change detector is provided with the mean absolute error at each time point i, given as $MAE_i = \frac{1}{M}\sum _{i=1}^M |y^j_i - \hat{y}^j_i|$, where M denotes the number of targets, and $y^j_i$ and $\hat{y}^j_i$ are the standardized ground truth and predicted values for the $j^{th}$ target of the $i^{th}$ data example. When a change is detected on any base model, the iSOUP-Tree with highest error is replaced by an empty leaf.

Novel methods 2 and 3: ADWIN and Adaptive random forests with iSOUP-Tree The iSOUP-ADWIN-RF method uses the modified/ randomized version of iSOUP-Tree, whose leaf nodes select a random subset of the descriptive attributes when constructed. The ensemble learning and concept drift detection and adaptation mechanism incorporated by iSOUP-ADWIN-RF closely mirror those included in iSOUP-ADWIN-Bag. Similarly, the iSOUP-ARF ensemble learning and concept drift detection and adaptation mechanism closely mirror those of ARF-Reg. As ensemble methods, iSOUP-ADWIN-RF and iSOUP-ARF require the number of base models as an input parameter, along the size of the attribute subset as a function of the total number of available descriptive attributes, just as iSOUP-RF (which uses the same randomized version of iSOUP).

Novel method 4: Adaptive-iSOUP-Tree In the case of local concept drift, which occurs in a subspace (rather than the whole data space), inaccurate predictions are likely to come from some subtrees of the iSOUP-Tree, while other subtrees will be still modeling the data generating process well. Discarding the whole tree, as ADWIN does, seems thus unreasonable. We therefore propose a novel Adaptive-iSOUP-Tree method that extends iSOUP-Tree by equipping every internal node in the tree with a concept drift detection and adaptation mechanism.

Every node in an iSOUP-Tree, covers a hyperrectangle in the feature data space. By monitoring the error of its subspace, every node detects changes in its subspace only and adapts the model to reflect those changes. Since every node, including the root, handles concept drift, Adaptive-iSOUP-Tree can detect both local and global concept drift. Algorithm 2 presents the update operator of the Adaptive-iSOUP-Tree method.

Following the FIMT-DD single-target approach and the multi-target approach of AMRules, we use PH tests in every internal node of an iSOUP-Tree for change detection. Initialization (with an empty leaf node) and split selection (using the Hoeffding bound) in Adaptive-iSOUP-Tree are the same as in iSOUP-Tree. Adaptive-iSOUP-Tree uses an adaptive prediction strategy, choosing between the MT perceptron (with a constant learning rate) and the MT mean predictor.

The predictions are propagated and calculated by the corresponding leaf node. The predictions and the true target values are first standardized. The absolute error per target is then computed and back-propagated to the ancestor nodes. The PH test statistics in every node along the path to the root are updated and are queried for potential concept drift. When all targets detect a drift, the node confirms a detection.

When a node detects a change, it triggers an adaptation strategy to build an alternative subtree, grown in parallel with the original one. Data examples in that instance subspace are used for training both subtrees. Given sufficient evidence about which subtree performs better, the other is removed. In the case of true concept drift detection and proper adaptation, the alternate subtree is expected to outperform the original one, as it reflects the change, and will replace it in the original tree. Otherwise, in the case of a false alarm, the original tree will not be outperformed, and the alternate tree will be discarded. To decide which subtree outperforms the other, we monitor the log ratios of the per-target faded mean squared errors of both subtrees. We use the improved adaptation mechanism that includes a fading factor, included also in FIMT-DD, as proposed by Gama et al. (2009). Given the predicted and real values of the $j^{th}$ target for the $i^{th}$ example, $\hat{y_i}^j$ and $y_i^j$, we define the faded error $S_i^{j}(Tree) = L_i^{j}(Tree) + f S_{i-1}^{j}(Tree)$, where $L_i^{j}(Tree)$, $S_{i-1}^{j}(Tree)$ and f are, resp., the current loss $(\hat{y_i}^j - y_i^j)^2$, the accumulated faded mean squared error up to time $i-1$, and a fading factor (with a value close to 1, such as 0.95). We monitor the $Q_i^j$ statistic defined as $Q_i^j = \text {log}\left( \frac{S_i^j(OrgTree)}{S_i^j(AltTree)} \right)$,

where OrgTree and AltTree stand for the original and the alternate tree, resp. Positive values of this statistic for a given target imply that the original tree has a higher error on that target than the alternative.

The $Q_i^j$ statistics are checked after every $T_{min}$ examples. If they are positive for most of the target attributes (at least $90\%$ of them), the alternate tree replaces the original one. If the alternate tree does not promise any improvement, its $Q_i^j$ average will begin to decrease. The adaptation mechanism concludes that the alarm is false and deletes the alternate subtree when the time period for alternate tree growing has passed or most of the $Q_i^j$ averages (at least $90\%$ of them) start to decrease. Adaptive-iSOUP-Tree follows the approach of FIMT-DD to consider the average statistics only after $10T_{min}$ examples have been stored to avoid premature discarding.

To ensure that the alternate tree has a large enough learning rate to learn the new concept and that both subtrees have no vastly different values, Adaptive-iSOUP-Tree takes a constant learning rate perceptron along with the mean regressor in an adaptive learning strategy following iSOUP-Tree.

Novel methods 5 and 6: Online bagging and random forests with Adaptive-iSOUP-Tree The Adaptive-iSOUP-Tree has an internal concept drift detection and adaption mechanism. It does not need external ensemble-level techniques for change and adaptation. Therefore, we test Adaptive-iSOUP-Tree with the vanilla online bagging approach yielding Adaptive-iSOUP-Bag.

In addition, we also propose a modified version of the Adaptive-iSOUP-Tree to be used in random forest ensembles. When initializing the model as an empty leaf node and when splitting a node into two leaf nodes, the nodes select a subset of the descriptive attributes. The statistics needed for future splits are stored only for the selected attributes. This modification of Adaptive-iSOUP-Tree enables the development of the online random forest method Adaptive-iSOUP-RF.

Both Adaptive-iSOUP-Bag and Adaptive-iSOUP-RF initialize as sets of empty leaf nodes. MTR predictions for an example are calculated as per-target averages of the base-model MTR predictions.

4 Experimental design

Here, we describe the framework for evaluating the performance of the novel tree-based multi-target data stream mining methods that can detect and adapt to change. More specifically, we are interested in the following research questions:

1.
What is the performance of different multi-target data stream methods on real-world data in a supervised learning setting?
2.
What is the performance of multi-target data stream mining methods on stationary, i.e., non-evolving data streams?
3.
What is the influence of the three possible concept drift types/speeds on the performance of the multi-target data stream mining methods on evolving data streams?
4.
What is the performance of multi-target methods in detecting concept drift occurring at all three possible speeds (abrupt, gradual, incremental)?

4.1 Evaluated methods

In the experimental evaluation, we perform experiments with 11 different methods listed in Table 1 with their acronyms. The AMRules method is the sole competitor state-of-the-art online MTR method that incorporates a mechanism for concept drift detection and adaptation. Here, we evaluate its two versions, ordered and unordered: $\text {AMRules}^o$ and $\text {AMRules}^u$.

Table 1 The methods used in the empirical evaluation

Change detection and adaptation in multi-target regression on data streams

Abstract

Similar content being viewed by others

Mitigating concept drift in data streams: an incremental decision tree approach

Online GRNN-Based Ensembles for Regression on Evolving Data Streams

Streaming Data Analytics for Feature Importance Measures in Concept Drift Detection and Adaptation

Explore related subjects

1 Introduction

2 Background and related work

2.1 Multi-target prediction on data streams

2.2 Change detection and adaptation

2.3 Methods for concept drift detection and adaptation

2.4 Existing online MTR methods

2.4.1 Rule-based methods: AMRules

2.4.2 Tree-based methods: iSOUP-Tree

2.4.3 Tree-based ensemble methods: iSOUP-Bag and iSOUP-RF

3 Novel methods for online multi-target regression with change detection and adaptation

4 Experimental design

4.1 Evaluated methods

4.2 Data streams

4.3 Evaluating scenarios and measures

4.4 Statistical comparisons

4.5 Experimental setup

5 Results and discussion

5.1 Ensemble-level concept drift: ADWIN-RF vs. ARF

5.2 Real-world data streams

5.3 Artificial data streams with no concept drift

5.4 Data streams with concept drift of different speeds

6 Case study: predicting thermal power consumption of the MEX spacecraft

6.1 Performance of AMRules

6.2 Performance of iSOUP methods

6.3 Performance of ADWIN-based methods

6.4 Performance of the adaptive methods

6.5 Summary

7 Conclusion and future work

Data availibility

Materials availability

Code availability

Notes

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Ethical approval

Consent for publication

Additional information

Publisher's Note

Appendices

Appendix A: Datasets

1.1 A.1 Artificial data streams

1.2 A.2 Real-world datasets

Appendix B: Detailed results

1.1 B.1 Pairwise method comparison

1.2 B.2 Mean evaluation metrics per data stream type

Appendix C: Statistical comparisons

1.1 C.1 Comparison of multiple methods

1.2 C.2 Pairwise method comparison

Appendix D: Additional experiments

1.1 D.1 Impact of the number of independent variables

1.2 D.2 Impact of the number of dependent variables observing concept drift

Rights and permissions

About this article

Cite this article

Share this article

Keywords