1 Introduction

Imbalanced datasets are often encountered in multiple real-world applications. For classification tasks, such an issue has been studied (Haixiang et al. 2017; Krawczyk 2016; Johnson and Khoshgoftaar 2019). Nonetheless, it is also present in regression tasks (Branco et al. 2016). Branco et al. (2017) define imbalanced problems based on the simultaneity of two factors: (i) a disproportionate preference of the user at the domain of the target variable, and (ii) insufficient representation of the data available in the most relevant cases for the user. In classification tasks, an imbalanced dataset is determined through the presence of a class having a smaller representation (minority class) than another one (majority class). However, in regression problems, the target value is continuous, thus representing a complex definition, because the target value is not constrained to a limited set of discrete values, unlike in classification problems where the target value represents specific categories or classes. Figure 1 presents the distribution and frequency of examples drawn from an imbalanced dataset (FuelCons) with target values ranging from 2.7 to 17.3. To analyze this range, we employed a bin width of approximately 0.2, resulting in a total of 74 bins. The values at the chart’s edges show little frequency and are considered rare examples. In this context, Ribeiro (2011) proposes the concept of a relevance function which determines the relevance of continuous target values in defining certain examples as rare and others as normal. This definition allows to verify an imbalanced between instances considered rare and those seen as normal.

Fig. 1
figure 1

Distribution and frequency of the target value Y from the FuelCons dataset

Standard regression tasks assume that all values of the domain are of equal importance, and are typically evaluated based on the performance of the most frequent values. However, values that are little represented are often extremely relevant, not only to the user, but also in the prediction process. For example, in the context of software engineering prediction mistakes in large projects are associated with higher development costs (Rathore and Kumar 2017), whereas during temperatures prediction in a meteorological application, errors that surface while predicting extreme conditions (e.g., very high temperatures) are even much more costly (Ribeiro and Moniz 2020). This scenario presents particular difficulties for learning algorithms, which tend to follow the interval of values in greater quantity while neglecting the rare ones in the distribution. Hence, failing to obtain a good prediction performance for these particular examples.

Studies looking at solutions for imbalanced regression problems have faced relatively little scrutiny when compared to those related to classification problems (Haixiang et al. 2017). The most common approach used to address this gap has been to modify the distribution of examples by balancing the training data before the actual learning process begins. Some of these strategies are Random Under-sampling (Torgo et al. 2013), which removes examples from intervals having greater quantities, Random Over-sampling (Branco et al. 2019), which replicates rare values in the dataset, and the WEighted Relevance-based Combination Strategy (WERCS) (Branco et al. 2019), which creates a weighted combination biased versions of the under- and over-sampling strategies. In addition, several real-world imbalanced regression problems rely on resampling strategies to properly deal with rare and extreme cases, such as in software defect prediction (Bal and Kumar 2018, 2020; Rathore and Kumar 2017 and Rathore and Kumar 2017) and Enzyme Optimum Temperature prediction (Gado et al. 2020), as well as to assist in detecting arsenic concentration in soil using satellite imagery (Agrawal and Petersen 2021). Hence, the variety of problems and increased interest in this field demonstrates the need for studies on imbalanced regression techniques.

Another difficulty encountered in such scenarios is related to the fact that traditional performance metrics, such as the Mean Squared Error (MSE) and the Mean Absolute Error (MAE), do not adequately capture user-defined criteria (Branco et al. 2019). Additionally, recent works have proposed new performance metrics for evaluating the performance of regression models under imbalanced target distributions, and place greater emphasis on errors occurring in rare cases. In these cases, Precision, Recall, and F1-score metrics, as described for regression tasks (Torgo and Ribeiro 2009), and the squared error-relevance area (SERA) metric proposed in Ribeiro and Moniz (2020), are commonly used. Nevertheless, a comparison between multiple imbalanced regression strategies under these performance metrics, and of how they differ in their approach to assessing the model’s performance, is still an open question.

Therefore, our main goal is to analyze the effects of resampling strategies for dealing with imbalanced regression problems from different perspectives. To this end, we conduct an extensive experimental study employing different resampling strategies and learning algorithms. In addition, we use metrics that can assess the models’ performance in imbalanced regression tasks, such as the F1-score for regression and SERA (Ribeiro and Moniz 2020). To the best of our knowledge, this is the first work that performs a comprehensive empirical analysis of resampling techniques for imbalanced regression tasks. In contrast, for imbalanced classification tasks, numerous surveys and empirical studies have evaluated resampling algorithms in different scenarios, such as binary problems (García et al. 2020; Kovács 2019; Wojciechowski and Wilk 2017; Roy et al. 2018; Ali et al. 2019; Del Rio et al. 2015; Díez-Pastor et al. 2015; Moniz and Monteiro 2021), multiclass classification (Cruz et al. 2019; Sáez et al. 2016), and data streams (Aguiar et al. 2022; Zyblewski et al. 2019).

The broad scope of our experimental analysis, which considers multiple resampling strategies, regression models, and performance metrics, is at the core of the uniqueness of our research since it allowed us to assess the relationship among these three variables. Our study thus differs from Branco et al. (2016), which addresses only theoretical aspects of imbalanced problems in general. Moreover, regarding the performance metrics, using the SERA metric (Ribeiro and Moniz 2020) is highlighted since no other work has evaluated all resampling strategies using it specifically.

The following research questions guide this study: (i) Is it worth using resampling strategies? (ii) Which resampling strategies influence predictive performance the most? (iii) Does the choice of best strategy depend on the problem, the learning model, and the metrics used? (iv) Does the number of training examples resulting from each strategy influence the results? (v) Do the features of the data (percentage of rare cases, number of rare cases, dataset size, number of attribues and imbalance ratio) impact the predictive performance of the models? The experimental analysis revealed that resampling strategies are beneficial to the vast majority of regression models. The best strategies include GN, RO, and WERCS. Another important point is that choosing the best strategy depends on the dataset, the regression model, and the metric used when evaluating the system’s performance. Furthermore, we found that the dataset size, the number of rare cases, the number of attribute and the imbalance ratio significantly influence the results. The smallest datasets and those with the fewest rare cases are the most challenging. Models demonstrate superior performance in datasets with fewer features. Lastly, concerning the imbalance ratio, regression models encounter more significant challenges with a higher imbalance ratio.

Contributions

  • We propose a novel taxonomy for imbalanced regression tasks according to the regression model, learning strategy and metrics.

  • We review the main strategies used for imbalanced regression tasks.

  • We conduct an extensive experimental study comparing the performance of state-of-the-art resampling strategies and their effects on multiple learning algorithms and novel performance metrics proposed in the literature.

  • We analyze the impact of dataset characteristics (e.g., dataset size and the number of rare cases) on the model’s predictive performance.

This work is organized as follows: Sect. 2 presents the basic concepts and proposes a taxonomy for imbalanced regression problems. Section 3 describes the resampling approaches evaluated in this study highlighting their advantages and disadvantages. Section 4 presents the experimental methodology by describing the data, algorithms, parameters, and performance metrics used in this work. Results are shown in Sect. 5. Section 6 presents the lessons learned by revisiting and answering the research questions. Finally, Sect. 7 brings our conclusions.

2 Basic concepts and proposed taxonomy

Some fundamental concepts must be grasped in order to understand the notion of imbalanced regression. In this context, the concept of relevance function is presented herein and a taxonomy is proposed to organize the strategies required. The relevance function is a fundamental concept in imbalanced regression, as it defines the importance of each sample in the dataset. Finally, a taxonomy is proposed to categorize the approaches used to address imbalanced regression problems, providing a way to understand the existing literature. Based on this taxonomy, we review the main strategies for dealing with imbalanced regression problems.

2.1 Relevance function

The concept of relevance function is crucial when it comes to understanding the imbalanced regression problem and some strategies for dealing with it. Proposed by Ribeiro (2011), the relevance function (\(\phi : Y \rightarrow [0,1]\)) determines the relevance of the examples in each dataset using an automatic method. The relevance value determines the examples that are normal and those that are rare, with the rare ones being the least represented in the dataset. The intuition of the relevance function is to automatically set the significance of data points within a dataset by assigning relevance scores. In this way, the relevance function serves as the foundation for evaluating models in the context of imbalanced regression, as well as for data resampling. Consequently, using a different relevance function alters both the model evaluation and data resampling.

To the best of our knowledge, this definition of relevance function is unique in the literature. In Ribeiro (2011) and Ribeiro and Moniz (2020), the relevance function is showcased using the Piecewise Cubic Hermite Interpolating Polynomials (pchip) and cubic spline methods. However, it was noted that cubic spline interpolation cannot provide precise control over the function. It fails to confine the relevance function within the specified [0, 1] interval scale. This limitation is rectified by the pchip method, employing suitable derivatives at control points, thereby ensuring properties like positivity, monotonicity, and convexity. Consequently, Ribeiro (2011) proposed relevance function utilizes the pchip method and aligned with this, the works in the field utilize this function.

The relevance function (\(\phi\)) is calculated using Piecewise Cubic Hermite Interpolating Polynomials (pchip) (Dougherty et al. 1989) over a set of control points (Algorithm 1). The algorithm receives as input the control points (S) with their respective relevance values (\(\varphi (y_k)\)) and derivative (\(\varphi '(y_k)\)). The condition \(y1< y2< \ldots < ys\) ensures that the data points are ordered in ascending order of their y-values. This ordering is fundamental for properly functioning the pchip algorithm. As a result, the algorithm produces a separate \(\phi (y)\) polynomial for each interval \([y_k, y_{k+1}]\), with coefficients calculated based on the control points and their derivatives within that specific interval, where the variable k represents the index for the input set S control points.

Algorithm 1
figure a

pchip(S): Piecewise cubic Hermite interpolating polynomials

The control points can be defined based on domain knowledge or provided by an automated method. When control points are defined based on domain knowledge, selecting them is guided by the expertise and understanding of the specific problem or dataset. This approach relies on the insights and experience of individuals familiar with the data and its context. Ideally, access to domain knowledge for defining control points would be preferred. However, this knowledge is often unavailable or nonexistent (Ribeiro and Moniz 2020). Therefore, the utilization of an automatic method for control point definition becomes necessary. An example of defining control points of the NO2 emissions problem based on domain knowledge is presented in Table 1. Control points are determined based on Directive 2008/50/EC. The objective is to maintain the LNO2 (target) hourly concentration values below a limit equal to \(ln(150 \mu g/m^3) \approx 5.0\), indicating maximum relevance, and the annual average guideline of \(\ln {\text{(40}}\mu g{\text{ / }}m^{{\text{3}}} {\text{)}} \approx {\text{3}}{\text{.7}}\), indicating minimal relevance. And the lowest LNO2 concentration value \(\ln {\text{(3}}\mu g{\text{ / }}m^{{\text{3}}} {\text{)}} \approx {\text{1}}{\text{.1}}\) is attributed minimal relevance.

Table 1 Control points of LNO2 concentration thresholds according to Directive 2008/50/EC (Ribeiro and Moniz 2020)

In this work, we employ the automatic method, proposed by Ribeiro (2011), to define the control points. This method is based on Tukey’s boxplot (Tukey 1970). The Tukey’s boxplot is a graphical representation used to display the distribution of a dataset through its five summary statistics: The adjacent limits \(adj_L\) (Eq. 1) and \(adj_H\) (Eq. 2), first quartile (Q1), third quartile (Q3) and median \(\tilde{Y}\) (Eq. 3). In turn, the control points are defined by the adjacent limits and the median value. The input to the pchip algorithm consists of control points, their relevance and derivatives. For this purpose, to the adjacent values (\(adj_L\), \(adj_H\)) maximum relevance is assigned, which equals 1, and the median value (\(\tilde{Y}\)) with relevance value equal to zero. All control points are initialized with derivative \(\phi '(y_k)\) equal to 0. In addition to defining the control points using Tukey’s boxplot, Ribeiro and Moniz (2020) proposes the utilization of the adjusted boxplot, as proposed by Hubert and Vandervieren (2008).

$$\begin{aligned}{} & {} adj_L = Q1 - 1.5 \cdot IQR \end{aligned}$$
(1)
$$\begin{aligned}{} & {} adj_H = Q3 + 1.5 \cdot IQR \end{aligned}$$
(2)
$$\begin{aligned}{} & {} \tilde{Y} = \text{ median } \text{ of } Y \end{aligned}$$
(3)

where Q1 and Q3 are the first and third quartile, respectively, and \(IQR = Q3 - Q1\).

Figure 2 illustrates the relevance function resulting from the pchip algorithm, for the fuelCons dataset. The points approaching \(\tilde{Y}\) have negligible relevance, whereas points that move away from \(\tilde{Y}\) and approach \(adj_L\) or \(adj_H\) have maximum relevance.

Fig. 2
figure 2

Relevance function of the fuelCons dataset

Algorithm 2
figure b

check_slopes (\(\Phi , \Delta\)) Fritsch and Carlson (1980)

The interpolation generates a function that crosses the control points. One of the main goals is to learn the correct slopes in the data points such that the interpolant is monotonic by parts. To this end, a method that implements the Monotone Cubic Spline (Fritsch and Carlson 1980) (line 6) is used. The check_slopes method (Algorithm 2) ensures that the derivative is zero when the control point for a maximum or minimum local (Ribeiro and Moniz 2020).

A relevance threshold (\(t_R\)) defined by the user is employed to divide the data into rare (\(D_R\)) and normal (\(D_N\)) values. Given a dataset D, the sets \(D_R\) and \(D_N\) are defined considering the superior and inferior thresholds as follows: \(D_R = \{\langle {{\textbf {x}}},y \rangle \in D: \phi (y) \ge t_R\}\) and \(D_N = \{\langle {{\textbf {x}}},y \rangle \in D:\phi (y) < t_R\}\).

2.2 Proposed taxonomy

In the context of class imbalanced problems, solutions are often classified into four groups: Algorithmic level, Cost-sensitive, Ensemble learning, and Data preprocessing (Galar et al. 2011; López et al. 2013). However, one problem with this classification is that there is a significant overlap between the ensemble learning, data preprocessing, and cost-sensitive groups. Ensemble learning approaches can be used in conjunction with any other approaches by learning the base models, accounting to target imbalance at the algorithmic level, or applying data preprocessing prior to training each base model in the ensemble. Therefore, to better understand the different approaches for dealing with imbalanced regression problems, we can categorize the strategies into three main groups: (i) Regression Models, (ii) Learning Process Modification, and (iii) Evaluation Metrics.

Fig. 3
figure 3

Proposed taxonomy for imbalanced regression problems

The first group of strategies comprises regression models, such as single models and ensembles, which can be used to address imbalanced regression problems. However, their performance can be further improved by incorporating data preprocessing, cost-sensitive learning, and algorithmic-level modifications. The second group describes these additional strategies which can help adjust the learning process to deal with the target imbalance, thus leading to better results when compared to using the models alone. The third group comprises the evaluation metrics and is divided into local and global subgroups. The local metrics require a relevance threshold to distinguish extreme values and conduct a local evaluation, and thus, cases with a relevance score lower than the threshold are disregarded. Conversely, global metrics do not require a relevance threshold, making a global evaluation, considering all the examples. To conclude, categorizing these strategies into three groups can provide a better understanding of the approaches and enable the selection of the most suitable strategy for dealing with imbalanced regression problems. As shown in Fig. 3, data preprocessing takes the spotlight, which is the main focus of this work. Herein, we explore and compare different data preprocessing techniques to improve the performance of regression models (single models and ensembles) in imbalanced regression problems.

2.2.1 Regression models

Regression models such as MLPRegressor, Linear Support Vector Regression (SVR) and decision trees can be used to solve problems with imbalanced regression data, but they may not perform well due to the imbalance. In such cases, it may be necessary to utilize other techniques such as data preprocessing or cost-sensitive learning, or to modify the algorithm, to address the issue. In the same perspective, ensemble models, such as bagging, boosting, and random forest, can also be utilized in addressing these problems. Solutions based on ensemble learning combined with data preprocessing strategies and cost-sensitive were proposed. In Branco et al. (2018) the REsampled BAGGing (REBAGG) model was proposed in a bid to integrate data resampling strategies with bagging, and had the advantage of generating a diverse set of models taking into account the different ways training data are resampled using the Random Under-sampling, Random Over-sampling and SmoteR strategies. SMOTEBoost (Moniz et al. 2018) includes a resampling step when boosting, where SmoteR is used to direct the distribution of data towards rare cases. In the same context, Moniz et al. (2017) carried out a performance study of ensemble methods in regression tasks with imbalanced datasets.

2.2.2 Learning process modification

Learning Process Modification refers to the techniques used to modify the training process of machine learning algorithms to take into account rare cases. These techniques include algorithmic level modification, as well as the cost-sensitive and data preprocessing methods. At an algorithmic level, a model is introduced in Torgo and Ribeiro (2003) with new division criteria for the regression trees that allow to induce trees at extreme and rare predicted values. Yang et al. (2021) proposed methods aimed at favoring the similarity between near targets by applying a kernel distribution to soften the distribution in the target and space of attributes. Ribeiro (2011) then addressed a utility-based algorithm involving cost-sensitive learning designed with a set of rules extracted from the generation of different regression trees aimed at obtaining accurate and interpretable predictions for imbalanced regression. Steininger et al. (2021) proposed a density-based weighting approach to address the issue of imbalanced regression, building on the cost-sensitive method. This approach assigns higher weights to rare cases by taking into account their local densities. Finally, one of the most common approaches for treating imbalanced issues is data preprocessing, also known as resampling or balancing algorithms, which precede the learning process, altering the examples distribution. The method works by either removing samples from common cases (i.e., under-sampling) or generating synthetic samples for rare events (i.e., oversampling). Data processing techniques have the advantage of allowing the use of just about any learning algorithm concurrently, without affecting the explicability of the model (Branco et al. 2019).

Different resampling strategies have been proposed to deal with imbalanced regression problems. Most such techniques are based on existing resampling strategies proposed for classification problems. That is the case, for example, of the SmoteR algorithm, which is a variation of the Smote (Chawla et al. 2002) algorithm, with the following main adaptations made to adjust to the issue of regression: (i) the definition of rare cases, (ii) the creation of synthetic examples, and (iii) the definition of target values for newly generated examples. Also on the basis of the Smote algorithm, Camacho et al. (2022) proposed Geometric SMOTE, which generates synthetic data points along the line connecting two existing data points. Other strategies adapted from imbalanced classification are: Random Under-sampling (Torgo et al. 2013), based on the idea of Kubat et al. (1997); Random Over-sampling (Branco et al. 2019), proposed for the classification in Batista et al. (2004), and Introduction of Gaussian Noise (Branco et al. 2019), adapted from Lee (1999, 2000). In contrast, the SMOGN (SmoteR with Gaussian Noise) (Branco et al. 2017) and the WERCS (WEighted Relevance-based Combination Strategy) (Branco et al. 2019) strategies were originally proposed for handling imbalanced regression problems. Furthermore, Song et al. (2022) introduced a distributed version of the SMOGN called DistSMOGN. The method uses a weighted sampling technique to generate synthetic samples for rare cases, in addition to considering the data distribution in each node of the distributed system. For the imbalanced data streams in regression models context, Aminian et al. (2021) introduced two sampling strategies (ChebyUS, ChebyOS) based on the Chebyshev inequality to improve the performance of existing regression methods on imbalanced data streams. The approaches use a weighted learning strategy that assigns higher weights to rare cases in order to balance the training process.

Each strategy resamples data differently. However, they appear to be based on the same principles: reducing normal examples and/or increasing rare examples. Under-sampling, which reduces normal examples, is the basis of the Random Under-sampling strategy. In contrast, over-sampling, which increases rare examples, can have a simple performance, as in Random Over-sampling, or by generating synthetic cases, as in the SmoteR Algorithm and Introduction of Gaussian Noise. Other strategies are based on the aforementioned models. Examples include the SmoteR with Gaussian Noise (SMOGN), which combines the Random Under-sampling strategy with the SmoteR and Introduction of Gaussian Noise over-sampling strategies. Also, the WEighted Relevance-based Combination Strategy (WERCS) combines the Random Under-sampling and Random Over-sampling strategies by using weights to perform the resampling without establishing a relevance threshold.

In our study, we analyze a variety of data preprocessing techniques to optimize the performance of single and ensemble regression models in addressing imbalanced regression problems. Our objective is to compare the effectiveness of different approaches in identifying the most suitable strategies for this situation. By carefully assessing these techniques, we aim to provide guidance as to how to increase the success rate of regression models using data preprocessing techniques in imbalanced regression tasks.

2.2.3 Evaluation metrics

The choice of assessment metrics is fundamental in an imbalanced datasets scenario. Some metrics, such as the MSE, may fool users when the focus is on the accuracy of rare values of the target variable (Moniz et al. 2014) since it does not consider the relevance of each testing example. To show the limitations of the MSE metric and how the scores obtained by different metrics can significantly differ, we present a synthetic example (Table 2). For 10 examples in the FuelCons dataset, we present hypothetical predictions for two artificial models: \(M_1\) and \(M_2\). The True row represents the true target for each instance in the dataset, directly obtained from the FuelCons dataset. The \(\phi\) row is the relevance value of each example. Meanwhile, the \(M_1\) and \(M_2\) rows showcase predictions generated by the respective models for individual test examples. In parallel, the \(M_1\) and \(M_2\) loss rows quantify the differences between the true target and the predictions made by the models for each test example. The example shows that \(M_1\) generates more accurate predictions for the less relevant examples, which are less represented in the dataset, while \(M_2\) performs better for more relevant examples, which are more frequently represented. Nonetheless, if the models’ performances are assessed using the MSE metric, there will be no difference in scores between them. This is because the MSE metric considers all examples as having the same relevance (\(\phi\)). Therefore, for the imbalanced data scenario, where each example has a particular relevance, it is more interesting to use metrics that consider the relevance of each particular example.

Table 2 Predictions of two artificial models

Other metrics consider each example as having a particular relevance score, such as Precision, Recall, and the F1-score, which were proposed for regression applications in Torgo and Ribeiro (2009). In addition, the Squared error-relevance area (SERA) metric, which was specifically created for imbalanced regression, was proposed by Ribeiro and Moniz (2020). This metric aims to effectively assess the model’s performance for predictions of extreme values while being robust to model bias. Table 3 presents the MSE, F1-score, and SERA values for the example presented in Table 2. As earlier mentioned, for the MSE, the models are regarded as equals since they both have the same error amplitude. Nonetheless, for the F1-score and SERA, which consider each example’s relevance, \(M_2\) is the best model as it presents a lower error in the most important examples.

Table 3 Performances of two artificial models

The Precision, Recall, and F1-score metrics require that a relevance threshold be defined to determine extreme values. Thus, a local evaluation is performed, since examples below the threshold are ignored. Furthermore, these metrics use the concept of a utility-based framework (Torgo and Ribeiro 2007; Ribeiro 2011). Such a structure uses the numeric error of the prediction and the relevance of the actual and predicted values. The utility of predicting a value \({\hat{y}}\) for y is calculated from the notions of costs and benefits of numeric predictions (Branco et al. 2019), and thus, the utility function \(U^p_\phi ({\hat{y}},y)\) is given by Eq. 4, where \({\hat{y}}\) is the predicted value and y is the actual value.

$$\begin{aligned} \begin{aligned}&U^p_\phi ({\hat{y}},y) = B_\phi ({\hat{y}},y) - C^p_\phi ({\hat{y}},y) \\&\quad =\phi (y)\cdot (1-\Gamma _B({\hat{y}},y))-\phi ^p({\hat{y}},y)\cdot \Gamma _C({\hat{y}},y) \end{aligned} \end{aligned}$$
(4)

The utility is given by the difference between the prediction benefit (\(B_\phi ({\hat{y}},y)\)) and cost (\(C^p_\phi ({\hat{y}},y)\)) of prediction \({\hat{y}}\) for y. The benefit is defined as a proportion of the relevance of the actual value according to the following equation: \(\phi (y)\cdot (1-\Gamma _B({\hat{y}},y))\), where \(\Gamma _B({\hat{y}},y)\) is the bounded loss function (Eq. 5). This equation defines a loss function, \(\Gamma _B({\hat{y}},y)\), which quantifies the loss incurred when making a prediction \({\hat{y}}\) for the actual value y (Eq. 6). This loss function operates on a scale from 0 to 1, where 0 represents no loss, and 1 represents maximum loss.

$$\begin{aligned} \Gamma _B({\hat{y}},y)) = {\left\{ \begin{array}{ll} L({\hat{y}},y)/{\dot{L}}_B({\hat{y}},y), &{} \text{ if } L({\hat{y}},y) < {\dot{L}}_B({\hat{y}},y)\\ 1, &{} \text{ if } L({\hat{y}},y) \ge {\dot{L}}_B({\hat{y}},y) \end{array}\right. } \end{aligned}$$
(5)

L is a “standard” loss function [e.g., absolute deviation (Eq. 6)] and \({\dot{L}}_B\) is the benefit threshold function, (Eq. 7). The benefit threshold function identifies the point at which the predicted value ceases to provide a benefit. This can happen because of two conditions: (i) surpassing the maximum acceptable loss of the bump or (ii) being situated on a different bump (Ribeiro 2011).

$$\begin{aligned}{} & {} L({\hat{y}}, y) = |{\hat{y}} - y| \end{aligned}$$
(6)
$$\begin{aligned}{} & {} {\dot{L}}_B({\hat{y}}, y) = min\{b^\Delta _{\gamma (y)}, {\ddot{L}}_B({\hat{y}}, y) \} \end{aligned}$$
(7)

where \(b^\Delta _{\gamma (y)}\) is the maximum admissible loss, defined in Eq. 8. The maximum admissible loss is calculated for each bump i. A bump refers to a interval of the domain, denoted as \(B \subseteq Y\) (Ribeiro 2011). \(b^-\) is the mean value at which the target variable reaches the minimum relevance before reaching its maximum value, and \(b^*\) is the mean value at which the target variable reaches the maximum relevance. The reason for this definition is that this function is contingent upon the smallest discrepancy concerning the target variable when transitioning from the most pertinent value within a bump (\(b^*_i\)) to an alternative bump. The smallest differences regarding the target variable can have two effects on model performance. On the positive side, it can make the model more accurate by focusing on the areas where predictions must be very close to the actual values. This is useful when you need high accuracy in specific parts of the data. Conversely, the model might become too fixated on the training data, making it sensitive to unusual data points and not very good at handling new data, leading to overfitting. Consequently, this implies that when dealing with “narrow” bumps, our sensitivity to prediction errors is heightened, whereas for broader bumps, we are more inclined to deem larger disparities between the actual and forecasted values as acceptable (Ribeiro 2011).

$$\begin{aligned} b^\Delta _{\gamma (y)} = 2 \cdot \text{ min }\{\mid b^-_i - b^*_i\mid , \mid b^*_i - b^-_{i+1} \mid \} \end{aligned}$$
(8)

Figure 4 shows the bump partition obtained for a relevance function and the maximum admissible loss for each bump. This arbitrary relevance function, defined in the context of non-uniform utility regression, has four quite different bumps.

Fig. 4
figure 4

Bumps partition of Y with respect to relevance function \(\phi\) and the maximum admissible loss in bumps. Each bump i is characterized by its partition node \(b^-\) and by one global maximum \(b^*\). Each bump has a maximum error tolerance defined by the double of the smalles amplitude in the bump between each of one of its bounds and its maximum value (Ribeiro 2011)

And \({\ddot{L}}_B({\hat{y}},y))\) (Eq. 9) is defined as follows:

$$\begin{aligned} {\ddot{L}}_B({\hat{y}},y)) = {\left\{ \begin{array}{ll} \mid y - b^-_{\gamma (y)}\mid , &{} \text{ if } {\hat{y}}< y)\\ \mid y - b^-_{\gamma (y)+1}\mid , &{} \text{ if } {\hat{y}} \ge y) \end{array}\right. } \end{aligned}$$
(9)

This definition satisfies two essential conditions: (1) The initial component within the min function addresses the maximum allowable error range within the true value’s context, guaranteeing a level of reasonable accuracy in the prediction; (2) The subsequent component within the min function evaluates whether the predicted value aligns with the correct action by considering its proximity to the boundaries of the context associated with the true value.

The cost is given by the mean of weighted relevance (\(\phi ^p({\hat{y}},y)\)) (Eq. 10), where the parameter p is used to define the weights between the two relevances and \(\Gamma _C({\hat{y}},y)\) is the bounded loss function in the scale [0;1]. This equation calculates the weighted relevance of the predicted value \({\hat{y}}\) and the actual value y. The parameter p defines the weights between these two relevances. The intuition here is to balance the predicted value’s importance and the utility function’s actual value.

$$\begin{aligned} \phi ^p({\hat{y}},y) = (1-p)\phi ({\hat{y}})+p\phi (y) \end{aligned}$$
(10)

The cost function \(\Gamma _C({\hat{y}},y)\) is calculated according to Eq. 11.

$$\begin{aligned} \Gamma _C({\hat{y}},y)) = {\left\{ \begin{array}{ll} L({\hat{y}},y)/{\dot{L}}_C({\hat{y}},y), &{} \text{ if } L({\hat{y}},y) < {\dot{L}}_C({\hat{y}},y)\\ 1, &{} \text{ if } L({\hat{y}},y) \ge {\dot{L}}_C({\hat{y}},y) \end{array}\right. } \end{aligned}$$
(11)

where L is the standard loss function, and \({\dot{L}}_C\) is the cost threshold function (Eq. 12):

$$\begin{aligned} {\dot{L}}_C({\hat{y}}, y) = min\{ b^\Delta _{\gamma (y)}, {\ddot{L}}_C({\hat{y}}, y) \} \end{aligned}$$
(12)

and \({\ddot{L}}_C({\hat{y}},y))\) is defined as follows:

$$\begin{aligned} {\ddot{L}}_C({\hat{y}},y)) = {\left\{ \begin{array}{ll} \mid y - b^*_{\gamma (y)-1}\mid , &{} \text{ if } {\hat{y}}< y)\\ \mid y - b^*_{\gamma (y)+1}\mid , &{} \text{ if } {\hat{y}} \ge y) \end{array}\right. } \end{aligned}$$
(13)

Captured using the utility function, the Precision and Recall metrics are defined by Eqs. 14 and 15, respectively.

$$\begin{aligned}{} & {} Precision = \frac{\sum _{\phi ({\hat{y}}_i)>t_R} (1+U^p_\phi ({\hat{y}}_i, y_i))}{\sum _{\phi ({\hat{y}}_i)>t_R}(1+\phi ({\hat{y}}_i))} \end{aligned}$$
(14)
$$\begin{aligned}{} & {} Recall = \frac{\sum _{\phi (y_i)>t_R} (1+U^p_\phi ({\hat{y}}_i, y_i))}{\sum _{\phi (y_i)>t_R}(1+\phi (y_i))} \end{aligned}$$
(15)

The relevance of the actual value \(y_i\) is defined by \(\phi (y_i)\), as defined in Sect. 2.1, and \(\phi ({\hat{y}}_i)\) is the relevance of the predicted value \({\hat{y}}_i\). \(t_R\) is a threshold defined by the user for the relevance values, and \(U^p_\phi ({\hat{y}}_i, y_i)\) is the utility function previously described.

The Precision and Recall metrics can be aggregated in compound measures, such as F1-score, defined by Eq. 16:t

$$\begin{aligned} \textit{F1-score} = \frac{(\beta ^2+1) \cdot Precision \cdot Recall}{\beta ^2 \cdot Precision + Recall} \end{aligned}$$
(16)

where \(0 \le \beta \le 1\) controls the relative importance of the Recall for the Precision. These compound measures have the advantage of allowing comparisons between models by providing a single score (Torgo and Ribeiro 2009).

These metrics require the definition of an ad-hoc relevance threshold and do not consider examples below the threshold for model evaluation (Ribeiro and Moniz 2020). To address this, Ribeiro and Moniz (2020) proposed the SERA metric.

SERA metric can assess models’ efficacy and optimize them for predicting rare and extreme cases. This metric does not require a definition of a relevance threshold and thus performs a global evaluation since all data points are considered. The Squared error-relevance is obtained in relation to a cutting t achieved based on a relevance function \(\phi : Y \rightarrow [0,1]\). A subset \(D^t = \{ \langle {{\textbf {x}}},y \rangle \in D: \phi (y) \ge t\}\) formed based on the cutting t is considered for this estimate, such as in Eq. 17:

$$\begin{aligned} SER_t = \sum \limits _{i \in D^t}(\hat{y_i}-y_i)^2 \end{aligned}$$
(17)

The Squared error-relevance area (SERA) represents the area below the curve \(SER_t\), obtained through integration presented in Eq. 18:

$$\begin{aligned} SERA = \int \limits _{0}^{1} SER_t \hspace{1mm} dt = \int \limits _{0}^{1} \sum \limits _{i\in D^t}(\hat{y_i}-y_i)^2 \hspace{1mm} dt \end{aligned}$$
(18)

The \(SER_t\) curve offers a broad view of prediction errors in the domain at various relevance cutoff values. Therefore, a smaller area under the curve (SERA) indicates a better model. It is noteworthy that assuming uniform preferences with \(\phi (y) = 1\), SERA is comparable with the sum of squared errors.

3 Resampling strategies

The most common way to deal with imbalanced datasets is to use resampling strategies changing the data distribution to balance the targets (Moniz et al. 2017). Such strategies are concentrated on the following three main approaches: (i) over-sampling, (ii) under-sampling, and (iii) a combination of these two approaches. In over-sampling, rare cases are generated to compensate for the imbalanced distribution. The Random Over-sampling technique (Branco et al. 2019) is an example of such a technique, which works by replicating rare cases prior to training. However, it is also possible to perform over-sampling by generating synthetic cases, as in the SmoteR (Torgo et al. 2013) and Introduction of Gaussian Noise strategies (Branco et al. 2019).

Conversely, under-sampling techniques aim to exclude larger quantity data (i.e., normal examples). The Random Under-sampling algorithm (Torgo et al. 2013) uses this notion. Some strategies employ a combination of approaches, such as the SmoteR and Introduction of Gaussian Noise, which generates synthetic cases and uses under-sampling, WEighted Relevance based Combination Strategy (Branco et al. 2019), thus combining the approaches of under-sampling and over-sampling. The SMOGN (Branco et al. 2017) uses the generation of synthetic cases with SmoteR and GN and under-sampling.

Sections 3.1, 3.2, 3.3, 3.4, 3.5 and 3.6 provide an overview of the resampling strategies evaluated in this work. These strategies were selected based on their wide adoption in the literature. Conversely, other strategies were disregarded due to an absence of publicly available source code for them, limited reproducibility, and infrequent utilization by researchers for diverse problem domains. Finally, Sect. 3.7 critically analyzes the resampling strategies with a visual example.

3.1 SmoteR

The SMOTE for regression (SmoteR) algorithm was proposed in Torgo et al. (2013) (Algorithm 3). Like the other methods addressing imbalanced regression issues, it requires a relevance function (\(\phi (y)\)) and a relevance threshold (\(t_R\)). The relevant or unimportant examples are defined from such a function. The algorithm removes the least relevant examples (lines 4 to 7), which are considered “normal”, and then generates synthetic examples based on the most relevant examples (line 8). The generation process basically follows the idea in the SMOTE, namely, first selecting one rare case from the dataset as the seed case and one of its K-Nearest Neighbors to generate a new data point between the reference and its selected neighbor. Algorithm 4 presents the procedure for generating the synthetic cases using SmoteR. First the number of synthetic examples that is generated from a selected rare case, ng, is determined based on the percentage of over-sampling o determined by the user and the dataset cardinality |D| (line 3). Then, for each rare case c that will be used as a reference in the generation process, its K-Nearest Neighbors are computed (Line 5) nns. After the set of neighbors are obtained, the algorithms execute multiple iterations to generate ng synthetic examples by picking one of the examples in the nns set at random and interpolating with the reference one. This generation process is presented from lines 8 to 15, which show how attribute values for the synthetic case are generated. If the attributes are numeric, the difference between the attributes of the two seed cases is calculated (line 10). Subsequently, (line 11) multiplies this difference by a random number between 0 and 1, and then adds to the example’s attribute. Otherwise, a random selection between the values of the seed cases is performed. On lines 16 to 18, the value of the target is generated, calculated by the weighted average of the two cases. The weights are obtained by the distance between the new case and the two seed cases (lines 16 and 17). In de Oliveira Branco (2018), this strategy is extended, and is able to handle any number of either normal or rare cases.

Algorithm 3
figure c

SmoteR

Algorithm 4
figure d

Generating synthetic cases

3.2 Random over-sampling

The Random over-sampling (Branco et al. 2019) strategy, presented in Algorithm 5, works by first selecting the examples that are above the relevance threshold \(t_R\) (line 2) as candidates to be duplicated, \(Bins_R\). Then, for each bin B belonging to the rare examples \(Bins_R\), the number of replicas tgtNr generated is defined according to its cardinality |B| and the oversampling percentage o (Line 4). The |B| represents the number of elements (data points or examples) contained within that specific bin B. This oversampling percentage is a hyperparameter defined by the user. Random sampling is performed on line 5, and the duplicated cases are added to the new dataset (newD) on line 6. When performing this algorithm, no special treatment is required to generate the target values. As the examples generated are identical to the existing rare cases, the duplicated ones have exactly the same target value.

Algorithm 5
figure e

Random over-sampling

Algorithm 6
figure f

Random under-sampling

3.3 Random under-sampling

The Random Under-Sampling strategy (Algorithm 6) was proposed by Torgo et al. (2013). In this approach, the under-sampling is performed by first using the relevance function (Sect. 2.1) and a relevance threshold \(t_R\) to define the rare cases in the dataset (line 1). The examples below \(t_R\) are considered normal, being candidates to be removed from the final dataset (Branco et al. 2016) (line 2), while rare cases are kept. The removal of the normal examples is thus performed according to an under-sampling rate provided by the user u, which defines the percentage of under-sampling applied in the dataset. For each bin B belonging to the set of normal examples \(Bins_N\), the number of examples removed from it is computed based on its cardinality and the percentage of undersampling u (Line 5). Line 6 performs the under-sampling in B by randomly selecting data points to be removed, resulting in a reduced set that is used to compose the final dataset newD.

3.4 Introduction of Gaussian noise

Generating synthetic examples through Gaussian noise (Introduction of Gaussian Noise - GN) constitutes an adaptation of the method proposed in Lee (1999, 2000) for classification tasks to the regression context. Algorithm 7 presents the GN technique. It starts by dividing the dataset into normal cases \(Bins_N\) and rare cases \(Bins_R\) according to the relevance function \(\phi (y)\) and the relevance threshold \(t_R\) (Lines 1 and 2). Examples belonging to \(Bins_N\) (i.e., normal examples) are reduced in size, using the Random under-sampling technique (lines 4 to 6). The amount of reduction is controlled by the percentage of the under-sampling hyperparameter u defined by the user.

From lines 8 to 20, the over-sampling procedure is performed using the samples in \(Bins_R\). For each seed case selected and used in the generation process, a total of ng new artificial generated examples are added to the dataset. ng is computed based on the percentage of the overs-sampling hyperparameter o and the number of examples in the corresponding set \(B \in Bins_R\) (Line 9). The artificial cases are generated by introducing a small perturbation on both the attributes and the target variable value of the seed case. If the attributes are nominal (line 13), the generation is performed with probability proportional to the frequency of the values found in the category (lines 14 and 15). Otherwise, for the numeric attributes, a random perturbation from a normal distribution is added, as indicated on lines 17 and 18, where \(\delta\) is the perturbation amplitude defined by the user and sd(a) is the standard deviation of the attribute a estimated using the examples in the category. The normal perturbation is also applied to the seed target value in order to generate the target value of the newly generated example.

Algorithm 7
figure g

Introduction of Gaussian Noise

3.5 SmoteR with Gaussian noise

The SmoteR with Gaussian Noise (SMOGN - SG) (Branco et al. 2017) (Algorithm 8) combines the Random under-sampling strategy (lines 6 to 9) with two over-sampling strategies: SmoteR and Introduction of Gaussian Noise. The goal is to limit the potential risks to the SmoteR of generating bad examples when the seed and its selected neighbor are not close enough by using the more conservative strategy of just introducing Gaussian noise to generate new cases. These bad examples may not represent of the underlying data distribution and can introduce several issues like noise, bias, or inconsistencies into the dataset. Moreover, the technique aims to allow for an increase in diversity when generating examples, which is not feasible by using only the Introduction of Gaussian Noise method (Branco et al. 2017). Increasing diversity means producing a comprehensive range of examples covering different data distribution aspects. The generated examples should not be overly similar or redundant. Instead, they should capture different patterns, variations, or scenarios present in the data to represent the data distribution comprehensively. Thus, SMOGN addresses the main drawbacks of SmoteR and the introduction of Gaussian noise techniques.

Line 11 determines the number of synthetic cases ng that will be generated according to the percentage of the over-sampling hyperparameter o and the number of existing cases in the corresponding bin B. Then, for each seed case in B, its K-Nearest Neighbors and the maximum allowed distance to generate new cases with SmoteR are computed (lines13 to 15). When the seed case and the selected neighbor are “sufficiently near” (i.e., distance below the computed threshold maxD), the SMOGN generates new synthetic examples with the SmoteR (lines 17 and 18) technique. Otherwise, it uses the Introduction of Gaussian Noise method when the distance between the two examples is higher than the estimated threshold (lines 20 and 21). The generated data points are then added to the new dataset, newD.

Algorithm 8
figure h

SMOGN

3.6 WEighted relevance based combination strategy

The WEighted Relevance-based Combination Strategy (WERCS) strategy (Branco et al. 2019) combines biased versions of the under- and over-sampling strategies which depend exclusively on the relevance function provided to the dataset without requiring establishing a relevance threshold. Under the WERCS, the relevance function and a modification of the relevance are used to attribute weights that are used as inclusion and removal criteria for the examples. Algorithm 9 details this resampling strategy. The over-sampling and under-sampling on lines 4 and 7, respectively, are performed considering weights obtained on lines 3 and 6. These weights are calculated based on the relevance function. The weights associated with over-sampling WOver are proportional to the relevance function (line 3). Therefore, the higher the relevance of a case, the higher its probability of being selected for generating new cases. Conversely, the weights associated with under-sampling WUnd are inversely proportional to the relevance value (line 6). Thus, normal examples, which are usually associated with lower relevance values, have a higher probability of being removed rather than used in the generation process. The number of generated and removed samples is defined based on the percentage of over-sampling o and under-sampling u, respectively.

Therefore, the main advantage of this technique is that as a relevance threshold is not set a priori, each example can participate in both processes. Thus, both under-sampling and over-sampling strategies are applied over the entire dataset. Also, the technique eliminates the dependency on the relevance threshold \(t_R\) that was a key component necessary for applying all other resampling strategies reviewed in this work.

Algorithm 9
figure i

WEighted relevance-based combination strategy (WERCS)

3.7 Advantages and disadvantages of strategies

The strategies to resample data can have both advantages and disadvantages. Therefore, it is crucial to understand the behavior of each strategy. While these strategies can potentially enhance learning, they can also impede the learning process of the models. Figure 5 introduces the result of applying the resampling strategies to the FuelCons dataset. The following values were attributed to the algorithm’s parameter: u/o = balance and \(t_R\) = 0.8 (except for the WERCS, since it does not require establishing the threshold). The standard values were adopted for the remaining parameters. For the visualization, the target values (Y) and the attribute (X30) were considered.

Despite selecting the nearest examples to generate new cases, SmoteR still involves the risk of the example being too far and of generating an example that does not correspond to the seed very well. This phenomenon is shown in the lower left side of Fig. 5b, where the generated examples are far from the original examples. In the RO strategy, high percentages of over-sampling may cause an overfitting (Branco et al. 2019) problem. Even though the technique increases the representation of rare cases considerably, the generated dataset does not present a high points diversity. The generation process consists in just duplicating existing samples without covering the feature space well.

Fig. 5
figure 5

Distribution of the examples of the FuelCons dataset after applying the resampling strategies, considering \(t_R\)=0.8

Figure 5c shows the rare data points in darker shade, given that the RO only makes copies of the examples. This can therefore lead to learning algorithms overfitting such rare examples. In addition, if the replication rate is too high, many duplicate data points are added to the dataset which can significantly increase the training time. In contrast to the RO, in the RU strategy some meaningful information may be lost due to the removal of training data (Fig. 5d), which may hamper the learning of the model. Figure 5e shows the result after using the GN strategy, which promotes over-sampling by adding normally distributed noise. Once again, in contrast to the RO strategy, examples different from the originals ones are generated, and this diversity can help to mitigate overfitting. For the SG strategy, even though one of its goals is to reduce the risks seen in SmoteR by creating different examples from the original, Fig. 5f shows that there is still a similarity with the SmoteR distribution. However, when compared to GN, it is evident that the diversity of generated examples is higher in SG. In the WERCS strategy (Fig. 5g), it can be seen that the green data points are divided into two groups after the under-sampling, and this result can complicate the learning process. The WERCS over-sampling strategy performs similarly to RO, where the generated data are copies of the originals; such as, no new information is added to the training set.

The advantages and disadvantages of each resampling strategy are quite evident, as is the fact that there is no perfect strategy. We hypothesize that other variables, such as the regression model and the dataset under investigation, are required to determine the best data resampling strategy. Thus, our research allows to understand the behaviors of these strategies with different regression models and problems, which in turn allows to establish directions for combinations of the three variables, namely, the resampling strategy, the regression model, and the dataset.

4 Research methodology

4.1 Datasets

Experiments were performed using 30 imbalanced regression datasets chosen to match the frequency generally used in studies looking at imbalanced regression. The levels of imbalance in these datasets are defined from the relevance function (Sect. 2.1). A study conducted by Branco et al. (2019) involved varying the relevance threshold from 0.5 to 1. Nevertheless, the findings showed a complex relationship between the number of rare cases, the learning algorithm, and the applied pre-processing strategy. Therefore, our experiments considered a commonly used threshold (\(t_R\)) of 0.8, as used in Branco et al. (2017), Branco et al. (2019) and Branco et al. (2018). Thus, we obtained datasets with different percentages of rare cases (imbalanced levels), varying between 5.1% and 23.4%. The main features of these sets are presented in Table 4. Datasets are presented in descending order in terms of the percentage of rare cases (%Rare). It is important to clarify that counting rare cases is conducted across the entire dataset, as commonly practiced in the literature. Counting rare cases on the entire dataset is crucial for comprehensively understanding their rarity within the data context. This approach allows us to analyze the model’s behavior within the original context of the dataset. However, resampling strategies are applied only to the training set to prevent data leakage during cross-validation. The nominal attributes were codified, transforming the vector of categories into whole values between 0 and the number of categories\(-1\). As for the ordinal attributes, a pre-defined order was established (e.g., small: 1, medium: 2, large: 3).

Table 4 Characteristics of the 30 datasets used in the experiments

For each dataset, the results were calculated by applying two 10-fold cross-validation repetitions (i.e., \(2\times 10\) cross-validation) in order to obtain the mean and standard deviation of the results. Nested cross-validation with 2-fold was employed to optimize the hyperparameters of the resampling strategies, specifically utilizing the SERA metric for optimization. The SERA metric was chosen to optimize the hyperparameters because it was specifically created for imbalanced regression. This metric evaluates models’ performance in predicting extreme values, penalizing model biases without requiring a threshold, and conducting a global assessment (Ribeiro and Moniz 2020). Unlike the F1-score, which conducts a local assessment by considering only rare examples, SERA evaluates all examples.

4.2 Algorithms

The experiments were performed with the following learning algorithms: Bagging (BG), Decision Tree (DT), Multilayer Perceptron (MLP), Random Forest (RF), Support Vector Machine (SVM), and XGBoost (XG). Default hyperparameters were applied for these models. For details and descriptions of default hyperparameters and used packages, refer to Online Appendix A.

As resample techniques, we considered the following strategies: SmoteR (SMT), Random Over-sampling (RO), Random Under-sampling (RU), Introduction of Gaussian Noise (GN), SMOGN (SG), and WEighted Relevance-based Combination Strategy (WERCS). Details about hyperparameters and packages can be found in Table 5.

Table 5 Resampling strategies, hyperparameters, and packages used

4.3 Model evaluation

In imbalanced tasks, choosing the appropriate metrics for model evaluation is essential. This work uses the F1-score and SERA metrics to evaluate regression models, allowing the evaluation of different perspectives of the model performance. While the F1-score metric is based on the concept of utility-based evaluation and performs a local assessment according to the definition of a relevance threshold, the SERA metric evaluates the effectiveness of models in predicting extreme values while penalizing several model biases without the need for a threshold, and performing a global assessment (Ribeiro and Moniz 2020). The results for the RMSE and MAE metrics can be consulted in the supplementary material (Online Appendix B) for benchmarking purposes.

5 Results

The experiments aimed at answering the following research questions:

  1. 1.

    Is it worth using resampling strategies?

  2. 2.

    Which resampling strategies influence the predictive performance the most?

  3. 3.

    Does the choice of best strategy depend on the problem, the learning model, and the metrics used?

  4. 4.

    Does the number of training examples resulting from each strategy influence the results?

  5. 5.

    Do the features of the data (percentage of rare cases, number of rare cases, dataset size, number of attribues and imbalance ratio) impact the predictive performance of the models?

Tables 6 and 7 show how many times each algorithm obtains the highest value for the F1-score and SERA metrics, respectively. Where there is a tie, each of the n tied strategies receives 1/n point. Each row in this table must add up to 30, the number of datasets assessed. For both metrics used, we found that the larger number of wins occurs when using some of the resampling strategies, which points to an advantage of using such strategies. As highlighted in bold in the tables, RO and GN obtained the highest number of wins considering the F1-score, and the GN and WERCS, according to SERA. Another point observed is that the choice of best strategy possibly depends on the regression model used. As for the metrics, both agree regarding the GN strategy. By observing the score by rows, also in Tables 6 and 7, it is clear that there is no general agreement between the datasets for a resampling strategy since each point is a dataset, and all of them are distributed in different strategies. The results per learning algorithm, including mean and standard deviation, can be accessed in the supplementary material – Online Appendix B.

Table 6 Number of times each algorithm and resampling strategy achieved the best result according to the F1-score metric
Table 7 Number of times each algorithm and resampling strategy reached the best result according to the SERA metric

To identify the best way to preprocess each dataset, Tables 8 and 9 introduce the best and worst results for the F1-score and SERA metrics, respectively. The results show that most datasets have distinct preferences in terms of combining the best learning model and the resampling strategy. This distinction is also found for the metrics used. As for the worst results, the SVR and MLP, without preprocessing, is the worst combinations for both metrics. Thus, balancing the dataset before applying these models is crucial to reaching more promising results. It is also crucial to note the significant difference between the best and worst results per problem. So, obtaining good results depends on the correct choice of resampling strategy and learning model. Unfortunately, the SG strategy failed to perform on the california, heat, and wine-quality datasets. These are large datasets, highlighting the potential challenges in optimizing hyperparameters, rendering the use of this model impractical.

Table 8 Best and worst results for each dataset based on the F1-score metric
Table 9 Best and worst results for each dataset based on the SERA metric

We applied the Friedman test to better measure the advantage of using resampling strategies (\(p-value <0.05\)). The Friedman statistical test was chosen since it can compare multiple techniques over several datasets (Demšar 2006). For this measurement, ranking sequences are compared. Tables 10 and 11 present the mean ranking of the means of the algorithms with a combination of each resampling strategy, considering the F1-score and SERA metrics. The lower the ranking, the better the algorithm performance. The algorithms used present significant differences. In general, the best average rank of each algorithm was obtained by using some of the resampling strategies evaluated in this work.

Table 10 Average ranking (F1-score)
Table 11 Average ranking (SERA)

To verify which approaches are statistically different, we applied the Nemenyi post-hoc test. Figure 6a–f illustrates the critical difference diagrams (Demšar 2006) for each of the learning models, considering the F1-score metric. The horizontal line demonstrates the significance of the difference between the models. Models that are not connected present a significant difference (\(p-value <0.05\)) in relation to the others. This test once again confirms that, globally, resampling strategies can significantly improve the regressors’ performance. The Nemenyi test reveals that the RO obtained the best results and the most significant differences in relation to None (data without any preprocessing) for the metric F1-score. In most cases, the SMT, SG, RU techniques achieve the worst results. Figure 7a–f considers the SERA metric; in such a scenario, most of the best results are obtained using the GN strategy, followed by WERCS, given the number of times where the best results were achieved in the critical difference chart.

Fig. 6
figure 6

Critical difference diagrams for each learning algorithm considering the F1-score metric

Fig. 7
figure 7

Critical difference diagrams for each learning algorithm considering the SERA metric

Another interesting fact is how different learning algorithms perform when no resampling strategy is applied. In both metrics, the DT model achieved better results with the original data sets. Additionally, an interesting aspect involves the ensemble models, Random Forest (RF) and XGBoost (XG) obtained better results than single models, corroborating the analysis conducted in Moniz et al. (2017), which says that ensemble methods provide a better result than single models. Conversely, the SVR and MLP algorithms obtained the worst results, especially when no preprocessing techniques were employed. Thus, it can be concluded that these algorithms are the most affected by having an imbalanced target distribution and require special attention when applied in the imbalanced regression context.

As described in Sect. 3, each resampling algorithm uses different heuristics to balance the dataset. Figure 8 illustrates the percentage of increase/decrease in the training examples for each strategy. It was previously concluded that the best results were achieved using the RO, GN, and WERCS strategies. The GN and WERCS strategies present a small percentage of 1.28% and 2.83%, respectively. Conversely, the RO presents an increased percentage of 1421.1%. Therefore, the influence of the number of examples on the results is unclear since the strategies with different percentages of increase/decrease obtained good results. Nonetheless, it may be disadvantageous (from a training time point of view) to use a strategy, such as the RO, that considerably increases the training set. Other strategies also deliver satisfactory results without excessively increasing the training set. More details about the number of instances after the application of the resampling strategies can be found in the supplementary material – Online Appendix C.

Fig. 8
figure 8

Percentage of increase/decrease in the training set for each resampling strategy

Figures 9101112 and 13 present the F1-score results arranged according to some dataset characteristics in a bid to assess their impact on the performance of the models. The following characteristics were assessed: percentage of rare cases, number of rare cases, dataset size, number of attributes, and imbalance ratio. The imbalance ratio is calculated as the ratio between the number of rare cases (\(D_R\)) and the number of normal cases (\(D_N\)), i.e., \(\frac{|D_R|}{|D_N|}\). Each box represents a regression model (BG, DT, MLP, RF, SVR and XG), and each point represents a specific set of data, and each line represents a resampling strategy (None, SMT, RO, RU, GN, SG and WERCS).

The results presented in Fig. 9 correspond to the same ordering provided in Table 4, where the datasets are arranged in decreasing order of the percentage of rare cases. In such conditions, it is not possible to find any pattern. Therefore, it is unclear how this aspect relates to the model’s performance. Figures 10 and 11 are arranged according to the number of rare cases and the dataset size, respectively. These circumstances reveal that the smaller datasets with a lower number of rare cases represent the hardest tasks, as observed in Branco et al. (2019). Figure 12 illustrates the evolution of F1 considering the number of attributes in each dataset. In some instances, it is noticeable that datasets with fewer features exhibit superior performance. Finally, in Fig. 13, the datasets are sorted according to their respective imbalance ratios. The regression models with all resampling strategies face more significant challenges when dealing with datasets exhibiting higher imbalance ratios. This difficulty arises because higher imbalance ratios mean the rare cases are significantly underrepresented compared to the normal cases. As a result, the model may struggle to learn the underlying patterns and become biased toward the normal cases.

For all the evaluated dataset characteristics, the behavior of the resampling strategies is quite similar, resulting in an overlap of the graph’s line. For better clarity, another analysis is conducted considering the best F1-score for each dataset. The figures in Online Appendix D present the best F1-score for each dataset, considering the dataset characteristics. With this, we can visualize how the data characteristics affect the performance of the top models. The percentage of rare cases does not exhibit a clear pattern. Thus, concluding whether this characteristic affects the model’s performance is challenging. Regarding the number of rare cases and the dataset size, models achieve better performance when there are more rare cases and a larger dataset. When we consider the number of attributes, we observe that a higher number leads to better model performance. As for the imbalance ratio, the higher the imbalance ratio, the worse the model’s performance.

Fig. 9
figure 9

Evolution of the F1-score with datasets sorted by percentage of rare cases

Fig. 10
figure 10

Evolution of the F1-score with datasets sorted by number of rare cases

Fig. 11
figure 11

Evolution of the F1-score with datasets sorted by size

Fig. 12
figure 12

Evolution of the F1-score with datasets sorted by number of attributes

Fig. 13
figure 13

Evolution of the F1-score with datasets sorted by imbalance ratio

6 Lessons learned

Different approaches have been proposed in a bid to solve the imbalanced problem in the context of regression, including resampling strategies. Our research introduced a review and an experimental study of the main resampling strategies for dealing with imbalanced regression problems. In this section, the research questions are revisited and answered succinctly.

  1. 1.

    Is it worth using resampling strategies?

    We answer this question by accounting for the number of times that each strategy won (Tables 6 and 7). For both metrics, four of the resampling strategies used won more times than when no resampling strategy was used. Furthermore, the Nemenyi post-hoc statistical tests performed (Figs. 6 and 7) demonstrate that many resampling strategies are statistically better as compared to the absence of a strategy. Therefore, it is advantageous to use (some) resampling strategies.

  2. 2.

    Which resampling strategies influence the predictive performance the most?

    Considering the F1-score metric, the RO and GN strategies positively influenced the results of the learning algorithms. As for the SERA metric, the GN and WERCS techniques are the best strategies. Statistically, in general, the GN, RO, and WERCS strategies held the best results (Figs. 6 and 7). Conversely, in terms of predictive performance, the SMT, SG, RU techniques achieve the worst results..

  3. 3.

    Does the choice of best strategy depend on the dataset, the learning model, and the metrics used?

    Most of the datasets used have distinct preferences regarding the combination of the best regression model and resampling strategy (Tables 8 and 9). For the regression models, different resampling strategies can reach better results. As for the metrics, both agree that the GN is a good resampling strategy. Nonetheless, there are cases of disagreement between them.

  4. 4.

    Does the number of training examples resulting from each strategy influence the results?

    Given that the best results were obtained using the GN, RO, and WERCS strategies, which have different percentages (1.28%, 1421.1%, 2.83%, respectively) of increase/decrease in the training examples (Fig. 8), the influence of the number of examples on the results is not clear. Nonetheless, it may not be advantageous (from a training time point of view) to use a strategy like the RO, which considerably increases the training set, as other strategies also deliver equivalent results without this excessive increase.

  5. 5.

    Do the features of the data (percentage of rare cases, number of rare cases, dataset size, number of attribues and imbalance ratio) impact the predictive performance of the models?

    In the studies performed, the percentage of rare cases did not have a clear impact on the results. On the other hand, considering the dataset size and number of rare cases, it could be seen that the smaller datasets with fewer rare cases correspond to the most difficult tasks. Models demonstrate superior performance in datasets with fewer features. Lastly, concerning the imbalance ratio, regression models encounter more significant challenges with a higher imbalance ratio. The results for this question are shown in Figs. 91011, 12 and 13. Online Appendix D presents the evolution of the best F1-score for each dataset characteristic, providing a clearer view of the impact of these dataset characteristics on model performance.

7 Conclusion

This work reviews and performs a comparative study of data resampling strategies for handling imbalanced regression problems. We reviewed six state-of-the-art resampling strategies for regression based on three approaches: (i) under-sampling, (ii) oversampling, and (iii) a mix of undersampling and oversampling, while discussing the advantages and drawbacks of each existing technique.

Then, we performed an extensive experimental analysis comprised of 6 regression algorithms and 7 scenarios (6 resampling strategies and not using resampling) that can guide the development of new strategies to solve the imbalanced regression problem. Our experimental results demonstrate that it is important to use a resampling technique for most models as resampling techniques lead to statistically better results. The experimental study also shows that no resampling technique outperforms all others. Furthermore, choosing the best resampling technique depends on three main factors: the learning algorithm, the dataset, and the performance metric used to assess the model’s performance.

Further studies should address the recommendation of combining resampling strategies with a regression model for each specific dataset. Another element worth addressing is the dataset characteristics, which should be investigated through data complexity measures (Lorena et al. 2018) in order to assess the adverse effects of these features on prediction performance. Moreover, an essential point to address involves proposing a new relevance function since currently, only one definition exists. This proposal aims to conduct studies and comparisons regarding the definition of an imbalanced regression dataset.