Training regression ensembles by sequential target correction and resampling

doi:10.1016/j.ins.2012.01.035

Information Sciences

Volume 195, 15 July 2012, Pages 154-174

https://doi.org/10.1016/j.ins.2012.01.035 Get rights and content

Abstract

Ensemble methods learn models from examples by generating a set of hypotheses, which are then combined to make a single decision. We propose an algorithm to construct an ensemble for regression estimation. Our proposal generates the hypotheses sequentially using a simple procedure whereby the target map to be learned by the base learner at each step is modified as a function of the previous step error. We state a theorem that relates the overall upper error bound of the composite hypothesis obtained within this procedure to the training errors of the individual hypotheses. We also demonstrate that the proposed procedure results in a learning functional that enforces a weighted form of Negative Correlation with respect to previous hypotheses. Additionally, we incorporate resampling to allow the ensemble to control the impact of highly influential data points, showing that this component significantly improves its ability to generalize from the known examples. We describe experiments performed to evaluate our technique on real and synthetic datasets using neural networks as base learners. These results show that our technique exhibits considerably better prediction errors than the Negative Correlation (NC) method and that its performance is very competitive with that of the Bagging and AdaBoost algorithms for regression estimation.

Introduction

Ensemble methods [31], [14], [27], [5] have been demonstrated to be a powerful and flexible way to improve the performance of a base learning algorithm in a variety of machine learning scenarios, including classification [45], [41], [19], regression [7], [8], novelty detection [32], times series forecasting [46] and clustering [10], [37]. The basic idea consists of extracting a model from data by combining a set of simple models that are constructed and organized to achieve a desired goal. The Bagging [4], AdaBoost [34], Mixture of Experts [17], Stacking [44] and Negative Correlation [22] algorithms, as well as their many variations, are well-known examples of this class of methods.

The concept of diversity is commonly used by the ensemble community to denote the differences among the individual components to be combined. Because the replication of multiple exact copies of the same model does not provide an advantage over the use of a single instance of such a model, much research has been directed to the definition of strategies to measure and generate diversity in a useful way [6], [20]. The most commonly investigated method to promote diversity among the models in an ensemble is probably the manipulation of the training data used to build the individual models. The Bagging [4] and AdaBoost [34] algorithms, for example, build these models by resampling the original dataset in a first step and then training a base learning algorithm with the obtained datasets. Another investigated approach to achieve diversity in the ensemble is to use different learning functionals to construct the component models, changing the individual learning goals in terms of the original learning task and the state of the other components in the system. Negative Correlation (NC) [7], [22] for example, trains each model with a learning functional that explicitly controls the tradeoff between the individual accuracy and the contribution to a measure of diversity in the ensemble. In [7] Brown et al. showed that NC is competitive with Bagging and AdaBoost in regression estimation scenarios.

This paper presents a new algorithm for building ensembles in regression problems, where the composite predictor is obtained by following a stage-wise additive modeling scheme. At each step, the target map to be learned by the base learner is modified as a function of the previous step error by using a very simple analytic rule. We show that this method is equivalent to using a learning functional similar to that employed by NC [7], [22] to measure and generate diversity in the ensemble. However, in NC, the diversity is measured as a kind of correlation between individual predictions. Specifically, the correlations of the differences between the individual and composite predictions are used to encourage diversity. Our functional, in contrast, uses the differences between the individual predictions and the original target map. This way of measuring diversity in the system encourages only those differences among the individual hypotheses that compensate for the current errors of the ensemble. On the other hand, this functional weights the correlations between the actual and past hypotheses such that the correlations with the more recent hypotheses have a greater weight.

Our experiments show that, compared with Negative Correlation, the proposed procedure can considerably increase the ability of the ensemble to improve the performance of the base learner. On the theoretical side, we provide two results that explicitly relate the overall upper error bound of the composite hypothesis obtained within this procedure to the training errors of the individual hypotheses, thereby demonstrating how the former can be monotonically improved as the number of predictors in the ensemble increases. We also discuss how the stage-wise scheme to modify the target map to be learned by the base learner could increase the leverage effect of highly influential points in the regression, thereby compromising the generalization ability of the ensemble. To address this issue we propose the use of resampling as a preliminary step to generate the individual components. In [13], experimental evidence is presented to support the hypothesis that Bagging stabilizes prediction by equalizing the influence of training examples. In many situations, highly influential points are outliers, and their down-weighting could help to produce more robust predictions. As with Bagging, we show that resampling the dataset before each learning step allows the ensemble to control the impact of highly influential points, which, in turn, improves its ability to generalize from examples.

This article is organized as follows. In Section 2 we state some basic definitions about supervised learning from examples. We then present some background on ensemble learning in Section 3, discussing the key ideas behind diversity generation, Negative Correlation and resampling. In Section 4, we introduce the model proposed for ensemble learning and show that it has the above properties. In Section 5 we present the experimental results. Finally, Section 6 provides the conclusions of this work.

Section snippets

Supervised learning

In supervised learning [24], [42], [29], [15], we are given a set of examples S = {(x₁, y₁) … (x_n, y_n)}, where $x_{i} \in X$ models an input and $y_{i} \in Y$ corresponds to the desired response or output of x_i. Our goal is to implement a hypothesis $f : X \to Y$ to predict the output associated with a new input $x \in X$ . The performance is measured by a loss function $ℓ : Y \times Y \to R$ that penalizes prediction errors. If the input/output pairs are outcomes of a distribution function P(x, y), a reasonable criterion is to choose the

Ensemble methods

Ensemble algorithms address the problem of learning the unknown function $f_{0} : X \to Y$ by combining a set of simple estimators f₁, f₂, … , f_T, $f_{i} : X \to Y$ , instead of directly designing a more complex hypothesis in one single step. Two components are therefore required: a method to build each hypothesis and a method to aggregate them. The set of hypotheses can be obtained, for example, using a common learner $L$ which, provided with the correct examples and implementing a given loss function, returns a target

The proposed approach

We address the problem of constructing an ensemble as a sequence of learning rounds, where each one attempts to correct the errors incurred by the previous step. This is, for example, the paradigm of the AdaBoost algorithm [35], which was designed for classification and extended for regression in [8]. In AdaBoost, the weights of the training examples are modified at each step to focus the new components of the ensemble on the examples that were incorrectly learned by the previous components.

Experimental results and comparisons

In this section, we present different experiments designed to evaluate the characteristics of the proposed method. Section 5.1 is devoted to studying the behavior of the smoothing strategy introduced in Section 4.2. We investigate the effect of the parameter κ on the training and expected performance of the algorithm. The values of κ that provide the lowest cross-validation errors are selected for further experimentation. The purpose of Section 5.2 is to contrast the training and testing error

Conclusions and final remarks

We have studied a new method to construct ensembles in regression scenarios following a stage-wise additive modeling scheme of two main components. The first component is a method to correct the target map to be learnt at each step as a function of the previous step error. The second component is resampling.

It has been shown that the target correction scheme can monotonically improve a bound for the training error of the ensemble, provided that the base learner cannot incur in arbitrarily large

Acknowledgements

This work was supported by the following Research Grants: Fondecyt 1110854 and FB0821 Centro Científico Tecnológico de Valparaíso and the Foundation for Advancement of Soft Computing (Mieres Spain). Partial support was also received from CONICYT (Chile) Ph.D. Grant 21080414. The authors thank the reviewers for their comments that indeed helped to improve the manuscript.

References (46)

M.A.H. Akhand et al.
Progressive interactive training: a sequential neural network ensemble learning method
Neurocomputing
(2009)
Y. Liu et al.
Ensemble learning via negative correlation
Neural Networks
(1999)
A. Ulas et al.
Incremental construction of classifier and discriminant ensembles
Information Sciences
(2009)
D.H. Wolpert
Stacked generalization
Neural Networks
(1992)
J. Xiao et al.
A dynamic classifier ensemble selection approach for noise data
Information Sciences
(2010)
C.L. Blake, C.J. Merz, UCI repository of machine learning databases,...
O. Bousquet et al.
Stability and generalization
Journal of Machine Learning Research
(2002)
L. Breiman
Bagging predictors
Machine Learning
(1996)
G. Brown, Diversity in Neural Network Ensembles, PhD thesis, School of Computer Science, University of Birmingham,...
G. Brown et al.
Diversity creation methods: a survey and categorisation
Information Fusion Journal (Special issue on Diversity in Multiple Classifier Systems)
(2004)

G. Brown et al.

Managing diversity in regression ensembles

Journal of Machine Learning Research

(2005)

H. Drucker, Improving regressors using boosting techniques, in: Fourteenth International Conference on Machine...

S.E. Fahlman, C. Lebiere. The cascade-correlation learning architecture, in: D.S. Touretzky (Ed.), Advances in Neural...

X.Z. Fern et al.

Solving cluster ensemble problems by bipartite graph partitioning

J.H. Friedman

Multivariate adaptive regression splines

Annals of Statistics

(1991)

Y. Grandvalet, Bagging down-weights leverage points, in: IJCNN, vol. IV, 2000, pp....

Y. Grandvalet

Bagging equalizes influence

Machine Learning

(2004)

M. Haindl, J. Kittler, F. Roli (Eds.), Multiple Classifier Systems, 7th International Workshop, MCS 2007, Prague, Czech...

P. Huber, Robust Statistics, Wiley Series in Probability and Mathematical Statistics,...

R. Jacobs et al.

Adaptative mixtures of local experts

Neural Computation

(1991)

A. Krogh et al.

Neural network ensembles, cross-validation and active learning

Neural Information Processing Systems

(1995)

L. Kuncheva

Combining Pattern Classifiers: Methods and Algorithms

(2004)

Cited by (7)

Enhanced ensemble-based classifier with boosting for pattern recognition
2017, Applied Mathematics and Computation
Citation Excerpt :
It means the following methods: linear classifier, K-nearest-neighbors, quadratic classifier, RBF, SVM, deep convex net, convolutional nets, and various combinations of these methods. Details about these methods are given in [7,9]. We have proposed possible ways of improving the existing algorithms for classification, which was pushed towards a greater simplicity and universality.
The aim of the article is a proposal of a classifier based on neural networks that will be applicable in machine digitization of incomplete and inaccurate data or data containing noise for the purpose of their classification (pattern recognition). The article is focused on the possibility of increasing the efficiency of the algorithms via their appropriate combination, and particularly increasing their reliability and reducing their time demands. Time demands do not mean runtime, nor its development, but time demands of applying the algorithm to a particular problem domain. In other words, the amount of professional labour that is needed for such an implementation. The article aims at methods from the field of pattern recognition, which primarily means various types of neural networks. The proposed approaches are verified experimentally.
Fuzzy rule base ensemble generated from data by linguistic associations mining
2016, Fuzzy Sets and Systems
Citation Excerpt :
Its use in machine learning and computational intelligence is very frequent. Let us only mention some of related work, e.g., the regression ensemble in [15], clustering ensembles in [16,17] or classification ensembles [18,19]. Although the equal-weights ensemble performs as accurately as mentioned above, there are works that promisingly show the potential of more sophisticated approaches.
As there are many various methods for time series prediction developed but none of them generally outperforms all the others, there always exists a danger of choosing a method that is inappropriate for a given time series. To overcome such a problem, distinct ensemble techniques, that combine several individual forecasts, are being proposed. In this contribution, we employ the so-called Fuzzy Rule-Based Ensemble. This method is constructed as a linear combination of a small number of forecasting methods where the weights of the combination are determined by fuzzy rule bases based on time series features such as trend, seasonality, or stationarity. For identification of fuzzy rule bases, we use the linguistic association mining. A huge experimental justification is provided.
Embedded local feature selection within mixture of experts
2014, Information Sciences
Citation Excerpt :
Ulas et al. [45] combine classifiers by employing the most informative components of the eigenvectors corresponding to the correlation matrix among classifier outputs. Ñanculef et al. [33] propose to learn an ensemble of regressors using a sequential scheme and a score minimizing classification error and ensemble diversity. All these works do not include an embedded feature selection mechanism.
A useful strategy to deal with complex classification scenarios is the “divide and conquer” approach. The mixture of experts (MoE) technique makes use of this strategy by jointly training a set of classifiers, or experts, that are specialized in different regions of the input space. A global model, or gate function, complements the experts by learning a function that weighs their relevance in different parts of the input space. Local feature selection appears as an attractive alternative to improve the specialization of experts and gate function, particularly, in the case of high dimensional data. In general, subsets of dimensions, or subspaces, are usually more appropriate to classify instances located in different regions of the input space. Accordingly, this work contributes with a regularized variant of MoE that incorporates an embedded process for local feature selection using $L_{1}$ regularization. Experiments using artificial and real-world datasets provide evidence that the proposed method improves the classical MoE technique, in terms of accuracy and sparseness of the solution. Furthermore, our results indicate that the advantages of the proposed technique increase with the dimensionality of the data.
Handbook of machine learning - volume 2: Optimization and decision making
2019, Handbook Of Machine Learning - Volume 2: Optimization And Decision Making
LocalBoost: A Parallelizable Approach to Boosting Classifiers
2019, Neural Processing Letters
Ensemble methods for time series forecasting
2017, Studies in Fuzziness and Soft Computing

View all citing articles on Scopus

View full text

Training regression ensembles by sequential target correction and resampling

Abstract

Introduction

Section snippets

Supervised learning

Ensemble methods

The proposed approach

Experimental results and comparisons

Conclusions and final remarks

Acknowledgements

Neurocomputing

Neural Networks

Information Sciences

Neural Networks

Information Sciences

Stability and generalization

Journal of Machine Learning Research

Bagging predictors

Machine Learning

Diversity creation methods: a survey and categorisation

Information Fusion Journal (Special issue on Diversity in Multiple Classifier Systems)

Managing diversity in regression ensembles

Journal of Machine Learning Research

Solving cluster ensemble problems by bipartite graph partitioning

Multivariate adaptive regression splines

Annals of Statistics

Bagging equalizes influence

Machine Learning

Adaptative mixtures of local experts

Neural Computation

Neural network ensembles, cross-validation and active learning

Neural Information Processing Systems

Combining Pattern Classifiers: Methods and Algorithms