Elsevier

Information Sciences

Volume 195, 15 July 2012, Pages 154-174
Information Sciences

Training regression ensembles by sequential target correction and resampling

https://doi.org/10.1016/j.ins.2012.01.035Get rights and content

Abstract

Ensemble methods learn models from examples by generating a set of hypotheses, which are then combined to make a single decision. We propose an algorithm to construct an ensemble for regression estimation. Our proposal generates the hypotheses sequentially using a simple procedure whereby the target map to be learned by the base learner at each step is modified as a function of the previous step error. We state a theorem that relates the overall upper error bound of the composite hypothesis obtained within this procedure to the training errors of the individual hypotheses. We also demonstrate that the proposed procedure results in a learning functional that enforces a weighted form of Negative Correlation with respect to previous hypotheses. Additionally, we incorporate resampling to allow the ensemble to control the impact of highly influential data points, showing that this component significantly improves its ability to generalize from the known examples. We describe experiments performed to evaluate our technique on real and synthetic datasets using neural networks as base learners. These results show that our technique exhibits considerably better prediction errors than the Negative Correlation (NC) method and that its performance is very competitive with that of the Bagging and AdaBoost algorithms for regression estimation.

Introduction

Ensemble methods [31], [14], [27], [5] have been demonstrated to be a powerful and flexible way to improve the performance of a base learning algorithm in a variety of machine learning scenarios, including classification [45], [41], [19], regression [7], [8], novelty detection [32], times series forecasting [46] and clustering [10], [37]. The basic idea consists of extracting a model from data by combining a set of simple models that are constructed and organized to achieve a desired goal. The Bagging [4], AdaBoost [34], Mixture of Experts [17], Stacking [44] and Negative Correlation [22] algorithms, as well as their many variations, are well-known examples of this class of methods.

The concept of diversity is commonly used by the ensemble community to denote the differences among the individual components to be combined. Because the replication of multiple exact copies of the same model does not provide an advantage over the use of a single instance of such a model, much research has been directed to the definition of strategies to measure and generate diversity in a useful way [6], [20]. The most commonly investigated method to promote diversity among the models in an ensemble is probably the manipulation of the training data used to build the individual models. The Bagging [4] and AdaBoost [34] algorithms, for example, build these models by resampling the original dataset in a first step and then training a base learning algorithm with the obtained datasets. Another investigated approach to achieve diversity in the ensemble is to use different learning functionals to construct the component models, changing the individual learning goals in terms of the original learning task and the state of the other components in the system. Negative Correlation (NC) [7], [22] for example, trains each model with a learning functional that explicitly controls the tradeoff between the individual accuracy and the contribution to a measure of diversity in the ensemble. In [7] Brown et al. showed that NC is competitive with Bagging and AdaBoost in regression estimation scenarios.

This paper presents a new algorithm for building ensembles in regression problems, where the composite predictor is obtained by following a stage-wise additive modeling scheme. At each step, the target map to be learned by the base learner is modified as a function of the previous step error by using a very simple analytic rule. We show that this method is equivalent to using a learning functional similar to that employed by NC [7], [22] to measure and generate diversity in the ensemble. However, in NC, the diversity is measured as a kind of correlation between individual predictions. Specifically, the correlations of the differences between the individual and composite predictions are used to encourage diversity. Our functional, in contrast, uses the differences between the individual predictions and the original target map. This way of measuring diversity in the system encourages only those differences among the individual hypotheses that compensate for the current errors of the ensemble. On the other hand, this functional weights the correlations between the actual and past hypotheses such that the correlations with the more recent hypotheses have a greater weight.

Our experiments show that, compared with Negative Correlation, the proposed procedure can considerably increase the ability of the ensemble to improve the performance of the base learner. On the theoretical side, we provide two results that explicitly relate the overall upper error bound of the composite hypothesis obtained within this procedure to the training errors of the individual hypotheses, thereby demonstrating how the former can be monotonically improved as the number of predictors in the ensemble increases. We also discuss how the stage-wise scheme to modify the target map to be learned by the base learner could increase the leverage effect of highly influential points in the regression, thereby compromising the generalization ability of the ensemble. To address this issue we propose the use of resampling as a preliminary step to generate the individual components. In [13], experimental evidence is presented to support the hypothesis that Bagging stabilizes prediction by equalizing the influence of training examples. In many situations, highly influential points are outliers, and their down-weighting could help to produce more robust predictions. As with Bagging, we show that resampling the dataset before each learning step allows the ensemble to control the impact of highly influential points, which, in turn, improves its ability to generalize from examples.

This article is organized as follows. In Section 2 we state some basic definitions about supervised learning from examples. We then present some background on ensemble learning in Section 3, discussing the key ideas behind diversity generation, Negative Correlation and resampling. In Section 4, we introduce the model proposed for ensemble learning and show that it has the above properties. In Section 5 we present the experimental results. Finally, Section 6 provides the conclusions of this work.

Section snippets

Supervised learning

In supervised learning [24], [42], [29], [15], we are given a set of examples S = {(x1, y1)  (xn, yn)}, where xiX models an input and yiY corresponds to the desired response or output of xi. Our goal is to implement a hypothesis f:XY to predict the output associated with a new input xX. The performance is measured by a loss function :Y×YR that penalizes prediction errors. If the input/output pairs are outcomes of a distribution function P(x, y), a reasonable criterion is to choose the

Ensemble methods

Ensemble algorithms address the problem of learning the unknown function f0:XY by combining a set of simple estimators f1, f2,  , fT, fi:XY, instead of directly designing a more complex hypothesis in one single step. Two components are therefore required: a method to build each hypothesis and a method to aggregate them. The set of hypotheses can be obtained, for example, using a common learner L which, provided with the correct examples and implementing a given loss function, returns a target

The proposed approach

We address the problem of constructing an ensemble as a sequence of learning rounds, where each one attempts to correct the errors incurred by the previous step. This is, for example, the paradigm of the AdaBoost algorithm [35], which was designed for classification and extended for regression in [8]. In AdaBoost, the weights of the training examples are modified at each step to focus the new components of the ensemble on the examples that were incorrectly learned by the previous components.

Experimental results and comparisons

In this section, we present different experiments designed to evaluate the characteristics of the proposed method. Section 5.1 is devoted to studying the behavior of the smoothing strategy introduced in Section 4.2. We investigate the effect of the parameter κ on the training and expected performance of the algorithm. The values of κ that provide the lowest cross-validation errors are selected for further experimentation. The purpose of Section 5.2 is to contrast the training and testing error

Conclusions and final remarks

We have studied a new method to construct ensembles in regression scenarios following a stage-wise additive modeling scheme of two main components. The first component is a method to correct the target map to be learnt at each step as a function of the previous step error. The second component is resampling.

It has been shown that the target correction scheme can monotonically improve a bound for the training error of the ensemble, provided that the base learner cannot incur in arbitrarily large

Acknowledgements

This work was supported by the following Research Grants: Fondecyt 1110854 and FB0821 Centro Científico Tecnológico de Valparaíso and the Foundation for Advancement of Soft Computing (Mieres Spain). Partial support was also received from CONICYT (Chile) Ph.D. Grant 21080414. The authors thank the reviewers for their comments that indeed helped to improve the manuscript.

References (46)

  • G. Brown et al.

    Managing diversity in regression ensembles

    Journal of Machine Learning Research

    (2005)
  • H. Drucker, Improving regressors using boosting techniques, in: Fourteenth International Conference on Machine...
  • S.E. Fahlman, C. Lebiere. The cascade-correlation learning architecture, in: D.S. Touretzky (Ed.), Advances in Neural...
  • X.Z. Fern et al.

    Solving cluster ensemble problems by bipartite graph partitioning

  • J.H. Friedman

    Multivariate adaptive regression splines

    Annals of Statistics

    (1991)
  • Y. Grandvalet, Bagging down-weights leverage points, in: IJCNN, vol. IV, 2000, pp....
  • Y. Grandvalet

    Bagging equalizes influence

    Machine Learning

    (2004)
  • M. Haindl, J. Kittler, F. Roli (Eds.), Multiple Classifier Systems, 7th International Workshop, MCS 2007, Prague, Czech...
  • P. Huber, Robust Statistics, Wiley Series in Probability and Mathematical Statistics,...
  • R. Jacobs et al.

    Adaptative mixtures of local experts

    Neural Computation

    (1991)
  • A. Krogh et al.

    Neural network ensembles, cross-validation and active learning

    Neural Information Processing Systems

    (1995)
  • L. Kuncheva

    Combining Pattern Classifiers: Methods and Algorithms

    (2004)
  • Cited by (7)

    • Enhanced ensemble-based classifier with boosting for pattern recognition

      2017, Applied Mathematics and Computation
      Citation Excerpt :

      It means the following methods: linear classifier, K-nearest-neighbors, quadratic classifier, RBF, SVM, deep convex net, convolutional nets, and various combinations of these methods. Details about these methods are given in [7,9]. We have proposed possible ways of improving the existing algorithms for classification, which was pushed towards a greater simplicity and universality.

    • Fuzzy rule base ensemble generated from data by linguistic associations mining

      2016, Fuzzy Sets and Systems
      Citation Excerpt :

      Its use in machine learning and computational intelligence is very frequent. Let us only mention some of related work, e.g., the regression ensemble in [15], clustering ensembles in [16,17] or classification ensembles [18,19]. Although the equal-weights ensemble performs as accurately as mentioned above, there are works that promisingly show the potential of more sophisticated approaches.

    • Embedded local feature selection within mixture of experts

      2014, Information Sciences
      Citation Excerpt :

      Ulas et al. [45] combine classifiers by employing the most informative components of the eigenvectors corresponding to the correlation matrix among classifier outputs. Ñanculef et al. [33] propose to learn an ensemble of regressors using a sequential scheme and a score minimizing classification error and ensemble diversity. All these works do not include an embedded feature selection mechanism.

    • Handbook of machine learning - volume 2: Optimization and decision making

      2019, Handbook Of Machine Learning - Volume 2: Optimization And Decision Making
    • Ensemble methods for time series forecasting

      2017, Studies in Fuzziness and Soft Computing
    View all citing articles on Scopus
    View full text