Elsevier

Pattern Recognition

Volume 36, Issue 6, June 2003, Pages 1303-1309
Pattern Recognition

Double-bagging: combining classifiers by bootstrap aggregation

https://doi.org/10.1016/S0031-3203(02)00169-3Get rights and content

Abstract

The combination of classifiers leads to substantial reduction of misclassification error in a wide range of applications and benchmark problems. We suggest using an out-of-bag sample for combining different classifiers. In our setup, a linear discriminant analysis is performed using the observations in the out-of-bag sample, and the corresponding discriminant variables computed for the observations in the bootstrap sample are used as additional predictors for a classification tree. Two classifiers are combined and therefore method and variable selection bias is no problem for the corresponding estimate of misclassification error, the need of an additional test sample disappears. Moreover, the procedure performs comparable to the best classifiers used in a number of artificial examples and applications.

Introduction

The construction of a good classifier based on a learning sample can be seen as a three-step procedure: learning different rules, selecting an optimal one and estimating its misclassification error. Often only a small learning sample is available and all the three parts have to be performed using this learning sample only. It is well known that the selection of a classification rule with minimum estimated misclassification error leads to biased estimates of its performance. Even with efficient estimates of misclassification error like the 0.632+ bootstrap estimator by Efron and Tibshirani [1], the minimum of several estimators of misclassification error is a downward biased estimate of the true error rate.

However, different rules have to be taken into account. In many applications simple rules like naive Bayes, nearest neighbors or linear discriminant analysis (LDA) perform comparably to more advanced classifiers (see Ref. [2] for a discussion). Clearly, the performance of a classifier depends on how good the underlying model represents the data and consequently the performance of a rule can only be investigated under model assumptions. [3] suggest a strategy where different classifiers are compared by a simulation model derived from laser scanning image data. Lausen and Schumacher [4] discuss the bias of an optimal selected effect estimator in a simulation of a simple cutpoint model. Here, we suggest a classification procedure that combines two methods directly and needs no method selection.

LDA and classification trees (CTREE, cf. Ref. [5]) are somewhat extreme models. LDA assumes a spherical distribution of the predictors in each class. The classes are separable by hyperplanes in the sample space. In contrast, classification trees are non-parametric, i.e. do not assume a special distribution of the predictors. CTREE searches for partitions in the multivariate samples space, which may be seen as higher-order interactions or homogeneous subgroups defined by some combination binary splits of the predictors. Consequently, we combine both ideas. Breiman et al. [5, p. 16] noted that it is “… of surprise that [LDA] does as well as it does …” and consequently suggested to investigate linear combinations of the predictors in each node.

The combination of different classifiers leads to a substantial reduction of misclassification error in many applications. Bagging [6], [7] and boosting (for e.g. Ref. [8]) raised a lot of interest. Kuncheva et al. [9] compare several kinds of aggregation. Saranli and Demirekler [10] study rank-based combination. The dynamic selection of classifiers is investigated by Giacinto and Roli [11]. Although known for a long time, LDA is still improved, e.g. Du and Chang [12] and Yu and Yang [13] discuss recent proposals for special applications. Bootstrap aggregation of LDA was studied by Skurichina and Duin [14], [15]. Their investigations show that LDA can take advantage of bootstrap aggregation in situations where LDA is unstable. Skurichina and Duin [14] show that one example is the situation where only a small number of observations but a large number of predictors is available. Lausen et al. [16] propose a P value adjusted method for classification and regression trees and avoid a variable selection bias for different measurement scales.

We suggest “double-bagging” to deal with the problems of variable and method selection bias. Approximately 13 of the observations are not part of a single bootstrap sample in bagging. Breiman [17] calls those observations “out of bag”. In our framework, the out-of-bag sample is used to estimate the coefficients of a linear discriminant function. The corresponding linear discriminant variables computed for the bootstrap sample are used as additional predictors for the classification trees which allow for a linear separation of the classes. This method performs comparably to LDA when the classes are linear separated and comparable to bagging if the classes can be identified by partitions. Therefore, only the misclassification error of the combined classifier has to be estimated and this estimate does not suffer an over optimism due to method selection bias. Moreover, double-bagging improves the best classifier studied in several experiments.

The procedure is introduced in detail in Section 2. We illustrate the performance by a model of laser scanning image data of the eye background for glaucoma classification in Section 3. Additionally, six benchmark problems are used to compare the performance of double-bagging of LDA and classification trees with recent proposals in Section 4.

Section snippets

Double-bagging

Let L={(yi,xi),i=1,…,N} denote a learning sample of N independent observations consisting of p-dimensional vectors of predictors xi=(xi1,…,xip)∈Rp and class labels yi∈{1,…,J}. The observations in the learning set are a random sample from some distribution function F(y1,x1),…,(yN,xN)iidF.A classifier C(x,L) predicts future y-values for a vector of predictors x based on a learning sample L. The aggregated classifier CA is given byCA(x)=EFC(x,L),where the expectation is over learning samples L

Classification by laser scanning images

Glaucoma is an ocular disease that causes progressive damages in the optic nerve fibers and leads to visual field loss. Laser scanning images of the eye background are used to detect a loss of retinal nerves. The images taken by the Heidelberg retina tomograph [19] are used to derive measurements for the loss of retinal nerves. Fig. 1 shows mean and topography images of a normal and glaucomatous eye.

A learning sample of 98 HRT examinations of normal eyes and 98 examinations of glaucomatous eyes

Benchmark experiments

Additionally to the simulation study we compare the performance of our proposal using the glaucoma clinical data and six benchmark classification problems. The artificial generated problems Twonorm, Threenorm and Ringnorm as well as the breast cancer, diabetes and ionosphere data are used. The data sets as well as the generating code of the artificial problems are taken from the R-package mlbench, which is a collection of machine learning problems from the UCI repository (//www.ics.uci.edu/~mlearn/

Discussion

Aggregating multiple classifiers cannot only be used to combine bootstrap replications of classifiers but can also be used for method combination and averaging. LeBlanc and Tibshirani [24] use a weighted average of the estimated class probabilities for the combination of various classifiers. Other proposals for the combination of classifiers are for example MultiBoosting (see Ref. [25]) and the use of correspondence analysis as suggested by Merz [26].

The use of the out-of-bag sample for error

Summary

The construction of a good classifier based on a learning sample can be seen as a three-step procedure: learning different rules, selecting an optimal one and estimating its misclassification error. Often only a small learning sample is available and all the three parts have to be performed using this learning sample only. It is well known that the selection of a classification rule with minimum estimated misclassification error leads to biased estimates of its performance. However, different

Acknowledgements

T. Hothorn and B. Lausen gratefully acknowledge support from Deutsche Forschungsgemeinschaft, grant SFB 539-A4/C1.

About the Author—TORSTEN HOTHORN was born in Dresden, Germany in 1975. He received a diploma in Statistics from the University of Dortmund in 2000. Since spring 2000 he is statistician in the biostatistics group of the Department of Medical Informatics, Biometry and Epidemiology at the University of Erlangen-Nuremberg. His research interests currently include classification, non-parametric statistics and statistical computing.

References (28)

  • L. Breiman et al.

    Classification and Regression Trees

    (1984)
  • L. Breiman

    Bagging Predictors

    Mach. Learning

    (1996)
  • L. Breiman

    Arcing classifiers

    Ann. Stat.

    (1998)
  • R.E. Schapire et al.

    Boosting the margina new explanation for the effectiveness of voting methods

    Ann. Stat.

    (1998)
  • Cited by (0)

    About the Author—TORSTEN HOTHORN was born in Dresden, Germany in 1975. He received a diploma in Statistics from the University of Dortmund in 2000. Since spring 2000 he is statistician in the biostatistics group of the Department of Medical Informatics, Biometry and Epidemiology at the University of Erlangen-Nuremberg. His research interests currently include classification, non-parametric statistics and statistical computing.

    About the Author—BERTHOLD LAUSEN was born in Solingen, Germany in 1961 and received a diploma in Statistics from the University of Dortmund in 1987. He obtained a Ph.D. in statistics from the University of Dortmund in 1990. From 1987 to 1988 he was statistician at the Institute for Medical Biometry and Medical Informatics, University of Freiburg, 1989–1993 statistician at the chair Mathematical Statistics and Applications, Department of Statistics, University of Dortmund, 1993–1997 senior biometrician at the Research Institute for Child Nutrition Dortmund, 1997–2000 senior lecturer (non-clinical) in medical statistics at the Department of Medical Statistics and Evaluation, Imperial College of Science, Technology and Medicine, London. Since 2000 he is head of the biostatistics group at the Department of Medical Informatics, Biometry and Epidemiology, University of Erlangen-Nuremberg.

    View full text