Double-bagging: combining classifiers by bootstrap aggregation
Introduction
The construction of a good classifier based on a learning sample can be seen as a three-step procedure: learning different rules, selecting an optimal one and estimating its misclassification error. Often only a small learning sample is available and all the three parts have to be performed using this learning sample only. It is well known that the selection of a classification rule with minimum estimated misclassification error leads to biased estimates of its performance. Even with efficient estimates of misclassification error like the 0.632+ bootstrap estimator by Efron and Tibshirani [1], the minimum of several estimators of misclassification error is a downward biased estimate of the true error rate.
However, different rules have to be taken into account. In many applications simple rules like naive Bayes, nearest neighbors or linear discriminant analysis (LDA) perform comparably to more advanced classifiers (see Ref. [2] for a discussion). Clearly, the performance of a classifier depends on how good the underlying model represents the data and consequently the performance of a rule can only be investigated under model assumptions. [3] suggest a strategy where different classifiers are compared by a simulation model derived from laser scanning image data. Lausen and Schumacher [4] discuss the bias of an optimal selected effect estimator in a simulation of a simple cutpoint model. Here, we suggest a classification procedure that combines two methods directly and needs no method selection.
LDA and classification trees (CTREE, cf. Ref. [5]) are somewhat extreme models. LDA assumes a spherical distribution of the predictors in each class. The classes are separable by hyperplanes in the sample space. In contrast, classification trees are non-parametric, i.e. do not assume a special distribution of the predictors. CTREE searches for partitions in the multivariate samples space, which may be seen as higher-order interactions or homogeneous subgroups defined by some combination binary splits of the predictors. Consequently, we combine both ideas. Breiman et al. [5, p. 16] noted that it is “… of surprise that [LDA] does as well as it does …” and consequently suggested to investigate linear combinations of the predictors in each node.
The combination of different classifiers leads to a substantial reduction of misclassification error in many applications. Bagging [6], [7] and boosting (for e.g. Ref. [8]) raised a lot of interest. Kuncheva et al. [9] compare several kinds of aggregation. Saranli and Demirekler [10] study rank-based combination. The dynamic selection of classifiers is investigated by Giacinto and Roli [11]. Although known for a long time, LDA is still improved, e.g. Du and Chang [12] and Yu and Yang [13] discuss recent proposals for special applications. Bootstrap aggregation of LDA was studied by Skurichina and Duin [14], [15]. Their investigations show that LDA can take advantage of bootstrap aggregation in situations where LDA is unstable. Skurichina and Duin [14] show that one example is the situation where only a small number of observations but a large number of predictors is available. Lausen et al. [16] propose a P value adjusted method for classification and regression trees and avoid a variable selection bias for different measurement scales.
We suggest “double-bagging” to deal with the problems of variable and method selection bias. Approximately of the observations are not part of a single bootstrap sample in bagging. Breiman [17] calls those observations “out of bag”. In our framework, the out-of-bag sample is used to estimate the coefficients of a linear discriminant function. The corresponding linear discriminant variables computed for the bootstrap sample are used as additional predictors for the classification trees which allow for a linear separation of the classes. This method performs comparably to LDA when the classes are linear separated and comparable to bagging if the classes can be identified by partitions. Therefore, only the misclassification error of the combined classifier has to be estimated and this estimate does not suffer an over optimism due to method selection bias. Moreover, double-bagging improves the best classifier studied in several experiments.
The procedure is introduced in detail in Section 2. We illustrate the performance by a model of laser scanning image data of the eye background for glaucoma classification in Section 3. Additionally, six benchmark problems are used to compare the performance of double-bagging of LDA and classification trees with recent proposals in Section 4.
Section snippets
Double-bagging
Let denote a learning sample of N independent observations consisting of p-dimensional vectors of predictors and class labels yi∈{1,…,J}. The observations in the learning set are a random sample from some distribution function FA classifier predicts future y-values for a vector of predictors based on a learning sample . The aggregated classifier CA is given bywhere the expectation is over learning samples
Classification by laser scanning images
Glaucoma is an ocular disease that causes progressive damages in the optic nerve fibers and leads to visual field loss. Laser scanning images of the eye background are used to detect a loss of retinal nerves. The images taken by the Heidelberg retina tomograph [19] are used to derive measurements for the loss of retinal nerves. Fig. 1 shows mean and topography images of a normal and glaucomatous eye.
A learning sample of 98 HRT examinations of normal eyes and 98 examinations of glaucomatous eyes
Benchmark experiments
Additionally to the simulation study we compare the performance of our proposal using the glaucoma clinical data and six benchmark classification problems. The artificial generated problems Twonorm, Threenorm and Ringnorm as well as the breast cancer, diabetes and ionosphere data are used. The data sets as well as the generating code of the artificial problems are taken from the R-package mlbench, which is a collection of machine learning problems from the UCI repository (//www.ics.uci.edu/~mlearn/
Discussion
Aggregating multiple classifiers cannot only be used to combine bootstrap replications of classifiers but can also be used for method combination and averaging. LeBlanc and Tibshirani [24] use a weighted average of the estimated class probabilities for the combination of various classifiers. Other proposals for the combination of classifiers are for example MultiBoosting (see Ref. [25]) and the use of correspondence analysis as suggested by Merz [26].
The use of the out-of-bag sample for error
Summary
The construction of a good classifier based on a learning sample can be seen as a three-step procedure: learning different rules, selecting an optimal one and estimating its misclassification error. Often only a small learning sample is available and all the three parts have to be performed using this learning sample only. It is well known that the selection of a classification rule with minimum estimated misclassification error leads to biased estimates of its performance. However, different
Acknowledgements
T. Hothorn and B. Lausen gratefully acknowledge support from Deutsche Forschungsgemeinschaft, grant SFB 539-A4/C1.
About the Author—TORSTEN HOTHORN was born in Dresden, Germany in 1975. He received a diploma in Statistics from the University of Dortmund in 2000. Since spring 2000 he is statistician in the biostatistics group of the Department of Medical Informatics, Biometry and Epidemiology at the University of Erlangen-Nuremberg. His research interests currently include classification, non-parametric statistics and statistical computing.
References (28)
- et al.
Evaluating the effect of optimized cutoff values in the assessment of prognostic factors
Comput. Stat. Data Anal.
(1996) - et al.
Decision templates for multiple classifier fusionan experimental comparison
Pattern Recognition
(2001) - et al.
A statistical unified framework for rank-based multiple classifier decision combination
Pattern Recognition
(2001) - et al.
Dynamic classifier selection based on multiple classifier behaviour
Pattern Recognition
(2001) - et al.
A linear constrained distance-based discriminant analysis for hyperspectral image classification
Pattern Recognition
(2001) - et al.
A direct LDA algorithm for high-dimensional data—with application to face recognition
Pattern Recognition
(2001) - et al.
Bagging for linear classifiers
Pattern Recognition
(1998) - et al.
Improvements on cross-validationthe 0.632+ bootstrap method
J. Am. Stat. Assoc.
(1997) On bias, variance, 0/1-loss, and the course-of-dimensionality
Data Mining Knowledge Discovery
(1997)- T. Hothorn, B. Lausen, Bagging tree classifiers for laser scanning images: data and simulation based strategy,...
Classification and Regression Trees
Bagging Predictors
Mach. Learning
Arcing classifiers
Ann. Stat.
Boosting the margina new explanation for the effectiveness of voting methods
Ann. Stat.
Cited by (0)
About the Author—TORSTEN HOTHORN was born in Dresden, Germany in 1975. He received a diploma in Statistics from the University of Dortmund in 2000. Since spring 2000 he is statistician in the biostatistics group of the Department of Medical Informatics, Biometry and Epidemiology at the University of Erlangen-Nuremberg. His research interests currently include classification, non-parametric statistics and statistical computing.
About the Author—BERTHOLD LAUSEN was born in Solingen, Germany in 1961 and received a diploma in Statistics from the University of Dortmund in 1987. He obtained a Ph.D. in statistics from the University of Dortmund in 1990. From 1987 to 1988 he was statistician at the Institute for Medical Biometry and Medical Informatics, University of Freiburg, 1989–1993 statistician at the chair Mathematical Statistics and Applications, Department of Statistics, University of Dortmund, 1993–1997 senior biometrician at the Research Institute for Child Nutrition Dortmund, 1997–2000 senior lecturer (non-clinical) in medical statistics at the Department of Medical Statistics and Evaluation, Imperial College of Science, Technology and Medicine, London. Since 2000 he is head of the biostatistics group at the Department of Medical Informatics, Biometry and Epidemiology, University of Erlangen-Nuremberg.