Shrinking classification trees for bootstrap aggregation
Introduction
We consider the classification problem: suppose we have some observations D={(Xi,Yi):i=1,…,n}, where Xi is a vector of some features or attributes and Yi is the corresponding class label. The goal is to learn a classifier from {(Xi,Yi)}, which can then be used to predict the class label Yj of any future observed attributes Xj. One popular tool is the classification or decision tree, such as CART and C4.5 (Breiman et al., 1984; Quinlan, 1993). Since a fully grown classification tree may over-fit the data and hence have a poor performance in predicting future observations, pruning and shrinking are adopted to avoid over-fitting (Breiman et al., 1984; Hastie and Pregibon, 1990).
Breiman (1996a) introduced bootstrap aggregation (bagging) as a method to combine multiple versions of an unstable estimator, each constructed from a bootstrap sample, and showed that it improves over the single estimate based on the original whole sample. In particular, it is well-known that classification trees are unstable. Therefore, bagging classification trees has since attracted much research attention. Bagging classification trees works as follows. First we draw a bootstrap sample Db from the original sample D; i.e. each of n observations in Db is independently drawn from D with an equal probability and with replacement (Efron and Tibshirani, 1993). Then we construct a base tree Tb using Db, which may or may not be pruned or shrunken. This process is repeated for B>1 times. Then for a future observation x, each trained tree Tb will have a prediction of its class as yb, and the bagging ensemble estimates the class of x as the one with plurality in {yb}.
Some recent work (e.g. Bauer and Kohavi, 1998) shows that bagging pruned trees does not necessarily yield a better performance than bagging unpruned trees does. In this paper we explore a similar issue in bagging shrunken/unshrunken trees. Using a well-known data set, we show that the usual method of bagging shrunken trees, analogous to bagging pruned trees, does not work well. This observation motivated us to propose a new method to shrink the base trees, leading to an ensemble with an improved accuracy over that using the usual method.
This paper is organized as follows. Section 2 is a brief review of shrinking and bagging classifications tree. A modification is proposed to shrink the base trees of the bagging ensemble in Section 3.
Section snippets
Shrinking and bagging
We do not expect that shrinking would differ from pruning technically in our current setting. However, since pruning is better studied in bagging, it is interesting in its own right to study shrinking in bagging. All of our simulations were conducted in S environment (Becker et al., 1988).
In this paper, any unshrunken (i.e., full) tree is by default fully grown to fit the data exactly; in other words, all of the training examples in any of its terminal nodes belong to the same class. As an
A proposal for shrinking in bagging
Comparing the ratios of the shrunken and unshrunken tree sizes in a bagging ensemble and in a single tree (Fig. 1), we will notice that indeed there is less shrinkage effect in bagging. This was also observed by Bauer and Kohavi (1998) in the context of bagging pruned trees. From the bootstrap theory (Efron and Tibshirani, 1993), we know that on average only 63% of the original observations appear in a bootstrap sample. In other words, there are many replicated observations in a bootstrap
Acknowledgments
The author is grateful to the Referees and Editors for helpful suggestions.
References (16)
- et al.
A decision-theoretic generalization of on-line learning and an application to boosting
J. Computer and System Sci.
(1997) - et al.
A tree-based statistical language model for natural language speech recognition
IEEE Trans. Acoust. Speech Signal Process.
(1989) - Bauer, E., Kohavi, R., 1998. An empirical comparison of voting classification algorithms: Bagging, Boosting, and...
- et al.
The New S Language: A Programming Environment for Data Analysis and Graphics
(1988) Bagging predictors
Machine Learning
(1996)- Breiman, L. 1996b. Pasting bites together for prediction in large data sets and on-line. Technical Report, Statistics...
- et al.
Classification and Regression Trees
(1984) Learning classification trees
Statistics and Computing
(1992)