A note on split selection bias in classification trees
Section snippets
Background
Classification trees are commonly used in searching for patterns. Researchers in the field of statistics, machine learning and data mining have developed related methods, such as CART (Breiman et al., 1984) and C4.5 (Quinlan, 1993), to name a few. A recent survey of various classification tree methods can be found in Murthy (1998).
A basic element in constructing classification trees is split selection. For a binary classification tree, a univariate split based on predictor X is of the form X⩽x,
Splitting criteria
The Pearson chi-square statistic has been used as a splitting criterion in the literature (Kass, 1980; Hawkins, 1991; Shih, 1999). Consider numerical predictor X in a two-class problem, the best split based on X is chosen to be of form such that it best separates the two classes. Specifically, for real x, a 2×2 table is obtained and it is given in Table 1.
The best split point is where Ax2 denotes the Pearson chi-square statistic:Miller and Siegmund
Selection bias
For another numerical predictor, say W, the best split is selected, if it has the associated maximum statistic: . The common scheme of split selection between X and W is to choose the one with the larger maximally selected chi-square statistic. Equivalently, this approach compares AX with AW. We observe that AX depends on n1 and n2 which are the numbers of observations available for classes 1 and 2, respectively. AX also implicitly depends on the number of distinct Ax values.
Simulation studies
We study the effect of the three split selection schemes: the chi-square, the phi-square, and the exact P-value methods in the following sections. We first study the case where the class variable is independent of the predictors.
Concluding remarks
One main reason of using a classification tree is its easy interpretation. This insight is provided by the splits and any selection bias can weaken our confidence in the explanation of a resulting tree. We demonstrate that the split selection method using the usual exhaustive search approach is biased if numerical predictors have unequal numbers of available split points and the Pearson chi-square statistic is used as the criterion for two-class problems. The selection bias can be corrected by
Acknowledgements
The author is grateful to the referees for their thoughtful comments which improved the presentation of the article. The author thanks Deng, Chia-Zu for her computing assistance. This research is supported by a grant from NSC, Taiwan.
References (23)
Analysis of Ordinal Categorical Data
(1984)- et al.
Classification and Regression Trees
(1984) Tree stability diagnostics and some remedies for instability
Statist. Med.
(2000)- Dobra, A., Gehrke, J., 2001. Bias correction in classification tree construction. Proceedings of the Seventh...
- et al.
An Introduction to the Bootstrap
(1993) - Frank, E., Witten, I.H., 1998. Using a permutation test for attribute selection in decision trees. Proceedings of the...
Minimally selected p and other tests for a single abrupt changepoint in a binary sequence
Biometrics
(1999)FIRMformal inference-based recursive modeling
Amer. Statist.
(1991)- et al.
Multiple comparisons in induction algorithms
Mach. Learning
(2000) An exploratory technique for investigating large quantities of categorical data
Appl. Statist.
(1980)
Classification trees with unbiased multiway splits
J. Amer. Statist. Assoc.
Cited by (53)
Splitting criteria for classification problems with multi-valued attributes and large number of classes
2018, Pattern Recognition LettersCitation Excerpt :Indeed, it is widely known that many splitting criteria have bias toward attributes with a large number of values. There are some proposals available to cope with this issue [9,11,22]. This topic, though relevant, is not the focus of our paper.
Decision tree for classification and forecasting
2023, Engineering Mathematics and Artificial Intelligence: Foundations, Methods, and ApplicationsBOSS - Biomarker Optimal Segmentation System
2023, arXivA tree-based modeling approach for matched case-control studies
2023, Statistics in MedicineDynamic Allocation Optimization in A/B-Tests Using Classification-Based Preprocessing
2023, IEEE Transactions on Knowledge and Data Engineering