A note on split selection bias in classification trees

doi:10.1016/S0167-9473(03)00064-1

Computational Statistics & Data Analysis

Volume 45, Issue 3, 10 April 2004, Pages 457-466

https://doi.org/10.1016/S0167-9473(03)00064-1 Get rights and content

Abstract

A common approach to split selection in classification trees is to search through all possible splits generated by predictor variables. A splitting criterion is then used to evaluate those splits and the one with the largest criterion value is usually chosen to actually channel samples into corresponding subnodes. However, this greedy method is biased in variable selection when the numbers of the available split points for each variable are different. Such result may thus hamper the intuitively appealing nature of classification trees. The problem of the split selection bias for two-class tasks with numerical predictors is examined. The statistical explanation of its existence is given and a solution based on the P-values is provided, when the Pearson chi-square statistic is used as the splitting criterion.

Section snippets

Background

Classification trees are commonly used in searching for patterns. Researchers in the field of statistics, machine learning and data mining have developed related methods, such as CART (Breiman et al., 1984) and C4.5 (Quinlan, 1993), to name a few. A recent survey of various classification tree methods can be found in Murthy (1998).

A basic element in constructing classification trees is split selection. For a binary classification tree, a univariate split based on predictor X is of the form X⩽x,

Splitting criteria

The Pearson chi-square statistic has been used as a splitting criterion in the literature (Kass, 1980; Hawkins, 1991; Shih, 1999). Consider numerical predictor X in a two-class problem, the best split based on X is chosen to be of form $X⩽x^{∗}$ such that it best separates the two classes. Specifically, for real x, a 2×2 table is obtained and it is given in Table 1.

The best split point is $x^{∗} = arg max A_{x}$ where A_x² denotes the Pearson chi-square statistic: $A_{x}^{2} =N(ad−bc)^{2} (n_{1} n_{2} n_{L} n_{R})^{−1} .$ Miller and Siegmund

Selection bias

For another numerical predictor, say W, the best split $W⩽w^{∗}$ is selected, if it has the associated maximum statistic: $A_{W} ≡ max_{w} A_{w}$ . The common scheme of split selection between X and W is to choose the one with the larger maximally selected chi-square statistic. Equivalently, this approach compares A_X with A_W. We observe that A_X depends on n₁ and n₂ which are the numbers of observations available for classes 1 and 2, respectively. A_X also implicitly depends on the number of distinct A_x values.

Simulation studies

We study the effect of the three split selection schemes: the chi-square, the phi-square, and the exact P-value methods in the following sections. We first study the case where the class variable is independent of the predictors.

Concluding remarks

One main reason of using a classification tree is its easy interpretation. This insight is provided by the splits and any selection bias can weaken our confidence in the explanation of a resulting tree. We demonstrate that the split selection method using the usual exhaustive search approach is biased if numerical predictors have unequal numbers of available split points and the Pearson chi-square statistic is used as the criterion for two-class problems. The selection bias can be corrected by

Acknowledgements

The author is grateful to the referees for their thoughtful comments which improved the presentation of the article. The author thanks Deng, Chia-Zu for her computing assistance. This research is supported by a grant from NSC, Taiwan.

References (23)

A. Agresti
Analysis of Ordinal Categorical Data
(1984)
L. Breiman et al.
Classification and Regression Trees
(1984)
F. Dannegger
Tree stability diagnostics and some remedies for instability
Statist. Med.
(2000)
Dobra, A., Gehrke, J., 2001. Bias correction in classification tree construction. Proceedings of the Seventh...
B. Efron et al.
An Introduction to the Bootstrap
(1993)
Frank, E., Witten, I.H., 1998. Using a permutation test for attribute selection in decision trees. Proceedings of the...
A.L. Halpern
Minimally selected p and other tests for a single abrupt changepoint in a binary sequence
Biometrics
(1999)
D.M. Hawkins
FIRMformal inference-based recursive modeling
Amer. Statist.
(1991)
D. Jensen et al.
Multiple comparisons in induction algorithms
Mach. Learning
(2000)
G.V. Kass
An exploratory technique for investigating large quantities of categorical data
Appl. Statist.
(1980)

H. Kim et al.

Classification trees with unbiased multiway splits

J. Amer. Statist. Assoc.

(2001)

Cited by (53)

Splitting criteria for classification problems with multi-valued attributes and large number of classes
2018, Pattern Recognition Letters
Citation Excerpt :
Indeed, it is widely known that many splitting criteria have bias toward attributes with a large number of values. There are some proposals available to cope with this issue [9,11,22]. This topic, though relevant, is not the focus of our paper.
Decision Trees and Random Forests are among the most popular methods for classification tasks. Two key issues faced by these methods are: how to select the best attribute to associate with a node and how to split the samples given the selected attribute. This paper addresses an important challenge that arises when nominal attributes with a large number of values are present: the computational time required to compute splits of good quality. We present a framework to generate computationally efficient splitting criteria that handle, with theoretical approximation guarantee, multi-valued nominal attributes for classification tasks with a large number of classes. Experiments with a number of datasets suggest that a method derived from our framework is competitive in terms of accuracy and speed with the Twoing criterion, one of few criteria available that is able to handle, with optimality guarantee, nominal attributes with a large number of distinct values. However, this method has the advantage of also efficiently handling datasets with a large number of classes. These experiments also give evidence of the potential of aggregating attributes to improve the classification power of decision trees.
Subgroup detection in linear growth curve models with generalized linear mixed model (GLMM) trees
2023, arXiv
Decision tree for classification and forecasting
2023, Engineering Mathematics and Artificial Intelligence: Foundations, Methods, and Applications
BOSS - Biomarker Optimal Segmentation System
2023, arXiv
A tree-based modeling approach for matched case-control studies
2023, Statistics in Medicine
Dynamic Allocation Optimization in A/B-Tests Using Classification-Based Preprocessing
2023, IEEE Transactions on Knowledge and Data Engineering

View all citing articles on Scopus

View full text

A note on split selection bias in classification trees

Abstract

Section snippets

Background

Splitting criteria

Selection bias

Simulation studies

Concluding remarks

Acknowledgements

Analysis of Ordinal Categorical Data

Classification and Regression Trees

Tree stability diagnostics and some remedies for instability

Statist. Med.

An Introduction to the Bootstrap

Minimally selected p and other tests for a single abrupt changepoint in a binary sequence

Biometrics

FIRMformal inference-based recursive modeling

Amer. Statist.

Multiple comparisons in induction algorithms

Mach. Learning

An exploratory technique for investigating large quantities of categorical data

Appl. Statist.

Classification trees with unbiased multiway splits

J. Amer. Statist. Assoc.