A note on split selection bias in classification trees

https://doi.org/10.1016/S0167-9473(03)00064-1Get rights and content

Abstract

A common approach to split selection in classification trees is to search through all possible splits generated by predictor variables. A splitting criterion is then used to evaluate those splits and the one with the largest criterion value is usually chosen to actually channel samples into corresponding subnodes. However, this greedy method is biased in variable selection when the numbers of the available split points for each variable are different. Such result may thus hamper the intuitively appealing nature of classification trees. The problem of the split selection bias for two-class tasks with numerical predictors is examined. The statistical explanation of its existence is given and a solution based on the P-values is provided, when the Pearson chi-square statistic is used as the splitting criterion.

Section snippets

Background

Classification trees are commonly used in searching for patterns. Researchers in the field of statistics, machine learning and data mining have developed related methods, such as CART (Breiman et al., 1984) and C4.5 (Quinlan, 1993), to name a few. A recent survey of various classification tree methods can be found in Murthy (1998).

A basic element in constructing classification trees is split selection. For a binary classification tree, a univariate split based on predictor X is of the form Xx,

Splitting criteria

The Pearson chi-square statistic has been used as a splitting criterion in the literature (Kass, 1980; Hawkins, 1991; Shih, 1999). Consider numerical predictor X in a two-class problem, the best split based on X is chosen to be of form X⩽x such that it best separates the two classes. Specifically, for real x, a 2×2 table is obtained and it is given in Table 1.

The best split point is x=argmaxAx where Ax2 denotes the Pearson chi-square statistic:Ax2=N(ad−bc)2(n1n2nLnR)−1.Miller and Siegmund

Selection bias

For another numerical predictor, say W, the best split W⩽w is selected, if it has the associated maximum statistic: AWmaxwAw. The common scheme of split selection between X and W is to choose the one with the larger maximally selected chi-square statistic. Equivalently, this approach compares AX with AW. We observe that AX depends on n1 and n2 which are the numbers of observations available for classes 1 and 2, respectively. AX also implicitly depends on the number of distinct Ax values.

Simulation studies

We study the effect of the three split selection schemes: the chi-square, the phi-square, and the exact P-value methods in the following sections. We first study the case where the class variable is independent of the predictors.

Concluding remarks

One main reason of using a classification tree is its easy interpretation. This insight is provided by the splits and any selection bias can weaken our confidence in the explanation of a resulting tree. We demonstrate that the split selection method using the usual exhaustive search approach is biased if numerical predictors have unequal numbers of available split points and the Pearson chi-square statistic is used as the criterion for two-class problems. The selection bias can be corrected by

Acknowledgements

The author is grateful to the referees for their thoughtful comments which improved the presentation of the article. The author thanks Deng, Chia-Zu for her computing assistance. This research is supported by a grant from NSC, Taiwan.

References (23)

  • A. Agresti

    Analysis of Ordinal Categorical Data

    (1984)
  • L. Breiman et al.

    Classification and Regression Trees

    (1984)
  • F. Dannegger

    Tree stability diagnostics and some remedies for instability

    Statist. Med.

    (2000)
  • Dobra, A., Gehrke, J., 2001. Bias correction in classification tree construction. Proceedings of the Seventh...
  • B. Efron et al.

    An Introduction to the Bootstrap

    (1993)
  • Frank, E., Witten, I.H., 1998. Using a permutation test for attribute selection in decision trees. Proceedings of the...
  • A.L. Halpern

    Minimally selected p and other tests for a single abrupt changepoint in a binary sequence

    Biometrics

    (1999)
  • D.M. Hawkins

    FIRMformal inference-based recursive modeling

    Amer. Statist.

    (1991)
  • D. Jensen et al.

    Multiple comparisons in induction algorithms

    Mach. Learning

    (2000)
  • G.V. Kass

    An exploratory technique for investigating large quantities of categorical data

    Appl. Statist.

    (1980)
  • H. Kim et al.

    Classification trees with unbiased multiway splits

    J. Amer. Statist. Assoc.

    (2001)
  • Cited by (53)

    View all citing articles on Scopus
    View full text