Combining multiple class distribution modified subsamples in a single tree

https://doi.org/10.1016/j.patrec.2006.08.013Get rights and content

Abstract

This work describes the Consolidated Tree Construction (CTC) algorithm: a single tree is built based on a set of subsamples. This way the explaining capacity of the classifier is not lost even if many subsamples are used. We show how CTC algorithm can use undersampling to change class distribution without loss of information, building more accurate classifiers than C4.5.

Introduction

When using machine learning to solve real world problems, where a model has to be built from a data set, it has usually been assumed that the proportion of each class in the training set should match the proportion found in reality. There are situations where the class distribution of the collected data does not match the distribution expected in reality and efforts have been made to balance this situation (Chan and Stolfo, 1998, Zadrozny and Elkan, 2001).

Most efforts made to face class distribution problems have been directed to very unbalanced data sets. It is easily understood why classifiers do not behave well when trained with very unbalanced data sets. If 99% of the examples in a data set belong to the same class, a classifier that labels test cases with the majority class will achieve 99% accuracy. Since most classifiers are designed to minimise the error rate this kind of data sets tend to build very simple and useless classifiers (Japkowicz, 2000, Chawla, 2003).

However, class distribution used in the training set is not only important in highly unbalanced data sets but in every data set. Weiss and Provost (2003) have shown that each domain has an optimal class distribution to be used for training. In their work, Weiss and Provost show that it is mostly preferable to use a distribution other than the one expected in reality in situations where class distribution of the training set can be chosen.

In general, any machine learning algorithm builds a classifier based on a sample with a particular distribution. If the best distribution (the one that obtains better results) is known, the original sample can be changed hoping that better results will be achieved, using two strategies: oversampling or undersampling. Even if the direct consequence of undersampling is that some examples are ignored, in general, undersampling techniques obtain better results than oversampling techniques (repeating some examples) (Drummond and Holte, 2003). Anyway, it is well known that, in general, decreasing the number of examples in the training set, increases the error rate of the built classifiers. Besides, algorithms are even more sensitive to data reduction in the very common case of small training sets (Provost et al., 1999).

There are many real domains such as illness diagnosis, fraud detection in different fields, customer’s behaviour analysis (marketing), customer fidelisation, … where it is not enough to obtain a high accuracy in the classification, the comprehensibility of the classifier is necessary (Domingos, 1997). This kind of problems need the use of classifying paradigms that are able to give an explanation, for example classification or decision trees.

As we mentioned before, in some cases it would be useful to change class distribution undersampling the original data set, but we would like to do it in such a way that information loss is avoided. How could we do it? An easy way would be to build multiple classifiers: creating several subsamples with changed distribution by undersamplig the original data set. A classifier can be built with each one of the subsamples. The classification of new examples can be done similar to bagging by a voting process. This can be a good option in some cases but not in areas where explanation is important. As Domingos (1997) stated, it is clear that “while a single decision tree can be easily understood by a human as long as it is not too large, fifty such trees, even if individually simple, exceed the capacity of even the most patient”.

The algorithm we have developed, Consolidated Tree Construction (CTC), solves the previous problem. The CTC algorithm creates several subsamples with the desired class distribution from the original training set. Opposite to other algorithms that use subsamples to build multiple trees, as bagging or boosting, CTC induces a single tree. Therefore it does not lose the comprehensibility of the base classifier.

This work describes the CTC algorithm and shows how it can be combined with changes in class distribution to achieve improvements in accuracy. The C4.5 algorithm (Quinlan, 1993) has been used with two aims: as base algorithm for building Consolidated Trees and also for evaluating CTC algorithm’s behaviour. The evaluation has been done with 10 databases from the UCI repository benchmark (Newman et al., 1998).

The paper proceeds describing how a single tree can be built from several subsamples, CTC algorithm, in Section 2. The characteristics of the databases used in the experimentation and details about the experimental methodology are described in Section 3. In Section 4 an analysis of the behaviour of CTC algorithm when changes in class distribution proposed by Weiss and Provost (2003) are done is presented and compared to C4.5. Finally Section 5 is devoted to show the conclusions and further work.

Section snippets

Consolidated Tree Construction algorithm

Consolidated Tree Construction algorithm (CTC) uses several subsamples to build a single tree (Pérez et al., 2004). This technique is radically different from bagging, boosting, etc. The consensus is achieved at each step of the tree’s building process and only one tree is built.

The different subsamples are used to make proposals about the feature that should be used to split in the current node. The split function used in this work is the gain ratio criterion (the same used by Quinlan (1993)

Experimental methodology

In this paper we show how to take advantage of changes in class distribution in order to obtain increased accuracy. This work is based on a previous paper, in which Weiss and Provost (2003) show the effect of class distribution in tree induction. In that work the authors show that, in the majority of cases, a decision tree induced from a sample with a class distribution other than the one expected in reality can achieve lower error rates than a tree induced from a sample with the natural

Results

The experimentation in Section 4.1 has been done as close as possible to the experimentation conducted by Weiss and Provost, and so, trees have not been pruned. In Section 4.2, the generated trees have been pruned obtaining better classification performance.

Conclusions and further work

In this paper we have presented the CTC algorithm, which builds a single tree from several subsamples. This way it uses more information than a tree built based on a single subsample, but it does not lose the explaining capacity classification trees have. Our work has been based on the conclusions presented by Weiss and Provost (2003). They showed that an optimal class distribution to train classifiers for each domain exists, and this distribution is not necessarily the natural one. Besides,

Acknowledgments

The work described in this paper was partly done under the University of the Basque Country (UPV/EHU) project 1/UPV 00139.226-T-15920/2004. It was also funded by the Diputación Foral de Gipuzkoa and the European Union.

Thanks to the anonymous reviewers for their valuable comments.

References (17)

  • L. Breiman et al.

    Classification and Regression Trees

    (1984)
  • Chan, P.K., Stolfo, S.J., 1998. Toward scalable learning with non-uniform class and cost distributions: A case study in...
  • Chawla, N.V., 2003. C4.5 and imbalanced data sets: Investigating the effect of sampling method, probabilistic estimate,...
  • T.G. Dietterich

    Approximate statistical tests for comparing supervised classification learning algorithms

    Neural Comput.

    (1998)
  • Domingos, P., 1997. Knowledge acquisition from examples via multiple models. In: Proceedings of the 14th International...
  • Drummond, C., Holte, R.C., 2003. C4.5, Class imbalance, and cost sensitive: Why under-sampling beats over-sampling. In:...
  • Fawcett, T., 2004. ROC Graphs: Notes and Practical Considerations for Researchers. HP Labs Tech Report...
  • T. Hastie et al.

    The Elements of Statistical Learning

    (2001)
There are more references available in the full text version of this article.

Cited by (41)

  • PCTBagging: From inner ensembles to ensembles. A trade-off between discriminating capacity and interpretability

    2022, Information Sciences
    Citation Excerpt :

    Thus, as stated in [1], tree ensembles increase the generalisation capacity but cause loss of transparency and, as such, are not considered comprehensible. The consolidation approach was proposed as an alternative to improve the results of decision tree algorithms by maintaining the interpretability of the models [5]. Consolidation also uses multiple samples but creates a single classifier by applying the voting process during the model building phase instead of applying it during the classification of new examples.

  • BPDET: An effective software bug prediction model using deep representation and ensemble learning techniques

    2020, Expert Systems with Applications
    Citation Excerpt :

    J48 Consolidated: This method is used to generate a class of pruned or non-pruned consolidated tree in C4.5. It uses consolidated tree construction (CTC) algorithm (Pérez, Muguerza, Arbelaitz, Gurrutxaga, & Martín, 2007). By subsamples, a single tree is built.

View all citing articles on Scopus
View full text