Obtaining Optimal Class Distribution for Decision Trees: Comparative Analysis of CTC and C4.5

Albisua, Iñaki; Arbelaitz, Olatz; Gurrutxaga, Ibai; Martín, José I.; Muguerza, Javier; Pérez, Jesús M.; Perona, Iñigo

doi:10.1007/978-3-642-14264-2_11

Iñaki Albisua²²,
Olatz Arbelaitz²²,
Ibai Gurrutxaga²²,
José I. Martín²²,
Javier Muguerza²²,
Jesús M. Pérez²² &
…
Iñigo Perona²²

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 5988))

Included in the following conference series:

Conference of the Spanish Association for Artificial Intelligence

657 Accesses
4 Citations

Abstract

When using machine learning to solve real world problems, the class distribution used in the training set is important; not only in highly unbalanced data sets but in every data set. Weiss and Provost suggested that each domain has an optimal class distribution to be used for training. The aim of this work was to analyze the truthfulness of this hypothesis in the context of decision tree learners. With this aim we found the optimal class distribution for 30 databases and two decision tree learners, C4.5 and Consolidated Tree Construction algorithm (CTC), taking into account pruned and unpruned trees and based on two measures for evaluating discriminating capacity: AUC and error. The results confirmed that changes in the class distribution of the training samples improve the performance (AUC and error) of the classifiers. Therefore, the experimentation showed that there is an optimal class distribution for each database and this distribution depends on the used learning algorithm, whether the trees are pruned or not and the used evaluation criteria. Besides, results showed that CTC algorithm combined with optimal class distribution samples achieves more accurate learners, than any of the options of C4.5 and CTC with original distribution, with statistically significant differences.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Enhancing techniques for learning decision trees from imbalanced data

Article 02 March 2019

Decision tree induction based on minority entropy for the class imbalance problem

Article 22 January 2016

Addressing Local Class Imbalance in Balanced Datasets with Dynamic Impurity Decision Trees

References

Asuncion, A., Newman, D.J.: UCI Machine Learning Repository University of California, School of Information and Computer Science, Irvine (2007), http://www.ics.uci.edu/~mlearn/MLRepository.html
Chawla, N.V., Bowyer, K.W., Kegelmeyer, W.P.: SMOTE: Synthetic Minority Over-sampling Technique. Journal of Artificial Intelligence Research 16, 321–357 (2002)
MATH Google Scholar
Chawla, N.V.: C4.5 and Imbalanced Data sets: Investigating the effect of sampling method, probabilistic estimate, and decision tree structure. In: Proc. of the Workshop on Learning from Imbalanced Data Sets, ICML, Washington DC (2003)
Google Scholar
Demšar, J.: Statistical Comparisons of Classifiers over Multiple Data Sets. Journal of Machine Learning Research 7, 1–30 (2006)
MathSciNet MATH Google Scholar
Drummond, C., Holte, R.C.: Exploiting the Cost (In)sensitivity of Decision Tree Splitting Criteria. In: Proc. of the 17th Int. Conf. on Machine Learning, pp. 239–246 (2000)
Google Scholar
Estabrooks, A., Jo, T.J., Japkowicz, N.: A Multiple Resampling Method for Learning from Imbalanced Data Sets. Computational Intelligence 20(1), 18–36 (2004)
Article MathSciNet Google Scholar
García, S., Herrera, F.: An Extension on Statistical Comparisons of Classifiers over Multiple Data Sets for all Pairwise Comparisons. JMLR 9, 2677–2694 (2008)
MATH Google Scholar
Japkowicz, N., Stephen, S.: The Class Imbalance Problem: A Systematic Study. Intelligent Data Analysis Journal 6(5) (2002)
Google Scholar
Kennedy, R.L., Lee, Y., Van Roy, B., Reed, C.D., Lippmann, R.P.: Solving Data Mining Problems through Pattern Recognition. Prentice-Hall, Englewood Cliffs (1998)
Google Scholar
Ling, C.X., Huang, J., Zhang, H.: AUC: a better measure than accuracy in comparing learning algorithms. In: Xiang, Y., Chaib-draa, B. (eds.) Canadian AI 2003. LNCS (LNAI), vol. 2671, pp. 329–341. Springer, Heidelberg (2003)
Chapter Google Scholar
Marrocco, C., Duin, R.P.W., Tortorella, F.: Maximizing the area under the ROC curve by pairwise feature combination. Pattern Recognition 41(6), 1961–1974 (2008)
Article MATH Google Scholar
Orriols-Puig, A., Bernadó-Mansilla, E.: Evolutionary rule-based systems for imbalanced data sets. Soft Comput. 13, 213–225 (2009)
Article Google Scholar
Pérez, J.M., Muguerza, J., Arbelaitz, O., Gurrutxaga, I.: A New Algorithm to Build Consolidated Trees: Study of the Error Rate and Steadiness. In: Advances in Soft Computing, Proc. of the International Intelligent Information Processing and Web Mining Conference (IIS: IIPWM’04). Zakopane, Poland, pp. 79–88 (2004)
Google Scholar
Pérez, J.M., Muguerza, J., Arbelaitz, O., Gurrutxaga, I., Martín, J.I.: Combining multiple class distribution modified subsamples in a single tree. Pattern Recognition Letters 28(4), 414–422 (2007)
Article Google Scholar
Quinlan, J.R.: C4.5: Programs for Machine Learning. In: Morgan Kaufmann Publishers Inc. (eds.), San Mateo (1993)
Google Scholar
Weiss, G.M., Provost, F.: Learning when Training Data are Costly: The Effect of Class Distribution on Tree Induction. JAIR 19, 315–354 (2003)
MATH Google Scholar
Xu, L., Krzyzak, A., Suen, C.Y.: Methods of Combining Multiple Classifiers and Their Applications to Handwriting Recognition. IEEE Transactions on Systems, Man and Cybernetics SMC-22(3), 418–435 (1992)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Dept. of Computer Architecture and Technology, University of the Basque Country, M. Lardizabal, 1, 20018, Donostia, Spain
Iñaki Albisua, Olatz Arbelaitz, Ibai Gurrutxaga, José I. Martín, Javier Muguerza, Jesús M. Pérez & Iñigo Perona

Authors

Iñaki Albisua
View author publications
You can also search for this author in PubMed Google Scholar
Olatz Arbelaitz
View author publications
You can also search for this author in PubMed Google Scholar
Ibai Gurrutxaga
View author publications
You can also search for this author in PubMed Google Scholar
José I. Martín
View author publications
You can also search for this author in PubMed Google Scholar
Javier Muguerza
View author publications
You can also search for this author in PubMed Google Scholar
Jesús M. Pérez
View author publications
You can also search for this author in PubMed Google Scholar
Iñigo Perona
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

IIIA - CSIC, Campus UAB s/n, 08193, Bellaterra, Spain
Pedro Meseguer
Dpto. Lenguajes y Ciencias de la Computación, Universidad de Málaga, Campus de Teatinos, 29071, Málaga, Spain
Lawrence Mandow
Dpto. Lenguajes y Sistemas Informáticos, ETS Ingeniería Informática, University of Seville, Av. Reina Mercedes S/N, 41012, Sevilla, Spain
Rafael M. Gasca

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Albisua, I. et al. (2010). Obtaining Optimal Class Distribution for Decision Trees: Comparative Analysis of CTC and C4.5. In: Meseguer, P., Mandow, L., Gasca, R.M. (eds) Current Topics in Artificial Intelligence. CAEPIA 2009. Lecture Notes in Computer Science(), vol 5988. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-14264-2_11

Download citation

DOI: https://doi.org/10.1007/978-3-642-14264-2_11
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-14263-5
Online ISBN: 978-3-642-14264-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics