C4.5 Consolidation Process: An Alternative to Intelligent Oversampling Methods in Class Imbalance Problems

Albisua, Iñaki; Arbelaitz, Olatz; Gurrutxaga, Ibai; Muguerza, Javier; Pérez, Jesús M.

doi:10.1007/978-3-642-25274-7_8

Iñaki Albisua²²,
Olatz Arbelaitz²²,
Ibai Gurrutxaga²²,
Javier Muguerza²² &
…
Jesús M. Pérez²²

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 7023))

Included in the following conference series:

Conference of the Spanish Association for Artificial Intelligence

1373 Accesses

Abstract

In real world problems solved using data mining techniques, it is very usual to find data in which the number of examples of one of the classes is much smaller than the number of examples of the rest of the classes. Many works have been done to deal with these problems known as class imbalance problems. Most of them focus their effort on data resampling techniques so that training data would be improved, usually balancing the classes, before using a classical learning algorithm. Another option is to propose modifications to the learning algorithm. As a mixture of these two options, we proposed the Consolidation process, based on a previous resampling of the training data and a modification of the learning algorithm, in this study the C4.5. In this work, we experimented with 14 databases and compared the effectiveness of each strategy based on the achieved AUC values. Results show that the consolidation obtains the best performance compared to five well-known resampling methods including SMOTE and some of its variants. Thus, the consolidation process combined with subsamples to balance the class distribution is appropriate for class imbalance problems requiring explanation and high discriminating capacity.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Experimental Analysis of Oversampling Techniques in Class Imbalance Problem

Evidence-based adaptive oversampling algorithm for imbalanced classification

Article 23 September 2023

Hybrid Data-Level Techniques for Class Imbalance Problem

References

Albisua, I., Arbelaitz, O., Gurrutxaga, I., Martín, J.I., Muguerza, J., Pérez, J.M., Perona, I.: Obtaining optimal class distribution for decision trees: Comparative analysis of CTC and C4.5. In: Meseguer, P., Mandow, L., Gasca, R.M. (eds.) CAEPIA 2009. LNCS, vol. 5988, pp. 101–110. Springer, Heidelberg (2010)
Chapter Google Scholar
Artís, M., Ayuso, M., Guillén, M.: Modelling different types of automobile insurance fraud behaviour in the Spanish market. Insurance: Mathematics and Economics 24, 67–81 (1999)
MATH Google Scholar
Asuncion, A., Newman, D.J.: UCI ML Repository. University of California, School of Information and Computer Science, Irvine (2007), http://www.ics.uci.edu/~mlearn/MLRepository.html
Google Scholar
Batista, G.E.A.P.A., Prati, R.C., Monard, M.C.: A study of the behavior of several methods for balancing machine learning training data SIGKDD Explorations Newsletter. ACM 6, 20–29 (2004)
Google Scholar
Berry, M.J.A., Linoff, G.: Astering Data Mining. The Art and Science of Customer Relationship Management. Willey (2000)
Google Scholar
Chan, P.K., Stolfo, S.J.: Toward Scalable Learning with Non-uniform Class and Cost Distributions: A Case Study in Credit Card Fraud Detection. In: Proc. of the 4th. Int. Conf. on Knowledge Discovery and Data Mining, pp. 164–168. AAAI Press, Menlo Park (1998)
Google Scholar
Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: Synthetic Minority Over-sampling Technique. Journal of Artificial Intelligence Research 16, 321–357 (2002)
MATH Google Scholar
Demšar, J.: Statistical Comparisons of Classifiers over Multiple Data Sets. Journal of Machine Learning Research 7, 1–30 (2006)
MathSciNet MATH Google Scholar
Elkan, C.: The Foundations of Cost-Sensitive Learning. In: Proceedings of the 17th. Int. Joint Conf. on Artificial Intelligence, pp. 973–978 (2001)
Google Scholar
Estabrooks, A., Jo, T.J., Japkowicz, N.: A Multiple Resampling Method for Learning from Imbalanced Data Sets. Computational Intelligence 20(1), 18–36 (2004)
Article MathSciNet Google Scholar
Wu, G., Chang, E.Y.: Class-Boundary Alignment for Imbalanced Dataset Learning. In: Workshop on Learning from Imbalanced Datasets II, ICML, Washington DC (2003)
Google Scholar
García, S., Herrera, F.: An Extension on “Statistical Comparisons of Classifiers over Multiple Data Sets” for all Pairwise Comparisons. Journal of Machine Learning Reserarch 9, 2677–2694 (2008)
MATH Google Scholar
García, S., Fernández, A., Herrera, F.: Enhancing the Effectiveness and Interpretability of Decision Tree and Rule Induction Classifiers with Evolutionary Training Set Selection over Imbalanced Problems. Applied Soft Computing 9, 1304–1314 (2009)
Article Google Scholar
García, S., Fernández, A., Luengo, J., Herrera, F.: Advanced nonparametric tests for multiple comparisons in the design of experiments in computational intelligence and data mining: Experimental Analysis of Power. Information Sciences 180, 2044–2064 (2010)
Article Google Scholar
Han, H., Wang, W., Mao, B.: Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning Advances in Intelligent Computing, pp. 878–887 (2005)
Google Scholar
Japkowicz, N., Stephen, S.: The Class Imbalance Problem: A Systematic Study. Intelligent Data Analysis Journal 6(5), 429–449 (2002)
MATH Google Scholar
Joshi, M., Kumar, V., Agarwal, R.: Evaluating Boosting Algorithms to Classify Rare Classes: Comparison and Improvements. In: First IEEE International Conference on Data Mining, San Jose, CA (2001)
Google Scholar
Ling, C.X., Huang, J., Zhang, H.: AUC: A better measure than accuracy in comparing learning algorithms. In: Xiang, Y., Chaib-draa, B. (eds.) Canadian AI 2003. LNCS (LNAI), vol. 2671, pp. 329–341. Springer, Heidelberg (2003)
Chapter Google Scholar
Manevitz, L.M., Yousef, M.: One-class SVMs for document classification. Journal of Machine Learning Research 2, 139–154 (2001)
MATH Google Scholar
Marrocco, C., Duin, R.P.W., Tortorella, F.: Maximizing the area under the ROC curve by pairwise feature combination. Pattern Recognition 41(6), 1961–1974 (2008)
Article MATH Google Scholar
Orriols-Puig, A., Bernadó-Mansilla, E.: Evolutionary rule-based systems for imbalanced data sets. Soft Computing 13, 213–225 (2009)
Article Google Scholar
Pérez, J.M., Muguerza, J., Arbelaitz, O., Gurrutxaga, I., Martín, J.I.: Consolidated Tree Classifier Learning in a Car Insurance Fraud Detection Domain with Class Imbalance. In: Singh, S., Singh, M., Apte, C., Perner, P. (eds.) ICAPR 2005. LNCS, vol. 3686, pp. 381–389. Springer, Heidelberg (2005)
Chapter Google Scholar
Pérez, J.M., Muguerza, J., Arbelaitz, O., Gurrutxaga, I., Martín, J.I.: Combining multiple class distribution modified subsamples in a single tree. Pattern Recognition Letters 28(4), 414–422 (2007)
Article Google Scholar
Quinlan, J.R.: C4.5 Programs for Machine Learning. Morgan Kaufmann Publishers Inc., San Mateo (1993)
Google Scholar
Weiss, G.M., Provost, F.: Learning when Training Data are Costly: The Effect of Class Distribution on Tree Induction. Journal of Artificial Intelligence Research 19, 315–354 (2003)
MATH Google Scholar
Wilson, D.R., Martínez, T.R.: Reduction Techniques for Exemplar-Based Learning Algorithms. Machine Learning 38(3), 257–286 (2000)
Article MATH Google Scholar
Wu, X., Kumar, V., Quinlan, J.R., Ghosh, J., Yang, Q., Motoda, H., McLachlan, G.J., Ng, A., Liu, B., Yu, P.S., Zhou, Z.-H., Steinbach, M., Hand, D.J., Steinberg, D.: Top 10 algorithms in data mining. Knowledge and Information Systems 14, 1–37 (2008)
Article Google Scholar
Yang, Q., Wu, X.: 10 challenging problems in data mining research. International Journal of Information Technology & Decision Making 5(4), 597–604 (2006)
Article Google Scholar
Zadrozny, B., Elkan, C.: Learning and Making Decisions When Costs and Probabilities are Both Unknown. In: Proceedings of the 7th. Int. Conf. on Knowledge Discovery and Data Mining, pp. 204–213 (2001)
Google Scholar

Download references

Author information

Authors and Affiliations

Computer Science Faculty, University of the Basque Country, Manuel Lardizabal 1, 20018, Donostia, Spain
Iñaki Albisua, Olatz Arbelaitz, Ibai Gurrutxaga, Javier Muguerza & Jesús M. Pérez

Authors

Iñaki Albisua
View author publications
You can also search for this author in PubMed Google Scholar
Olatz Arbelaitz
View author publications
You can also search for this author in PubMed Google Scholar
Ibai Gurrutxaga
View author publications
You can also search for this author in PubMed Google Scholar
Javier Muguerza
View author publications
You can also search for this author in PubMed Google Scholar
Jesús M. Pérez
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Computer Science School, University of the Basque Country, PÂº Manuel de Lardizabal 1, 20018, Donostia-San Sebastian, Spain
Jose A. Lozano
Computing Systems Department, University of Castilla-La Mancha, Campus Universitario s/n, 02071, Albacete, Spain
José A. Gámez
Dep. Statistics, O.R. and Computation, University of La Laguna, 38271, La Laguna, S.C. Tenerife, Spain
José A. Moreno

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Albisua, I., Arbelaitz, O., Gurrutxaga, I., Muguerza, J., Pérez, J.M. (2011). C4.5 Consolidation Process: An Alternative to Intelligent Oversampling Methods in Class Imbalance Problems. In: Lozano, J.A., Gámez, J.A., Moreno, J.A. (eds) Advances in Artificial Intelligence. CAEPIA 2011. Lecture Notes in Computer Science(), vol 7023. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-25274-7_8

Download citation

DOI: https://doi.org/10.1007/978-3-642-25274-7_8
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-25273-0
Online ISBN: 978-3-642-25274-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics