Learning with Class Skews and Small Disjuncts

Prati, Ronaldo C.; Batista, Gustavo E. A. P. A.; Monard, Maria Carolina

doi:10.1007/978-3-540-28645-5_30

Ronaldo C. Prati²⁰,
Gustavo E. A. P. A. Batista²⁰ &
Maria Carolina Monard²⁰

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 3171))

Included in the following conference series:

Brazilian Symposium on Artificial Intelligence

2973 Accesses
33 Citations

Abstract

One of the main objectives of a Machine Learning – ML – system is to induce a classifier that minimizes classification errors. Two relevant topics in ML are the understanding of which domain characteristics and inducer limitations might cause an increase in misclassification. In this sense, this work analyzes two important issues that might influence the performance of ML systems: class imbalance and error-prone small disjuncts. Our main objective is to investigate how these two important aspects are related to each other. Aiming at overcoming both problems we analyzed the behavior of two over-sampling methods we have proposed, namely Smote + Tomek links and Smote + ENN. Our results suggest that these methods are effective for dealing with class imbalance and, in some cases, might help in ruling out some undesirable disjuncts. However, in some cases a simpler method, Random over-sampling, provides compatible results requiring less computational resources.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Learning from Imbalanced Data: A Comparative Study

Revisiting Class Imbalance: A Generalized Notion for Oversampling

ISMOTE: A More Accurate Alternative for SMOTE

Article Open access 04 October 2024

References

Weiss, G.M.: The Effect of Small Disjuncts and Class Distribution on Decision Tree Learning. PhD thesis, Rutgers University (2003)
Google Scholar
Japkowicz, N.: Class Imbalances: Are we Focusing on the Right Issue? In: ICML Workshop on Learning from Imbalanced Data Sets (2003)
Google Scholar
Holte, R.C., Acker, L.E., Porter, B.W.: Concept Learning and the Problem of Small Disjuncts. IJCAI, 813–818 (1989)
Google Scholar
Weiss, G.M.: The problem with Noise and Small Disjuncts. In: ICML, pp. 574– 578 (1988)
Google Scholar
Carvalho, D.R., Freitas, A.A.: A Hybrid Decision Tree/Genetic Algorithm for Coping with the Problem of Small Disjuncts in Data Mining. In: Genetic and Evolutionary Computation Conference, pp. 1061–1068 (2000)
Google Scholar
Kubat, M., Matwin, S.: Addressing the Course of Imbalanced Training Sets: One- Sided Selection. In: ICML, pp. 179–186 (1997)
Google Scholar
Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: Synthetic Minority Over-sampling Technique. JAIR 16, 321–357 (2002)
MATH Google Scholar
Prati, R.C., Batista, G.E.A.P.A., Monard, M.C.: Class Imbalances versus Class Overlapping: an Analysis of a Learning System Behavior. In: Monroy, R., Arroyo-Figueroa, G., Sucar, L.E., Sossa, H. (eds.) MICAI 2004. LNCS (LNAI), vol. 2972, pp. 312–321. Springer, Heidelberg (2004)
Chapter Google Scholar
Weiss, G.M.: Learning with Rare Cases and Small Disjucts. In: ICML, pp. 558–565 (1995)
Google Scholar
Ferri, C., Flach, P., Hernández-Orallo, J.: Learning Decision Trees Using the Area Under the ROC Curve. In: ICML, pp. 139–146 (2002)
Google Scholar
Blake, C., Merz, C.: UCI Repository of Machine Learning Databases (1998), http://www.ics.uci.edu/~mlearn/MLRepository.html
Quinlan, J.R.: C4.5 Programs for Machine Learning. Morgan Kaufmann, San Francisco (1993)
Google Scholar
Zadrozny, B., Elkan, C.: Learning and Making Decisions When Costs and Probabilities are Both Unknown. In: KDD, pp. 204–213 (2001)
Google Scholar
Bauer, E., Kohavi, R.: An Empirical Comparison of Voting Classification Algorithms: Bagging, Boosting, and Variants. Machine Learning 36, 105–139 (1999)
Article Google Scholar
Weiss, G.M., Provost, F.: Learning When Training Data are Costly: The Effect of Class Distribution on Tree Induction. JAIR 19, 315–354 (2003)
MATH Google Scholar
Batista, G.E.A.P.A., Prati, R.C., Monard, M.C.: A Study of the Behavior of Several Methods for Balancing Machine Learning Training Data. In: SIGKDD Explorations, vol. 6 (2004) (to appear)
Google Scholar
Tomek, I.: Two Modifications of CNN. IEEE Transactions on Systems Man and Communications SMC-6, 769–772 (1976)
MathSciNet Google Scholar
Wilson, D.L.: Asymptotic Properties of Nearest Neighbor Rules Using Edited Data. IEEE Transactions on Systems, Man, and Communications 2, 408–421 (1972)
Article MATH Google Scholar

Download references

Author information

Authors and Affiliations

Institute of Mathematics and Computer Science at University of São Paulo, P. O. Box 668, ZIP Code 13560-970, São Carlos, SP, Brazil
Ronaldo C. Prati, Gustavo E. A. P. A. Batista & Maria Carolina Monard

Authors

Ronaldo C. Prati
View author publications
You can also search for this author in PubMed Google Scholar
Gustavo E. A. P. A. Batista
View author publications
You can also search for this author in PubMed Google Scholar
Maria Carolina Monard
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Instituto de Informática, UFRGS, Porto Alegre, RS, Brasil
Ana L. C. Bazzan
Intelligent Systems Laboratory LSI, Center of Technology, Federal University of Maranao UFMA, Bacanga Campus, 65080-040, Sao Luis, MA, Brazil
Sofiane Labidi

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Prati, R.C., Batista, G.E.A.P.A., Monard, M.C. (2004). Learning with Class Skews and Small Disjuncts. In: Bazzan, A.L.C., Labidi, S. (eds) Advances in Artificial Intelligence – SBIA 2004. SBIA 2004. Lecture Notes in Computer Science(), vol 3171. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-28645-5_30

Download citation

DOI: https://doi.org/10.1007/978-3-540-28645-5_30
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-23237-7
Online ISBN: 978-3-540-28645-5
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics