SMOTE-RSB *: a hybrid preprocessing approach based on oversampling and undersampling for high imbalanced data-sets using SMOTE and rough sets theory

Ramentol, Enislay; Caballero, Yailé; Bello, Rafael; Herrera, Francisco

doi:10.1007/s10115-011-0465-6

SMOTE-RSB _*: a hybrid preprocessing approach based on oversampling and undersampling for high imbalanced data-sets using SMOTE and rough sets theory

Regular Paper
Published: 04 December 2011

Volume 33, pages 245–265, (2012)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

Enislay Ramentol¹,
Yailé Caballero¹,
Rafael Bello² &
…
Francisco Herrera³

2251 Accesses
310 Citations
Explore all metrics

Abstract

Imbalanced data is a common problem in classification. This phenomenon is growing in importance since it appears in most real domains. It has special relevance to highly imbalanced data-sets (when the ratio between classes is high). Many techniques have been developed to tackle the problem of imbalanced training sets in supervised learning. Such techniques have been divided into two large groups: those at the algorithm level and those at the data level. Data level groups that have been emphasized are those that try to balance the training sets by reducing the larger class through the elimination of samples or increasing the smaller one by constructing new samples, known as undersampling and oversampling, respectively. This paper proposes a new hybrid method for preprocessing imbalanced data-sets through the construction of new samples, using the Synthetic Minority Oversampling Technique together with the application of an editing technique based on the Rough Set Theory and the lower approximation of a subset. The proposed method has been validated by an experimental study showing good results using C4.5 as the learning algorithm.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Systematic Review on Supervised and Unsupervised Machine Learning Algorithms for Data Science

Imbalanced data preprocessing techniques for machine learning: a systematic mapping study

Article 09 November 2022

A Review on Random Forest: An Ensemble Classifier

References

Alcalá-Fdez J, Sánchez L, García S, del Jesus MJ, Ventura S, Garrell JM, Otero J, Romero C, Bacardit J, Rivas VM, Fernández JC, Herrera F (2009) KEEL: a software tool to assess evolutionary algorithms to data mining problems. Soft Comput 13(3): 307–318
Article Google Scholar
Alcalá-Fdez J, Fernández A, Luengo J, Derrac J, García S, Sánchez L, Herrera F (2011) KEEL data-mining software tool: data set repository, integration of algorithms and experimental analysis framework. J Multiple-Valued Log Soft Comput 17(2–3): 255–287
Google Scholar
Asuncion A, Newman D (2007) UCI Machine learning repository. http://mlearn.ics.uci.edu/MLRepository.html
Batista GEAPA, Prati RC, Monard MC (2004) A study of the behaviour of several methods for balancing machine learning training data. SIGKDD Explor 6(1): 20–29
Article Google Scholar
Bello, R, Falcon, R, Pedrycz, W, Kacprzyk, J (eds) (2008) Granular computing: at the junction of rough sets and fuzzy sets. Springer
Bradley AP (1997) The use of the Area Under the ROC Curve in the evaluation of machine learning algorithms. Pattern Recognit 30(7): 1145–1159
Article Google Scholar
Bunkhumpornpat C, Sinapiromsaran K, Lursinsap C (2009) ‘Safe-Level-SMOTE: safe-level-synthetic minority over-sampling technique for handling the class imbalanced problem’. Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD09). LNCS 3644. Springer, pp 475–482
Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) SMOTE: Synthetic minority over-sampling technique. J Artif Intell Res 16: 321–357
MATH Google Scholar
Chawla NV, Japkowicz N, Kolcz A (2004) Editorial: special issue on learning from imbalanced data sets. SIGKDD Explor 6(1): 1–6
Article Google Scholar
Chawla NV, Cieslak D, Hall L, Joshi A (2008) Automatically countering imbalance and its empirical relationship to cost. Data Min Knowl Discov 17(2): 225–252
Article MathSciNet Google Scholar
Chen Y-S, Cheng C-H (2010) Forecasting PGR of the financial industry using a rough sets classifier based on attribute-granularity. Knowl Inf Syst 25(1): 57–79
Article MATH Google Scholar
Demšar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7: 1–30
MathSciNet MATH Google Scholar
Fernández A, García S, del Jesus MJ, Herrera F (2008) A study of the behaviour of linguistic fuzzy rule based classification systems in the framework of imbalanced data-sets. Fuzzy Sets Syst 159(18): 2378–2398
Article Google Scholar
Fernández A, del Jesus MJ, Herrera F (2010) Multi-class imbalanced data-sets with linguistic fuzzy rule based classification systems based on pairwise learning. 13th International Conference on Information Processing and Management of Uncertainty in Knowledge-Based Systems (IPMU2010) LNAI 6178. pp 89–98. 159(18):2378–2398
Fürnkranz J (2002) Round robin classification. J Mach Learn Res 2: 721–747
MathSciNet MATH Google Scholar
García S, Herrera F (2008) An extension on “Statistical comparisons of classifiers over multiple data sets” for all pairwise comparisons. J Mach Learn Res 9: 2677–2694
MATH Google Scholar
García S, Herrera F (2009) Evolutionary under-sampling for classification with imbalanced data sets: proposals and taxonomy. Evol Comput 17(3): 275–306
Article Google Scholar
García S, Fernández A, Luengo J, Herrera F (2009) A study of statistical techniques and performance measures for genetics-based machine learning: accuracy and interpretability. Soft Comput 13(10): 959–977
Article Google Scholar
García S, Fernández A, Luengo J, Herrera F (2010) Advanced nonparametric tests for multiple comparisons in the design of experiments in computational intelligence and data mining: experimental analysis of power. Inf Sci 180: 2044–2064
Article Google Scholar
Greco S (2001) Rough sets theory for multicriteria decision analysis. Eur J Oper Res 129: 1–47
Article MathSciNet MATH Google Scholar
Grzymala-Busse JW, Stefanowski J, Wilk S (2005) A comparison of two approaches to data mining from imbalanced data. J Intell Manuf 16(6): 565–573
Article Google Scholar
Han H, Wang WY, Mao BH (2005) Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. International conference on intelligent computing (ICIC05) LNCS 3644. Springer, pp 878–887
He H, García EA (2009) Learning from imbalanced data. IEEE Trans Knowl Data Eng 21(9): 1263–1284
Article Google Scholar
Holm S (1979) A simple sequentially rejective multiple test procedure, Scandinavian. J Stat 6: 65–70
MathSciNet Google Scholar
Huang J, Ling CX (2005) Using AUC and accuracy in evaluating learning algorithms. IEEE Trans Knowl Data Eng 17(3): 299–310
Article Google Scholar
Huan YM, Hung CM, Jiau HC (2006) Evaluation of neural networks and data mining methods on a credit assessment task for class imbalance problem. Nonlinear Anal Real World Appl 7(4): 720–747
Article MathSciNet Google Scholar
Iman R, Davenport J (1980) Approximations of the critical region of the Friedman statistic. Commun Stat Part A Theory Methods 9: 571–595
Article Google Scholar
Ling C, Sheng V (2006) Test strategies for cost-sensitive decision trees. IEEE Trans Knowl Data Eng 18(8): 1055–1057
Article Google Scholar
Mazurowski M, Habas P, Zurada J, Lo J, Baker J, Tourassi G (2008) Training neural network classifiers for medical decision making: the effects of imbalanced datasets on classification performance. Neural Netw 21(2-3): 427–436
Article Google Scholar
Midelfar H, Komorowski J, Nørsett K, Yadetie F, Sandvik A, Lægreid A (2003) Learning rough set classifiers from gene expression and clinical data. Fundam Inf 53: 155–183
Google Scholar
Orriols-Puig A, Bernadó-Mansilla E (2009) Evolutionary rule-based systems for imbalanced datasets. Soft Comput 13(3): 213–225
Article MATH Google Scholar
Pawlak Z (1982) Rough sets. Int J Comput Inf Sci 11: 145–172
Article MathSciNet Google Scholar
Quinlan J (1993) C4.5 programs for machine learning. Morgan Kaufmann, CA
Google Scholar
Sheskin D (2003) Handbook of parametric and nonparametric statistical procedures. chapman & hall, CRC Press
Slowinski R, Vanderpooten D (1997) Similarity relation as a basis for rough approximations. Adv Mach Intell Soft-Comput 4: 17–33
Google Scholar
Sun Y, Kamel MS, Wong AK, Wang Y (2007) Cost-sensitive boosting for classification of imbalanced data. Pattern Recognit 40: 3358–3378
Article MATH Google Scholar
Sun Y, Wong AK, Kamel MS (2009) Classification of imbalanced data: a review. Int J Pattern Recognit Artif Intell 23(4): 687–719
Article Google Scholar
Suresh S, Sundararajan N, Saratchandran P (2008) Risk-sensitive loss functions for sparse multi-category classification problems. Inf Sci 178(12): 2621–2638
Article MathSciNet MATH Google Scholar
Tomek I (1976) Two modifications of CNN. IEEE Trans Syst Man Commun 6: 769–772
Article MathSciNet MATH Google Scholar
Tsumoto S (2003) Automated extraction of hierarchical decision rules from clinical databases using rough set model. Expert Syst Appl 24: 189–197
Article Google Scholar
Wang BX, Japkowicz N (2010) Boosting support vector machines for imbalanced data sets. Knowl Inf Syst 25(1): 1–20
Article Google Scholar
Wei-hua X, Xiao-yan Z, Jian-min Z, Wen-xiu Z (2008) Attribute reduction in ordered information systems based on evidence theory. Knowl Inf Syst 178(5): 1355–1371
Google Scholar
Weiss GM, Hirsh H (2000) A quantitative study of small disjuncts, In: Proceedings of the 17th national conference on artificial inteligence. pp 665–670
Weiss GM, Provost F (2003) Learning when training data are costly: the effect of class distribution on tree induction. J Artif Intell Res 19: 315–354
MATH Google Scholar
Wilson DL (1972) Asymptotic properties of nearest neighbor rules using edited data. IEEE Trans Syst Man Commun 2(3): 408–421
Article Google Scholar
Wu X, Kumar V, Quinlan JR, Ghosh J, Yang Q, Motoda H, McLachlan GJ, Ng A, Liu B, Yu PS, Zhou Z-H, Steinbach M, Hand DJ, Steinberg D (2008) Top 10 algorithms in data mining. Knowl Inf Syst 14(1): 1–37
Article Google Scholar
Xu W, Zhang X, Zhong J, Zhang W (2010) Attribute reduction in ordered information systems based on evidence theory. Knowl Inf Syst 25(1): 169–184
Article Google Scholar
Yang Q, Wu X (2006) 10 challenging problems in data mining research. Int J Inf Technol Decis Mak 5(4): 597–604
Article Google Scholar
Zhou Z-H, Liu X-Y (2006) Training cost-sensitive neural networks with methods addressing the class imbalance problem. IEEE Trans Knowl Data Eng 18(1): 63–77
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, University of Camagüey, Camagüey, Cuba
Enislay Ramentol & Yailé Caballero
Department of Computer Science, Universidad Central de Las Villas, Santa Clara, Cuba
Rafael Bello
Department of Computer Science and Artificial Intelligence, CITIC-UGR (Research Center on Information and Communications Technology), University of Granada, 18071, Granada, Spain
Francisco Herrera

Authors

Enislay Ramentol
View author publications
You can also search for this author in PubMed Google Scholar
Yailé Caballero
View author publications
You can also search for this author in PubMed Google Scholar
Rafael Bello
View author publications
You can also search for this author in PubMed Google Scholar
Francisco Herrera
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Francisco Herrera.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ramentol, E., Caballero, Y., Bello, R. et al. SMOTE-RSB _*: a hybrid preprocessing approach based on oversampling and undersampling for high imbalanced data-sets using SMOTE and rough sets theory. Knowl Inf Syst 33, 245–265 (2012). https://doi.org/10.1007/s10115-011-0465-6

Download citation

Received: 23 December 2009
Revised: 08 September 2011
Accepted: 17 November 2011
Published: 04 December 2011
Issue Date: November 2012
DOI: https://doi.org/10.1007/s10115-011-0465-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

SMOTE-RSB _*: a hybrid preprocessing approach based on oversampling and undersampling for high imbalanced data-sets using SMOTE and rough sets theory

Abstract

Access this article

Similar content being viewed by others

A Systematic Review on Supervised and Unsupervised Machine Learning Algorithms for Data Science

Imbalanced data preprocessing techniques for machine learning: a systematic mapping study

A Review on Random Forest: An Ensemble Classifier

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

SMOTE-RSB *: a hybrid preprocessing approach based on oversampling and undersampling for high imbalanced data-sets using SMOTE and rough sets theory

Abstract

Access this article

Similar content being viewed by others

A Systematic Review on Supervised and Unsupervised Machine Learning Algorithms for Data Science

Imbalanced data preprocessing techniques for machine learning: a systematic mapping study

A Review on Random Forest: An Ensemble Classifier

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation

SMOTE-RSB _*: a hybrid preprocessing approach based on oversampling and undersampling for high imbalanced data-sets using SMOTE and rough sets theory