Skip to main content
Log in

SMOTE-RSB *: a hybrid preprocessing approach based on oversampling and undersampling for high imbalanced data-sets using SMOTE and rough sets theory

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

Imbalanced data is a common problem in classification. This phenomenon is growing in importance since it appears in most real domains. It has special relevance to highly imbalanced data-sets (when the ratio between classes is high). Many techniques have been developed to tackle the problem of imbalanced training sets in supervised learning. Such techniques have been divided into two large groups: those at the algorithm level and those at the data level. Data level groups that have been emphasized are those that try to balance the training sets by reducing the larger class through the elimination of samples or increasing the smaller one by constructing new samples, known as undersampling and oversampling, respectively. This paper proposes a new hybrid method for preprocessing imbalanced data-sets through the construction of new samples, using the Synthetic Minority Oversampling Technique together with the application of an editing technique based on the Rough Set Theory and the lower approximation of a subset. The proposed method has been validated by an experimental study showing good results using C4.5 as the learning algorithm.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Alcalá-Fdez J, Sánchez L, García S, del Jesus MJ, Ventura S, Garrell JM, Otero J, Romero C, Bacardit J, Rivas VM, Fernández JC, Herrera F (2009) KEEL: a software tool to assess evolutionary algorithms to data mining problems. Soft Comput 13(3): 307–318

    Article  Google Scholar 

  2. Alcalá-Fdez J, Fernández A, Luengo J, Derrac J, García S, Sánchez L, Herrera F (2011) KEEL data-mining software tool: data set repository, integration of algorithms and experimental analysis framework. J Multiple-Valued Log Soft Comput 17(2–3): 255–287

    Google Scholar 

  3. Asuncion A, Newman D (2007) UCI Machine learning repository. http://mlearn.ics.uci.edu/MLRepository.html

  4. Batista GEAPA, Prati RC, Monard MC (2004) A study of the behaviour of several methods for balancing machine learning training data. SIGKDD Explor 6(1): 20–29

    Article  Google Scholar 

  5. Bello, R, Falcon, R, Pedrycz, W, Kacprzyk, J (eds) (2008) Granular computing: at the junction of rough sets and fuzzy sets. Springer

  6. Bradley AP (1997) The use of the Area Under the ROC Curve in the evaluation of machine learning algorithms. Pattern Recognit 30(7): 1145–1159

    Article  Google Scholar 

  7. Bunkhumpornpat C, Sinapiromsaran K, Lursinsap C (2009) ‘Safe-Level-SMOTE: safe-level-synthetic minority over-sampling technique for handling the class imbalanced problem’. Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD09). LNCS 3644. Springer, pp 475–482

  8. Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) SMOTE: Synthetic minority over-sampling technique. J Artif Intell Res 16: 321–357

    MATH  Google Scholar 

  9. Chawla NV, Japkowicz N, Kolcz A (2004) Editorial: special issue on learning from imbalanced data sets. SIGKDD Explor 6(1): 1–6

    Article  Google Scholar 

  10. Chawla NV, Cieslak D, Hall L, Joshi A (2008) Automatically countering imbalance and its empirical relationship to cost. Data Min Knowl Discov 17(2): 225–252

    Article  MathSciNet  Google Scholar 

  11. Chen Y-S, Cheng C-H (2010) Forecasting PGR of the financial industry using a rough sets classifier based on attribute-granularity. Knowl Inf Syst 25(1): 57–79

    Article  MATH  Google Scholar 

  12. Demšar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7: 1–30

    MathSciNet  MATH  Google Scholar 

  13. Fernández A, García S, del Jesus MJ, Herrera F (2008) A study of the behaviour of linguistic fuzzy rule based classification systems in the framework of imbalanced data-sets. Fuzzy Sets Syst 159(18): 2378–2398

    Article  Google Scholar 

  14. Fernández A, del Jesus MJ, Herrera F (2010) Multi-class imbalanced data-sets with linguistic fuzzy rule based classification systems based on pairwise learning. 13th International Conference on Information Processing and Management of Uncertainty in Knowledge-Based Systems (IPMU2010) LNAI 6178. pp 89–98. 159(18):2378–2398

  15. Fürnkranz J (2002) Round robin classification. J Mach Learn Res 2: 721–747

    MathSciNet  MATH  Google Scholar 

  16. García S, Herrera F (2008) An extension on “Statistical comparisons of classifiers over multiple data sets” for all pairwise comparisons. J Mach Learn Res 9: 2677–2694

    MATH  Google Scholar 

  17. García S, Herrera F (2009) Evolutionary under-sampling for classification with imbalanced data sets: proposals and taxonomy. Evol Comput 17(3): 275–306

    Article  Google Scholar 

  18. García S, Fernández A, Luengo J, Herrera F (2009) A study of statistical techniques and performance measures for genetics-based machine learning: accuracy and interpretability. Soft Comput 13(10): 959–977

    Article  Google Scholar 

  19. García S, Fernández A, Luengo J, Herrera F (2010) Advanced nonparametric tests for multiple comparisons in the design of experiments in computational intelligence and data mining: experimental analysis of power. Inf Sci 180: 2044–2064

    Article  Google Scholar 

  20. Greco S (2001) Rough sets theory for multicriteria decision analysis. Eur J Oper Res 129: 1–47

    Article  MathSciNet  MATH  Google Scholar 

  21. Grzymala-Busse JW, Stefanowski J, Wilk S (2005) A comparison of two approaches to data mining from imbalanced data. J Intell Manuf 16(6): 565–573

    Article  Google Scholar 

  22. Han H, Wang WY, Mao BH (2005) Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. International conference on intelligent computing (ICIC05) LNCS 3644. Springer, pp 878–887

  23. He H, García EA (2009) Learning from imbalanced data. IEEE Trans Knowl Data Eng 21(9): 1263–1284

    Article  Google Scholar 

  24. Holm S (1979) A simple sequentially rejective multiple test procedure, Scandinavian. J Stat 6: 65–70

    MathSciNet  Google Scholar 

  25. Huang J, Ling CX (2005) Using AUC and accuracy in evaluating learning algorithms. IEEE Trans Knowl Data Eng 17(3): 299–310

    Article  Google Scholar 

  26. Huan YM, Hung CM, Jiau HC (2006) Evaluation of neural networks and data mining methods on a credit assessment task for class imbalance problem. Nonlinear Anal Real World Appl 7(4): 720–747

    Article  MathSciNet  Google Scholar 

  27. Iman R, Davenport J (1980) Approximations of the critical region of the Friedman statistic. Commun Stat Part A Theory Methods 9: 571–595

    Article  Google Scholar 

  28. Ling C, Sheng V (2006) Test strategies for cost-sensitive decision trees. IEEE Trans Knowl Data Eng 18(8): 1055–1057

    Article  Google Scholar 

  29. Mazurowski M, Habas P, Zurada J, Lo J, Baker J, Tourassi G (2008) Training neural network classifiers for medical decision making: the effects of imbalanced datasets on classification performance. Neural Netw 21(2-3): 427–436

    Article  Google Scholar 

  30. Midelfar H, Komorowski J, Nørsett K, Yadetie F, Sandvik A, Lægreid A (2003) Learning rough set classifiers from gene expression and clinical data. Fundam Inf 53: 155–183

    Google Scholar 

  31. Orriols-Puig A, Bernadó-Mansilla E (2009) Evolutionary rule-based systems for imbalanced datasets. Soft Comput 13(3): 213–225

    Article  MATH  Google Scholar 

  32. Pawlak Z (1982) Rough sets. Int J Comput Inf Sci 11: 145–172

    Article  MathSciNet  Google Scholar 

  33. Quinlan J (1993) C4.5 programs for machine learning. Morgan Kaufmann, CA

    Google Scholar 

  34. Sheskin D (2003) Handbook of parametric and nonparametric statistical procedures. chapman & hall, CRC Press

  35. Slowinski R, Vanderpooten D (1997) Similarity relation as a basis for rough approximations. Adv Mach Intell Soft-Comput 4: 17–33

    Google Scholar 

  36. Sun Y, Kamel MS, Wong AK, Wang Y (2007) Cost-sensitive boosting for classification of imbalanced data. Pattern Recognit 40: 3358–3378

    Article  MATH  Google Scholar 

  37. Sun Y, Wong AK, Kamel MS (2009) Classification of imbalanced data: a review. Int J Pattern Recognit Artif Intell 23(4): 687–719

    Article  Google Scholar 

  38. Suresh S, Sundararajan N, Saratchandran P (2008) Risk-sensitive loss functions for sparse multi-category classification problems. Inf Sci 178(12): 2621–2638

    Article  MathSciNet  MATH  Google Scholar 

  39. Tomek I (1976) Two modifications of CNN. IEEE Trans Syst Man Commun 6: 769–772

    Article  MathSciNet  MATH  Google Scholar 

  40. Tsumoto S (2003) Automated extraction of hierarchical decision rules from clinical databases using rough set model. Expert Syst Appl 24: 189–197

    Article  Google Scholar 

  41. Wang BX, Japkowicz N (2010) Boosting support vector machines for imbalanced data sets. Knowl Inf Syst 25(1): 1–20

    Article  Google Scholar 

  42. Wei-hua X, Xiao-yan Z, Jian-min Z, Wen-xiu Z (2008) Attribute reduction in ordered information systems based on evidence theory. Knowl Inf Syst 178(5): 1355–1371

    Google Scholar 

  43. Weiss GM, Hirsh H (2000) A quantitative study of small disjuncts, In: Proceedings of the 17th national conference on artificial inteligence. pp 665–670

  44. Weiss GM, Provost F (2003) Learning when training data are costly: the effect of class distribution on tree induction. J Artif Intell Res 19: 315–354

    MATH  Google Scholar 

  45. Wilson DL (1972) Asymptotic properties of nearest neighbor rules using edited data. IEEE Trans Syst Man Commun 2(3): 408–421

    Article  Google Scholar 

  46. Wu X, Kumar V, Quinlan JR, Ghosh J, Yang Q, Motoda H, McLachlan GJ, Ng A, Liu B, Yu PS, Zhou Z-H, Steinbach M, Hand DJ, Steinberg D (2008) Top 10 algorithms in data mining. Knowl Inf Syst 14(1): 1–37

    Article  Google Scholar 

  47. Xu W, Zhang X, Zhong J, Zhang W (2010) Attribute reduction in ordered information systems based on evidence theory. Knowl Inf Syst 25(1): 169–184

    Article  Google Scholar 

  48. Yang Q, Wu X (2006) 10 challenging problems in data mining research. Int J Inf Technol Decis Mak 5(4): 597–604

    Article  Google Scholar 

  49. Zhou Z-H, Liu X-Y (2006) Training cost-sensitive neural networks with methods addressing the class imbalance problem. IEEE Trans Knowl Data Eng 18(1): 63–77

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Francisco Herrera.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ramentol, E., Caballero, Y., Bello, R. et al. SMOTE-RSB *: a hybrid preprocessing approach based on oversampling and undersampling for high imbalanced data-sets using SMOTE and rough sets theory. Knowl Inf Syst 33, 245–265 (2012). https://doi.org/10.1007/s10115-011-0465-6

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-011-0465-6

Keywords

Navigation