Skip to main content
Log in

A memetic approach for training set selection in imbalanced data sets

  • Original Article
  • Published:
International Journal of Machine Learning and Cybernetics Aims and scope Submit manuscript

Abstract

Imbalanced data classification is a challenging problem in the field of machine learning. The problem occurs when data samples have an uneven distribution amongst the classes and classical classifiers are not suitable for classifying such datasets. To overcome this problem, in this paper, the best training samples are selected from data samples with the goal of improving the performance of classifier when dealing with imbalanced data. To do so, some heuristic methods are presented which use local information to give a proper view about whether removing or retaining each sample of training set. Subsequently, the methods are considered as local search algorithms and combined with a global search algorithm in a framework to form memetic algorithms. The global search used in this paper is binary quantum inspired gravitational search algorithm (BQIGSA) which is a new metaheuristic search for optimization of binary encoded problems. BQIGSA is employed since we seek for a highly stochastic and random search algorithm to solve our problem. We propose to use six different local search algorithms, three of which are application oriented that we designed based on the problem and the rest are general, and the best local search is then determined. Experiments are performed on 45 standard datasets, and G-mean and AUC criteria are considered as evaluation tools. Then, the data sets are employed to compare the best memetic approaches with some popular state of the art algorithms as well as a recently proposed memetic algorithm and the results show their superiority. At the last step, the performance of the proposed algorithm for four different classifiers is evaluated and the best classifier is determined to be utilized for this method.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

References

  1. Singh PK (2017) Three-way fuzzy concept lattice representation using neutrosophic set. Int J Mach Learn Cybern 8(1):69–79

    Article  Google Scholar 

  2. Cieslak DA, Chawla NV, Striegel A (2006) Combating imbalance in network intrusion datasets. In: IEEE international conference on granular computing. https://doi.org/10.1109/GRC.2006.1635905

  3. Kubat M, Holte RC, Matwin SJML (1998) Machine learning for the detection of oil spills in satellite radar images 30(2–3):195–215

  4. Zhang D, Islam MM, Lu G (2012) A review on automatic image annotation techniques. Pattern Recogn 45(1):346–362

    Article  Google Scholar 

  5. Pednault EP, Rosen BK, Apte C (2000) Handling imbalanced data sets in insurance risk modeling. IBM TJ Watson Research Center Yorktown Heights, New York

    Google Scholar 

  6. Yu H, Ni J, Zhao JJN (2013) ACOSampling: an ant colony optimization-based undersampling method for classifying imbalanced DNA microarray data. Neurocomputing 101:309–318

    Article  Google Scholar 

  7. Nezamabadi-pour H (2015) A quantum-inspired gravitational search algorithm for binary encoded optimization problems. Eng Appl Artif Intell 40:62–75

    Article  Google Scholar 

  8. Moscato P (1999) Memetic algorithms: a short introduction. New ideas in optimization. McGraw-Hill, Washington

    Google Scholar 

  9. García S, Cano JR, Herrera FJPR (2008) A memetic algorithm for evolutionary prototype selection: a scaling up approach. Pattern Recogn 41(8):2693–2709

    Article  MATH  Google Scholar 

  10. Ong YS, Lim MH, Zhu N, Wong KW (2006) Classification of adaptive memetic algorithms: a comparative study. IEEE Trans Syst Man Cybern Part B 36(1):141–152

    Article  Google Scholar 

  11. Chen X, Ong YS, Lim MH, Tan KC (2011) A multi-facet survey on memetic computation. IEEE Trans Evol Comput 15(5):591–607

    Article  Google Scholar 

  12. Grzymala-Busse JW, Stefanowski J, Wilk S (2004) A comparison of two approaches to data mining from imbalanced data. In: Negoita MG, Howlett RJ, Jain LC (eds) Knowledge-based intelligent information and engineering systems. KES 2004. Lecture notes in computer science, vol 3213. Springer, Berlin, Heidelberg

    Google Scholar 

  13. Krawczyk B, Woźniak M (2015) Cost-sensitive neural network with roc-based moving threshold for imbalanced classification. In: international conference on intelligent data engineering and automated learning. Springer, New York

    Chapter  Google Scholar 

  14. Yang C-Y, Yang J-S, Wang J-J (2009) Margin calibration in SVM class-imbalanced learning. Neurocomputing 73(1–3):397–411. https://doi.org/10.1016/j.neucom.2009.08.006

    Article  Google Scholar 

  15. Domingos P (1999) Metacost: a general method for making classifiers cost-sensitive. In: KDD '99 proceedings of the fifth ACM SIGKDD international conference on knowledge discovery and data mining, San Diego, California, USA, 15–18 Aug 1999, pp 155–164. https://doi.org/10.1145/312129.312220

  16. Elkan C (2001) The foundations of cost-sensitive learning. In: International joint conference on artificial intelligence. Lawrence Erlbaum Associates Ltd

  17. Ting KM (2002) An instance-weighting method to induce cost-sensitive trees. IEEE Trans Knowl Data Eng 14(3):659–665

    Article  MathSciNet  Google Scholar 

  18. Zhou Z-H, Liu X-Y (2006) Training cost-sensitive neural networks with methods addressing the class imbalance problem. IEEE Trans Knowl Data Eng 18(1):63–77

    Article  MathSciNet  Google Scholar 

  19. Saryazdi S, Nikpour B, Nezamabadi-Pour H (2017) NPC: Neighbors’ progressive competition algorithm for classification of imbalanced data sets. In: 2017 3rd Iranian conference on intelligent systems and signal processing (ICSPIS). IEEE, Shahrood, Iran

  20. Gao M et al (2011) A combined SMOTE and PSO based RBF classifier for two-class imbalanced problems. Neurocomputing 74(17):3456–3466

    Article  Google Scholar 

  21. Lin S-C, Yuan-chin IC, Yang W-N (2009) Meta-learning for imbalanced data and classification ensemble in binary classification. Neurocomputing 73(1):484–494

    Article  Google Scholar 

  22. Galar M et al (2012) A review on ensembles for the class imbalance problem: bagging-, boosting-, and hybrid-based approaches. IEEE Trans Syst Man Cybern Part C 42(4):463–484

    Article  Google Scholar 

  23. Jian C, Gao J, Ao Y (2016) A new sampling method for classifying imbalanced data based on support vector machine ensemble. Neurocomputing 193:115–122. https://doi.org/10.1016/j.neucom.2016.02.006

    Article  Google Scholar 

  24. Tahir MA, Kittler J, Yan FJPR (2012) Inverse random under sampling for class imbalance problem and its application to multi-label classification 45(10):3738–3750

  25. Tomek I (1976) Two modifications of CNN. IEEE Trans Syst Man Cybern 6:769–772

    MathSciNet  MATH  Google Scholar 

  26. Hart P (1968) The condensed nearest neighbor rule (Corresp). IEEE Trans Inf Theory 14(3):515–516. https://doi.org/10.1109/TIT.1968.1054155

    Article  Google Scholar 

  27. Kubat M, Matwin S (1997) Addressing the curse of imbalanced training sets: one-sided selection. In: ICML. Nashville, USA

  28. Wilson DL (1972) Asymptotic properties of nearest neighbor rules using edited data. IEEE Trans Syst Man Cybern 2(3):408–421. https://doi.org/10.1109/TSMC.1972.4309137

    Article  MathSciNet  MATH  Google Scholar 

  29. Laurikkala J (2001) Improving identification of difficult small classes by balancing class distribution. In: Quaglini S, Barahona P, Andreassen S (eds) Artificial intelligence in medicine. Lecture notes in computer science, vol 2101. Springer, Berlin, pp 63–66

    Chapter  Google Scholar 

  30. Yoon K, Kwek S (2005) An unsupervised learning approach to resolving the data imbalanced issue in supervised learning problems in functional genomics. In: Fifth international conference on hybrid intelligent systems (HIS'05). IEEE, Rio de Janeiro, Brazil

  31. Ghazikhani A, Yazdi HS, Monsefi R (2012) Class imbalance handling using wrapper-based random oversampling. In: 20th Iranian conference on electrical engineering (ICEE 2012). IEEE, Tehran, Iran

  32. Chen S, He H, Garcia EA (2010) RAMOBoost: ranked minority oversampling in boosting. IEEE Trans Neural Netw 21(10):1624–1642. https://doi.org/10.1109/TNN.2010.2066988

    Article  Google Scholar 

  33. He H et al (2008) ADASYN: adaptive synthetic sampling approach for imbalanced learning. In: 2008 IEEE international joint conference on neural networks (IEEE World Congress on Computational Intelligence). IEEE, Hong Kong, China

  34. Chawla NV et al (2002) SMOTE: synthetic minority over-sampling technique. J Artif Intell Res 16(1):321–357

    Article  MATH  Google Scholar 

  35. Hu S et al (2009) MSMOTE: improving classification performance when training data is imbalanced. In: 2009 Second international workshop on computer science and engineering. IEEE, Qingdao, China

  36. Barua S et al (2014) MWMOTE–majority weighted minority oversampling technique for imbalanced data set learning. IEEE Trans Knowl Data Eng 26(2):405–425

    Article  Google Scholar 

  37. Gao M et al (2014) PDFOS: PDF estimation based over-sampling for imbalanced two-class problems. Neurocomputing 138:248–259

    Article  Google Scholar 

  38. Ramentol E et al (2012) SMOTE-RSB*: a hybrid preprocessing approach based on oversampling and undersampling for high imbalanced data-sets using SMOTE and rough sets theory. Knowl Inf Syst 33(2):245–265

    Article  Google Scholar 

  39. Batista GE, Prati RC, Monard MC (2004) A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD Explor Newsl 6(1):20–29. https://doi.org/10.1145/1007730.1007735

    Article  Google Scholar 

  40. Han H, Wang WY, Mao BH (2005) Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. In: Huang DS, Zhang XP, Huang GB (eds) Advances in intelligent computing. ICIC 2005. Lecture notes in computer science, vol 3644. Springer, Berlin, Heidelberg

    Google Scholar 

  41. Bunkhumpornpat C, Sinapiromsaran K, Lursinsap C (2009) Safe-level-smote: safe-level-synthetic minority over-sampling technique for handling the class imbalanced problem. In: Theeramunkong T, Kijsirikul B, Cercone N, Ho TB (eds) Advances in knowledge discovery and data mining. Lecture notes in computer science, vol 5476. Springer, Berlin

    Chapter  Google Scholar 

  42. Cateni S, Colla V, Vannucci M (2014) A method for resampling imbalanced datasets in binary classification tasks for real-world problems. Neurocomputing 135:32–41

    Article  Google Scholar 

  43. Vluymans S et al (2016) EPRENNID: An evolutionary prototype reduction based ensemble for nearest neighbor classification of imbalanced data. Neurocomputing 216:596–610

    Article  Google Scholar 

  44. García S, Herrera F (2009) Evolutionary undersampling for classification with imbalanced datasets: proposals and taxonomy. Evol Comput 17(3):275–306

    Article  MathSciNet  Google Scholar 

  45. García S, Fernández A, Herrera F (2009) Enhancing the effectiveness and interpretability of decision tree and rule induction classifiers with evolutionary training set selection over imbalanced problems. Appl Soft Comput 9(4):1304–1314

    Article  Google Scholar 

  46. Garcı S et al (2012) Evolutionary-based selection of generalized instances for imbalanced classification. Knowl Based Syst 25(1):3–12

    Article  Google Scholar 

  47. Galar M et al (2013) EUSBoost: enhancing ensembles for highly imbalanced data-sets by evolutionary undersampling. Pattern Recogn 46(12):3460–3471

    Article  Google Scholar 

  48. Lim P, Goh CK, Tan KC (2016) Evolutionary cluster-based synthetic oversampling ensemble (eco-ensemble) for imbalance learning. IEEE Trans Cybern 47(9):2850–2861

    Article  Google Scholar 

  49. Li J et al (2016) Improving the classification performance of biological imbalanced datasets by swarm optimization algorithms. J Supercomput 72(10):3708–3728

    Article  Google Scholar 

  50. Fernández A et al (2017) A pareto based ensemble with feature and instance selection for learning from multi-class imbalanced datasets. Int J Neural Syst. https://doi.org/10.1142/S0129065717500289

    Article  Google Scholar 

  51. Nikpour B, Nezamabadi-pour H (2018) HTSS: a hyper-heuristic training set selection method for imbalanced data sets. Iran J Comput Sci 1(2):109–128

    Article  Google Scholar 

  52. Krasnogor N, Smith J (2005) A tutorial for competent memetic algorithms: model, taxonomy, and design issues. IEEE Trans Evol Comput 9(5):474–488

    Article  Google Scholar 

  53. Chen X et al (2011) A multi-facet survey on memetic computation. IEEE Trans Evol Comput 15(5):591–607

    Article  Google Scholar 

  54. Kannan SS, Ramaraj N (2010) A novel hybrid feature selection via symmetrical uncertainty ranking based local memetic search algorithm. Knowl Based Syst 23(6):580–585

    Article  Google Scholar 

  55. Lee J, Kim D-W (2015) Memetic feature selection algorithm for multi-label classification. Inf Sci 293:80–96

    Article  Google Scholar 

  56. Cano A, Zafra A, Ventura S (2013) Weighted data gravitation classification for standard and imbalanced data. IEEE Trans Cybern 43(6):1672–1687

    Article  Google Scholar 

  57. Peng L et al (2014) A new approach for imbalanced data classification based on data gravitation. Inf Sci 288:347–373

    Article  Google Scholar 

  58. Zhu Y, Wang Z, Gao D (2015) Gravitational fixed radius nearest neighbor for imbalanced problem. Knowl Based Syst 90:224–238

    Article  Google Scholar 

  59. Nikpour B, Shabani M, Nezamabadi-pour H (2017) Proposing new method to improve gravitational fixed nearest neighbor algorithm for imbalanced data classification. In: 2nd conference on swarm intelligence and evolutionary computation (CSIEC), Kerman, Iran, 7–9 Mar 2017. IEEE. https://doi.org/10.1109/CSIEC.2017.7940167

  60. Shabani-kordshooli M, Nikpour B, Nezamabadi-pour H (2017) An improvement to gravitational fixed radius nearest neighbor for imbalanced problem. In: Artificial intelligence and signal processing conference (AISP). IEEE

  61. Nezamabadi-pour H (2015) A quantum-inspired gravitational search algorithm for binary encoded optimization problems. Eng Appl Artif Intell 40:62–75

    Article  Google Scholar 

  62. Nielsen MA, Chuang IL (2000) Quantum computation and quantum information. Quantum 546:1231

    MATH  Google Scholar 

  63. Zhang G (2011) Quantum-inspired evolutionary algorithms: a survey and empirical study. J Heuristics 17(3):303–351

    Article  MATH  Google Scholar 

  64. Meng K, Wang HG, Dong ZY, Wong KP (2010) Quantum-inspired particle swarm optimization for valve-point economic load dispatch. IEEE Trans Power Syst 25(1):215–222. https://doi.org/10.1109/TPWRS.2009.2030359

    Article  Google Scholar 

  65. Hoffmeister F, Bäck T (1990) Genetic algorithms and evolution strategies: similarities and differences. In: International conference on parallel problem solving from nature. Springer, New York

  66. Mardani S (2014) A hyper-heuristic algorithm using fuzzy controller for feature selection. Master thesis, Electrical Engineering Department, Shahid Bahonar University of Kerman

  67. Bhowmik P et al (2010) A new differential evolution with improved mutation strategy. In: IEEE congress on evolutionary computation. IEEE, Barcelona, Spain

  68. García S et al (2010) Advanced nonparametric tests for multiple comparisons in the design of experiments in computational intelligence and data mining: experimental analysis of power. Inf Sci 180(10):2044–2064

    Article  Google Scholar 

  69. Holm S (1979) A simple sequentially rejective multiple test procedure. Scand J Stat 6(2):65–70

    MathSciNet  MATH  Google Scholar 

  70. López V et al (2013) An insight into classification with imbalanced data: Empirical results and current trends on using data intrinsic characteristics. Inf Sci 250:113–141

    Article  Google Scholar 

  71. Yu D-J et al (2013) Improving protein-ATP binding residues prediction by boosting SVMs with random under-sampling. Neurocomputing 104:180–190

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hossein Nezamabadi-pour.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

Appendix

The methods selected for our comparison work as follows:

RUS Random under-sampling removes the samples of majority class randomly to reach the balance [71].

NCL This method is an under-sampling method in which the three nearest neighbors of each sample are found. If this sample is a majority class sample and is misclassified by its three nearest neighbors, it is removed. If the sample is a minority class sample and its three nearest neighbors classify it wrongly, the nearest neighbors belonging to the majority class are eliminated [29].

OSS This method is used for under-sampling. First, a set containing all minority class samples and one randomly selected majority class sample is created and named C. Then, the original training set is classified by 1 nearest neighbor using the set C and all misclassified samples are transferred to C. Finally, majority class samples which are participating in Tomek Links are removed from C which results in removal of borderline and noisy samples [27].

ROS Random under-sampling replicate the samples of mainority class randomly to reach the balance [31].

SMOTE This is an over-sampling methodology which first, selects a minority class sample, x, randomly and finds its k nearest neighbors using Euclidean distance. Then, a minority class sample, y, is selected randomly from these neighbors. Finally, new sample, S, is generated using the following equation:

$$S = x + r \times \left( {y - x} \right)$$

where is a random number in the range [0, 1]. This process is repeated until the desired minority samples are achieved [34].

SMOTE + TL This method is a hybrid one in which after expanding the minority class samples using SMOTE algorithm, Tomek Links is applied to remove redundant samples from both majority and minority classes in order to avoid overfitting [39].

SMOTE + ENN This method removes the redundant samples after being expanded by SMOTE like the previous method, SMOTE + TL. However, ENN is used instead of Tomek Links for this elimination. ENN throw out the samples that are misclassified by their three nearest neighbors [28].

SMOTE + PSO This method tries to tune the parameters of SMOTE algorithm dynamically by giving the SMOTE parameters as input to PSO and optimize them. The fitness function is set as the function we used in our proposed method for fair comparison [49].

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Nikpour, B., Nezamabadi-pour, H. A memetic approach for training set selection in imbalanced data sets. Int. J. Mach. Learn. & Cyber. 10, 3043–3070 (2019). https://doi.org/10.1007/s13042-019-01000-w

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s13042-019-01000-w

Keywords

Navigation