A memetic approach for training set selection in imbalanced data sets

Nikpour, Bahareh; Nezamabadi-pour, Hossein

doi:10.1007/s13042-019-01000-w

A memetic approach for training set selection in imbalanced data sets

Original Article
Published: 28 August 2019

Volume 10, pages 3043–3070, (2019)
Cite this article

International Journal of Machine Learning and Cybernetics Aims and scope Submit manuscript

Bahareh Nikpour^1,2 &
Hossein Nezamabadi-pour^1,2

315 Accesses
6 Citations
1 Altmetric
Explore all metrics

Abstract

Imbalanced data classification is a challenging problem in the field of machine learning. The problem occurs when data samples have an uneven distribution amongst the classes and classical classifiers are not suitable for classifying such datasets. To overcome this problem, in this paper, the best training samples are selected from data samples with the goal of improving the performance of classifier when dealing with imbalanced data. To do so, some heuristic methods are presented which use local information to give a proper view about whether removing or retaining each sample of training set. Subsequently, the methods are considered as local search algorithms and combined with a global search algorithm in a framework to form memetic algorithms. The global search used in this paper is binary quantum inspired gravitational search algorithm (BQIGSA) which is a new metaheuristic search for optimization of binary encoded problems. BQIGSA is employed since we seek for a highly stochastic and random search algorithm to solve our problem. We propose to use six different local search algorithms, three of which are application oriented that we designed based on the problem and the rest are general, and the best local search is then determined. Experiments are performed on 45 standard datasets, and G-mean and AUC criteria are considered as evaluation tools. Then, the data sets are employed to compare the best memetic approaches with some popular state of the art algorithms as well as a recently proposed memetic algorithm and the results show their superiority. At the last step, the performance of the proposed algorithm for four different classifiers is evaluated and the best classifier is determined to be utilized for this method.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

HTSS: a hyper-heuristic training set selection method for imbalanced data sets

Article 12 February 2018

A Logarithmic Distance-Based Multi-Objective Genetic Programming Approach for Classification of Imbalanced Data

A memetic algorithm with support vector machine for feature selection and classification

Article 07 February 2015

References

Singh PK (2017) Three-way fuzzy concept lattice representation using neutrosophic set. Int J Mach Learn Cybern 8(1):69–79
Article Google Scholar
Cieslak DA, Chawla NV, Striegel A (2006) Combating imbalance in network intrusion datasets. In: IEEE international conference on granular computing. https://doi.org/10.1109/GRC.2006.1635905
Kubat M, Holte RC, Matwin SJML (1998) Machine learning for the detection of oil spills in satellite radar images 30(2–3):195–215
Zhang D, Islam MM, Lu G (2012) A review on automatic image annotation techniques. Pattern Recogn 45(1):346–362
Article Google Scholar
Pednault EP, Rosen BK, Apte C (2000) Handling imbalanced data sets in insurance risk modeling. IBM TJ Watson Research Center Yorktown Heights, New York
Google Scholar
Yu H, Ni J, Zhao JJN (2013) ACOSampling: an ant colony optimization-based undersampling method for classifying imbalanced DNA microarray data. Neurocomputing 101:309–318
Article Google Scholar
Nezamabadi-pour H (2015) A quantum-inspired gravitational search algorithm for binary encoded optimization problems. Eng Appl Artif Intell 40:62–75
Article Google Scholar
Moscato P (1999) Memetic algorithms: a short introduction. New ideas in optimization. McGraw-Hill, Washington
Google Scholar
García S, Cano JR, Herrera FJPR (2008) A memetic algorithm for evolutionary prototype selection: a scaling up approach. Pattern Recogn 41(8):2693–2709
Article MATH Google Scholar
Ong YS, Lim MH, Zhu N, Wong KW (2006) Classification of adaptive memetic algorithms: a comparative study. IEEE Trans Syst Man Cybern Part B 36(1):141–152
Article Google Scholar
Chen X, Ong YS, Lim MH, Tan KC (2011) A multi-facet survey on memetic computation. IEEE Trans Evol Comput 15(5):591–607
Article Google Scholar
Grzymala-Busse JW, Stefanowski J, Wilk S (2004) A comparison of two approaches to data mining from imbalanced data. In: Negoita MG, Howlett RJ, Jain LC (eds) Knowledge-based intelligent information and engineering systems. KES 2004. Lecture notes in computer science, vol 3213. Springer, Berlin, Heidelberg
Google Scholar
Krawczyk B, Woźniak M (2015) Cost-sensitive neural network with roc-based moving threshold for imbalanced classification. In: international conference on intelligent data engineering and automated learning. Springer, New York
Chapter Google Scholar
Yang C-Y, Yang J-S, Wang J-J (2009) Margin calibration in SVM class-imbalanced learning. Neurocomputing 73(1–3):397–411. https://doi.org/10.1016/j.neucom.2009.08.006
Article Google Scholar
Domingos P (1999) Metacost: a general method for making classifiers cost-sensitive. In: KDD '99 proceedings of the fifth ACM SIGKDD international conference on knowledge discovery and data mining, San Diego, California, USA, 15–18 Aug 1999, pp 155–164. https://doi.org/10.1145/312129.312220
Elkan C (2001) The foundations of cost-sensitive learning. In: International joint conference on artificial intelligence. Lawrence Erlbaum Associates Ltd
Ting KM (2002) An instance-weighting method to induce cost-sensitive trees. IEEE Trans Knowl Data Eng 14(3):659–665
Article MathSciNet Google Scholar
Zhou Z-H, Liu X-Y (2006) Training cost-sensitive neural networks with methods addressing the class imbalance problem. IEEE Trans Knowl Data Eng 18(1):63–77
Article MathSciNet Google Scholar
Saryazdi S, Nikpour B, Nezamabadi-Pour H (2017) NPC: Neighbors’ progressive competition algorithm for classification of imbalanced data sets. In: 2017 3rd Iranian conference on intelligent systems and signal processing (ICSPIS). IEEE, Shahrood, Iran
Gao M et al (2011) A combined SMOTE and PSO based RBF classifier for two-class imbalanced problems. Neurocomputing 74(17):3456–3466
Article Google Scholar
Lin S-C, Yuan-chin IC, Yang W-N (2009) Meta-learning for imbalanced data and classification ensemble in binary classification. Neurocomputing 73(1):484–494
Article Google Scholar
Galar M et al (2012) A review on ensembles for the class imbalance problem: bagging-, boosting-, and hybrid-based approaches. IEEE Trans Syst Man Cybern Part C 42(4):463–484
Article Google Scholar
Jian C, Gao J, Ao Y (2016) A new sampling method for classifying imbalanced data based on support vector machine ensemble. Neurocomputing 193:115–122. https://doi.org/10.1016/j.neucom.2016.02.006
Article Google Scholar
Tahir MA, Kittler J, Yan FJPR (2012) Inverse random under sampling for class imbalance problem and its application to multi-label classification 45(10):3738–3750
Tomek I (1976) Two modifications of CNN. IEEE Trans Syst Man Cybern 6:769–772
MathSciNet MATH Google Scholar
Hart P (1968) The condensed nearest neighbor rule (Corresp). IEEE Trans Inf Theory 14(3):515–516. https://doi.org/10.1109/TIT.1968.1054155
Article Google Scholar
Kubat M, Matwin S (1997) Addressing the curse of imbalanced training sets: one-sided selection. In: ICML. Nashville, USA
Wilson DL (1972) Asymptotic properties of nearest neighbor rules using edited data. IEEE Trans Syst Man Cybern 2(3):408–421. https://doi.org/10.1109/TSMC.1972.4309137
Article MathSciNet MATH Google Scholar
Laurikkala J (2001) Improving identification of difficult small classes by balancing class distribution. In: Quaglini S, Barahona P, Andreassen S (eds) Artificial intelligence in medicine. Lecture notes in computer science, vol 2101. Springer, Berlin, pp 63–66
Chapter Google Scholar
Yoon K, Kwek S (2005) An unsupervised learning approach to resolving the data imbalanced issue in supervised learning problems in functional genomics. In: Fifth international conference on hybrid intelligent systems (HIS'05). IEEE, Rio de Janeiro, Brazil
Ghazikhani A, Yazdi HS, Monsefi R (2012) Class imbalance handling using wrapper-based random oversampling. In: 20th Iranian conference on electrical engineering (ICEE 2012). IEEE, Tehran, Iran
Chen S, He H, Garcia EA (2010) RAMOBoost: ranked minority oversampling in boosting. IEEE Trans Neural Netw 21(10):1624–1642. https://doi.org/10.1109/TNN.2010.2066988
Article Google Scholar
He H et al (2008) ADASYN: adaptive synthetic sampling approach for imbalanced learning. In: 2008 IEEE international joint conference on neural networks (IEEE World Congress on Computational Intelligence). IEEE, Hong Kong, China
Chawla NV et al (2002) SMOTE: synthetic minority over-sampling technique. J Artif Intell Res 16(1):321–357
Article MATH Google Scholar
Hu S et al (2009) MSMOTE: improving classification performance when training data is imbalanced. In: 2009 Second international workshop on computer science and engineering. IEEE, Qingdao, China
Barua S et al (2014) MWMOTE–majority weighted minority oversampling technique for imbalanced data set learning. IEEE Trans Knowl Data Eng 26(2):405–425
Article Google Scholar
Gao M et al (2014) PDFOS: PDF estimation based over-sampling for imbalanced two-class problems. Neurocomputing 138:248–259
Article Google Scholar
Ramentol E et al (2012) SMOTE-RSB*: a hybrid preprocessing approach based on oversampling and undersampling for high imbalanced data-sets using SMOTE and rough sets theory. Knowl Inf Syst 33(2):245–265
Article Google Scholar
Batista GE, Prati RC, Monard MC (2004) A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD Explor Newsl 6(1):20–29. https://doi.org/10.1145/1007730.1007735
Article Google Scholar
Han H, Wang WY, Mao BH (2005) Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. In: Huang DS, Zhang XP, Huang GB (eds) Advances in intelligent computing. ICIC 2005. Lecture notes in computer science, vol 3644. Springer, Berlin, Heidelberg
Google Scholar
Bunkhumpornpat C, Sinapiromsaran K, Lursinsap C (2009) Safe-level-smote: safe-level-synthetic minority over-sampling technique for handling the class imbalanced problem. In: Theeramunkong T, Kijsirikul B, Cercone N, Ho TB (eds) Advances in knowledge discovery and data mining. Lecture notes in computer science, vol 5476. Springer, Berlin
Chapter Google Scholar
Cateni S, Colla V, Vannucci M (2014) A method for resampling imbalanced datasets in binary classification tasks for real-world problems. Neurocomputing 135:32–41
Article Google Scholar
Vluymans S et al (2016) EPRENNID: An evolutionary prototype reduction based ensemble for nearest neighbor classification of imbalanced data. Neurocomputing 216:596–610
Article Google Scholar
García S, Herrera F (2009) Evolutionary undersampling for classification with imbalanced datasets: proposals and taxonomy. Evol Comput 17(3):275–306
Article MathSciNet Google Scholar
García S, Fernández A, Herrera F (2009) Enhancing the effectiveness and interpretability of decision tree and rule induction classifiers with evolutionary training set selection over imbalanced problems. Appl Soft Comput 9(4):1304–1314
Article Google Scholar
Garcı S et al (2012) Evolutionary-based selection of generalized instances for imbalanced classification. Knowl Based Syst 25(1):3–12
Article Google Scholar
Galar M et al (2013) EUSBoost: enhancing ensembles for highly imbalanced data-sets by evolutionary undersampling. Pattern Recogn 46(12):3460–3471
Article Google Scholar
Lim P, Goh CK, Tan KC (2016) Evolutionary cluster-based synthetic oversampling ensemble (eco-ensemble) for imbalance learning. IEEE Trans Cybern 47(9):2850–2861
Article Google Scholar
Li J et al (2016) Improving the classification performance of biological imbalanced datasets by swarm optimization algorithms. J Supercomput 72(10):3708–3728
Article Google Scholar
Fernández A et al (2017) A pareto based ensemble with feature and instance selection for learning from multi-class imbalanced datasets. Int J Neural Syst. https://doi.org/10.1142/S0129065717500289
Article Google Scholar
Nikpour B, Nezamabadi-pour H (2018) HTSS: a hyper-heuristic training set selection method for imbalanced data sets. Iran J Comput Sci 1(2):109–128
Article Google Scholar
Krasnogor N, Smith J (2005) A tutorial for competent memetic algorithms: model, taxonomy, and design issues. IEEE Trans Evol Comput 9(5):474–488
Article Google Scholar
Chen X et al (2011) A multi-facet survey on memetic computation. IEEE Trans Evol Comput 15(5):591–607
Article Google Scholar
Kannan SS, Ramaraj N (2010) A novel hybrid feature selection via symmetrical uncertainty ranking based local memetic search algorithm. Knowl Based Syst 23(6):580–585
Article Google Scholar
Lee J, Kim D-W (2015) Memetic feature selection algorithm for multi-label classification. Inf Sci 293:80–96
Article Google Scholar
Cano A, Zafra A, Ventura S (2013) Weighted data gravitation classification for standard and imbalanced data. IEEE Trans Cybern 43(6):1672–1687
Article Google Scholar
Peng L et al (2014) A new approach for imbalanced data classification based on data gravitation. Inf Sci 288:347–373
Article Google Scholar
Zhu Y, Wang Z, Gao D (2015) Gravitational fixed radius nearest neighbor for imbalanced problem. Knowl Based Syst 90:224–238
Article Google Scholar
Nikpour B, Shabani M, Nezamabadi-pour H (2017) Proposing new method to improve gravitational fixed nearest neighbor algorithm for imbalanced data classification. In: 2nd conference on swarm intelligence and evolutionary computation (CSIEC), Kerman, Iran, 7–9 Mar 2017. IEEE. https://doi.org/10.1109/CSIEC.2017.7940167
Shabani-kordshooli M, Nikpour B, Nezamabadi-pour H (2017) An improvement to gravitational fixed radius nearest neighbor for imbalanced problem. In: Artificial intelligence and signal processing conference (AISP). IEEE
Nezamabadi-pour H (2015) A quantum-inspired gravitational search algorithm for binary encoded optimization problems. Eng Appl Artif Intell 40:62–75
Article Google Scholar
Nielsen MA, Chuang IL (2000) Quantum computation and quantum information. Quantum 546:1231
MATH Google Scholar
Zhang G (2011) Quantum-inspired evolutionary algorithms: a survey and empirical study. J Heuristics 17(3):303–351
Article MATH Google Scholar
Meng K, Wang HG, Dong ZY, Wong KP (2010) Quantum-inspired particle swarm optimization for valve-point economic load dispatch. IEEE Trans Power Syst 25(1):215–222. https://doi.org/10.1109/TPWRS.2009.2030359
Article Google Scholar
Hoffmeister F, Bäck T (1990) Genetic algorithms and evolution strategies: similarities and differences. In: International conference on parallel problem solving from nature. Springer, New York
Mardani S (2014) A hyper-heuristic algorithm using fuzzy controller for feature selection. Master thesis, Electrical Engineering Department, Shahid Bahonar University of Kerman
Bhowmik P et al (2010) A new differential evolution with improved mutation strategy. In: IEEE congress on evolutionary computation. IEEE, Barcelona, Spain
García S et al (2010) Advanced nonparametric tests for multiple comparisons in the design of experiments in computational intelligence and data mining: experimental analysis of power. Inf Sci 180(10):2044–2064
Article Google Scholar
Holm S (1979) A simple sequentially rejective multiple test procedure. Scand J Stat 6(2):65–70
MathSciNet MATH Google Scholar
López V et al (2013) An insight into classification with imbalanced data: Empirical results and current trends on using data intrinsic characteristics. Inf Sci 250:113–141
Article Google Scholar
Yu D-J et al (2013) Improving protein-ATP binding residues prediction by boosting SVMs with random under-sampling. Neurocomputing 104:180–190
Article Google Scholar

Download references

Author information

Authors and Affiliations

Intelligent Data Processing Laboratory(IDPL), Department of Electrical Engineering, Shahid Bahonar University of Kerman, Kerman, Iran
Bahareh Nikpour & Hossein Nezamabadi-pour
Mahani Mathematical Research Center, Shahid Bahonar University of Kerman, Kerman, Iran
Bahareh Nikpour & Hossein Nezamabadi-pour

Authors

Bahareh Nikpour
View author publications
You can also search for this author in PubMed Google Scholar
Hossein Nezamabadi-pour
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hossein Nezamabadi-pour.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

The methods selected for our comparison work as follows:

RUS Random under-sampling removes the samples of majority class randomly to reach the balance [71].

NCL This method is an under-sampling method in which the three nearest neighbors of each sample are found. If this sample is a majority class sample and is misclassified by its three nearest neighbors, it is removed. If the sample is a minority class sample and its three nearest neighbors classify it wrongly, the nearest neighbors belonging to the majority class are eliminated [29].

OSS This method is used for under-sampling. First, a set containing all minority class samples and one randomly selected majority class sample is created and named C. Then, the original training set is classified by 1 nearest neighbor using the set C and all misclassified samples are transferred to C. Finally, majority class samples which are participating in Tomek Links are removed from C which results in removal of borderline and noisy samples [27].

ROS Random under-sampling replicate the samples of mainority class randomly to reach the balance [31].

SMOTE This is an over-sampling methodology which first, selects a minority class sample, x, randomly and finds its k nearest neighbors using Euclidean distance. Then, a minority class sample, y, is selected randomly from these neighbors. Finally, new sample, S, is generated using the following equation:

$$S = x + r \times \left( {y - x} \right)$$

where is a random number in the range [0, 1]. This process is repeated until the desired minority samples are achieved [34].

SMOTE + TL This method is a hybrid one in which after expanding the minority class samples using SMOTE algorithm, Tomek Links is applied to remove redundant samples from both majority and minority classes in order to avoid overfitting [39].

SMOTE + ENN This method removes the redundant samples after being expanded by SMOTE like the previous method, SMOTE + TL. However, ENN is used instead of Tomek Links for this elimination. ENN throw out the samples that are misclassified by their three nearest neighbors [28].

SMOTE + PSO This method tries to tune the parameters of SMOTE algorithm dynamically by giving the SMOTE parameters as input to PSO and optimize them. The fitness function is set as the function we used in our proposed method for fair comparison [49].

Rights and permissions

Reprints and permissions

About this article

Cite this article

Nikpour, B., Nezamabadi-pour, H. A memetic approach for training set selection in imbalanced data sets. Int. J. Mach. Learn. & Cyber. 10, 3043–3070 (2019). https://doi.org/10.1007/s13042-019-01000-w

Download citation

Received: 31 May 2018
Accepted: 13 August 2019
Published: 28 August 2019
Issue Date: November 2019
DOI: https://doi.org/10.1007/s13042-019-01000-w

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A memetic approach for training set selection in imbalanced data sets

Abstract

Access this article

Similar content being viewed by others

HTSS: a hyper-heuristic training set selection method for imbalanced data sets

A Logarithmic Distance-Based Multi-Objective Genetic Programming Approach for Classification of Imbalanced Data

A memetic algorithm with support vector machine for feature selection and classification

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendix

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A memetic approach for training set selection in imbalanced data sets

Abstract

Access this article

Similar content being viewed by others

HTSS: a hyper-heuristic training set selection method for imbalanced data sets

A Logarithmic Distance-Based Multi-Objective Genetic Programming Approach for Classification of Imbalanced Data

A memetic algorithm with support vector machine for feature selection and classification

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendix

Appendix

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation