A subregion division based multi-objective evolutionary algorithm for SVM training set selection
Introduction
SVM (Support Vector Machine), as a popular and powerful supervised classifier in machine learning, has been successfully used in a wide variety of applications, ranging from pattern mining [1] and computer vision [2] to medical diagnosis [3] and information retrieval [4]. Despite of its strong theoretical foundations and good generalization performance, SVM also has some disadvantages, one of which is that training an SVM needs to solve a constrained quadratic programming optimization problem, whose computational complexity is O(n2) even O(n3) [5] (n is the number of instances in the training set). This issue is especially challenging nowadays, since in many real applications of SVM, the number of training data is very large [6], [7]. To tackle the disadvantage, different techniques have been suggested. Among them, as a data pre-processing technique, training set selection (TSS) has attracted much focus, as it can not only decrease the number of training data but also keep (even improve) the performance of SVM [8].
Essentially, TSS for SVM is one kind of instance selection problem, whose goal is to select the only relevant instances in training set before performing SVM training task [9]. Due to its importance, in the last decade, plenty of TSS algorithms with different optimization techniques have been developed [10], [11], [12], [13], [14], [15], [16], [17], [18], [19], among which, evolutionary algorithm (EA) based TSS methods have been paid much attention, as they do not make any assumptions on training set properties, and yield more effective solutions than those using non-evolutionary methods [10]. For example, Kawulok et al. proposed three different genetic algorithm (GA) based instance selection methods for SVM, and experimental results justified their superiority to the classical non-EA based TSS algorithms [15], [16], [17]. Nalepa et al. suggested several memetic algorithm (MA) based methods to perform instance selection for SVM. Empirical studies on the public classification data sets have demonstrated the effectiveness of the suggested algorithms [18], [19]. More EA based TSS methods for SVM can be found in [9].
Despite that EA based training set selection methods mentioned above have shown their competitiveness in obtaining instance subset with high quality, however, most of them utilize single objective optimization techniques. In fact, as pointed out in [9], TSS for SVM is a combinatorial optimization problem, which is characterized by two aspects. On one side, it needs to maximize accuracy of SVM obtained by using the selected training subset, on the other side, the number of the instances in the selected training subset should be minimized. To this end, those single objective EAs often need to introduce a trade-off parameter to balance the accuracy and size of the selected instances subset. Nevertheless, how to set a suitable value for the trade-off parameter is also a difficult problem, especially when we do not have any prior knowledge in real applications. A natural approach to tackle the problem is to develop the multi-objective evolutionary algorithm (MOEA) for SVM training set selection. To be a little surprise, although there are many works on designing MOEAs related to SVM optimization [20], [21], [22], there are only a few works that focus on developing MOEAs on TSS for SVM. For example, in [23], Pighetti et al. applied NSGA-II algorithm to produce good training subsets, which were used to train a SVM. In [24], Rosales-Perez et al. proposed a MOEA/D based TSS algorithm, which can simultaneously perform instance selection and hyper-parameters selection for SVM. Lately, in [25], Acampora et al. suggested a multi-objective training set selection algorithm under the framework of PESA-II. By using the suggested algorithm, the number of selected training instances was greatly reduced and the performance of SVM was further improved. Empirical results of these three algorithms have justified the superiority of MOEA for solving the TSS for SVM over the single EAs and the non-EAs. In this paper, we continue this research line by proposing a novel subregion division based MOEA, termed SDMOEA-TSS, for SVM training set selection, with which the quality and diversity of the selected instance subset can be further improved. Specifically, the main contributions of this paper are summarized as follows.
- •
A subregion division based multi-objective evolutionary algorithm, named SDMOEA-TSS, is proposed for training set selection for SVM, where objective space is divided into several subregions for searching better solutions. By using the proposed subregion division search strategy, the SDMOEA-TSS is capable of obtaining training subsets with both high quality and good diversity.
- •
In SDMOEA-TSS, a divided based initialization scheme is firstly suggested, which divides the objective space into different subregions and initializes the population in each subregion effectively. Then a subregion based evolutionary strategy is designed, which consists of the subregion based crossover, mutation and update operations. With this strategy, SDMOEA-TSS can utilize the individuals in subregions for local search as well as the whole population for global search.
- •
The effectiveness of the proposed SDMOEA-TSS is verified by comparing it with several state-of-the-arts on 21 SVM classification data sets with different characteristics. Experimental results have demonstrated the superiority of the proposed method over the comparison methods in terms of both quality and diversity of the solutions.
The remainder of the paper is organized as follows. In Section 2, related work on training set selection for SVM is presented. Section 3 gives the details of the proposed algorithm and empirical results by comparing our algorithm with several state-of-the-arts are reported in Section 4. Section 5 concludes the paper and discusses the future work.
Section snippets
Related work
In this section, we will review the related work on the training set selection (TSS) for SVM. To be specific, firstly, we discuss preliminaries about instance selection (IS) and the related work on IS, since training set selection is a specific form of instance selection1
The proposed algorithm
In this section, we firstly present the framework of the proposed algorithm, then give two important components of SDMOEA-TSS, which are the divided based initialization strategy and the subregion based evolutionary (including crossover, mutation and update) strategy.
Experimental results and analysis
In this section, we empirically verify the performance of the proposed SDMOEA-TSS by comparing it with several representative algorithms for SVM training set selection. Specifically, the experiments are designed as follows.
Conclusion and future work
In this paper, we proposed a subregion division based multi-objective evolutionary method, termed SDMOEA-TSS, for SVM training set selection. The proposed SDMOEA-TSS adopted a similar framework with NSGA-II and simultaneously optimized two conflicted objective, accuracy and reduction rate. To achieve the training sets with good quality, a divided based initialization strategy was firstly suggested, then a subregion based evolutionary (including crossover, mutation and update) strategy was
Declaration of Competing Interest
There are no conflicts about this paper.
Acknowledgments
This work is supported by the Natural Science Foundation of China (Grant no. U1804262, 61976001 and 61876184), and Humanities and Social Sciences Project of Chinese Ministry of Education (Grant no. 18YJC870004) and the Natural Science Foundation of Anhui Province (Grant no. 1708085MF166, 1908085MF219), Key Program of Natural Science Project of Educational Commission of Anhui Province (Grant no. KJ2017A013).
Fan Cheng received the B.Sc. in 2000 and M.Sc. in 2003 both from HeFei University of Technology, China. He received the Ph.D. in 2012 from University of Science and Technology of China, China. Now he is an Associate Professor of School of Computer Science and Technology at Anhui University, China. His main research interests include machine learning, imbalanced classification, multi-objective optimization, and complex network.
References (52)
- et al.
Collaborative SVM classification in scale-free peer-to-peer networks
Expert Syst. Appl.
(2017) - et al.
Automatic incident classification for large-scale traffic data by adaptive boosting SVM
Inf. Sci.
(2018) - et al.
Evolutionary wrapper approaches for training set selection as preprocessing mechanism for support vector machines: experimental evaluation and support vector analysis
Appl. Soft Comput.
(2016) - et al.
Data selection based on decision tree for SVM classification on large data sets
Appl. Soft Comput.
(2015) - et al.
Adaptive memetic algorithm enhanced with data geometry analysis to select training data for SVMs
Neurocomputing
(2016) - et al.
Surrogate-assisted multi-objective model selection for support vector machines
Neurocomputing
(2015) - et al.
A multi-objective evolutionary approach to training set selection for support vector machine
Knowl. Based Syst.
(2018) - et al.
Improving the combination of results in the ensembles of prototype selectors
Neural Netw.
(2019) - et al.
Three new instance selection methods based on local sets: a comparative study with several approaches from a bi-objective perspective
Pattern Recognit.
(2015) - et al.
Fast instance selection for speeding up support vector machines
Knowl. Based Syst.
(2013)
Comparison of genetic algorithm based prototype selection schemes
Pattern Recognit.
An improved multi-objective population-based extremal optimization algorithm with polynomial mutation
Inf. Sci.
Constrained multi-objective population extremal optimization based economic-emission dispatch incorporating renewable energy resources
Renew. Energy
Mining sequential patterns for classification
Knowl. Inf. Syst.
SVM based multi-label learning with missing labels for image annotation
Pattern Recognit.
Evolving the SVM model based on a hybrid method using swarm optimization techniques in combination with a genetic algorithm for medical diagnosis
Multimed. Tools Appl.
Libsvm: a library for support vector machines
ACM Trans. Intell. Syst. Technol.
LS-GKM: a new gkm-SVM for large-scale datasets
Bioinformatics
Using evolutionary algorithms as instance selection for data reduction in KDD: an experimental study
IEEE Trans. Evol. Comput.
Selecting training sets for support vector machines: a review
Artif. Intell. Rev.
Cluster-based instance selection for machine classification
Knowl. Inf. Syst.
Support vector machine active learning for image retrieval
Proceedings of the Ninth ACM International Conference on Multimedia
A survey on instance selection for active learning
Knowl. Inf. Syst.
Support vector machines training data selection using a genetic algorithm
Proceedings of the Joint IAPR International Workshops on Statistical Techniques in Pattern Recognition and Structural and Syntactic Pattern Recognition
Adaptive genetic algorithm to select training data for support vector machines
Proceedings of the European Conference on the Applications of Evolutionary Computation
An alternating genetic algorithm for selecting SVM model and training set
Proceedings of the Mexican Conference on Pattern Recognition
Cited by (29)
Ensembles of evolutionarily-constructed support vector machine cascades
2024, Knowledge-Based SystemsSupport Vector Machine with feature selection: A multiobjective approach
2022, Expert Systems with ApplicationsCitation Excerpt :Several metaheuristic have been developed with this aim. To cite only some of them, (Huang & Wang, 2006) and later (Zhao et al., 2011) propose genetic algorithms which are also compared to the grid algorithm and others using standard benchmark instances; Lin et al. (2008) propose a particle swarm optimization and demonstrate its efficiency by comparing it to grid search in different public datasets; García-Pedrajas et al. (2014) propose a memetic algorithm for dealing with many instances and many features simultaneously by performing joint instance and feature selection; Gauthama Raman et al. (2017) present an adaptive and a robust intrusion detection technique for parameter setting and feature selection in SVM; Aladeemy et al. (2017) propose a variation of Cohort Intelligence algorithm for SVM with feature selection; Bouraoui et al. (2018) propose a multi-objective approach to simultaneously optimize SVM parameters and feature subset using different kernel functions; Faris et al. (2018) present a multi-verse optimizer approach for feature selection and optimizing SVM parameters based on a robust system architecture; Tao et al. (2019) present a good classifier for data regarding hospitalization expenses which is obtained by using feature selection in SVM; Ibrahim et al. (2019) develop a novel metaheuristic, the Grasshopper Optimization Algorithm, which is inspired by grasshoppers searching for food and approved its ability to solve some biomedical datasets; Cheng et al. (2020) explore the benefit of subdividing the training set into smaller regions when training sets are large-scale; Dudzik et al. (2021) propose an evolutionary technique that efficiently classifies difficult datasets, including very large and extremely imbalanced cases; Al-Zoubi et al. (2021) present an improved evolutionary variant of competitive swarm optimizer and its superiority over a genetic algorithm is shown; All these works learn from a training set and check the performance with a testing set. Most of them select optimal features and optimize the parameters of SVM simultaneously with the aim of reducing the number of features while trying to maintain the predictive capability.
Three-objective constrained evolutionary instance selection for classification: Wrapper and filter approaches
2022, Engineering Applications of Artificial IntelligenceCitation Excerpt :In Hamidzadeh et al. (2020), a chaotic krill herd evolutionary algorithm is used to optimize an unconstrained multi-objective optimization problem for instance reduction where accuracy and geometric mean are maximized in binary classification problems with imbalanced data. In Cheng et al. (2020) an MOEA is proposed for instance selection over SVM where the objective space is divided into subregions. This evolutionary strategy based on subregions not only makes use of the individuals in each subregion for local search, but also maintains the global search capacity of the entire population.
Big data classification using heterogeneous ensemble classifiers in Apache Spark based on MapReduce paradigm
2021, Expert Systems with ApplicationsEvaluation of a novel computer dye recipe prediction method based on the PSO-LSSVM models and single reactive dye database
2021, Chemometrics and Intelligent Laboratory SystemsCitation Excerpt :SVM shows high performance when dealing with a limited number of training samples. However, the computational process becomes difficult when the amount of training data is large, which is time-consuming and unsuitable for practical applications [27–29]. Least squares support vector machine (LSSVM) can overcome the shortcomings of standard SVM by transforming the solution of quadratic programming problem into solving a system of linear equations [30,31].
Fan Cheng received the B.Sc. in 2000 and M.Sc. in 2003 both from HeFei University of Technology, China. He received the Ph.D. in 2012 from University of Science and Technology of China, China. Now he is an Associate Professor of School of Computer Science and Technology at Anhui University, China. His main research interests include machine learning, imbalanced classification, multi-objective optimization, and complex network.
Jiabin Chen is a Master student of School of Computer Science and Technology at Anhui University, China. He received the B.Sc. in 2017 from Anhui Jianzhu University, China. His research interests are multi-objective optimization and instance selection.
Jianfeng Qiu received the B.Sc. from AnQing Normal University in 2003. He received the M.Sc. in 2006 and Ph.D. in 2014 from Anhui University, China. Currently, he is a lecture in the School of Computer Science and Technology, Anhui University, China. His main research interests include machine learning, imbalanced classification, multi-objective optimization, and complex network.
Lei Zhang received the B.Sc. from Anhui Agriculture University in 2007, and the Ph.D. in 2014 from University of Science and Technology of China. Currently, he is an Associate Professor in the School of Computer Science and Technology, Anhui University, China. His main research interests include multi-objective optimization and applications, data mining, social network analysis and pattern recommendation. He has published more than 40 papers in refereed conferences and journals, such as ACM SIGKDD, ACM CIKM, IEEE ICDM, IEEE TCYB, ACM TKDD, IEEE CIM and Information Sciences. He is the recipient of the ACM CIKM’12 Best Student Paper Award.