Abstract
Clustering is an unsupervised classification method used to group the objects of an unlabeled data set. The high dimensional data sets generally comprise of irrelevant and redundant features also along with the relevant features which deteriorate the clustering result. Therefore, feature selection is necessary to select a subset of relevant features as it improves discrimination ability of the original set of features which helps in improving the clustering result. Though many metaheuristics have been suggested to select subset of the relevant features in wrapper framework based on some criteria, most of them are marred by the three key issues. First, they require objects class information a priori which is unknown in unsupervised feature selection. Second, feature subset selection is devised on a single validity measure; hence, it produces a single best solution biased toward the cardinality of the feature subset. Third, they find difficulty in avoiding local optima owing to lack of balancing in exploration and exploitation in the feature search space. To deal with the first issue, we use unsupervised feature selection method where no class information is required. To address the second issue, we follow pareto-based approach to obtain diverse trade-off solutions by optimizing conceptually contradicting validity measures silhouette index (Sil) and feature cardinality (d). For the third issue, we introduce genetic crossover operator to improve diversity in a recent Newtonian law of gravity-based metaheuristic binary gravitational search algorithm (BGSA) in multi-objective optimization scenario; it is named as improved multi-objective BGSA for feature selection (IMBGSAFS). We use ten real-world data sets for comparison of the IMBGSAFS results with three multi-objective methods MBGSA, MOPSO, and NSGA-II in wrapper framework and the Pearson’s linear correlation coefficient (FM-CC) as a multi-objective filter method. We employ four multi-objective quality measures convergence, diversity, coverage and ONVG. The obtained results show superiority of the IMBGSAFS over its competitors. An external clustering validity index F-measure also establish the above finding. As the decision maker picks only a single solution from the set of trade-off solutions, we employee the F-measure to select a final single solution from the external archive. The quality of final solution achieved by IMBGSAFS is superior over competitors in terms of clustering accuracy and/or smaller subset size.
Similar content being viewed by others
References
Bharti KK, Singh PK (2015) Hybrid dimension reduction by integrating feature selection with feature extraction method for text clustering. Exp Syst Appl 42(6):3105–3114
Biesiada J, Duch W (2007) Feature selection for high-dimensional data—a pearson redundancy based filter. In: Computer recognition systems, vol 2. Springer, Berlin, Heidelberg, pp 242–249
Coello CAC, Pulido GT, Lechuga MS (2004) Handling multiple objectives with particle swarm optimization. IEEE Trans Evolut Comput 8(3):256–279
Dash M, Liu H (1997) Feature selection for classification. Intell Data Anal 1(3):131–156
Davies DL, Bouldin DW (1979) A cluster separation measure. IEEE Trans Pattern Anal Mach Intell 2:224–227
Deb K (2001) Multi-objective optimization using evolutionary algorithms, ser. Wiley-Interscience series in systems and optimization. Wiley, Hoboken
Deb K, Agrawal S, Pratap A, Meyarivan T (2000) A fast elitist non-dominated sorting genetic algorithm for multi-objective optimization: Nsga-ii. Lect Notes Comput Sci 1917:849–858
Deb K, Jain S (2002) Running performance metrics for evolutionary multi-objective optimizations. In: Proceedings of the 4th Asia-Pacific conference on simulated evolution and learning (SEAL’02), (Singapore), pp 13–20
Dunn JC (1973) A fuzzy relative of the isodata process and its use in detecting compact well-separated clusters. Taylor & Francis, pp 32–57
Dy JG, Brodley CE (2004) Feature selection for unsupervised learning. J Mach Learn Res 5:845–889
Eberhart RC, Kennedy J (1995) A new optimizer using particle swarm theory. In: Proceedings of the sixth international symposium on micro machine and human science, vol 1. New York, pp 39–43
Friedman M (1937) The use of ranks to avoid the assumption of normality implicit in the analysis of variance. J Am Stat Assoc 32(200):675–701
González B, Valdez F, Melin P, Prado-Arechiga G (2015) Fuzzy logic in the gravitational search algorithm for the optimization of modular neural networks in pattern recognition. Exp Syst Appl 42(14):5839–5847
Guyon I, Elisseeff A (2003) An introduction to variable and feature selection. J Mach Learn Res 3:1157–1182
Hancer E, Xue B, Zhang M, Karaboga D, Akay B (2017) Pareto front feature selection based on artificial bee colony optimization. Inf Sci 422:462–479
Handl J, Knowles J (2006) Feature subset selection in unsupervised learning via multiobjective optimization. Int J Comput Intell Res 2(3):217–238
Jain AK, Dubes RC et al (1988) Algorithms for clustering data, vol 6. Prentice Hall, Englewood Cliffs
Jain AK, Murty MN, Flynn PJ (1999) Data clustering: a review. ACM Comput Surv (CSUR) 31(3):264–323
Kaufman L, Rousseeuw P (2009) Finding groups in data: an introduction to cluster analysis, ser. Wiley series in probability and statistics. Wiley, Hoboken
Kennedy J, Eberhart RC (1997) A discrete binary version of the particle swarm algorithm. In: Proceedings of conference on system, man, and cybernetics. Citeseer, pp 4104–4109
Kim Y, Street WN, Menczer F (2002) Evolutionary model selection in unsupervised learning. Intell Data Anal 6(6):531–556
Liu H, Motoda H (1998) Feature selection for knowledge discovery and data mining. Springer, Berlin
Morita M, Sabourin R, Bortolozzi F, Suen CY (2003) Unsupervised feature selection using multi-objective genetic algorithms for handwritten word recognition. In: Proceedings of the seventh international conference on document analysis and recognition, 2003. IEEE, pp 666–670
Mukhopadhyay A, Maulik U, Bandyopadhyay S, Coello Coello C (2014) A survey of multiobjective evolutionary algorithms for data mining: part i. IEEE Trans Evolut Comput 18(1):4–19
Nag K, Pal NR (2016) A multiobjective genetic programming-based ensemble for simultaneous feature selection and classification. IEEE Trans Cybern 46(2):499–510
Okabe T, Jin Y, Sendhoff B (2003) A critical survey of performance indices for multi-objective optimisation. In: The 2003 congress on evolutionary computation, CEC’03, vol 2. IEEE, pp 878–885
Prakash J, Singh PK (2015) An effective multiobjective approach for hard partitional clustering. Memet Comput 7(2):93–104
Rashedi E, Nezamabadi-pour H (2014) Feature subset selection using improved binary gravitational search algorithm. J Intell Fuzzy Syst 26(3):1211–1221
Rashedi E, Nezamabadi-Pour H, Saryazdi S (2009) Gsa: a gravitational search algorithm. Inf Sci 179(13):2232–2248
Rashedi E, Nezamabadi-Pour H, Saryazdi S (2010) Bgsa: binary gravitational search algorithm. Natural Comput 9(3):727–745
Shams M, Rashedi E, Hakimi A (2015) Clustered-gravitational search algorithm and its application in parameter optimization of a low noise amplifier. Appl Math Comput 258:436–453
Sikdar UK, Ekbal A, Saha S (2015) Mode: multiobjective differential evolution for feature selection and classifier ensemble. Soft Comput 19(12):3529–3549
Xu R, Wunsch D et al (2005) Survey of clustering algorithms. IEEE Trans Neural Netw 16(3):645–678
Yu L, Liu H (2004) Efficient feature selection via analysis of relevance and redundancy. J Mach Learn Res 5:1205–1224
Zhang Y, Gong D-W, Cheng J (2017) Multi-objective particle swarm optimization approach for cost-based feature selection in classification. IEEE/ACM Trans Comput Biol Bioinf 14(1):64–75
Zitzler E, Thiele L (1999) Multiobjective evolutionary algorithms: a comparative case study and the strength pareto approach. IEEE Trans Evolut Comput 3(4):257–271
Zitzler E, Deb K, Thiele L (2000) Comparison of multiobjective evolutionary algorithms: empirical results. Evolut Comput 8(2):173–195
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest. This article does not contain any studies with human participants or animals performed by any of the authors.
Additional information
Communicated by V. Loia.
Rights and permissions
About this article
Cite this article
Prakash, J., Singh, P.K. Gravitational search algorithm and K-means for simultaneous feature selection and data clustering: a multi-objective approach. Soft Comput 23, 2083–2100 (2019). https://doi.org/10.1007/s00500-017-2923-x
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00500-017-2923-x