Abstract
Feature subset selection and/or dimensionality reduction is an essential preprocess before performing any data mining task, especially when there are too many features in the problem space. In this paper, a clustering-based feature subset selection (CFSS) algorithm is proposed for discriminating more relevant features. In each level of agglomeration, it uses similarity measure among features to merge two most similar clusters of features. By gathering similar features into clusters and then introducing representative features of each cluster, it tries to remove some redundant features. To identify the representative features, a criterion based on mutual information is proposed. Since CFSS works in a filter manner in specifying the representatives, it is noticeably fast. As an advantage of hierarchical clustering, it does not need to determine the number of clusters in advance. In CFSS, the clustering process is repeated until all features are distributed in some clusters. However, to diffuse the features in a reasonable number of clusters, a recently proposed approach is used to obtain a suitable level for cutting the clustering tree. To assess the performance of CFSS, we have applied it on some valid UCI datasets and compared with some popular feature selection methods. The experimental results reveal the efficiency and fastness of our proposed method.
Similar content being viewed by others
References
Roweis S, Saul L (2000) Nonlinear dimensionality reduction by locally linear embedding. Science 290(5500):2323–2326
Kohavi R, John GH (1997) Wrapper for feature subset selection. Artif Intell 97(1–2):273–324
Pudil P, Novovicova J, Kittler J (1994) Floating search methods in feature selection. Pattern Recognit Lett 15:1119–1125
Reunanen J (2003) Overfitting in making comparisons between variable selection methods. J Mach Learn Res 3:1371–1382
Goldberg D (1989) Genetic algorithms in search, optimization and machine learning. Addison Wesley, Reading
Kennedy J, Eberhart RC (1995) Particle swarm optimization. IEEE Int Conf Neural Netw 4:942–1948
Chandrashekar G, Sahin F (2014) A survey on feature selection methods. Comput Electr Eng 40(1):16–28
Peng H, Long F, Ding C (2005) Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundance. IEEE Trans Pattern Anal Mach Intell 27(8):1226–1238
Dubes R, Jain AK (1980) Clustering methodologies in exploratory data analysis. In: Yovits MC (ed) Advances in computers. Academic Press Inc., New York, pp 113–125
Kasim S, Deris S, Othman RM (2013) Multi-stage filtering for improving confidence level and determining dominant clusters in clustering algorithms of gene expression data. Comput Biol Med 43:1120–1133
MacQueen JB (1967) Some methods for classification and analysis of multivariate observations. In: Proceedings of 5th Berkeley symposium on mathematical statistics and probability, vol 1. University of California Press, pp 281–297
Rokach L, Maimon O (2005) Clustering methods. In: Data mining and knowledge discovery handbook. Springer, New York, pp 321–352
Manning CD, Schütze H (1999) Foundations of statistical natural language processing. MIT Press, Cambridge
Rafsanjani MK, Varzaneh ZA, Chukanlo NE (2012) A survey of hierarchical clustering algorithms. J Math Comput Sci 5(3):229–240
Yu-chieh WU (2014) A top-down information theoretic word clustering algorithm for phrase recognition. Inf Sci 275:213–225
Mitra P, Murthy C, Pal SK (2002) Unsupervised feature selection using feature similarity. IEEE Trans Pattern Anal Mach Intell 24(3):301–312
Sotoca JM, Pla F (2010) Supervised feature selection by clustering using conditional mutual information based distances. Pattern Recogn 43(6):325–343
Song Q, Ni J, Wang G (2013) A fast clustering-based feature subset selection algorithm for high-dimensional data. IEEE Trans Knowl Data Eng 25(1):1–14
Altman NS (1992) An introduction to kernel and nearest neighbor nonparametric regression. Am Stat 46(3):175–185
Ward JH (1963) Hierarchical grouping to optimize an objective function. J Am Stat Assoc 58(301):236–244
Song Y, Jin S, Shen J (2011) A unique property of single-link distance and its application in data clustering. Data Knowl Eng 70:984–1003
Mansoori EG (2014) GACH: a grid-based algorithm for hierarchical clustering of high-dimensional data. Soft Comput 18(5):905–922
Khedkar SA, Bainwad AM, Chitnis PO (2014) A survey on clustered feature selection algorithms for high dimensional data. Int J Comput Sci Inf Technol (IJCSIT) 5(3):3274–3280
Cover TM, Thomas JA (1991) Elements of information theory. Wiley, New York
Sibson R (1973) SLINK: an optimally efficient algorithm for the single-link cluster method. Comput J (Br Comput Soc) 16(1):30–34
Defays D (1977) An efficient algorithm for a complete link method. Comput J (Br Comput Soc) 20(4):364–366
Mansoori EG (2013) Using statistical measures for feature ranking. Int J Pattern Recognit Artif Intell 27(1):1–14
Asuncion A, Newman DJ (2007) UCI machine learning repository. Department of Information and Computer science, University of California, Irvine, CA, online available: http://www.ics.uci.edu/mlearn/MLRepository.html
McLachlan GJ, Do KA, Ambroise C (2004) Analyzing microarray gene expression data. Wiley, New York
Raskutti B, Leckie C (1999) An evaluation of criteria for measuring the quality of clusters. In: Proceedings of the international joint conference of artificial intelligence, pp 905–910
Robnik-Sikonja M, Kononenko I (1997) An adaptation of relief for attribute estimation in regression. In: Machine learning proceedings of the fourteenth international conference (ICML), pp 296–304
Jitkrittum W, Hachiya H, Sugiyama M (2013) Feature selection via L1-penalized squared loss mutual information. IEICE Trans Inf Syst 96(7):1513–1524
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Dehghan, Z., Mansoori, E.G. A new feature subset selection using bottom-up clustering. Pattern Anal Applic 21, 57–66 (2018). https://doi.org/10.1007/s10044-016-0565-8
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10044-016-0565-8