Abstract
Feature selection is a fundamental preprocess before performing actual learning; especially in unsupervised manner where the data are unlabeled. Essentially, when there are too many features in the problem, dimensionality reduction through discarding weak features is highly desirable. In this paper, we present a framework for unsupervised feature selection based on dependency maximization between the samples similarity matrices before and after deleting a feature. In this regard, a novel estimation of Hilbert–Schmidt independence criterion (HSIC), more appropriate for high-dimensional data with small sample size, is introduced. Its key idea is that by eliminating the redundant features and/or those have high inter-relevancy, the pairwise samples similarity is not affected seriously. Also, to handle the diagonally dominant matrices, a heuristic trick is used in order to reduce the dynamic range of matrix values. In order to speed up the proposed scheme, the gap statistic and k-means clustering methods are also employed. To assess the performance of our method, some experiments on benchmark datasets are conducted. The obtained results confirm the efficiency of our unsupervised feature selection scheme.




Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Yu L, Liu H (2004) Efficient feature selection via analysis of relevance and redundancy. J Mach Learn Res 5:1205–1224
Sharma A, Imoto S, Miyano S (2012) A top-r feature selection algorithm for microarray gene expression data. IEEE/ACM Trans Comput Biol Bioinform 9(3):754–64
Sharma A, Imoto S, Miyano S, Sharma V (2012) Null space based feature selection method for gene expression data. Int J Mach Learn Cybern 3(4):269–276
Sharma A, Imoto S, Miyano S (2012) A between-class overlapping filter-based method for transcriptome data analysis. J Bioinform Comput Biol 10(5):1250010
Dy J, Brodley C (2004) Feature selection for unsupervised learning. J Mach Learn Res 5:845–889
Shang R, Chang J, Jiao L, Xue Y (2017) Unsupervised feature selection based on self-representation sparse regression and local similarity preserving. Int J Mach Learn Cybern 1–14
Chandrashekar G, Sahin F (2014) A survey on feature selection methods. Comput Electr Eng 40:16–28
Guyon I, Gunn S, Nikravesh M, Zadeh LA (2006) Feature extraction: foundations and applications, vol 207. Springer, Berlin, pp 89–117
Peng H, Long F, Ding C (2005) Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundance. IEEE Trans Pattern Anal Mach Intell 27(8):1226–1238
Guyon I, Elisseeff A (2003) An introduction to variable and feature selection. J Mach Learn Res 3:1157–1182
Brown G, Pocock A, Zhao M, Lujan M (2012) Conditional likelihood maximization: a unifying framework for information theoretic feature selection. J Mach Learn Res 13:27–66
Xu I, Cao L, Zhong J, Feng Y (2010) Adapt the mRMR criterion for unsupervised feature selection. Advanced data mining and applications. Springer, Berlin, pp 111–121
Gretton A, Bousquet O, Smola AJ, Scholkopf B (2005) Measuring statistical dependence with Hilbert–Schmidt norms. In: Jain S, Simon HU, Tomita E (eds) Proceedings of the international conference on algorithmic learning theory, Springer, pp 63–77
Zarkoob H (2010) Feature selection for gene expression data based on Hilbert–Schmidt independence criterion. University of Waterloo, Electronic theses and dissertations
Bedo I, Chetty M, Ngom A, Ahmad S (2008) Microarray design using the Hilbert–Schmidt independence criterion. Springer, Berlin, pp 288–298
Song I, Smola A, Gretton A, Bedo J, Borgwardt K (2012) Feature selection via dependence maximization. J Mach Learn Res 13:1393–1434
Farahat AK, Ghodsi A, Kamel MS (2013) Efficient greedy feature selection for unsupervised learning. Knowl Inf Syst 35(2):285–310
Sharma A, Paliwal KK, Imoto S, Miyano S (2014) A feature selection method using improved regularized linear discriminant analysis. Mach Vis Appl 25(3):775–786
Eskandari S, Akbas E (2017) Supervised infinite feature selection. arXiv Prepr. http://arxiv.org/abs/1704.02665
Luo I, Nie F, Chang X, Yang Y, Hauptmann AG, Zheng Q (2018) Adaptive unsupervised feature selection with structure regularization. IEEE Trans Neural Netw Learn Syst 29(4):944–956
Weston I, Scholkopf B, Eskin E, Leslie C, Noble W (2003) Dealing with large diagonals in kernel matrices. Inst Stat Math 55(2):391–408
Fischer A, Roth V, Buhmann JM (2003) Clustering with the connectivity Kernel. Adv Neural Inf Process Syst 16:89–96
Tibshirani R, Walther G, Hastie T (2001) Estimating the number of clusters in a data set via the gap statistic. J R Stat Soc Ser B 63(2):411–423
McQueen I (1967) Some methods for classification and analysis of multivariate observations. In: Proceedings of fifth Berkeley symposium, math statistics and probability, pp 281–297
Somol P, Pudil P, Novovicova J, Paclik P (1999) Adaptive floating search methods in feature selection. Pattern Recognit Lett 20:1157–1163
UCI Machine Learning Repository. https://archive.ics.uci.edu/ml/datasets. Accessed Feb 2017
Mramor I, Leban G, Demsar J, Zupan B (2007) Visualization-based cancer microarray data classification analysis. Bioinformatics 23(16):2147–2154
Scholkopf A, Smola AJ (2001) Learning with kernels: support vector machines, regularization, optimization, and beyond. MIT Press, Cambridge
Lin S, Liu Z (2007) Parameter selection of support vector machines based on RBF kernel function. Zhejiang Univ Technol 35:163–167
Ester I, Kriegel H, Sander J, Xu X (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of KDD, vol 96, pp 226–231
Kreyszig A (1970) Introductory mathematical statistics. Wiley, New York
Hubert L, Arabic P (1985) Comparing partitions. J Classif 2:193–218
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix 1
Appendix 1
1.1 Applying SP-BAHSIC, SPC-BAHSIC, SP-FOHSIC and SPC-FOHSIC on synthetic data
Herein, we explain the steps of our proposed methods on two synthetic data. First, we run SP-BAHSIC on dataset \({X_G}\) with \(m=4\) samples and \(n=4\) features, called \(G=\left\{ {A,~B,~C,D} \right\}\).
Our aim is to find \({n^\prime }=2\) most informative features. By using RBF kernel function [29] as \(\phi (.)\), the samples similarity matrix \(K_{{\{ A,B,C,D\} }}^{\phi }\) is computed as:
Similarly, using RBF kernel for \(\psi (.)\), the samples similarity matrices \(K_{{\{ B,C,D\} }}^{\psi }\), \(K_{{\{ A,C,D\} }}^{\psi }\), \(K_{{\{ A,B,D\} }}^{\psi }\) and \(K_{{\{ A,B,C\} }}^{\psi }\) are obtained.
Using (14), HSIC2 is computed between \(K_{{\{ A,B,C,D\} }}^{\phi }\) and each of \(K_{{\{ B,C,D\} }}^{\psi }\), \(K_{{\{ A,C,D\} }}^{\psi }\), \(K_{{\{ A,B,D\} }}^{\psi }\) and \(K_{{\{ A,B,C\} }}^{\psi }\). \({H_1}\) is achieved as:
According to \({H_1}\), the HSIC2 measure between \(K_{{\{ A,B,C,D\} }}^{\phi }\) and \(K_{{\{ A,B,D\} }}^{\psi }\) is maximum. So, among 4 features, feature \(C\) is more suitable for elimination. In the second phase, the samples similarity matrices \(K_{{\{ B,D\} }}^{\psi }\), \(K_{{\{ A,D\} }}^{\psi }\) and \(K_{{\{ A,B\} }}^{\psi }\) are computed.
Also, HSIC2 value between \(K_{{\{ A,B,C,D\} }}^{\phi }\) and each of them is stored in \({H_2}\).
From HSIC2 values in \({H_2}\), the next candidate for elimination is feature \(D\). Thus, features \(A\) and \(B\) are returned by SP-BAHSIC as the most informative features.
As another example, now SPC-BAHSIC is run on dataset \({X_G}\) with \(m=4\) samples and \(n=6\) features, named \(G=\left\{ {A,~B,~C,D,E,F} \right\}\) in order to find \(n'=2\) most informative features.
First, we estimate the number of clusters for these \(n=6\) features using gap statistic method which results in \(l=4\) features. Next, the clusters of features are found using k-means method. From each cluster, one feature is selected in \({G_c}=\left\{ {A,~B,~C,F} \right\}\). So, the dataset \({X_G}\) is represented by \({X_{{G_c}}}\) in new feature space.
Employing \({X_{{G_c}}}\), the next steps of SPC-BAHSIC is the same as SP-BAHSIC. Again, using RBF kernel function as \(\phi (.)\), the samples similarity matrix \(K_{{\{ A,B,C,F\} }}^{\phi }\) is computed.
Also, \(\psi (.)\) is used to compute the following similarity matrices:
Similar to the first example, HSIC2 is computed between \(K_{{\{ A,B,C,F\} }}^{\phi }\) and each of \(K_{{\{ B,C,F\} }}^{\psi }\), \(K_{{\{ A,C,F\} }}^{\psi }\), \(K_{{\{ A,B,F\} }}^{\psi }\) and \(K_{{\{ A,B,C\} }}^{\psi }\) and their values are stored in \({H_1}\).
From \({H_1}\), it is clear that the similarity measure between \(K_{{\{ A,B,C,F\} }}^{\phi }\) and \(K_{{\{ A,B,F\} }}^{\psi }\) is the highest and so, feature \(C\) should be removed from \({G_c}\). In the second phase, these similarity matrices are achieved.
Using (14), HSIC2 is computed between \(K_{{\{ A,B,C,F\} }}^{\phi }\) and each of \(K_{{\{ B,F\} }}^{\psi }\), \(K_{{\{ A,F\} }}^{\psi }\) and \(K_{{\{ A,B\} }}^{\psi }\), as \({H_2}\) represents:
According to \({H_2}\), feature \(A\) is the next eliminated feature. So, features \(B\) and \(F\) are the most informative features which are returned by algorithm of SPC-BAHSIC.
In this part, we run SP-FOHSIC on dataset \({X_G}\) with \(m=4\) samples and \(n=4\) features, named \(G=\left\{ {A,~B,~C,D} \right\}\) and \(n'=2\) of informative features are needed.
By using RBF kernel function as \(\phi (.)\), the samples similarity matrix \(K_{{\{ A,B,C,D\} }}^{\phi }\) is computed as:
Similarly, using RBF kernel for \(\psi (.)\), the samples similarity matrices \(K_{{\left\{ A \right\}}}^{\psi }\) (samples are only included feature A), \(K_{{\{ B\} }}^{\psi }\) (samples are only included feature B), \(K_{{\{ C\} }}^{\psi }\) (samples are only included feature B) and \(K_{{\{ D\} }}^{\psi }\) (samples are only included feature D) are obtained:
Using (14), HSIC2 is computed between \(K_{{\{ A,B,C,D\} }}^{\phi }\) and each of \(K_{{\{ A\} }}^{\psi }\), \(K_{{\{ B\} }}^{\psi }\), \(K_{{\{ C\} }}^{\psi }\) and \(K_{{\{ D\} }}^{\psi }\). \({H_1}\) is achieved as:
According to \({H_1}\), the HSIC2 measure between \(K_{{\{ A,B,C,D\} }}^{\phi }\) and \(K_{{\{ B\} }}^{\psi }\) is maximum. So, among 4 features, feature \(B\) is more suitable for selection. In the second phase, the samples similarity matrices \(K_{{\{ {\text{B}},A\} }}^{\psi }\), \(K_{{\{ B,C\} }}^{\psi }\) and \(K_{{\{ {\text{B}},D\} }}^{\psi }\) are computed.
Also, HSIC2 value between \(K_{{\{ A,B,C,D\} }}^{\phi }\) and each of them is stored in \({H_2}\) as:
From HSIC2 values in \({H_2}\), the next candidate for selection is feature \(A\). Thus, features \(A\) and \(B\) are returned by SP-FOHSIC as the most informative features.
As the last example, SPC-FOHSIC is run on dataset \({X_G}\) with \(m=4\) samples and \(n=6\) features, named \(G=\left\{ {A,~B,~C,D,E,F} \right\}\). As in previous experiments, our aim is to find \(n'=2\) most informative features.
Again, by using gap statistic method, these 6 features can be grouped into \(l=4\) clusters. These clusters are determined via k-means method. Using their centroids, the features \({G_c}=\left\{ {A,~B,~C,F} \right\}\) establish the new feature space wherein the dataset \({X_G}\) is represented as \({X_{{G_c}}}\):
The next steps of SPC-FOHSIC is the same as SP-FOHSIC. Again, using RBF kernel function as \(\phi (.)\), the samples similarity matrix \(K_{{\{ A,B,C,F\} }}^{\phi }\) is computed as:
Also, \(\psi (.)\) is used to compute the following similarity matrices:
Similarly, HSIC2 is computed between \(K_{{\{ A,B,C,F\} }}^{\phi }\) and each of \(K_{{\{ A\} }}^{\psi }\), \(K_{{\{ B\} }}^{\psi }\), \(K_{{\{ C\} }}^{\psi }\) and \(K_{{\{ F\} }}^{\psi }\) and their values are stored in \({H_1}\) as:
From \({H_1}\), it is clear that the similarity measure between \(K_{{\{ A,B,C,F\} }}^{\phi }\) and \(K_{{\{ F\} }}^{\psi }\) is the highest and so, feature \(F\) should be selected. In the second phase, these similarity matrices are achieved as:
Using (14), HSIC2 is computed between \(K_{{\{ A,B,C,F\} }}^{\phi }\) and each of \(K_{{\{ F,A\} }}^{\psi }\), \(K_{{\{ F,B\} }}^{\psi }\) and \(K_{{\{ F,C\} }}^{\psi }\), represents as \({H_2}\):
According to \({H_2}\), feature \(B\) is the next informative feature. So, features \(B\) and \(F\) are the most informative features which are returned by algorithm of SPC-FOHSIC.
Rights and permissions
About this article
Cite this article
Liaghat, S., Mansoori, E.G. Filter-based unsupervised feature selection using Hilbert–Schmidt independence criterion. Int. J. Mach. Learn. & Cyber. 10, 2313–2328 (2019). https://doi.org/10.1007/s13042-018-0869-7
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s13042-018-0869-7