Filter-based unsupervised feature selection using Hilbert–Schmidt independence criterion

Liaghat, Samaneh; Mansoori, Eghbal G.

doi:10.1007/s13042-018-0869-7

Filter-based unsupervised feature selection using Hilbert–Schmidt independence criterion

Original Article
Published: 03 September 2018

Volume 10, pages 2313–2328, (2019)
Cite this article

International Journal of Machine Learning and Cybernetics Aims and scope Submit manuscript

Samaneh Liaghat¹ &
Eghbal G. Mansoori¹

408 Accesses
Explore all metrics

Abstract

Feature selection is a fundamental preprocess before performing actual learning; especially in unsupervised manner where the data are unlabeled. Essentially, when there are too many features in the problem, dimensionality reduction through discarding weak features is highly desirable. In this paper, we present a framework for unsupervised feature selection based on dependency maximization between the samples similarity matrices before and after deleting a feature. In this regard, a novel estimation of Hilbert–Schmidt independence criterion (HSIC), more appropriate for high-dimensional data with small sample size, is introduced. Its key idea is that by eliminating the redundant features and/or those have high inter-relevancy, the pairwise samples similarity is not affected seriously. Also, to handle the diagonally dominant matrices, a heuristic trick is used in order to reduce the dynamic range of matrix values. In order to speed up the proposed scheme, the gap statistic and k-means clustering methods are also employed. To assess the performance of our method, some experiments on benchmark datasets are conducted. The obtained results confirm the efficiency of our unsupervised feature selection scheme.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Feature Selection Based on Data Clustering

Unsupervised feature selection based on generalized regression model with linear discriminant constraints

Article Open access 22 April 2025

Joint local structure preservation and redundancy minimization for unsupervised feature selection

Article 20 July 2020

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

References

Yu L, Liu H (2004) Efficient feature selection via analysis of relevance and redundancy. J Mach Learn Res 5:1205–1224
MathSciNet Google Scholar
Sharma A, Imoto S, Miyano S (2012) A top-r feature selection algorithm for microarray gene expression data. IEEE/ACM Trans Comput Biol Bioinform 9(3):754–64
Google Scholar
Sharma A, Imoto S, Miyano S, Sharma V (2012) Null space based feature selection method for gene expression data. Int J Mach Learn Cybern 3(4):269–276
Article Google Scholar
Sharma A, Imoto S, Miyano S (2012) A between-class overlapping filter-based method for transcriptome data analysis. J Bioinform Comput Biol 10(5):1250010
Article Google Scholar
Dy J, Brodley C (2004) Feature selection for unsupervised learning. J Mach Learn Res 5:845–889
MathSciNet MATH Google Scholar
Shang R, Chang J, Jiao L, Xue Y (2017) Unsupervised feature selection based on self-representation sparse regression and local similarity preserving. Int J Mach Learn Cybern 1–14
Chandrashekar G, Sahin F (2014) A survey on feature selection methods. Comput Electr Eng 40:16–28
Article Google Scholar
Guyon I, Gunn S, Nikravesh M, Zadeh LA (2006) Feature extraction: foundations and applications, vol 207. Springer, Berlin, pp 89–117
Book MATH Google Scholar
Peng H, Long F, Ding C (2005) Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundance. IEEE Trans Pattern Anal Mach Intell 27(8):1226–1238
Article Google Scholar
Guyon I, Elisseeff A (2003) An introduction to variable and feature selection. J Mach Learn Res 3:1157–1182
MATH Google Scholar
Brown G, Pocock A, Zhao M, Lujan M (2012) Conditional likelihood maximization: a unifying framework for information theoretic feature selection. J Mach Learn Res 13:27–66
MATH Google Scholar
Xu I, Cao L, Zhong J, Feng Y (2010) Adapt the mRMR criterion for unsupervised feature selection. Advanced data mining and applications. Springer, Berlin, pp 111–121
Google Scholar
Gretton A, Bousquet O, Smola AJ, Scholkopf B (2005) Measuring statistical dependence with Hilbert–Schmidt norms. In: Jain S, Simon HU, Tomita E (eds) Proceedings of the international conference on algorithmic learning theory, Springer, pp 63–77
Zarkoob H (2010) Feature selection for gene expression data based on Hilbert–Schmidt independence criterion. University of Waterloo, Electronic theses and dissertations
Bedo I, Chetty M, Ngom A, Ahmad S (2008) Microarray design using the Hilbert–Schmidt independence criterion. Springer, Berlin, pp 288–298
Google Scholar
Song I, Smola A, Gretton A, Bedo J, Borgwardt K (2012) Feature selection via dependence maximization. J Mach Learn Res 13:1393–1434
MathSciNet MATH Google Scholar
Farahat AK, Ghodsi A, Kamel MS (2013) Efficient greedy feature selection for unsupervised learning. Knowl Inf Syst 35(2):285–310
Article Google Scholar
Sharma A, Paliwal KK, Imoto S, Miyano S (2014) A feature selection method using improved regularized linear discriminant analysis. Mach Vis Appl 25(3):775–786
Article Google Scholar
Eskandari S, Akbas E (2017) Supervised infinite feature selection. arXiv Prepr. http://arxiv.org/abs/1704.02665
Luo I, Nie F, Chang X, Yang Y, Hauptmann AG, Zheng Q (2018) Adaptive unsupervised feature selection with structure regularization. IEEE Trans Neural Netw Learn Syst 29(4):944–956
Article Google Scholar
Weston I, Scholkopf B, Eskin E, Leslie C, Noble W (2003) Dealing with large diagonals in kernel matrices. Inst Stat Math 55(2):391–408
MathSciNet MATH Google Scholar
Fischer A, Roth V, Buhmann JM (2003) Clustering with the connectivity Kernel. Adv Neural Inf Process Syst 16:89–96
Google Scholar
Tibshirani R, Walther G, Hastie T (2001) Estimating the number of clusters in a data set via the gap statistic. J R Stat Soc Ser B 63(2):411–423
Article MathSciNet MATH Google Scholar
McQueen I (1967) Some methods for classification and analysis of multivariate observations. In: Proceedings of fifth Berkeley symposium, math statistics and probability, pp 281–297
Somol P, Pudil P, Novovicova J, Paclik P (1999) Adaptive floating search methods in feature selection. Pattern Recognit Lett 20:1157–1163
Article Google Scholar
UCI Machine Learning Repository. https://archive.ics.uci.edu/ml/datasets. Accessed Feb 2017
Mramor I, Leban G, Demsar J, Zupan B (2007) Visualization-based cancer microarray data classification analysis. Bioinformatics 23(16):2147–2154
Article Google Scholar
Scholkopf A, Smola AJ (2001) Learning with kernels: support vector machines, regularization, optimization, and beyond. MIT Press, Cambridge
Google Scholar
Lin S, Liu Z (2007) Parameter selection of support vector machines based on RBF kernel function. Zhejiang Univ Technol 35:163–167
Google Scholar
Ester I, Kriegel H, Sander J, Xu X (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of KDD, vol 96, pp 226–231
Kreyszig A (1970) Introductory mathematical statistics. Wiley, New York
MATH Google Scholar
Hubert L, Arabic P (1985) Comparing partitions. J Classif 2:193–218
Article Google Scholar

Download references

Author information

Authors and Affiliations

School of Electrical and Computer Engineering, Shiraz University, Shiraz, Iran
Samaneh Liaghat & Eghbal G. Mansoori

Authors

Samaneh Liaghat
View author publications
You can also search for this author inPubMed Google Scholar
Eghbal G. Mansoori
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Eghbal G. Mansoori.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix 1

1.1 Applying SP-BAHSIC, SPC-BAHSIC, SP-FOHSIC and SPC-FOHSIC on synthetic data

Herein, we explain the steps of our proposed methods on two synthetic data. First, we run SP-BAHSIC on dataset ${X_G}$ with $m=4$ samples and $n=4$ features, called $G=\left\{ {A,~B,~C,D} \right\}$.

$${X_G}=\left[ {\begin{array}{*{20}{l}} 1&3&4&2 \\ 3&6&5&4 \\ 5&{10}&4&3 \\ 2&5&4&4 \end{array}} \right]$$

Our aim is to find ${n^\prime }=2$ most informative features. By using RBF kernel function [29] as $\phi (.)$, the samples similarity matrix $K_{{\{ A,B,C,D\} }}^{\phi }$ is computed as:

$$K_{{\{ A,B,C,D\} }}^{\phi }=\left[ {\begin{array}{*{20}{l}} 1& \quad {0.70}& \quad {0.27}& \quad {0.83} \\ {0.70}& \quad 1& \quad {0.64}& \quad {0.94} \\ {0.27}& \quad {0.64}& \quad 1& \quad {0.50} \\ {0.83}& \quad {0.94}& \quad {0.50}& \quad 1 \end{array}} \right]$$

Similarly, using RBF kernel for $\psi (.)$, the samples similarity matrices $K_{{\{ B,C,D\} }}^{\psi }$, $K_{{\{ A,C,D\} }}^{\psi }$, $K_{{\{ A,B,D\} }}^{\psi }$ and $K_{{\{ A,B,C\} }}^{\psi }$ are obtained.

$$K_{{\{ B,C,D\} }}^{\psi }=\left[ {\begin{array}{*{20}{l}} 1&{0.93}&{0.78}&{0.96} \\ {0.93}&1&{0.91}&{0.99} \\ {0.78}&{0.91}&1&{0.88} \\ {0.96}&{0.99}&{0.88}&1 \end{array}} \right],$$

$$K_{{\{ A,C,D\} }}^{\psi }=\left[ {\begin{array}{*{20}{l}} 1&{0.96}&{0.92}&{0.97} \\ {0.96}&1&{0.97}&{0.99} \\ {0.92}&{0.97}&1&{0.95} \\ {0.97}&{0.99}&{0.95}&1 \end{array}} \right],$$

$$K_{{\{ A,B,D\} }}^{\psi }=\left[ {\begin{array}{*{20}{l}} 1&{0.92}&{0.72}&{0.96} \\ {0.92}&1&{0.90}&{0.99} \\ {0.72}&{0.90}&1&{0.84} \\ {0.96}&{0.99}&{0.84}&1 \end{array}} \right],$$

$$K_{{\{ A,B,C\} }}^{\psi }=\left[ {\begin{array}{*{20}{l}} 1& \quad {0.93}& \quad {0.72}& \quad {0.97} \\ {0.93}& \quad 1& \quad {0.90}& \quad {0.98} \\ {0.72}& \quad {0.90}& \quad 1& \quad {0.84} \\ {0.97}& \quad {0.98}& \quad {0.84}& \quad 1 \end{array}} \right]$$

Using (14), HSIC₂ is computed between $K_{{\{ A,B,C,D\} }}^{\phi }$ and each of $K_{{\{ B,C,D\} }}^{\psi }$, $K_{{\{ A,C,D\} }}^{\psi }$, $K_{{\{ A,B,D\} }}^{\psi }$ and $K_{{\{ A,B,C\} }}^{\psi }$. ${H_1}$ is achieved as:

$${H_1}=\left[ {\begin{array}{*{20}{c}} {\begin{array}{*{20}{c}} {0.0073}&{0.0027} \end{array}}&{\begin{array}{*{20}{c}} {0.0094}&{0.0093} \end{array}} \end{array}} \right]$$

According to ${H_1}$, the HSIC₂ measure between $K_{{\{ A,B,C,D\} }}^{\phi }$ and $K_{{\{ A,B,D\} }}^{\psi }$ is maximum. So, among 4 features, feature $C$ is more suitable for elimination. In the second phase, the samples similarity matrices $K_{{\{ B,D\} }}^{\psi }$, $K_{{\{ A,D\} }}^{\psi }$ and $K_{{\{ A,B\} }}^{\psi }$ are computed.

$$\begin{aligned} K_{{\{ B,D\} }}^{\psi } & =\left[ {\begin{array}{*{20}{l}} 1& \quad {0.94}& \quad {0.78}& \quad {0.96} \\ {0.94}& \quad 1& \quad {0.92}& \quad {0.99} \\ {0.78}& \quad {0.92}& \quad 1& \quad {0.88} \\ {0.96}& \quad {0.99}& \quad {0.88}& \quad 1 \end{array}} \right], \\ K_{{\{ A,D\} }}^{\psi } & =\left[ {\begin{array}{*{20}{l}} 1& \quad {0.96}& \quad {0.92}& \quad {0.97} \\ {0.96}& \quad 1& \quad {0.97}& \quad {0.99} \\ {0.92}& \quad {0.97}& \quad 1& \quad {0.95} \\ {0.97}& \quad {0.99}& \quad {0.95}& \quad 1 \end{array}} \right], \\ K_{{\{ A,B\} }}^{\psi } & =\left[ {\begin{array}{*{20}{l}} 1& \quad {0.94}& \quad {0.72}& \quad {0.97} \\ {0.94}& \quad 1& \quad {0.90}& \quad {0.99} \\ {0.72}& \quad {0.90}& \quad 1& \quad {0.84} \\ {0.97}& \quad {0.99}& \quad {0.84}& \quad 1 \end{array}} \right] \\ \end{aligned}$$

Also, HSIC₂ value between $K_{{\{ A,B,C,D\} }}^{\phi }$ and each of them is stored in ${H_2}$.

$${H_2}=\left[ {\begin{array}{*{20}{c}} {0.0072}&{0.0026}&{0.0093} \end{array}} \right]$$

From HSIC₂ values in ${H_2}$, the next candidate for elimination is feature $D$. Thus, features $A$ and $B$ are returned by SP-BAHSIC as the most informative features.

As another example, now SPC-BAHSIC is run on dataset ${X_G}$ with $m=4$ samples and $n=6$ features, named $G=\left\{ {A,~B,~C,D,E,F} \right\}$ in order to find $n'=2$ most informative features.

$${X_G}=\left[ {\begin{array}{*{20}{l}} 1&3&4&2&4&8 \\ 3&6&5&4&2&1 \\ 5&9&4&3&7&6 \\ 2&5&4&4&3&2 \end{array}} \right]$$

First, we estimate the number of clusters for these $n=6$ features using gap statistic method which results in $l=4$ features. Next, the clusters of features are found using k-means method. From each cluster, one feature is selected in ${G_c}=\left\{ {A,~B,~C,F} \right\}$. So, the dataset ${X_G}$ is represented by ${X_{{G_c}}}$ in new feature space.

$${X_{{G_c}}}=\left[ {\begin{array}{*{20}{c}} {\begin{array}{*{20}{c}} 1&3 \end{array}}&4&8 \\ {\begin{array}{*{20}{c}} 3&6 \end{array}}&5&1 \\ {\begin{array}{*{20}{c}} {\begin{array}{*{20}{c}} 5&9 \end{array}} \\ {\begin{array}{*{20}{c}} 2&5 \end{array}} \end{array}}&{\begin{array}{*{20}{c}} 4 \\ 4 \end{array}}&{\begin{array}{*{20}{c}} 6 \\ 2 \end{array}} \end{array}} \right]$$

Employing ${X_{{G_c}}}$, the next steps of SPC-BAHSIC is the same as SP-BAHSIC. Again, using RBF kernel function as $\phi (.)$, the samples similarity matrix $K_{{\{ A,B,C,F\} }}^{\phi }$ is computed.

$$K_{{\{ A,B,C,F\} }}^{\phi }=\left[ {\begin{array}{*{20}{c}} 1&{0.28}&{\begin{array}{*{20}{c}} {0.33}&{0.44} \end{array}} \\ {0.28}&1&{\begin{array}{*{20}{c}} {0.46}&{0.92} \end{array}} \\ {\begin{array}{*{20}{c}} {0.33} \\ {0.44} \end{array}}&{\begin{array}{*{20}{c}} {0.46} \\ {0.92} \end{array}}&{\begin{array}{*{20}{c}} {\begin{array}{*{20}{c}} 1 \\ {0.44} \end{array}}&{\begin{array}{*{20}{c}} {0.44} \\ 1 \end{array}} \end{array}} \end{array}} \right]$$

Also, $\psi (.)$ is used to compute the following similarity matrices:

$$K_{{\{ B,C,F\} }}^{\psi }=\left[ {\begin{array}{*{20}{c}} 1&{0.74}&{\begin{array}{*{20}{c}} {0.82}&{0.82} \end{array}} \\ {0.74}&1&{\begin{array}{*{20}{c}} {0.84}&{0.98} \end{array}} \\ {\begin{array}{*{20}{c}} {0.82} \\ {0.82} \end{array}}&{\begin{array}{*{20}{c}} {0.84} \\ {0.98} \end{array}}&{\begin{array}{*{20}{c}} {\begin{array}{*{20}{c}} 1 \\ {0.85} \end{array}}&{\begin{array}{*{20}{c}} {0.85} \\ 1 \end{array}} \end{array}} \end{array}} \right],$$

$$K_{{\{ A,C,F\} }}^{\psi }=\left[ {\begin{array}{*{20}{c}} 1&{0.76}&{\begin{array}{*{20}{c}} {0.90}&{0.83} \end{array}} \\ {0.76}&1&{\begin{array}{*{20}{c}} {0.86}&{0.98} \end{array}} \\ {\begin{array}{*{20}{c}} {0.90} \\ {0.83} \end{array}}&{\begin{array}{*{20}{c}} {0.86} \\ {0.98} \end{array}}&{\begin{array}{*{20}{c}} {\begin{array}{*{20}{c}} 1 \\ {0.88} \end{array}}&{\begin{array}{*{20}{c}} {0.88} \\ 1 \end{array}} \end{array}} \end{array}} \right],$$

$$K_{{\{ A,B,F\} }}^{\psi }=\left[ {\begin{array}{*{20}{c}} 1&{0.73}&{\begin{array}{*{20}{c}} {0.75}&{~~~0.81} \end{array}} \\ {0.73}&1&{\begin{array}{*{20}{c}} {0.83}&{~~~0.98} \end{array}} \\ {\begin{array}{*{20}{c}} {0.75} \\ {0.81} \end{array}}&{\begin{array}{*{20}{c}} {0.83} \\ {0.98} \end{array}}&{\begin{array}{*{20}{c}} {\begin{array}{*{20}{c}} {1~~~} \\ {0.81~~~} \end{array}}&{\begin{array}{*{20}{c}} {0.81} \\ 1 \end{array}} \end{array}} \end{array}} \right],$$

$$K_{{\{ A,B,C\} }}^{\psi }=\left[ {\begin{array}{*{20}{c}} 1&{0.93}&{\begin{array}{*{20}{c}} {0.77}&{~~0.97} \end{array}} \\ {0.93}&1&{\begin{array}{*{20}{c}} {~~0.93~~}&{0.98~~} \end{array}} \\ {\begin{array}{*{20}{c}} {0.77} \\ {0.97} \end{array}}&{\begin{array}{*{20}{c}} {0.93} \\ {0.98} \end{array}}&{\begin{array}{*{20}{c}} {\begin{array}{*{20}{c}} {1~~~} \\ {0.88~~~} \end{array}}&{\begin{array}{*{20}{c}} {0.88~} \\ 1 \end{array}} \end{array}} \end{array}} \right]$$

Similar to the first example, HSIC₂ is computed between $K_{{\{ A,B,C,F\} }}^{\phi }$ and each of $K_{{\{ B,C,F\} }}^{\psi }$, $K_{{\{ A,C,F\} }}^{\psi }$, $K_{{\{ A,B,F\} }}^{\psi }$ and $K_{{\{ A,B,C\} }}^{\psi }$ and their values are stored in ${H_1}$.

$${H_1}=\left[ {\begin{array}{*{20}{c}} {\begin{array}{*{20}{c}} {0.0133}&{0.0111} \end{array}}&{\begin{array}{*{20}{c}} {0.0151}&{0.0064} \end{array}} \end{array}} \right]$$

From ${H_1}$, it is clear that the similarity measure between $K_{{\{ A,B,C,F\} }}^{\phi }$ and $K_{{\{ A,B,F\} }}^{\psi }$ is the highest and so, feature $C$ should be removed from ${G_c}$. In the second phase, these similarity matrices are achieved.

$$K_{{\{ B,F\} }}^{\psi }=\left[ {\begin{array}{*{20}{c}} 1&{0.75}&{\begin{array}{*{20}{c}} {0.82}&{0.82} \end{array}} \\ {0.75}&1&{\begin{array}{*{20}{c}} {0.84}&{0.99} \end{array}} \\ {\begin{array}{*{20}{c}} {0.82} \\ {0.82} \end{array}}&{\begin{array}{*{20}{c}} {0.84} \\ {0.99} \end{array}}&{\begin{array}{*{20}{c}} {\begin{array}{*{20}{c}} 1 \\ {0.85} \end{array}}&{\begin{array}{*{20}{c}} {0.85} \\ 1 \end{array}} \end{array}} \end{array}} \right],$$

$$K_{{\{ A,F\} }}^{\psi }=\left[ {\begin{array}{*{20}{c}} 1&{0.77}&{\begin{array}{*{20}{c}} {0.90}&{0.83} \end{array}} \\ {0.77}&1&{\begin{array}{*{20}{c}} {0.86}&{0.99} \end{array}} \\ {\begin{array}{*{20}{c}} {0.90} \\ {0.83} \end{array}}&{\begin{array}{*{20}{c}} {0.86} \\ {0.99} \end{array}}&{\begin{array}{*{20}{c}} {\begin{array}{*{20}{c}} 1 \\ {0.88} \end{array}}&{\begin{array}{*{20}{c}} {0.88} \\ 1 \end{array}} \end{array}} \end{array}} \right],$$

$$K_{{\{ A,B\} }}^{\psi }=\left[ {\begin{array}{*{20}{c}} 1&{0.94}&{\begin{array}{*{20}{c}} {0.77}&{0.97} \end{array}} \\ {0.94}&1&{\begin{array}{*{20}{c}} {0.94}&{0.99} \end{array}} \\ {\begin{array}{*{20}{c}} {0.77} \\ {0.97} \end{array}}&{\begin{array}{*{20}{c}} {0.94} \\ {0.99} \end{array}}&{\begin{array}{*{20}{c}} {\begin{array}{*{20}{c}} 1 \\ {0.88} \end{array}}&{\begin{array}{*{20}{c}} {0.88} \\ 1 \end{array}} \end{array}} \end{array}} \right],$$

Using (14), HSIC₂ is computed between $K_{{\{ A,B,C,F\} }}^{\phi }$ and each of $K_{{\{ B,F\} }}^{\psi }$, $K_{{\{ A,F\} }}^{\psi }$ and $K_{{\{ A,B\} }}^{\psi }$, as ${H_2}$ represents:

$${H_2}=\left[ {\begin{array}{*{20}{c}} {0.0132}&{0.0110}&{0.0063} \end{array}} \right]$$

According to ${H_2}$, feature $A$ is the next eliminated feature. So, features $B$ and $F$ are the most informative features which are returned by algorithm of SPC-BAHSIC.

In this part, we run SP-FOHSIC on dataset ${X_G}$ with $m=4$ samples and $n=4$ features, named $G=\left\{ {A,~B,~C,D} \right\}$ and $n'=2$ of informative features are needed.

$${X_G}=\left[ {\begin{array}{*{20}{c}} 1&3&{\begin{array}{*{20}{c}} 4&2 \end{array}} \\ 3&6&{\begin{array}{*{20}{c}} 5&4 \end{array}} \\ {\begin{array}{*{20}{c}} 5 \\ 2 \end{array}}&{\begin{array}{*{20}{c}} {10} \\ 5 \end{array}}&{\begin{array}{*{20}{c}} {\begin{array}{*{20}{c}} 4&3 \end{array}} \\ {\begin{array}{*{20}{c}} 4&4 \end{array}} \end{array}} \end{array}} \right]$$

By using RBF kernel function as $\phi (.)$, the samples similarity matrix $K_{{\{ A,B,C,D\} }}^{\phi }$ is computed as:

$$K_{{\{ A,B,C,D\} }}^{\phi }=\left[ {\begin{array}{*{20}{c}} 1&{0.70}&{\begin{array}{*{20}{c}} {0.27}&{0.83} \end{array}} \\ {0.70}&1&{\begin{array}{*{20}{c}} {0.64}&{0.94} \end{array}} \\ {\begin{array}{*{20}{c}} {0.27} \\ {0.83} \end{array}}&{\begin{array}{*{20}{c}} {0.64} \\ {0.94} \end{array}}&{\begin{array}{*{20}{c}} {\begin{array}{*{20}{c}} 1 \\ {0.50} \end{array}}&{\begin{array}{*{20}{c}} {0.50} \\ 1 \end{array}} \end{array}} \end{array}} \right]$$

Similarly, using RBF kernel for $\psi (.)$, the samples similarity matrices $K_{{\left\{ A \right\}}}^{\psi }$ (samples are only included feature A), $K_{{\{ B\} }}^{\psi }$ (samples are only included feature B), $K_{{\{ C\} }}^{\psi }$ (samples are only included feature B) and $K_{{\{ D\} }}^{\psi }$ (samples are only included feature D) are obtained:

$$K_{{\{ A\} }}^{\psi }=\left[ {\begin{array}{*{20}{l}} 1&{0.98}&{0.92}&{0.99} \\ {0.98}&1&{0.98}&{0.99} \\ {0.92}&{0.98}&1&{0.96} \\ {0.99}&{0.99}&{0.96}&1 \end{array}} \right],$$

$$K_{{\{ B\} }}^{\psi }=\left[ {\begin{array}{*{20}{l}} 1&{0.96}&{0.78}&{0.98} \\ {0.96}&1&{0.92}&{0.99} \\ {0.78}&{0.92}&1&{0.88} \\ {0.98}&{0.99}&{0.88}&1 \end{array}} \right],$$

$$K_{{\{ C\} }}^{\psi }=\left[ {\begin{array}{*{20}{l}} 1&{0.99}&1&1 \\ {0.99}&1&{0.99}&{0.99} \\ 1&{0.99}&1&1 \\ 1&{0.99}&1&1 \end{array}} \right],$$

$$K_{{\{ D\} }}^{\psi }=\left[ {\begin{array}{*{20}{l}} 1&{0.98}&{0.99}&{0.98} \\ {0.98}&1&{0.99}&1 \\ {0.99}&{0.99}&1&{0.99} \\ {0.98}&1&{0.99}&1 \end{array}} \right]$$

Using (14), HSIC₂ is computed between $K_{{\{ A,B,C,D\} }}^{\phi }$ and each of $K_{{\{ A\} }}^{\psi }$, $K_{{\{ B\} }}^{\psi }$, $K_{{\{ C\} }}^{\psi }$ and $K_{{\{ D\} }}^{\psi }$. ${H_1}$ is achieved as:

$${H_1}=\left[ {\begin{array}{*{20}{c}} {\begin{array}{*{20}{c}} {0.0025}&{0.0071} \end{array}}&{\begin{array}{*{20}{c}} {2.1554 \times {{10}^{ - 5}}}&{1.7748 \times {{10}^{ - 4}}} \end{array}} \end{array}} \right]$$

According to ${H_1}$, the HSIC₂ measure between $K_{{\{ A,B,C,D\} }}^{\phi }$ and $K_{{\{ B\} }}^{\psi }$ is maximum. So, among 4 features, feature $B$ is more suitable for selection. In the second phase, the samples similarity matrices $K_{{\{ {\text{B}},A\} }}^{\psi }$, $K_{{\{ B,C\} }}^{\psi }$ and $K_{{\{ {\text{B}},D\} }}^{\psi }$ are computed.

$$K_{{\{ B,A\} }}^{\psi }=\left[ {\begin{array}{*{20}{l}} 1&{0.94}&{0.72}&{0.97} \\ {0.94}&1&{0.90}&{0.99} \\ {0.72}&{0.90}&1&{0.84} \\ {0.97}&{0.99}&{0.84}&1 \end{array}} \right],$$

$$K_{{\{ B,C\} }}^{\psi }=\left[ {\begin{array}{*{20}{l}} 1&{0.95}&{0.78}&{0.98} \\ {0.95}&1&{0.92}&{0.99} \\ {0.78}&{0.92}&1&{0.88} \\ {0.98}&{0.99}&{0.88}&1 \end{array}} \right],$$

$$K_{{\{ B,D\} }}^{\psi }=\left[ {\begin{array}{*{20}{l}} 1&{0.94}&{0.78}&{0.96} \\ {0.94}&1&{0.92}&{0.99} \\ {0.78}&{0.92}&1&{0.88} \\ {0.96}&{0.99}&{0.88}&1 \end{array}} \right]$$

Also, HSIC₂ value between $K_{{\{ A,B,C,D\} }}^{\phi }$ and each of them is stored in ${H_2}$ as:

$${H_2}=\left[ {\begin{array}{*{20}{c}} {0.0093}&{0.0071}&{0.0072} \end{array}} \right]$$

From HSIC₂ values in ${H_2}$, the next candidate for selection is feature $A$. Thus, features $A$ and $B$ are returned by SP-FOHSIC as the most informative features.

As the last example, SPC-FOHSIC is run on dataset ${X_G}$ with $m=4$ samples and $n=6$ features, named $G=\left\{ {A,~B,~C,D,E,F} \right\}$. As in previous experiments, our aim is to find $n'=2$ most informative features.

$${X_G}=\left[ {\begin{array}{*{20}{l}} 1&3&4&2&4&8 \\ 3&6&5&4&2&1 \\ 5&9&4&3&7&6 \\ 2&5&4&4&3&2 \end{array}} \right]$$

Again, by using gap statistic method, these 6 features can be grouped into $l=4$ clusters. These clusters are determined via k-means method. Using their centroids, the features ${G_c}=\left\{ {A,~B,~C,F} \right\}$ establish the new feature space wherein the dataset ${X_G}$ is represented as ${X_{{G_c}}}$:

$${X_{{G_c}}}=\left[ {\begin{array}{*{20}{l}} 1&3&4&8 \\ 3&6&5&1 \\ 5&9&4&6 \\ 2&5&4&2 \end{array}} \right]$$

The next steps of SPC-FOHSIC is the same as SP-FOHSIC. Again, using RBF kernel function as $\phi (.)$, the samples similarity matrix $K_{{\{ A,B,C,F\} }}^{\phi }$ is computed as:

$$K_{{\{ A,B,C,F\} }}^{\phi }=\left[ {\begin{array}{*{20}{l}} 1&{0.28}&{0.33}&{0.44} \\ {0.28}&1&{0.46}&{0.92} \\ {0.33}&{0.46}&1&{0.44} \\ {0.44}&{0.92}&{0.44}&1 \end{array}} \right]$$

Also, $\psi (.)$ is used to compute the following similarity matrices:

$$K_{{\{ A\} }}^{\psi }=\left[ {\begin{array}{*{20}{l}} 1&{0.98}&{092}&{0.99} \\ {0.98}&1&{0.98}&{0.99} \\ {0.92}&{0.98}&1&{0.95} \\ {0.99}&{0.99}&{0.96}&1 \end{array}} \right],$$

$$K_{{\{ B\} }}^{\psi }=\left[ {\begin{array}{*{20}{l}} 1&{0.96}&{0.83}&{0.98} \\ {0.96}&1&{0.96}&{0.99} \\ {0.83}&{0.96}&1&{0.92} \\ {0.98}&{0.99}&{0.92}&1 \end{array}} \right],$$

$$K_{{\{ C\} }}^{\psi }=\left[ {\begin{array}{*{20}{l}} 1&{0.99}&1&1 \\ {0.99}&1&{0.99}&{0.99} \\ 1&{0.99}&1&1 \\ 1&{0.99}&1&1 \end{array}} \right],$$

$$K_{{\{ F\} }}^{\psi }=\left[ {\begin{array}{*{20}{l}} 1&{0.78}&{0.98}&{~~0.83} \\ {0.78}&1&{0.88}&{0.99~~} \\ {0.98}&{0.88}&1&{0.92~} \\ {0.83}&{0.99}&{0.92}&1 \end{array}} \right]$$

Similarly, HSIC₂ is computed between $K_{{\{ A,B,C,F\} }}^{\phi }$ and each of $K_{{\{ A\} }}^{\psi }$, $K_{{\{ B\} }}^{\psi }$, $K_{{\{ C\} }}^{\psi }$ and $K_{{\{ F\} }}^{\psi }$ and their values are stored in ${H_1}$ as:

$${H_1}=\left[ {\begin{array}{*{20}{c}} {\begin{array}{*{20}{c}} {0.0020}&{0.0044} \end{array}}&{\begin{array}{*{20}{c}} {9.6213 \times {{10}^{ - 5}}}&{0.0091} \end{array}} \end{array}} \right]$$

From ${H_1}$, it is clear that the similarity measure between $K_{{\{ A,B,C,F\} }}^{\phi }$ and $K_{{\{ F\} }}^{\psi }$ is the highest and so, feature $F$ should be selected. In the second phase, these similarity matrices are achieved as:

$$K_{{\{ F,A\} }}^{\psi }=\left[ {\begin{array}{*{20}{l}} 1&{0.77}&{0.90}&{0.83} \\ {0.77}&1&{0.86}&{0.99} \\ {0.90}&{0.86}&1&{0.88} \\ {0.83}&{0.99}&{0.88}&1 \end{array}} \right],$$

$$K_{{\{ F,B\} }}^{\psi }=\left[ {\begin{array}{*{20}{l}} 1&{0.75}&{0.82}&{0.82} \\ {0.75}&1&{0.84}&{0.99} \\ {0.82}&{0.84}&1&{0.85} \\ {0.82}&{0.99}&{0.85}&1 \end{array}} \right]$$

$$K_{{\{ F,C\} }}^{\psi }=\left[ {\begin{array}{*{20}{l}} 1&{0.78}&{0.98}&{0.83} \\ {0.78}&1&{0.88}&{0.99} \\ {0.98}&{0.88}&1&{0.92} \\ 1&{0.92}&{0.99}&{0.83} \end{array}} \right]$$

Using (14), HSIC₂ is computed between $K_{{\{ A,B,C,F\} }}^{\phi }$ and each of $K_{{\{ F,A\} }}^{\psi }$, $K_{{\{ F,B\} }}^{\psi }$ and $K_{{\{ F,C\} }}^{\psi }$, represents as ${H_2}$:

$${H_2}=\left[ {\begin{array}{*{20}{c}} {0.0110}&{0.0132}&{0.0092} \end{array}} \right]$$

According to ${H_2}$, feature $B$ is the next informative feature. So, features $B$ and $F$ are the most informative features which are returned by algorithm of SPC-FOHSIC.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Liaghat, S., Mansoori, E.G. Filter-based unsupervised feature selection using Hilbert–Schmidt independence criterion. Int. J. Mach. Learn. & Cyber. 10, 2313–2328 (2019). https://doi.org/10.1007/s13042-018-0869-7

Download citation

Received: 15 July 2017
Accepted: 24 August 2018
Published: 03 September 2018
Issue Date: 01 September 2019
DOI: https://doi.org/10.1007/s13042-018-0869-7

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Filter-based unsupervised feature selection using Hilbert–Schmidt independence criterion

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Feature Selection Based on Data Clustering

Unsupervised feature selection based on generalized regression model with linear discriminant constraints

Joint local structure preservation and redundancy minimization for unsupervised feature selection

Explore related subjects

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s Note

Appendix 1

Appendix 1

1.1 Applying SP-BAHSIC, SPC-BAHSIC, SP-FOHSIC and SPC-FOHSIC on synthetic data

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now