Feature selection for k-means clustering stability: theoretical analysis and an algorithm

Mavroeidis, Dimitrios; Marchiori, Elena

doi:10.1007/s10618-013-0320-3

Feature selection for k-means clustering stability: theoretical analysis and an algorithm

Published: 29 May 2013

Volume 28, pages 918–960, (2014)
Cite this article

Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Dimitrios Mavroeidis¹ &
Elena Marchiori²

1950 Accesses
16 Citations
Explore all metrics

Abstract

Stability of a learning algorithm with respect to small input perturbations is an important property, as it implies that the derived models are robust with respect to the presence of noisy features and/or data sample fluctuations. The qualitative nature of the stability property enhardens the development of practical, stability optimizing, data mining algorithms as several issues naturally arise, such as: how “much” stability is enough, or how can stability be effectively associated with intrinsic data properties. In the context of this work we take into account these issues and explore the effect of stability maximization in the continuous (PCA-based) k-means clustering problem. Our analysis is based on both mathematical optimization and statistical arguments that complement each other and allow for the solid interpretation of the algorithm’s stability properties. Interestingly, we derive that stability maximization naturally introduces a tradeoff between cluster separation and variance, leading to the selection of features that have a high cluster separation index that is not artificially inflated by the features variance. The proposed algorithmic setup is based on a Sparse PCA approach, that selects the features that maximize stability in a greedy fashion. In our study, we also analyze several properties of Sparse PCA relevant to stability that promote Sparse PCA as a viable feature selection mechanism for clustering. The practical relevance of the proposed method is demonstrated in the context of cancer research, where we consider the problem of detecting potential tumor biomarkers using microarray gene expression data. The application of our method to a leukemia dataset shows that the tradeoff between cluster separation and variance leads to the selection of features corresponding to important biomarker genes. Some of them have relative low variance and are not detected without the direct optimization of stability in Sparse PCA based k-means. Apart from the qualitative evaluation, we have also verified our approach as a feature selection method for $k$-means clustering using four cancer research datasets. The quantitative empirical results illustrate the practical utility of our framework as a feature selection mechanism for clustering.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Puma optimizer (PO): a novel metaheuristic optimization algorithm and its application in machine learning

Article 19 January 2024

Benyamin Abdollahzadeh, Nima Khodadadi, … Seyedali Mirjalili

Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC

Article 30 August 2016

Aki Vehtari, Andrew Gelman & Jonah Gabry

Feature selection techniques for machine learning: a survey of more than two decades of research

Article 01 December 2023

Dipti Theng & Kishor K. Bhoyar

Notes

With the term continuous $k$-means clustering problem we refer to the continuous relaxation approach for approximating $k$-means (Ding and He 2004).
This is because $\mathbf{Trace}(X_{fc}^TX_{fc})=\mathbf{Trace}(X_{fc}X_{fc}^T)$.
We will refer to this property as “structured variance contribution” of a feature.

References

Breiman L (1996) Bagging predictors. Mach Learn 24:123–140
MATH MathSciNet Google Scholar
Cai D, Zhang C, He X (2010) Unsupervised feature selection for multi-cluster data. In: ACM SIGKDD
Cho H (2010) Data transformation for sum squared residue. In: Zaki MJ, Yu JX, Ravindran B, Pudi V (eds) PAKDD (1). Lecture notes in computer science, vol 6118. Springer, Berlin, pp 48–55
Chomez P, De Backer O, Bertrand M, De Plaen E, Boon T, Lucas S (2001) An overview of the MAGE gene family with the identification of all human members of the family. Cancer Res 61(14):5544–5551
Google Scholar
d’Aspremont A, Bach FR, Ghaoui LE (2007) Full regularization path for sparse principal component analysis. In: ICML
d’Aspremont A, Bach F, Ghaoui LE (2008) Optimal solutions for sparse principal component analysis. J Mach Learn Res 9:1269–1294
MATH MathSciNet Google Scholar
Dhillon IS (2001) Co-clustering documents and words using bipartite spectral graph partitioning. In: ACM SIGKDD
Dhillon IS, Guan Y, Kulis B (2004) Kernel k-means: spectral clustering and normalized cuts. In: ACM SIGKDD
Ding CHQ, He X (2004) K-means clustering via principal component analysis. In: ICML
Dy JG, Brodley CE (2004) Feature selection for unsupervised learning. J Mach Learn Res 5:845–889
MATH MathSciNet Google Scholar
Efron B, Tibshirani R (1993) An introduction to the bootstrap. Chapman Hall, New York
Book MATH Google Scholar
Golub GH, Loan CFV (1996) Matrix computations. The Johns Hopkins University Press, Baltimore
MATH Google Scholar
Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, Coller H, Loh ML, Downing JR, Caligiuri MA, Bloomfield CD, Lander ES (1999) Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286(5439):531–537
Article Google Scholar
Halkidi M, Batistakis Y, Vazirgiannis M (2001) On clustering validation techniques. J Intell Inf Syst 17(2–3):107–145
Article MATH Google Scholar
Han Y, Yu L (2010) A variance reduction framework for stable feature selection. In: IEEE ICDM
He X, Cai D, Niyogi P (2005) Laplacian score for feature selection. In: NIPS
Huang L, Yan D, Jordan MI, Taft N (2008) Spectral clustering with perturbed data. In: D. Koller, D. Schuurmans, Y. Bengio, L. Bottou (eds) Advances in neural information processing systems 21, Proceedings of the twenty-second annual conference on neural information processing systems, Vancouver, BC, Canada, December 8–11, 2008. MIT Press, Cambridge, pp 705–712
Kalousis A, Prados J, Hilario M (2007) Stability of feature selection algorithms: a study on high-dimensional spaces. Knowl Inf Syst 12(1):95–116
Article Google Scholar
Loscalzo S, Yu L, Ding CHQ (2009) Consensus group stable feature selection. In: ACM SIGKDD
Mackey L (2008) Deflation methods for sparse PCA. In: NIPS
Manning CD, Raghavan P, Schtze H (2008) Introduction to information retrieval. Cambridge University Press, Cambridge
Book MATH Google Scholar
Mavroeidis D, Bingham E (2008) Enhancing the stability of spectral ordering with sparsification and partial supervision: application to paleontological data. In: Proceedings of the 2008 eighth IEEE international conference on data mining. IEEE Computer Society, Washington, pp 462–471. doi:10.1109/ICDM.2008.120
Mavroeidis D, Bingham E (2010) Enhancing the stability and efficiency of spectral ordering with partial supervision and feature selection. Knowl Inf Syst 23:243–265
Article Google Scholar
Mavroeidis D, Magdalinos P (2012) A sequential sampling framework for spectral k-means based on efficient bootstrap accuracy estimations: application to distributed clustering. ACM Trans Knowl Discov Data 7(2)
Mavroeidis D, Marchiori E (2011) A novel stability based feature selection framework for k-means clustering. In: Proceedings of the 2011 European conference on machine learning and knowledge discovery in databases—part II, ECML PKDD’11. Springer, Berlin, pp 421–436
Mavroeidis D, Vazirgiannis M (2007) Stability based sparse lSI/PCA: incorporating feature selection in lSI and PCA. In: Proceedings of the 18th European conference on machine learning, ECML ’07. Springer, Berlin, pp 226–237
Munson MA, Caruana R (2009) On feature selection, bias-variance, and bagging. In: ECML/PKDD
Nicolas E, Ramus C, Berthier S, Arlotto M, Bouamrani A, Lefebvre C, Morel F, Garin J, Ifrah N, Berger F, Cahn JY, Mossuz P (2011) Expression of S100A8 in leukemic cells predicts poor survival in de novo AML patients. Leukemia 25:57–65
Article Google Scholar
Saeys Y, Abeel T, de Peer YV (2008) Robust feature selection using ensemble feature selection techniques. In: ECML/PKDD
Sandrine D, Jane F (2003) Bagging to improve the accuracy of a clustering procedure. Bioinformatics 19:1090–1099
Article Google Scholar
Scupoli M, Donadelli M, Cioffi F, Rossi M, Perbellini O, Malpeli G, Corbioli S, Vinante F, Krampera M, Palmieri M, Scarpa A, Ariola C, Foa R, Pizzolo G (2008) Bone marrow stromal cells and the upregulation of interleukin-8 production in human T-cell acute lymphoblastic leukemia through the cxcl12/cxcr4 axis and the nf-kappab and jnk/ap-1 pathways. Haematologica 93(4):524–532
Article Google Scholar
Shahzad A, Knapp M, Lang I, Kohler G (2010) Interleukin 8 (IL-8)—a universal biomarker? Int Arch Med 3(11)
Stewart GW, Sun JG (1990) Matrix perturbation theory. Computer science and scientific computing. Academic Press, Boston
Google Scholar
Waugh D, Wilson C (2008) The interleukin8 pathway in cancer. Clin Cancer Res 14(21):6735–6741
Google Scholar
Wolf L, Shashua A (2005) Feature selection for unsupervised and supervised inference: the emergence of sparsity in a weight-based approach. J Mach Learn Res 6:1855–1887
MATH MathSciNet Google Scholar
Wu J, Xiong H, Chen J (2009) Adapting the right measures for k-means clustering. In: Proceedings of the 15th ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’09. ACM, New York, pp 877–886
Yu L, Ding CHQ, Loscalzo S (2008) Stable feature selection via dense feature groups. In: ACM SIGKDD
Zhao Z, Liu H (2007) Spectral feature selection for supervised and unsupervised learning. In: Proceedings of the 24th international conference on machine learning, ICML ’07. ACM, New York, pp 1151–1157

Download references

Author information

Authors and Affiliations

IBM Research–Ireland, Mulhuddart, Dublin 15, Ireland
Dimitrios Mavroeidis
Department of Computer Science, Faculty of Sciences, Radboud University, Heyendaalseweg 135, 6525 AJ, Nijmegen, The Netherlands
Elena Marchiori

Authors

Dimitrios Mavroeidis
View author publications
You can also search for this author in PubMed Google Scholar
Elena Marchiori
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Dimitrios Mavroeidis.

Additional information

Responsible editor: Dimitrios Gunopulos, Donato Malerba, Michalis Vazirgiannis.

Appendix

Proof

(Proof of Lemma 1) Based on Ding and He (2004), the continuous solution for the instance clusters is derived by the $k-1$ dominant eigenvectors of matrix $X_{fc}^TX_{fc}$, where $X_{fc}$ is a feature-instance matrix with the rows (features) being centered. Since $X$ is double-centered the sum of rows and columns of $X$ will be equal to 0, i.e. $\sum _iX_{ij}=\sum _jX_{ij}=0$. Thus, the continuous solution of Spectral $k$-means (for instance clustering) will be derived by the $k-1$ dominant eigenvectors of matrix $X^TX$.

Analogously, the continuous solution for the feature clusters is derived by the $k-1$ dominant eigenvectors of matrix $X_{ic}X_{ic}^T$, where $X_{ic}$ is a feature-instance matrix with the columns (instances) being centered. Since $X$ is double-centered, we will have that the instance-centered matrix $X_{ic}$ will be equal to $X$, i.e. $X_{ic}=X$. Thus, the continuous cluster solution will be derived by the dominant eigenvectors of matrix $XX^T$.

Using basic linear algebra one can easily derive that the matrices $XX^T$ and $X^TX$ have exactly the same eigenvalues.

Thus $\lambda _{k-1}(XX^T)-\lambda _{k}(XX^T)=\lambda _{k-1}(X^TX) -\lambda _{k}(X^TX)$, and the stability of the relevant eigenspaces will be equivalent. $\square $

Proof

(Proof of Theorem 1) We will start by decomposing the components $\lambda _{1}(Cov)$ and $\mathbf{Trace}(Cov)$. For $\lambda _{1}(Cov)$ we have:

$$\begin{aligned} \lambda _{1}(Cov)&= \lambda _{1}(\mathbf{diag}(u)X_{fc}X_{fc}^T\mathbf{diag}(u))\\&= \lambda _{1}(X_{fc}^T\mathbf{diag}(u)X_{fc})\\&= \max \limits _{||v||=1}v^T(X_{fc}^T\mathbf{diag}(u)X_{fc})v\\&= \max \limits _{||v||=1}\sum \limits _{i=1}^mu_i(v^Tx_{fc}(i))^2 \end{aligned}$$

In the above derivations we have used the following, easily verifiable facts: $\lambda _1(AA^T)=\lambda _1(A^TA)$ and $\mathbf{diag}(u)=(\mathbf{diag}(u))^2$. For $\mathbf{Tr}(Cov)$ we have:

$$\begin{aligned} \mathbf{Tr}(Cov)&= \mathbf{Tr}(\mathbf{diag}(u)X_{fc}X_{fc}^T\mathbf{diag}(u))\\&= \sum \limits _{i=1}^m u_i x_{fc}(i)^Tx_{fc}(i) \end{aligned}$$

Based on the above we can write:

$$\begin{aligned} Obj_{(ows)}(I\cup \{m_s\}) =\max \limits _{||v||=1}\sum \limits _{i\in I\cup \{m_s\}}(v^Tx_{fc}(i))^2-\frac{1}{n}\sum \limits _{i\in I\cup \{m_s\}} x_{fc}(i)^Tx_{fc}(i) \end{aligned}$$

The vector $v$ that maximizes the expression $\max _{||v||=1}\sum _{i\in I\cup \{m_s\}}(v^Tx_{fc}(i))^2$ is the dominant eigenvector of matrix $Cov=\mathbf{diag}(u)X_{fc}X_{fc}^T\mathbf{diag}(u)$, where $u$ is defined by the feature-set $I\cup \{m_s\}$ (recall that this expression was derived from $\lambda _1(Cov)$). Thus, based on the fact that for all $v$ such that $||v||=1$ it holds that $\lambda _1(A)\ge v^TAv$, we can employ any other normalized vector and compute a lower bound to the expression. In order to derive a useful tight bound we employ here the eigenvector $v_{(I)}$ of matrix $Cov=\mathbf{diag}(u)X_{fc}X_{fc}^T\mathbf{diag}(u)$, where $u$ is defined by the feature-set $I$ (i.e. without feature $\{m_s\}$).

Based on the above we can write:

$$\begin{aligned} Obj_{(ows)}(I\cup \{m_s\})&\ge \sum \limits _{i\in I\cup \{m_s\}}(v_{(I)}^Tx_{fc}(i))^2-\frac{1}{n}\sum \limits _{i\in I\cup \{m_s\}} x_{fc}(i)^Tx_{fc}(i)\\&=\sum \limits _{i\in I}(v_{(I)}^Tx_{fc}(i))^2-\frac{1}{n}\sum \limits _{i\in I} x_{fc}(i)^Tx_{fc}(i)\\&+(v_{(I)}^Tx_{fc}(m_s))^2-\frac{1}{n} x_{fc}(m_s)^Tx_{fc}(m_s)\\&=Obj_{(ows)}(I)+(v_{(I)}^Tx_{fc}(m_s))^2-\frac{1}{n} x_{fc}(m_s)^Tx_{fc}(m_s) \end{aligned}$$

$\square $

Proof

(Proof of Theorem 2) We will start by decomposing the components $\lambda _{1}(Cov)$ and $\mathbf{Trace}(Cov)$.

For $\lambda _{1}(Cov)$ we have:

$$\begin{aligned} \lambda _{1}(Cov)&= \lambda _{1}(C_m^uX_{fc}X_{fc}^TC_m^u)\\&= \lambda _{1}(X_{fc}^TC_m^uX_{fc})\\&= \lambda _{1}\left( X_{fc}^T\mathbf{diag}(u)\left( I -\tfrac{1}{\mathbf{card}(u)}e_me^T_m\right) \mathbf{diag}(u)X_{fc}\right) \\&= \max \limits _{||v||=1}v^T\left( X_{fc}^T\mathbf{diag}(u) \left( I-\tfrac{1}{\mathbf{card}(u)}e_me^T_m\right) \mathbf{diag}(u)X_{fc}\right) v\\&= \max \limits _{||v||=1}v^T(X_{fc}^T\mathbf{diag}(u)X_{fc})v\\&-\tfrac{1}{\mathbf{card}(u)}v^T(X_{fc}^T\mathbf{diag}(u)e_me^T_m \mathbf{diag}(u)X_{fc})v)\\&= \max \limits _{||v||=1}\sum \limits _{i=1}^mu_i(v^Tx_{fc}(i))^2 -\tfrac{1}{\mathbf{card}(u)}\left( v^T\sum \limits _{i=1}^mu_ix_{fc}(i)\right) ^2 \end{aligned}$$

In the above derivations we have used the following, easily verifiable facts: $\lambda _1(AA^T)=\lambda _1(A^TA), \,C_m^u=(C_m^u)^T, \,C_m^u=(C_m^u)^2$ and $\mathbf{diag}(u)=(\mathbf{diag}(u))^2$.

For $\mathbf{Tr}(Cov)$ we have:

$$\begin{aligned} \mathbf{Tr}(Cov)&= \mathbf{Tr}(C_m^uX_{fc}X_{fc}^TC_m^u)\\&= \mathbf{Tr}(X_{fc}^TC_m^uX_{fc})\\&= \mathbf{Tr}\left( X_{fc}\mathbf{diag}(u) \left( I-\tfrac{1}{\mathbf{card}(u)}e_me^T_m\right) \mathbf{diag}(u)X_{fc}\right) \\&= \mathbf{Tr}(X_{fc}^T\mathbf{diag}(u)X_{fc})\\&-\tfrac{1}{\mathbf{card}(u)}\mathbf{Tr}(X_{fc}^T \mathbf{diag}(u)e_me^T_m\mathbf{diag}(u)X_{fc}) \end{aligned}$$

In these derivations we have used the following properties of the matrix Trace: $\mathbf{Tr}(AA^T)=\mathbf{Tr}(A^TA), \,\mathbf{Tr}(A+B)=\mathbf{Tr}(A)+\mathbf{Tr}(B)$ and $\mathbf{Tr}(\beta A)=\beta \mathbf{Tr}(A)$.

Now for $\mathbf{Tr}(X_{fc}^T\mathbf{diag}(u)X_{fc})$ we have:

$$\begin{aligned} \mathbf{Tr}(X_{fc}^T\mathbf{diag}(u)X_{fc})&= \mathbf{Tr}(\mathbf{diag} (u)X_{fc}X_{fc}^T\mathbf{diag}(u))\\&= \sum \limits _{i=1}^m u_i x_{fc}(i)^Tx_{fc}(i) \end{aligned}$$

Finally, for $\frac{1}{\mathbf{card}(u)}\mathbf{Tr}(X_{fc}^T\mathbf{diag} (u)e_me^T_m\mathbf{diag}(u)X_{fc})$ we have:

$$\begin{aligned}&\tfrac{1}{\mathbf{card}(u)}\mathbf{Tr}(X_{fc}^T\mathbf{diag}(u)e_me^T_m \mathbf{diag}(u)X_{fc})\\&=\tfrac{1}{\mathbf{card}(u)}\mathbf{Tr}(e^T_m\mathbf{diag}(u)X_{fc}X_{fc}^T \mathbf{diag}(u)e_m)\\&=\tfrac{1}{\mathbf{card}(u)}\left( \sum \limits _{i=1}^n u_ix_{fc}(i)\right) ^T\left( \sum \limits _{i=1}^m u_ix_{fc}(i)\right) \end{aligned}$$

Based on the afore derivations we can write the objective function in Eq. 4 as:

$$\begin{aligned} \max \limits _{||v||=1}\max \limits _{u\in \{0,1\}^n}\left[ O_1-\frac{1}{\mathbf{card}(u)}O_2\right] \end{aligned}$$

(8)

Where

$$\begin{aligned} O_1&= \sum \limits _{i=1}^mu_i\left[ (v^Tx_{fc}(i))^2-\frac{1}{n} x_{fc}(i)^Tx_{fc}(i)\right] \end{aligned}$$

(9)

$$\begin{aligned} O_2&= \left( v^T\sum \limits _{i=1}^mu_ix_{fc}(i)\right) ^2-\frac{1}{n}\left( \sum \limits _{i=1}^m u_ix_{fc}(i)\right) ^T\left( \sum \limits _{i=1}^m u_ix_{fc}(i)\right) \end{aligned}$$

(10)

We can now employ the fact that for all $v$ such that $||v||=1$ it holds that $\lambda _1(A)\ge v^TAv$ and replace the $v$ vector in equations $O_1$ and $O_2$ with a fixed normalized vector. For deriving a tight bound we employ $v_{(I)}$, the dominant eigenvector of $C_m^uX_{fc}X_{fc}^TC_m^u$, where $u$ is defined based on the feature set $I$ (excluding feature $m_s$).

Based on the above (with a slight abuse of notation, since the global objective is lower bounded using $v_{(I)}$ and not the individual $O_1$ and $O_2$), we can write $O_1(I\cup \{m_s\})$ as:

$$\begin{aligned} O_1(I\cup \{m\})\ge O_1(I)+(v_{(I)}^Tx_{fc}(m_s))^2-\frac{1}{n} x_{fc}(m_s)^Tx_{fc}(m_s) \end{aligned}$$

Now $O_2(I\cup \{m_s\})$ can be written as (again with a slight abuse of notation, since the global objective is lower bounded using $v_{(I)}$ and not the individual $O_1$ and $O_2$):

$$\begin{aligned} O_2(I\cup \{m_s\})&\ge \left( v_{(I)}^T\sum \limits _{i\in I\cup \{m_s\}}x_{fc}(i)\right) ^2\!-\!\frac{1}{n}\left( \sum \limits _{i\in I\cup \{m_s\}} x_{fc}(i)\right) ^T\left( \sum \limits _{i\in I\cup \{m_s\}} x_{fc}(i)\right) \\&= O_2(I)+(v_{(I)}^Tx_{fc}(m_s))^2+2\left( \sum \limits _{i\in I}v_{(I)}^Tx_{fc}(i)\right) (v_{(I)}^Tx_{fc}(m_s))\\&-\frac{1}{n}x_{fc}(m_s)^Tx_{fc}(m_s)-\frac{2}{n} \left( \sum \limits _{i\in I}x_{fc}(i)\right) ^T x_{fc}(m_s) \end{aligned}$$

In order to complete the proof we must express $\frac{1}{\mathbf{card}(I)+1}O_2(I)$ in the form of $\frac{1}{\mathbf{card}(I)}O_2(I)+C$. In order to achieve this goal we take into account the fact that $\frac{1}{\mathbf{card}(I)+1}=\frac{1}{\mathbf{card}(I)} -\frac{1}{\mathbf{card}(I)(\mathbf{card}(I)+1)}$ and write $Obj(I\cup \{m\})$ as:

With some simple algebraic derivations the final bound can be derived. $\square $

Proof

(Proof of Theorem 3) Recall that in Schur complement deflation, the deflation step is performed as follows:

$$\begin{aligned} A_t=A_{t-1}-\frac{A_{t-1}x_tx_t^TA_{t-1}}{x_t^TA_{t-1}x_t} \end{aligned}$$

Now if we consider that $A_t=X_{fc}^{(t)}(X_{fc}^{(t)})^T$ and also that $x_t$ is the dominant eigenvector of matrix $Cov^{(t-1)}$, we can write:

$$\begin{aligned}&A_t=A_{t-1}-\frac{A_{t-1}x_tx_t^TA_{t-1}}{x_t^TA_{t-1}x_t} \Rightarrow X_{fc}^{(t)}(X_{fc}^{(t)})^T=X_{fc}^{(t-1)}(X_{fc}^{(t-1)})^T\nonumber \\&\quad -\frac{X_{fc}^{(t-1)}(X_{fc}^{(t-1)})^Tx_tx_t^TX_{fc}^{(t-1)} (X_{fc}^{(t-1)})^T}{x_t^TX_{fc}^{(t-1)}(X_{fc}^{(t-1)})^Tx_t} \end{aligned}$$

(11)

Recall that $x_t$ is the dominant eigenvector of $Cov^{(t-1)}$ that can be written as $Cov^{(t-1)}=C_m^{u(t-1)}X_{fc}^{(t-1)}(X_{fc}^{(t-1)})^TC_m^{u(t-1)}$ (i.e. it is based on the selected feature subset $u(t-1)$).

Since $Cov^{(t-1)}$ is a double-centered matrix (its rows and columns are centered through the multiplication with $C_m^{u(t-1)}$), its dominant eigenvector will also be centered, i.e. $(C_m^{u(t-1)})x_t=x_t$. Based on this property, we can write:

$$\begin{aligned} x_t^TX_{fc}^{(t-1)}(X_{fc}^{(t-1)})^Tx_t&= x_t^TC_m^{u(t-1)} X_{fc}^{(t-1)}(X_{fc}^{(t-1)})^TC_m^{u(t-1)}x_t\nonumber \\&= x_t^TCov^{(t-1)}x_t=\lambda _{\max }^{(t-1)} \end{aligned}$$

(12)

It should be noted that if we are analyzing one-way stable Sparse PCA, we can directly derive that $x_t^TX_{fc}^{(t-1)}(X_{fc}^{(t-1)})^Tx_t=\lambda _{\max }^{(t-1)}$. Now, $(X_{fc}^{(t-1)})^Tx_t$ can be written as:

$$\begin{aligned} (X_{fc}^{(t-1)})^Tx_t=(X_{ft-1)})^TC_m^{u(t-1)}x_t= \sqrt{\lambda _{\max }^{(t-1)}}v_t, \end{aligned}$$

(13)

where $v_t$ is the dominant eigenvector of $(X_{fc}^{(t-1)})^TC_m^{u(t-1)}X_{fc}^{(t-1)}$ and $\lambda _{\max }^{(t-1)}$ is the dominant eigenvector of $Cov^{(t-1)}$.

We should again note that in the one-way stable case, we can directly derive that $(X_{fc}^{(t-1)})^Tx_t=\sqrt{\lambda _{\max }^{(t-1)}}v_t$.

Based on Eqs. 11–13 we can write

$$\begin{aligned} A_t&= A_{t-1}-\frac{A_{t-1}x_tx_t^TA_{t-1}}{x_t^TA_{t-1}x_t}\\ \Rightarrow X_{fc}^{(t)}(X_{fc}^{(t)})^T&= X_{fc}^{(t-1)} (X_{fc}^{(t-1)})^T-X_{fc}^{(t-1)}v_tv_t^T(X_{fc}^{(t-1)})^T\\ \Rightarrow X_{fc}^{(t)}&= X_{fc}^{(t-1)}(I-v_tv_t^T) \end{aligned}$$

$\square $

Rights and permissions

Reprints and permissions

About this article

Cite this article

Mavroeidis, D., Marchiori, E. Feature selection for k-means clustering stability: theoretical analysis and an algorithm. Data Min Knowl Disc 28, 918–960 (2014). https://doi.org/10.1007/s10618-013-0320-3

Download citation

Received: 31 October 2011
Accepted: 25 April 2013
Published: 29 May 2013
Issue Date: July 2014
DOI: https://doi.org/10.1007/s10618-013-0320-3

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Feature selection for k-means clustering stability: theoretical analysis and an algorithm

Abstract

Access this article

Similar content being viewed by others

Puma optimizer (PO): a novel metaheuristic optimization algorithm and its application in machine learning

Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC

Feature selection techniques for machine learning: a survey of more than two decades of research

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Appendix

Proof

Proof

Proof

Proof

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Feature selection for k-means clustering stability: theoretical analysis and an algorithm

Abstract

Access this article

Similar content being viewed by others

Puma optimizer (PO): a novel metaheuristic optimization algorithm and its application in machine learning

Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC

Feature selection techniques for machine learning: a survey of more than two decades of research

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Appendix

Appendix

Proof

Proof

Proof

Proof

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation