Skip to main content
Log in

Feature selection for k-means clustering stability: theoretical analysis and an algorithm

  • Published:
Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Abstract

Stability of a learning algorithm with respect to small input perturbations is an important property, as it implies that the derived models are robust with respect to the presence of noisy features and/or data sample fluctuations. The qualitative nature of the stability property enhardens the development of practical, stability optimizing, data mining algorithms as several issues naturally arise, such as: how “much” stability is enough, or how can stability be effectively associated with intrinsic data properties. In the context of this work we take into account these issues and explore the effect of stability maximization in the continuous (PCA-based) k-means clustering problem. Our analysis is based on both mathematical optimization and statistical arguments that complement each other and allow for the solid interpretation of the algorithm’s stability properties. Interestingly, we derive that stability maximization naturally introduces a tradeoff between cluster separation and variance, leading to the selection of features that have a high cluster separation index that is not artificially inflated by the features variance. The proposed algorithmic setup is based on a Sparse PCA approach, that selects the features that maximize stability in a greedy fashion. In our study, we also analyze several properties of Sparse PCA relevant to stability that promote Sparse PCA as a viable feature selection mechanism for clustering. The practical relevance of the proposed method is demonstrated in the context of cancer research, where we consider the problem of detecting potential tumor biomarkers using microarray gene expression data. The application of our method to a leukemia dataset shows that the tradeoff between cluster separation and variance leads to the selection of features corresponding to important biomarker genes. Some of them have relative low variance and are not detected without the direct optimization of stability in Sparse PCA based k-means. Apart from the qualitative evaluation, we have also verified our approach as a feature selection method for \(k\)-means clustering using four cancer research datasets. The quantitative empirical results illustrate the practical utility of our framework as a feature selection mechanism for clustering.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12

Similar content being viewed by others

Notes

  1. With the term continuous \(k\)-means clustering problem we refer to the continuous relaxation approach for approximating \(k\)-means (Ding and He 2004).

  2. This is because \(\mathbf{Trace}(X_{fc}^TX_{fc})=\mathbf{Trace}(X_{fc}X_{fc}^T)\).

  3. We will refer to this property as “structured variance contribution” of a feature.

References

  • Breiman L (1996) Bagging predictors. Mach Learn 24:123–140

    MATH  MathSciNet  Google Scholar 

  • Cai D, Zhang C, He X (2010) Unsupervised feature selection for multi-cluster data. In: ACM SIGKDD

  • Cho H (2010) Data transformation for sum squared residue. In: Zaki MJ, Yu JX, Ravindran B, Pudi V (eds) PAKDD (1). Lecture notes in computer science, vol 6118. Springer, Berlin, pp 48–55

  • Chomez P, De Backer O, Bertrand M, De Plaen E, Boon T, Lucas S (2001) An overview of the MAGE gene family with the identification of all human members of the family. Cancer Res 61(14):5544–5551

    Google Scholar 

  • d’Aspremont A, Bach FR, Ghaoui LE (2007) Full regularization path for sparse principal component analysis. In: ICML

  • d’Aspremont A, Bach F, Ghaoui LE (2008) Optimal solutions for sparse principal component analysis. J Mach Learn Res 9:1269–1294

    MATH  MathSciNet  Google Scholar 

  • Dhillon IS (2001) Co-clustering documents and words using bipartite spectral graph partitioning. In: ACM SIGKDD

  • Dhillon IS, Guan Y, Kulis B (2004) Kernel k-means: spectral clustering and normalized cuts. In: ACM SIGKDD

  • Ding CHQ, He X (2004) K-means clustering via principal component analysis. In: ICML

  • Dy JG, Brodley CE (2004) Feature selection for unsupervised learning. J Mach Learn Res 5:845–889

    MATH  MathSciNet  Google Scholar 

  • Efron B, Tibshirani R (1993) An introduction to the bootstrap. Chapman Hall, New York

    Book  MATH  Google Scholar 

  • Golub GH, Loan CFV (1996) Matrix computations. The Johns Hopkins University Press, Baltimore

    MATH  Google Scholar 

  • Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, Coller H, Loh ML, Downing JR, Caligiuri MA, Bloomfield CD, Lander ES (1999) Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286(5439):531–537

    Article  Google Scholar 

  • Halkidi M, Batistakis Y, Vazirgiannis M (2001) On clustering validation techniques. J Intell Inf Syst 17(2–3):107–145

    Article  MATH  Google Scholar 

  • Han Y, Yu L (2010) A variance reduction framework for stable feature selection. In: IEEE ICDM

  • He X, Cai D, Niyogi P (2005) Laplacian score for feature selection. In: NIPS

  • Huang L, Yan D, Jordan MI, Taft N (2008) Spectral clustering with perturbed data. In: D. Koller, D. Schuurmans, Y. Bengio, L. Bottou (eds) Advances in neural information processing systems 21, Proceedings of the twenty-second annual conference on neural information processing systems, Vancouver, BC, Canada, December 8–11, 2008. MIT Press, Cambridge, pp 705–712

  • Kalousis A, Prados J, Hilario M (2007) Stability of feature selection algorithms: a study on high-dimensional spaces. Knowl Inf Syst 12(1):95–116

    Article  Google Scholar 

  • Loscalzo S, Yu L, Ding CHQ (2009) Consensus group stable feature selection. In: ACM SIGKDD

  • Mackey L (2008) Deflation methods for sparse PCA. In: NIPS

  • Manning CD, Raghavan P, Schtze H (2008) Introduction to information retrieval. Cambridge University Press, Cambridge

    Book  MATH  Google Scholar 

  • Mavroeidis D, Bingham E (2008) Enhancing the stability of spectral ordering with sparsification and partial supervision: application to paleontological data. In: Proceedings of the 2008 eighth IEEE international conference on data mining. IEEE Computer Society, Washington, pp 462–471. doi:10.1109/ICDM.2008.120

  • Mavroeidis D, Bingham E (2010) Enhancing the stability and efficiency of spectral ordering with partial supervision and feature selection. Knowl Inf Syst 23:243–265

    Article  Google Scholar 

  • Mavroeidis D, Magdalinos P (2012) A sequential sampling framework for spectral k-means based on efficient bootstrap accuracy estimations: application to distributed clustering. ACM Trans Knowl Discov Data 7(2)

  • Mavroeidis D, Marchiori E (2011) A novel stability based feature selection framework for k-means clustering. In: Proceedings of the 2011 European conference on machine learning and knowledge discovery in databases—part II, ECML PKDD’11. Springer, Berlin, pp 421–436

  • Mavroeidis D, Vazirgiannis M (2007) Stability based sparse lSI/PCA: incorporating feature selection in lSI and PCA. In: Proceedings of the 18th European conference on machine learning, ECML ’07. Springer, Berlin, pp 226–237

  • Munson MA, Caruana R (2009) On feature selection, bias-variance, and bagging. In: ECML/PKDD

  • Nicolas E, Ramus C, Berthier S, Arlotto M, Bouamrani A, Lefebvre C, Morel F, Garin J, Ifrah N, Berger F, Cahn JY, Mossuz P (2011) Expression of S100A8 in leukemic cells predicts poor survival in de novo AML patients. Leukemia 25:57–65

    Article  Google Scholar 

  • Saeys Y, Abeel T, de Peer YV (2008) Robust feature selection using ensemble feature selection techniques. In: ECML/PKDD

  • Sandrine D, Jane F (2003) Bagging to improve the accuracy of a clustering procedure. Bioinformatics 19:1090–1099

    Article  Google Scholar 

  • Scupoli M, Donadelli M, Cioffi F, Rossi M, Perbellini O, Malpeli G, Corbioli S, Vinante F, Krampera M, Palmieri M, Scarpa A, Ariola C, Foa R, Pizzolo G (2008) Bone marrow stromal cells and the upregulation of interleukin-8 production in human T-cell acute lymphoblastic leukemia through the cxcl12/cxcr4 axis and the nf-kappab and jnk/ap-1 pathways. Haematologica 93(4):524–532

    Article  Google Scholar 

  • Shahzad A, Knapp M, Lang I, Kohler G (2010) Interleukin 8 (IL-8)—a universal biomarker? Int Arch Med 3(11)

  • Stewart GW, Sun JG (1990) Matrix perturbation theory. Computer science and scientific computing. Academic Press, Boston

    Google Scholar 

  • Waugh D, Wilson C (2008) The interleukin8 pathway in cancer. Clin Cancer Res 14(21):6735–6741

    Google Scholar 

  • Wolf L, Shashua A (2005) Feature selection for unsupervised and supervised inference: the emergence of sparsity in a weight-based approach. J Mach Learn Res 6:1855–1887

    MATH  MathSciNet  Google Scholar 

  • Wu J, Xiong H, Chen J (2009) Adapting the right measures for k-means clustering. In: Proceedings of the 15th ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’09. ACM, New York, pp 877–886

  • Yu L, Ding CHQ, Loscalzo S (2008) Stable feature selection via dense feature groups. In: ACM SIGKDD

  • Zhao Z, Liu H (2007) Spectral feature selection for supervised and unsupervised learning. In: Proceedings of the 24th international conference on machine learning, ICML ’07. ACM, New York, pp 1151–1157

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Dimitrios Mavroeidis.

Additional information

Responsible editor: Dimitrios Gunopulos, Donato Malerba, Michalis Vazirgiannis.

Appendix

Appendix

Proof

(Proof of Lemma 1) Based on Ding and He (2004), the continuous solution for the instance clusters is derived by the \(k-1\) dominant eigenvectors of matrix \(X_{fc}^TX_{fc}\), where \(X_{fc}\) is a feature-instance matrix with the rows (features) being centered. Since \(X\) is double-centered the sum of rows and columns of \(X\) will be equal to 0, i.e. \(\sum _iX_{ij}=\sum _jX_{ij}=0\). Thus, the continuous solution of Spectral \(k\)-means (for instance clustering) will be derived by the \(k-1\) dominant eigenvectors of matrix \(X^TX\).

Analogously, the continuous solution for the feature clusters is derived by the \(k-1\) dominant eigenvectors of matrix \(X_{ic}X_{ic}^T\), where \(X_{ic}\) is a feature-instance matrix with the columns (instances) being centered. Since \(X\) is double-centered, we will have that the instance-centered matrix \(X_{ic}\) will be equal to \(X\), i.e. \(X_{ic}=X\). Thus, the continuous cluster solution will be derived by the dominant eigenvectors of matrix \(XX^T\).

Using basic linear algebra one can easily derive that the matrices \(XX^T\) and \(X^TX\) have exactly the same eigenvalues.

Thus \(\lambda _{k-1}(XX^T)-\lambda _{k}(XX^T)=\lambda _{k-1}(X^TX) -\lambda _{k}(X^TX)\), and the stability of the relevant eigenspaces will be equivalent. \(\square \)

Proof

(Proof of Theorem 1) We will start by decomposing the components \(\lambda _{1}(Cov)\) and \(\mathbf{Trace}(Cov)\). For \(\lambda _{1}(Cov)\) we have:

$$\begin{aligned} \lambda _{1}(Cov)&= \lambda _{1}(\mathbf{diag}(u)X_{fc}X_{fc}^T\mathbf{diag}(u))\\&= \lambda _{1}(X_{fc}^T\mathbf{diag}(u)X_{fc})\\&= \max \limits _{||v||=1}v^T(X_{fc}^T\mathbf{diag}(u)X_{fc})v\\&= \max \limits _{||v||=1}\sum \limits _{i=1}^mu_i(v^Tx_{fc}(i))^2 \end{aligned}$$

In the above derivations we have used the following, easily verifiable facts: \(\lambda _1(AA^T)=\lambda _1(A^TA)\) and \(\mathbf{diag}(u)=(\mathbf{diag}(u))^2\). For \(\mathbf{Tr}(Cov)\) we have:

$$\begin{aligned} \mathbf{Tr}(Cov)&= \mathbf{Tr}(\mathbf{diag}(u)X_{fc}X_{fc}^T\mathbf{diag}(u))\\&= \sum \limits _{i=1}^m u_i x_{fc}(i)^Tx_{fc}(i) \end{aligned}$$

Based on the above we can write:

$$\begin{aligned} Obj_{(ows)}(I\cup \{m_s\}) =\max \limits _{||v||=1}\sum \limits _{i\in I\cup \{m_s\}}(v^Tx_{fc}(i))^2-\frac{1}{n}\sum \limits _{i\in I\cup \{m_s\}} x_{fc}(i)^Tx_{fc}(i) \end{aligned}$$

The vector \(v\) that maximizes the expression \(\max _{||v||=1}\sum _{i\in I\cup \{m_s\}}(v^Tx_{fc}(i))^2\) is the dominant eigenvector of matrix \(Cov=\mathbf{diag}(u)X_{fc}X_{fc}^T\mathbf{diag}(u)\), where \(u\) is defined by the feature-set \(I\cup \{m_s\}\) (recall that this expression was derived from \(\lambda _1(Cov)\)). Thus, based on the fact that for all \(v\) such that \(||v||=1\) it holds that \(\lambda _1(A)\ge v^TAv\), we can employ any other normalized vector and compute a lower bound to the expression. In order to derive a useful tight bound we employ here the eigenvector \(v_{(I)}\) of matrix \(Cov=\mathbf{diag}(u)X_{fc}X_{fc}^T\mathbf{diag}(u)\), where \(u\) is defined by the feature-set \(I\) (i.e. without feature \(\{m_s\}\)).

Based on the above we can write:

$$\begin{aligned} Obj_{(ows)}(I\cup \{m_s\})&\ge \sum \limits _{i\in I\cup \{m_s\}}(v_{(I)}^Tx_{fc}(i))^2-\frac{1}{n}\sum \limits _{i\in I\cup \{m_s\}} x_{fc}(i)^Tx_{fc}(i)\\&=\sum \limits _{i\in I}(v_{(I)}^Tx_{fc}(i))^2-\frac{1}{n}\sum \limits _{i\in I} x_{fc}(i)^Tx_{fc}(i)\\&+(v_{(I)}^Tx_{fc}(m_s))^2-\frac{1}{n} x_{fc}(m_s)^Tx_{fc}(m_s)\\&=Obj_{(ows)}(I)+(v_{(I)}^Tx_{fc}(m_s))^2-\frac{1}{n} x_{fc}(m_s)^Tx_{fc}(m_s) \end{aligned}$$

\(\square \)

Proof

(Proof of Theorem 2) We will start by decomposing the components \(\lambda _{1}(Cov)\) and \(\mathbf{Trace}(Cov)\).

For \(\lambda _{1}(Cov)\) we have:

$$\begin{aligned} \lambda _{1}(Cov)&= \lambda _{1}(C_m^uX_{fc}X_{fc}^TC_m^u)\\&= \lambda _{1}(X_{fc}^TC_m^uX_{fc})\\&= \lambda _{1}\left( X_{fc}^T\mathbf{diag}(u)\left( I -\tfrac{1}{\mathbf{card}(u)}e_me^T_m\right) \mathbf{diag}(u)X_{fc}\right) \\&= \max \limits _{||v||=1}v^T\left( X_{fc}^T\mathbf{diag}(u) \left( I-\tfrac{1}{\mathbf{card}(u)}e_me^T_m\right) \mathbf{diag}(u)X_{fc}\right) v\\&= \max \limits _{||v||=1}v^T(X_{fc}^T\mathbf{diag}(u)X_{fc})v\\&-\tfrac{1}{\mathbf{card}(u)}v^T(X_{fc}^T\mathbf{diag}(u)e_me^T_m \mathbf{diag}(u)X_{fc})v)\\&= \max \limits _{||v||=1}\sum \limits _{i=1}^mu_i(v^Tx_{fc}(i))^2 -\tfrac{1}{\mathbf{card}(u)}\left( v^T\sum \limits _{i=1}^mu_ix_{fc}(i)\right) ^2 \end{aligned}$$

In the above derivations we have used the following, easily verifiable facts: \(\lambda _1(AA^T)=\lambda _1(A^TA), \,C_m^u=(C_m^u)^T, \,C_m^u=(C_m^u)^2\) and \(\mathbf{diag}(u)=(\mathbf{diag}(u))^2\).

For \(\mathbf{Tr}(Cov)\) we have:

$$\begin{aligned} \mathbf{Tr}(Cov)&= \mathbf{Tr}(C_m^uX_{fc}X_{fc}^TC_m^u)\\&= \mathbf{Tr}(X_{fc}^TC_m^uX_{fc})\\&= \mathbf{Tr}\left( X_{fc}\mathbf{diag}(u) \left( I-\tfrac{1}{\mathbf{card}(u)}e_me^T_m\right) \mathbf{diag}(u)X_{fc}\right) \\&= \mathbf{Tr}(X_{fc}^T\mathbf{diag}(u)X_{fc})\\&-\tfrac{1}{\mathbf{card}(u)}\mathbf{Tr}(X_{fc}^T \mathbf{diag}(u)e_me^T_m\mathbf{diag}(u)X_{fc}) \end{aligned}$$

In these derivations we have used the following properties of the matrix Trace: \(\mathbf{Tr}(AA^T)=\mathbf{Tr}(A^TA), \,\mathbf{Tr}(A+B)=\mathbf{Tr}(A)+\mathbf{Tr}(B)\) and \(\mathbf{Tr}(\beta A)=\beta \mathbf{Tr}(A)\).

Now for \(\mathbf{Tr}(X_{fc}^T\mathbf{diag}(u)X_{fc})\) we have:

$$\begin{aligned} \mathbf{Tr}(X_{fc}^T\mathbf{diag}(u)X_{fc})&= \mathbf{Tr}(\mathbf{diag} (u)X_{fc}X_{fc}^T\mathbf{diag}(u))\\&= \sum \limits _{i=1}^m u_i x_{fc}(i)^Tx_{fc}(i) \end{aligned}$$

Finally, for \(\frac{1}{\mathbf{card}(u)}\mathbf{Tr}(X_{fc}^T\mathbf{diag} (u)e_me^T_m\mathbf{diag}(u)X_{fc})\) we have:

$$\begin{aligned}&\tfrac{1}{\mathbf{card}(u)}\mathbf{Tr}(X_{fc}^T\mathbf{diag}(u)e_me^T_m \mathbf{diag}(u)X_{fc})\\&=\tfrac{1}{\mathbf{card}(u)}\mathbf{Tr}(e^T_m\mathbf{diag}(u)X_{fc}X_{fc}^T \mathbf{diag}(u)e_m)\\&=\tfrac{1}{\mathbf{card}(u)}\left( \sum \limits _{i=1}^n u_ix_{fc}(i)\right) ^T\left( \sum \limits _{i=1}^m u_ix_{fc}(i)\right) \end{aligned}$$

Based on the afore derivations we can write the objective function in Eq. 4 as:

$$\begin{aligned} \max \limits _{||v||=1}\max \limits _{u\in \{0,1\}^n}\left[ O_1-\frac{1}{\mathbf{card}(u)}O_2\right] \end{aligned}$$
(8)

Where

$$\begin{aligned} O_1&= \sum \limits _{i=1}^mu_i\left[ (v^Tx_{fc}(i))^2-\frac{1}{n} x_{fc}(i)^Tx_{fc}(i)\right] \end{aligned}$$
(9)
$$\begin{aligned} O_2&= \left( v^T\sum \limits _{i=1}^mu_ix_{fc}(i)\right) ^2-\frac{1}{n}\left( \sum \limits _{i=1}^m u_ix_{fc}(i)\right) ^T\left( \sum \limits _{i=1}^m u_ix_{fc}(i)\right) \end{aligned}$$
(10)

We can now employ the fact that for all \(v\) such that \(||v||=1\) it holds that \(\lambda _1(A)\ge v^TAv\) and replace the \(v\) vector in equations \(O_1\) and \(O_2\) with a fixed normalized vector. For deriving a tight bound we employ \(v_{(I)}\), the dominant eigenvector of \(C_m^uX_{fc}X_{fc}^TC_m^u\), where \(u\) is defined based on the feature set \(I\) (excluding feature \(m_s\)).

Based on the above (with a slight abuse of notation, since the global objective is lower bounded using \(v_{(I)}\) and not the individual \(O_1\) and \(O_2\)), we can write \(O_1(I\cup \{m_s\})\) as:

$$\begin{aligned} O_1(I\cup \{m\})\ge O_1(I)+(v_{(I)}^Tx_{fc}(m_s))^2-\frac{1}{n} x_{fc}(m_s)^Tx_{fc}(m_s) \end{aligned}$$

Now \(O_2(I\cup \{m_s\})\) can be written as (again with a slight abuse of notation, since the global objective is lower bounded using \(v_{(I)}\) and not the individual \(O_1\) and \(O_2\)):

$$\begin{aligned} O_2(I\cup \{m_s\})&\ge \left( v_{(I)}^T\sum \limits _{i\in I\cup \{m_s\}}x_{fc}(i)\right) ^2\!-\!\frac{1}{n}\left( \sum \limits _{i\in I\cup \{m_s\}} x_{fc}(i)\right) ^T\left( \sum \limits _{i\in I\cup \{m_s\}} x_{fc}(i)\right) \\&= O_2(I)+(v_{(I)}^Tx_{fc}(m_s))^2+2\left( \sum \limits _{i\in I}v_{(I)}^Tx_{fc}(i)\right) (v_{(I)}^Tx_{fc}(m_s))\\&-\frac{1}{n}x_{fc}(m_s)^Tx_{fc}(m_s)-\frac{2}{n} \left( \sum \limits _{i\in I}x_{fc}(i)\right) ^T x_{fc}(m_s) \end{aligned}$$

In order to complete the proof we must express \(\frac{1}{\mathbf{card}(I)+1}O_2(I)\) in the form of \(\frac{1}{\mathbf{card}(I)}O_2(I)+C\). In order to achieve this goal we take into account the fact that \(\frac{1}{\mathbf{card}(I)+1}=\frac{1}{\mathbf{card}(I)} -\frac{1}{\mathbf{card}(I)(\mathbf{card}(I)+1)}\) and write \(Obj(I\cup \{m\})\) as:

With some simple algebraic derivations the final bound can be derived. \(\square \)

Proof

(Proof of Theorem 3) Recall that in Schur complement deflation, the deflation step is performed as follows:

$$\begin{aligned} A_t=A_{t-1}-\frac{A_{t-1}x_tx_t^TA_{t-1}}{x_t^TA_{t-1}x_t} \end{aligned}$$

Now if we consider that \(A_t=X_{fc}^{(t)}(X_{fc}^{(t)})^T\) and also that \(x_t\) is the dominant eigenvector of matrix \(Cov^{(t-1)}\), we can write:

$$\begin{aligned}&A_t=A_{t-1}-\frac{A_{t-1}x_tx_t^TA_{t-1}}{x_t^TA_{t-1}x_t} \Rightarrow X_{fc}^{(t)}(X_{fc}^{(t)})^T=X_{fc}^{(t-1)}(X_{fc}^{(t-1)})^T\nonumber \\&\quad -\frac{X_{fc}^{(t-1)}(X_{fc}^{(t-1)})^Tx_tx_t^TX_{fc}^{(t-1)} (X_{fc}^{(t-1)})^T}{x_t^TX_{fc}^{(t-1)}(X_{fc}^{(t-1)})^Tx_t} \end{aligned}$$
(11)

Recall that \(x_t\) is the dominant eigenvector of \(Cov^{(t-1)}\) that can be written as \(Cov^{(t-1)}=C_m^{u(t-1)}X_{fc}^{(t-1)}(X_{fc}^{(t-1)})^TC_m^{u(t-1)}\) (i.e. it is based on the selected feature subset \(u(t-1)\)).

Since \(Cov^{(t-1)}\) is a double-centered matrix (its rows and columns are centered through the multiplication with \(C_m^{u(t-1)}\)), its dominant eigenvector will also be centered, i.e. \((C_m^{u(t-1)})x_t=x_t\). Based on this property, we can write:

$$\begin{aligned} x_t^TX_{fc}^{(t-1)}(X_{fc}^{(t-1)})^Tx_t&= x_t^TC_m^{u(t-1)} X_{fc}^{(t-1)}(X_{fc}^{(t-1)})^TC_m^{u(t-1)}x_t\nonumber \\&= x_t^TCov^{(t-1)}x_t=\lambda _{\max }^{(t-1)} \end{aligned}$$
(12)

It should be noted that if we are analyzing one-way stable Sparse PCA, we can directly derive that \(x_t^TX_{fc}^{(t-1)}(X_{fc}^{(t-1)})^Tx_t=\lambda _{\max }^{(t-1)}\). Now, \((X_{fc}^{(t-1)})^Tx_t\) can be written as:

$$\begin{aligned} (X_{fc}^{(t-1)})^Tx_t=(X_{ft-1)})^TC_m^{u(t-1)}x_t= \sqrt{\lambda _{\max }^{(t-1)}}v_t, \end{aligned}$$
(13)

where \(v_t\) is the dominant eigenvector of \((X_{fc}^{(t-1)})^TC_m^{u(t-1)}X_{fc}^{(t-1)}\) and \(\lambda _{\max }^{(t-1)}\) is the dominant eigenvector of \(Cov^{(t-1)}\).

We should again note that in the one-way stable case, we can directly derive that \((X_{fc}^{(t-1)})^Tx_t=\sqrt{\lambda _{\max }^{(t-1)}}v_t\).

Based on Eqs. 1113 we can write

$$\begin{aligned} A_t&= A_{t-1}-\frac{A_{t-1}x_tx_t^TA_{t-1}}{x_t^TA_{t-1}x_t}\\ \Rightarrow X_{fc}^{(t)}(X_{fc}^{(t)})^T&= X_{fc}^{(t-1)} (X_{fc}^{(t-1)})^T-X_{fc}^{(t-1)}v_tv_t^T(X_{fc}^{(t-1)})^T\\ \Rightarrow X_{fc}^{(t)}&= X_{fc}^{(t-1)}(I-v_tv_t^T) \end{aligned}$$

\(\square \)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Mavroeidis, D., Marchiori, E. Feature selection for k-means clustering stability: theoretical analysis and an algorithm. Data Min Knowl Disc 28, 918–960 (2014). https://doi.org/10.1007/s10618-013-0320-3

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10618-013-0320-3

Keywords

Navigation