Abstract
Stability of a learning algorithm with respect to small input perturbations is an important property, as it implies that the derived models are robust with respect to the presence of noisy features and/or data sample fluctuations. The qualitative nature of the stability property enhardens the development of practical, stability optimizing, data mining algorithms as several issues naturally arise, such as: how “much” stability is enough, or how can stability be effectively associated with intrinsic data properties. In the context of this work we take into account these issues and explore the effect of stability maximization in the continuous (PCA-based) k-means clustering problem. Our analysis is based on both mathematical optimization and statistical arguments that complement each other and allow for the solid interpretation of the algorithm’s stability properties. Interestingly, we derive that stability maximization naturally introduces a tradeoff between cluster separation and variance, leading to the selection of features that have a high cluster separation index that is not artificially inflated by the features variance. The proposed algorithmic setup is based on a Sparse PCA approach, that selects the features that maximize stability in a greedy fashion. In our study, we also analyze several properties of Sparse PCA relevant to stability that promote Sparse PCA as a viable feature selection mechanism for clustering. The practical relevance of the proposed method is demonstrated in the context of cancer research, where we consider the problem of detecting potential tumor biomarkers using microarray gene expression data. The application of our method to a leukemia dataset shows that the tradeoff between cluster separation and variance leads to the selection of features corresponding to important biomarker genes. Some of them have relative low variance and are not detected without the direct optimization of stability in Sparse PCA based k-means. Apart from the qualitative evaluation, we have also verified our approach as a feature selection method for \(k\)-means clustering using four cancer research datasets. The quantitative empirical results illustrate the practical utility of our framework as a feature selection mechanism for clustering.
Similar content being viewed by others
Notes
With the term continuous \(k\)-means clustering problem we refer to the continuous relaxation approach for approximating \(k\)-means (Ding and He 2004).
This is because \(\mathbf{Trace}(X_{fc}^TX_{fc})=\mathbf{Trace}(X_{fc}X_{fc}^T)\).
We will refer to this property as “structured variance contribution” of a feature.
References
Breiman L (1996) Bagging predictors. Mach Learn 24:123–140
Cai D, Zhang C, He X (2010) Unsupervised feature selection for multi-cluster data. In: ACM SIGKDD
Cho H (2010) Data transformation for sum squared residue. In: Zaki MJ, Yu JX, Ravindran B, Pudi V (eds) PAKDD (1). Lecture notes in computer science, vol 6118. Springer, Berlin, pp 48–55
Chomez P, De Backer O, Bertrand M, De Plaen E, Boon T, Lucas S (2001) An overview of the MAGE gene family with the identification of all human members of the family. Cancer Res 61(14):5544–5551
d’Aspremont A, Bach FR, Ghaoui LE (2007) Full regularization path for sparse principal component analysis. In: ICML
d’Aspremont A, Bach F, Ghaoui LE (2008) Optimal solutions for sparse principal component analysis. J Mach Learn Res 9:1269–1294
Dhillon IS (2001) Co-clustering documents and words using bipartite spectral graph partitioning. In: ACM SIGKDD
Dhillon IS, Guan Y, Kulis B (2004) Kernel k-means: spectral clustering and normalized cuts. In: ACM SIGKDD
Ding CHQ, He X (2004) K-means clustering via principal component analysis. In: ICML
Dy JG, Brodley CE (2004) Feature selection for unsupervised learning. J Mach Learn Res 5:845–889
Efron B, Tibshirani R (1993) An introduction to the bootstrap. Chapman Hall, New York
Golub GH, Loan CFV (1996) Matrix computations. The Johns Hopkins University Press, Baltimore
Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, Coller H, Loh ML, Downing JR, Caligiuri MA, Bloomfield CD, Lander ES (1999) Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286(5439):531–537
Halkidi M, Batistakis Y, Vazirgiannis M (2001) On clustering validation techniques. J Intell Inf Syst 17(2–3):107–145
Han Y, Yu L (2010) A variance reduction framework for stable feature selection. In: IEEE ICDM
He X, Cai D, Niyogi P (2005) Laplacian score for feature selection. In: NIPS
Huang L, Yan D, Jordan MI, Taft N (2008) Spectral clustering with perturbed data. In: D. Koller, D. Schuurmans, Y. Bengio, L. Bottou (eds) Advances in neural information processing systems 21, Proceedings of the twenty-second annual conference on neural information processing systems, Vancouver, BC, Canada, December 8–11, 2008. MIT Press, Cambridge, pp 705–712
Kalousis A, Prados J, Hilario M (2007) Stability of feature selection algorithms: a study on high-dimensional spaces. Knowl Inf Syst 12(1):95–116
Loscalzo S, Yu L, Ding CHQ (2009) Consensus group stable feature selection. In: ACM SIGKDD
Mackey L (2008) Deflation methods for sparse PCA. In: NIPS
Manning CD, Raghavan P, Schtze H (2008) Introduction to information retrieval. Cambridge University Press, Cambridge
Mavroeidis D, Bingham E (2008) Enhancing the stability of spectral ordering with sparsification and partial supervision: application to paleontological data. In: Proceedings of the 2008 eighth IEEE international conference on data mining. IEEE Computer Society, Washington, pp 462–471. doi:10.1109/ICDM.2008.120
Mavroeidis D, Bingham E (2010) Enhancing the stability and efficiency of spectral ordering with partial supervision and feature selection. Knowl Inf Syst 23:243–265
Mavroeidis D, Magdalinos P (2012) A sequential sampling framework for spectral k-means based on efficient bootstrap accuracy estimations: application to distributed clustering. ACM Trans Knowl Discov Data 7(2)
Mavroeidis D, Marchiori E (2011) A novel stability based feature selection framework for k-means clustering. In: Proceedings of the 2011 European conference on machine learning and knowledge discovery in databases—part II, ECML PKDD’11. Springer, Berlin, pp 421–436
Mavroeidis D, Vazirgiannis M (2007) Stability based sparse lSI/PCA: incorporating feature selection in lSI and PCA. In: Proceedings of the 18th European conference on machine learning, ECML ’07. Springer, Berlin, pp 226–237
Munson MA, Caruana R (2009) On feature selection, bias-variance, and bagging. In: ECML/PKDD
Nicolas E, Ramus C, Berthier S, Arlotto M, Bouamrani A, Lefebvre C, Morel F, Garin J, Ifrah N, Berger F, Cahn JY, Mossuz P (2011) Expression of S100A8 in leukemic cells predicts poor survival in de novo AML patients. Leukemia 25:57–65
Saeys Y, Abeel T, de Peer YV (2008) Robust feature selection using ensemble feature selection techniques. In: ECML/PKDD
Sandrine D, Jane F (2003) Bagging to improve the accuracy of a clustering procedure. Bioinformatics 19:1090–1099
Scupoli M, Donadelli M, Cioffi F, Rossi M, Perbellini O, Malpeli G, Corbioli S, Vinante F, Krampera M, Palmieri M, Scarpa A, Ariola C, Foa R, Pizzolo G (2008) Bone marrow stromal cells and the upregulation of interleukin-8 production in human T-cell acute lymphoblastic leukemia through the cxcl12/cxcr4 axis and the nf-kappab and jnk/ap-1 pathways. Haematologica 93(4):524–532
Shahzad A, Knapp M, Lang I, Kohler G (2010) Interleukin 8 (IL-8)—a universal biomarker? Int Arch Med 3(11)
Stewart GW, Sun JG (1990) Matrix perturbation theory. Computer science and scientific computing. Academic Press, Boston
Waugh D, Wilson C (2008) The interleukin8 pathway in cancer. Clin Cancer Res 14(21):6735–6741
Wolf L, Shashua A (2005) Feature selection for unsupervised and supervised inference: the emergence of sparsity in a weight-based approach. J Mach Learn Res 6:1855–1887
Wu J, Xiong H, Chen J (2009) Adapting the right measures for k-means clustering. In: Proceedings of the 15th ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’09. ACM, New York, pp 877–886
Yu L, Ding CHQ, Loscalzo S (2008) Stable feature selection via dense feature groups. In: ACM SIGKDD
Zhao Z, Liu H (2007) Spectral feature selection for supervised and unsupervised learning. In: Proceedings of the 24th international conference on machine learning, ICML ’07. ACM, New York, pp 1151–1157
Author information
Authors and Affiliations
Corresponding author
Additional information
Responsible editor: Dimitrios Gunopulos, Donato Malerba, Michalis Vazirgiannis.
Appendix
Appendix
Proof
(Proof of Lemma 1) Based on Ding and He (2004), the continuous solution for the instance clusters is derived by the \(k-1\) dominant eigenvectors of matrix \(X_{fc}^TX_{fc}\), where \(X_{fc}\) is a feature-instance matrix with the rows (features) being centered. Since \(X\) is double-centered the sum of rows and columns of \(X\) will be equal to 0, i.e. \(\sum _iX_{ij}=\sum _jX_{ij}=0\). Thus, the continuous solution of Spectral \(k\)-means (for instance clustering) will be derived by the \(k-1\) dominant eigenvectors of matrix \(X^TX\).
Analogously, the continuous solution for the feature clusters is derived by the \(k-1\) dominant eigenvectors of matrix \(X_{ic}X_{ic}^T\), where \(X_{ic}\) is a feature-instance matrix with the columns (instances) being centered. Since \(X\) is double-centered, we will have that the instance-centered matrix \(X_{ic}\) will be equal to \(X\), i.e. \(X_{ic}=X\). Thus, the continuous cluster solution will be derived by the dominant eigenvectors of matrix \(XX^T\).
Using basic linear algebra one can easily derive that the matrices \(XX^T\) and \(X^TX\) have exactly the same eigenvalues.
Thus \(\lambda _{k-1}(XX^T)-\lambda _{k}(XX^T)=\lambda _{k-1}(X^TX) -\lambda _{k}(X^TX)\), and the stability of the relevant eigenspaces will be equivalent. \(\square \)
Proof
(Proof of Theorem 1) We will start by decomposing the components \(\lambda _{1}(Cov)\) and \(\mathbf{Trace}(Cov)\). For \(\lambda _{1}(Cov)\) we have:
In the above derivations we have used the following, easily verifiable facts: \(\lambda _1(AA^T)=\lambda _1(A^TA)\) and \(\mathbf{diag}(u)=(\mathbf{diag}(u))^2\). For \(\mathbf{Tr}(Cov)\) we have:
Based on the above we can write:
The vector \(v\) that maximizes the expression \(\max _{||v||=1}\sum _{i\in I\cup \{m_s\}}(v^Tx_{fc}(i))^2\) is the dominant eigenvector of matrix \(Cov=\mathbf{diag}(u)X_{fc}X_{fc}^T\mathbf{diag}(u)\), where \(u\) is defined by the feature-set \(I\cup \{m_s\}\) (recall that this expression was derived from \(\lambda _1(Cov)\)). Thus, based on the fact that for all \(v\) such that \(||v||=1\) it holds that \(\lambda _1(A)\ge v^TAv\), we can employ any other normalized vector and compute a lower bound to the expression. In order to derive a useful tight bound we employ here the eigenvector \(v_{(I)}\) of matrix \(Cov=\mathbf{diag}(u)X_{fc}X_{fc}^T\mathbf{diag}(u)\), where \(u\) is defined by the feature-set \(I\) (i.e. without feature \(\{m_s\}\)).
Based on the above we can write:
\(\square \)
Proof
(Proof of Theorem 2) We will start by decomposing the components \(\lambda _{1}(Cov)\) and \(\mathbf{Trace}(Cov)\).
For \(\lambda _{1}(Cov)\) we have:
In the above derivations we have used the following, easily verifiable facts: \(\lambda _1(AA^T)=\lambda _1(A^TA), \,C_m^u=(C_m^u)^T, \,C_m^u=(C_m^u)^2\) and \(\mathbf{diag}(u)=(\mathbf{diag}(u))^2\).
For \(\mathbf{Tr}(Cov)\) we have:
In these derivations we have used the following properties of the matrix Trace: \(\mathbf{Tr}(AA^T)=\mathbf{Tr}(A^TA), \,\mathbf{Tr}(A+B)=\mathbf{Tr}(A)+\mathbf{Tr}(B)\) and \(\mathbf{Tr}(\beta A)=\beta \mathbf{Tr}(A)\).
Now for \(\mathbf{Tr}(X_{fc}^T\mathbf{diag}(u)X_{fc})\) we have:
Finally, for \(\frac{1}{\mathbf{card}(u)}\mathbf{Tr}(X_{fc}^T\mathbf{diag} (u)e_me^T_m\mathbf{diag}(u)X_{fc})\) we have:
Based on the afore derivations we can write the objective function in Eq. 4 as:
Where
We can now employ the fact that for all \(v\) such that \(||v||=1\) it holds that \(\lambda _1(A)\ge v^TAv\) and replace the \(v\) vector in equations \(O_1\) and \(O_2\) with a fixed normalized vector. For deriving a tight bound we employ \(v_{(I)}\), the dominant eigenvector of \(C_m^uX_{fc}X_{fc}^TC_m^u\), where \(u\) is defined based on the feature set \(I\) (excluding feature \(m_s\)).
Based on the above (with a slight abuse of notation, since the global objective is lower bounded using \(v_{(I)}\) and not the individual \(O_1\) and \(O_2\)), we can write \(O_1(I\cup \{m_s\})\) as:
Now \(O_2(I\cup \{m_s\})\) can be written as (again with a slight abuse of notation, since the global objective is lower bounded using \(v_{(I)}\) and not the individual \(O_1\) and \(O_2\)):
In order to complete the proof we must express \(\frac{1}{\mathbf{card}(I)+1}O_2(I)\) in the form of \(\frac{1}{\mathbf{card}(I)}O_2(I)+C\). In order to achieve this goal we take into account the fact that \(\frac{1}{\mathbf{card}(I)+1}=\frac{1}{\mathbf{card}(I)} -\frac{1}{\mathbf{card}(I)(\mathbf{card}(I)+1)}\) and write \(Obj(I\cup \{m\})\) as:
With some simple algebraic derivations the final bound can be derived. \(\square \)
Proof
(Proof of Theorem 3) Recall that in Schur complement deflation, the deflation step is performed as follows:
Now if we consider that \(A_t=X_{fc}^{(t)}(X_{fc}^{(t)})^T\) and also that \(x_t\) is the dominant eigenvector of matrix \(Cov^{(t-1)}\), we can write:
Recall that \(x_t\) is the dominant eigenvector of \(Cov^{(t-1)}\) that can be written as \(Cov^{(t-1)}=C_m^{u(t-1)}X_{fc}^{(t-1)}(X_{fc}^{(t-1)})^TC_m^{u(t-1)}\) (i.e. it is based on the selected feature subset \(u(t-1)\)).
Since \(Cov^{(t-1)}\) is a double-centered matrix (its rows and columns are centered through the multiplication with \(C_m^{u(t-1)}\)), its dominant eigenvector will also be centered, i.e. \((C_m^{u(t-1)})x_t=x_t\). Based on this property, we can write:
It should be noted that if we are analyzing one-way stable Sparse PCA, we can directly derive that \(x_t^TX_{fc}^{(t-1)}(X_{fc}^{(t-1)})^Tx_t=\lambda _{\max }^{(t-1)}\). Now, \((X_{fc}^{(t-1)})^Tx_t\) can be written as:
where \(v_t\) is the dominant eigenvector of \((X_{fc}^{(t-1)})^TC_m^{u(t-1)}X_{fc}^{(t-1)}\) and \(\lambda _{\max }^{(t-1)}\) is the dominant eigenvector of \(Cov^{(t-1)}\).
We should again note that in the one-way stable case, we can directly derive that \((X_{fc}^{(t-1)})^Tx_t=\sqrt{\lambda _{\max }^{(t-1)}}v_t\).
Based on Eqs. 11–13 we can write
\(\square \)
Rights and permissions
About this article
Cite this article
Mavroeidis, D., Marchiori, E. Feature selection for k-means clustering stability: theoretical analysis and an algorithm. Data Min Knowl Disc 28, 918–960 (2014). https://doi.org/10.1007/s10618-013-0320-3
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10618-013-0320-3