Skip to main content
Log in

A novel soft subspace clustering algorithm with noise detection for high dimensional datasets

  • Methodologies and Application
  • Published:
Soft Computing Aims and scope Submit manuscript

Abstract

When dealing with high dimensional data, clustering faces the curse of dimensionality problem. In such data sets, clusters of objects exist in subspaces rather than in whole feature space. Subspace clustering algorithms have already been introduced to tackle this problem. However, noisy data points present in this type of data can have great impact on the clustering results. Therefore, to overcome these problems simultaneously, the fuzzy soft subspace clustering with noise detection (FSSC-ND) is proposed. The presented algorithm is based on the entropy weighting soft subspace clustering and noise clustering. The FSSC-ND algorithm uses a new objective function and update rules to achieve the mentioned goals and present more interpretable clustering results. Several experiments have been conducted on artificial and UCI benchmark datasets to assess the performance of the proposed algorithm. In addition, a number of cancer gene expression datasets are used to evaluate the performance of the proposed algorithm when dealing with high dimensional data. The results of these experiments demonstrate the superiority of the FSSC-ND algorithm in comparison with the state of the art clustering algorithms developed in earlier research.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1

Similar content being viewed by others

References

  • Alizadeh AA, Eisen MB, Davis RE et al (2000) Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling. Nature 403:503–511

    Article  Google Scholar 

  • Armstrong S, Staunton JE, Silverman LB et al (2002) MLL translocations specify a distinct gene expression profile that distinguishes a unique leukemia. Nat Genet 30:41–47

    Article  Google Scholar 

  • Bezdek JC (1981) Pattern recognition with fuzzy objective function algorithms. Plenum, New York

    Book  MATH  Google Scholar 

  • Chan EY, Ching WK, Ng MK, Huang JZ (2004) An optimization algorithm for clustering using weighted dissimilarity measures. Pattern Recognit 37:943–952

    Article  MATH  Google Scholar 

  • Dave RN (1991) Characterization and detection of noise in clustering. Pattern Recognit Lett 12:657–664

    Article  Google Scholar 

  • Dave RN, Krishnapuram R (1997) Robust clustering methods: a unified view. IEEE Trans Fuzzy Syst 5:270–293

    Article  Google Scholar 

  • Deng Z, Choi KS, Chung FL, Wang S (2010) Enhanced soft subspace clustering integrating within-cluster and between-cluster information. Pattern Recognit 43:767–781

    Article  MATH  Google Scholar 

  • Deng Z, Choi KS, Chung FL, Wang S (2011) EEW-SC: enhanced entropy-weighting subspace clustering for high dimensional gene expression data clustering analysis. Appl Soft Comput 11:4798–4806

    Article  Google Scholar 

  • Domeniconi C, Gunopulos D, Ma S, Yan B, Al-Razgan M, Papadopoulos D (2007) Locally adaptive metrics for clustering high dimensional data. Data Min Knowl Discov 14:63–97

    Article  MathSciNet  Google Scholar 

  • Gan G, Ng MK-P (2015) Subspace clustering using affinity propagation. Pattern Recognit 48(4):1455–1464

    Article  Google Scholar 

  • Gan G, Wu J (2008) A convergence theorem for the fuzzy subspace clustering algorithm. Pattern Recognit 41:1939–1947

    Article  MATH  Google Scholar 

  • Höppner F, Klawonn F, Kruse R, Runkler T (1999) Fuzzy cluster analysis. Wiley, Chichester

    MATH  Google Scholar 

  • Huang X et al (2014) DSKmeans: a new kmeans-type approach to discriminative subspace clustering. Knowl-Based Syst 70:293–300

    Article  Google Scholar 

  • Jain AK, Murty MN, Flynn PL (1999) Data clustering: a review. ACM Comput Surv 31:264–323

    Article  Google Scholar 

  • Jiang D, Tang C, Zhang A (2004) Cluster analysis for gene expression data: a survey. IEEE Trans Knowl Data Eng 16:1370–1386

    Article  Google Scholar 

  • Jing L, Ng MK, Xu J, Huang JZ (2005) Subspace clustering of text documents with feature weighting k-means algorithm. Adv Knowl Discov Data Min 3518:802–812

    Article  Google Scholar 

  • Jing L, Ng MK, Huang JZ (2007) An entropy weighting K-means algorithm for subspace clustering of high-dimensional sparse data. IEEE Trans Knowl Data Eng 19:1–16

    Article  Google Scholar 

  • Kriegel HR, Kröger P, Zimek A (2009) Clustering high-dimensional data: a survey on subspace clustering, pattern-based clustering, and correlation clustering. ACM TKDD 3:1–58

    Article  Google Scholar 

  • Krishnapuram R, Keller J (1993) A possibilistic approach to clustering. IEEE Trans Fuzzy Syst 1:98–110

    Article  Google Scholar 

  • Moreno-Hagelsieb G, Wang Z, Walsh S, ElSherbiny A (2013) Phylogenomic clustering for selecting non-redundant genomes for comparative genomics. Bioinformatics 29(7):947–949

    Article  Google Scholar 

  • Newman DJ, Hettich S, Blake CL et al UCI repository of machine learning databases. Department of Information and Computer Science, University of California, Irvine. http://archive.ics.uci.edu/ml/S

  • Parsons L, Haque E, Liu H (2004) Subspace clustering for high dimensional data: a review. ACM SIGKDD Explor Newsl 6:90–105

    Article  Google Scholar 

  • Patel MR, Sharma MG (2014) A survey on text mining techniques. Int J Eng Comput Sci 3(5), 5621–5625 (2014)

  • Perou CM et al (2000) Molecular portraits of human breast tumours. Nature 406:747–752

    Article  Google Scholar 

  • Pomeroy SL et al (2002) Prediction of central nervous system embryonal tumour outcome based on gene expression. Nature 415:436–442

    Article  Google Scholar 

  • Rehm F, Klawonn F, Kruse R (2007) A novel approach to noise clustering for outlier detection. Soft Comput 11:489–494

    Article  Google Scholar 

  • Tan PN (2007) Introduction to data mining. Pearson Education India, New York

    Google Scholar 

  • Tang L, Liu H, Zhang J (2012) Identifying evolving groups in dynamic multimode networks. Knowl Data Eng IEEE Trans 24(1):72–85

    Article  Google Scholar 

  • Wang J et al (2013) Fuzzy partition based soft subspace clustering and its applications in high dimensional data. Inf Sci 246:133–154

  • Welsh JB (2001) Analysis of gene expression identifies candidate markers and pharmacological targets in prostate cancer. Cancer Res 61:5974–5978

  • Zhang A (2006) Advanced analysis of gene expression microarray data. World Scientific Press, Singapore

    Book  MATH  Google Scholar 

  • Zhong S, Ghosh J (2005) Generative model-based document clustering: a comparative study. Knowl Inf Syst 8:374–384

    Article  Google Scholar 

  • Zhu L et al (2014) Evolving soft subspace clustering. Appl Soft Comput 14:210–228

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Elham Chitsaz.

Additional information

Communicated by V. Loia.

Appendix

Appendix

Proof of Theorem 1

We can rewrite the \(J_{\mathrm{FSSC-ND}}\) as follows

$$\begin{aligned} \hbox {J}_{\mathrm{FSSC-ND}} \left( {U,W,V} \right)= & {} \sum _{i=1}^{C+1} {\sum _{j=1}^N {u_{ij} ^{m}} d_{ij}^2 }\nonumber \\&+\,\rho \sum _{i=1}^C {\sum _{k=1}^D {w_{ik} \ln w_{ik} } }\\ \hbox {subject to }\sum _{k=1}^D {w_{ik} }= & {} 1,\quad \forall i\quad \hbox { and} \quad \sum _{i=1}^{C+1} {u_{ij} } =1,\quad \forall j\nonumber \end{aligned}$$
(15)

where, \(d_{ij}^2 =\sum \nolimits _{k=1}^D {w_{ik} \left( {x_{jk} -v_{ik} } \right) ^{2}} \) for \(i = 1\) to C and \(d_{ij}^{2} = \delta ^{2}\) for \(i=C+1\); and \(u_{ij} =1-\sum \nolimits _{i=1}^C {u_{ij} } \) for \(i=C+1\).

Using the Lagrangian multipliers technique the following optimization problem is obtained:

$$\begin{aligned} \Phi _1 \left( {U,\lambda ^{u}} \right) =\sum _{i=1}^{C+1} {\sum _{j=1}^N {u_{ij} ^{m}} d_{ij}^2 } -\sum _{j=1}^N {\lambda _j^u \left( {\sum _{i=1}^C {u_{ij} -1} } \right) } \end{aligned}$$
(16)

where \(\lambda ^{u}\) is a vector containing the Lagrange multipliers associated with constraint on U.

Setting the gradient of \(\Phi _{1}\) with respect to \(u_{ij}\) and \(\lambda ^{u}_{j}\) to zero, the optimal value of \(u_{ij}\) is obtained.

$$\begin{aligned}&\frac{\partial \Phi _1 }{\partial u_{ij} }=mu_{ij}^{m-1} d_{ij}^2 -\lambda _j^u =0, \end{aligned}$$
(17)
$$\begin{aligned}&u_{ij} =\left( {\frac{\lambda _j^u }{md_{ij}^2 }} \right) ^{1/(m-1)} \end{aligned}$$
(18)
$$\begin{aligned}&\frac{\partial \Phi _1 }{\partial \lambda _j^u }=\sum _{i=1}^{C+1} {u_{ij} } -1=0 \end{aligned}$$
(19)

Substituting \(u_{ij}\) derived from Eq. (18) in (19) we have:

$$\begin{aligned} \sum _{i=1}^{C+1} {\left( {\frac{\lambda _j^u }{md_{ij}^2 }} \right) ^{1/(m-1)}}= & {} \left( {\frac{\lambda _j^u }{am}} \right) ^{1/(m-1)}\sum _{i=1}^{C+1} {\left( {\frac{1}{d_{ij}^2 }} \right) ^{1/(m-1)}} \nonumber \\= & {} 1, \end{aligned}$$
(20)

It follows:

$$\begin{aligned} \left( {\frac{\lambda _j^u }{m}} \right) ^{1/(m-1)}=\frac{1}{\sum \nolimits _{i=1}^{C+1} {\left( {\frac{1}{d_{ij}^2 }} \right) ^{1/(m-1)}} } \end{aligned}$$
(21)

Substituting (21) back to (18), we obtain:

$$\begin{aligned} u_{ij} =\frac{1}{\sum \nolimits _{l=1}^{C+1} {\left( {\frac{d_{ij} ^{2}}{d_{lj} ^{2}}} \right) ^{1/(m-1)}} } \end{aligned}$$
(22)

Proof of Theorem 2

Considering the objective function in Eq. (7) and the constraint \(\sum \nolimits _{k=1}^D {w_{ik} } =1,\forall i\), the following aggregate function should be minimized according to the Lagrangian multiplier method:

$$\begin{aligned} \Phi _2 \left( {W,\lambda ^{w}} \right)= & {} \sum _{i=1}^C {\sum _{j=1}^N {u_{ij} ^{m}} \sum _{k=1}^D {w_{ik} \left( {x_{jk} -v_{ik} } \right) ^{2}} } \nonumber \\&+\,\rho \sum _{i=1}^C {\sum _{k=1}^D {w_{ik} \ln w_{ik} } }\nonumber \\&-\, \sum _{i=1}^C {\lambda _i^w \left( {\sum _{k=1}^D {w_{ik} } -1} \right) } \end{aligned}$$
(23)
$$\begin{aligned} \frac{\partial \Phi _2 }{\partial w_{ik} }\!= & {} \!\sum _{j=1}^N {u_{ij} ^{m}\left( {x_{jk} -v_{ik} } \right) ^{2}} \!+\!\rho \ln w_{ik} +\rho \!-\!\lambda _i^w \!=\!0\nonumber \\ \end{aligned}$$
(24)

So it follows that:

$$\begin{aligned} w_{ik} =\exp \left( {\frac{\lambda _i^w -\rho }{\rho }} \right) \exp \left( {\frac{-D_{ik} }{\rho }} \right) \end{aligned}$$
(25)

where,

$$\begin{aligned} D_{ik} =\sum _{j=1}^N {u_{ij} ^{m}\left( {x_{jk} -v_{ik} } \right) ^{2}} \end{aligned}$$
(26)

On the other hand:

$$\begin{aligned} \frac{\partial \Phi _2 }{\partial \lambda _i^w }=\sum _{k=1}^D {w_{ik} } -1=0 \end{aligned}$$
(27)

Substituting (25) into (27):

$$\begin{aligned}&\sum _{k=1}^D {w_{ik} } =\exp \left( {\frac{\lambda _i^w -\rho }{\rho }} \right) \sum _{k=1}^D {\exp \left( {\frac{-D_{ik} }{\rho }} \right) } =1 \end{aligned}$$
(28)
$$\begin{aligned}&\exp \left( {\frac{\lambda _i^w -\rho }{\rho }} \right) =\frac{1}{\sum \nolimits _{k=1}^D {\exp \left( {\frac{-D_{ik} }{\rho }} \right) } } \end{aligned}$$
(29)

Substituting (29) back to (25), the final equation for computing \(w_{ik}\) is obtained:

$$\begin{aligned} w_{ik} =\frac{\exp \left( {\frac{-D_{ik} }{\rho }} \right) }{\sum \nolimits _{k{^{\prime }}=1}^D {\exp \left( {\frac{-D_{ik^{\prime }} }{\rho }} \right) } } \end{aligned}$$
(30)

Proof of Theorem 3

In order to minimize the objective function, the gradient of \(J_\mathrm{FSSC-ND}\) is set to zero:

$$\begin{aligned} \frac{\partial J_\mathrm{FSSC-ND} }{\partial v_{ik} }=-2w_{ik} \sum _{j=1}^N {u_{ij} ^{m}} \left( {x_{jk} -v_{ik} } \right) =0 \end{aligned}$$
(31)

Hence, it will result in the computation formula of \(v-_{ik}\) as follows:

$$\begin{aligned} v_{ik} =\frac{\sum \nolimits _{j=1}^N {u_{ij}^m x_{jk} } }{\sum \nolimits _{j=1}^N {u_{ij}^m } } \end{aligned}$$
(32)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Chitsaz, E., Zolghadri Jahromi, M. A novel soft subspace clustering algorithm with noise detection for high dimensional datasets. Soft Comput 20, 4463–4472 (2016). https://doi.org/10.1007/s00500-015-1756-8

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00500-015-1756-8

Keywords

Navigation