Abstract
When dealing with high dimensional data, clustering faces the curse of dimensionality problem. In such data sets, clusters of objects exist in subspaces rather than in whole feature space. Subspace clustering algorithms have already been introduced to tackle this problem. However, noisy data points present in this type of data can have great impact on the clustering results. Therefore, to overcome these problems simultaneously, the fuzzy soft subspace clustering with noise detection (FSSC-ND) is proposed. The presented algorithm is based on the entropy weighting soft subspace clustering and noise clustering. The FSSC-ND algorithm uses a new objective function and update rules to achieve the mentioned goals and present more interpretable clustering results. Several experiments have been conducted on artificial and UCI benchmark datasets to assess the performance of the proposed algorithm. In addition, a number of cancer gene expression datasets are used to evaluate the performance of the proposed algorithm when dealing with high dimensional data. The results of these experiments demonstrate the superiority of the FSSC-ND algorithm in comparison with the state of the art clustering algorithms developed in earlier research.
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs00500-015-1756-8/MediaObjects/500_2015_1756_Fig1_HTML.gif)
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Alizadeh AA, Eisen MB, Davis RE et al (2000) Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling. Nature 403:503–511
Armstrong S, Staunton JE, Silverman LB et al (2002) MLL translocations specify a distinct gene expression profile that distinguishes a unique leukemia. Nat Genet 30:41–47
Bezdek JC (1981) Pattern recognition with fuzzy objective function algorithms. Plenum, New York
Chan EY, Ching WK, Ng MK, Huang JZ (2004) An optimization algorithm for clustering using weighted dissimilarity measures. Pattern Recognit 37:943–952
Dave RN (1991) Characterization and detection of noise in clustering. Pattern Recognit Lett 12:657–664
Dave RN, Krishnapuram R (1997) Robust clustering methods: a unified view. IEEE Trans Fuzzy Syst 5:270–293
Deng Z, Choi KS, Chung FL, Wang S (2010) Enhanced soft subspace clustering integrating within-cluster and between-cluster information. Pattern Recognit 43:767–781
Deng Z, Choi KS, Chung FL, Wang S (2011) EEW-SC: enhanced entropy-weighting subspace clustering for high dimensional gene expression data clustering analysis. Appl Soft Comput 11:4798–4806
Domeniconi C, Gunopulos D, Ma S, Yan B, Al-Razgan M, Papadopoulos D (2007) Locally adaptive metrics for clustering high dimensional data. Data Min Knowl Discov 14:63–97
Gan G, Ng MK-P (2015) Subspace clustering using affinity propagation. Pattern Recognit 48(4):1455–1464
Gan G, Wu J (2008) A convergence theorem for the fuzzy subspace clustering algorithm. Pattern Recognit 41:1939–1947
Höppner F, Klawonn F, Kruse R, Runkler T (1999) Fuzzy cluster analysis. Wiley, Chichester
Huang X et al (2014) DSKmeans: a new kmeans-type approach to discriminative subspace clustering. Knowl-Based Syst 70:293–300
Jain AK, Murty MN, Flynn PL (1999) Data clustering: a review. ACM Comput Surv 31:264–323
Jiang D, Tang C, Zhang A (2004) Cluster analysis for gene expression data: a survey. IEEE Trans Knowl Data Eng 16:1370–1386
Jing L, Ng MK, Xu J, Huang JZ (2005) Subspace clustering of text documents with feature weighting k-means algorithm. Adv Knowl Discov Data Min 3518:802–812
Jing L, Ng MK, Huang JZ (2007) An entropy weighting K-means algorithm for subspace clustering of high-dimensional sparse data. IEEE Trans Knowl Data Eng 19:1–16
Kriegel HR, Kröger P, Zimek A (2009) Clustering high-dimensional data: a survey on subspace clustering, pattern-based clustering, and correlation clustering. ACM TKDD 3:1–58
Krishnapuram R, Keller J (1993) A possibilistic approach to clustering. IEEE Trans Fuzzy Syst 1:98–110
Moreno-Hagelsieb G, Wang Z, Walsh S, ElSherbiny A (2013) Phylogenomic clustering for selecting non-redundant genomes for comparative genomics. Bioinformatics 29(7):947–949
Newman DJ, Hettich S, Blake CL et al UCI repository of machine learning databases. Department of Information and Computer Science, University of California, Irvine. http://archive.ics.uci.edu/ml/S
Parsons L, Haque E, Liu H (2004) Subspace clustering for high dimensional data: a review. ACM SIGKDD Explor Newsl 6:90–105
Patel MR, Sharma MG (2014) A survey on text mining techniques. Int J Eng Comput Sci 3(5), 5621–5625 (2014)
Perou CM et al (2000) Molecular portraits of human breast tumours. Nature 406:747–752
Pomeroy SL et al (2002) Prediction of central nervous system embryonal tumour outcome based on gene expression. Nature 415:436–442
Rehm F, Klawonn F, Kruse R (2007) A novel approach to noise clustering for outlier detection. Soft Comput 11:489–494
Tan PN (2007) Introduction to data mining. Pearson Education India, New York
Tang L, Liu H, Zhang J (2012) Identifying evolving groups in dynamic multimode networks. Knowl Data Eng IEEE Trans 24(1):72–85
Wang J et al (2013) Fuzzy partition based soft subspace clustering and its applications in high dimensional data. Inf Sci 246:133–154
Welsh JB (2001) Analysis of gene expression identifies candidate markers and pharmacological targets in prostate cancer. Cancer Res 61:5974–5978
Zhang A (2006) Advanced analysis of gene expression microarray data. World Scientific Press, Singapore
Zhong S, Ghosh J (2005) Generative model-based document clustering: a comparative study. Knowl Inf Syst 8:374–384
Zhu L et al (2014) Evolving soft subspace clustering. Appl Soft Comput 14:210–228
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by V. Loia.
Appendix
Appendix
Proof of Theorem 1
We can rewrite the \(J_{\mathrm{FSSC-ND}}\) as follows
where, \(d_{ij}^2 =\sum \nolimits _{k=1}^D {w_{ik} \left( {x_{jk} -v_{ik} } \right) ^{2}} \) for \(i = 1\) to C and \(d_{ij}^{2} = \delta ^{2}\) for \(i=C+1\); and \(u_{ij} =1-\sum \nolimits _{i=1}^C {u_{ij} } \) for \(i=C+1\).
Using the Lagrangian multipliers technique the following optimization problem is obtained:
where \(\lambda ^{u}\) is a vector containing the Lagrange multipliers associated with constraint on U.
Setting the gradient of \(\Phi _{1}\) with respect to \(u_{ij}\) and \(\lambda ^{u}_{j}\) to zero, the optimal value of \(u_{ij}\) is obtained.
Substituting \(u_{ij}\) derived from Eq. (18) in (19) we have:
It follows:
Substituting (21) back to (18), we obtain:
Proof of Theorem 2
Considering the objective function in Eq. (7) and the constraint \(\sum \nolimits _{k=1}^D {w_{ik} } =1,\forall i\), the following aggregate function should be minimized according to the Lagrangian multiplier method:
So it follows that:
where,
On the other hand:
Substituting (29) back to (25), the final equation for computing \(w_{ik}\) is obtained:
Proof of Theorem 3
In order to minimize the objective function, the gradient of \(J_\mathrm{FSSC-ND}\) is set to zero:
Hence, it will result in the computation formula of \(v-_{ik}\) as follows:
Rights and permissions
About this article
Cite this article
Chitsaz, E., Zolghadri Jahromi, M. A novel soft subspace clustering algorithm with noise detection for high dimensional datasets. Soft Comput 20, 4463–4472 (2016). https://doi.org/10.1007/s00500-015-1756-8
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00500-015-1756-8