Skip to main content
Log in

An extended EM algorithm for subspace clustering

  • Research Article
  • Published:
Frontiers of Computer Science in China Aims and scope Submit manuscript

Abstract

Clustering high dimensional data has become a challenge in data mining due to the curse of dimensionality. To solve this problem, subspace clustering has been defined as an extension of traditional clustering that seeks to find clusters in subspaces spanned by different combinations of dimensions within a dataset. This paper presents a new subspace clustering algorithm that calculates the local feature weights automatically in an EM-based clustering process. In the algorithm, the features are locally weighted by using a new unsupervised weighting method, as a means to minimize a proposed clustering criterion that takes into account both the average intra-clusters compactness and the average inter-clusters separation for subspace clustering. For the purposes of capturing accurate subspace information, an additional outlier detection process is presented to identify the possible local outliers of subspace clusters, and is embedded between the E-step and M-step of the algorithm. The method has been evaluated in clustering real-world gene expression data and high dimensional artificial data with outliers, and the experimental results have shown its effectiveness.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Berkhin P. A survey of clustering data mining techniques. In: Kogan J, Nicholas C, Teboulle M, eds. Grouping multidimensional data: recent advances in clustering. Berlin: Springer, 2006, 25–71

    Chapter  Google Scholar 

  2. Parsons L, Haque E, Liu H. Subspace clustering for high dimensional data: a review. ACM SIGKDD Explorations Newsletter, 2004, 6(1): 90–105

    Article  Google Scholar 

  3. Hinneburg A, Aggarwal C C, Kaim D. What is the nearest neighbor in high dimensional spaces. In: Proceedings of VLDB. Berlin: Springer, 2000, 506–515

    Google Scholar 

  4. Dash M, Liu M, Yao J. Dimensionality reduction for unsupervised data. In: Proceedings of ICTAI. Newport Beach: IEEE Computer Society, 1997, 532–539

    Google Scholar 

  5. Han E-H, Karypis G. Clustering in a high-dimensional space using hypergraph models. Technical Report, TR-97-063, Universyty of Minnesota, 1997

  6. Aggarwal C C, Procopiuc C, Wolf J L, et al. Fast algorithm for projected clustering. In: Proceedings of ACM SIGMOD. New York: ACM, 1999, 61–72

    Google Scholar 

  7. Agrawal R, Gehrke J, Gunopulos D, et al. Automatic subspace clustering of high dimensional data for data mining applications. In: Proceedings of ACM SIGMOD. New York: ACM, 1998, 94–105

    Google Scholar 

  8. Cheng C H, Fu A W, Zhang Y. Entropy-based subspace clustering for mining numerical data. In: Proceedings of ACM SIGKDD. New York: ACM, 1999, 84–93

    Google Scholar 

  9. Goil S, Nagesh H, Choudhary A. Mafia: efficient and scalable subspace clustering for very large data sets. Technical Report CPDC-TR-9906-010, Northwestern University, 1999

  10. Domeniconi C, Gunopulos D, Ma S, et al. Locally adaptive metrics for clustering high dimensional data. Technical Report ISE-TR-06-04, 2006

  11. Jing L, Ng M K, Xu J, et al. On the performance of feature weighting K-means for text subspace clustering. In: Proceedings of WAIM, 2005, 205–212

  12. Wu C F J. On the convergence properties of the EM algorithm. Annals of Statistics, 1983, 11(1): 95–103

    Article  MATH  MathSciNet  Google Scholar 

  13. Friedman J H, Meulman J J. Clustering objects on subsets of attributes. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 2004, 66(4): 815–849

    Article  MATH  MathSciNet  Google Scholar 

  14. Candillier L, Tellier I, Torre F, et al. SuSE: subspace selection embedded in an EM algorithm. In: Proceedings of CAP, 2006, 331–345

  15. Chen L F, Jiang Q S, Wang S R. A new unsupervised term weighting scheme for document clustering. Journal of Computational Information Systems, 2007, 3(4): 1455–1464

    Google Scholar 

  16. Aggarwal C C, Yu P S. Outlier detection for high dimensional data. In: Proceedings of ACM SIGMOD. New York: ACM, 2001, 219–234

    Google Scholar 

  17. Gan G, Wu J, Yang Z. A fuzzy subspace algorithm for clustering high dimensional data. LNAI, 2006, 4093: 271–278

    Google Scholar 

  18. Sun H, Wang S, Jiang Q. FCM-based model selection algorithms for determining the number of clusters. Pattern Recognition, 2004, 37(10): 2027–2037

    Article  MATH  Google Scholar 

  19. Golub TR, Slonim DK, Tamayo P, et al. Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science, 1999, 286: 531–537

    Article  Google Scholar 

  20. Gordon G J, Jensen R V, Hsiao L L, et al. Translation of microarray data into clinically relevant cancer diagnostic tests using gege expression ratios in lung cancer and mesothelioma. Cancer Research, 2002, 62: 4963–4967

    Google Scholar 

  21. Tan S, Cheng X, Ghanem M M, et al. A novel refinement approach for text categorization, In: Proceedings of ACM CIKM. New York: ACM, 2005, 469–476

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Qingshan Jiang.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Chen, L., Jiang, Q. An extended EM algorithm for subspace clustering. Front. Comput. Sci. China 2, 81–86 (2008). https://doi.org/10.1007/s11704-008-0007-x

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11704-008-0007-x

Keywords

Navigation