Abstract
Clustering high dimensional data has become a challenge in data mining due to the curse of dimensionality. To solve this problem, subspace clustering has been defined as an extension of traditional clustering that seeks to find clusters in subspaces spanned by different combinations of dimensions within a dataset. This paper presents a new subspace clustering algorithm that calculates the local feature weights automatically in an EM-based clustering process. In the algorithm, the features are locally weighted by using a new unsupervised weighting method, as a means to minimize a proposed clustering criterion that takes into account both the average intra-clusters compactness and the average inter-clusters separation for subspace clustering. For the purposes of capturing accurate subspace information, an additional outlier detection process is presented to identify the possible local outliers of subspace clusters, and is embedded between the E-step and M-step of the algorithm. The method has been evaluated in clustering real-world gene expression data and high dimensional artificial data with outliers, and the experimental results have shown its effectiveness.
Similar content being viewed by others
References
Berkhin P. A survey of clustering data mining techniques. In: Kogan J, Nicholas C, Teboulle M, eds. Grouping multidimensional data: recent advances in clustering. Berlin: Springer, 2006, 25–71
Parsons L, Haque E, Liu H. Subspace clustering for high dimensional data: a review. ACM SIGKDD Explorations Newsletter, 2004, 6(1): 90–105
Hinneburg A, Aggarwal C C, Kaim D. What is the nearest neighbor in high dimensional spaces. In: Proceedings of VLDB. Berlin: Springer, 2000, 506–515
Dash M, Liu M, Yao J. Dimensionality reduction for unsupervised data. In: Proceedings of ICTAI. Newport Beach: IEEE Computer Society, 1997, 532–539
Han E-H, Karypis G. Clustering in a high-dimensional space using hypergraph models. Technical Report, TR-97-063, Universyty of Minnesota, 1997
Aggarwal C C, Procopiuc C, Wolf J L, et al. Fast algorithm for projected clustering. In: Proceedings of ACM SIGMOD. New York: ACM, 1999, 61–72
Agrawal R, Gehrke J, Gunopulos D, et al. Automatic subspace clustering of high dimensional data for data mining applications. In: Proceedings of ACM SIGMOD. New York: ACM, 1998, 94–105
Cheng C H, Fu A W, Zhang Y. Entropy-based subspace clustering for mining numerical data. In: Proceedings of ACM SIGKDD. New York: ACM, 1999, 84–93
Goil S, Nagesh H, Choudhary A. Mafia: efficient and scalable subspace clustering for very large data sets. Technical Report CPDC-TR-9906-010, Northwestern University, 1999
Domeniconi C, Gunopulos D, Ma S, et al. Locally adaptive metrics for clustering high dimensional data. Technical Report ISE-TR-06-04, 2006
Jing L, Ng M K, Xu J, et al. On the performance of feature weighting K-means for text subspace clustering. In: Proceedings of WAIM, 2005, 205–212
Wu C F J. On the convergence properties of the EM algorithm. Annals of Statistics, 1983, 11(1): 95–103
Friedman J H, Meulman J J. Clustering objects on subsets of attributes. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 2004, 66(4): 815–849
Candillier L, Tellier I, Torre F, et al. SuSE: subspace selection embedded in an EM algorithm. In: Proceedings of CAP, 2006, 331–345
Chen L F, Jiang Q S, Wang S R. A new unsupervised term weighting scheme for document clustering. Journal of Computational Information Systems, 2007, 3(4): 1455–1464
Aggarwal C C, Yu P S. Outlier detection for high dimensional data. In: Proceedings of ACM SIGMOD. New York: ACM, 2001, 219–234
Gan G, Wu J, Yang Z. A fuzzy subspace algorithm for clustering high dimensional data. LNAI, 2006, 4093: 271–278
Sun H, Wang S, Jiang Q. FCM-based model selection algorithms for determining the number of clusters. Pattern Recognition, 2004, 37(10): 2027–2037
Golub TR, Slonim DK, Tamayo P, et al. Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science, 1999, 286: 531–537
Gordon G J, Jensen R V, Hsiao L L, et al. Translation of microarray data into clinically relevant cancer diagnostic tests using gege expression ratios in lung cancer and mesothelioma. Cancer Research, 2002, 62: 4963–4967
Tan S, Cheng X, Ghanem M M, et al. A novel refinement approach for text categorization, In: Proceedings of ACM CIKM. New York: ACM, 2005, 469–476
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Chen, L., Jiang, Q. An extended EM algorithm for subspace clustering. Front. Comput. Sci. China 2, 81–86 (2008). https://doi.org/10.1007/s11704-008-0007-x
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11704-008-0007-x