Abstract
This paper proposes a feature grouping method for clustering of text data. In this new method, the vector space model is used to represent a set of documents. The LDA algorithm is applied to the text data to generate groups of features as topics. The topics are treated as group features which enable the recently published subspace clustering algorithm FG-k-means to be used to cluster high dimensional text data with two level features, the word level and the group level. In generating the group level features with LDA, an entropy based word filtering method is proposed to remove the words with low probabilities in the word distribution of the corresponding topics. Experiments were conducted on three real-life text data sets to compare the new method with three existing clustering algorithms. The experiment results have shown that the new method improved the clustering performance in comparison with other methods.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Makarenkov, V., Legendre, P.: Optimal variable weighting for ultrametric and additive trees and k-means partitioning: Methods and software. Journal of Classification 18(2), 245–271 (2001)
Modha, D., Spangler, W.: Feature weighting in k-means clustering. Machine Learning 52(3), 217–237 (2003)
Friedman, J., Meulman, J.: Clustering Objects on Subsets of Attributes. Journal of the Royal Statistical Society Series B (Statistical Methodology) 66(4), 815–849 (2004)
Huang, Z., Ng, M., Rong, H., Li, Z.: Automated variable weighting in k-means type clustering. IEEE Transactions on Pattern Analysis and Machine Intelligence 27(5), 657–668 (2005)
Domeniconi, C., Gunopulos, D., Ma, S., Yan, B., Alrazgan, M., Papadopoulos, D.: Locally adaptive metrics for clustering high dimensional data. Data Mining and Knowledge Discovery 14(1), 63–97 (2007)
Jing, L., Ng, M., Huang, Z.: An entropy weighting k-means algorithm for subspace clustering of high-dimensional sparse data. IEEE Transactions on Knowledge and Data Engineering 19(8), 1026–1041 (2007)
Hoff, P.: Model-based subspace clustering. Bayesian Analysis 1(2), 321–344 (2006)
Bouveyron, C., Girard, S., Schmid, C.: High Dimensional Data Clustering. Computational Statistics & Data Analysis 52(1), 502–519 (2007)
Tsai, C.Y., Chiu, C.C.: Developing a feature weight self-adjustment mechanism for a k-means clustering algorithm. Computational Statistics & Data Analysis 52(10), 4658–4672 (2008)
Deng, Z., Choi, K.S., Chung, F.L., Wang, S.: Enhanced soft subspace clustering integrating within-cluster and between-cluster information. Pattern Recognition 43(3), 767–781 (2010)
Cheng, H., Hua, K.A., Vu, K.: Constrained Locally Weighted Clustering. Proc. VLDB Endow. 1, 90–101 (2008)
Chen, X., Xu, X., Ye, Y., Huang, J.Z.: Tw-k-means: Automated two-level variable weighting clustering algorithm for multi-view data. IEEE Transactions on Knowledge and Data Engineering 25(4), 932–944 (2013)
Chen, X., Ye, Y., Xu, X., Huang, J.Z.: A feature group weighting method for subspace clustering of high-dimensional data. Pattern Recognition 45(1), 434–446 (2012)
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. In: Proceedings of the Twenty-Second Annual International SIGIR Conference, vol. 3(4-5), pp. 993–1022 (2003)
Griffiths, T.L., Steyvers, M.: Finding sientific topics. Proc. Natl. Acad. Sci. U.S. A 101(suppl. 1), 5228–5235 (2004)
Bhattacharya, I., Getoor, L.: A latent dirichlet model for unsupervised entity resolution. In: SIAM Conference on Data Mining (SDM), pp. 47–58 (April 2006)
Salton, G., Buckley, C.: Term weighting approaches in automatic text retrieval. Information Processing and Management 24(5), 513–523 (1988)
Porteous, I., Newman, D., Ihler, A., Asuncion, A., Smyth, P., Welling, M.: Fast collapsed gibbs sampling for latent dirichlet allocation. In: KDD 2008 Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2008), pp. 569–577 (2008)
Shannon, C.E.: A mathematical theory of communication. Bell System Technical Journal 27, 379–423, 623–656 (1948)
Sokolova, M., Lapalme, G.: A systematic analysis of performance measures for classification tasks. Information Processing and Management 45, 427–437 (2009)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer International Publishing Switzerland
About this paper
Cite this paper
Cai, Y., Chen, X., Peng, P.X., Huang, J.Z. (2014). A LDA Feature Grouping Method for Subspace Clustering of Text Data. In: Chau, M., Chen, H., Wang, G.A., Wang, JH. (eds) Intelligence and Security Informatics. PAISI 2014. Lecture Notes in Computer Science, vol 8440. Springer, Cham. https://doi.org/10.1007/978-3-319-06677-6_7
Download citation
DOI: https://doi.org/10.1007/978-3-319-06677-6_7
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-06676-9
Online ISBN: 978-3-319-06677-6
eBook Packages: Computer ScienceComputer Science (R0)