A LDA Feature Grouping Method for Subspace Clustering of Text Data

Cai, Yeshou; Chen, Xiaojun; Peng, Patrick Xiaogang; Huang, Joshua Zhexue

doi:10.1007/978-3-319-06677-6_7

Yeshou Cai¹⁹,
Xiaojun Chen²⁰,
Patrick Xiaogang Peng²⁰ &
…
Joshua Zhexue Huang²⁰

Part of the book series: Lecture Notes in Computer Science ((LNSC,volume 8440))

Included in the following conference series:

Pacific-Asia Workshop on Intelligence and Security Informatics

783 Accesses
2 Citations

Abstract

This paper proposes a feature grouping method for clustering of text data. In this new method, the vector space model is used to represent a set of documents. The LDA algorithm is applied to the text data to generate groups of features as topics. The topics are treated as group features which enable the recently published subspace clustering algorithm FG-k-means to be used to cluster high dimensional text data with two level features, the word level and the group level. In generating the group level features with LDA, an entropy based word filtering method is proposed to remove the words with low probabilities in the word distribution of the corresponding topics. Experiments were conducted on three real-life text data sets to compare the new method with three existing clustering algorithms. The experiment results have shown that the new method improved the clustering performance in comparison with other methods.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Makarenkov, V., Legendre, P.: Optimal variable weighting for ultrametric and additive trees and k-means partitioning: Methods and software. Journal of Classification 18(2), 245–271 (2001)
MATH MathSciNet Google Scholar
Modha, D., Spangler, W.: Feature weighting in k-means clustering. Machine Learning 52(3), 217–237 (2003)
Article MATH Google Scholar
Friedman, J., Meulman, J.: Clustering Objects on Subsets of Attributes. Journal of the Royal Statistical Society Series B (Statistical Methodology) 66(4), 815–849 (2004)
Article MATH MathSciNet Google Scholar
Huang, Z., Ng, M., Rong, H., Li, Z.: Automated variable weighting in k-means type clustering. IEEE Transactions on Pattern Analysis and Machine Intelligence 27(5), 657–668 (2005)
Article Google Scholar
Domeniconi, C., Gunopulos, D., Ma, S., Yan, B., Alrazgan, M., Papadopoulos, D.: Locally adaptive metrics for clustering high dimensional data. Data Mining and Knowledge Discovery 14(1), 63–97 (2007)
Article MathSciNet Google Scholar
Jing, L., Ng, M., Huang, Z.: An entropy weighting k-means algorithm for subspace clustering of high-dimensional sparse data. IEEE Transactions on Knowledge and Data Engineering 19(8), 1026–1041 (2007)
Article Google Scholar
Hoff, P.: Model-based subspace clustering. Bayesian Analysis 1(2), 321–344 (2006)
Article MathSciNet Google Scholar
Bouveyron, C., Girard, S., Schmid, C.: High Dimensional Data Clustering. Computational Statistics & Data Analysis 52(1), 502–519 (2007)
Article MATH MathSciNet Google Scholar
Tsai, C.Y., Chiu, C.C.: Developing a feature weight self-adjustment mechanism for a k-means clustering algorithm. Computational Statistics & Data Analysis 52(10), 4658–4672 (2008)
Article MATH MathSciNet Google Scholar
Deng, Z., Choi, K.S., Chung, F.L., Wang, S.: Enhanced soft subspace clustering integrating within-cluster and between-cluster information. Pattern Recognition 43(3), 767–781 (2010)
Article MATH Google Scholar
Cheng, H., Hua, K.A., Vu, K.: Constrained Locally Weighted Clustering. Proc. VLDB Endow. 1, 90–101 (2008)
Google Scholar
Chen, X., Xu, X., Ye, Y., Huang, J.Z.: Tw-k-means: Automated two-level variable weighting clustering algorithm for multi-view data. IEEE Transactions on Knowledge and Data Engineering 25(4), 932–944 (2013)
Article Google Scholar
Chen, X., Ye, Y., Xu, X., Huang, J.Z.: A feature group weighting method for subspace clustering of high-dimensional data. Pattern Recognition 45(1), 434–446 (2012)
Article MATH Google Scholar
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. In: Proceedings of the Twenty-Second Annual International SIGIR Conference, vol. 3(4-5), pp. 993–1022 (2003)
Google Scholar
Griffiths, T.L., Steyvers, M.: Finding sientific topics. Proc. Natl. Acad. Sci. U.S. A 101(suppl. 1), 5228–5235 (2004)
Google Scholar
Bhattacharya, I., Getoor, L.: A latent dirichlet model for unsupervised entity resolution. In: SIAM Conference on Data Mining (SDM), pp. 47–58 (April 2006)
Google Scholar
Salton, G., Buckley, C.: Term weighting approaches in automatic text retrieval. Information Processing and Management 24(5), 513–523 (1988)
Article Google Scholar
Porteous, I., Newman, D., Ihler, A., Asuncion, A., Smyth, P., Welling, M.: Fast collapsed gibbs sampling for latent dirichlet allocation. In: KDD 2008 Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2008), pp. 569–577 (2008)
Google Scholar
Shannon, C.E.: A mathematical theory of communication. Bell System Technical Journal 27, 379–423, 623–656 (1948)
Google Scholar
Sokolova, M., Lapalme, G.: A systematic analysis of performance measures for classification tasks. Information Processing and Management 45, 427–437 (2009)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Shenzhen Key Lab of High Performance Data Mining,Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, China
Yeshou Cai
Shenzhen University, China
Xiaojun Chen, Patrick Xiaogang Peng & Joshua Zhexue Huang

Authors

Yeshou Cai
View author publications
You can also search for this author in PubMed Google Scholar
Xiaojun Chen
View author publications
You can also search for this author in PubMed Google Scholar
Patrick Xiaogang Peng
View author publications
You can also search for this author in PubMed Google Scholar
Joshua Zhexue Huang
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

The University of Hong Kong, School of Business, Pokfulam Road, Pokfulam, Hong Kong, SAR, China
Michael Chau
MIS Department, University of Arizona, Tucson, USA
Hsinchun Chen
Virginia Tech, Pamplin College of Business, Pamplin Hall, 1007, 24061, Blacksburg, VA, USA
G. Alan Wang
Central Police University, 56 Shujen Rd., Takang Village, 33304, Kueishan Hsiang, Taoyuan County, Taiwan, R.O.C.
Jau-Hwang Wang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Cai, Y., Chen, X., Peng, P.X., Huang, J.Z. (2014). A LDA Feature Grouping Method for Subspace Clustering of Text Data. In: Chau, M., Chen, H., Wang, G.A., Wang, JH. (eds) Intelligence and Security Informatics. PAISI 2014. Lecture Notes in Computer Science, vol 8440. Springer, Cham. https://doi.org/10.1007/978-3-319-06677-6_7

Download citation

DOI: https://doi.org/10.1007/978-3-319-06677-6_7
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-06676-9
Online ISBN: 978-3-319-06677-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics