Skip to main content

A LDA Feature Grouping Method for Subspace Clustering of Text Data

  • Conference paper
Intelligence and Security Informatics (PAISI 2014)

Part of the book series: Lecture Notes in Computer Science ((LNSC,volume 8440))

Included in the following conference series:

Abstract

This paper proposes a feature grouping method for clustering of text data. In this new method, the vector space model is used to represent a set of documents. The LDA algorithm is applied to the text data to generate groups of features as topics. The topics are treated as group features which enable the recently published subspace clustering algorithm FG-k-means to be used to cluster high dimensional text data with two level features, the word level and the group level. In generating the group level features with LDA, an entropy based word filtering method is proposed to remove the words with low probabilities in the word distribution of the corresponding topics. Experiments were conducted on three real-life text data sets to compare the new method with three existing clustering algorithms. The experiment results have shown that the new method improved the clustering performance in comparison with other methods.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Makarenkov, V., Legendre, P.: Optimal variable weighting for ultrametric and additive trees and k-means partitioning: Methods and software. Journal of Classification 18(2), 245–271 (2001)

    MATH  MathSciNet  Google Scholar 

  2. Modha, D., Spangler, W.: Feature weighting in k-means clustering. Machine Learning 52(3), 217–237 (2003)

    Article  MATH  Google Scholar 

  3. Friedman, J., Meulman, J.: Clustering Objects on Subsets of Attributes. Journal of the Royal Statistical Society Series B (Statistical Methodology) 66(4), 815–849 (2004)

    Article  MATH  MathSciNet  Google Scholar 

  4. Huang, Z., Ng, M., Rong, H., Li, Z.: Automated variable weighting in k-means type clustering. IEEE Transactions on Pattern Analysis and Machine Intelligence 27(5), 657–668 (2005)

    Article  Google Scholar 

  5. Domeniconi, C., Gunopulos, D., Ma, S., Yan, B., Alrazgan, M., Papadopoulos, D.: Locally adaptive metrics for clustering high dimensional data. Data Mining and Knowledge Discovery 14(1), 63–97 (2007)

    Article  MathSciNet  Google Scholar 

  6. Jing, L., Ng, M., Huang, Z.: An entropy weighting k-means algorithm for subspace clustering of high-dimensional sparse data. IEEE Transactions on Knowledge and Data Engineering 19(8), 1026–1041 (2007)

    Article  Google Scholar 

  7. Hoff, P.: Model-based subspace clustering. Bayesian Analysis 1(2), 321–344 (2006)

    Article  MathSciNet  Google Scholar 

  8. Bouveyron, C., Girard, S., Schmid, C.: High Dimensional Data Clustering. Computational Statistics & Data Analysis 52(1), 502–519 (2007)

    Article  MATH  MathSciNet  Google Scholar 

  9. Tsai, C.Y., Chiu, C.C.: Developing a feature weight self-adjustment mechanism for a k-means clustering algorithm. Computational Statistics & Data Analysis 52(10), 4658–4672 (2008)

    Article  MATH  MathSciNet  Google Scholar 

  10. Deng, Z., Choi, K.S., Chung, F.L., Wang, S.: Enhanced soft subspace clustering integrating within-cluster and between-cluster information. Pattern Recognition 43(3), 767–781 (2010)

    Article  MATH  Google Scholar 

  11. Cheng, H., Hua, K.A., Vu, K.: Constrained Locally Weighted Clustering. Proc. VLDB Endow. 1, 90–101 (2008)

    Google Scholar 

  12. Chen, X., Xu, X., Ye, Y., Huang, J.Z.: Tw-k-means: Automated two-level variable weighting clustering algorithm for multi-view data. IEEE Transactions on Knowledge and Data Engineering 25(4), 932–944 (2013)

    Article  Google Scholar 

  13. Chen, X., Ye, Y., Xu, X., Huang, J.Z.: A feature group weighting method for subspace clustering of high-dimensional data. Pattern Recognition 45(1), 434–446 (2012)

    Article  MATH  Google Scholar 

  14. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. In: Proceedings of the Twenty-Second Annual International SIGIR Conference, vol. 3(4-5), pp. 993–1022 (2003)

    Google Scholar 

  15. Griffiths, T.L., Steyvers, M.: Finding sientific topics. Proc. Natl. Acad. Sci. U.S. A 101(suppl. 1), 5228–5235 (2004)

    Google Scholar 

  16. Bhattacharya, I., Getoor, L.: A latent dirichlet model for unsupervised entity resolution. In: SIAM Conference on Data Mining (SDM), pp. 47–58 (April 2006)

    Google Scholar 

  17. Salton, G., Buckley, C.: Term weighting approaches in automatic text retrieval. Information Processing and Management 24(5), 513–523 (1988)

    Article  Google Scholar 

  18. Porteous, I., Newman, D., Ihler, A., Asuncion, A., Smyth, P., Welling, M.: Fast collapsed gibbs sampling for latent dirichlet allocation. In: KDD 2008 Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2008), pp. 569–577 (2008)

    Google Scholar 

  19. Shannon, C.E.: A mathematical theory of communication. Bell System Technical Journal 27, 379–423, 623–656 (1948)

    Google Scholar 

  20. Sokolova, M., Lapalme, G.: A systematic analysis of performance measures for classification tasks. Information Processing and Management 45, 427–437 (2009)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer International Publishing Switzerland

About this paper

Cite this paper

Cai, Y., Chen, X., Peng, P.X., Huang, J.Z. (2014). A LDA Feature Grouping Method for Subspace Clustering of Text Data. In: Chau, M., Chen, H., Wang, G.A., Wang, JH. (eds) Intelligence and Security Informatics. PAISI 2014. Lecture Notes in Computer Science, vol 8440. Springer, Cham. https://doi.org/10.1007/978-3-319-06677-6_7

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-06677-6_7

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-06676-9

  • Online ISBN: 978-3-319-06677-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics