Skip to main content
Log in

Text clustering using VSM with feature clusters

  • Original Article
  • Published:
Neural Computing and Applications Aims and scope Submit manuscript

Abstract

Representation of documents is the basis of clustering systems. In addition, non-contiguous phrases appear more and more frequent in the text in the Web 2.0 age, and these phrases can affect the result of text clustering. In order to improve the quality of text clustering, this paper proposed a feature cluster-based vector space model (FC-VSM) which used the text feature clusters co-occurrence matrix to represent document and proposed to identify non-contiguous phrases in the text preprocessing stage. Our method can reduce dimension of features compared with the traditional VSM-based model. It identified non-contiguous phrases, used distributed representation of features, and implements feature clusters. Despite their simplicity, our methods are surprisingly effective and can improve the accuracy of clustering significantly which is shown in experimental results.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

References

  1. Shi Z (2002) Knowledge discovery. Tsing University Press, BeiJing

    Google Scholar 

  2. Jain AK (2010) Data clustering: 50 years beyond K-means. Pattern Recognit Lett 31(8):651–666

    Article  Google Scholar 

  3. Grabmeier J, Rudolph A (2002) Techniques of cluster algorithms in data mining. Data Min Knowl Disc 6(4):303

    Article  MathSciNet  Google Scholar 

  4. Meyer CD, Wessell CD (2012) Stochastic data clustering. SIAM SIAM J Matrix Anal Appl 33(4):1214–1236

    Article  MATH  MathSciNet  Google Scholar 

  5. Mikolov T, Chen K, Corrado G, and Dean J. Efficient estimation of word representations in vector space (2013). \hyperimage{http://arxiv.org/abs/1301–3781}{arXiv:1301–3781}

  6. https://code.google.com/p/word2vec/

  7. Mikolov T, Sutskever I, Chen K, et al (2013) Distributed representations of words and phrases and their compositionality. In: Advances in neural information processing systems, pp 3111–3119

  8. Simard M, Cancedda N, Cavestro B, et al (2005) Translating with non-contiguous phrases. In: Proceedings of the conference on human language technology and empirical methods in natural language processing, pp 755–762

  9. Doucet A, Ahonen-Myka H (2004) Non-contiguous word sequences for information retrieval. In: Proceedings of the workshop on multiword expressions: integrating processing. Association for computational linguistics, pp 88–95

  10. Rumelhart DE, Hinton GE, Williams RJ (1986) Learning representations by back-propagating errors. Nature 323(9):533–536

    Article  Google Scholar 

  11. Mikolov T, Karafiát M, Burget L, et al (2010) Recurrent neural network based language model. In: INTERSPEECH, pp 1045–1048

  12. Bengio Y, Ducharme R, Vincent P, Jauvin C (2003) A neural probabilistic language model. J Mach Learn Res, 1137–1155

  13. Mikolov T (2012) Statistical language models based on neural networks. Ph.D. thesis, Brno University of Technology

  14. Collobert R, Weston J (2008) A unified architecture for natural language processing: Deep neural networks with multitask learning. In: Proceedings of the 25th international conference on machine learning, ACM, pp 160–167

  15. Turian J, Ratinov L, Bengio Y (2010) Word representations: a simple and general method for semi-supervised learning. In: Proceedings of the 48th annual meeting of the association for computational linguistics, pp 384–394

  16. Socher R, Lin CC, Ng AY, Manning C (2011) Parsing natural scenes and natural language with recursive neural networks. In: Proceedings of the 28th international conference on machine learning (ICML-11), pp 129–136

  17. Socher R, Bauer J, Manning CD and Ng AY (2013) Parsing with compositional vector grammars. In: Proceedings of the association for computational linguistics

  18. Collobert R, Weston J, Bottou L et al (2011) Natural language processing (almost) from scratch. J Mach Learn Res 12:2493–2537

    MATH  Google Scholar 

  19. Huang EH, Socher R, Manning CD, Ng AY (2012) Improving word representations via global context and multiple word prototypes. In: Proceedings of the 50th annual meeting of the association for computational linguistics: long papers, vol 1, pp 873–882

  20. Mnih A, Hinton GE (2008) A scalable hierarchical distributed language model. In: Advances in neural information processing systems, pp 1081–1088

  21. Mikolov T, Karafiát M, Burget L, et al (2010) Recurrent neural network based language model. In: INTERSPEECH, pp 1045–1048

  22. Morin F, Bengio Y (2005) Hierarchical probabilistic neural network language model. In: Proceedings of the international workshop on artificial intelligence and statistics, pp 246–252

  23. Salton G, Wong A, Yang CS (1975) A vector space model for automatic indexing. Commun ACM 18(11):613–620

    Article  MATH  Google Scholar 

  24. Landauer TK, Domais ST (1997) A solution to plato’s problem: the latent semantic analysis theory of acquisition, induction and representation of knowledge. Psychol Rev 104(2):211–240

    Article  Google Scholar 

  25. Blei DM, Ng AY, Jordan MI (2003) Latent dirichlet allocation. J Mach Learn Res 3:993–1022

    MATH  Google Scholar 

  26. Lu Y, Mei Q, Zhai CX (2011) Investigating task performance of probabilistic topic models: an empirical study of PLSA and LDA. Inf Retrieval 14(2):178–203

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Cao Qimin.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Qimin, C., Qiao, G., Yongliang, W. et al. Text clustering using VSM with feature clusters. Neural Comput & Applic 26, 995–1003 (2015). https://doi.org/10.1007/s00521-014-1792-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00521-014-1792-9

Keywords

Navigation