Text clustering using VSM with feature clusters

Qimin, Cao; Qiao, Guo; Yongliang, Wang; Xianghua, Wu

doi:10.1007/s00521-014-1792-9

Text clustering using VSM with feature clusters

Original Article
Published: 19 December 2014

Volume 26, pages 995–1003, (2015)
Cite this article

Neural Computing and Applications Aims and scope Submit manuscript

Cao Qimin¹,
Guo Qiao¹,
Wang Yongliang¹ &
…
Wu Xianghua¹

805 Accesses
23 Citations
Explore all metrics

Abstract

Representation of documents is the basis of clustering systems. In addition, non-contiguous phrases appear more and more frequent in the text in the Web 2.0 age, and these phrases can affect the result of text clustering. In order to improve the quality of text clustering, this paper proposed a feature cluster-based vector space model (FC-VSM) which used the text feature clusters co-occurrence matrix to represent document and proposed to identify non-contiguous phrases in the text preprocessing stage. Our method can reduce dimension of features compared with the traditional VSM-based model. It identified non-contiguous phrases, used distributed representation of features, and implements feature clusters. Despite their simplicity, our methods are surprisingly effective and can improve the accuracy of clustering significantly which is shown in experimental results.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Shi Z (2002) Knowledge discovery. Tsing University Press, BeiJing
Google Scholar
Jain AK (2010) Data clustering: 50 years beyond K-means. Pattern Recognit Lett 31(8):651–666
Article Google Scholar
Grabmeier J, Rudolph A (2002) Techniques of cluster algorithms in data mining. Data Min Knowl Disc 6(4):303
Article MathSciNet Google Scholar
Meyer CD, Wessell CD (2012) Stochastic data clustering. SIAM SIAM J Matrix Anal Appl 33(4):1214–1236
Article MATH MathSciNet Google Scholar
Mikolov T, Chen K, Corrado G, and Dean J. Efficient estimation of word representations in vector space (2013). \hyperimage{http://arxiv.org/abs/1301–3781}{arXiv:1301–3781}
https://code.google.com/p/word2vec/
Mikolov T, Sutskever I, Chen K, et al (2013) Distributed representations of words and phrases and their compositionality. In: Advances in neural information processing systems, pp 3111–3119
Simard M, Cancedda N, Cavestro B, et al (2005) Translating with non-contiguous phrases. In: Proceedings of the conference on human language technology and empirical methods in natural language processing, pp 755–762
Doucet A, Ahonen-Myka H (2004) Non-contiguous word sequences for information retrieval. In: Proceedings of the workshop on multiword expressions: integrating processing. Association for computational linguistics, pp 88–95
Rumelhart DE, Hinton GE, Williams RJ (1986) Learning representations by back-propagating errors. Nature 323(9):533–536
Article Google Scholar
Mikolov T, Karafiát M, Burget L, et al (2010) Recurrent neural network based language model. In: INTERSPEECH, pp 1045–1048
Bengio Y, Ducharme R, Vincent P, Jauvin C (2003) A neural probabilistic language model. J Mach Learn Res, 1137–1155
Mikolov T (2012) Statistical language models based on neural networks. Ph.D. thesis, Brno University of Technology
Collobert R, Weston J (2008) A unified architecture for natural language processing: Deep neural networks with multitask learning. In: Proceedings of the 25th international conference on machine learning, ACM, pp 160–167
Turian J, Ratinov L, Bengio Y (2010) Word representations: a simple and general method for semi-supervised learning. In: Proceedings of the 48th annual meeting of the association for computational linguistics, pp 384–394
Socher R, Lin CC, Ng AY, Manning C (2011) Parsing natural scenes and natural language with recursive neural networks. In: Proceedings of the 28th international conference on machine learning (ICML-11), pp 129–136
Socher R, Bauer J, Manning CD and Ng AY (2013) Parsing with compositional vector grammars. In: Proceedings of the association for computational linguistics
Collobert R, Weston J, Bottou L et al (2011) Natural language processing (almost) from scratch. J Mach Learn Res 12:2493–2537
MATH Google Scholar
Huang EH, Socher R, Manning CD, Ng AY (2012) Improving word representations via global context and multiple word prototypes. In: Proceedings of the 50th annual meeting of the association for computational linguistics: long papers, vol 1, pp 873–882
Mnih A, Hinton GE (2008) A scalable hierarchical distributed language model. In: Advances in neural information processing systems, pp 1081–1088
Mikolov T, Karafiát M, Burget L, et al (2010) Recurrent neural network based language model. In: INTERSPEECH, pp 1045–1048
Morin F, Bengio Y (2005) Hierarchical probabilistic neural network language model. In: Proceedings of the international workshop on artificial intelligence and statistics, pp 246–252
Salton G, Wong A, Yang CS (1975) A vector space model for automatic indexing. Commun ACM 18(11):613–620
Article MATH Google Scholar
Landauer TK, Domais ST (1997) A solution to plato’s problem: the latent semantic analysis theory of acquisition, induction and representation of knowledge. Psychol Rev 104(2):211–240
Article Google Scholar
Blei DM, Ng AY, Jordan MI (2003) Latent dirichlet allocation. J Mach Learn Res 3:993–1022
MATH Google Scholar
Lu Y, Mei Q, Zhai CX (2011) Investigating task performance of probabilistic topic models: an empirical study of PLSA and LDA. Inf Retrieval 14(2):178–203
Article Google Scholar

Download references

Author information

Authors and Affiliations

School of Automation, Beijing Institute of Technology, Beijing, 100081, China
Cao Qimin, Guo Qiao, Wang Yongliang & Wu Xianghua

Authors

Cao Qimin
View author publications
You can also search for this author in PubMed Google Scholar
Guo Qiao
View author publications
You can also search for this author in PubMed Google Scholar
Wang Yongliang
View author publications
You can also search for this author in PubMed Google Scholar
Wu Xianghua
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Cao Qimin.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Qimin, C., Qiao, G., Yongliang, W. et al. Text clustering using VSM with feature clusters. Neural Comput & Applic 26, 995–1003 (2015). https://doi.org/10.1007/s00521-014-1792-9

Download citation

Received: 23 April 2014
Accepted: 08 December 2014
Published: 19 December 2014
Issue Date: May 2015
DOI: https://doi.org/10.1007/s00521-014-1792-9

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Text clustering using VSM with feature clusters

Abstract

Access this article

Similar content being viewed by others

Basic Co-Occurrence Latent Semantic Vector Space Model

A Statistics-Based Semantic Relation Analysis Approach for Document Clustering

A News Text Clustering Method Based on Similarity of Text Labels

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Text clustering using VSM with feature clusters

Abstract

Access this article

Similar content being viewed by others

Basic Co-Occurrence Latent Semantic Vector Space Model

A Statistics-Based Semantic Relation Analysis Approach for Document Clustering

A News Text Clustering Method Based on Similarity of Text Labels

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation