Extended Strategies for Document Clustering with Word Co-occurrences

Wei, Yang; Wei, Jinmao; Yang, Zhenglu

doi:10.1007/978-3-319-25255-1_38

Extended Strategies for Document Clustering with Word Co-occurrences

Yang Wei^18,19,
Jinmao Wei^18,19 &
Zhenglu Yang^18,19

Conference paper
First Online: 13 November 2015

2793 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 9313))

Abstract

To tackle the sparse data problem of the bag-of-words model for document clustering, recent strategies have been proposed to enrich a document with the relatedness of all the words in a corpus to the document, where the relatedness is estimated by the weighted sum of word co-occurrences. However, the relatedness is overestimated without eliminating the overlaps between word co-occurrences. This paper demonstrates that the weighted sum strategy gives the upper bound of the theoretic degree of relatedness. Two strategies are further proposed to approach the theoretic degree of relatedness. The first strategy is established under the extreme assumption that all the words in a document co-occur with each other. By considering the specificities of words, the second strategy gives several extended versions of the weighted sum strategy. Substantial experiments verify that the document clustering incorporated with the extended strategies achieve a significant performance improvement compared to the state-of-the-art techniques.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Billhardt, H., Borrajo, D., Maojo, V.: A context vector model for information retrieval. Journal of the American Society for Information Science and Technology 53(3), 236–249 (2002)
Article Google Scholar
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. The Journal of machine Learning research 3, 993–1022 (2003)
MATH Google Scholar
Blunsom, P., Grefenstette, E., Hermann, K.M., et al.: New directions in vector space models of meaning. In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (2014)
Google Scholar
Bullinaria, J.A., Levy, J.P.: Extracting semantic representations from word co-occurrence statistics: A computational study. Behavior Research Methods 39(3), 510–526 (2007)
Article Google Scholar
Bullinaria, J.A., Levy, J.P.: Extracting semantic representations from word co-occurrence statistics: stop-lists, stemming, and svd. Behavior Research Methods 44(3), 890–907 (2012)
Article Google Scholar
Cai, D., He, X., Han, J.: Locally consistent concept factorization for document clustering. IEEE Transactions on Knowledge and Data Engineering 23(6), 902–913 (2011)
Article Google Scholar
Deerwester, S.C., Dumais, S.T., Landauer, T.K., Furnas, G.W., Harshman, R.A.: Indexing by latent semantic analysis. JASIS 41(6), 391–407 (1990)
Article Google Scholar
Harris, Z.S.: Distributional structure. Word (1954)
Google Scholar
Iosif, E., Potamianos, A.: Unsupervised semantic similarity computation between terms using web documents. IEEE Transactions on Knowledge and Data Engineering 22(11), 1637–1647 (2010)
Article Google Scholar
Kalogeratos, A., Likas, A.: Text document clustering using global term context vectors. Knowledge and Information Systems 31(3), 455–474 (2012)
Article Google Scholar
Le, Q.V., Mikolov, T.: Distributed representations of sentences and documents. In: Proceedings of the 31st International Conference on Machine Learning, W&CP, vol. 32. JMLR (2014)
Google Scholar
Lund, K., Burgess, C.: Producing high-dimensional semantic spaces from lexical co-occurrence. Behavior Research Methods, Instruments, & Computers 28(2), 203–208 (1996)
Article Google Scholar
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013)
Google Scholar
Mitchell, J., Lapata, M.: Composition in distributional models of semantics. Cognitive Science 34(8), 1388–1429 (2010)
Article Google Scholar
Rungsawang, A.: Dsir: The first trec-7 attempt. In: TREC, pp. 366–372. Citeseer (1998)
Google Scholar
Turney, P.D., Pantel, P., et al.: From frequency to meaning: Vector space models of semantics. Journal of Artificial Intelligence Research 37(1), 141–188 (2010)
MathSciNet MATH Google Scholar
Xu, W., Liu, X., Gong, Y.: Document clustering based on non-negative matrix factorization. In: Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Informaion Retrieval, pp. 267–273. ACM (2003)
Google Scholar

Download references

Author information

Authors and Affiliations

College of Computer and Control Engineering, Nankai University, Weijin Rd. 94, 300071, Tianjin, China
Yang Wei, Jinmao Wei & Zhenglu Yang
College of Software, Nankai University, Weijin Rd. 94, 300071, Tianjin, China
Yang Wei, Jinmao Wei & Zhenglu Yang

Authors

Yang Wei
View author publications
You can also search for this author in PubMed Google Scholar
Jinmao Wei
View author publications
You can also search for this author in PubMed Google Scholar
Zhenglu Yang
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

University of Hong Kong, Hong Kong, China
Reynold Cheng
Computer Science, Peking University, Beijing, China
Bin Cui
Advanced Digital Sciences Center (ADSC), Singapore, Singapore
Zhenjie Zhang
University of Technology, Guangzhou, China
Ruichu Cai
Guangxi University, Guangxi, China
Jia Xu

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wei, Y., Wei, J., Yang, Z. (2015). Extended Strategies for Document Clustering with Word Co-occurrences. In: Cheng, R., Cui, B., Zhang, Z., Cai, R., Xu, J. (eds) Web Technologies and Applications. APWeb 2015. Lecture Notes in Computer Science(), vol 9313. Springer, Cham. https://doi.org/10.1007/978-3-319-25255-1_38

Download citation

DOI: https://doi.org/10.1007/978-3-319-25255-1_38
Published: 13 November 2015
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-25254-4
Online ISBN: 978-3-319-25255-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics