Skip to main content

Context Vector Model for Document Representation: A Computational Study

  • Conference paper
  • First Online:
Natural Language Processing and Chinese Computing (NLPCC 2015)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 9362))

Abstract

To tackle the sparse data problem of the bag-of-words model for document representation, the Context Vector Model (CVM) has been proposed to enrich a document with the relatedness of all the words in a corpus to the document. The nature of CVM is the combination of word vectors, wherefore the representation method for words is essential for CVM. A computational study is performed in this paper to compare the effects of the newly proposed word representation methods embedded in CVM. The experimental results demonstrate that some of the newly proposed word representation methods significantly improve the performance of CVM, for they estimate the relatedness between words better.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Anastasiu, D.C., Tagarelli, A., Karypis, G.: Document clustering: The next frontier. Tech. rep., Technical Report. University of Minnesota (2013)

    Google Scholar 

  2. Andrews, N.O., Fox, E.A.: Recent developments in document clustering. Computer Science, Virginia Tech, Tech Rep (2007)

    Google Scholar 

  3. Billhardt, H., Borrajo, D., Maojo, V.: A context vector model for information retrieval. Journal of the American Society for Information Science and Technology 53(3), 236–249 (2002)

    Article  Google Scholar 

  4. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. Journal of Machine Learning Research 3, 993–1022 (2003)

    MATH  Google Scholar 

  5. Blunsom, P., Grefenstette, E., Hermann, K.M., et al.: New directions in vector space models of meaning. In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (2014)

    Google Scholar 

  6. Bullinaria, J.A., Levy, J.P.: Extracting semantic representations from word co-occurrence statistics: A computational study. Behavior Research Methods 39(3), 510–526 (2007)

    Article  Google Scholar 

  7. Bullinaria, J.A., Levy, J.P.: Extracting semantic representations from word co-occurrence statistics: stop-lists, stemming, and SVD. Behavior Research Methods 44(3), 890–907 (2012)

    Article  Google Scholar 

  8. Cheng, X., Miao, D., Wang, C., Cao, L.: Coupled term-term relation analysis for document clustering. In: The 2013 International Joint Conference on Neural Networks (IJCNN), pp. 1–8. IEEE (2013)

    Google Scholar 

  9. Deerwester, S.C., Dumais, S.T., Landauer, T.K., Furnas, G.W., Harshman, R.A.: Indexing by latent semantic analysis. JASIS 41(6), 391–407 (1990)

    Article  Google Scholar 

  10. Harris, Z.S.: Distributional structure. Word (1954)

    Google Scholar 

  11. Hofmann, T.: Probabilistic latent semantic indexing. In: Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 50–57. ACM (1999)

    Google Scholar 

  12. Iosif, E., Potamianos, A.: Unsupervised semantic similarity computation between terms using web documents. IEEE Transactions on Knowledge and Data Engineering 22(11), 1637–1647 (2010)

    Article  Google Scholar 

  13. Kalogeratos, A., Likas, A.: Text document clustering using global term context vectors. Knowledge and Information Systems 31(3), 455–474 (2012)

    Article  Google Scholar 

  14. Karypis, G., Han, E.: Concept indexing: A fast dimensionality reduction algorithm with applications to document retrieval and categorization. Tech. rep, DTIC Document (2000)

    Google Scholar 

  15. Le, Q.V., Mikolov, T.: Distributed representations of sentences and documents. In: Proceedings of the 31st International Conference on Machine Learning, vol. 32, JMLR W&CP (2014)

    Google Scholar 

  16. Lund, K., Burgess, C.: Producing high-dimensional semantic spaces from lexical co-occurrence. Behavior Research Methods, Instruments, & Computers 28(2), 203–208 (1996)

    Article  Google Scholar 

  17. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013)

  18. Mitchell, J., Lapata, M.: Composition in distributional models of semantics. Cognitive Science 34(8), 1388–1429 (2010)

    Article  Google Scholar 

  19. Pangos, A., Iosif, E., Potamianos, A., Fosler-Lussier, E.: Combining statistical similarity measures for automatic induction of semantic classes. In: 2005 IEEE Workshop on Automatic Speech Recognition and Understanding, pp. 278–283. IEEE (2005)

    Google Scholar 

  20. Rungsawang, A.: Dsir: The first trec-7 attempt. In: TREC, pp. 366–372. Citeseer (1998)

    Google Scholar 

  21. Salton, G., Wong, A., Yang, C.S.: A vector space model for automatic indexing. Communications of the ACM 18(11), 613–620 (1975)

    Article  MATH  Google Scholar 

  22. Turney, P.D., Pantel, P., et al.: From frequency to meaning: Vector space models of semantics. Journal of Artificial Intelligence Research 37(1), 141–188 (2010)

    Article  MathSciNet  MATH  Google Scholar 

  23. Wong, S.K.M., Ziarko, W., Raghavan, V.V., Wong, P.: On modeling of information retrieval concepts in vector spaces. ACM Transactions on Database Systems (TODS) 12(2), 299–321 (1987)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jinmao Wei .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this paper

Cite this paper

Wei, Y., Wei, J., Xu, H. (2015). Context Vector Model for Document Representation: A Computational Study. In: Li, J., Ji, H., Zhao, D., Feng, Y. (eds) Natural Language Processing and Chinese Computing. NLPCC 2015. Lecture Notes in Computer Science(), vol 9362. Springer, Cham. https://doi.org/10.1007/978-3-319-25207-0_17

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-25207-0_17

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-25206-3

  • Online ISBN: 978-3-319-25207-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics