Abstract
Recent strategies have been proposed to reveal the semantic relatedness between documents by enriching a document with the relatedness of all the words in the given document collection to the document. By restricting the relatedness to the expected frequencies that each word will occur in the document, the traditional weighted sum of word vectors is proved to give the upper bounds of the expected frequencies. Duplicate counts usually exist during the sum of the word vectors, which weaken the discriminativeness of the enriched document vectors. The strategy which gives the lower bounds of the expected frequencies is also obtained by keeping the maximum values of the word vectors on each dimension. Together with the lower bounds and the deviations of word co-occurrence frequencies, a novel method is proposed to remove the duplicate counts existing in the upper bounds. As a result, the proposed method smooths the generated document vectors better than the weighted sum strategy. Substantial experiments verify that the document clustering incorporated with the proposed method achieves a significant performance improvement compared with the existing strategies.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Andrews, N.O., Fox, E.A.: Recent developments in document clustering. Computer Science, Virginia Tech., Tech. report (2007)
Billhardt, H., Borrajo, D., Maojo, V.: A context vector model for information retrieval. J. Am. Soc. Inform. Sci. Technol. 53(3), 236–249 (2002)
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. J. Mach. Learning Res. 3, 993–1022 (2003)
Blunsom, P., Grefenstette, E., Hermann, K.M., et al.: New directions in vector space models of meaning. In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (2014)
Bullinaria, J.A., Levy, J.P.: Extracting semantic representations from word co-occurrence statistics: a computational study. Behav. Res. Meth. 39(3), 510–526 (2007)
Bullinaria, J.A., Levy, J.P.: Extracting semantic representations from word co-occurrence statistics: stop-lists, stemming, and SVD. Behav. Res. Meth. 44(3), 890–907 (2012)
Cai, D., He, X., Han, J.: Locally consistent concept factorization for document clustering. IEEE Trans. Knowl. Data Eng. 23(6), 902–913 (2011)
Cheng, X., Miao, D., Wang, C., Cao, L.: Coupled term-term relation analysis for document clustering. In: The 2013 International Joint Conference on Neural Networks (IJCNN), pp. 1–8. IEEE (2013)
Deerwester, S.C., Dumais, S.T., Landauer, T.K., Furnas, G.W., Harshman, R.A.: Indexing by latent semantic analysis. JASIS 41(6), 391–407 (1990)
Harris, Z.S.: Distributional structure. Word 10(23), 146–162 (1954)
Iosif, E., Potamianos, A.: Unsupervised semantic similarity computation between terms using web documents. IEEE Trans. Knowl. Data Eng. 22(11), 1637–1647 (2010)
Kalogeratos, A., Likas, A.: Text document clustering using global term context vectors. Knowl. Inf. Syst. 31(3), 455–474 (2012)
Le, Q.V., Mikolov, T.: Distributed representations of sentences and documents. In: Proceedings of the 31st International Conference on Machine Learning, W&CP, vol. 32. JMLR (2014)
Lovász, L., Plummer, M.: Matching Theory Annals of Discrete Mathematics, vol. 29. North-Holland, Amsterdam (1986)
Lund, K., Burgess, C.: Producing high-dimensional semantic spaces from lexical co-occurrence. Behav. Res. Meth. Instrum. Comput. 28(2), 203–208 (1996)
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. arXiv preprint (2013). arXiv:1301.3781
Mitchell, J., Lapata, M.: Composition in distributional models of semantics. Cogn. Sci. 34(8), 1388–1429 (2010)
Rungsawang, A.: DSIR: the first trec-7 attempt. In: TREC, pp. 366–372. Citeseer (1998)
Turney, P.D., Pantel, P., et al.: From frequency to meaning: vector space models of semantics. J. Artif. Intell. Res. 37(1), 141–188 (2010)
Xu, W., Liu, X., Gong, Y.: Document clustering based on non-negative matrix factorization. In: Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Informaion Retrieval, pp. 267–273. ACM (2003)
Acknowledgments
This work was supported by the National Natural Science Foundation of China under grant 61070089, the Science Foundation of TianJin under grant 14JCYBJC15700, and the National 863 Project of China under Grant No. 2013AA013204.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Wei, Y., Wei, J., Yang, Z. (2015). Enriching Document Representation with the Deviations of Word Co-occurrence Frequencies. In: Wang, G., Zomaya, A., Martinez, G., Li, K. (eds) Algorithms and Architectures for Parallel Processing. ICA3PP 2015. Lecture Notes in Computer Science(), vol 9529. Springer, Cham. https://doi.org/10.1007/978-3-319-27122-4_17
Download citation
DOI: https://doi.org/10.1007/978-3-319-27122-4_17
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-27121-7
Online ISBN: 978-3-319-27122-4
eBook Packages: Computer ScienceComputer Science (R0)