Enriching Document Representation with the Deviations of Word Co-occurrence Frequencies

Wei, Yang; Wei, Jinmao; Yang, Zhenglu

doi:10.1007/978-3-319-27122-4_17

Yang Wei^17,18,
Jinmao Wei^17,18 &
Zhenglu Yang^17,18

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 9529))

Included in the following conference series:

International Conference on Algorithms and Architectures for Parallel Processing

1295 Accesses
2 Citations

Abstract

Recent strategies have been proposed to reveal the semantic relatedness between documents by enriching a document with the relatedness of all the words in the given document collection to the document. By restricting the relatedness to the expected frequencies that each word will occur in the document, the traditional weighted sum of word vectors is proved to give the upper bounds of the expected frequencies. Duplicate counts usually exist during the sum of the word vectors, which weaken the discriminativeness of the enriched document vectors. The strategy which gives the lower bounds of the expected frequencies is also obtained by keeping the maximum values of the word vectors on each dimension. Together with the lower bounds and the deviations of word co-occurrence frequencies, a novel method is proposed to remove the duplicate counts existing in the upper bounds. As a result, the proposed method smooths the generated document vectors better than the weighted sum strategy. Substantial experiments verify that the document clustering incorporated with the proposed method achieves a significant performance improvement compared with the existing strategies.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
www.cad.zju.edu.cn/home/dengcai/Data/TextData.html.

References

Andrews, N.O., Fox, E.A.: Recent developments in document clustering. Computer Science, Virginia Tech., Tech. report (2007)
Google Scholar
Billhardt, H., Borrajo, D., Maojo, V.: A context vector model for information retrieval. J. Am. Soc. Inform. Sci. Technol. 53(3), 236–249 (2002)
Article Google Scholar
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. J. Mach. Learning Res. 3, 993–1022 (2003)
MATH Google Scholar
Blunsom, P., Grefenstette, E., Hermann, K.M., et al.: New directions in vector space models of meaning. In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (2014)
Google Scholar
Bullinaria, J.A., Levy, J.P.: Extracting semantic representations from word co-occurrence statistics: a computational study. Behav. Res. Meth. 39(3), 510–526 (2007)
Article Google Scholar
Bullinaria, J.A., Levy, J.P.: Extracting semantic representations from word co-occurrence statistics: stop-lists, stemming, and SVD. Behav. Res. Meth. 44(3), 890–907 (2012)
Article Google Scholar
Cai, D., He, X., Han, J.: Locally consistent concept factorization for document clustering. IEEE Trans. Knowl. Data Eng. 23(6), 902–913 (2011)
Article Google Scholar
Cheng, X., Miao, D., Wang, C., Cao, L.: Coupled term-term relation analysis for document clustering. In: The 2013 International Joint Conference on Neural Networks (IJCNN), pp. 1–8. IEEE (2013)
Google Scholar
Deerwester, S.C., Dumais, S.T., Landauer, T.K., Furnas, G.W., Harshman, R.A.: Indexing by latent semantic analysis. JASIS 41(6), 391–407 (1990)
Article Google Scholar
Harris, Z.S.: Distributional structure. Word 10(23), 146–162 (1954)
Article Google Scholar
Iosif, E., Potamianos, A.: Unsupervised semantic similarity computation between terms using web documents. IEEE Trans. Knowl. Data Eng. 22(11), 1637–1647 (2010)
Article Google Scholar
Kalogeratos, A., Likas, A.: Text document clustering using global term context vectors. Knowl. Inf. Syst. 31(3), 455–474 (2012)
Article Google Scholar
Le, Q.V., Mikolov, T.: Distributed representations of sentences and documents. In: Proceedings of the 31st International Conference on Machine Learning, W&CP, vol. 32. JMLR (2014)
Google Scholar
Lovász, L., Plummer, M.: Matching Theory Annals of Discrete Mathematics, vol. 29. North-Holland, Amsterdam (1986)
MATH Google Scholar
Lund, K., Burgess, C.: Producing high-dimensional semantic spaces from lexical co-occurrence. Behav. Res. Meth. Instrum. Comput. 28(2), 203–208 (1996)
Article Google Scholar
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. arXiv preprint (2013). arXiv:1301.3781
Mitchell, J., Lapata, M.: Composition in distributional models of semantics. Cogn. Sci. 34(8), 1388–1429 (2010)
Article Google Scholar
Rungsawang, A.: DSIR: the first trec-7 attempt. In: TREC, pp. 366–372. Citeseer (1998)
Google Scholar
Turney, P.D., Pantel, P., et al.: From frequency to meaning: vector space models of semantics. J. Artif. Intell. Res. 37(1), 141–188 (2010)
MathSciNet MATH Google Scholar
Xu, W., Liu, X., Gong, Y.: Document clustering based on non-negative matrix factorization. In: Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Informaion Retrieval, pp. 267–273. ACM (2003)
Google Scholar

Download references

Acknowledgments

This work was supported by the National Natural Science Foundation of China under grant 61070089, the Science Foundation of TianJin under grant 14JCYBJC15700, and the National 863 Project of China under Grant No. 2013AA013204.

Author information

Authors and Affiliations

College of Computer and Control Engineering, Nankai University, Tianjin, 300071, China
Yang Wei, Jinmao Wei & Zhenglu Yang
College of Software, Nankai University, Tianjin, 300071, China
Yang Wei, Jinmao Wei & Zhenglu Yang

Authors

Yang Wei
View author publications
You can also search for this author in PubMed Google Scholar
Jinmao Wei
View author publications
You can also search for this author in PubMed Google Scholar
Zhenglu Yang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jinmao Wei .

Editor information

Editors and Affiliations

Central South University, Changsha, China
Guojun Wang
The University of Sydney, Sydney, New South Wales, Australia
Albert Zomaya
University of Murcia, Murcia, Murcia, Spain
Gregorio Martinez
Hunan University , Changsha, China
Kenli Li

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wei, Y., Wei, J., Yang, Z. (2015). Enriching Document Representation with the Deviations of Word Co-occurrence Frequencies. In: Wang, G., Zomaya, A., Martinez, G., Li, K. (eds) Algorithms and Architectures for Parallel Processing. ICA3PP 2015. Lecture Notes in Computer Science(), vol 9529. Springer, Cham. https://doi.org/10.1007/978-3-319-27122-4_17

Download citation

DOI: https://doi.org/10.1007/978-3-319-27122-4_17
Published: 16 December 2015
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-27121-7
Online ISBN: 978-3-319-27122-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics