Generating Bags of Words from the Sums of Their Word Embeddings

White, Lyndon; Togneri, Roberto; Liu, Wei; Bennamoun, Mohammed

doi:10.1007/978-3-319-75477-2_5

Lyndon White¹⁴,
Roberto Togneri¹⁴,
Wei Liu¹⁴ &
…
Mohammed Bennamoun¹⁴

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 9623))

Included in the following conference series:

International Conference on Intelligent Text Processing and Computational Linguistics

1424 Accesses
2 Citations

Abstract

Many methods have been proposed to generate sentence vector representations, such as recursive neural networks, latent distributed memory models, and the simple sum of word embeddings (SOWE). However, very few methods demonstrate the ability to reverse the process – recovering sentences from sentence embeddings. Amongst the many sentence embeddings, SOWE has been shown to maintain semantic meaning, so in this paper we introduce a method for moving from the SOWE representations back to the bag of words (BOW) for the original sentences. This is a part way step towards recovering the whole sentence and has useful theoretical and practical applications of its own. This is done using a greedy algorithm to convert the vector to a bag of words. To our knowledge this is the first such work. It demonstrates qualitatively the ability to recreate the words from a large corpus based on its sentence embeddings.

As well as practical applications for allowing classical information retrieval methods to be combined with more recent methods using the sums of word embeddings, the success of this method has theoretical implications on the degree of information maintained by the sum of embeddings representation. This lends some credence to the consideration of the SOWE as a dimensionality reduced, and meaning enhanced, data manifold for the bag of words.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
http://white.ucc.asn.au/publications/White2015BOWgen/.
2.
Kindly made available online at http://nlp.stanford.edu/projects/glove/.
3.
The same is true for any number of repetitions of the word buffalo – each of which forms a valid sentence as noted in Tymoczko et al. (1995).
4.
http://www.cicling.org/2016/data/97.

References

Bezanson, J., et al.: Julia: a fresh approach to numerical computing (2014). arXiv:1411.1607 [cs.MS]
Bowman, S.R., et al.: Generating sentences from a continuous space (2015). arXiv preprint arXiv:1511.06349
Dinu, G., Baroni, M.: How to make words with vectors: phrase generation in distributional semantics. In: Proceedings of ACL, pp. 624–633 (2014)
Google Scholar
Farhadi, A., Hejrati, M., Sadeghi, M.A., Young, P., Rashtchian, C., Hockenmaier, J., Forsyth, D.: Every picture tells a story: generating sentences from images. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010. LNCS, vol. 6314, pp. 15–29. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-15561-1_2
Chapter Google Scholar
Francis, W.N., Kucera, H.: Brown corpus manual. Brown University (1979)
Google Scholar
Iyyer, M., Boyd-Graber, J., Daumé III, H.: Generating sentences from semantic vector space representations. In: NIPS Workshop on Learning Semantics (2014)
Google Scholar
Kåagebäck, M., et al.: Extractive summarization using continuous vector space models. In: Proceedings of the 2nd Workshop on Continuous Vector Space Models and their Compositionality (CVSC)@ EACL, pp. 31–39 (2014)
Google Scholar
Karp, R.M.: Reducibility among combinatorial problems. In: Miller, R.E., Thatcher, J.W., Bohlinger, J.D. (eds.) Complexity of Computer Computations. The IBM Research Symposia Series, pp. 85–103. Springer, Boston (1972). https://doi.org/10.1007/978-1-4684-2001-2_9
Chapter Google Scholar
Le, Q., Mikolov, T.: Distributed representations of sentences and documents. In: Proceedings of the 31st International Conference on Machine Learning (ICML-2014), pp. 1188–1196 (2014)
Google Scholar
Mikolov, T., et al.: Efficient estimation of word representations in vector space (2013a). arXiv preprint arXiv:1301.3781
Mikolov, T., Yih, W.-T., Zweig, G.: Linguistic regularities in continuous space word representations. In: HLT-NAACL, pp. 746–751 (2013b)
Google Scholar
Nation, I.: How large a vocabulary is needed for reading and listening? Can. Mod. Lang. Rev. 63(1), 59–82 (2006)
Article MathSciNet Google Scholar
Pennington, J., Socher, R., Manning, C.D.: GloVe: global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP 2014), pp. 1532–1543 (2014)
Google Scholar
Ritter, S., et al.: Leveraging preposition ambiguity to assess compositional distributional models of semantics. In: The Fourth Joint Conference on Lexical and Computational Semantics (2015)
Google Scholar
Socher, R.: Recursive deep learning for natural language processing and computer vision. Ph.D. thesis. Stanford University (2014)
Google Scholar
Socher, R., et al.: Grounded compositional semantics for finding and describing images with sentences. Trans. Assoc. Comput. Linguist. 2, 207–218 (2014)
Google Scholar
Tymoczko, T., Henle, J., Henle, J.M.: Sweet reason: a field guide to modern logic. In: Textbooks in Mathematical Sciences. Key College (1995). ISBN 9780387989303
Google Scholar
White, L., et al.: How well sentence embeddings capture meaning. In: Proceedings of the 20th Australasian Document Computing Symposium (ADCS 2015), Parramatta, pp. 9:1–9:8. ACM (2015). https://doi.org/10.1145/2838931.2838932. ISBN 978-1-4503-4040-3
Yin, W., Schtze, H.: Learning word meta-embeddings by using ensembles of embedding sets (2015). eprint arXiv:1508.04257
Yogatama, D., Liu, F., Smith, N.A.: Extractive summarization by maximizing semantic volume. In: Conference on Empirical Methods in Natural Language Processing (2015)
Google Scholar
Zhu, Y., et al.: Aligning books and movies: towards story-like visual explanations by watching movies and reading books (2015). arXiv preprint arXiv:1506.06724

Download references

Acknowledgements

This research is supported by the Australian Postgraduate Award, and partially funded by Australian Research Council grants DP150102405 and LP110100050. Computational resources were provided by the National eResearch Collaboration Tools and Resources project (Nectar).

Author information

Authors and Affiliations

The University of Western Australia, 35 Stirling Highway, Crawley, Western Australia
Lyndon White, Roberto Togneri, Wei Liu & Mohammed Bennamoun

Authors

Lyndon White
View author publications
You can also search for this author in PubMed Google Scholar
Roberto Togneri
View author publications
You can also search for this author in PubMed Google Scholar
Wei Liu
View author publications
You can also search for this author in PubMed Google Scholar
Mohammed Bennamoun
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Lyndon White .

Editor information

Editors and Affiliations

CIC, Instituto Politécnico Nacional, Mexico City, Mexico
Alexander Gelbukh

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

White, L., Togneri, R., Liu, W., Bennamoun, M. (2018). Generating Bags of Words from the Sums of Their Word Embeddings. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2016. Lecture Notes in Computer Science(), vol 9623. Springer, Cham. https://doi.org/10.1007/978-3-319-75477-2_5

Download citation

DOI: https://doi.org/10.1007/978-3-319-75477-2_5
Published: 21 March 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-75476-5
Online ISBN: 978-3-319-75477-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics