Abstract
Many methods have been proposed to generate sentence vector representations, such as recursive neural networks, latent distributed memory models, and the simple sum of word embeddings (SOWE). However, very few methods demonstrate the ability to reverse the process – recovering sentences from sentence embeddings. Amongst the many sentence embeddings, SOWE has been shown to maintain semantic meaning, so in this paper we introduce a method for moving from the SOWE representations back to the bag of words (BOW) for the original sentences. This is a part way step towards recovering the whole sentence and has useful theoretical and practical applications of its own. This is done using a greedy algorithm to convert the vector to a bag of words. To our knowledge this is the first such work. It demonstrates qualitatively the ability to recreate the words from a large corpus based on its sentence embeddings.
As well as practical applications for allowing classical information retrieval methods to be combined with more recent methods using the sums of word embeddings, the success of this method has theoretical implications on the degree of information maintained by the sum of embeddings representation. This lends some credence to the consideration of the SOWE as a dimensionality reduced, and meaning enhanced, data manifold for the bag of words.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
- 2.
Kindly made available online at http://nlp.stanford.edu/projects/glove/.
- 3.
The same is true for any number of repetitions of the word buffalo – each of which forms a valid sentence as noted in Tymoczko et al. (1995).
- 4.
References
Bezanson, J., et al.: Julia: a fresh approach to numerical computing (2014). arXiv:1411.1607 [cs.MS]
Bowman, S.R., et al.: Generating sentences from a continuous space (2015). arXiv preprint arXiv:1511.06349
Dinu, G., Baroni, M.: How to make words with vectors: phrase generation in distributional semantics. In: Proceedings of ACL, pp. 624–633 (2014)
Farhadi, A., Hejrati, M., Sadeghi, M.A., Young, P., Rashtchian, C., Hockenmaier, J., Forsyth, D.: Every picture tells a story: generating sentences from images. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010. LNCS, vol. 6314, pp. 15–29. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-15561-1_2
Francis, W.N., Kucera, H.: Brown corpus manual. Brown University (1979)
Iyyer, M., Boyd-Graber, J., Daumé III, H.: Generating sentences from semantic vector space representations. In: NIPS Workshop on Learning Semantics (2014)
Kåagebäck, M., et al.: Extractive summarization using continuous vector space models. In: Proceedings of the 2nd Workshop on Continuous Vector Space Models and their Compositionality (CVSC)@ EACL, pp. 31–39 (2014)
Karp, R.M.: Reducibility among combinatorial problems. In: Miller, R.E., Thatcher, J.W., Bohlinger, J.D. (eds.) Complexity of Computer Computations. The IBM Research Symposia Series, pp. 85–103. Springer, Boston (1972). https://doi.org/10.1007/978-1-4684-2001-2_9
Le, Q., Mikolov, T.: Distributed representations of sentences and documents. In: Proceedings of the 31st International Conference on Machine Learning (ICML-2014), pp. 1188–1196 (2014)
Mikolov, T., et al.: Efficient estimation of word representations in vector space (2013a). arXiv preprint arXiv:1301.3781
Mikolov, T., Yih, W.-T., Zweig, G.: Linguistic regularities in continuous space word representations. In: HLT-NAACL, pp. 746–751 (2013b)
Nation, I.: How large a vocabulary is needed for reading and listening? Can. Mod. Lang. Rev. 63(1), 59–82 (2006)
Pennington, J., Socher, R., Manning, C.D.: GloVe: global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP 2014), pp. 1532–1543 (2014)
Ritter, S., et al.: Leveraging preposition ambiguity to assess compositional distributional models of semantics. In: The Fourth Joint Conference on Lexical and Computational Semantics (2015)
Socher, R.: Recursive deep learning for natural language processing and computer vision. Ph.D. thesis. Stanford University (2014)
Socher, R., et al.: Grounded compositional semantics for finding and describing images with sentences. Trans. Assoc. Comput. Linguist. 2, 207–218 (2014)
Tymoczko, T., Henle, J., Henle, J.M.: Sweet reason: a field guide to modern logic. In: Textbooks in Mathematical Sciences. Key College (1995). ISBN 9780387989303
White, L., et al.: How well sentence embeddings capture meaning. In: Proceedings of the 20th Australasian Document Computing Symposium (ADCS 2015), Parramatta, pp. 9:1–9:8. ACM (2015). https://doi.org/10.1145/2838931.2838932. ISBN 978-1-4503-4040-3
Yin, W., Schtze, H.: Learning word meta-embeddings by using ensembles of embedding sets (2015). eprint arXiv:1508.04257
Yogatama, D., Liu, F., Smith, N.A.: Extractive summarization by maximizing semantic volume. In: Conference on Empirical Methods in Natural Language Processing (2015)
Zhu, Y., et al.: Aligning books and movies: towards story-like visual explanations by watching movies and reading books (2015). arXiv preprint arXiv:1506.06724
Acknowledgements
This research is supported by the Australian Postgraduate Award, and partially funded by Australian Research Council grants DP150102405 and LP110100050. Computational resources were provided by the National eResearch Collaboration Tools and Resources project (Nectar).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer International Publishing AG, part of Springer Nature
About this paper
Cite this paper
White, L., Togneri, R., Liu, W., Bennamoun, M. (2018). Generating Bags of Words from the Sums of Their Word Embeddings. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2016. Lecture Notes in Computer Science(), vol 9623. Springer, Cham. https://doi.org/10.1007/978-3-319-75477-2_5
Download citation
DOI: https://doi.org/10.1007/978-3-319-75477-2_5
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-75476-5
Online ISBN: 978-3-319-75477-2
eBook Packages: Computer ScienceComputer Science (R0)