Skip to main content

Generating Bags of Words from the Sums of Their Word Embeddings

  • Conference paper
  • First Online:
Computational Linguistics and Intelligent Text Processing (CICLing 2016)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 9623))

Abstract

Many methods have been proposed to generate sentence vector representations, such as recursive neural networks, latent distributed memory models, and the simple sum of word embeddings (SOWE). However, very few methods demonstrate the ability to reverse the process – recovering sentences from sentence embeddings. Amongst the many sentence embeddings, SOWE has been shown to maintain semantic meaning, so in this paper we introduce a method for moving from the SOWE representations back to the bag of words (BOW) for the original sentences. This is a part way step towards recovering the whole sentence and has useful theoretical and practical applications of its own. This is done using a greedy algorithm to convert the vector to a bag of words. To our knowledge this is the first such work. It demonstrates qualitatively the ability to recreate the words from a large corpus based on its sentence embeddings.

As well as practical applications for allowing classical information retrieval methods to be combined with more recent methods using the sums of word embeddings, the success of this method has theoretical implications on the degree of information maintained by the sum of embeddings representation. This lends some credence to the consideration of the SOWE as a dimensionality reduced, and meaning enhanced, data manifold for the bag of words.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    http://white.ucc.asn.au/publications/White2015BOWgen/.

  2. 2.

    Kindly made available online at http://nlp.stanford.edu/projects/glove/.

  3. 3.

    The same is true for any number of repetitions of the word buffalo – each of which forms a valid sentence as noted in Tymoczko et al. (1995).

  4. 4.

    http://www.cicling.org/2016/data/97.

References

  • Bezanson, J., et al.: Julia: a fresh approach to numerical computing (2014). arXiv:1411.1607 [cs.MS]

  • Bowman, S.R., et al.: Generating sentences from a continuous space (2015). arXiv preprint arXiv:1511.06349

  • Dinu, G., Baroni, M.: How to make words with vectors: phrase generation in distributional semantics. In: Proceedings of ACL, pp. 624–633 (2014)

    Google Scholar 

  • Farhadi, A., Hejrati, M., Sadeghi, M.A., Young, P., Rashtchian, C., Hockenmaier, J., Forsyth, D.: Every picture tells a story: generating sentences from images. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010. LNCS, vol. 6314, pp. 15–29. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-15561-1_2

    Chapter  Google Scholar 

  • Francis, W.N., Kucera, H.: Brown corpus manual. Brown University (1979)

    Google Scholar 

  • Iyyer, M., Boyd-Graber, J., Daumé III, H.: Generating sentences from semantic vector space representations. In: NIPS Workshop on Learning Semantics (2014)

    Google Scholar 

  • Kåagebäck, M., et al.: Extractive summarization using continuous vector space models. In: Proceedings of the 2nd Workshop on Continuous Vector Space Models and their Compositionality (CVSC)@ EACL, pp. 31–39 (2014)

    Google Scholar 

  • Karp, R.M.: Reducibility among combinatorial problems. In: Miller, R.E., Thatcher, J.W., Bohlinger, J.D. (eds.) Complexity of Computer Computations. The IBM Research Symposia Series, pp. 85–103. Springer, Boston (1972). https://doi.org/10.1007/978-1-4684-2001-2_9

    Chapter  Google Scholar 

  • Le, Q., Mikolov, T.: Distributed representations of sentences and documents. In: Proceedings of the 31st International Conference on Machine Learning (ICML-2014), pp. 1188–1196 (2014)

    Google Scholar 

  • Mikolov, T., et al.: Efficient estimation of word representations in vector space (2013a). arXiv preprint arXiv:1301.3781

  • Mikolov, T., Yih, W.-T., Zweig, G.: Linguistic regularities in continuous space word representations. In: HLT-NAACL, pp. 746–751 (2013b)

    Google Scholar 

  • Nation, I.: How large a vocabulary is needed for reading and listening? Can. Mod. Lang. Rev. 63(1), 59–82 (2006)

    Article  MathSciNet  Google Scholar 

  • Pennington, J., Socher, R., Manning, C.D.: GloVe: global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP 2014), pp. 1532–1543 (2014)

    Google Scholar 

  • Ritter, S., et al.: Leveraging preposition ambiguity to assess compositional distributional models of semantics. In: The Fourth Joint Conference on Lexical and Computational Semantics (2015)

    Google Scholar 

  • Socher, R.: Recursive deep learning for natural language processing and computer vision. Ph.D. thesis. Stanford University (2014)

    Google Scholar 

  • Socher, R., et al.: Grounded compositional semantics for finding and describing images with sentences. Trans. Assoc. Comput. Linguist. 2, 207–218 (2014)

    Google Scholar 

  • Tymoczko, T., Henle, J., Henle, J.M.: Sweet reason: a field guide to modern logic. In: Textbooks in Mathematical Sciences. Key College (1995). ISBN 9780387989303

    Google Scholar 

  • White, L., et al.: How well sentence embeddings capture meaning. In: Proceedings of the 20th Australasian Document Computing Symposium (ADCS 2015), Parramatta, pp. 9:1–9:8. ACM (2015). https://doi.org/10.1145/2838931.2838932. ISBN 978-1-4503-4040-3

  • Yin, W., Schtze, H.: Learning word meta-embeddings by using ensembles of embedding sets (2015). eprint arXiv:1508.04257

  • Yogatama, D., Liu, F., Smith, N.A.: Extractive summarization by maximizing semantic volume. In: Conference on Empirical Methods in Natural Language Processing (2015)

    Google Scholar 

  • Zhu, Y., et al.: Aligning books and movies: towards story-like visual explanations by watching movies and reading books (2015). arXiv preprint arXiv:1506.06724

Download references

Acknowledgements

This research is supported by the Australian Postgraduate Award, and partially funded by Australian Research Council grants DP150102405 and LP110100050. Computational resources were provided by the National eResearch Collaboration Tools and Resources project (Nectar).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Lyndon White .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer International Publishing AG, part of Springer Nature

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

White, L., Togneri, R., Liu, W., Bennamoun, M. (2018). Generating Bags of Words from the Sums of Their Word Embeddings. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2016. Lecture Notes in Computer Science(), vol 9623. Springer, Cham. https://doi.org/10.1007/978-3-319-75477-2_5

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-75477-2_5

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-75476-5

  • Online ISBN: 978-3-319-75477-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics