Skip to main content

Learning Sparse Overcomplete Word Vectors Without Intermediate Dense Representations

  • Conference paper
  • First Online:
Knowledge Science, Engineering and Management (KSEM 2017)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 10412))

  • 2011 Accesses

Abstract

Dense word representation models have attracted a lot of interest for their promising performances in various natural language processing (NLP) tasks. However, dense word vectors are uninterpretable, inseparable, and time and space consuming. We propose a model to learn sparse word representations directly from the plain text, rather than most existing methods that learn sparse vectors from intermediate dense word embeddings. Additionally, we design an efficient algorithm based on noise-contrastive estimation (NCE) to train the model. Moreover, a clustering-based adaptive updating scheme for noise distributions is introduced for effective learning when NCE is applied. Experimental results show that the resulting sparse word vectors are comparable to dense vectors on the word analogy tasks. Our models outperform dense word vectors on the word similarity tasks. The sparse word vectors are much more interpretable, according to the sparse vector visualization and the word intruder identification experiments.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    The same as a word is composed of word atoms, we also assume that a context is composed of context atoms. In this paper, we will use surrounding words as contexts and thus \(\mathcal {V} = \mathcal {C}\). The number of word and context atoms are also set to be equal, i.e., \(n_{b} = n_{c}\).

  2. 2.

    SpVec-nce’s learning algorithm could be derived similarly.

  3. 3.

    We use index convention from Python except that indexes start with 1.

  4. 4.

    We define the product of a matrix \({\varvec{A}} \in \mathbb {R}^{m \times n}\) and a 3-way tensor \({\varvec{B}}\in \mathbb {R}^{n\times p \times q}\) to be a 3-way tensor \({\varvec{C}}\) such that \({\varvec{C}}_{:, i, j} = {\varvec{AB}}_{:,i,j}\).

  5. 5.

    https://github.com/mfaruqui/sparse-coding.

  6. 6.

    https://github.com/dav/word2vec.

  7. 7.

    https://github.com/dav/word2vec/blob/master/data/questions-words.txt.

  8. 8.

    http://research.microsoft.com/en-us/projects/rnn/.

References

  1. Brown, P.F., Della Pietra, V.J., de Souza, P.V., Lai, J.C., Mercer, R.L.: Class-based n-gram models of natural language. Comput. Linguist. 18(4), 467–479 (1992)

    Google Scholar 

  2. Chen, Y., Mou, L., Xu, Y., Li, G., Jin, Z.: Compressing neural language models by sparse word representations. In: Proceedings of ACL (2016)

    Google Scholar 

  3. Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., Kuksa, P.: Natural language processing (almost) from scratch. J. Mach. Learn. Res. 12, 2493–2537 (2011)

    MATH  Google Scholar 

  4. Donoho, D.L., Elad, M., Temlyakov, V.N.: Stable recovery of sparse overcomplete representations in the presence of noise. IEEE Trans. Inf. Theory 52, 6–18 (2006)

    Article  MathSciNet  MATH  Google Scholar 

  5. Erk, K., Padó, S.: A structured vector space model for word meaning in context. In: Proceedings of EMNLP, pp. 897–906, Morristown, NJ, USA (2008)

    Google Scholar 

  6. Faruqui, M., Tsvetkov, Y., Yogatama, D., Dyer, C., Smith, N.A.: Sparse overcomplete word vector representations. In: Proceedings of ACL, pp. 1491–1500 (2015)

    Google Scholar 

  7. Finkelstein, L., Gabrilovich, E., Matias, Y., Rivlin, E., Solan, Z., Wolfman, G., Ruppin, E.: Placing search in context: the concept revisited. ACM Trans. Inf. Syst. 20(1), 116–131 (2002)

    Article  Google Scholar 

  8. Goodfellow, I., Bengio, Y., Courville, A.: Deep Learning. MIT Press, Cambridge (2016)

    MATH  Google Scholar 

  9. Goodfellow, I.J.: On distinguishability criteria for estimating generative models. arXiv, December 2014

    Google Scholar 

  10. Gutmann, M., Hyvärinen, A.: Noise-contrastive estimation of unnormalized statistical models, with applications to natural image statistics. J. Mach. Learn. Res. 13(1), 307–361 (2012)

    MathSciNet  MATH  Google Scholar 

  11. Harris, Z.S.: Distributional structure. Word 10, 146–162 (1954)

    Article  Google Scholar 

  12. Kalchbrenner, N., Blunsom, P.: Recurrent convolutional neural networks for discourse compositionality. arXiv.org, June 2013

  13. Landauer, T.K., Foltz, P.W., Laham, D.: An introduction to latent semantic analysis. Discourse Process. 25(2–3), 259–284 (1998)

    Article  Google Scholar 

  14. Levy, O., Goldberg, Y.: Linguistic regularities in sparse and explicit word representations. In: Conference on Natural Language Learning, pp. 171–180 (2014)

    Google Scholar 

  15. Lund, K., Burgess, C., Atchley, R.A.: Semantic and associative priming in high-dimensional semantic space. In: Annual Conference of the Cognitive Science Society, vol. 17, pp. 660–665 (1995)

    Google Scholar 

  16. Martins, A.F.T., Smith, N.A., Figueiredo, M.A.T., Aguiar, P.M.Q.: Structured sparsity in structured prediction. In: Proceedings of EMNLP (2011)

    Google Scholar 

  17. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. In: Computing Research Repository (2013)

    Google Scholar 

  18. Mnih, A., Kavukcuoglu, K.: Learning word embeddings efficiently with noise-contrastive estimation. In: NIPS, pp. 2265–2273 (2013)

    Google Scholar 

  19. Mnih, A., Teh, Y.W.: A fast and simple algorithm for training neural probabilistic language models. arXiv preprint arXiv:1206.6426 (2012)

  20. Murphy, B., Talukdar, P.P., Mitchell, T.M.: Learning effective and interpretable semantic models using non-negative sparse embedding. In: Proceedings of COLING. ACL (2012)

    Google Scholar 

  21. Olshausen, B.A., Field, D.J.: Sparse coding with an overcomplete basis set: a strategy employed by v1? Vis. Res. 37(23), 3311–3325 (1997)

    Article  Google Scholar 

  22. Parikh, N., Boyd, S.: Proximal algorithms. Found. Trends\(\textregistered \) Optim. 1(3), 127–239 (2014)

    Google Scholar 

  23. Paul, M., Dredze, M.: Factorial LDA: sparse multi-dimensional text models. In: Advances in Neural Information Processing (2012)

    Google Scholar 

  24. Pennington, J., Socher, R., Manning, C.D.: GloVe: global vectors for word representation. In: Proceedings of ACL (2014)

    Google Scholar 

  25. Sivaram, G.S.V.S., Nemala, S.K., Elhilali, M., Tran, T.D., Hermansky, H.: Sparse coding for speech recognition. In: ICASSP, pp. 4346–4349 (2010)

    Google Scholar 

  26. Socher, R., Bauer, J., Manning, C.D., Ng, A.Y.: Parsing with compositional vector grammars. In: Proceedings of ACL (2013)

    Google Scholar 

  27. Socher, R., Chen, D., Manning, C.D.: Reasoning with neural tensor networks for knowledge base completion. In: NIPS (2013)

    Google Scholar 

  28. Soutner, D., Müller, L.: Continuous distributed representations of words as input of LSTM network language model. In: Sojka, P., Horák, A., Kopeček, I., Pala, K. (eds.) TSD 2014. LNCS, vol. 8655, pp. 150–157. Springer, Cham (2014). doi:10.1007/978-3-319-10816-2_19

    Google Scholar 

  29. Sun, F., Guo, J., Lan, Y., Xu, J., Cheng, X.: Sparse word embeddings using \(\ell _{1}\) regularized online learning. In: Proceedings of the 25th International Joint Conference on Artificial Intelligence, New York, USA, pp. 959–966 (2016)

    Google Scholar 

  30. Toutanova, K., Johnson, M.: A Bayesian LDA-based model for semi-supervised part-of-speech tagging. In: NIPS (2007)

    Google Scholar 

  31. Vinson, D.P., Vigliocco, G.: Semantic feature production norms for a large set of objects and events. Behav. Res. Methods 40(1), 183–190 (2008)

    Article  Google Scholar 

  32. Yogatama, D., Smith, N.A.: Linguistic structured sparsity in text categorization. In: Proceedings of ACL (2014)

    Google Scholar 

Download references

Acknowledgement

This research is supported by the National Basic Research Program of China (the 973 Program) under Grant No. 2015CB352201 and the National Natural Science Foundation of China under Grant Nos. 61421091, 61232015 and 61502014.

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Ge Li or Zhi Jin .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing AG

About this paper

Cite this paper

Chen, Y., Li, G., Jin, Z. (2017). Learning Sparse Overcomplete Word Vectors Without Intermediate Dense Representations. In: Li, G., Ge, Y., Zhang, Z., Jin, Z., Blumenstein, M. (eds) Knowledge Science, Engineering and Management. KSEM 2017. Lecture Notes in Computer Science(), vol 10412. Springer, Cham. https://doi.org/10.1007/978-3-319-63558-3_1

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-63558-3_1

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-63557-6

  • Online ISBN: 978-3-319-63558-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics