Skip to main content
Log in

The representational geometry of word meanings acquired by neural machine translation models

  • Published:
Machine Translation

Abstract

This work is the first comprehensive analysis of the properties of word embeddings learned by neural machine translation (NMT) models trained on bilingual texts. We show the word representations of NMT models outperform those learned from monolingual text by established algorithms such as Skipgram and CBOW on tasks that require knowledge of semantic similarity and/or lexical–syntactic role. These effects hold when translating from English to French and English to German, and we argue that the desirable properties of NMT word embeddings should emerge largely independently of the source and target languages. Further, we apply a recently-proposed heuristic method for training NMT models with very large vocabularies, and show that this vocabulary expansion method results in minimal degradation of embedding quality. This allows us to make a large vocabulary of NMT embeddings available for future research and applications. Overall, our analyses indicate that NMT embeddings should be used in applications that require word concepts to be organised according to similarity and/or lexical function, while monolingual embeddings are better suited to modelling (nonspecific) inter-word relatedness.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

Notes

  1. Subsequent variants use different algorithms for selecting the \((w,c)\) from the training corpus (Hill and Korhonen 2014; Levy and Goldberg 2014).

  2. The models of Chandar et al. (2014) and Hermann and Blunsom (2014) both aim to minimise the divergence between source and target language sentences represented as sums of word embeddings. Because of these similarities, we do not compare with both in this paper.

  3. Access to source code and limited GPU time prevent us from training and evaluating the embeddings from other NMT models such as that of Kalchbrenner and Blunsom (2013),  Devlin et al. (2014) and Sutskever et al. (2014). The underlying principles of encoding–decoding also apply to these models, and we expect the embeddings would exhibit similar properties to those analysed here.

  4. These corpora were produced from the WMT14 parallel data after conducting the data-selection procedure described by Cho et al. (2014).

  5. Available from http://www.cs.cmu.edu/mfaruqui/soft.html. The available embeddings were trained on English–German aligned data, but the authors report similar results for English–French.

  6. For a more detailed discussion of the similarity/relatedness distinction, see (Hill et al. (2014)).

  7. The most dissimilar pair in SimLex-Assoc-333 is [shrink,grow] with a score of 0.23. The highest is [vanish,disappear] with 9.80.

  8. To control for different vocabularies, we restricted the effective vocabulary of each model to the intersection of all model vocabularies, and excluded all questions that contained an answer outside of this intersection.

  9. Available online at http://www.cl.cam.ac.uk/fh295/.

  10. We did not do the same for our translation models because sentence-aligned bilingual corpora of comparable size do not exist.

  11. The performance of the FD embeddings on this task is higher than that reported by Faruqui and Dyer (2014) because we search for answers over a smaller total candidate vocabulary.

  12. A different solution to the rare-word problem was proposed by Luong et al. (2014). We do not evaluate the effects on the resulting embeddings of this method because we lack access to the source code.

References

  • Agirre E, Alfonseca E, Hall K, Kravalova J, Pasca M, Soroa A (2009) A study on similarity and relatedness using distributional and wordnet-based approaches. In: Proceedings of NAACL-HLT 2009

  • Bahdanau D, Cho K, Bengio Y (2015) Neural machine translation by jointly learning to align and translate. In: Proceedings of ICLR

  • Baroni M, Dinu G, Kruszewski G (2014) Dont count, predict! a systematic comparison of context-counting vs. context-predicting semantic vectors. In: Proceedings of the 52nd annual meeting of the association for computational linguistics, vol 1

  • Bengio Y, Sénécal JS (2003) Quick training of probabilistic neural nets by importance sampling. In: Proceedings of AISTATS 2003

  • Bengio Y, Ducharme R, Vincent P, Janvin C (2003) A neural probabilistic language model. J Mach Learn Res 3:1137–1155

    MATH  Google Scholar 

  • Bruni E, Tran NK, Baroni M (2014) Multimodal distributional semantics. J Artif Intell Res(JAIR) 49:1–47

    MathSciNet  MATH  Google Scholar 

  • Chandar S, Lauly S, Larochelle H, Khapra MM, Ravindran B, Raykar V, Saha A (2014) An autoencoder approach to learning bilingual word representations. In: NIPS

  • Cho K, van Merrienboer B, Gulcehre C, Bougares F, Schwenk H, Bengio Y (2014) Learning phrase representations using RNN encoder-decoder for statistical machine translation. In: Proceedings of the empirical methods in natural language processing (EMNLP 2014), to appear

  • Collobert R, Weston J (2008) A unified architecture for natural language processing: deep neural networks with multitask learning. In: Proceedings of the 25th international conference on machine learning, ACM, pp 160–167

  • Collobert R, Weston J, Bottou L, Karlen M, Kavukcuoglu K, Kuksa P (2011) Natural language processing (almost) from scratch. J Mach Learn Res 12:2493–2537

    MATH  Google Scholar 

  • Devlin J, Zbib R, Huang Z, Lamar T, Schwartz R, Makhoul J (2014) Fast and robust neural network joint models for statistical machine translation. In: 52nd annual meeting of the association for computational linguistics, Baltimore, June

  • Faruqui M, Dyer C (2014) Improving vector space word representations using multilingual correlation. In: Proceedings of EACL, vol 2014

  • Firth RJ (1957) A synopsis of linguistic theory 1930–1955. Philological Society, Oxford, pp 1–32

    Google Scholar 

  • Haghighi A, Liang P, Berg-Kirkpatrick T, Klein D (2008) Learning bilingual lexicons from monolingual corpora. In: ACL, vol 2008, pp 771–779

  • Hermann KM, Blunsom P (2014) Multilingual distributed representations without word alignment. In: Proceedings of ICLR

  • Hill F, Korhonen A (2014) Learning abstract concepts from multi-modal data: since you probably can’t see what i mean. In: Proceedings of the empirical methods in natural language processing (EMNLP 2014)

  • Hill F, Reichart R, Korhonen A (2014) Simlex-999: evaluating semantic models with (genuine) similarity estimation. arXiv preprint arXiv:1408.3456

  • Jean S, Cho K, Memisevic R, Bengio Y (2015) On using very large target vocabulary for neural machine translation. In: Proceedings of NAACL

  • Kalchbrenner N, Blunsom P (2013) Recurrent continuous translation models. In: Proceedings of the 2013 conference on empirical methods in natural language processing, Association for Computational Linguistics, Seattle

  • Klementiev A, Titov I, Bhattarai B (2012a) Inducing crosslingual distributed representations of words. COLING

  • Klementiev A, Titov I, Bhattarai B (2012b) Inducing crosslingual distributed representations of words. In: COLING

  • Kočiský T, Hermann KM, Blunsom P (2014) Learning bilingual word representations by marginalizing alignments. In: Proceedings of ACL

  • Kusner M, Sun Y, Kolkin N, Weinberger KQ (2015) From word embeddings to document distances. In: Proceedings of the 32nd international conference on machine learning (ICML-15), pp 957–966

  • Landauer TK, Dumais ST (1997) A solution to plato’s problem: the latent semantic analysis theory of acquisition, induction, and representation of knowledge. Psychol Rev 104(2):211

    Article  Google Scholar 

  • Levy O, Goldberg Y (2014) Dependency-based word embeddings. In: Proceedings of the 52nd annual meeting of the association for computational linguistics, vol 2

  • Luong T, Sutskever I, Le QV, Vinyals O, Zaremba W (2014) Addressing the rare word problem in neural machine translation. arXiv preprint arXiv:1410.8206

  • Mikolov T, Le QV, Sutskever I (2013a) Exploiting similarities among languages for machine translation. In: CORR

  • Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J (2013b) Distributed representations of words and phrases and their compositionality. In: Advances in neural information processing systems, pp 3111–3119

  • Mnih A, Hinton GE (2009) A scalable hierarchical distributed language model. In: Advances in neural information processing systems, pp 1081–1088

  • Morin F, Bengio Y (2005) Hierarchical probabilistic neural network language model. AISTATS, Citeseer 5:246–252

    Google Scholar 

  • Nelson DL, McEvoy CL, Schreiber TA (2004) The university of south florida free association, rhyme, and word fragment norms. Behav Res Methods Instrum Comput 36(3):402–407

    Article  Google Scholar 

  • Pennington J, Socher R, Manning C (2014) Glove: global vectors for word representation. In: Proceedings of the empirical methods in natural language processing (EMNLP 2014)

  • Sutskever I, Vinyals O, Le QV (2014) Sequence to sequence learning with neural networks. In: Proceedings of NIPS

  • Turney PD, Pantel P (2010) From frequency to meaning: vector space models of semantics. J Artif Intell Res 37(1):141–188

    MathSciNet  MATH  Google Scholar 

  • Vulić I, De Smet W, Moens MF (2011) Identifying word translations from comparable corpora using latent topic models. In: Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies: short papers, Vol 2, Association for Computational Linguistics, pp 479–484

  • Weston J, Bengio S, Usunier N (2010) Large scale image annotation: learning to rank with joint word-image embeddings. Mach Learn 81(1):21–35

    Article  MathSciNet  Google Scholar 

Download references

Acknowledgements

This work was in part funded by a Google European Doctoral Fellowship and a Google Faculty Award.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Felix Hill.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Hill, F., Cho, K., Jean, S. et al. The representational geometry of word meanings acquired by neural machine translation models. Machine Translation 31, 3–18 (2017). https://doi.org/10.1007/s10590-017-9194-2

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10590-017-9194-2

Keywords

Navigation