The representational geometry of word meanings acquired by neural machine translation models

Hill, Felix; Cho, Kyunghyun; Jean, Sébastien; Bengio, Yoshua

doi:10.1007/s10590-017-9194-2

The representational geometry of word meanings acquired by neural machine translation models

Published: 29 April 2017

Volume 31, pages 3–18, (2017)
Cite this article

Machine Translation

Felix Hill ORCID: orcid.org/0000-0002-6712-1718¹,
Kyunghyun Cho²,
Sébastien Jean² &
…
Yoshua Bengio³

1445 Accesses
10 Citations
7 Altmetric
Explore all metrics

Abstract

This work is the first comprehensive analysis of the properties of word embeddings learned by neural machine translation (NMT) models trained on bilingual texts. We show the word representations of NMT models outperform those learned from monolingual text by established algorithms such as Skipgram and CBOW on tasks that require knowledge of semantic similarity and/or lexical–syntactic role. These effects hold when translating from English to French and English to German, and we argue that the desirable properties of NMT word embeddings should emerge largely independently of the source and target languages. Further, we apply a recently-proposed heuristic method for training NMT models with very large vocabularies, and show that this vocabulary expansion method results in minimal degradation of embedding quality. This allows us to make a large vocabulary of NMT embeddings available for future research and applications. Overall, our analyses indicate that NMT embeddings should be used in applications that require word concepts to be organised according to similarity and/or lexical function, while monolingual embeddings are better suited to modelling (nonspecific) inter-word relatedness.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Size vs. Structure in Training Corpora for Word Embedding Models: Araneum Russicum Maximum and Russian National Corpus

A survey on training and evaluation of word embeddings

Article 17 February 2021

François Torregrossa, Robin Allesiardo, … Guillaume Gravier

LX-DSemVectors: Distributional Semantics Models for Portuguese

Notes

Subsequent variants use different algorithms for selecting the \((w,c)\) from the training corpus (Hill and Korhonen 2014; Levy and Goldberg 2014).
The models of Chandar et al. (2014) and Hermann and Blunsom (2014) both aim to minimise the divergence between source and target language sentences represented as sums of word embeddings. Because of these similarities, we do not compare with both in this paper.
Access to source code and limited GPU time prevent us from training and evaluating the embeddings from other NMT models such as that of Kalchbrenner and Blunsom (2013), Devlin et al. (2014) and Sutskever et al. (2014). The underlying principles of encoding–decoding also apply to these models, and we expect the embeddings would exhibit similar properties to those analysed here.
These corpora were produced from the WMT14 parallel data after conducting the data-selection procedure described by Cho et al. (2014).
Available from http://www.cs.cmu.edu/mfaruqui/soft.html. The available embeddings were trained on English–German aligned data, but the authors report similar results for English–French.
For a more detailed discussion of the similarity/relatedness distinction, see (Hill et al. (2014)).
The most dissimilar pair in SimLex-Assoc-333 is [shrink,grow] with a score of 0.23. The highest is [vanish,disappear] with 9.80.
To control for different vocabularies, we restricted the effective vocabulary of each model to the intersection of all model vocabularies, and excluded all questions that contained an answer outside of this intersection.
Available online at http://www.cl.cam.ac.uk/fh295/.
We did not do the same for our translation models because sentence-aligned bilingual corpora of comparable size do not exist.
The performance of the FD embeddings on this task is higher than that reported by Faruqui and Dyer (2014) because we search for answers over a smaller total candidate vocabulary.
A different solution to the rare-word problem was proposed by Luong et al. (2014). We do not evaluate the effects on the resulting embeddings of this method because we lack access to the source code.

References

Agirre E, Alfonseca E, Hall K, Kravalova J, Pasca M, Soroa A (2009) A study on similarity and relatedness using distributional and wordnet-based approaches. In: Proceedings of NAACL-HLT 2009
Bahdanau D, Cho K, Bengio Y (2015) Neural machine translation by jointly learning to align and translate. In: Proceedings of ICLR
Baroni M, Dinu G, Kruszewski G (2014) Dont count, predict! a systematic comparison of context-counting vs. context-predicting semantic vectors. In: Proceedings of the 52nd annual meeting of the association for computational linguistics, vol 1
Bengio Y, Sénécal JS (2003) Quick training of probabilistic neural nets by importance sampling. In: Proceedings of AISTATS 2003
Bengio Y, Ducharme R, Vincent P, Janvin C (2003) A neural probabilistic language model. J Mach Learn Res 3:1137–1155
MATH Google Scholar
Bruni E, Tran NK, Baroni M (2014) Multimodal distributional semantics. J Artif Intell Res(JAIR) 49:1–47
MathSciNet MATH Google Scholar
Chandar S, Lauly S, Larochelle H, Khapra MM, Ravindran B, Raykar V, Saha A (2014) An autoencoder approach to learning bilingual word representations. In: NIPS
Cho K, van Merrienboer B, Gulcehre C, Bougares F, Schwenk H, Bengio Y (2014) Learning phrase representations using RNN encoder-decoder for statistical machine translation. In: Proceedings of the empirical methods in natural language processing (EMNLP 2014), to appear
Collobert R, Weston J (2008) A unified architecture for natural language processing: deep neural networks with multitask learning. In: Proceedings of the 25th international conference on machine learning, ACM, pp 160–167
Collobert R, Weston J, Bottou L, Karlen M, Kavukcuoglu K, Kuksa P (2011) Natural language processing (almost) from scratch. J Mach Learn Res 12:2493–2537
MATH Google Scholar
Devlin J, Zbib R, Huang Z, Lamar T, Schwartz R, Makhoul J (2014) Fast and robust neural network joint models for statistical machine translation. In: 52nd annual meeting of the association for computational linguistics, Baltimore, June
Faruqui M, Dyer C (2014) Improving vector space word representations using multilingual correlation. In: Proceedings of EACL, vol 2014
Firth RJ (1957) A synopsis of linguistic theory 1930–1955. Philological Society, Oxford, pp 1–32
Google Scholar
Haghighi A, Liang P, Berg-Kirkpatrick T, Klein D (2008) Learning bilingual lexicons from monolingual corpora. In: ACL, vol 2008, pp 771–779
Hermann KM, Blunsom P (2014) Multilingual distributed representations without word alignment. In: Proceedings of ICLR
Hill F, Korhonen A (2014) Learning abstract concepts from multi-modal data: since you probably can’t see what i mean. In: Proceedings of the empirical methods in natural language processing (EMNLP 2014)
Hill F, Reichart R, Korhonen A (2014) Simlex-999: evaluating semantic models with (genuine) similarity estimation. arXiv preprint arXiv:1408.3456
Jean S, Cho K, Memisevic R, Bengio Y (2015) On using very large target vocabulary for neural machine translation. In: Proceedings of NAACL
Kalchbrenner N, Blunsom P (2013) Recurrent continuous translation models. In: Proceedings of the 2013 conference on empirical methods in natural language processing, Association for Computational Linguistics, Seattle
Klementiev A, Titov I, Bhattarai B (2012a) Inducing crosslingual distributed representations of words. COLING
Klementiev A, Titov I, Bhattarai B (2012b) Inducing crosslingual distributed representations of words. In: COLING
Kočiský T, Hermann KM, Blunsom P (2014) Learning bilingual word representations by marginalizing alignments. In: Proceedings of ACL
Kusner M, Sun Y, Kolkin N, Weinberger KQ (2015) From word embeddings to document distances. In: Proceedings of the 32nd international conference on machine learning (ICML-15), pp 957–966
Landauer TK, Dumais ST (1997) A solution to plato’s problem: the latent semantic analysis theory of acquisition, induction, and representation of knowledge. Psychol Rev 104(2):211
Article Google Scholar
Levy O, Goldberg Y (2014) Dependency-based word embeddings. In: Proceedings of the 52nd annual meeting of the association for computational linguistics, vol 2
Luong T, Sutskever I, Le QV, Vinyals O, Zaremba W (2014) Addressing the rare word problem in neural machine translation. arXiv preprint arXiv:1410.8206
Mikolov T, Le QV, Sutskever I (2013a) Exploiting similarities among languages for machine translation. In: CORR
Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J (2013b) Distributed representations of words and phrases and their compositionality. In: Advances in neural information processing systems, pp 3111–3119
Mnih A, Hinton GE (2009) A scalable hierarchical distributed language model. In: Advances in neural information processing systems, pp 1081–1088
Morin F, Bengio Y (2005) Hierarchical probabilistic neural network language model. AISTATS, Citeseer 5:246–252
Google Scholar
Nelson DL, McEvoy CL, Schreiber TA (2004) The university of south florida free association, rhyme, and word fragment norms. Behav Res Methods Instrum Comput 36(3):402–407
Article Google Scholar
Pennington J, Socher R, Manning C (2014) Glove: global vectors for word representation. In: Proceedings of the empirical methods in natural language processing (EMNLP 2014)
Sutskever I, Vinyals O, Le QV (2014) Sequence to sequence learning with neural networks. In: Proceedings of NIPS
Turney PD, Pantel P (2010) From frequency to meaning: vector space models of semantics. J Artif Intell Res 37(1):141–188
MathSciNet MATH Google Scholar
Vulić I, De Smet W, Moens MF (2011) Identifying word translations from comparable corpora using latent topic models. In: Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies: short papers, Vol 2, Association for Computational Linguistics, pp 479–484
Weston J, Bengio S, Usunier N (2010) Large scale image annotation: learning to rank with joint word-image embeddings. Mach Learn 81(1):21–35
Article MathSciNet Google Scholar

Download references

Acknowledgements

This work was in part funded by a Google European Doctoral Fellowship and a Google Faculty Award.

Author information

Authors and Affiliations

Deepmind, 7 Pancras Square, London, NC14AG, UK
Felix Hill
Courant Institute of Mathematical Sciences, New York University, New York, NY, USA
Kyunghyun Cho & Sébastien Jean
MILA, Université de Montréal, Montreal, Canada
Yoshua Bengio

Authors

Felix Hill
View author publications
You can also search for this author in PubMed Google Scholar
Kyunghyun Cho
View author publications
You can also search for this author in PubMed Google Scholar
Sébastien Jean
View author publications
You can also search for this author in PubMed Google Scholar
Yoshua Bengio
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Felix Hill.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Hill, F., Cho, K., Jean, S. et al. The representational geometry of word meanings acquired by neural machine translation models. Machine Translation 31, 3–18 (2017). https://doi.org/10.1007/s10590-017-9194-2

Download citation

Received: 15 May 2016
Accepted: 28 March 2017
Published: 29 April 2017
Issue Date: June 2017
DOI: https://doi.org/10.1007/s10590-017-9194-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

The representational geometry of word meanings acquired by neural machine translation models

Abstract

Access this article

Similar content being viewed by others

Size vs. Structure in Training Corpora for Word Embedding Models: Araneum Russicum Maximum and Russian National Corpus

A survey on training and evaluation of word embeddings

LX-DSemVectors: Distributional Semantics Models for Portuguese

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

The representational geometry of word meanings acquired by neural machine translation models

Abstract

Access this article

Similar content being viewed by others

Size vs. Structure in Training Corpora for Word Embedding Models: Araneum Russicum Maximum and Russian National Corpus

A survey on training and evaluation of word embeddings

LX-DSemVectors: Distributional Semantics Models for Portuguese

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation