Toward meaningful notions of similarity in NLP embedding models

Elekes, Ábel; Englhardt, Adrian; Schäler, Martin; Böhm, Klemens

doi:10.1007/s00799-018-0237-y

Toward meaningful notions of similarity in NLP embedding models

Published: 20 April 2018

Volume 21, pages 109–128, (2020)
Cite this article

International Journal on Digital Libraries Aims and scope Submit manuscript

Ábel Elekes ORCID: orcid.org/0000-0003-3538-9469¹,
Adrian Englhardt¹,
Martin Schäler¹ &
…
Klemens Böhm¹

839 Accesses
7 Citations
Explore all metrics

Abstract

Finding similar words with the help of word embedding models, such as Word2Vec or GloVe, computed on large-scale digital libraries has yielded meaningful results in many cases. However, the underlying notion of similarity has remained ambiguous. In this paper, we examine when exactly similarity values in word embedding models are meaningful. To do so, we analyze the statistical distribution of similarity values systematically, conducting two series of experiments. The first one examines how the distribution of similarity values depends on the different embedding model algorithms and parameters. The second one starts by showing that intuitive similarity thresholds do not exist. We then propose a method stating which similarity values and thresholds actually are meaningful for a given embedding model. Based on these results, we calculate how these thresholds, when taken into account during evaluation, change the evaluation scores of the models in similarity test sets. In more abstract terms, our insights give way to a better understanding of the notion of similarity in embedding models and to more reliable evaluations of such models.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Notes

https://dbis.ipd.kit.edu/2542.php.
http://nlp.stanford.edu/projects/glove/.
https://github.com/facebookresearch/fastText.
Available at http://download.wikimedia.org/enwiki/.
Note, we find all these word pairs at the end of the model’s list as they have low similarity values.

References

Erk, K.: Vector space models of word meaning and phrase meaning: a survey. Language and Linguistics Compass 6, 635–653 (2012)
Article Google Scholar
Clark, S.: Vector Space Models of Lexical Meaning. Handbook of Contemporary Semantic Theory, pp. 493–522. Wiley, London (2013)
Google Scholar
Baroni, M., Dinu, G., Kruszewski, G.: Don’t count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors. In: ACL, vol. 1 (2014)
Deerwester, S., Dumais, S.T., Furnas, G.W., Landauer, T.K., Harshman, R.: Indexing by latent semantic analysis. J. Am. Soc. Inf. Sci. 41, 391 (1990)
Article Google Scholar
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)
MATH Google Scholar
Bengio, Y., Ducharme, R., Vincent, P., Jauvin, C.: A neural probabilistic language model. J. Mach. Learn. Res. 3, 1137–1155 (2003)
MATH Google Scholar
Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., Kuksa, P.: Natural language processing (almost) from scratch. J. Mach. Learn. Res. 12, 2493–2537 (2011)
MATH Google Scholar
Collobert, R., Weston, J.: A unified architecture for natural language processing: deep neural networks with multitask learning. In: Proceedings of the 25th International Conference on Machine Learning (2008)
Huang, E.H., Socher, R., Manning, C.D., Ng, A.Y.: Improving word representations via global context and multiple word prototypes. In: Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers, vol. 1 (2012)
Pennington, J., Socher, R., Manning, C.D.: Glove: global vectors for word representation. In: EMNLP (2014)
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. (2013). arXiv:1301.3781
Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems (2013)
Passos, A., Kumar, V., McCallum, A.: Lexicon infused phrase embeddings for named entity resolution (2014). arXiv:1404.5367
Komatsu, H., Tian, R., Okazaki, N., Inui, K.: Reducing lexical features in parsing by word embeddings. In: Proceedings of the 29th Pacific Asia Conference on Language, Information and Computation (2015)
Wang, W.Y., Yang, D.: That’s so annoying!!!: a lexical and frame-semantic embedding based data augmentation approach to automatic categorization of annoying behaviors using petpeeve tweets. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP 2015), Lisbon, Portugal (2015)
Klein, B., Lev, G., Sadeh, G., Wolf, L.: Fisher vectors derived from hybrid Gaussian–Laplacian mixture models for image annotation (2014). arXiv:1411.7399
Devlin, J., Zbib, R., Huang, Z., Lamar, T., Schwartz, R.M., Makhoul, J.: Fast and robust neural network joint models for statistical machine translation. In: ACL, vol. 1 (2014)
Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. In: Advances in Neural Information Processing Systems (2014)
Liu, S., Yang, N., Li, M., Zhou, M.: A recursive recurrent neural network for statistical machine translation. In: ACL, vol. 1 (2014)
Levy, O., Goldberg, Y., Dagan, I.: Improving distributional similarity with lessons learned from word embeddings. Trans. Assoc. Comput. Linguist. 3, 211–225 (2015)
Article Google Scholar
Lebret, R., Collobert, R.: Rehabilitation of count-based models for word vector representations. In: International Conference on Intelligent Text Processing and Computational Linguistics (2015)
Chapter Google Scholar
Hill, F., Cho, K., Jean, S., Devin, C., Bengio, Y.: Not all neural embeddings are born equal (2014). arXiv:1410.0718
Schnabel, T., Labutov, I., Mimno, D., Joachims, T.: Evaluation methods for unsupervised word embeddings. In: Proceedings of EMNLP (2015)
Finkelstein, L., Gabrilovich, E., Matias, Y., Rivlin, E., Solan, Z., Wolfman, G., Ruppin, E.: Placing search in context: the concept revisited. In: Proceedings of the 10th International Conference on World Wide Web (2001)
Bruni, E., Boleda, G., Baroni, M., Tran, N.-K.: Distributional semantics in technicolor. In: Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers, vol. 1 (2012)
Hill, F., Reichart, R., Korhonen, A.: Simlex-999: evaluating semantic models with (genuine) similarity estimation. Comput. Linguist. 41, 665–695 (2016)
Article MathSciNet Google Scholar
Batchkarov, M., Kober, T., Reffin, J., Weeds, J., Weir, D.: A critique of word similarity as a method for evaluating distributional semantic models. In: Proceedings of the 1st Workshop on Evaluating Vector Space Representations for NLP (2016)
Avraham, O., Goldberg, Y.: Improving reliability of word similarity evaluation by redesigning annotation task and performance measure (2016). arXiv:1611.03641
Miller, G.A.: WordNet: a lexical database for English. Commun. ACM 38, 39–41 (1995)
Article Google Scholar
Elekes, A., Schäler, M., Boehm, K.: On the various semantics of similarity in word embedding models. In: 2017 ACM/IEEE Joint Conference on Digital Libraries (JCDL) (2017)
Levy, O., Goldberg, Y.: Neural word embedding as implicit matrix factorization. In: Advances in Neural Information Processing Systems (2014)
Shi, T., Liu, Z.: Linking GloVe with word2vec (2014). arXiv:1411.5595
Rehurek, R., Sojka, P.: Software framework for topic modelling with large corpora. In: Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks (2010)
Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information (2016). arXiv:1607.04606
Miura, Y., Taniguchi, M., Taniguchi,T., Ohkuma, T.: A simple scalable neural networks based model for geolocation prediction in Twitter. In: WNUT 2016, vol. 9026924, p. 235 (2016)
Seo, S., Huang, J., Yang, H., Liu, Y.: Representation learning of users and items for review rating prediction using attention-based convolutional neural network. In: 3rd International Workshop on Machine Learning Methods for Recommender Systems (MLRec)(SDM’17) (2017)
Le, Q., Mikolov, T.: Distributed representations of sentences and documents. In: Proceedings of the 31st International Conference on Machine Learning (ICML-14) (2014)
Badjatiya, P., Gupta, S., Gupta, M., Varma, V.: Deep learning for hate speech detection in tweets. In: Proceedings of the 26th International Conference on World Wide Web Companion (2017)
Joulin, A., Grave, E., Bojanowski, P., Mikolov, T.: Bag of tricks for efficient text classification (2016). arXiv:1607.01759
Kusner, M., Sun, Y., Kolkin, N., Weinberger, K.: From word embeddings to document distances. In: International Conference on Machine Learning (2015)
Massey Jr., F.J.: The Kolmogorov–Smirnov test for goodness of fit. J. Am. Stat. Assoc. 46, 68–78 (1951)
Article Google Scholar
Chelba, C., Mikolov, T., Schuster, M., Ge, Q., Brants, T., Koehn, P., Robinson, T.: One billion word benchmark for measuring progress in statistical language modeling (2013). arXiv:1312.3005
Chiu, B., Korhonen, A., Pyysalo, S.: Intrinsic evaluation of word vectors fails to predict extrinsic performance. ACL 2016, 1 (2016)
Google Scholar
Rong, X.: word2vec parameter learning explained (2014). arXiv:1411.2738
Altszyler, E., Sigman, M., Slezak, D.F.: Comparative study of LSA vs Word2vec embeddings in small corpora: a case study in dreams database (2016). arXiv:1610.01520
Lin, C.-C., Ammar, W., Dyer, C., Levin, L.: Unsupervised POS induction with word embeddings (2015). arXiv:1503.06760
Leacock, C., Chodorow, M.: Combining local context and WordNet similarity for word sense identification. WordNet: An Electronic Lexical Database, vol. 49, pp. 265–283 (1998)
Mihalcea, R., Corley, C., Strapparava, C.: Corpus-based and knowledge-based measures of text semantic similarity. In: AAAI (2006)
Budanitsky, A., Hirst, G.: Evaluating wordnet-based measures of lexical semantic relatedness. Comput. Linguist. 32, 13–47 (2006)
Article Google Scholar
Budanitsky, A., Hirst, G.: Semantic distance in WordNet: an experimental, application-oriented evaluation of five measures. In: Workshop on WordNet and Other Lexical Resources (2001)
Meng, L., Huang, R., Gu, J.: A review of semantic similarity measures in wordnet. Int. Hybrid Inf. Technol. 6, 1–12 (2013)
Google Scholar
Bird, S.: NLTK: the natural language toolkit. In: Proceedings of the COLING/ACL on Interactive Presentation Sessions (2006)
Welch, B.L.: The generalization of Student’s’ problem when several different population variances are involved. Biometrika 34, 28–35 (1947)
MathSciNet MATH Google Scholar
Michel, J.-B., Shen, Y.K., Aiden, A.P., Veres, A., Gray, M.K., Pickett, J.P., Hoiberg, D., Clancy, D., Norvig, P., Orwant, J., et al.: Quantitative analysis of culture using millions of digitized books. Science 331, 176–182 (2011)
Article Google Scholar
Rubenstein, H., Goodenough, J.B.: Contextual correlates of synonymy. Commun. ACM 8, 627–633 (1965)
Article Google Scholar
Radinsky, K., Agichtein, E., Gabrilovich, E., Markovitch, S.: A word at a time: computing word relatedness using temporal semantic analysis. In: Proceedings of the 20th International Conference On World Wide Web (2011)
Luong, T., Socher, R., Manning, C.D.: Better word representations with recursive neural networks for morphology. In: CoNLL (2013)

Download references

Author information

Authors and Affiliations

Karlsruhe Institute of Technology (KIT), Karlsruhe, Germany
Ábel Elekes, Adrian Englhardt, Martin Schäler & Klemens Böhm

Authors

Ábel Elekes
View author publications
You can also search for this author in PubMed Google Scholar
Adrian Englhardt
View author publications
You can also search for this author in PubMed Google Scholar
Martin Schäler
View author publications
You can also search for this author in PubMed Google Scholar
Klemens Böhm
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ábel Elekes.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Elekes, Á., Englhardt, A., Schäler, M. et al. Toward meaningful notions of similarity in NLP embedding models. Int J Digit Libr 21, 109–128 (2020). https://doi.org/10.1007/s00799-018-0237-y

Download citation

Received: 15 September 2017
Revised: 12 April 2018
Accepted: 13 April 2018
Published: 20 April 2018
Issue Date: June 2020
DOI: https://doi.org/10.1007/s00799-018-0237-y

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Toward meaningful notions of similarity in NLP embedding models

Abstract

Access this article

Similar content being viewed by others

Embeddings Evaluation Using a Novel Measure of Semantic Similarity

Exploration of a Threshold for Similarity Based on Uncertainty in Word Embedding

Dual embeddings and metrics for word and relational similarity

Notes

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Toward meaningful notions of similarity in NLP embedding models

Abstract

Access this article

Similar content being viewed by others

Embeddings Evaluation Using a Novel Measure of Semantic Similarity

Exploration of a Threshold for Similarity Based on Uncertainty in Word Embedding

Dual embeddings and metrics for word and relational similarity

Notes

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation