Skip to main content
Log in

Toward meaningful notions of similarity in NLP embedding models

  • Published:
International Journal on Digital Libraries Aims and scope Submit manuscript

Abstract

Finding similar words with the help of word embedding models, such as Word2Vec or GloVe, computed on large-scale digital libraries has yielded meaningful results in many cases. However, the underlying notion of similarity has remained ambiguous. In this paper, we examine when exactly similarity values in word embedding models are meaningful. To do so, we analyze the statistical distribution of similarity values systematically, conducting two series of experiments. The first one examines how the distribution of similarity values depends on the different embedding model algorithms and parameters. The second one starts by showing that intuitive similarity thresholds do not exist. We then propose a method stating which similarity values and thresholds actually are meaningful for a given embedding model. Based on these results, we calculate how these thresholds, when taken into account during evaluation, change the evaluation scores of the models in similarity test sets. In more abstract terms, our insights give way to a better understanding of the notion of similarity in embedding models and to more reliable evaluations of such models.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20
Fig. 21
Fig. 22
Fig. 23
Fig. 24
Fig. 25
Fig. 26

Similar content being viewed by others

Notes

  1. https://dbis.ipd.kit.edu/2542.php.

  2. http://nlp.stanford.edu/projects/glove/.

  3. https://github.com/facebookresearch/fastText.

  4. Available at http://download.wikimedia.org/enwiki/.

  5. Note, we find all these word pairs at the end of the model’s list as they have low similarity values.

References

  1. Erk, K.: Vector space models of word meaning and phrase meaning: a survey. Language and Linguistics Compass 6, 635–653 (2012)

    Article  Google Scholar 

  2. Clark, S.: Vector Space Models of Lexical Meaning. Handbook of Contemporary Semantic Theory, pp. 493–522. Wiley, London (2013)

    Google Scholar 

  3. Baroni, M., Dinu, G., Kruszewski, G.: Don’t count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors. In: ACL, vol. 1 (2014)

  4. Deerwester, S., Dumais, S.T., Furnas, G.W., Landauer, T.K., Harshman, R.: Indexing by latent semantic analysis. J. Am. Soc. Inf. Sci. 41, 391 (1990)

    Article  Google Scholar 

  5. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)

    MATH  Google Scholar 

  6. Bengio, Y., Ducharme, R., Vincent, P., Jauvin, C.: A neural probabilistic language model. J. Mach. Learn. Res. 3, 1137–1155 (2003)

    MATH  Google Scholar 

  7. Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., Kuksa, P.: Natural language processing (almost) from scratch. J. Mach. Learn. Res. 12, 2493–2537 (2011)

    MATH  Google Scholar 

  8. Collobert, R., Weston, J.: A unified architecture for natural language processing: deep neural networks with multitask learning. In: Proceedings of the 25th International Conference on Machine Learning (2008)

  9. Huang, E.H., Socher, R., Manning, C.D., Ng, A.Y.: Improving word representations via global context and multiple word prototypes. In: Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers, vol. 1 (2012)

  10. Pennington, J., Socher, R., Manning, C.D.: Glove: global vectors for word representation. In: EMNLP (2014)

  11. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. (2013). arXiv:1301.3781

  12. Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems (2013)

  13. Passos, A., Kumar, V., McCallum, A.: Lexicon infused phrase embeddings for named entity resolution (2014). arXiv:1404.5367

  14. Komatsu, H., Tian, R., Okazaki, N., Inui, K.: Reducing lexical features in parsing by word embeddings. In: Proceedings of the 29th Pacific Asia Conference on Language, Information and Computation (2015)

  15. Wang, W.Y., Yang, D.: That’s so annoying!!!: a lexical and frame-semantic embedding based data augmentation approach to automatic categorization of annoying behaviors using petpeeve tweets. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP 2015), Lisbon, Portugal (2015)

  16. Klein, B., Lev, G., Sadeh, G., Wolf, L.: Fisher vectors derived from hybrid Gaussian–Laplacian mixture models for image annotation (2014). arXiv:1411.7399

  17. Devlin, J., Zbib, R., Huang, Z., Lamar, T., Schwartz, R.M., Makhoul, J.: Fast and robust neural network joint models for statistical machine translation. In: ACL, vol. 1 (2014)

  18. Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. In: Advances in Neural Information Processing Systems (2014)

  19. Liu, S., Yang, N., Li, M., Zhou, M.: A recursive recurrent neural network for statistical machine translation. In: ACL, vol. 1 (2014)

  20. Levy, O., Goldberg, Y., Dagan, I.: Improving distributional similarity with lessons learned from word embeddings. Trans. Assoc. Comput. Linguist. 3, 211–225 (2015)

    Article  Google Scholar 

  21. Lebret, R., Collobert, R.: Rehabilitation of count-based models for word vector representations. In: International Conference on Intelligent Text Processing and Computational Linguistics (2015)

    Chapter  Google Scholar 

  22. Hill, F., Cho, K., Jean, S., Devin, C., Bengio, Y.: Not all neural embeddings are born equal (2014). arXiv:1410.0718

  23. Schnabel, T., Labutov, I., Mimno, D., Joachims, T.: Evaluation methods for unsupervised word embeddings. In: Proceedings of EMNLP (2015)

  24. Finkelstein, L., Gabrilovich, E., Matias, Y., Rivlin, E., Solan, Z., Wolfman, G., Ruppin, E.: Placing search in context: the concept revisited. In: Proceedings of the 10th International Conference on World Wide Web (2001)

  25. Bruni, E., Boleda, G., Baroni, M., Tran, N.-K.: Distributional semantics in technicolor. In: Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers, vol. 1 (2012)

  26. Hill, F., Reichart, R., Korhonen, A.: Simlex-999: evaluating semantic models with (genuine) similarity estimation. Comput. Linguist. 41, 665–695 (2016)

    Article  MathSciNet  Google Scholar 

  27. Batchkarov, M., Kober, T., Reffin, J., Weeds, J., Weir, D.: A critique of word similarity as a method for evaluating distributional semantic models. In: Proceedings of the 1st Workshop on Evaluating Vector Space Representations for NLP (2016)

  28. Avraham, O., Goldberg, Y.: Improving reliability of word similarity evaluation by redesigning annotation task and performance measure (2016). arXiv:1611.03641

  29. Miller, G.A.: WordNet: a lexical database for English. Commun. ACM 38, 39–41 (1995)

    Article  Google Scholar 

  30. Elekes, A., Schäler, M., Boehm, K.: On the various semantics of similarity in word embedding models. In: 2017 ACM/IEEE Joint Conference on Digital Libraries (JCDL) (2017)

  31. Levy, O., Goldberg, Y.: Neural word embedding as implicit matrix factorization. In: Advances in Neural Information Processing Systems (2014)

  32. Shi, T., Liu, Z.: Linking GloVe with word2vec (2014). arXiv:1411.5595

  33. Rehurek, R., Sojka, P.: Software framework for topic modelling with large corpora. In: Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks (2010)

  34. Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information (2016). arXiv:1607.04606

  35. Miura, Y., Taniguchi, M., Taniguchi,T., Ohkuma, T.: A simple scalable neural networks based model for geolocation prediction in Twitter. In: WNUT 2016, vol. 9026924, p. 235 (2016)

  36. Seo, S., Huang, J., Yang, H., Liu, Y.: Representation learning of users and items for review rating prediction using attention-based convolutional neural network. In: 3rd International Workshop on Machine Learning Methods for Recommender Systems (MLRec)(SDM’17) (2017)

  37. Le, Q., Mikolov, T.: Distributed representations of sentences and documents. In: Proceedings of the 31st International Conference on Machine Learning (ICML-14) (2014)

  38. Badjatiya, P., Gupta, S., Gupta, M., Varma, V.: Deep learning for hate speech detection in tweets. In: Proceedings of the 26th International Conference on World Wide Web Companion (2017)

  39. Joulin, A., Grave, E., Bojanowski, P., Mikolov, T.: Bag of tricks for efficient text classification (2016). arXiv:1607.01759

  40. Kusner, M., Sun, Y., Kolkin, N., Weinberger, K.: From word embeddings to document distances. In: International Conference on Machine Learning (2015)

  41. Massey Jr., F.J.: The Kolmogorov–Smirnov test for goodness of fit. J. Am. Stat. Assoc. 46, 68–78 (1951)

    Article  Google Scholar 

  42. Chelba, C., Mikolov, T., Schuster, M., Ge, Q., Brants, T., Koehn, P., Robinson, T.: One billion word benchmark for measuring progress in statistical language modeling (2013). arXiv:1312.3005

  43. Chiu, B., Korhonen, A., Pyysalo, S.: Intrinsic evaluation of word vectors fails to predict extrinsic performance. ACL 2016, 1 (2016)

    Google Scholar 

  44. Rong, X.: word2vec parameter learning explained (2014). arXiv:1411.2738

  45. Altszyler, E., Sigman, M., Slezak, D.F.: Comparative study of LSA vs Word2vec embeddings in small corpora: a case study in dreams database (2016). arXiv:1610.01520

  46. Lin, C.-C., Ammar, W., Dyer, C., Levin, L.: Unsupervised POS induction with word embeddings (2015). arXiv:1503.06760

  47. Leacock, C., Chodorow, M.: Combining local context and WordNet similarity for word sense identification. WordNet: An Electronic Lexical Database, vol. 49, pp. 265–283 (1998)

  48. Mihalcea, R., Corley, C., Strapparava, C.: Corpus-based and knowledge-based measures of text semantic similarity. In: AAAI (2006)

  49. Budanitsky, A., Hirst, G.: Evaluating wordnet-based measures of lexical semantic relatedness. Comput. Linguist. 32, 13–47 (2006)

    Article  Google Scholar 

  50. Budanitsky, A., Hirst, G.: Semantic distance in WordNet: an experimental, application-oriented evaluation of five measures. In: Workshop on WordNet and Other Lexical Resources (2001)

  51. Meng, L., Huang, R., Gu, J.: A review of semantic similarity measures in wordnet. Int. Hybrid Inf. Technol. 6, 1–12 (2013)

    Google Scholar 

  52. Bird, S.: NLTK: the natural language toolkit. In: Proceedings of the COLING/ACL on Interactive Presentation Sessions (2006)

  53. Welch, B.L.: The generalization of Student’s’ problem when several different population variances are involved. Biometrika 34, 28–35 (1947)

    MathSciNet  MATH  Google Scholar 

  54. Michel, J.-B., Shen, Y.K., Aiden, A.P., Veres, A., Gray, M.K., Pickett, J.P., Hoiberg, D., Clancy, D., Norvig, P., Orwant, J., et al.: Quantitative analysis of culture using millions of digitized books. Science 331, 176–182 (2011)

    Article  Google Scholar 

  55. Rubenstein, H., Goodenough, J.B.: Contextual correlates of synonymy. Commun. ACM 8, 627–633 (1965)

    Article  Google Scholar 

  56. Radinsky, K., Agichtein, E., Gabrilovich, E., Markovitch, S.: A word at a time: computing word relatedness using temporal semantic analysis. In: Proceedings of the 20th International Conference On World Wide Web (2011)

  57. Luong, T., Socher, R., Manning, C.D.: Better word representations with recursive neural networks for morphology. In: CoNLL (2013)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ábel Elekes.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Elekes, Á., Englhardt, A., Schäler, M. et al. Toward meaningful notions of similarity in NLP embedding models. Int J Digit Libr 21, 109–128 (2020). https://doi.org/10.1007/s00799-018-0237-y

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00799-018-0237-y

Keywords

Navigation