Abstract
Finding similar words with the help of word embedding models, such as Word2Vec or GloVe, computed on large-scale digital libraries has yielded meaningful results in many cases. However, the underlying notion of similarity has remained ambiguous. In this paper, we examine when exactly similarity values in word embedding models are meaningful. To do so, we analyze the statistical distribution of similarity values systematically, conducting two series of experiments. The first one examines how the distribution of similarity values depends on the different embedding model algorithms and parameters. The second one starts by showing that intuitive similarity thresholds do not exist. We then propose a method stating which similarity values and thresholds actually are meaningful for a given embedding model. Based on these results, we calculate how these thresholds, when taken into account during evaluation, change the evaluation scores of the models in similarity test sets. In more abstract terms, our insights give way to a better understanding of the notion of similarity in embedding models and to more reliable evaluations of such models.


























Similar content being viewed by others
Notes
Available at http://download.wikimedia.org/enwiki/.
Note, we find all these word pairs at the end of the model’s list as they have low similarity values.
References
Erk, K.: Vector space models of word meaning and phrase meaning: a survey. Language and Linguistics Compass 6, 635–653 (2012)
Clark, S.: Vector Space Models of Lexical Meaning. Handbook of Contemporary Semantic Theory, pp. 493–522. Wiley, London (2013)
Baroni, M., Dinu, G., Kruszewski, G.: Don’t count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors. In: ACL, vol. 1 (2014)
Deerwester, S., Dumais, S.T., Furnas, G.W., Landauer, T.K., Harshman, R.: Indexing by latent semantic analysis. J. Am. Soc. Inf. Sci. 41, 391 (1990)
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)
Bengio, Y., Ducharme, R., Vincent, P., Jauvin, C.: A neural probabilistic language model. J. Mach. Learn. Res. 3, 1137–1155 (2003)
Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., Kuksa, P.: Natural language processing (almost) from scratch. J. Mach. Learn. Res. 12, 2493–2537 (2011)
Collobert, R., Weston, J.: A unified architecture for natural language processing: deep neural networks with multitask learning. In: Proceedings of the 25th International Conference on Machine Learning (2008)
Huang, E.H., Socher, R., Manning, C.D., Ng, A.Y.: Improving word representations via global context and multiple word prototypes. In: Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers, vol. 1 (2012)
Pennington, J., Socher, R., Manning, C.D.: Glove: global vectors for word representation. In: EMNLP (2014)
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. (2013). arXiv:1301.3781
Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems (2013)
Passos, A., Kumar, V., McCallum, A.: Lexicon infused phrase embeddings for named entity resolution (2014). arXiv:1404.5367
Komatsu, H., Tian, R., Okazaki, N., Inui, K.: Reducing lexical features in parsing by word embeddings. In: Proceedings of the 29th Pacific Asia Conference on Language, Information and Computation (2015)
Wang, W.Y., Yang, D.: That’s so annoying!!!: a lexical and frame-semantic embedding based data augmentation approach to automatic categorization of annoying behaviors using petpeeve tweets. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP 2015), Lisbon, Portugal (2015)
Klein, B., Lev, G., Sadeh, G., Wolf, L.: Fisher vectors derived from hybrid Gaussian–Laplacian mixture models for image annotation (2014). arXiv:1411.7399
Devlin, J., Zbib, R., Huang, Z., Lamar, T., Schwartz, R.M., Makhoul, J.: Fast and robust neural network joint models for statistical machine translation. In: ACL, vol. 1 (2014)
Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. In: Advances in Neural Information Processing Systems (2014)
Liu, S., Yang, N., Li, M., Zhou, M.: A recursive recurrent neural network for statistical machine translation. In: ACL, vol. 1 (2014)
Levy, O., Goldberg, Y., Dagan, I.: Improving distributional similarity with lessons learned from word embeddings. Trans. Assoc. Comput. Linguist. 3, 211–225 (2015)
Lebret, R., Collobert, R.: Rehabilitation of count-based models for word vector representations. In: International Conference on Intelligent Text Processing and Computational Linguistics (2015)
Hill, F., Cho, K., Jean, S., Devin, C., Bengio, Y.: Not all neural embeddings are born equal (2014). arXiv:1410.0718
Schnabel, T., Labutov, I., Mimno, D., Joachims, T.: Evaluation methods for unsupervised word embeddings. In: Proceedings of EMNLP (2015)
Finkelstein, L., Gabrilovich, E., Matias, Y., Rivlin, E., Solan, Z., Wolfman, G., Ruppin, E.: Placing search in context: the concept revisited. In: Proceedings of the 10th International Conference on World Wide Web (2001)
Bruni, E., Boleda, G., Baroni, M., Tran, N.-K.: Distributional semantics in technicolor. In: Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers, vol. 1 (2012)
Hill, F., Reichart, R., Korhonen, A.: Simlex-999: evaluating semantic models with (genuine) similarity estimation. Comput. Linguist. 41, 665–695 (2016)
Batchkarov, M., Kober, T., Reffin, J., Weeds, J., Weir, D.: A critique of word similarity as a method for evaluating distributional semantic models. In: Proceedings of the 1st Workshop on Evaluating Vector Space Representations for NLP (2016)
Avraham, O., Goldberg, Y.: Improving reliability of word similarity evaluation by redesigning annotation task and performance measure (2016). arXiv:1611.03641
Miller, G.A.: WordNet: a lexical database for English. Commun. ACM 38, 39–41 (1995)
Elekes, A., Schäler, M., Boehm, K.: On the various semantics of similarity in word embedding models. In: 2017 ACM/IEEE Joint Conference on Digital Libraries (JCDL) (2017)
Levy, O., Goldberg, Y.: Neural word embedding as implicit matrix factorization. In: Advances in Neural Information Processing Systems (2014)
Shi, T., Liu, Z.: Linking GloVe with word2vec (2014). arXiv:1411.5595
Rehurek, R., Sojka, P.: Software framework for topic modelling with large corpora. In: Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks (2010)
Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information (2016). arXiv:1607.04606
Miura, Y., Taniguchi, M., Taniguchi,T., Ohkuma, T.: A simple scalable neural networks based model for geolocation prediction in Twitter. In: WNUT 2016, vol. 9026924, p. 235 (2016)
Seo, S., Huang, J., Yang, H., Liu, Y.: Representation learning of users and items for review rating prediction using attention-based convolutional neural network. In: 3rd International Workshop on Machine Learning Methods for Recommender Systems (MLRec)(SDM’17) (2017)
Le, Q., Mikolov, T.: Distributed representations of sentences and documents. In: Proceedings of the 31st International Conference on Machine Learning (ICML-14) (2014)
Badjatiya, P., Gupta, S., Gupta, M., Varma, V.: Deep learning for hate speech detection in tweets. In: Proceedings of the 26th International Conference on World Wide Web Companion (2017)
Joulin, A., Grave, E., Bojanowski, P., Mikolov, T.: Bag of tricks for efficient text classification (2016). arXiv:1607.01759
Kusner, M., Sun, Y., Kolkin, N., Weinberger, K.: From word embeddings to document distances. In: International Conference on Machine Learning (2015)
Massey Jr., F.J.: The Kolmogorov–Smirnov test for goodness of fit. J. Am. Stat. Assoc. 46, 68–78 (1951)
Chelba, C., Mikolov, T., Schuster, M., Ge, Q., Brants, T., Koehn, P., Robinson, T.: One billion word benchmark for measuring progress in statistical language modeling (2013). arXiv:1312.3005
Chiu, B., Korhonen, A., Pyysalo, S.: Intrinsic evaluation of word vectors fails to predict extrinsic performance. ACL 2016, 1 (2016)
Rong, X.: word2vec parameter learning explained (2014). arXiv:1411.2738
Altszyler, E., Sigman, M., Slezak, D.F.: Comparative study of LSA vs Word2vec embeddings in small corpora: a case study in dreams database (2016). arXiv:1610.01520
Lin, C.-C., Ammar, W., Dyer, C., Levin, L.: Unsupervised POS induction with word embeddings (2015). arXiv:1503.06760
Leacock, C., Chodorow, M.: Combining local context and WordNet similarity for word sense identification. WordNet: An Electronic Lexical Database, vol. 49, pp. 265–283 (1998)
Mihalcea, R., Corley, C., Strapparava, C.: Corpus-based and knowledge-based measures of text semantic similarity. In: AAAI (2006)
Budanitsky, A., Hirst, G.: Evaluating wordnet-based measures of lexical semantic relatedness. Comput. Linguist. 32, 13–47 (2006)
Budanitsky, A., Hirst, G.: Semantic distance in WordNet: an experimental, application-oriented evaluation of five measures. In: Workshop on WordNet and Other Lexical Resources (2001)
Meng, L., Huang, R., Gu, J.: A review of semantic similarity measures in wordnet. Int. Hybrid Inf. Technol. 6, 1–12 (2013)
Bird, S.: NLTK: the natural language toolkit. In: Proceedings of the COLING/ACL on Interactive Presentation Sessions (2006)
Welch, B.L.: The generalization of Student’s’ problem when several different population variances are involved. Biometrika 34, 28–35 (1947)
Michel, J.-B., Shen, Y.K., Aiden, A.P., Veres, A., Gray, M.K., Pickett, J.P., Hoiberg, D., Clancy, D., Norvig, P., Orwant, J., et al.: Quantitative analysis of culture using millions of digitized books. Science 331, 176–182 (2011)
Rubenstein, H., Goodenough, J.B.: Contextual correlates of synonymy. Commun. ACM 8, 627–633 (1965)
Radinsky, K., Agichtein, E., Gabrilovich, E., Markovitch, S.: A word at a time: computing word relatedness using temporal semantic analysis. In: Proceedings of the 20th International Conference On World Wide Web (2011)
Luong, T., Socher, R., Manning, C.D.: Better word representations with recursive neural networks for morphology. In: CoNLL (2013)
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Elekes, Á., Englhardt, A., Schäler, M. et al. Toward meaningful notions of similarity in NLP embedding models. Int J Digit Libr 21, 109–128 (2020). https://doi.org/10.1007/s00799-018-0237-y
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00799-018-0237-y