Skip to main content
Log in

Improving semantic change analysis by combining word embeddings and word frequencies

  • Published:
International Journal on Digital Libraries Aims and scope Submit manuscript

Abstract

Language is constantly evolving. As part of diachronic linguistics, semantic change analysis examines how the meanings of words evolve over time. Such semantic awareness is important to retrieve content from digital libraries. Recent research on semantic change analysis relying on word embeddings has yielded significant improvements over previous work. However, a recent, but somewhat neglected observation so far is that the rate of semantic shift negatively correlates with word-usage frequency. In this article, we therefore propose SCAF, Semantic Change Analysis with Frequency. It abstracts from the concrete embeddings and includes word frequencies as an orthogonal feature. SCAF allows using different combinations of embedding type, optimization algorithm and alignment method. Additionally, we leverage existing approaches for time series analysis, by using change detection methods to identify semantic shifts. In an evaluation with a realistic setup, SCAF achieves better detection rates than prior approaches, 95% instead of 51%. On the Google Books Ngram data set, our approach detects both known and yet unknown shifts for popular words.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Similar content being viewed by others

Notes

  1. https://dbis.ipd.kit.edu/2601.php.

  2. https://archive.org/details/twitterstream.

  3. https://code.google.com/archive/p/word2vec/.

  4. The most recent dump is always available at https://dumps.wikimedia.org/.

  5. https://en.wikipedia.org/wiki/Help:Wiki_markup.

  6. http://www.etymonline.com/.

References

  1. Zipf, G.K.: Human Behavior and the Principle of Least Effort: An Introduction to Human Ecology. Ravenio Books, Cambridge (2016)

    Google Scholar 

  2. Schatz, B.R.: Information retrieval in digital libraries: bringing search to the net. Science 275(5298), 327–334 (1997)

    Article  Google Scholar 

  3. Michel, J.B., Shen, Y.K., Aiden, A.P., Veres, A., Gray, M.K., Pickett, J.P., Hoiberg, D., Clancy, D., Norvig, P., Orwant, J.: Quantitative analysis of culture using millions of digitized books. Science 331(6014), 176–182 (2011)

    Article  Google Scholar 

  4. Hamilton, W.L., Leskovec, J., Jurafsky, D.: Diachronic word embeddings reveal statistical laws of semantic change. In: ACL, vol. 1, pp. 1489–1501 (2016)

  5. Kulkarni, V., Al-Rfou, R., Perozzi, B., Skiena, S.: Statistically significant detection of linguistic change. In: WWW, pp. 625–635 (2015)

  6. Kim, Y., Chiu, Y.I., Hanaki, K., Hegde, D., Petrov, S.: Temporal analysis of language through neural language models. In: ACL, pp. 61–65 (2014)

  7. Jatowt, A., Duh, K.: A framework for analyzing semantic change of words across time. In: IJDL, pp. 229–238 (2014)

  8. Basile, P., Caputo, A., Semeraro, G.: Temporal random indexing: a system for analysing word meaning over time. Ital. J. Comput. Linguist. 1(1), 55–68 (2015)

    Google Scholar 

  9. Phillips, L., Shaffer, K., Arendt, D., Hodas, N., Volkova, S.: Intrinsic and extrinsic evaluation of spatiotemporal text representations in twitter streams. In: Proceedings of the 2nd Workshop on Representation Learning for NLP, pp. 201–210 (2017)

  10. Basile, P., Caputo, A., Semeraro, G.: Temporal random indexing: a tool for analysing word meaning variations in news. In: ECIR, pp. 39–41 (2016)

  11. Yao, Z., Sun, Y., Ding, W., Rao, N., Xiong, H.: Dynamic word embeddings for evolving semantic discovery. In: WSDM, pp. 673–681 (2018)

  12. Kendall, D.G.: Stochastic processes occurring in the theory of queues and their analysis by the method of the imbedded Markov chain. Ann. Math. Stat. 3(6), 338–354 (1953)

    Article  MathSciNet  MATH  Google Scholar 

  13. Zhang, Y., Jatowt, A., Bhowmick, S.S., Tanaka, K.: The past is not a foreign country: detecting semantically similar terms across time. TKDE 28(10), 2793–2807 (2016)

    Google Scholar 

  14. Hamilton, W.L., Leskovec, J., Jurafsky, D.: Cultural shift or linguistic drift? Comparing two computational measures of semantic change. In: EMNLP, pp. 2116–2121 (2016)

  15. Basseville, M., Nikiforov, I.V.: Others: Detection of abrupt changes: theory and application, vol. 104. Prentice-Hall, Inc, Englewood Cliffs (1993)

    MATH  Google Scholar 

  16. Taylor, W.A.: Change-point analysis: a powerful new tool for detecting changes (2000). https://variation.com/wp-content/uploads/change-point-analyzer/change-point-analysis-a-powerful-new-tool-for-detecting-changes.pdf. Accessed 15 Jan 2018

  17. Ghanbarnejad, F., Gerlach, M., Miotto, J.M., Altmann, E.G.: Extracting information from s-curves of language change. J. R. Soc. Interface 11(101), 20141044 (2014)

    Article  Google Scholar 

  18. Piantadosi, S.T.: Zipf’s word frequency law in natural language: a critical review and future directions. Psychon. Bull. Rev. 21(5), 1112–1130 (2014)

    Article  Google Scholar 

  19. Krishnamoorthy, N., Malkarnenkar, G., Mooney, R., Saenko, K., Guadarrama, S.: Generating natural-language video descriptions using text-mined knowledge. In: AAAI, pp. 541–547 (2013)

  20. Bassil, Y., Alwani, M.: Ocr post-processing error correction algorithm using Google’s online spelling suggestion. J. Emerg. Trends Comput. Inf. Sci. 3(1), 90–99 (2012)

    Google Scholar 

  21. Nazar, R., Renau, I.: Google books n-gram corpus used as a grammar checker. In: EACL, pp. 27–34 (2012)

  22. Mikolov, T., Corrado, G., Chen, K., Dean, J.: Efficient estimation of word representations in vector space. In: ICLR (2013)

  23. Pennington, J., Socher, R., Manning, C.D.: GloVe: Global vectors for word representation. In: EMNLP, pp. 1532–1543 (2014)

  24. Mikolov, T., Sutskever, I., Chen, K., Corrado, G., Dean, J.: Distributed representations of words and phrases and their compositionality. In: NIPS pp. 3111–3119 (2013)

  25. Muromägi, A., Sirts, K., Laur, S.: Linear ensembles of word embedding models. In: Proceedings of the 21st Nordic Conference on Computational Linguistics, pp. 96–104 (2017)

  26. Rudolph, M., Blei, D.: Dynamic Bernoulli embeddings for language evolution (2017). arXiv preprint arXiv:170308052

  27. Gladkova, A., Drozd, A.: Intrinsic evaluations of word embeddings: What can we do better? In: ACL, pp. 36–42 (2016)

  28. Schnabel, T., Labutov, I., Mimno, D.M., Joachims, T.: Evaluation methods for unsupervised word embeddings. In: EMNLP, pp. 298–307 (2015)

  29. Hellrich, J., Hahn, U.: An assessment of experimental protocols for tracing changes in word semantics relative to accuracy and reliability. In: SIGHUM, pp. 111–117 (2016)

  30. Levy, O., Goldberg, Y., Dagan, I.: Improving distributional similarity with lessons learned from word embeddings. TACL 3, 211–225 (2015)

    Google Scholar 

  31. Elekes, Á., Englhardt, A., Schäler, M., Böhm, K.: Toward meaningful notions of similarity in nlp embedding models. IJDL 18, 1–20 (2018)

    Google Scholar 

  32. Elekes, A., Englhardt, A., Schäler, M., Böhm, K.: Resources to examine the quality of word embedding models trained on n-gram data. In: Proceedings of the 22nd Conference on Computational Natural Language Learning, pp. 423–432 (2018)

  33. Blank, A.: Why do new meanings occur? A cognitive typology of the motivations for lexical semantic change. Hist. Seman. Cognit. 13, 61–89 (1999)

    Google Scholar 

  34. Traugott, E.C., Dasher, R.B.: Regularities in Semantic Change. Cambridge University Press, Cambridge (2001)

    Book  Google Scholar 

  35. Hopper, P.J., Traugott, E.C.: Grammaticalization. Cambridge University Press, Cambridge (2003)

    Book  Google Scholar 

  36. Bréal, M.: Essai de sémantique: (Science des Significations). Hachette, New York (1904)

    Google Scholar 

  37. Ullmann, S.: Semantics: An Introduction to the Science of Meaning. Barnes & Noble, New York (1962)

    Google Scholar 

  38. Traugott, E.C.: On the rise of epistemic meanings in English: an example of subjectification in semantic change. Language 65, 31–55 (1989)

    Article  Google Scholar 

  39. Durie, M., Ross, M.: The Comparative Method Reviewed: Regularity and Irregularity in Language Change. Oxford University Press, Oxford (1996)

    Google Scholar 

  40. Lin, Y., Michel, J.B., Aiden, E.L., Orwant, J., Brockman, W., Petrov, S.: Syntactic annotations for the google books ngram corpus. In: Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics, Jeju, Republic of Korea, 8–14 July 2012, pp. 169–174. Association for Computational Linguistics (2012)

  41. Gulordava, K., Baroni, M.: A distributional similarity approach to the detection of semantic change in the Google books ngram corpus. In: GEMS, pp. 67–71 (2011)

  42. van Aggelen, A., Hollink, L., van Ossenbruggen, J.: Combining distributional semantics and structured data to study lexical change. In: EKAW, pp. 40–49 (2016)

  43. Del, Tredici, M., Nissim, M., Zaninello, A.: Tracing metaphors in time through self-distance in vector spaces (2016). arXiv preprint arXiv:161103279

  44. Basile, P., Caputo, A., Luisi, R., Semeraro, G.: Diachronic analysis of the Italian language exploiting Google ngram. In: CLiC-it, pp. 56–60 (2016)

  45. Takamura, H., Nagata, R., Kawasaki, Y.: Analyzing semantic change in japanese loanwords. In: EACL, vol. 1, pp. 1195–1204 (2017)

  46. Tahmasebi, N., Gossen, G., Kanhabua, N., Holzmann, H., Risse, T.: Neer: An unsupervised method for named entity evolution recognition. In: COLING, pp. 2553–2568 (2012)

  47. Rehurek, R., Sojka, P.: software framework for topic modelling with large corpora. In: LREC, pp. 45–50 (2010)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Adrian Englhardt.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Englhardt, A., Willkomm, J., Schäler, M. et al. Improving semantic change analysis by combining word embeddings and word frequencies. Int J Digit Libr 21, 247–264 (2020). https://doi.org/10.1007/s00799-019-00271-6

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00799-019-00271-6

Keywords

Navigation