Abstract
Language is constantly evolving. As part of diachronic linguistics, semantic change analysis examines how the meanings of words evolve over time. Such semantic awareness is important to retrieve content from digital libraries. Recent research on semantic change analysis relying on word embeddings has yielded significant improvements over previous work. However, a recent, but somewhat neglected observation so far is that the rate of semantic shift negatively correlates with word-usage frequency. In this article, we therefore propose SCAF, Semantic Change Analysis with Frequency. It abstracts from the concrete embeddings and includes word frequencies as an orthogonal feature. SCAF allows using different combinations of embedding type, optimization algorithm and alignment method. Additionally, we leverage existing approaches for time series analysis, by using change detection methods to identify semantic shifts. In an evaluation with a realistic setup, SCAF achieves better detection rates than prior approaches, 95% instead of 51%. On the Google Books Ngram data set, our approach detects both known and yet unknown shifts for popular words.
Similar content being viewed by others
Notes
References
Zipf, G.K.: Human Behavior and the Principle of Least Effort: An Introduction to Human Ecology. Ravenio Books, Cambridge (2016)
Schatz, B.R.: Information retrieval in digital libraries: bringing search to the net. Science 275(5298), 327–334 (1997)
Michel, J.B., Shen, Y.K., Aiden, A.P., Veres, A., Gray, M.K., Pickett, J.P., Hoiberg, D., Clancy, D., Norvig, P., Orwant, J.: Quantitative analysis of culture using millions of digitized books. Science 331(6014), 176–182 (2011)
Hamilton, W.L., Leskovec, J., Jurafsky, D.: Diachronic word embeddings reveal statistical laws of semantic change. In: ACL, vol. 1, pp. 1489–1501 (2016)
Kulkarni, V., Al-Rfou, R., Perozzi, B., Skiena, S.: Statistically significant detection of linguistic change. In: WWW, pp. 625–635 (2015)
Kim, Y., Chiu, Y.I., Hanaki, K., Hegde, D., Petrov, S.: Temporal analysis of language through neural language models. In: ACL, pp. 61–65 (2014)
Jatowt, A., Duh, K.: A framework for analyzing semantic change of words across time. In: IJDL, pp. 229–238 (2014)
Basile, P., Caputo, A., Semeraro, G.: Temporal random indexing: a system for analysing word meaning over time. Ital. J. Comput. Linguist. 1(1), 55–68 (2015)
Phillips, L., Shaffer, K., Arendt, D., Hodas, N., Volkova, S.: Intrinsic and extrinsic evaluation of spatiotemporal text representations in twitter streams. In: Proceedings of the 2nd Workshop on Representation Learning for NLP, pp. 201–210 (2017)
Basile, P., Caputo, A., Semeraro, G.: Temporal random indexing: a tool for analysing word meaning variations in news. In: ECIR, pp. 39–41 (2016)
Yao, Z., Sun, Y., Ding, W., Rao, N., Xiong, H.: Dynamic word embeddings for evolving semantic discovery. In: WSDM, pp. 673–681 (2018)
Kendall, D.G.: Stochastic processes occurring in the theory of queues and their analysis by the method of the imbedded Markov chain. Ann. Math. Stat. 3(6), 338–354 (1953)
Zhang, Y., Jatowt, A., Bhowmick, S.S., Tanaka, K.: The past is not a foreign country: detecting semantically similar terms across time. TKDE 28(10), 2793–2807 (2016)
Hamilton, W.L., Leskovec, J., Jurafsky, D.: Cultural shift or linguistic drift? Comparing two computational measures of semantic change. In: EMNLP, pp. 2116–2121 (2016)
Basseville, M., Nikiforov, I.V.: Others: Detection of abrupt changes: theory and application, vol. 104. Prentice-Hall, Inc, Englewood Cliffs (1993)
Taylor, W.A.: Change-point analysis: a powerful new tool for detecting changes (2000). https://variation.com/wp-content/uploads/change-point-analyzer/change-point-analysis-a-powerful-new-tool-for-detecting-changes.pdf. Accessed 15 Jan 2018
Ghanbarnejad, F., Gerlach, M., Miotto, J.M., Altmann, E.G.: Extracting information from s-curves of language change. J. R. Soc. Interface 11(101), 20141044 (2014)
Piantadosi, S.T.: Zipf’s word frequency law in natural language: a critical review and future directions. Psychon. Bull. Rev. 21(5), 1112–1130 (2014)
Krishnamoorthy, N., Malkarnenkar, G., Mooney, R., Saenko, K., Guadarrama, S.: Generating natural-language video descriptions using text-mined knowledge. In: AAAI, pp. 541–547 (2013)
Bassil, Y., Alwani, M.: Ocr post-processing error correction algorithm using Google’s online spelling suggestion. J. Emerg. Trends Comput. Inf. Sci. 3(1), 90–99 (2012)
Nazar, R., Renau, I.: Google books n-gram corpus used as a grammar checker. In: EACL, pp. 27–34 (2012)
Mikolov, T., Corrado, G., Chen, K., Dean, J.: Efficient estimation of word representations in vector space. In: ICLR (2013)
Pennington, J., Socher, R., Manning, C.D.: GloVe: Global vectors for word representation. In: EMNLP, pp. 1532–1543 (2014)
Mikolov, T., Sutskever, I., Chen, K., Corrado, G., Dean, J.: Distributed representations of words and phrases and their compositionality. In: NIPS pp. 3111–3119 (2013)
Muromägi, A., Sirts, K., Laur, S.: Linear ensembles of word embedding models. In: Proceedings of the 21st Nordic Conference on Computational Linguistics, pp. 96–104 (2017)
Rudolph, M., Blei, D.: Dynamic Bernoulli embeddings for language evolution (2017). arXiv preprint arXiv:170308052
Gladkova, A., Drozd, A.: Intrinsic evaluations of word embeddings: What can we do better? In: ACL, pp. 36–42 (2016)
Schnabel, T., Labutov, I., Mimno, D.M., Joachims, T.: Evaluation methods for unsupervised word embeddings. In: EMNLP, pp. 298–307 (2015)
Hellrich, J., Hahn, U.: An assessment of experimental protocols for tracing changes in word semantics relative to accuracy and reliability. In: SIGHUM, pp. 111–117 (2016)
Levy, O., Goldberg, Y., Dagan, I.: Improving distributional similarity with lessons learned from word embeddings. TACL 3, 211–225 (2015)
Elekes, Á., Englhardt, A., Schäler, M., Böhm, K.: Toward meaningful notions of similarity in nlp embedding models. IJDL 18, 1–20 (2018)
Elekes, A., Englhardt, A., Schäler, M., Böhm, K.: Resources to examine the quality of word embedding models trained on n-gram data. In: Proceedings of the 22nd Conference on Computational Natural Language Learning, pp. 423–432 (2018)
Blank, A.: Why do new meanings occur? A cognitive typology of the motivations for lexical semantic change. Hist. Seman. Cognit. 13, 61–89 (1999)
Traugott, E.C., Dasher, R.B.: Regularities in Semantic Change. Cambridge University Press, Cambridge (2001)
Hopper, P.J., Traugott, E.C.: Grammaticalization. Cambridge University Press, Cambridge (2003)
Bréal, M.: Essai de sémantique: (Science des Significations). Hachette, New York (1904)
Ullmann, S.: Semantics: An Introduction to the Science of Meaning. Barnes & Noble, New York (1962)
Traugott, E.C.: On the rise of epistemic meanings in English: an example of subjectification in semantic change. Language 65, 31–55 (1989)
Durie, M., Ross, M.: The Comparative Method Reviewed: Regularity and Irregularity in Language Change. Oxford University Press, Oxford (1996)
Lin, Y., Michel, J.B., Aiden, E.L., Orwant, J., Brockman, W., Petrov, S.: Syntactic annotations for the google books ngram corpus. In: Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics, Jeju, Republic of Korea, 8–14 July 2012, pp. 169–174. Association for Computational Linguistics (2012)
Gulordava, K., Baroni, M.: A distributional similarity approach to the detection of semantic change in the Google books ngram corpus. In: GEMS, pp. 67–71 (2011)
van Aggelen, A., Hollink, L., van Ossenbruggen, J.: Combining distributional semantics and structured data to study lexical change. In: EKAW, pp. 40–49 (2016)
Del, Tredici, M., Nissim, M., Zaninello, A.: Tracing metaphors in time through self-distance in vector spaces (2016). arXiv preprint arXiv:161103279
Basile, P., Caputo, A., Luisi, R., Semeraro, G.: Diachronic analysis of the Italian language exploiting Google ngram. In: CLiC-it, pp. 56–60 (2016)
Takamura, H., Nagata, R., Kawasaki, Y.: Analyzing semantic change in japanese loanwords. In: EACL, vol. 1, pp. 1195–1204 (2017)
Tahmasebi, N., Gossen, G., Kanhabua, N., Holzmann, H., Risse, T.: Neer: An unsupervised method for named entity evolution recognition. In: COLING, pp. 2553–2568 (2012)
Rehurek, R., Sojka, P.: software framework for topic modelling with large corpora. In: LREC, pp. 45–50 (2010)
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Englhardt, A., Willkomm, J., Schäler, M. et al. Improving semantic change analysis by combining word embeddings and word frequencies. Int J Digit Libr 21, 247–264 (2020). https://doi.org/10.1007/s00799-019-00271-6
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00799-019-00271-6