Improving semantic change analysis by combining word embeddings and word frequencies

Englhardt, Adrian; Willkomm, Jens; Schäler, Martin; Böhm, Klemens

doi:10.1007/s00799-019-00271-6

Improving semantic change analysis by combining word embeddings and word frequencies

Published: 20 May 2019

Volume 21, pages 247–264, (2020)
Cite this article

International Journal on Digital Libraries Aims and scope Submit manuscript

Adrian Englhardt ORCID: orcid.org/0000-0002-0388-1785¹,
Jens Willkomm¹,
Martin Schäler¹ &
…
Klemens Böhm¹

638 Accesses
5 Citations
1 Altmetric
Explore all metrics

Abstract

Language is constantly evolving. As part of diachronic linguistics, semantic change analysis examines how the meanings of words evolve over time. Such semantic awareness is important to retrieve content from digital libraries. Recent research on semantic change analysis relying on word embeddings has yielded significant improvements over previous work. However, a recent, but somewhat neglected observation so far is that the rate of semantic shift negatively correlates with word-usage frequency. In this article, we therefore propose SCAF, Semantic Change Analysis with Frequency. It abstracts from the concrete embeddings and includes word frequencies as an orthogonal feature. SCAF allows using different combinations of embedding type, optimization algorithm and alignment method. Additionally, we leverage existing approaches for time series analysis, by using change detection methods to identify semantic shifts. In an evaluation with a realistic setup, SCAF achieves better detection rates than prior approaches, 95% instead of 51%. On the Google Books Ngram data set, our approach detects both known and yet unknown shifts for popular words.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 2

Fig. 3

Fig. 9

A survey on neural topic models: methods, applications, and challenges

Article Open access 25 January 2024

Xiaobao Wu, Thong Nguyen & Anh Tuan Luu

A comprehensive and analytical review of text clustering techniques

Article 08 April 2024

Vivek Mehta, Mohit Agarwal & Rohit Kumar Kaliyar

Detecting the emergence of technologies and the evolution and co-development trajectories in science (DETECTS): a ‘burst’ analysis-based approach

Article 24 October 2015

Hélène Dernis, Mariagrazia Squicciarini & Roberto de Pinho

Notes

https://dbis.ipd.kit.edu/2601.php.
https://archive.org/details/twitterstream.
https://code.google.com/archive/p/word2vec/.
The most recent dump is always available at https://dumps.wikimedia.org/.
https://en.wikipedia.org/wiki/Help:Wiki_markup.
http://www.etymonline.com/.

References

Zipf, G.K.: Human Behavior and the Principle of Least Effort: An Introduction to Human Ecology. Ravenio Books, Cambridge (2016)
Google Scholar
Schatz, B.R.: Information retrieval in digital libraries: bringing search to the net. Science 275(5298), 327–334 (1997)
Article Google Scholar
Michel, J.B., Shen, Y.K., Aiden, A.P., Veres, A., Gray, M.K., Pickett, J.P., Hoiberg, D., Clancy, D., Norvig, P., Orwant, J.: Quantitative analysis of culture using millions of digitized books. Science 331(6014), 176–182 (2011)
Article Google Scholar
Hamilton, W.L., Leskovec, J., Jurafsky, D.: Diachronic word embeddings reveal statistical laws of semantic change. In: ACL, vol. 1, pp. 1489–1501 (2016)
Kulkarni, V., Al-Rfou, R., Perozzi, B., Skiena, S.: Statistically significant detection of linguistic change. In: WWW, pp. 625–635 (2015)
Kim, Y., Chiu, Y.I., Hanaki, K., Hegde, D., Petrov, S.: Temporal analysis of language through neural language models. In: ACL, pp. 61–65 (2014)
Jatowt, A., Duh, K.: A framework for analyzing semantic change of words across time. In: IJDL, pp. 229–238 (2014)
Basile, P., Caputo, A., Semeraro, G.: Temporal random indexing: a system for analysing word meaning over time. Ital. J. Comput. Linguist. 1(1), 55–68 (2015)
Google Scholar
Phillips, L., Shaffer, K., Arendt, D., Hodas, N., Volkova, S.: Intrinsic and extrinsic evaluation of spatiotemporal text representations in twitter streams. In: Proceedings of the 2nd Workshop on Representation Learning for NLP, pp. 201–210 (2017)
Basile, P., Caputo, A., Semeraro, G.: Temporal random indexing: a tool for analysing word meaning variations in news. In: ECIR, pp. 39–41 (2016)
Yao, Z., Sun, Y., Ding, W., Rao, N., Xiong, H.: Dynamic word embeddings for evolving semantic discovery. In: WSDM, pp. 673–681 (2018)
Kendall, D.G.: Stochastic processes occurring in the theory of queues and their analysis by the method of the imbedded Markov chain. Ann. Math. Stat. 3(6), 338–354 (1953)
Article MathSciNet MATH Google Scholar
Zhang, Y., Jatowt, A., Bhowmick, S.S., Tanaka, K.: The past is not a foreign country: detecting semantically similar terms across time. TKDE 28(10), 2793–2807 (2016)
Google Scholar
Hamilton, W.L., Leskovec, J., Jurafsky, D.: Cultural shift or linguistic drift? Comparing two computational measures of semantic change. In: EMNLP, pp. 2116–2121 (2016)
Basseville, M., Nikiforov, I.V.: Others: Detection of abrupt changes: theory and application, vol. 104. Prentice-Hall, Inc, Englewood Cliffs (1993)
MATH Google Scholar
Taylor, W.A.: Change-point analysis: a powerful new tool for detecting changes (2000). https://variation.com/wp-content/uploads/change-point-analyzer/change-point-analysis-a-powerful-new-tool-for-detecting-changes.pdf. Accessed 15 Jan 2018
Ghanbarnejad, F., Gerlach, M., Miotto, J.M., Altmann, E.G.: Extracting information from s-curves of language change. J. R. Soc. Interface 11(101), 20141044 (2014)
Article Google Scholar
Piantadosi, S.T.: Zipf’s word frequency law in natural language: a critical review and future directions. Psychon. Bull. Rev. 21(5), 1112–1130 (2014)
Article Google Scholar
Krishnamoorthy, N., Malkarnenkar, G., Mooney, R., Saenko, K., Guadarrama, S.: Generating natural-language video descriptions using text-mined knowledge. In: AAAI, pp. 541–547 (2013)
Bassil, Y., Alwani, M.: Ocr post-processing error correction algorithm using Google’s online spelling suggestion. J. Emerg. Trends Comput. Inf. Sci. 3(1), 90–99 (2012)
Google Scholar
Nazar, R., Renau, I.: Google books n-gram corpus used as a grammar checker. In: EACL, pp. 27–34 (2012)
Mikolov, T., Corrado, G., Chen, K., Dean, J.: Efficient estimation of word representations in vector space. In: ICLR (2013)
Pennington, J., Socher, R., Manning, C.D.: GloVe: Global vectors for word representation. In: EMNLP, pp. 1532–1543 (2014)
Mikolov, T., Sutskever, I., Chen, K., Corrado, G., Dean, J.: Distributed representations of words and phrases and their compositionality. In: NIPS pp. 3111–3119 (2013)
Muromägi, A., Sirts, K., Laur, S.: Linear ensembles of word embedding models. In: Proceedings of the 21st Nordic Conference on Computational Linguistics, pp. 96–104 (2017)
Rudolph, M., Blei, D.: Dynamic Bernoulli embeddings for language evolution (2017). arXiv preprint arXiv:170308052
Gladkova, A., Drozd, A.: Intrinsic evaluations of word embeddings: What can we do better? In: ACL, pp. 36–42 (2016)
Schnabel, T., Labutov, I., Mimno, D.M., Joachims, T.: Evaluation methods for unsupervised word embeddings. In: EMNLP, pp. 298–307 (2015)
Hellrich, J., Hahn, U.: An assessment of experimental protocols for tracing changes in word semantics relative to accuracy and reliability. In: SIGHUM, pp. 111–117 (2016)
Levy, O., Goldberg, Y., Dagan, I.: Improving distributional similarity with lessons learned from word embeddings. TACL 3, 211–225 (2015)
Google Scholar
Elekes, Á., Englhardt, A., Schäler, M., Böhm, K.: Toward meaningful notions of similarity in nlp embedding models. IJDL 18, 1–20 (2018)
Google Scholar
Elekes, A., Englhardt, A., Schäler, M., Böhm, K.: Resources to examine the quality of word embedding models trained on n-gram data. In: Proceedings of the 22nd Conference on Computational Natural Language Learning, pp. 423–432 (2018)
Blank, A.: Why do new meanings occur? A cognitive typology of the motivations for lexical semantic change. Hist. Seman. Cognit. 13, 61–89 (1999)
Google Scholar
Traugott, E.C., Dasher, R.B.: Regularities in Semantic Change. Cambridge University Press, Cambridge (2001)
Book Google Scholar
Hopper, P.J., Traugott, E.C.: Grammaticalization. Cambridge University Press, Cambridge (2003)
Book Google Scholar
Bréal, M.: Essai de sémantique: (Science des Significations). Hachette, New York (1904)
Google Scholar
Ullmann, S.: Semantics: An Introduction to the Science of Meaning. Barnes & Noble, New York (1962)
Google Scholar
Traugott, E.C.: On the rise of epistemic meanings in English: an example of subjectification in semantic change. Language 65, 31–55 (1989)
Article Google Scholar
Durie, M., Ross, M.: The Comparative Method Reviewed: Regularity and Irregularity in Language Change. Oxford University Press, Oxford (1996)
Google Scholar
Lin, Y., Michel, J.B., Aiden, E.L., Orwant, J., Brockman, W., Petrov, S.: Syntactic annotations for the google books ngram corpus. In: Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics, Jeju, Republic of Korea, 8–14 July 2012, pp. 169–174. Association for Computational Linguistics (2012)
Gulordava, K., Baroni, M.: A distributional similarity approach to the detection of semantic change in the Google books ngram corpus. In: GEMS, pp. 67–71 (2011)
van Aggelen, A., Hollink, L., van Ossenbruggen, J.: Combining distributional semantics and structured data to study lexical change. In: EKAW, pp. 40–49 (2016)
Del, Tredici, M., Nissim, M., Zaninello, A.: Tracing metaphors in time through self-distance in vector spaces (2016). arXiv preprint arXiv:161103279
Basile, P., Caputo, A., Luisi, R., Semeraro, G.: Diachronic analysis of the Italian language exploiting Google ngram. In: CLiC-it, pp. 56–60 (2016)
Takamura, H., Nagata, R., Kawasaki, Y.: Analyzing semantic change in japanese loanwords. In: EACL, vol. 1, pp. 1195–1204 (2017)
Tahmasebi, N., Gossen, G., Kanhabua, N., Holzmann, H., Risse, T.: Neer: An unsupervised method for named entity evolution recognition. In: COLING, pp. 2553–2568 (2012)
Rehurek, R., Sojka, P.: software framework for topic modelling with large corpora. In: LREC, pp. 45–50 (2010)

Download references

Author information

Authors and Affiliations

Karlsruhe Institute of Technology, Karlsruhe, Germany
Adrian Englhardt, Jens Willkomm, Martin Schäler & Klemens Böhm

Authors

Adrian Englhardt
View author publications
You can also search for this author in PubMed Google Scholar
Jens Willkomm
View author publications
You can also search for this author in PubMed Google Scholar
Martin Schäler
View author publications
You can also search for this author in PubMed Google Scholar
Klemens Böhm
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Adrian Englhardt.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Englhardt, A., Willkomm, J., Schäler, M. et al. Improving semantic change analysis by combining word embeddings and word frequencies. Int J Digit Libr 21, 247–264 (2020). https://doi.org/10.1007/s00799-019-00271-6

Download citation

Received: 04 May 2018
Revised: 22 March 2019
Accepted: 07 May 2019
Published: 20 May 2019
Issue Date: September 2020
DOI: https://doi.org/10.1007/s00799-019-00271-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Improving semantic change analysis by combining word embeddings and word frequencies

Abstract

Access this article

Similar content being viewed by others

A survey on neural topic models: methods, applications, and challenges

A comprehensive and analytical review of text clustering techniques

Detecting the emergence of technologies and the evolution and co-development trajectories in science (DETECTS): a ‘burst’ analysis-based approach

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Improving semantic change analysis by combining word embeddings and word frequencies

Abstract

Access this article

Similar content being viewed by others

A survey on neural topic models: methods, applications, and challenges

A comprehensive and analytical review of text clustering techniques

Detecting the emergence of technologies and the evolution and co-development trajectories in science (DETECTS): a ‘burst’ analysis-based approach

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation