Using Word Embeddings for Computing Distances Between Texts and for Authorship Attribution

Hoenen, Armin

doi:10.1007/978-3-319-59569-6_33

Armin Hoenen¹⁷

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 10260))

Included in the following conference series:

International Conference on Applications of Natural Language to Information Systems

1911 Accesses

Abstract

In this paper, word embeddings are used for the task of supervised authorship attribution. While previous methods have for instance been looking at characters (n-grams), syntax and most importantly token frequencies, the method presented focusses on the implications of semantic relationships between words. With this instead of authors word choices, semantic networks of entities as perceived by authors may come closer into focus. We find that those can be used reliably for authorship attribution. The method is generally applicable as a tool to compare different texts and/or authors through word embeddings which have been trained separately. This is achieved by not comparing vectors directly, but by comparing sets of most similar words for words shared between texts and then aggregating and averaging similarities per text pair. On two literary corpora (German, English), we compute embeddings for each text separately. The similarities are then used to detect the author of an unknown text.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
https://sites.google.com/site/computationalstylistics/.
2.
https://code.google.com/archive/p/word2vec/, default settings, corpus lowercased and punctuation marks deleted.

References

Argamon, S.: Interpreting Burrows’s delta: geometric and probabilistic foundations. Literary Linguist. Comput. 23(2), 131–147 (2008)
Article Google Scholar
Burrows, J.: Delta: a measure of stylistic difference and a guide to likely authorship. Literary Linguistic Comput. 17(3), 267–287 (2002)
Article Google Scholar
Eder, M.: Does size matter? Authorship attribution, small samples, big problem. Literary Linguist. Comput. 30(2), 167–182 (2013)
Google Scholar
Evert, S., Proisl, T., Vitt, T., Schöch, C., Jannidis, F., Pielström, S.: Towards a better understanding of Burrows’s Delta in literary authorship attribution. In: Proceedings of the Fourth Workshop on Computational Linguistics for Literature, pp. 79–88. Association for Computational Linguistics, Denver, Colorado, USA (2015)
Google Scholar
Koppel, M., Schler, J.: Exploiting stylistic idiosyncrasies for authorship attribution. In: Proceedings of IJCAI’03 Workshop on Computational Approaches to Style Analysis and Synthesis, vol. 69, p. 72 (2003)
Google Scholar
Marsden, J., Budden, D., Craig, H., Moscato, P.: Language individuation and marker words: Shakespeare and his Maxwells Demon. PLoS ONE 8(6), 63–88 (2013)
Article Google Scholar
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. In: Proceedings of Workshop at ICLR (2013)
Google Scholar
Smith, P.W.H., Aldridge, W.: Improving authorship attribution: optimizing Burrows’ Delta method. J. Quant. Linguist. 18(1), 63–88 (2011)
Article Google Scholar
Stamatatos, E.: A survey of modern authorship attribution methods. J. Am. Soc. Inf. Sci. Technol. 60(3), 538–556 (2009)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Text Technology Lab/CEDIFOR, Goethe University Frankfurt, Frankfurt, Germany
Armin Hoenen

Authors

Armin Hoenen
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Armin Hoenen .

Editor information

Editors and Affiliations

Erasmus University Rotterdam, Rotterdam, The Netherlands
Flavius Frasincar
University of Liège , Liège, Belgium
Ashwin Ittoo
Japan Advanced Institute of Science and Technology, Nomi, Japan
Le Minh Nguyen
Conservatoire National des Arts et Métiers, Paris, France
Elisabeth Métais

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Hoenen, A. (2017). Using Word Embeddings for Computing Distances Between Texts and for Authorship Attribution. In: Frasincar, F., Ittoo, A., Nguyen, L., Métais, E. (eds) Natural Language Processing and Information Systems. NLDB 2017. Lecture Notes in Computer Science(), vol 10260. Springer, Cham. https://doi.org/10.1007/978-3-319-59569-6_33

Download citation

DOI: https://doi.org/10.1007/978-3-319-59569-6_33
Published: 02 June 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-59568-9
Online ISBN: 978-3-319-59569-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics