Abstract
In this paper, word embeddings are used for the task of supervised authorship attribution. While previous methods have for instance been looking at characters (n-grams), syntax and most importantly token frequencies, the method presented focusses on the implications of semantic relationships between words. With this instead of authors word choices, semantic networks of entities as perceived by authors may come closer into focus. We find that those can be used reliably for authorship attribution. The method is generally applicable as a tool to compare different texts and/or authors through word embeddings which have been trained separately. This is achieved by not comparing vectors directly, but by comparing sets of most similar words for words shared between texts and then aggregating and averaging similarities per text pair. On two literary corpora (German, English), we compute embeddings for each text separately. The similarities are then used to detect the author of an unknown text.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
- 2.
https://code.google.com/archive/p/word2vec/, default settings, corpus lowercased and punctuation marks deleted.
References
Argamon, S.: Interpreting Burrows’s delta: geometric and probabilistic foundations. Literary Linguist. Comput. 23(2), 131–147 (2008)
Burrows, J.: Delta: a measure of stylistic difference and a guide to likely authorship. Literary Linguistic Comput. 17(3), 267–287 (2002)
Eder, M.: Does size matter? Authorship attribution, small samples, big problem. Literary Linguist. Comput. 30(2), 167–182 (2013)
Evert, S., Proisl, T., Vitt, T., Schöch, C., Jannidis, F., Pielström, S.: Towards a better understanding of Burrows’s Delta in literary authorship attribution. In: Proceedings of the Fourth Workshop on Computational Linguistics for Literature, pp. 79–88. Association for Computational Linguistics, Denver, Colorado, USA (2015)
Koppel, M., Schler, J.: Exploiting stylistic idiosyncrasies for authorship attribution. In: Proceedings of IJCAI’03 Workshop on Computational Approaches to Style Analysis and Synthesis, vol. 69, p. 72 (2003)
Marsden, J., Budden, D., Craig, H., Moscato, P.: Language individuation and marker words: Shakespeare and his Maxwells Demon. PLoS ONE 8(6), 63–88 (2013)
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. In: Proceedings of Workshop at ICLR (2013)
Smith, P.W.H., Aldridge, W.: Improving authorship attribution: optimizing Burrows’ Delta method. J. Quant. Linguist. 18(1), 63–88 (2011)
Stamatatos, E.: A survey of modern authorship attribution methods. J. Am. Soc. Inf. Sci. Technol. 60(3), 538–556 (2009)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Hoenen, A. (2017). Using Word Embeddings for Computing Distances Between Texts and for Authorship Attribution. In: Frasincar, F., Ittoo, A., Nguyen, L., Métais, E. (eds) Natural Language Processing and Information Systems. NLDB 2017. Lecture Notes in Computer Science(), vol 10260. Springer, Cham. https://doi.org/10.1007/978-3-319-59569-6_33
Download citation
DOI: https://doi.org/10.1007/978-3-319-59569-6_33
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-59568-9
Online ISBN: 978-3-319-59569-6
eBook Packages: Computer ScienceComputer Science (R0)