Abstract
This paper explores the problem of identifying an author based on text passages of varying length, ranging from 100 to 2,000 words. The study builds on previous research on authorship attribution of Polish literary texts, finding that the TF-IDF with multilayer perceptron outperforms other techniques. The study investigates whether the issue with BERT in authorship attribution can be mitigated by removing named entities from the input data and replacing posteriori probabilities with logits in sequence classification. The results demonstrate that machine learning methods are capable of almost perfect authorship attribution on short texts, and the proposed MaxLogit approach significantly improves results. However, except in the case of short passages up to 400 words, better results are obtained with TF-IDF than with BERT. The study concludes with a discussion of the results and suggestions for future research.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Bellet, A., Habrard, A., Sebban, M.: Metric learning, synthesis lectures on artificial intelligence and machine learning, vol. 9. Morgan & Claypool Publishers (USA), Synthesis Lectures on Artificial Intelligence and Machine Learning, pp. 1–151 (2015). https://doi.org/10.2200/S00626ED1V01Y201501AIM030, https://hal.archives-ouvertes.fr/hal-01121733
Calix, K., Connors, M., Levy, D., Manzar, H., McCabe, G., Westcott, S.: Stylometry for e-mail author identification and authentication (2008)
Can, M.: Authorship attribution using principal component analysis and competitive neural networks. Math. Comput. Appl. 19(1), 21–36 (2014)
Conneau, A., et al.: Unsupervised cross-lingual representation learning at scale. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 8440–8451 (2020). https://doi.org/10.18653/v1/2020.acl-main.747
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
Eder, M., Piasecki, M., Walkowiak, T.: Open stylometric system based on multilevel text analysis. Cognitive Studies | Études cognitives 17 (2017). https://doi.org/10.11649/cs.1430
Eder, M., Rybicki, J.: Late 19th- and early 20th-century polish novels (2015). http://hdl.handle.net/11321/57, CLARIN-PL digital repository
Fabien, M., Villatoro-Tello, E., Motlicek, P., Parida, S.: BertAA : BERT fine-tuning for authorship attribution. In: Proceedings of the 17th International Conference on Natural Language Processing (ICON), pp. 127–137. NLP Association of India (NLPAI), Indian Institute of Technology Patna, Patna, India (2020). https://aclanthology.org/2020.icon-main.16
Grivas, A., Krithara, A., Giannakopoulos, G.: Author profiling using stylometric and structural feature groupings. In: Working Notes of CLEF 2015 - Conference and Labs of the Evaluation forum, Toulouse, France, September 8–11, 2015. CEUR Workshop Proceedings, vol. 1391. CEUR-WS.org (2015). http://ceur-ws.org/Vol-1391/68-CR.pdf
Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning. SSS, Springer, New York (2009). https://doi.org/10.1007/978-0-387-84858-7
Hendrycks, D., et al.: Scaling out-of-distribution detection for real-world settings. In: Chaudhuri, K., Jegelka, S., Song, L., Szepesvari, C., Niu, G., Sabato, S. (eds.) Proceedings of the 39th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 162, pp. 8759–8773. PMLR (17–23 Jul 2022). https://proceedings.mlr.press/v162/hendrycks22a.html
Juola, P.: Authorship attribution. Found. Trends Inf. Retr. 1(3), 233–334 (2006). https://doi.org/10.1561/1500000005
Marcińczuk, M., Kocoń, J., Oleksy, M.: Liner2 – a generic framework for named entity recognition. In: Proceedings of the 6th Workshop on Balto-Slavic Natural Language Processing, pp. 86–91. Association for Computational Linguistics, Valencia, Spain (2017). https://doi.org/10.18653/v1/W17-1413, https://aclanthology.org/W17-1413
Päpcke, S., Weitin, T., Herget, K., Glawion, A., Brandes, U.: Stylometric similarity in literary corpora: Non-authorship clustering and Deutscher Novellenschatz. Digital Scholarship in the Humanities (2022). https://doi.org/10.1093/llc/fqac039, fqac039
Salton G, B.C.: Term-weighting approaches in automatic text retrieval. Info. Process. Manage. 24(5), 513–523 (1988)
Walkowiak, T.: Author attribution of literary texts in polish by the sequence averaging. In: Rutkowski, L., Scherer, R., Korytkowski, M., Pedrycz, W., Tadeusiewicz, R., Zurada, J.M. (eds.) Artificial Intelligence and Soft Computing, pp. 367–376. Springer International Publishing, Cham (2023). https://doi.org/10.1007/978-3-031-23480-4_31
Walkowiak, T., Piasecki, M.: Stylometry analysis of literary texts in polish. In: Rutkowski, L., Scherer, R., Korytkowski, M., Pedrycz, W., Tadeusiewicz, R., Zurada, J.M. (eds.) Artificial Intelligence and Soft Computing, pp. 777–787. Springer International Publishing, Cham (2018). https://doi.org/10.1007/978-3-319-91262-2_68
Acknowledgements
Financed by the European Regional Development Fund as a part of the 2014–2020 Smart Growth Operational Programme, CLARIN - Common Language Resources and Technology Infrastructure, project no. POIR.04.02.00-00C002/19.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Walkowiak, T. (2023). Authorship Attribution of Literary Texts Using Named Entity Masking and MaxLogit-Based Sequence Classification for Varying Text Lengths. In: Rutkowski, L., Scherer, R., Korytkowski, M., Pedrycz, W., Tadeusiewicz, R., Zurada, J.M. (eds) Artificial Intelligence and Soft Computing. ICAISC 2023. Lecture Notes in Computer Science(), vol 14125. Springer, Cham. https://doi.org/10.1007/978-3-031-42505-9_26
Download citation
DOI: https://doi.org/10.1007/978-3-031-42505-9_26
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-42504-2
Online ISBN: 978-3-031-42505-9
eBook Packages: Computer ScienceComputer Science (R0)