Authorship Attribution of Literary Texts Using Named Entity Masking and MaxLogit-Based Sequence Classification for Varying Text Lengths

Walkowiak, Tomasz

doi:10.1007/978-3-031-42505-9_26

Tomasz Walkowiak ORCID: orcid.org/0000-0002-7749-4251¹³

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 14125))

Included in the following conference series:

International Conference on Artificial Intelligence and Soft Computing

572 Accesses

Abstract

This paper explores the problem of identifying an author based on text passages of varying length, ranging from 100 to 2,000 words. The study builds on previous research on authorship attribution of Polish literary texts, finding that the TF-IDF with multilayer perceptron outperforms other techniques. The study investigates whether the issue with BERT in authorship attribution can be mitigated by removing named entities from the input data and replacing posteriori probabilities with logits in sequence classification. The results demonstrate that machine learning methods are capable of almost perfect authorship attribution on short texts, and the proposed MaxLogit approach significantly improves results. However, except in the case of short passages up to 400 words, better results are obtained with TF-IDF than with BERT. The study concludes with a discussion of the results and suggestions for future research.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 79.99; Price excludes VAT (USA)

Softcover Book: USD 99.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
https://huggingface.co/dkleczek/bert-base-polish-uncased-v1.

References

Bellet, A., Habrard, A., Sebban, M.: Metric learning, synthesis lectures on artificial intelligence and machine learning, vol. 9. Morgan & Claypool Publishers (USA), Synthesis Lectures on Artificial Intelligence and Machine Learning, pp. 1–151 (2015). https://doi.org/10.2200/S00626ED1V01Y201501AIM030, https://hal.archives-ouvertes.fr/hal-01121733
Calix, K., Connors, M., Levy, D., Manzar, H., McCabe, G., Westcott, S.: Stylometry for e-mail author identification and authentication (2008)
Google Scholar
Can, M.: Authorship attribution using principal component analysis and competitive neural networks. Math. Comput. Appl. 19(1), 21–36 (2014)
MATH Google Scholar
Conneau, A., et al.: Unsupervised cross-lingual representation learning at scale. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 8440–8451 (2020). https://doi.org/10.18653/v1/2020.acl-main.747
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
Eder, M., Piasecki, M., Walkowiak, T.: Open stylometric system based on multilevel text analysis. Cognitive Studies | Études cognitives 17 (2017). https://doi.org/10.11649/cs.1430
Eder, M., Rybicki, J.: Late 19th- and early 20th-century polish novels (2015). http://hdl.handle.net/11321/57, CLARIN-PL digital repository
Fabien, M., Villatoro-Tello, E., Motlicek, P., Parida, S.: BertAA : BERT fine-tuning for authorship attribution. In: Proceedings of the 17th International Conference on Natural Language Processing (ICON), pp. 127–137. NLP Association of India (NLPAI), Indian Institute of Technology Patna, Patna, India (2020). https://aclanthology.org/2020.icon-main.16
Grivas, A., Krithara, A., Giannakopoulos, G.: Author profiling using stylometric and structural feature groupings. In: Working Notes of CLEF 2015 - Conference and Labs of the Evaluation forum, Toulouse, France, September 8–11, 2015. CEUR Workshop Proceedings, vol. 1391. CEUR-WS.org (2015). http://ceur-ws.org/Vol-1391/68-CR.pdf
Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning. SSS, Springer, New York (2009). https://doi.org/10.1007/978-0-387-84858-7
Book MATH Google Scholar
Hendrycks, D., et al.: Scaling out-of-distribution detection for real-world settings. In: Chaudhuri, K., Jegelka, S., Song, L., Szepesvari, C., Niu, G., Sabato, S. (eds.) Proceedings of the 39th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 162, pp. 8759–8773. PMLR (17–23 Jul 2022). https://proceedings.mlr.press/v162/hendrycks22a.html
Juola, P.: Authorship attribution. Found. Trends Inf. Retr. 1(3), 233–334 (2006). https://doi.org/10.1561/1500000005
Article Google Scholar
Marcińczuk, M., Kocoń, J., Oleksy, M.: Liner2 – a generic framework for named entity recognition. In: Proceedings of the 6th Workshop on Balto-Slavic Natural Language Processing, pp. 86–91. Association for Computational Linguistics, Valencia, Spain (2017). https://doi.org/10.18653/v1/W17-1413, https://aclanthology.org/W17-1413
Päpcke, S., Weitin, T., Herget, K., Glawion, A., Brandes, U.: Stylometric similarity in literary corpora: Non-authorship clustering and Deutscher Novellenschatz. Digital Scholarship in the Humanities (2022). https://doi.org/10.1093/llc/fqac039, fqac039
Salton G, B.C.: Term-weighting approaches in automatic text retrieval. Info. Process. Manage. 24(5), 513–523 (1988)
Google Scholar
Walkowiak, T.: Author attribution of literary texts in polish by the sequence averaging. In: Rutkowski, L., Scherer, R., Korytkowski, M., Pedrycz, W., Tadeusiewicz, R., Zurada, J.M. (eds.) Artificial Intelligence and Soft Computing, pp. 367–376. Springer International Publishing, Cham (2023). https://doi.org/10.1007/978-3-031-23480-4_31
Chapter Google Scholar
Walkowiak, T., Piasecki, M.: Stylometry analysis of literary texts in polish. In: Rutkowski, L., Scherer, R., Korytkowski, M., Pedrycz, W., Tadeusiewicz, R., Zurada, J.M. (eds.) Artificial Intelligence and Soft Computing, pp. 777–787. Springer International Publishing, Cham (2018). https://doi.org/10.1007/978-3-319-91262-2_68
Chapter Google Scholar

Download references

Acknowledgements

Financed by the European Regional Development Fund as a part of the 2014–2020 Smart Growth Operational Programme, CLARIN - Common Language Resources and Technology Infrastructure, project no. POIR.04.02.00-00C002/19.

Author information

Authors and Affiliations

Faculty of Information and Communication Technology, Wroclaw University of Science and Technology, Wroclaw, Poland
Tomasz Walkowiak

Authors

Tomasz Walkowiak
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Tomasz Walkowiak .

Editor information

Editors and Affiliations

Systems Research Institute of the Polish Academy of Sciences, Warsaw, Poland
Leszek Rutkowski
Częstochowa University of Technology, Częstochowa, Poland
Rafał Scherer
Częstochowa University of Technology, Częstochowa, Poland
Marcin Korytkowski
University of Alberta, Edmonton, AB, Canada
Witold Pedrycz
AGH University of Krakow, Kraków, Poland
Ryszard Tadeusiewicz
University of Louisville, Louisville, KY, USA
Jacek M. Zurada

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Walkowiak, T. (2023). Authorship Attribution of Literary Texts Using Named Entity Masking and MaxLogit-Based Sequence Classification for Varying Text Lengths. In: Rutkowski, L., Scherer, R., Korytkowski, M., Pedrycz, W., Tadeusiewicz, R., Zurada, J.M. (eds) Artificial Intelligence and Soft Computing. ICAISC 2023. Lecture Notes in Computer Science(), vol 14125. Springer, Cham. https://doi.org/10.1007/978-3-031-42505-9_26

Download citation

DOI: https://doi.org/10.1007/978-3-031-42505-9_26
Published: 14 September 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-42504-2
Online ISBN: 978-3-031-42505-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Authorship Attribution of Literary Texts Using Named Entity Masking and MaxLogit-Based Sequence Classification for Varying Text Lengths