Skip to main content

Authorship Attribution of Literary Texts Using Named Entity Masking and MaxLogit-Based Sequence Classification for Varying Text Lengths

  • Conference paper
  • First Online:
Artificial Intelligence and Soft Computing (ICAISC 2023)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 14125))

Included in the following conference series:

  • 572 Accesses

Abstract

This paper explores the problem of identifying an author based on text passages of varying length, ranging from 100 to 2,000 words. The study builds on previous research on authorship attribution of Polish literary texts, finding that the TF-IDF with multilayer perceptron outperforms other techniques. The study investigates whether the issue with BERT in authorship attribution can be mitigated by removing named entities from the input data and replacing posteriori probabilities with logits in sequence classification. The results demonstrate that machine learning methods are capable of almost perfect authorship attribution on short texts, and the proposed MaxLogit approach significantly improves results. However, except in the case of short passages up to 400 words, better results are obtained with TF-IDF than with BERT. The study concludes with a discussion of the results and suggestions for future research.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://huggingface.co/dkleczek/bert-base-polish-uncased-v1.

References

  1. Bellet, A., Habrard, A., Sebban, M.: Metric learning, synthesis lectures on artificial intelligence and machine learning, vol. 9. Morgan & Claypool Publishers (USA), Synthesis Lectures on Artificial Intelligence and Machine Learning, pp. 1–151 (2015). https://doi.org/10.2200/S00626ED1V01Y201501AIM030, https://hal.archives-ouvertes.fr/hal-01121733

  2. Calix, K., Connors, M., Levy, D., Manzar, H., McCabe, G., Westcott, S.: Stylometry for e-mail author identification and authentication (2008)

    Google Scholar 

  3. Can, M.: Authorship attribution using principal component analysis and competitive neural networks. Math. Comput. Appl. 19(1), 21–36 (2014)

    MATH  Google Scholar 

  4. Conneau, A., et al.: Unsupervised cross-lingual representation learning at scale. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 8440–8451 (2020). https://doi.org/10.18653/v1/2020.acl-main.747

  5. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)

  6. Eder, M., Piasecki, M., Walkowiak, T.: Open stylometric system based on multilevel text analysis. Cognitive Studies | Études cognitives 17 (2017). https://doi.org/10.11649/cs.1430

  7. Eder, M., Rybicki, J.: Late 19th- and early 20th-century polish novels (2015). http://hdl.handle.net/11321/57, CLARIN-PL digital repository

  8. Fabien, M., Villatoro-Tello, E., Motlicek, P., Parida, S.: BertAA : BERT fine-tuning for authorship attribution. In: Proceedings of the 17th International Conference on Natural Language Processing (ICON), pp. 127–137. NLP Association of India (NLPAI), Indian Institute of Technology Patna, Patna, India (2020). https://aclanthology.org/2020.icon-main.16

  9. Grivas, A., Krithara, A., Giannakopoulos, G.: Author profiling using stylometric and structural feature groupings. In: Working Notes of CLEF 2015 - Conference and Labs of the Evaluation forum, Toulouse, France, September 8–11, 2015. CEUR Workshop Proceedings, vol. 1391. CEUR-WS.org (2015). http://ceur-ws.org/Vol-1391/68-CR.pdf

  10. Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning. SSS, Springer, New York (2009). https://doi.org/10.1007/978-0-387-84858-7

    Book  MATH  Google Scholar 

  11. Hendrycks, D., et al.: Scaling out-of-distribution detection for real-world settings. In: Chaudhuri, K., Jegelka, S., Song, L., Szepesvari, C., Niu, G., Sabato, S. (eds.) Proceedings of the 39th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 162, pp. 8759–8773. PMLR (17–23 Jul 2022). https://proceedings.mlr.press/v162/hendrycks22a.html

  12. Juola, P.: Authorship attribution. Found. Trends Inf. Retr. 1(3), 233–334 (2006). https://doi.org/10.1561/1500000005

    Article  Google Scholar 

  13. Marcińczuk, M., Kocoń, J., Oleksy, M.: Liner2 – a generic framework for named entity recognition. In: Proceedings of the 6th Workshop on Balto-Slavic Natural Language Processing, pp. 86–91. Association for Computational Linguistics, Valencia, Spain (2017). https://doi.org/10.18653/v1/W17-1413, https://aclanthology.org/W17-1413

  14. Päpcke, S., Weitin, T., Herget, K., Glawion, A., Brandes, U.: Stylometric similarity in literary corpora: Non-authorship clustering and Deutscher Novellenschatz. Digital Scholarship in the Humanities (2022). https://doi.org/10.1093/llc/fqac039, fqac039

  15. Salton G, B.C.: Term-weighting approaches in automatic text retrieval. Info. Process. Manage. 24(5), 513–523 (1988)

    Google Scholar 

  16. Walkowiak, T.: Author attribution of literary texts in polish by the sequence averaging. In: Rutkowski, L., Scherer, R., Korytkowski, M., Pedrycz, W., Tadeusiewicz, R., Zurada, J.M. (eds.) Artificial Intelligence and Soft Computing, pp. 367–376. Springer International Publishing, Cham (2023). https://doi.org/10.1007/978-3-031-23480-4_31

    Chapter  Google Scholar 

  17. Walkowiak, T., Piasecki, M.: Stylometry analysis of literary texts in polish. In: Rutkowski, L., Scherer, R., Korytkowski, M., Pedrycz, W., Tadeusiewicz, R., Zurada, J.M. (eds.) Artificial Intelligence and Soft Computing, pp. 777–787. Springer International Publishing, Cham (2018). https://doi.org/10.1007/978-3-319-91262-2_68

    Chapter  Google Scholar 

Download references

Acknowledgements

Financed by the European Regional Development Fund as a part of the 2014–2020 Smart Growth Operational Programme, CLARIN - Common Language Resources and Technology Infrastructure, project no. POIR.04.02.00-00C002/19.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Tomasz Walkowiak .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Walkowiak, T. (2023). Authorship Attribution of Literary Texts Using Named Entity Masking and MaxLogit-Based Sequence Classification for Varying Text Lengths. In: Rutkowski, L., Scherer, R., Korytkowski, M., Pedrycz, W., Tadeusiewicz, R., Zurada, J.M. (eds) Artificial Intelligence and Soft Computing. ICAISC 2023. Lecture Notes in Computer Science(), vol 14125. Springer, Cham. https://doi.org/10.1007/978-3-031-42505-9_26

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-42505-9_26

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-42504-2

  • Online ISBN: 978-3-031-42505-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics