Skip to main content

Author Attribution of Literary Texts in Polish by the Sequence Averaging

  • Conference paper
  • First Online:
Artificial Intelligence and Soft Computing (ICAISC 2022)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 13589))

Included in the following conference series:

Abstract

The paper concerns the authorship recognition in a collection of Polish literary texts from the late 19th and early 20th centuries, consisting of 99 novels from 33 authors. The authors divide the books into smaller parts and analyze the classification based on a book part. To mimic the real task of testing an unknown book, the data set has been divided in such a way that parts of the same book do not appear simultaneously in the training and test set. The authors compare the approaches by working with raw texts, and, to avoid the semantic features of the text, they represent texts in the form of a sequence of grammatical class bigrams. In the case of raw text analysis, classical TF-IDF, supervised fastText, and contemporary transformer-based BERT are analyzed. In the case of grammatical classes, only TF-IDF and fastText are applied. In addition, the authors propose a sequence averaging method that works by dividing the text into smaller parts, classifying each part separately, and making the final classification based on averaging results from each part of the text. The study suggests that the TF-IDF on the raw text outperforms other methods and the sequence averaging improves the classification results for most of the analyzed schemas. Surprisingly, the BERT based method is the worst. This phenomenon is carefully analyzed and explained.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 64.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 84.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://huggingface.co/allegro/herbert-base-cased.

References

  1. Alexandre, L.A., Campilho, A.C., Kamel, M.: On combining classifiers using sum and product rules. Pattern Recogn. Lett. 22(12), 1283–1289 (2001). https://doi.org/10.1016/S0167-8655(01)00073-3

    Article  MATH  Google Scholar 

  2. Barlas, G., Stamatatos, E.: Cross-domain authorship attribution using pre-trained language models. In: Maglogiannis, I., Iliadis, L., Pimenidis, E. (eds.) AIAI 2020. IAICT, vol. 583, pp. 255–266. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-49161-1_22

    Chapter  Google Scholar 

  3. Calix, K., Connors, M., Levy, D., Manzar, H., McCabe, G., Westcott, S.: Stylometry for e-mail author identification and authentication (2008)

    Google Scholar 

  4. Can, M.: Authorship attribution using principal component analysis and competitive neural networks. Math. Comput. Appl. 19(1), 21–36 (2014)

    MATH  Google Scholar 

  5. Craig, H., Kinney, A.: Shakespeare, Computers, and the Mystery of Authorship. Cambridge University Press, Cambridge (2009)

    Book  Google Scholar 

  6. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)

  7. Eder, M., Piasecki, M., Walkowiak, T.: Open stylometric system based on multilevel text analysis. Cogn. Stud. | Études cognitives 17 (2017). https://doi.org/10.11649/cs.1430

  8. Eder, M., Rybicki, J.: Late 19th- and early 20th-century polish novels (2015). http://hdl.handle.net/11321/57. CLARIN-PL digital repository

  9. Fabien, M., Villatoro-Tello, E., Motlicek, P., Parida, S.: BertAA: BERT fine-tuning for authorship attribution. In: Proceedings of the 17th International Conference on Natural Language Processing (ICON), pp. 127–137. NLP Association of India (NLPAI), Indian Institute of Technology Patna, Patna, India (2020). https://aclanthology.org/2020.icon-main.16

  10. Grave, E., Bojanowski, P., Gupta, P., Joulin, A., Mikolov, T.: Learning word vectors for 157 languages. In: Proceedings of the International Conference on Language Resources and Evaluation (LREC 2018), pp. 3483–3487 (2018)

    Google Scholar 

  11. Grivas, A., Krithara, A., Giannakopoulos, G.: Author profiling using stylometric and structural feature groupings. In: Working Notes of CLEF 2015 - Conference and Labs of the Evaluation forum, Toulouse, France, 8–11 September 2015. CEUR Workshop Proceedings, vol. 1391. CEUR-WS.org (2015). http://ceur-ws.org/Vol-1391/68-CR.pdf

  12. Harris, Z.S.: Distributional structure. Word 10(2–3), 146–162 (1954)

    Article  Google Scholar 

  13. Hastie, T.J., Tibshirani, R.J., Friedman, J.H.: The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer Series in Statistics, Springer, New York (2009). https://doi.org/10.1007/978-0-387-84858-7

    Book  MATH  Google Scholar 

  14. Joulin, A., Grave, E., Bojanowski, P., Mikolov, T.: Bag of tricks for efficient text classification. In: Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, pp. 427–431. Association for Computational Linguistics (2017). http://aclweb.org/anthology/E17-2068

  15. Juola, P.: Authorship attribution. Found. Trends Inf. Retr. 1(3), 233–334 (2006). https://doi.org/10.1561/1500000005

    Article  Google Scholar 

  16. Le, Q., Mikolov, T.: Distributed representations of sentences and documents. In: International Conference on Machine Learning, pp. 1188–1196 (2014)

    Google Scholar 

  17. Piasecki, M., Walkowiak, T., Eder, M.: Open stylometric system WebSty: Integrated language processing, analysis and visualisation. Comput. Methods Sci. Technol. 24(1), 43–58 (2018)

    Article  Google Scholar 

  18. Przepiórkowski, A., Bańko, M., Górski, R.L., Lewandowska-Tomaszczyk, B. (eds.): Narodowy Korpus Jezyka Polskiego [Eng.: National Corpus of Polish]. Wydawnictwo Naukowe PWN (2012). http://nkjp.pl/settings/papers/NKJP_ksiazka.pdf

  19. Radziszewski, A.: A tiered CRF tagger for Polish. In: Bembenik, R., Skonieczny, L., Rybinski, H., Kryszkiewicz, M., Niezgodka, M. (eds.) Intelligent Tools for Building a Scientific Information Platform. Studies in Computational Intelligence, vol. 467, pp. 215–230. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-35647-6_16

    Chapter  Google Scholar 

  20. Rybak, P., Mroczkowski, R., Tracz, J., Gawlik, I.: KLEJ: comprehensive benchmark for polish language understanding. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 1191–1201. Association for Computational Linguistics (2020). https://www.aclweb.org/anthology/2020.acl-main.111

  21. Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. Inf. Process. Manag. 24(5), 513–523 (1988)

    Article  Google Scholar 

  22. Vaswani, A., et al.: Attention is all you need. In: Guyon, I., et al. (eds.) Advances in Neural Information Processing Systems, vol. 30. Curran Associates, Inc. (2017). https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf

  23. Walkowiak, T.: Subject classification of texts in polish - from TF-IDF to transformers. In: Zamojski, W., Mazurkiewicz, J., Sugier, J., Walkowiak, T., Kacprzyk, J. (eds.) DepCoS-RELCOMEX 2021. AISC, vol. 1389, pp. 457–465. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-76773-0_44

    Chapter  Google Scholar 

  24. Walkowiak, T., Piasecki, M.: Stylometry analysis of literary texts in polish. In: Rutkowski, L., Scherer, R., Korytkowski, M., Pedrycz, W., Tadeusiewicz, R., Zurada, J.M. (eds.) ICAISC 2018. LNCS (LNAI), vol. 10842, pp. 777–787. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-91262-2_68

    Chapter  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Tomasz Walkowiak .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Walkowiak, T. (2023). Author Attribution of Literary Texts in Polish by the Sequence Averaging. In: Rutkowski, L., Scherer, R., Korytkowski, M., Pedrycz, W., Tadeusiewicz, R., Zurada, J.M. (eds) Artificial Intelligence and Soft Computing. ICAISC 2022. Lecture Notes in Computer Science(), vol 13589. Springer, Cham. https://doi.org/10.1007/978-3-031-23480-4_31

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-23480-4_31

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-23479-8

  • Online ISBN: 978-3-031-23480-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics