Abstract
Authorship attribution is the process of identifying the author of a particular document. This task has been performed by experts in the field. However, with the advancement of natural language processing tools and machine learning techniques, this activity has also been performed by computer systems. Authorship attribution has applicability from the detection of plagiarism and copyright to the resolution of forensic problems. There are several works on this subject in the English idiom, however those that consider texts in Portuguese are few. Therefore, this paper aims to study authorship attribution of texts of Brazilian literature. We carried out our experiments using Naïve Bayes and Random Forests methods, and for the feature extraction we considered Term Frequency - Inverse Document Frequency and Part of Speech techniques. The results showed that the Random Forests using as input the textual features extracted by Part of Speech presented the best cross-validation accuracy, although not the best runtime.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Elmanarelbouanani, S., Kassou, I.: Authorship analysis studies: a survey. Int. J. Comput. Appl. 86, December 2013. https://doi.org/10.5120/15038-3384
Altheneyan, A.S., Menai, M.E.B.: Naïve bayes classifiers for authorship attribution of Arabic texts. J. King Saud Univ. Comput. Inf. Sci. 26, 473–484 (2014). https://doi.org/10.1016/j.jksuci.2014.06.006
Orengo, V.M., Huyck, C.R.: A stemming algorithm for the Portuguese language, pp. 186–193 (2001). https://doi.org/10.1109/SPIRE.2001.989755
Faceli, K., Lorena, A.C., Gama, J., de Carvalho, A.C.P.L.F.: Inteligência Artificial: Uma Abordagem de Aprendizado de Máquina, Rio de Janeiro, LTC - Livros Técnicos e Cinetíficos Ltda. (2011)
Mekala, S., Tippireddy, R.R., Bulusu, V.V.: A novel document representation approach for authorship attribution. Int. J. Intell. Eng. Syst. 11, 261–270 (2018). https://doi.org/10.22266/ijies2018.0630.28
da Silva, C.L., Petry, L.M., Freitas, V., Dorneles, C.: Min. J. Ground Exploratory Anal. Newspaper Art. (2019). https://doi.org/10.1109/BRACIS.2019.00023
Tamboli, M.S., Prasad, R.: A robust authorship attribution on big period. Int. J. Electr. Comput. Eng. (IJECE) 9, 3167–3174 (2019). https://doi.org/10.11591/ijece.v9i4.pp3167-3174
Russell, S., Norvig, P.: Artificial Intelligence: A Modern Approach, 3rd edn. Elsevier Editora Ltd. (2009)
Breiman, L.: Random forests. Mach. Learn. 45, 5–32 (2001). https://doi.org/10.1023/A:1010933404324
Lott, B.: Survey of keyword extraction techniques (2012)
Mannem, P., et al.: Parts of speech tagging for Indian languages: a literature survey. Int. J. Comput. Appl. 34 (2011). https://doi.org/10.5120/4119-5993
Dang, S.: A review of text mining techniques associated with various application areas. Int. J. Sci. Res. 4, 2461–2466 (2015)
Mihalcea, R., Tarau, P.: TextRank: bringing order into texts. In: Proceedings of EMNLP-04 and the 2004 Conference on Empirical Methods in Natural Language Processing, July 2004
Tatman, R.: Brazilian Portuguese literature corpus, July 2017
Pedregosa, F., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
Liu, A.Y.: The effect of oversampling and under sampling on classifying imbalanced text datasets (2004)
Corrêa, I., Drews-Jr, P., Souza, M., Garcia, V.: Supervised microalgae classification in imbalanced dataset (2016). https://doi.org/10.1109/BRACIS.2016.020
Cyran, K.A.: Machine learning approach to authorship attribution of literary texts (2007)
Aluísio, S., Pelizzoni, J., Marchi, A.R., de Oliveira, L., Manenti, R., Marquiafável, V.: An account of the challenge of tagging a reference corpus for Brazilian Portuguese. In: Mamede, N.J., Trancoso, I., Baptista, J., das Graças Volpe Nunes, M. (eds.) PROPOR 2003. LNCS (LNAI), vol. 2721, pp. 110–117. Springer, Heidelberg (2003). https://doi.org/10.1007/3-540-45011-4_17
Lemaître, G., Nogueira, F., Aridas, C.K.: Imbalanced-learn: a Python toolbox to tackle the curse of imbalanced datasets in machine learning. J. Mach. Learn. Res. 18, 1–5 (2017)
Kohavi, R.: A study of cross-validation and bootstrap for accuracy estimation and model selection. In: Proceedings of the 14th International Joint Conference on Artificial Intelligence, vol. 2, pp. 1137–1143 (1995)
Pessoa, T., Medeiros, R., Nepomuceno, T., Bian, G., Albuquerque, V., Filho, P.: Performance analysis of Google colaboratory as a tool for accelerating deep learning applications, p. 1 (2018)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
da Rocha Bartolomei, B., Drummond, I.N. (2020). Authorship Attribution of Brazilian Literary Texts Through Machine Learning Techniques. In: Cerri, R., Prati, R.C. (eds) Intelligent Systems. BRACIS 2020. Lecture Notes in Computer Science(), vol 12319. Springer, Cham. https://doi.org/10.1007/978-3-030-61377-8_27
Download citation
DOI: https://doi.org/10.1007/978-3-030-61377-8_27
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-61376-1
Online ISBN: 978-3-030-61377-8
eBook Packages: Computer ScienceComputer Science (R0)