An Ensemble Approach to Cross-Domain Authorship Attribution

Custódio, José Eleandro; Paraboni, Ivandré

doi:10.1007/978-3-030-28577-7_17

José Eleandro Custódio¹⁷ &
Ivandré Paraboni¹⁷

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 11696))

Included in the following conference series:

International Conference of the Cross-Language Evaluation Forum for European Languages

1142 Accesses
9 Citations

Abstract

This paper presents an ensemble approach to cross-domain authorship attribution that combines predictions made by three independent classifiers, namely, standard character n-grams, character n-grams with non-diacritic distortion and word n-grams. Our proposal relies on variable-length n-gram models and multinomial logistic regression to select the prediction of highest probability among the three models as the output for the task. The present approach is compared against a number of baseline systems, and we report results based on both the PAN-CLEF 2018 test data, and on a new corpus of song lyrics in English and Portuguese.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 59.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

References

Adorno, H.G., Posadas-Durán, J.P., Sidorov, G., Pinto, D.: Document embeddings learned on various types of n-grams for cross-topic authorship attribution. Computing 100, 741–756 (2018)
Article Google Scholar
Custódio, J.E., Paraboni, I.: EACH-USP ensemble cross-domain authorship attribution: notebook for PAN at CLEF 2018. In: Cappellato, L., Ferro, N., Nie, J.Y., Soulier, L. (eds.) Working Notes Papers of the CLEF 2018 Evaluation Labs. CEUR Workshop Proceedings, CLEF and CEUR-WS.org, September 2018
Google Scholar
Custódio, J.E., Paraboni, I.: Multi-channel open-set cross-domain authorship attribution. In: Working Notes Papers of the Conference and Labs of the Evaluation Forum (CLEF-2019), Lugano, Switzerland (2019, to appear)
Google Scholar
Goldberg, Y.: Neural Network Methods in Natural Language Processing. Morgan & Claypool Publishers, San Rafael (2017)
Google Scholar
Gollub, T., et al.: Recent trends in digital text forensics and its evaluation. In: Forner, P., Müller, H., Paredes, R., Rosso, P., Stein, B. (eds.) CLEF 2013. LNCS, vol. 8138, pp. 282–302. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-40802-1_28
Chapter Google Scholar
Hossain, R., Al Marouf, A.: BanglaMusicStylo: a stylometric dataset of Bangla music lyrics. In: 2018 International Conference on Bangla Speech and Language Processing (ICBSLP), pp. 1–5, September 2018
Google Scholar
Kestemont, M.: Function words in authorship attribution from black magic to theory? In: 3rd Workshop on Computational Linguistics for Literature (CLFL 2014), pp. 59–66 (2014)
Google Scholar
Kestemont, M., et al.: Overview of the author identification task at PAN-2018: cross-domain authorship attribution and style change detection. In: Cappellato, L., Ferro, N., Nie, J.Y., Soulier, L. (eds.) Working Notes Papers of the CLEF 2018 Evaluation Labs. CEUR Workshop Proceedings, CLEF and CEUR-WS.org, September 2018
Google Scholar
Kırmacı, B., Oğul, H.: Evaluating text features for lyrics-based songwriter prediction. In: 2015 IEEE 19th International Conference on Intelligent Engineering Systems (INES), pp. 405–409, September 2015
Google Scholar
Le, Q., Mikolov, T.: Distributed representations of sentences and documents. In: Proceedings of Machine Learning Research, vol. 32, no. 2, pp. 1188–1196. PMLR, Beijing (2014)
Google Scholar
Markov, I., Baptista, J., Lagunas, O.P.: Authorship attribution in portuguese using character N-grams. Acta Polytechnica Hungarica 14(3), 59–78 (2017)
Google Scholar
Markov, I., Stamatatos, E., Sidorov, G.: Improving cross-topic authorship attribution: the role of pre-processing. In: Gelbukh, A. (ed.) CICLing 2017. LNCS, vol. 10762, pp. 289–302. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-77116-8_21
Chapter Google Scholar
Pedregosa, F., et al.: Scikit-learn: machine learning in python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
MathSciNet MATH Google Scholar
Potthast, M., Rangel, F., Tschuggnall, M., Stamatatos, E., Rosso, P., Stein, B.: Overview of PAN’17. In: Jones, G.J.F., et al. (eds.) CLEF 2017. LNCS, vol. 10456, pp. 275–290. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-65813-1_25
Chapter Google Scholar
Rocha, A., et al.: Authorship attribution for social media forensics. IEEE Trans. Inf. Forensics Secur. 12(1), 5–33 (2017)
Article MathSciNet Google Scholar
Rosso, P., Rangel, F., Potthast, M., Stamatatos, E., Tschuggnall, M., Stein, B.: Overview of PAN’16. In: Fuhr, N., et al. (eds.) CLEF 2016. LNCS, vol. 9822, pp. 332–350. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-44564-9_28
Chapter Google Scholar
Sapkota, U., Bethard, S., Montes-y-Gómez, M., Solorio, T.: Not all character n-grams are created equal: a study in authorship attribution. In: NAACL HLT 2015, The 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Denver, Colorado, USA, pp. 93–102 (2015)
Google Scholar
Shrestha, P., Sierra, S., Gonzalez, F., Rosso, P., Montes-Y-Gomez, M., Solorio, T.: Convolutional neural networks for authorship attribution of short texts. In: Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics, vol. 2, pp. 669–674. Association for Computational Linguistics (ACL) (2017)
Google Scholar
Stamatatos, E.: Authorship attribution using text distortion. In: Proceedings of the Conference of the European Chapter of the Association for Computational Linguistics (EACL-2017). Association for Computational Linguistics, Valencia (2017)
Google Scholar

Download references

Acknowledgements

The second author received financial support from FAPESP grant nro. 2016/14223-0 and by the University of São Paulo. The authors also thank the PAN-CLEF AA shared task organisers, and the anonymous reviewers for their valuable input.

Author information

Authors and Affiliations

School of Arts, Sciences and Humanities (EACH), University of São Paulo (USP), São Paulo, Brazil
José Eleandro Custódio & Ivandré Paraboni

Authors

José Eleandro Custódio
View author publications
You can also search for this author in PubMed Google Scholar
Ivandré Paraboni
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ivandré Paraboni .

Editor information

Editors and Affiliations

Universita della Svizzera Italiana, Lugano, Switzerland
Fabio Crestani
Zurich University of Applied Sciences, Winterthur, Switzerland
Martin Braschler
University of Neuchâtel, Neuchâtel, Switzerland
Jacques Savoy
Technische Universität Wien, Vienna, Austria
Andreas Rauber
HES-SO Valais-Wallis, Sierre, Switzerland
Henning Müller
University of Santiago de Compostela, Santiago de Compostela, Spain
David E. Losada
Swiss Alliance for Data-Intensive Services, Thun, Switzerland
Gundula Heinatz Bürki
University of Padua, Padua, Italy
Linda Cappellato
University of Padua, Padua, Italy
Nicola Ferro

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Custódio, J.E., Paraboni, I. (2019). An Ensemble Approach to Cross-Domain Authorship Attribution. In: Crestani, F., et al. Experimental IR Meets Multilinguality, Multimodality, and Interaction. CLEF 2019. Lecture Notes in Computer Science(), vol 11696. Springer, Cham. https://doi.org/10.1007/978-3-030-28577-7_17

Download citation

DOI: https://doi.org/10.1007/978-3-030-28577-7_17
Published: 03 August 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-28576-0
Online ISBN: 978-3-030-28577-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics