Skip to main content

An Ensemble Approach to Cross-Domain Authorship Attribution

  • Conference paper
  • First Online:
Experimental IR Meets Multilinguality, Multimodality, and Interaction (CLEF 2019)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 11696))

Abstract

This paper presents an ensemble approach to cross-domain authorship attribution that combines predictions made by three independent classifiers, namely, standard character n-grams, character n-grams with non-diacritic distortion and word n-grams. Our proposal relies on variable-length n-gram models and multinomial logistic regression to select the prediction of highest probability among the three models as the output for the task. The present approach is compared against a number of baseline systems, and we report results based on both the PAN-CLEF 2018 test data, and on a new corpus of song lyrics in English and Portuguese.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 59.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 79.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://pan.webis.de/clef18/pan18-web/author-identification.html.

  2. 2.

    https://www.letras.mus.br/.

  3. 3.

    https://github.com/6/stopwords-json.

References

  1. Adorno, H.G., Posadas-Durán, J.P., Sidorov, G., Pinto, D.: Document embeddings learned on various types of n-grams for cross-topic authorship attribution. Computing 100, 741–756 (2018)

    Article  Google Scholar 

  2. Custódio, J.E., Paraboni, I.: EACH-USP ensemble cross-domain authorship attribution: notebook for PAN at CLEF 2018. In: Cappellato, L., Ferro, N., Nie, J.Y., Soulier, L. (eds.) Working Notes Papers of the CLEF 2018 Evaluation Labs. CEUR Workshop Proceedings, CLEF and CEUR-WS.org, September 2018

    Google Scholar 

  3. Custódio, J.E., Paraboni, I.: Multi-channel open-set cross-domain authorship attribution. In: Working Notes Papers of the Conference and Labs of the Evaluation Forum (CLEF-2019), Lugano, Switzerland (2019, to appear)

    Google Scholar 

  4. Goldberg, Y.: Neural Network Methods in Natural Language Processing. Morgan & Claypool Publishers, San Rafael (2017)

    Google Scholar 

  5. Gollub, T., et al.: Recent trends in digital text forensics and its evaluation. In: Forner, P., Müller, H., Paredes, R., Rosso, P., Stein, B. (eds.) CLEF 2013. LNCS, vol. 8138, pp. 282–302. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-40802-1_28

    Chapter  Google Scholar 

  6. Hossain, R., Al Marouf, A.: BanglaMusicStylo: a stylometric dataset of Bangla music lyrics. In: 2018 International Conference on Bangla Speech and Language Processing (ICBSLP), pp. 1–5, September 2018

    Google Scholar 

  7. Kestemont, M.: Function words in authorship attribution from black magic to theory? In: 3rd Workshop on Computational Linguistics for Literature (CLFL 2014), pp. 59–66 (2014)

    Google Scholar 

  8. Kestemont, M., et al.: Overview of the author identification task at PAN-2018: cross-domain authorship attribution and style change detection. In: Cappellato, L., Ferro, N., Nie, J.Y., Soulier, L. (eds.) Working Notes Papers of the CLEF 2018 Evaluation Labs. CEUR Workshop Proceedings, CLEF and CEUR-WS.org, September 2018

    Google Scholar 

  9. Kırmacı, B., Oğul, H.: Evaluating text features for lyrics-based songwriter prediction. In: 2015 IEEE 19th International Conference on Intelligent Engineering Systems (INES), pp. 405–409, September 2015

    Google Scholar 

  10. Le, Q., Mikolov, T.: Distributed representations of sentences and documents. In: Proceedings of Machine Learning Research, vol. 32, no. 2, pp. 1188–1196. PMLR, Beijing (2014)

    Google Scholar 

  11. Markov, I., Baptista, J., Lagunas, O.P.: Authorship attribution in portuguese using character N-grams. Acta Polytechnica Hungarica 14(3), 59–78 (2017)

    Google Scholar 

  12. Markov, I., Stamatatos, E., Sidorov, G.: Improving cross-topic authorship attribution: the role of pre-processing. In: Gelbukh, A. (ed.) CICLing 2017. LNCS, vol. 10762, pp. 289–302. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-77116-8_21

    Chapter  Google Scholar 

  13. Pedregosa, F., et al.: Scikit-learn: machine learning in python. J. Mach. Learn. Res. 12, 2825–2830 (2011)

    MathSciNet  MATH  Google Scholar 

  14. Potthast, M., Rangel, F., Tschuggnall, M., Stamatatos, E., Rosso, P., Stein, B.: Overview of PAN’17. In: Jones, G.J.F., et al. (eds.) CLEF 2017. LNCS, vol. 10456, pp. 275–290. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-65813-1_25

    Chapter  Google Scholar 

  15. Rocha, A., et al.: Authorship attribution for social media forensics. IEEE Trans. Inf. Forensics Secur. 12(1), 5–33 (2017)

    Article  MathSciNet  Google Scholar 

  16. Rosso, P., Rangel, F., Potthast, M., Stamatatos, E., Tschuggnall, M., Stein, B.: Overview of PAN’16. In: Fuhr, N., et al. (eds.) CLEF 2016. LNCS, vol. 9822, pp. 332–350. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-44564-9_28

    Chapter  Google Scholar 

  17. Sapkota, U., Bethard, S., Montes-y-Gómez, M., Solorio, T.: Not all character n-grams are created equal: a study in authorship attribution. In: NAACL HLT 2015, The 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Denver, Colorado, USA, pp. 93–102 (2015)

    Google Scholar 

  18. Shrestha, P., Sierra, S., Gonzalez, F., Rosso, P., Montes-Y-Gomez, M., Solorio, T.: Convolutional neural networks for authorship attribution of short texts. In: Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics, vol. 2, pp. 669–674. Association for Computational Linguistics (ACL) (2017)

    Google Scholar 

  19. Stamatatos, E.: Authorship attribution using text distortion. In: Proceedings of the Conference of the European Chapter of the Association for Computational Linguistics (EACL-2017). Association for Computational Linguistics, Valencia (2017)

    Google Scholar 

Download references

Acknowledgements

The second author received financial support from FAPESP grant nro. 2016/14223-0 and by the University of São Paulo. The authors also thank the PAN-CLEF AA shared task organisers, and the anonymous reviewers for their valuable input.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ivandré Paraboni .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Custódio, J.E., Paraboni, I. (2019). An Ensemble Approach to Cross-Domain Authorship Attribution. In: Crestani, F., et al. Experimental IR Meets Multilinguality, Multimodality, and Interaction. CLEF 2019. Lecture Notes in Computer Science(), vol 11696. Springer, Cham. https://doi.org/10.1007/978-3-030-28577-7_17

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-28577-7_17

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-28576-0

  • Online ISBN: 978-3-030-28577-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics