Abstract
This paper examines the machine learning approach to authorship attribution of articles in the Polish language. The focus is on the effect of the data volume, number of authors and thematic homogeneity on authorship attribution quality. We study the impact of feature selection under various feature selection criteria, mainly chi square and information gain measures, as well as the effect of combining features of different types. Results are reported for the Rzeczpospolita corpus in terms of the \(F_1\) measure.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Dershowitz, I., Koppel, M., Akiva, N., Dershowitz, N.: Computerized source criticism of biblical texts. J. Biblical Lit. 134(2), 253–271 (2015)
Koppel, M., Schler, J., Argamon, S.: Computational methods in authorship attribution. J. Am. Soc. Inform. Sci. Technol. 60(1), 9–26 (2009)
Koppel, M., Schler, J., Argamon, S., Messeri, E.: Authorship attribution with thousands of candidate authors. In: Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2006, pp. 659–660 (2006)
Luyckx, K., Daelemans, W.: Authorship attribution and verification with many authors and limited data. In: Proceedings of the Twenty-Second International Conference on Computational Linguistics (COLING 2008), Manchester, UK, pp. 513–520 (2008)
Cortes, C., Vapnik, V.: Support-vector networks. Mach. Learn. 20(3), 273–297 (1995)
Dasarasthy, B.: Nearest Neighbor Pattern Classification Techniques. IEEE Computer Society Press, Los Alamitos (1991)
Quinlan, J.R.: Induction of decision trees. Mach. Learn. 1(1), 81–106 (1986)
McCallum, A., Nigam, K.: A comparison of event models for Naïve Bayes text classification. In: Learning for Text Categorization: Papers from the 1998 AAAI Workshop, pp. 41–48 (1998)
Rumelhart, D., Hinton, G., Williams, R.: Learning representations by back-propagating errors. Nature 323(6088), 533–536 (1986)
Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)
Bridle, J.: Training stochastic model recognition algorithms as networks can lead to maximum mutual information estimation of parameters. In: Touretzky, D. (ed.) Advances in Neural Information Processing Systems, pp. 211–217. Morgan Kaufman (1990)
van Rijsbergen, C.J.: Information Retrieval. Butterworth, Newton (1979)
Luyckx, K., Daelemans, W.: The effect of author set size and data size in authorship attribution. Literary Linguist. Comput. 26(1), 35–55 (2011)
Forman, G.: An extensive empirical study of feature selection metrics for text classification. J. Mach. Learn. Res. 3, 1289–1305 (2003)
Acknowledgments
This research is supported by AGH - University of Science and Technology (AGH-UST) Grant no. 11.11.230.124 (statutory project).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing Switzerland
About this paper
Cite this paper
Kuta, M., Puto, B., Kitowski, J. (2016). Authorship Attribution of Polish Newspaper Articles. In: Rutkowski, L., Korytkowski, M., Scherer, R., Tadeusiewicz, R., Zadeh, L., Zurada, J. (eds) Artificial Intelligence and Soft Computing. ICAISC 2016. Lecture Notes in Computer Science(), vol 9693. Springer, Cham. https://doi.org/10.1007/978-3-319-39384-1_41
Download citation
DOI: https://doi.org/10.1007/978-3-319-39384-1_41
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-39383-4
Online ISBN: 978-3-319-39384-1
eBook Packages: Computer ScienceComputer Science (R0)