Abstract
The aim of the paper is to compare stylometric methods in a task of authorship, author gender and literacy period recognition for texts in Polish language. Different feature selection and classification methods were analyzed. Features sets include common words (the most common, the rarest and all words) and grammatical classes frequencies, as well as simple statistics of selected characters, words and sentences. Due to the fact that Polish is a highly inflected language common words features are calculated as the frequencies of the lexemes obtained by morpho-syntactic tagger for Polish. Nine different classifiers were analysed. Authors tested proposed methods on a set of Polish novels. Recognition was done on whole novels and chunked texts. Performed experiments showed that the best results are obtained for features based on all words. For ill defined problems (with small recognition accuracy) the random forest classifier gave the best results. In other cases (for tasks with medium or high recognition accuracy) the multilayer perceptron and the linear regression learned by stochastic gradient descent gave the best results. Moreover, the paper includes an analysis of statistical importance of used features.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001). http://dx.doi.org/10.1023/A:1010933404324
Burrows, J.F.: Delta: a measure of stylistic difference and a guide to likely. Lit. Linguist Comput. 17(3), 267–287 (2002)
Canales, O., Monaco, V., Murphy, T., Edyta Zych, J.S., Tappert, C., Castro, A., Sotoye, O., Torres, L., Truley, G.: A stylometry system for authenticating students taking online tests. In: Proceedings of Student-Faculty Research Day, CSIS. Pace University (2011)
Craig, H., Kinney, A.: Shakespeare, Computers, and the Mystery of Authorship. Cambridge University Press, Cambridge (2009)
Crammer, K., Dekel, O., Keshet, J., Shalev-Shwartz, S., Singer, Y.: Online passive-aggressive algorithms. J. Mach. Learn. Res. 7, 551–585 (2006). http://dl.acm.org/citation.cfm?id=1248547.1248566
Eder, M.: Style-markers in authorship attribution: a cross-language study of the authorial fingerprint. Stud. Pol. Linguist. 6, 99–114 (2011)
Eder, M., Piasecki, M., Walkowiak, T.: Open stylometric system based on multilevel text analysis. Cogn. Stud. 17 (2017, to appear)
Fomenko, A.T., Fomenko, V.P., Fomenko, T.G.: The authorial invariant in Russian literary texts. Its application: who was the real author of the “quiet don”? In: Fomenko, A.T., Nosovskiy, G.V. (eds.) History: Fiction or Science?, pp. 425–444 (2005)
Guyon, I., Weston, J., Barnhill, S., Vapnik, V.: Gene selection for cancer classification using support vector machines. Mach. Learn. 46(1–3), 389–422 (2002)
Hastie, T.J., Tibshirani, R.J., Friedman, J.H.: The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer Series in Statistics. Springer, New York (2009). autres impressions: 2011 (corr.), 2013 (7e corr.)
Hoover, D.L.: Testing burrows’s delta. Liter. Linguist. Comput. 19(4), 453–475 (2004)
Joachims, T.: A probabilistic analysis of the rocchio algorithm with TFIDF for text categorization. In: Fisher, D.H. (ed.) ICML, pp. 143–151. Morgan Kaufmann (1997). http://dblp.uni-trier.de/db/conf/icml/icml1997.html#Joachims97
Jockers, M.L., Witten, D.M.: A comparative study of machine learning methods for authorship attribution. Lit. Linguist Comput. 25(2), 215–223 (2010)
Juola, P.: Authorship attribution. Found. Trends Inf. Retr. 1(3), 233–334 (2006). http://dx.doi.org/10.1561/1500000005
Koppel, M., Akiva, N., Dagan, I.: Feature instability as a criterion for selecting potential style markers. J. Am. Soc. Inf. Sci. Technol. 57(11), 1519–1525 (2006)
Koppel, M., Schler, J., Argamon, S.: Computational methods in authorship attribution. J. Am. Soc. Inf. Sci. Technol. 60(1), 9–26 (2009)
Peng, R.D.: Hengartner: quantitative analysis of literary style. Am. Stat. 56(3), 175–185 (2002)
Piasecki, M., Radziszewski, A.: Morphological prediction for polish by a statistical a tergo index. Syst. Sci. 34(4), 7–17 (2008)
Riloff, E.: Little words can make a big difference for text classification. In: Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 1995, NY, USA, pp. 130–136. ACM, New York (1995). http://doi.acm.org/10.1145/215206.215349
Salton, G., McGill, M.J.: Introduction to Modern Information Retrieval. McGraw-Hill Inc., New York (1986)
Smith, P., Aldridge, W.: Improving authorship attribution: optimizing burrows’ delta method. J. Quant. Linguist. 18(1), 63–88 (2011)
Tsuruoka, Y., Tsujii, J., Ananiadou, S.: Stochastic gradient descent training for l1-regularized log-linear models with cumulative penalty. In: Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, ACL 2009, pp. 477–485. Association for Computational Linguistics, Stroudsburg (2009)
de Vel, O., Anderson, A., Corney, M., Mohay, G.: Mining e-mail content for author identification forensics. SIGMOD Rec. 30(4), 55–64 (2001)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Baj, M., Walkowiak, T. (2017). Computer Based Stylometric Analysis of Texts in Polish Language. In: Rutkowski, L., Korytkowski, M., Scherer, R., Tadeusiewicz, R., Zadeh, L., Zurada, J. (eds) Artificial Intelligence and Soft Computing. ICAISC 2017. Lecture Notes in Computer Science(), vol 10246. Springer, Cham. https://doi.org/10.1007/978-3-319-59060-8_1
Download citation
DOI: https://doi.org/10.1007/978-3-319-59060-8_1
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-59059-2
Online ISBN: 978-3-319-59060-8
eBook Packages: Computer ScienceComputer Science (R0)