Abstract
Authorship attribution aims at identifying the author of an unseen text document based on text samples originating from different authors. In this paper we focus on authorship attribution of Polish texts using stylometric features based on part of speech (POS) tags. Polish language is characterized by high inflection level and in consequence over 1000 POS tags can be distinguished. This allows building a sufficiently large feature space by extracting POS information from documents and performing their classification with use of machine learning methods. We report results of experiments conducted with Weka workbench using combinations of the following features: POS tags, an approximation of their bigrams and simple document statistics.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Coyotl-Morales, R.M., Villaseñor-Pineda, L., Montes-y-Gómez, M., Rosso, P.: Authorship attribution using word sequences. In: Martínez-Trinidad, J.F., Carrasco Ochoa, J.A., Kittler, J. (eds.) CIARP 2006. LNCS, vol. 4225, pp. 844–853. Springer, Heidelberg (2006). doi:10.1007/11892755_87
Eder, M.: Style-markers in authorship attribution a cross-language study of the authorial fingerprint. Stud. Pol. Linguist. 6(1), 99–114 (2011)
Gamon, M.: Linguistic correlates of style: authorship classification with deep linguistic analysis features. In: Proceedings of the 20th International Conference on Computational Linguistics. COLING 2004, Stroudsburg. Association for Computational Linguistics (2004). http://dx.doi.org/10.3115/1220355.1220443
Juola, P.: Authorship attribution. Found. Trends Inf. Retr. 1(3), 233–334 (2006)
Kešelj, V., Peng, F., Cercone, N., Thomas, C.: N-gram-based author profiles for authorship attribution. In: Proceedings of the Conference Pacific Association for Computational Linguistics, PACLING, vol. 3, pp. 255–264 (2003)
Koppel, M., Akiva, N., Dagan, I.: Feature instability as a criterion for selecting potential style markers. J. Am. Soc. Inf. Sci. Technol. 57(11), 1519–1525 (2006)
Koppel, M., Schler, J., Argamon, S.: Authorship attribution: what’s easy and what’s hard? J. Law Policy 21, 317–331 (2013)
Kuta, M., Puto, B., Kitowski, J.: Authorship attribution of Polish newspaper articles. In: Rutkowski, L., Korytkowski, M., Scherer, R., Tadeusiewicz, R., Zadeh, L.A., Zurada, J.M. (eds.) ICAISC 2016. LNCS (LNAI), vol. 9693, pp. 474–483. Springer, Cham (2016). doi:10.1007/978-3-319-39384-1_41
Lamirel, J.-C.: New metrics and related statistical approaches for efficient mining in very large and highly multidimensional databases. In: Kozielski, S., Mrozek, D., Kasprowski, P., Małysiak-Mrozek, B., Kostrzewa, D. (eds.) BDAS 2015. CCIS, vol. 521, pp. 3–20. Springer, Cham (2015). doi:10.1007/978-3-319-18422-7_1
Luyckx, K., Daelemans, W.: The effect of author set size and data size in authorship attribution. Lit. Linguist. Comput. 26(1), 35–55 (2011)
Miłkowski, M.: Morfologik (2016). http://morfologik.blogspot.com/. Accessed Dec 2016
Rybicki, J.: Success rates in most-frequent-word-based authorship attribution: a case study of 1000 Polish novels from Ignacy Krasicki to Jerzy Pilch. Stud. Pol. Linguist. 10(2), 87–104 (2015). http://www.ejournals.eu/SPL/2015/Issue-2/art/5409/
Stamatatos, E., Fakotakis, N., Kokkinakis, G.: Computer-based authorship attribution without lexical measures. Comput. Humanit. 35(2), 193–214 (2001). http://dx.doi.org/10.1023/A: 1002681919510
Stamatatos, E.: A survey of modern authorship attribution methods. J. Am. Soc. Inf. Sci. Technol. 60(3), 538–556 (2009)
Stańczyk, U.: The class imbalance problem in construction of training datasets for authorship attribution. In: Gruca, A., Brachman, A., Kozielski, S., Czachórski, T. (eds.) Man–Machine Interactions 4. AISC, vol. 391, pp. 535–547. Springer, Cham (2016). doi:10.1007/978-3-319-23437-3_46
Szwed, P.: Concepts extraction from unstructured Polish texts: a rule based approach. In: 2015 Federated Conference on Computer Science and Information Systems (FedCSIS), pp. 355–364, September 2015
Szwed, P.: Enhancing concept extraction from Polish texts with rule management. In: Kozielski, S., Mrozek, D., Kasprowski, P., Małysiak-Mrozek, B., Kostrzewa, D. (eds.) BDAS 2015-2016. CCIS, vol. 613, pp. 341–356. Springer, Cham (2016). doi:10.1007/978-3-319-34099-9_27
Wolinski, M., Milkowski, M., Ogrodniczuk, M., Przepiórkowski, A.: PoliMorf: a (not so) new open morphological dictionary for Polish. In: LREC, pp. 860–864 (2012)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Szwed, P. (2017). Authorship Attribution for Polish Texts Based on Part of Speech Tagging. In: Kozielski, S., Mrozek, D., Kasprowski, P., Małysiak-Mrozek, B., Kostrzewa, D. (eds) Beyond Databases, Architectures and Structures. Towards Efficient Solutions for Data Analysis and Knowledge Representation. BDAS 2017. Communications in Computer and Information Science, vol 716. Springer, Cham. https://doi.org/10.1007/978-3-319-58274-0_26
Download citation
DOI: https://doi.org/10.1007/978-3-319-58274-0_26
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-58273-3
Online ISBN: 978-3-319-58274-0
eBook Packages: Computer ScienceComputer Science (R0)