Abstract
The large tagset of the IPI PAN Corpus of Polish and the limited size of the learning corpus make construction of a tagger especially demanding. The goal of this work is to decompose the overall process of tagging of Polish into subproblems of partial disambiguation. Moreover, an architecture of a tagger facilitating this decomposition is proposed. The proposed architecture enables easy integration of hand-written tagging rules with the rest of the tagger. The architecture is open for different types of classifiers. A complete tagger for Polish called TaKIPI is also presented. Its configuration, the achieved results (92.55% of accuracy for all tokens, 84.75% for ambiguous tokens in ten-fold test), and considered variants of the architecture are discussed, too.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Przepiórkowski, A.: The IPI PAN Corpus Preliminary Version. Institute of Computer Science PAS (2004)
Dębowski, Ł.: Trigram morphosyntactic tagger for Polish. In: Kłopotek, M.A., Wierzchoń, S.T., Trojanowski, K. (eds.) Intelligent Information Processing and Web Mining. Proceedings of the International IIS: IIPWM 2004 Conference, Zakopane, Poland, pp. 409–413. Springer, Heidelberg (2004)
Hajič, J., Krbec, P., Květoň, P., Oliva, K., Petkevič, V.: Serial combination rules and statistics: A case study in czech tagging. In: Proceedings of The 39th Annual Meeting of ACL, pp. 260–267. Morgan Kaufmann Publishers, San Francisco (2001)
Rudolf, M.: Metody automatycznej analizy korpusu tekstów polskich: pozyskiwanie, wzbogacanie i przetwarzanie informacji lingwistycznych. PhD thesis, Uniwersytet Warszawski (2003)
Piasecki, M., Gaweł, B.: A rule-based tagger for Polish based on Genetic Algorithm. In: [13]
Woliński, M.: Morfeusz — a practical tool for the morphological analysis of polish. In: [14]
Márquez, L.: Part-of-speech Tagging: A Machine Learning Approach based on Decision Trees. PhD thesis, Universitat Politécnica de Catalunya (1999)
Quinlan, J.: C4.5: Programms for Machine Learning. Morgan Kaufmann, San Mateo (1993)
Quinlan, R.: Ross Quinlan’s Personal Homepage (2005), http://www.rulequest.com/Personal/c4.5r8.tar.gz
Piasecki, M., Godlewski, G.: Reductionistic, Tree and Rule Based Tagger for Polish. In: [14]
Karlsson, F., Voutilainen, A., Heikkil a, J., Anttila, A. (eds.): Constraint Grammar: A Language-Independent System for Parsing Unrestricted Text. Mouton de Gruyter, Berlin, New York (1995)
Květoň, P.: Language for grammatical rules. Report TR-2003-17, ÚFAL/CKL MFF UK, Prague (2003)
Kłopotek, M.A., Wierzchoń, S.T., Trojanowski, K. (eds.): Intelligent Information Processing and Web Mining — Proceedings of the International IIS: IIPWM 2005 Conference, Advances in Soft Computing. Springer, Berlin (2005)
Kłopotek, M.A., Wierzchoń, S.T., Trojanowski, K. (eds.): Intelligent Information Processing and Web Mining — Proceedings of the International IIS: IIPWM 2006 Conference, Advances in Soft Computing. Springer, Berlin (2006)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2006 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Piasecki, M., Godlewski, G. (2006). Effective Architecture of the Polish Tagger. In: Sojka, P., Kopeček, I., Pala, K. (eds) Text, Speech and Dialogue. TSD 2006. Lecture Notes in Computer Science(), vol 4188. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11846406_27
Download citation
DOI: https://doi.org/10.1007/11846406_27
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-39090-9
Online ISBN: 978-3-540-39091-6
eBook Packages: Computer ScienceComputer Science (R0)