Abstract
In this paper we will survey the key results achieved so far in Hungarian POS tagging. The most successful approaches have been selected and re-evaluated on a manually annotated corpus containing 1.2 million words. Tests were performed on single-domain, multiple domain and cross-domain test settings. We investigate here the possibilities of further improvement of the selected POS tagging methods by combining them. Our aim is to build a POS tagger that achieves good results on a fine tag set of more than 1000 tags.
Results show that rule-based methods – including Transformation Based Learning – can be used as effectively as statistical methods for Hungarian POS tagging. Combined methods do increase the tagging accuracy, producing significantly better results than those published earlier. We also show that the optimal combination differs in the cases of domain specific and general purpose taggers.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Dimitrova, L., Erjavec, T., Ide, N., Kaalep, H.J., Petkevic, V., Tufis, D.: Multext-east: Parallel and comparable corpora and lexicons for six Central and Eastern European languages. In: Boitet, C., Whitelock, P. (eds.) Proceedings of the Thirty-Sixth Annual Meeting of the Association for Computational Linguistics and Seventeenth International Conference on Computational Linguistics, pp. 315–319. Morgan Kaufmann Publishers, San Francisco (1998)
Váradi, T.: The Hungarian National Corpus. In: Proceedings of the Third International Conference on Language Resources and Evaluation, LREC 2002, Las Palmas de Gran Canaria, pp. 385–396 (2002)
Alexin, Z., Csirik, J., Gyimóthy, T., Bibok, K., Hatvani, C., Prószéki, G., Tihanyi, L.: Manually annotated Hungarian corpus. In: Proceedings of the Research Note Sessions of the 10th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2003, Budapest, Hungary, pp. 53–56 (2003)
Tufis, D., Dienes, P., Oravecz, C., Varadi, T.: Principled hidden tagset design for tiered tagging of Hungarian (2000)
Horváth, T., Alexin, Z., Gyimóthy, T., Wrobel, S.: Application of different learning methods to Hungarian Part-of-Speech tagging. In: Džeroski, S., Flach, P.A. (eds.) ILP 1999. LNCS (LNAI), vol. 1634, pp. 128–139. Springer, Heidelberg (1999)
Brants, T.: TnT – a statistical part-of-speech tagger. In: Proceedings of the Sixth Applied Natural Language Processing, Seattle, WA (2000)
Oravecz, C., Dienes, P.: Efficient stochastic Part-of-Speech tagging for Hungarian. In: Proceedings of the Third International Conference on Language Resources and Evaluation, LREC 2002, Las Palmas, pp. 710–717 (2002)
Brill, E.: Transformation-based error-driven learning and natural language processing: A case study in part-of-speech tagging. Computational Linguistics 21, 543–565 (1995)
Megyesi, B.: Brill’s rule-based POS tagger for Hungarian. Master’s thesis, Department of Linguistics, Stockholm University, Sweden (1998)
Megyesi, B.: Improving Brill’s POS tagger for an agglutinative language. In: Proceedings of the Joint Sigdat Conference on Empirical Methods in Natural Language Processing and Very Large Corpora, EMNLP/VLC 1999, pp. 275–284 (1999)
Ngai, G., Florian, R.: Transformation-based learning in the fast lane. In: Proceedings of North American ACL 2001, pp. 40–47 (2001)
Hócza, A., Alexin, Z., Csendes, D., Csirik, J., Gyimóthy, T.: Application of ILP methods in different natural language processing phases for information extraction from Hungarian texts. In: Proceedings of the Kalmár Workshop on Logic and Computer Science, Szeged, Hungary, pp. 107–116 (2003)
Muggleton, S., Feng, C.: Efficient induction of logic programs. In: Muggleton, S. (ed.) Inductive Logic Programming, pp. 281–297. Academic Press, New York (1992)
van Halteren, H., Zavrel, J., Daelemans, W.: Improving data driven wordclass tagging by system combination. In: Boitet, C., Whitelock, P. (eds.) Proceedings of the Thirty-Sixth Annual Meeting of the Association for Computational Linguistics and Seventeenth International Conference on Computational Linguistics, pp. 491–497. Morgan Kaufmann Publishers, San Francisco (1998)
Váradi, T., Oravecz, C.: Morpho-syntactic ambiguity and tagset design for Hungarian. In: Proceedings of the EACL LINC Workshop on Annotated Corpora, Bergen, Norway, pp. 8–13 (1999)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2004 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Kuba, A., Hócza, A., Csirik, J. (2004). POS Tagging of Hungarian with Combined Statistical and Rule-Based Methods. In: Sojka, P., Kopeček, I., Pala, K. (eds) Text, Speech and Dialogue. TSD 2004. Lecture Notes in Computer Science(), vol 3206. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-30120-2_15
Download citation
DOI: https://doi.org/10.1007/978-3-540-30120-2_15
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-23049-6
Online ISBN: 978-3-540-30120-2
eBook Packages: Springer Book Archive