POS Tagging of Hungarian with Combined Statistical and Rule-Based Methods

Kuba, András; Hócza, András; Csirik, János

doi:10.1007/978-3-540-30120-2_15

András Kuba²¹,
András Hócza²² &
János Csirik²¹

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 3206))

Included in the following conference series:

International Conference on Text, Speech and Dialogue

881 Accesses
4 Citations

Abstract

In this paper we will survey the key results achieved so far in Hungarian POS tagging. The most successful approaches have been selected and re-evaluated on a manually annotated corpus containing 1.2 million words. Tests were performed on single-domain, multiple domain and cross-domain test settings. We investigate here the possibilities of further improvement of the selected POS tagging methods by combining them. Our aim is to build a POS tagger that achieves good results on a fine tag set of more than 1000 tags.

Results show that rule-based methods – including Transformation Based Learning – can be used as effectively as statistical methods for Hungarian POS tagging. Combined methods do increase the tagging accuracy, producing significantly better results than those published earlier. We also show that the optimal combination differs in the cases of domain specific and general purpose taggers.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Dimitrova, L., Erjavec, T., Ide, N., Kaalep, H.J., Petkevic, V., Tufis, D.: Multext-east: Parallel and comparable corpora and lexicons for six Central and Eastern European languages. In: Boitet, C., Whitelock, P. (eds.) Proceedings of the Thirty-Sixth Annual Meeting of the Association for Computational Linguistics and Seventeenth International Conference on Computational Linguistics, pp. 315–319. Morgan Kaufmann Publishers, San Francisco (1998)
Google Scholar
Váradi, T.: The Hungarian National Corpus. In: Proceedings of the Third International Conference on Language Resources and Evaluation, LREC 2002, Las Palmas de Gran Canaria, pp. 385–396 (2002)
Google Scholar
Alexin, Z., Csirik, J., Gyimóthy, T., Bibok, K., Hatvani, C., Prószéki, G., Tihanyi, L.: Manually annotated Hungarian corpus. In: Proceedings of the Research Note Sessions of the 10th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2003, Budapest, Hungary, pp. 53–56 (2003)
Google Scholar
Tufis, D., Dienes, P., Oravecz, C., Varadi, T.: Principled hidden tagset design for tiered tagging of Hungarian (2000)
Google Scholar
Horváth, T., Alexin, Z., Gyimóthy, T., Wrobel, S.: Application of different learning methods to Hungarian Part-of-Speech tagging. In: Džeroski, S., Flach, P.A. (eds.) ILP 1999. LNCS (LNAI), vol. 1634, pp. 128–139. Springer, Heidelberg (1999)
Chapter Google Scholar
Brants, T.: TnT – a statistical part-of-speech tagger. In: Proceedings of the Sixth Applied Natural Language Processing, Seattle, WA (2000)
Google Scholar
Oravecz, C., Dienes, P.: Efficient stochastic Part-of-Speech tagging for Hungarian. In: Proceedings of the Third International Conference on Language Resources and Evaluation, LREC 2002, Las Palmas, pp. 710–717 (2002)
Google Scholar
Brill, E.: Transformation-based error-driven learning and natural language processing: A case study in part-of-speech tagging. Computational Linguistics 21, 543–565 (1995)
Google Scholar
Megyesi, B.: Brill’s rule-based POS tagger for Hungarian. Master’s thesis, Department of Linguistics, Stockholm University, Sweden (1998)
Google Scholar
Megyesi, B.: Improving Brill’s POS tagger for an agglutinative language. In: Proceedings of the Joint Sigdat Conference on Empirical Methods in Natural Language Processing and Very Large Corpora, EMNLP/VLC 1999, pp. 275–284 (1999)
Google Scholar
Ngai, G., Florian, R.: Transformation-based learning in the fast lane. In: Proceedings of North American ACL 2001, pp. 40–47 (2001)
Google Scholar
Hócza, A., Alexin, Z., Csendes, D., Csirik, J., Gyimóthy, T.: Application of ILP methods in different natural language processing phases for information extraction from Hungarian texts. In: Proceedings of the Kalmár Workshop on Logic and Computer Science, Szeged, Hungary, pp. 107–116 (2003)
Google Scholar
Muggleton, S., Feng, C.: Efficient induction of logic programs. In: Muggleton, S. (ed.) Inductive Logic Programming, pp. 281–297. Academic Press, New York (1992)
Google Scholar
van Halteren, H., Zavrel, J., Daelemans, W.: Improving data driven wordclass tagging by system combination. In: Boitet, C., Whitelock, P. (eds.) Proceedings of the Thirty-Sixth Annual Meeting of the Association for Computational Linguistics and Seventeenth International Conference on Computational Linguistics, pp. 491–497. Morgan Kaufmann Publishers, San Francisco (1998)
Google Scholar
Váradi, T., Oravecz, C.: Morpho-syntactic ambiguity and tagset design for Hungarian. In: Proceedings of the EACL LINC Workshop on Annotated Corpora, Bergen, Norway, pp. 8–13 (1999)
Google Scholar

Download references

Author information

Authors and Affiliations

Research Group on Artificial Intelligence, of the Hungarian Academy of Sciences and University of Szeged, H-6720, Szeged, Aradi vértanúk tere 1., Hungary
András Kuba & János Csirik
Department of Informatics, University of Szeged, H-6720, Szeged, Árpád tér 2., Hungary
András Hócza

Authors

András Kuba
View author publications
You can also search for this author in PubMed Google Scholar
András Hócza
View author publications
You can also search for this author in PubMed Google Scholar
János Csirik
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Faculty of Informatics, Masaryk University, Brno, Czech Republic
Petr Sojka
Faculty of Informatics, Masaryk University, Botanická 68a, CZ-602 00, Brno, Czech Republic
Ivan Kopeček
Faculty of Informatics, Department of Computer Graphics and Design, Masaryk University, Botanická 68a, 602 00, Brno, Czech Republic
Karel Pala

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kuba, A., Hócza, A., Csirik, J. (2004). POS Tagging of Hungarian with Combined Statistical and Rule-Based Methods. In: Sojka, P., Kopeček, I., Pala, K. (eds) Text, Speech and Dialogue. TSD 2004. Lecture Notes in Computer Science(), vol 3206. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-30120-2_15

Download citation

DOI: https://doi.org/10.1007/978-3-540-30120-2_15
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-23049-6
Online ISBN: 978-3-540-30120-2
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics