Skip to main content

POS Tagging of Hungarian with Combined Statistical and Rule-Based Methods

  • Conference paper
Book cover Text, Speech and Dialogue (TSD 2004)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 3206))

Included in the following conference series:

Abstract

In this paper we will survey the key results achieved so far in Hungarian POS tagging. The most successful approaches have been selected and re-evaluated on a manually annotated corpus containing 1.2 million words. Tests were performed on single-domain, multiple domain and cross-domain test settings. We investigate here the possibilities of further improvement of the selected POS tagging methods by combining them. Our aim is to build a POS tagger that achieves good results on a fine tag set of more than 1000 tags.

Results show that rule-based methods – including Transformation Based Learning – can be used as effectively as statistical methods for Hungarian POS tagging. Combined methods do increase the tagging accuracy, producing significantly better results than those published earlier. We also show that the optimal combination differs in the cases of domain specific and general purpose taggers.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Dimitrova, L., Erjavec, T., Ide, N., Kaalep, H.J., Petkevic, V., Tufis, D.: Multext-east: Parallel and comparable corpora and lexicons for six Central and Eastern European languages. In: Boitet, C., Whitelock, P. (eds.) Proceedings of the Thirty-Sixth Annual Meeting of the Association for Computational Linguistics and Seventeenth International Conference on Computational Linguistics, pp. 315–319. Morgan Kaufmann Publishers, San Francisco (1998)

    Google Scholar 

  2. Váradi, T.: The Hungarian National Corpus. In: Proceedings of the Third International Conference on Language Resources and Evaluation, LREC 2002, Las Palmas de Gran Canaria, pp. 385–396 (2002)

    Google Scholar 

  3. Alexin, Z., Csirik, J., Gyimóthy, T., Bibok, K., Hatvani, C., Prószéki, G., Tihanyi, L.: Manually annotated Hungarian corpus. In: Proceedings of the Research Note Sessions of the 10th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2003, Budapest, Hungary, pp. 53–56 (2003)

    Google Scholar 

  4. Tufis, D., Dienes, P., Oravecz, C., Varadi, T.: Principled hidden tagset design for tiered tagging of Hungarian (2000)

    Google Scholar 

  5. Horváth, T., Alexin, Z., Gyimóthy, T., Wrobel, S.: Application of different learning methods to Hungarian Part-of-Speech tagging. In: Džeroski, S., Flach, P.A. (eds.) ILP 1999. LNCS (LNAI), vol. 1634, pp. 128–139. Springer, Heidelberg (1999)

    Chapter  Google Scholar 

  6. Brants, T.: TnT – a statistical part-of-speech tagger. In: Proceedings of the Sixth Applied Natural Language Processing, Seattle, WA (2000)

    Google Scholar 

  7. Oravecz, C., Dienes, P.: Efficient stochastic Part-of-Speech tagging for Hungarian. In: Proceedings of the Third International Conference on Language Resources and Evaluation, LREC 2002, Las Palmas, pp. 710–717 (2002)

    Google Scholar 

  8. Brill, E.: Transformation-based error-driven learning and natural language processing: A case study in part-of-speech tagging. Computational Linguistics 21, 543–565 (1995)

    Google Scholar 

  9. Megyesi, B.: Brill’s rule-based POS tagger for Hungarian. Master’s thesis, Department of Linguistics, Stockholm University, Sweden (1998)

    Google Scholar 

  10. Megyesi, B.: Improving Brill’s POS tagger for an agglutinative language. In: Proceedings of the Joint Sigdat Conference on Empirical Methods in Natural Language Processing and Very Large Corpora, EMNLP/VLC 1999, pp. 275–284 (1999)

    Google Scholar 

  11. Ngai, G., Florian, R.: Transformation-based learning in the fast lane. In: Proceedings of North American ACL 2001, pp. 40–47 (2001)

    Google Scholar 

  12. Hócza, A., Alexin, Z., Csendes, D., Csirik, J., Gyimóthy, T.: Application of ILP methods in different natural language processing phases for information extraction from Hungarian texts. In: Proceedings of the Kalmár Workshop on Logic and Computer Science, Szeged, Hungary, pp. 107–116 (2003)

    Google Scholar 

  13. Muggleton, S., Feng, C.: Efficient induction of logic programs. In: Muggleton, S. (ed.) Inductive Logic Programming, pp. 281–297. Academic Press, New York (1992)

    Google Scholar 

  14. van Halteren, H., Zavrel, J., Daelemans, W.: Improving data driven wordclass tagging by system combination. In: Boitet, C., Whitelock, P. (eds.) Proceedings of the Thirty-Sixth Annual Meeting of the Association for Computational Linguistics and Seventeenth International Conference on Computational Linguistics, pp. 491–497. Morgan Kaufmann Publishers, San Francisco (1998)

    Google Scholar 

  15. Váradi, T., Oravecz, C.: Morpho-syntactic ambiguity and tagset design for Hungarian. In: Proceedings of the EACL LINC Workshop on Annotated Corpora, Bergen, Norway, pp. 8–13 (1999)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2004 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Kuba, A., Hócza, A., Csirik, J. (2004). POS Tagging of Hungarian with Combined Statistical and Rule-Based Methods. In: Sojka, P., Kopeček, I., Pala, K. (eds) Text, Speech and Dialogue. TSD 2004. Lecture Notes in Computer Science(), vol 3206. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-30120-2_15

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-30120-2_15

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-23049-6

  • Online ISBN: 978-3-540-30120-2

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics