Abstract
In this study, we assess the performance of purely statistical approaches using supervised machine learning for predicting case in German (nominative, accusative, dative, genitive, n/a). We experiment with two different treebanks containing morphological annotations: TIGER and TUEBA. An evaluation with 10-fold cross-validation serves as the basis for systematic comparisons of the optimal parametrizations of different approaches. We test taggers based on Hidden Markov Models (HMM), Decision Trees, and Conditional Random Fields (CRF). The CRF approach based on our hand-crafted feature model achieves an accuracy of about 94%. This outperforms all other approaches and results in an improvement of 11% compared to a baseline HMM trigram tagger and an improvement of 2% compared to a state-of-the-art tagger for rich morphological tagsets. Moreover, we investigate the effect of additional (morphological) categories (gender, number, person, part of speech) in the internal tagset used for the training. Rich internal tagsets improve results for all tested approaches.
Keywords
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Koskeniemmi, K., Haapalainen, M.: GERTWOL – Lingsoft Oy. In: Hausser, R. (ed.) Linguistische Verifikation: Dokumentation zur Ersten Morpholympics 1994, Niemeyer, Tübingen. Sprache und Information, vol. 34, pp. 121–140 (1996)
Zielinski, A., Simon, C.: Morphisto: An open-source morphological analyzer for German. In: Seventh International Workshop on Finite-State Methods and Natural Language Processing, pp. 177–184 (2008)
Lezius, W., Rapp, R., Wettler, M.: A freely available morphological analyzer, disambiguator and context sensitive lemmatizer for German. In: Proceedings of COLING-ACL 1998: 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics, Montreal, vol. 2, pp. 743–748 (1998)
Schmid, H., Laws, F.: Estimation of conditional probabilities with decision trees and an application to fine-grained POS tagging. In: Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008), Manchester, UK, pp. 777–784 (August 2008)
Perera, P., Witte, R.: A self-learning context-aware lemmatizer for German. In: Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing (HLT/EMNLP 2005), October 6-8, pp. 636–643. Association for Computational Linguistics, ACL, Vancouver (2005)
Brants, T.: TnT – a statistical part-of-speech tagger. In: Proceedings of the Sixth Applied Natural Language Processing Conference ANLP 2000, pp. 224–231 (2000)
Schiller, A., Teufel, S., Stöckert, C.: Guidelines für das Tagging deutscher Textcorpora mit STTS (Kleines und großes Tagset) (1999)
Sutton, C.A., McCallum, A.: An introduction to conditional random fields. Foundations and Trends in Machine Learning 4(4), 267–373 (2012)
Lavergne, T., Cappé, O., Yvon, F.: Practical very large scale CRFs. In: Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics (ACL), pp. 504–513. Association for Computational Linguistics (July 2010)
Brants, T.: Internal and external tagsets in part-of-speech tagging. In: Proceedings of Eurospeech, pp. 2787–2790 (1997)
Brants, S., Dipper, S., Eisenberg, P., Hansen-Schirra, S., König, E., Lezius, W., Rohrer, C., Smith, G., Uszkoreit, H.: Tiger: Linguistic interpretation of a german corpus. Research on Language and Computation 2(4), 597–620 (2004)
Hinrichs, E., Kübler, S., Naumann, K., Telljohann, H., Trushkina, J.: Recent developments in linguistic annotations of the TüBa-D/Z treebank. In: Proceedings of the Third Workshop on Treebanks and Linguistic Theories, pp. 51–62 (2004)
Halácsy, P., Kornai, A., Oravecz, C.: Hunpos: an open source trigram tagger. In: Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions, ACL 2007, pp. 209–212. Association for Computational Linguistics, Stroudsburg (2007)
Constant, M., Tellier, I.: Evaluating the impact of external lexical resources into a CRF-based multiword segmenter and part-of-speech tagger. In: Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC 2012), Istanbul, Turkey, pp. 646–650 (May 2012)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Clematide, S. (2013). A Case Study in Tagging Case in German: An Assessment of Statistical Approaches. In: Mahlow, C., Piotrowski, M. (eds) Systems and Frameworks for Computational Morphology. SFCM 2013. Communications in Computer and Information Science, vol 380. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-40486-3_2
Download citation
DOI: https://doi.org/10.1007/978-3-642-40486-3_2
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-40485-6
Online ISBN: 978-3-642-40486-3
eBook Packages: Computer ScienceComputer Science (R0)