Abstract
This paper presents the first work on POS tagging German Twitter data, showing that despite the noisy and often cryptic nature of the data a fine-grained analysis of POS tags on Twitter microtext is feasible. Our CRF-based tagger achieves an accuracy of around 89% when trained on LDA word clusters, features from an automatically created dictionary and additional out-of-domain training data.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Foster, J.: ”cba to check the spelling” investigating parser performance on discussion forum posts. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, HLT 2010, pp. 381–384. Association for Computational Linguistics, Stroudsburg (2010)
Foster, J., Wagner, J., Roux, J.L., Hogan, S., Nivre, J., Hogan, D., Genabith, J.V.: #hardtoparse: POS tagging and parsing the twitterverse. In: Proceedings of AAAI 2011 Workshop on Analysing Microtext (2011)
Gimpel, K., Schneider, N., O’Connor, B., Das, D., Mills, D., Eisenstein, J., Heilman, M., Yogatama, D., Flanigan, J., Smith, N.A.: Part-of-speech tagging for twitter: annotation, features, and experiments. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: Short Papers, HLT 2011, vol. 2, pp. 42–47. Association for Computational Linguistics, Stroudsburg (2011)
Owoputi, O., O’Connor, B., Dyer, C., Gimpel, K., Schneider, N., Smith, N.A.: Improved part-of-speech tagging for online conversational text with word clusters. In: Proceedings of the North American Chapter of the Association for Computational Linguistics Annual Meeting (2013)
Ritter, A., Clark, S., Mausam, E.O.: Named entity recognition in tweets: an experimental study. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, EMNLP 2011, pp. 1524–1534. Association for Computational Linguistics, Stroudsburg (2011)
Schiller, A., Teufel, S., Thielen, C.: Guidelines für das Tagging deutscher Textcorpora mit STTS. Technical report, IMS-CL. University Stuttgart, Germany (1995)
Teuber, O.: Fasel beschreib erwähn – Der Inflektiv als Wortform des Deutschen. Germanistische Linguistik 26(6), 141–142 (1998)
Rehbein, I., Schalowski, S.: Extending the STTS for the annotation of spoken language. In: Proceedings of KONVENS 2012, pp. 238–242 (2012)
Beißwenger, M., Ermakova, M., Geyken, A., Lemnitzer, L., Storrer, A.: A TEI schema for the representation of computer-mediated communication. Journal of the Text Encoding Initiative (3), 1–31 (2012)
Okazaki, N.: CRFsuite: a fast implementation of conditional random fields, CRFs (2007)
Lafferty, J.D., McCallum, A., Pereira, F.C.N.: Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In: Proceedings of the Eighteenth International Conference on Machine Learning, ICML 2001, pp. 282–289. Morgan Kaufmann Publishers Inc., San Francisco (2001)
Biemann, C.: Unsupervised part-of-speech tagging employing efficient graph clustering. In: Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop, COLING ACL 2006, pp. 7–12. Association for Computational Linguistics, Stroudsburg (2006)
Søgaard, A.: Simple semi-supervised training of part-of-speech taggers. In: Proceedings of the ACL 2010 Conference Short Papers, ACLShort 2010, pp. 205–208. Association for Computational Linguistics, Stroudsburg (2010)
Chrupala, G.: Efficient induction of probabilistic word classes with LDA. In: Proceedings of 5th International Joint Conference on Natural Language Processing, pp. 363–372. Asian Federation of Natural Language Processing, Chiang Mai (November 2011)
Owoputi, O., O’Connor, B., Dyer, C., Gimpel, K., Schneider, N.: Part-of-speech tagging for twitter: Word clusters and other advances. Technical Report CMU-ML-12-107. Carnegie Mellon University (2012)
Chrupała, G.: Hierarchical clustering of word class distributions. In: Proceedings of the NAACL-HLT Workshop on the Induction of Linguistic Structure, Montréal, Canada. Association for Computational Linguistics, pp. 100–104 (June 2012)
Brants, S., Dipper, S., Hansen, S., Lezius, W., Smith, G.: The TIGER treebank. In: Proceedings of the First Workshop on Treebanks and Linguistic Theories, pp. 24–42 (2002)
Fitschen, A.: Ein computerlinguistisches Lexikon als komplexes System. PhD thesis, Institut für Maschinelle Sprachverarbeitung der Universität Stuttgart (2004)
Schmid, H.: Improvements in part-of-speech tagging with an application to German. In: Proceedings of the ACL SIGDAT-Workshop (1995)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Rehbein, I. (2013). Fine-Grained POS Tagging of German Tweets. In: Gurevych, I., Biemann, C., Zesch, T. (eds) Language Processing and Knowledge in the Web. Lecture Notes in Computer Science(), vol 8105. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-40722-2_17
Download citation
DOI: https://doi.org/10.1007/978-3-642-40722-2_17
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-40721-5
Online ISBN: 978-3-642-40722-2
eBook Packages: Computer ScienceComputer Science (R0)