Abstract
Annotated corpora are valuable resources for NLP which are often costly to create. We introduce a method for transferring annotation from a morphologically annotated corpus of a source language to a target language. Our approach assumes only that an unannotated text corpus exists for the target language and a simple textbook which describes the basic morphological properties of that language is available. Our paper describes experiments with Polish, Czech, and Russian. However, the method is not tied in any way to these languages. In all the experiments we use the TnT tagger ([3]), a second-order Markov model. Our approach assumes that the information acquired about one language can be used for processing a related language. We have found out that even breathtakingly naive things (such as approximating the Russian transitions by Czech and/or Polish and approximating the Russian emissions by (manually/automatically derived) Czech cognates) can lead to a significant improvement of the tagger’s performance.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Agirre, E., Atutxa, A., Gojenola, K., Sarasola, K.: Exploring Portability of syntactic information from English to Basque. In: Proceedings of LREC 2004, Lisbon, Portugal (2004)
Bémová, A., Hajič, J., Hladká, B., Panevová, J.: Morphological and Syntactic Tagging of the Prague Dependency Treebank. In: Proceedings of ATALA Workshop, Paris, France, pp. 21–29 (1999)
Brants, T.: TnT — A Statistical Part-of-Speech Tagger. Proceedings of ANLP-NAACL, 224–231 (2000)
Hajic, J.: Morphological Tagging: Data vs. Dictionaries. In: Proceedings of ANLP-NAACL Conference, Seattle, WA, USA, pp. 94–101 (2000)
Hana, J.: Knowledge and labor light morphological analysis of Czech and Russian. Ms. Linguistic Department. The Ohio State University (2005)
Hana, J., Feldman, A., Brew, C.: A Resource-light Approach to Russian Morphology: Tagging Russian using Czech resources. In: Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 222–229 (2004)
Hwa, R., Resnik, P., Weinberg, A., Cabezas, C., Kolak, O.: Bootstrapping Parsers via Syntactic Projection across Parallel Texts. Natural Language Engineering 1(1), 1–15 (2004)
Marcus, M., Santorine, B., Marcinkiewicz, M.A.: Building a large annotated corpus of English: The Penn Treebank. Computational Linguistics 19(2), 313–330 (1993)
Przepiórkowski, A.: The IPI PAN Corpus: Preliminary version. IPI PAN, Warszawa (2004)
Yarowsky, D., Wicentowski, R.: Minimally Supervised Morphological Analysis by Multimodal Alignment. In: Proceedings of the 38th Meeting of the Association for Computational Linguistics, pp. 208–216 (2000)
Yarowsky, D., Ngai, G.: Inducing Multilingual POS Taggers and NP Bracketers via Robust Projection Across Aligned Corpora. In: Proceedings of NAACL-2001, pp. 200–207 (2001)
Yarowsky, D., Ngai, G., Wicentowski, R.: Inducing Multilingual Text Analysis Tools via Robust Projection across Aligned Corpora. In: Proceedings of HLT 2001, First International Conference on Human Language Technology Research (2001)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2006 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Feldman, A., Hana, J., Brew, C. (2006). Experiments in Cross-Language Morphological Annotation Transfer. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2006. Lecture Notes in Computer Science, vol 3878. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11671299_4
Download citation
DOI: https://doi.org/10.1007/11671299_4
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-32205-4
Online ISBN: 978-3-540-32206-1
eBook Packages: Computer ScienceComputer Science (R0)