Abstract
We implement a variant of the algorithm described by Yarowsky and Ngai in [21] to induce an HMM POS tagger for an arbitrary target language using only an existing POS tagger for a source language and an unannotated parallel corpus between the source and target languages. We extend this work by projecting from multiple source languages onto a single target language. We hypothesize that systematic transfer errors from differing source languages will cancel out, improving the quality of bootstrapped resources in the target language. Our experiments confirm the hypothesis. Each experiment compares three cases: (a) source data comes from a single language A, (b) source data comes from a single language B, and (c) source data comes from both A and B, but half as much from each. Apart from the source language, other conditions are held constant in all three cases – including the total amount of source data used. The null hypothesis is that performance in the mixed case would be an average of performance in the single-language cases, but in fact, mixed-case performance always exceeds the maximum of the single-language cases. We observed this effect in all six experiments we ran, involving three different source-language pairs and two different target languages.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Al-Onaizan, Y., Curin, J., Jahr, M., Knight, K., Lafferty, J., Melamed, D., Och, F.-J., Purdy, D., Smith, N., Yarowsky, D.: Statistical machine translation. In: Johns Hopkins University 1999 Summer Workshop on Language Engineering (1999)
Brants, T.: TnT – a statistical part-of-speech tagger. In: Proceedings of the 6th Applied NLP Conference, ANLP-2000, Seattle, WA, April 29 – May 3 (2000)
Brill, E.: Transformation-Based Error-Driven Learning and Natural Language Processing: A Case Study in Part-of-Speech Tagging. Computational Linguistics 21(4), 543–565 (1995)
Brill, E., Wu, J.: Classifier Combination for Improving Lexical Disambiguation. In: Proceedings of the ACL (1998)
Brown, P.F., Cocke, J., Della Pietra, S., Della Pietra, V.J., Jelinek, F., Lafferty, J.D., Mercer, R.L., Roossin, P.S.: A Statistical Approach to Machine Translation. Computational Linguistics 16(2), 79–85 (1990)
Clark, S., Curran, J., Osborne, M.: Bootstrapping POS taggers using unlabelled data. In: Daelemans, W., Osborne, M. (eds.) Proceedings of CoNLL-2003, Edmonton, Canada, pp. 49–55 (2003)
Collins, M., Hajic, J., Ramshaw, L., Tillmann, C.: A Statistical Parser for Czech. In: Proceedings of the 37th Annual Meeting of the ACL, College Park, Maryland (1999)
Cucerzan, S., Yarowsky, D.: Bootstrapping a Multilingual Part-of-speech Tagger in One Person-day. In: Proceedings of the Sixth Conference on Natural Language Learning (CoNLL) (2002)
Gimenez, J., Marquez, L.: SVMTool: A general POS tagger generator based on Support Vector Machines. In: Proceedings of the 4th International Conference on Language Resources and Evaluation (LREC 2004), Lisbon, Portugal (2004)
Gollins, T., Sanderson, M.: Improving Cross Language Information Retrieval with Triangulated Translation. In: Proceedings of the 24th annual international ACM SIGIR conference, pp. 90–95 (2001)
French-English Hansards Corpus of Canadian Parliamentary Proceedings
Hajic, J., Hladka, B.: Tagging Inflective Languages: Prediction of Morphological Categories for a Rich, Structured Tagset. In: COLING-ACL, pp. 483–490 (1998)
Hajic, J., Krbec, P., Kevton, P., Oliva, K., Petkevic, V.: Serial Combination of Rules and Statistics: A Case Study in Czech Tagging. In: Proceedings of the ACL (2001)
Henderson, J.C., Brill, E.: Exploiting Diversity in Natural Language Processing: Combining Parsers. In: Proceedings of the 1999 Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora, pp. 187–194 (1999)
Hwa, R., Resnik, P., Weinberg, A.: Breaking the Resource Bottleneck for Multilingual Parsing. In: Proceedings of the Workshop on Linguistic Knowledge Acquisition and Representation: Bootstrapping Annotated Language Data (2002)
Mann, G., Yarowsky, D.: Multipath translation lexicon induction via bridge languages. In: Proceedings of NAACL 2001: 2nd Meeting of the North American Chapter of the Association for Computational Linguistics, pp. 151–158 (2001)
Rabiner, L.: A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE 77(2) (1989)
Schmid, H.: Probabilistic Part-of-Speech Tagging Using Decision Trees. In: International Conference on New Methods in Language Processing, Manchester, UK (1994)
van Halteren, H., Zavrel, J., Daelemans, W.: Improving Data Driven Wordclass Tagging by System Combination. In: Proceedings of the Thirty-Sixth Annual Meeting of the Association for Computational Linguistics, pp. 491–497 (1998)
Witten, I., Bell, T.: The zero-frequency problem: Estimating the probabilities of novel events in adaptive text compression. IEEE Transactions in Information Theory 37(4), 1085–1094 (1991)
Yarowsky, D., Ngai, G.: Inducing Multilingual POS Taggers and NP Bracketers via Robust Projection Across Aligned Corpora. In: Proceedings of NAACL, pp. 200–207 (2001)
Yarowsky, D., Ngai, G., Wicentowski, R.: Inducing Multilingual Text Analysis Tools via Robust Projection across Aligned Corpora. In: Proceedings of HLT (2001)
Zavrel, J., Daelemans, W.: Bootstrapping a Tagged Corpus through Combination of Existing Heterogeneous Taggers. In: Proceedings of LREC-2000, Athens (2000)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Fossum, V., Abney, S. (2005). Automatically Inducing a Part-of-Speech Tagger by Projecting from Multiple Source Languages Across Aligned Corpora. In: Dale, R., Wong, KF., Su, J., Kwong, O.Y. (eds) Natural Language Processing – IJCNLP 2005. IJCNLP 2005. Lecture Notes in Computer Science(), vol 3651. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11562214_75
Download citation
DOI: https://doi.org/10.1007/11562214_75
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-29172-5
Online ISBN: 978-3-540-31724-1
eBook Packages: Computer ScienceComputer Science (R0)