Automatically Inducing a Part-of-Speech Tagger by Projecting from Multiple Source Languages Across Aligned Corpora

Fossum, Victoria; Abney, Steven

doi:10.1007/11562214_75

Victoria Fossum²² &
Steven Abney²³

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 3651))

Included in the following conference series:

International Conference on Natural Language Processing

1562 Accesses

Abstract

We implement a variant of the algorithm described by Yarowsky and Ngai in [21] to induce an HMM POS tagger for an arbitrary target language using only an existing POS tagger for a source language and an unannotated parallel corpus between the source and target languages. We extend this work by projecting from multiple source languages onto a single target language. We hypothesize that systematic transfer errors from differing source languages will cancel out, improving the quality of bootstrapped resources in the target language. Our experiments confirm the hypothesis. Each experiment compares three cases: (a) source data comes from a single language A, (b) source data comes from a single language B, and (c) source data comes from both A and B, but half as much from each. Apart from the source language, other conditions are held constant in all three cases – including the total amount of source data used. The null hypothesis is that performance in the mixed case would be an average of performance in the single-language cases, but in fact, mixed-case performance always exceeds the maximum of the single-language cases. We observed this effect in all six experiments we ran, involving three different source-language pairs and two different target languages.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Al-Onaizan, Y., Curin, J., Jahr, M., Knight, K., Lafferty, J., Melamed, D., Och, F.-J., Purdy, D., Smith, N., Yarowsky, D.: Statistical machine translation. In: Johns Hopkins University 1999 Summer Workshop on Language Engineering (1999)
Google Scholar
Brants, T.: TnT – a statistical part-of-speech tagger. In: Proceedings of the 6th Applied NLP Conference, ANLP-2000, Seattle, WA, April 29 – May 3 (2000)
Google Scholar
Brill, E.: Transformation-Based Error-Driven Learning and Natural Language Processing: A Case Study in Part-of-Speech Tagging. Computational Linguistics 21(4), 543–565 (1995)
Google Scholar
Brill, E., Wu, J.: Classifier Combination for Improving Lexical Disambiguation. In: Proceedings of the ACL (1998)
Google Scholar
Brown, P.F., Cocke, J., Della Pietra, S., Della Pietra, V.J., Jelinek, F., Lafferty, J.D., Mercer, R.L., Roossin, P.S.: A Statistical Approach to Machine Translation. Computational Linguistics 16(2), 79–85 (1990)
Google Scholar
Clark, S., Curran, J., Osborne, M.: Bootstrapping POS taggers using unlabelled data. In: Daelemans, W., Osborne, M. (eds.) Proceedings of CoNLL-2003, Edmonton, Canada, pp. 49–55 (2003)
Google Scholar
Collins, M., Hajic, J., Ramshaw, L., Tillmann, C.: A Statistical Parser for Czech. In: Proceedings of the 37th Annual Meeting of the ACL, College Park, Maryland (1999)
Google Scholar
Cucerzan, S., Yarowsky, D.: Bootstrapping a Multilingual Part-of-speech Tagger in One Person-day. In: Proceedings of the Sixth Conference on Natural Language Learning (CoNLL) (2002)
Google Scholar
Gimenez, J., Marquez, L.: SVMTool: A general POS tagger generator based on Support Vector Machines. In: Proceedings of the 4th International Conference on Language Resources and Evaluation (LREC 2004), Lisbon, Portugal (2004)
Google Scholar
Gollins, T., Sanderson, M.: Improving Cross Language Information Retrieval with Triangulated Translation. In: Proceedings of the 24th annual international ACM SIGIR conference, pp. 90–95 (2001)
Google Scholar
French-English Hansards Corpus of Canadian Parliamentary Proceedings
Google Scholar
Hajic, J., Hladka, B.: Tagging Inflective Languages: Prediction of Morphological Categories for a Rich, Structured Tagset. In: COLING-ACL, pp. 483–490 (1998)
Google Scholar
Hajic, J., Krbec, P., Kevton, P., Oliva, K., Petkevic, V.: Serial Combination of Rules and Statistics: A Case Study in Czech Tagging. In: Proceedings of the ACL (2001)
Google Scholar
Henderson, J.C., Brill, E.: Exploiting Diversity in Natural Language Processing: Combining Parsers. In: Proceedings of the 1999 Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora, pp. 187–194 (1999)
Google Scholar
Hwa, R., Resnik, P., Weinberg, A.: Breaking the Resource Bottleneck for Multilingual Parsing. In: Proceedings of the Workshop on Linguistic Knowledge Acquisition and Representation: Bootstrapping Annotated Language Data (2002)
Google Scholar
Mann, G., Yarowsky, D.: Multipath translation lexicon induction via bridge languages. In: Proceedings of NAACL 2001: 2nd Meeting of the North American Chapter of the Association for Computational Linguistics, pp. 151–158 (2001)
Google Scholar
Rabiner, L.: A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE 77(2) (1989)
Google Scholar
Schmid, H.: Probabilistic Part-of-Speech Tagging Using Decision Trees. In: International Conference on New Methods in Language Processing, Manchester, UK (1994)
Google Scholar
van Halteren, H., Zavrel, J., Daelemans, W.: Improving Data Driven Wordclass Tagging by System Combination. In: Proceedings of the Thirty-Sixth Annual Meeting of the Association for Computational Linguistics, pp. 491–497 (1998)
Google Scholar
Witten, I., Bell, T.: The zero-frequency problem: Estimating the probabilities of novel events in adaptive text compression. IEEE Transactions in Information Theory 37(4), 1085–1094 (1991)
Article Google Scholar
Yarowsky, D., Ngai, G.: Inducing Multilingual POS Taggers and NP Bracketers via Robust Projection Across Aligned Corpora. In: Proceedings of NAACL, pp. 200–207 (2001)
Google Scholar
Yarowsky, D., Ngai, G., Wicentowski, R.: Inducing Multilingual Text Analysis Tools via Robust Projection across Aligned Corpora. In: Proceedings of HLT (2001)
Google Scholar
Zavrel, J., Daelemans, W.: Bootstrapping a Tagged Corpus through Combination of Existing Heterogeneous Taggers. In: Proceedings of LREC-2000, Athens (2000)
Google Scholar

Download references

Author information

Authors and Affiliations

Dept. of EECS, University of Michigan, Ann Arbor, MI, 48105
Victoria Fossum
Dept. of Linguistics, University of Michigan, Ann Arbor, MI, 48105
Steven Abney

Authors

Victoria Fossum
View author publications
You can also search for this author in PubMed Google Scholar
Steven Abney
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Center for Language Technology, Macquarie University, 2019, Sydney, NSW, Australia
Robert Dale
Department of Systems Engineering and Engineering Management, The Chinese University of Hong Kong, Shatin, N.T., Hong Kong
Kam-Fai Wong
Institute for Infocomm Research, 21, Heng Mui Keng Terrace, 119613, Singapore
Jian Su
Language Information Sciences Research Centre, City University of Hong Kong, Tat Chee Avenue, Kowloon, Hong Kong
Oi Yee Kwong

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Fossum, V., Abney, S. (2005). Automatically Inducing a Part-of-Speech Tagger by Projecting from Multiple Source Languages Across Aligned Corpora. In: Dale, R., Wong, KF., Su, J., Kwong, O.Y. (eds) Natural Language Processing – IJCNLP 2005. IJCNLP 2005. Lecture Notes in Computer Science(), vol 3651. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11562214_75

Download citation

DOI: https://doi.org/10.1007/11562214_75
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-29172-5
Online ISBN: 978-3-540-31724-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics