Skip to main content

Automatically Inducing a Part-of-Speech Tagger by Projecting from Multiple Source Languages Across Aligned Corpora

  • Conference paper
Natural Language Processing – IJCNLP 2005 (IJCNLP 2005)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 3651))

Included in the following conference series:

  • 1562 Accesses

Abstract

We implement a variant of the algorithm described by Yarowsky and Ngai in [21] to induce an HMM POS tagger for an arbitrary target language using only an existing POS tagger for a source language and an unannotated parallel corpus between the source and target languages. We extend this work by projecting from multiple source languages onto a single target language. We hypothesize that systematic transfer errors from differing source languages will cancel out, improving the quality of bootstrapped resources in the target language. Our experiments confirm the hypothesis. Each experiment compares three cases: (a) source data comes from a single language A, (b) source data comes from a single language B, and (c) source data comes from both A and B, but half as much from each. Apart from the source language, other conditions are held constant in all three cases – including the total amount of source data used. The null hypothesis is that performance in the mixed case would be an average of performance in the single-language cases, but in fact, mixed-case performance always exceeds the maximum of the single-language cases. We observed this effect in all six experiments we ran, involving three different source-language pairs and two different target languages.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Al-Onaizan, Y., Curin, J., Jahr, M., Knight, K., Lafferty, J., Melamed, D., Och, F.-J., Purdy, D., Smith, N., Yarowsky, D.: Statistical machine translation. In: Johns Hopkins University 1999 Summer Workshop on Language Engineering (1999)

    Google Scholar 

  2. Brants, T.: TnT – a statistical part-of-speech tagger. In: Proceedings of the 6th Applied NLP Conference, ANLP-2000, Seattle, WA, April 29 – May 3 (2000)

    Google Scholar 

  3. Brill, E.: Transformation-Based Error-Driven Learning and Natural Language Processing: A Case Study in Part-of-Speech Tagging. Computational Linguistics 21(4), 543–565 (1995)

    Google Scholar 

  4. Brill, E., Wu, J.: Classifier Combination for Improving Lexical Disambiguation. In: Proceedings of the ACL (1998)

    Google Scholar 

  5. Brown, P.F., Cocke, J., Della Pietra, S., Della Pietra, V.J., Jelinek, F., Lafferty, J.D., Mercer, R.L., Roossin, P.S.: A Statistical Approach to Machine Translation. Computational Linguistics 16(2), 79–85 (1990)

    Google Scholar 

  6. Clark, S., Curran, J., Osborne, M.: Bootstrapping POS taggers using unlabelled data. In: Daelemans, W., Osborne, M. (eds.) Proceedings of CoNLL-2003, Edmonton, Canada, pp. 49–55 (2003)

    Google Scholar 

  7. Collins, M., Hajic, J., Ramshaw, L., Tillmann, C.: A Statistical Parser for Czech. In: Proceedings of the 37th Annual Meeting of the ACL, College Park, Maryland (1999)

    Google Scholar 

  8. Cucerzan, S., Yarowsky, D.: Bootstrapping a Multilingual Part-of-speech Tagger in One Person-day. In: Proceedings of the Sixth Conference on Natural Language Learning (CoNLL) (2002)

    Google Scholar 

  9. Gimenez, J., Marquez, L.: SVMTool: A general POS tagger generator based on Support Vector Machines. In: Proceedings of the 4th International Conference on Language Resources and Evaluation (LREC 2004), Lisbon, Portugal (2004)

    Google Scholar 

  10. Gollins, T., Sanderson, M.: Improving Cross Language Information Retrieval with Triangulated Translation. In: Proceedings of the 24th annual international ACM SIGIR conference, pp. 90–95 (2001)

    Google Scholar 

  11. French-English Hansards Corpus of Canadian Parliamentary Proceedings

    Google Scholar 

  12. Hajic, J., Hladka, B.: Tagging Inflective Languages: Prediction of Morphological Categories for a Rich, Structured Tagset. In: COLING-ACL, pp. 483–490 (1998)

    Google Scholar 

  13. Hajic, J., Krbec, P., Kevton, P., Oliva, K., Petkevic, V.: Serial Combination of Rules and Statistics: A Case Study in Czech Tagging. In: Proceedings of the ACL (2001)

    Google Scholar 

  14. Henderson, J.C., Brill, E.: Exploiting Diversity in Natural Language Processing: Combining Parsers. In: Proceedings of the 1999 Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora, pp. 187–194 (1999)

    Google Scholar 

  15. Hwa, R., Resnik, P., Weinberg, A.: Breaking the Resource Bottleneck for Multilingual Parsing. In: Proceedings of the Workshop on Linguistic Knowledge Acquisition and Representation: Bootstrapping Annotated Language Data (2002)

    Google Scholar 

  16. Mann, G., Yarowsky, D.: Multipath translation lexicon induction via bridge languages. In: Proceedings of NAACL 2001: 2nd Meeting of the North American Chapter of the Association for Computational Linguistics, pp. 151–158 (2001)

    Google Scholar 

  17. Rabiner, L.: A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE 77(2) (1989)

    Google Scholar 

  18. Schmid, H.: Probabilistic Part-of-Speech Tagging Using Decision Trees. In: International Conference on New Methods in Language Processing, Manchester, UK (1994)

    Google Scholar 

  19. van Halteren, H., Zavrel, J., Daelemans, W.: Improving Data Driven Wordclass Tagging by System Combination. In: Proceedings of the Thirty-Sixth Annual Meeting of the Association for Computational Linguistics, pp. 491–497 (1998)

    Google Scholar 

  20. Witten, I., Bell, T.: The zero-frequency problem: Estimating the probabilities of novel events in adaptive text compression. IEEE Transactions in Information Theory 37(4), 1085–1094 (1991)

    Article  Google Scholar 

  21. Yarowsky, D., Ngai, G.: Inducing Multilingual POS Taggers and NP Bracketers via Robust Projection Across Aligned Corpora. In: Proceedings of NAACL, pp. 200–207 (2001)

    Google Scholar 

  22. Yarowsky, D., Ngai, G., Wicentowski, R.: Inducing Multilingual Text Analysis Tools via Robust Projection across Aligned Corpora. In: Proceedings of HLT (2001)

    Google Scholar 

  23. Zavrel, J., Daelemans, W.: Bootstrapping a Tagged Corpus through Combination of Existing Heterogeneous Taggers. In: Proceedings of LREC-2000, Athens (2000)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2005 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Fossum, V., Abney, S. (2005). Automatically Inducing a Part-of-Speech Tagger by Projecting from Multiple Source Languages Across Aligned Corpora. In: Dale, R., Wong, KF., Su, J., Kwong, O.Y. (eds) Natural Language Processing – IJCNLP 2005. IJCNLP 2005. Lecture Notes in Computer Science(), vol 3651. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11562214_75

Download citation

  • DOI: https://doi.org/10.1007/11562214_75

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-29172-5

  • Online ISBN: 978-3-540-31724-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics