Abstract
Although corpus-based approaches to machine translation (MT) are growing in interest, they are not applicable when the translation involves less-resourced language pairs for which there are no parallel corpora available; in those cases, the rule-based approach is the only applicable solution. Most rule-based MT systems make use of part-of-speech (PoS) taggers to solve the PoS ambiguities in the source-language texts to translate; those MT systems require accurate PoS taggers to produce reliable translations in the target language (TL). The standard statistical approach to PoS ambiguity resolution (or tagging) uses hidden Markov models (HMM) trained in a supervised way from hand-tagged corpora, an expensive resource not always available, or in an unsupervised way through the Baum-Welch expectation-maximization algorithm; both methods use information only from the language being tagged. However, when tagging is considered as an intermediate task for the translation procedure, that is, when the PoS tagger is to be embedded as a module within an MT system, information from the TL can be (unsupervisedly) used in the training phase to increase the translation quality of the whole MT system. This paper presents a method to train HMM-based PoS taggers to be used in MT; the new method uses not only information from the source language (SL), as general-purpose methods do, but also information from the TL and from the remaining modules of the MT system in which the PoS tagger is to be embedded. We find that the translation quality of the MT system embedding a PoS tagger trained in an unsupervised manner through this new method is clearly better than that of the same MT system embedding a PoS tagger trained through the Baum-Welch algorithm, and comparable to that obtained by embedding a PoS tagger trained in a supervised way from hand-tagged corpora.
Similar content being viewed by others
References
Armentano-Oller C, Carrasco RC, Corbí-Bellot AM, Forcada ML, Ginestí-Rosell M, Ortiz-Rojas S, Pérez-Ortiz JA, Ramírez-Sánchez G, Sánchez-Martínez F, Scalco MA (2006) Open-source Portuguese-Spanish machine translation. In: Computational processing of the Portuguese language, proceedings of the 7th international workshop on computational processing of written and spoken Portuguese, vol 3960 of lecture notes in computer science. Itatiaia, RJ, Brazil: Springer-Verlag, pp 50–59
Armentano-Oller C, Forcada ML (2006) Open-source machine translation between small languages: Catalan and Aranese Occitan. In: Proceedings of strategies for developing machine translation for minority languages (5th workshop on speech and language technology for minority languages), Genoa, Italy, pp 51–54
Arnold D (2003) Why translation is difficult for computers. In: Somers H (eds) Computers and translation: a translator’s guide. John Benjamins, Amsterdam/Philadelphia, pp 119–142
Baum LE (1972) An inequality and associated maximization technique in statistical estimation of probabilistic functions of a Markov process. Inequalities 3: 1–8
Baum LE, Petrie T (1966) Statistical inference for probabilistic functions of finite state Markov chains. Ann Math Stat 37(6): 1554–1563
Brants T, Samuelsson C (1995) Tagging the Teleman corpus. In: Proceedings of the 10th Nordic conference of computational linguistics, Helsinki, Finland, pp 7–20
Brill E (1992) A simple rule-based part-of-speech tagger. In: Proceedings of the 3rd applied natural language processing conference, Trento, Italy, pp 152–155
Brill E (1995a) Transformation-based error-driven learning and natural language processing: a case study in part of speech tagging. Comput Linguist 21(4): 543–565
Brill E (1995b) Unsupervised learning of disambiguation rules for part of speech tagging. In: Proceedings of the third workshop on very large corpora, Somerset, NJ, pp 1–13
Brown PF, Della Pietra SA, Della Pietra VJ, Mercer RL (1993) The mathematics of statistical machine translation: parameter estimation. Comput Linguist 19(2): 263–311
Carbonell J, Klein S, Miller D, Steinbaum M, Grassiany T, Frei J (2006) Context-based machine translation. In: Proceedings of the 7th conference of the association for machine translation in the Americas. Visions for the future of machine translation, Cambridge, MA, pp 19–28
Carl, M, Way, A (eds) (2003) Recent advances in example-based machine translation, vol 21. Kluwer Academic Publishers, Dordrecht/Boston/London
Cutting D, Kupiec J, Pedersen J, Sibun P (1992) A practical part-of-speech tagger. In: Proceedings of the 3rd applied natural language processing conference, Trento, Italy, pp 133–140
Dermatas E, Kokkinakis G (1995) Automatic stochastic tagging of natural language texts. Comput Linguist 21(2): 137–163
Dien D, Kiem H (2003) POS-tagger for English-Vietnamese bilingual corpus. In: Proceedings of the workshop on building and using parallel texts: data driven machine translation and beyond, at the human language technology and the north American chapter of the association for computational linguistics joint conference, Edmonton, Canada, pp 88–95
Efron B, Tibshirani RJ (1993) An introduction to the bootstrap Vol. 57 of monographs on statistics and applied probability. Chapman & Hall/CRC, London, UK
Foster G, Isabelle P, Plamondon P (1997) Target text mediated interactive machine translation. Mach Transl 2(1–2): 175–194
Gale WA, Church KW (1990) Poor estimates of context are worse than none. In: Proceedings of the third DARPA workshop on speech and natural language. San Mateo, CA: Morgan Kaufmann Publishers Inc., pp 283–287
Gale WA, Sampson G (1995) Good-turing frequency estimation without tears. J Quant Linguist 2(3): 217–237
Jelinek F (1997) Statistical methods for speech recognition. MIT Press, Cambridge, MA
Kim JD, Lee SZ, Rim HC (1999) HMM specialization with selective kexicalization. In: Proceedings of the joint SIGDAT conference on empirical methods in natural language processing and very large corpora, College Park, MD, pp 121–127
Koehn P (2004) Statistical significance tests for machine translation evaluation. In: Proceedings of the conference on empirical methods in natural language processing. Barcelona, Spain, pp 388–395
Koehn P (2008) Statistical machine translation. Cambridge University Press, Cambridge, UK
Kupiec J (1992) Robust part-of-speech tagging using a hidden Markov model. Comput Speech Lang 6(3): 225–242
Levenshtein VI (1965) Binary codes capable of correcting deletions, insertions, and reversals. Doklady Akademii Nauk SSSR 163(4):845–848. English translation in Soviet Physics Doklady 10(8):707–710 (1966)
Manning CD, Schütze (1999) Foundations of statistical natural language processing. MIT Press, Cambridge, MA
Merialdo B (1994) Tagging English text with a probabilistic model. Comput Linguist 20(2): 155–171
Nagao M (1984) Framework of a mechanical translation between Japanese and English by analogy principle. In: Elithorn A, Banerji R (eds) Artificial and human intelligence. Amsterdam, The Netherlands, North Holland, pp 173–180
Och FJ (2005) Statistical machine translation: foundations and recent advances. Tutorial at MT Summit X, Phuket, Thailand
Papineni K, Roukos S, Ward T, Zhu WJ (2002) BLEU: a method for automatic evaluation of machine translation. In: 40th Annual meeting of the association for computational linguistics. Association for Computational Linguistics, Philadelphia, PA, pp 311–318
Pla F, Molina A (2004) Improving part-of-speech tagging using lexicalized HMMs. Nat Lang Eng 10(2): 167–189
Rabiner LR (1989) A tutorial on hidden Markov models and selected applications in speech recognition. Proc Inst Electr Electron Eng (IEEE) 77(2): 257–286
Sánchez-Villamil E, Forcada ML, Carrasco RC (2004) Unsupervised training of a finite-state sliding-window part-of-speech tagger. In: Advances in natural language processing, proceedings of the 4th international conference EsTAL (España for Natural Language Processing), Vol 3230 of lecture notes in computer science. Alicante, Spain: Springer-Verlag, pp 454–463
Sánchez-Martínez F, Pérez-Ortiz JA, Forcada ML (2004a) Cooperative unsupervised training of the part-of-speech taggers in a bidirectional machine translation system. In: Proceedings of the tenth conference on theoretical and methodological issues in machine translation, Baltimore, MD, pp 135–144
Sánchez-Martínez F, Pérez-Ortiz JA, Forcada ML (2004b) Exploring the use of target-language information to train the part-of-speech tagger of machine translation systems. In: Advances in natural language processing, proceedings of the 4th international conference EsTAL (España for Natural Language Processing), vol 3230 of lecture notes in computer science. Alicante, Spain: Springer-Verlag, pp 137–148
Sánchez-Martínez F, Pérez-Ortiz JA, Forcada ML (2006) Speeding up target-language driven part-of-speech tagger training for machine translation. In: Advances in artificial intelligence, proceedings of the 5th Mexican international conference on artificial intelligence, vol 4293 of lecture notes in computer science. Apizaco, Tlaxcala, Mexico: Springer-Verlag, pp 844–854
Snover M, Dorr B, Schwartz R, Micciulla L, Makhoul J (2006) A study of translation edit rate with targeted human annotation. In: Proceedings of the 7th conference of the association for machine translation in the Americas. Visions for the future of machine translation, Cambridge, MA, pp 223–231
Stolcke A (2002) SRILM—an extensible language modeling toolkit. In: Proceedings of the international conference on spoken language processing, Denver, CO, pp 901–904
Yarowsky D, Ngai G (2001) Inducing multilingual POS taggers and NP bracketers via robust projection across aligned corpora. In: Proceedings of the second meeting of the North American chapter of the association for computational linguistics, Pittsburgh, PA, pp 200–207
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Sánchez-Martínez, F., Pérez-Ortiz, J.A. & Forcada, M.L. Using target-language information to train part-of-speech taggers for machine translation. Machine Translation 22, 29–66 (2008). https://doi.org/10.1007/s10590-008-9044-3
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10590-008-9044-3