Abstract
The availability of machine-readable bilingual linguistic resources is crucial not only for rule-based machine translation but also for other applications such as cross-lingual information retrieval. However, the building of such resources (bilingual single-word and multi-word correspondences, translation rules) demands extensive manual work, and, as a consequence, bilingual resources are usually more difficult to find than “shallow” monolingual resources such as morphological dictionaries or part-of-speech taggers, especially when they involve a less-resourced language. This paper describes a methodology to build automatically both bilingual dictionaries and shallow-transfer rules by extracting knowledge from word-aligned parallel corpora processed with shallow monolingual resources (morphological analysers, and part-of-speech taggers). We present experiments for Brazilian Portuguese–Spanish and Brazilian Portuguese–English parallel texts. The results show that the proposed methodology can enable the rapid creation of valuable computational resources (bilingual dictionaries and shallow-transfer rules) for machine translation and other natural language processing tasks).
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Armentano-Oller C, Carrasco RC, Corbí-Bellot AM, Forcada ML, Ginestí-Rosell M, Ortiz-Rojas S, Pérez-Ortiz JA, Ramírez-Sánchez G, Sánchez–Martínez F, Scalco MA (2006) Open-source Portuguese–Spanish machine translation. In: Proceedings of the VII Encontro para o Processamento Computacional da Língua Portuguesa Escrita e Falada. Itatiaia, RJ, Brazil, pp 50–59
Bick E (2000) The parsing system Palavras, automatic grammatical analysis of Portuguese in a constraint grammar framework. Ph.D. Thesis, Aarhus University Press, Denmark
Brown P, Della-Pietra V, Della-Pietrac S and Mercer R (1993). The mathematics of statistical machine translation: parameter estimation. Comput Linguist 19(2): 263–312
Canals-Marote R, Esteve-Guillén A, Garrido-Alenda A, Guardiola-Savall M, Iturraspe-Bellver A, Montserrat-Buendia S, Ortiz-Rojas S, Pastor-Pina H, Pérez-Antón P, Forcada M (2001) The Spanish–Catalan machine translation system interNOSTRUM. In: MT Summit VIII: Machine Translation in the Information Age, Proceedings Santiago de Compostela, Spain, pp 73–76
Carbonell J, Probst K, Peterson E, Monson C, Lavie A, Brown R, Levin L (2002) Automatic rule learning for resource-limited MT. In: AMTA’02: Proceedings of the 5th Conference of the Association for Machine Translation in the Americas: From Research to Real Users. Lecture Notes In Computer Science, vol 2499, London, UK, pp 1–10
Caseli HM (2007) Indução de léxicos bilíngües e regras para a tradução automática. Ph.D. Thesis, ICMC-USP, São Paulo, Brazil
Caseli HM and Nunes MGV (2007). Automatic induction of bilingual lexicons for machine translation. Int J Transl 19: 29–43
Caseli HM, Nunes MGV and Forcada ML (2005). Evaluating the LIHLA lexical aligner on Spanish, Brazilian Portuguese and Basque parallel texts. Procesamiento del Lenguaje Natural 35: 237–244
Doddington G (2002) Automatic evaluation of machine translation quality using n-gram co-occurrence statistics. In: Proceedings of ARPA Workshop on Human Language Technology, San Diego, CA, pp 128–132
Fung P (1995) A pattern matching method for finding noun and proper noun translations from noisy parallel corpora. In: Proceedings of the 33rd Annual Meeting of the Association for Computational Linguistics, Cambridge, MA, pp 236–243
Hutchins J and Somers H (1992). An introduction to machine translation. Academic Press, London
Kaji H, Kida Y, Morimoto Y (1992) Learning translation templates from bilingual text. In: Proceedings of the fifteenth [sic] International Conference on Computational Linguistics, COLING-92. Nantes, France, pp 672–678
Koehn P, Knight K (2002) Learning a translation lexicon from monolingual corpora. In: Proceedings of the Workshop of the ACL Special Interest Group on the Lexicon (SIGLEX), Philadelphia, PA, pp 9–16
Langlais P, Foster G, Lapalme G (2001) Integrating bilingual lexicons in a probabilistic translation assistant. In: MT Summit VIII: Machine Translation in the Information Age, Proceedings, Santiago de Compostela, Spain, pp 197–202
Lavie A, Probst K, Peterson E, Vogel S, Levin L, Font-Llitjós A, Carbonell J (2004) A trainable transfer-based machine translation approach for languages with limited resources. In: Proceedings of the 9th Workshop of the European Association for Machine Translation (EAMT-04), Valletta, Malta, pp 1–8
McTait K (2003). Translation patterns, linguistic knowledge and complexity in an approach to EBMT. In: Carl, M and Way, A (eds) Recent advances in example-based machine translation, pp 307–338. Kluwer Academic Publishers, Dordrecht, The Netherlands
Melamed ID, Green R, Turian JP (2003) Precision and recall of machine translation. In: Proceedings of the Conference on Human Language Technology and the North American Chapter of the Association for Computational Linguistics (HLT/NAACL 2003), Edmonton, Canada, pp 61–63
Menezes A, Richardson SD (2001) A best-first alignment algorithm for automatic extraction of transfer mappings from bilingual corpora. In: Proceedings of the Workshop on Data-driven Machine Translation at 39th Annual Meeting of the ACL and 10th Meeting of the European Chapter, Toulouse, France, pp 39–46
Och FJ, Ney H (2000) Improved statistical alignment models. In: Proceedings of the 38th Annual Meeting of the Association for Computational Linguistics, Hong Kong, China, pp 440–447
Och FJ and Ney H (2003). A systematic comparison of various statistical alignment models. Comput Linguist 29(1): 19–51
Och FJ and Ney H (2004). The alignment template approach to statistical machine translation. Comput Linguist 30(4): 417–449
Papineni K, Roukos S, Ward T, Zhu W (2002) BLEU: a method for automatic evaluation of machine translation. In: ACL-02: the 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, PA, pp 311–318
Paumier S (2006). Unitex 1.2 user manual. Université Paris-Est, Marne-la-Vallée, France
Pei J, Han J, Mortazavi-Asl B, Wang J, Pinto H, Chen Q, Dayal U and Hsu M (2004). Mining sequential patterns by pattern-growth: the PrefixSpan approach. IEEE Trans Knowl Data Eng 16(10): 1–17
Probst K (2005) Learning transfer rules for machine translation with limited data. Ph.D. Thesis, Carnegie Mellon University, Pittsburgh, PA
Sánchez–Martínez F, Ney H (2006) Using alignment templates to infer shallow-transfer machine translation rules. In: Pyysala S, Salakoski T, Ginter D, Pahikkala T (eds) Advances in natural language processing, Proceedings of 5th International Conference on Natural Language Processing FinTAL, vol. 4139 of Lecture Notes in Computer Science, Turku, Finland, pp 756–767
Schafer C, Yarowsky D (2002) Inducing translation lexicons via diverse similarity measures and bridge languages. In: Proceedings of CoNLL-2002, Taipei, Taiwan, pp 1–7
Wu D, Xia X (1994) Learning an English–Chinese lexicon from parallel corpus. In: Proceedings of the 1st Conference of the Association for Machine Translation in the Americas (AMTA-1994), Columbia, MD pp 206–213
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Caseli, H.M., Nunes, M.d.G.V. & Forcada, M.L. Automatic induction of bilingual resources from aligned parallel corpora: application to shallow-transfer machine translation. Machine Translation 20, 227–245 (2006). https://doi.org/10.1007/s10590-007-9027-9
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10590-007-9027-9