Abstract
Statistical machine translation (SMT) is based on alignment models which learn from bilingual corpora the word correspondences between source and target language. These models are assumed to be capable of learning reorderings. However, the difference in word order between two languages is one of the most important sources of errors in SMT. In this paper, we show that SMT can take advantage of inductive learning in order to solve reordering problems. Given a word alignment, we identify those pairs of consecutive source blocks (sequences of words) whose translation is swapped, i.e. those blocks which, if swapped, generate a correct monotonic translation. Afterwards, we classify these pairs into groups, following recursively a co-occurrence block criterion, in order to infer reorderings. Inside the same group, we allow new internal combination in order to generalize the reorder to unseen pairs of blocks. Then, we identify the pairs of blocks in the source corpora (both training and test) which belong to the same group. We swap them and we use the modified source training corpora to realign and to build the final translation system. We have evaluated our reordering approach both in alignment and translation quality. In addition, we have used two state-of-the-art SMT systems: a Phrased-based and an Ngram-based. Experiments are reported on the EuroParl task, showing improvements almost over 1 point in the standard MT evaluation metrics (mWER and BLEU).
Similar content being viewed by others
Notes
TC-STAR (Technology and Corpora for Speech to Speech Translation) is an European Community project funded by the Sixth Framework Programme.
References
Brants, T. (2000) Tnt–a statistical part-of-speech tagger. In Proceedings of the sixth applied natural language processing.
Brown, P., Della Pietra, S., Della Pietra, V., & Mercer, R. (1993). The mathematics of statistical machine translation. Computational Linguistics, 19(2), 263–311.
Carreras, X., Chao, I., Padró, L., & Padró, M. (2004) Freeling: An open-source suite of language analyzers. In 4th international conference on language resources and evaluation, LREC’06, Lisboa, Portugal.
Costa-jussà, M. R., & Fonollosa, J. A. R. (2009). State-of-the-art word reordering approaches in statistical machine translation. IEICE Transactions on Information and Systems, 92(11), 2179–2185.
Costa-jussà, M. R., Fonollosa, J. A. R., & Monte, E. (2008). Using reordering in statistical machine translation based on alignment block classification. In 6th international conference on language resources and evaluation, LREC’08.
de Gispert, A., Mariño, J. (2003). Experiments in word-ordering and morphological preprocessing for transducer-based statistical machine translation. In IEEE automatic speech recognition and understanding workhsop, ASRU’03 (pp. 634–639). St. Thomas, USA.
Kanthak, S., Vilar, D., Matusov, E., Zens, R., & Ney, H. (2005). Novel reordering approaches in phrase-based statistical machine translation. In Proceedings of the ACL workshop on building and using parallel texts: Data-driven machine translation and beyond (pp. 167–174). Ann Arbor, MI.
Kneser, R., & Ney, H. (1995) Improved backing-off for ngram language modeling. IEEE International Conference on ASSP, 2, 181–184.
Koehn, P., Och, F. J., & Marcu, D. (2003). Statistical phrase-based translation. In Proceedings of the human language technology conference, HLT-NAACL’2003 (pp. 48–54). Edmonton, Canada.
Lambert, P. (2008). Exploiting lexical information and discriminative alignment training in statistical machine translation. Ph.D. thesis, Software Department, Universitat Politècnica de Catalunya (UPC).
Mariño, J. B., Banchs, R. E., Crego, J. M., de Gispert, A., Lambert, P., Fonollosa, J. A. R., & Costa-jussà, M. R. (2006) N-gram based machine translation. Computational Linguistics, 32(4), 527–549.
Miller, G. A., Beckwith, R., Fellbaum, C., Gross, D., Miller, K., & Tengi, R. (1991). Five papers on word net. Special Issue of International Journal of Lexicography, 3(4), 235–312.
Nelder, J. A., & Mead, R. (1965). A simplex method for function minimization. The Computer Journal, 7, 308–313.
Nießen, S., & Ney, H. (2001). Morpho-syntactic analysis for reordering in statistical machine translation. In Proceedings of the MT-Summit VII (pp. 247–252).
Och, F. J., & Ney, H. (2003). A systematic comparison of various statistical alignment models. Computational Linguistics, 29(1), 19–51.
Popovic, M., & Ney, H. (2006). Pos-based word reorderings for statistical machine translation. In 5th international conference on language resources and evaluation (LREC) (pp. 1278–1283). Genoa.
Stolcke, A. (2002). Srilm–an extensible language modeling toolkit. In Proceedings of the 7th international conference on spoken language processing, ICSLP’02 (pp. 901–904). Denver, USA.
Tillmann, C., & Zhang, T. (2005). A localized prediction model for statistical machine translation. In ACL.
Zens, R., Och, F. J., & Ney, H. (2004) Improvements in phrase-based statistical machine translation. In Proceedings of the human language technology conference, HLT-NAACL’2004 (pp. 257–264). Boston, MA (USA).
Acknowledgments
This work has been partially funded by the Spanish Department of Science and Innovation through the Juan de la Cierva fellowship program and the BUCEADOR project (TEC2009-14094-C04-01). The authors also want to thank the anonymous reviewers of this paper for their valuable comments. Finally, the authors want to thank Barcelona Media Innovation Center, Universitat Politècnica de Catalunya and TALP Research Center for their support and permission to publish this research.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Costa-jussà, M.R., Fonollosa, J.A.R. & Monte, E. Recursive alignment block classification technique for word reordering in statistical machine translation. Lang Resources & Evaluation 45, 165–179 (2011). https://doi.org/10.1007/s10579-010-9133-9
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10579-010-9133-9