Skip to main content
Log in

Recursive alignment block classification technique for word reordering in statistical machine translation

  • Original Paper
  • Published:
Language Resources and Evaluation Aims and scope Submit manuscript

Abstract

Statistical machine translation (SMT) is based on alignment models which learn from bilingual corpora the word correspondences between source and target language. These models are assumed to be capable of learning reorderings. However, the difference in word order between two languages is one of the most important sources of errors in SMT. In this paper, we show that SMT can take advantage of inductive learning in order to solve reordering problems. Given a word alignment, we identify those pairs of consecutive source blocks (sequences of words) whose translation is swapped, i.e. those blocks which, if swapped, generate a correct monotonic translation. Afterwards, we classify these pairs into groups, following recursively a co-occurrence block criterion, in order to infer reorderings. Inside the same group, we allow new internal combination in order to generalize the reorder to unseen pairs of blocks. Then, we identify the pairs of blocks in the source corpora (both training and test) which belong to the same group. We swap them and we use the modified source training corpora to realign and to build the final translation system. We have evaluated our reordering approach both in alignment and translation quality. In addition, we have used two state-of-the-art SMT systems: a Phrased-based and an Ngram-based. Experiments are reported on the EuroParl task, showing improvements almost over 1 point in the standard MT evaluation metrics (mWER and BLEU).

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

Notes

  1. http://www.europarl.eu.int/.

  2. http://www.tc-star.org/.

  3. TC-STAR (Technology and Corpora for Speech to Speech Translation) is an European Community project funded by the Sixth Framework Programme.

References

  • Brants, T. (2000) Tnt–a statistical part-of-speech tagger. In Proceedings of the sixth applied natural language processing.

  • Brown, P., Della Pietra, S., Della Pietra, V., & Mercer, R. (1993). The mathematics of statistical machine translation. Computational Linguistics, 19(2), 263–311.

    Google Scholar 

  • Carreras, X., Chao, I., Padró, L., & Padró, M. (2004) Freeling: An open-source suite of language analyzers. In 4th international conference on language resources and evaluation, LREC’06, Lisboa, Portugal.

  • Costa-jussà, M. R., & Fonollosa, J. A. R. (2009). State-of-the-art word reordering approaches in statistical machine translation. IEICE Transactions on Information and Systems, 92(11), 2179–2185.

    Article  Google Scholar 

  • Costa-jussà, M. R., Fonollosa, J. A. R., & Monte, E. (2008). Using reordering in statistical machine translation based on alignment block classification. In 6th international conference on language resources and evaluation, LREC’08.

  • de Gispert, A., Mariño, J. (2003). Experiments in word-ordering and morphological preprocessing for transducer-based statistical machine translation. In IEEE automatic speech recognition and understanding workhsop, ASRU’03 (pp. 634–639). St. Thomas, USA.

  • Kanthak, S., Vilar, D., Matusov, E., Zens, R., & Ney, H. (2005). Novel reordering approaches in phrase-based statistical machine translation. In Proceedings of the ACL workshop on building and using parallel texts: Data-driven machine translation and beyond (pp. 167–174). Ann Arbor, MI.

  • Kneser, R., & Ney, H. (1995) Improved backing-off for ngram language modeling. IEEE International Conference on ASSP, 2, 181–184.

    Google Scholar 

  • Koehn, P., Och, F. J., & Marcu, D. (2003). Statistical phrase-based translation. In Proceedings of the human language technology conference, HLT-NAACL’2003 (pp. 48–54). Edmonton, Canada.

  • Lambert, P. (2008). Exploiting lexical information and discriminative alignment training in statistical machine translation. Ph.D. thesis, Software Department, Universitat Politècnica de Catalunya (UPC).

  • Mariño, J. B., Banchs, R. E., Crego, J. M., de Gispert, A., Lambert, P., Fonollosa, J. A. R., & Costa-jussà, M. R. (2006) N-gram based machine translation. Computational Linguistics, 32(4), 527–549.

    Article  Google Scholar 

  • Miller, G. A., Beckwith, R., Fellbaum, C., Gross, D., Miller, K., & Tengi, R. (1991). Five papers on word net. Special Issue of International Journal of Lexicography, 3(4), 235–312.

    Article  Google Scholar 

  • Nelder, J. A., & Mead, R. (1965). A simplex method for function minimization. The Computer Journal, 7, 308–313.

    Google Scholar 

  • Nießen, S., & Ney, H. (2001). Morpho-syntactic analysis for reordering in statistical machine translation. In Proceedings of the MT-Summit VII (pp. 247–252).

  • Och, F. J., & Ney, H. (2003). A systematic comparison of various statistical alignment models. Computational Linguistics, 29(1), 19–51.

    Article  Google Scholar 

  • Popovic, M., & Ney, H. (2006). Pos-based word reorderings for statistical machine translation. In 5th international conference on language resources and evaluation (LREC) (pp. 1278–1283). Genoa.

  • Stolcke, A. (2002). Srilm–an extensible language modeling toolkit. In Proceedings of the 7th international conference on spoken language processing, ICSLP’02 (pp. 901–904). Denver, USA.

  • Tillmann, C., & Zhang, T. (2005). A localized prediction model for statistical machine translation. In ACL.

  • Zens, R., Och, F. J., & Ney, H. (2004) Improvements in phrase-based statistical machine translation. In Proceedings of the human language technology conference, HLT-NAACL’2004 (pp. 257–264). Boston, MA (USA).

Download references

Acknowledgments

This work has been partially funded by the Spanish Department of Science and Innovation through the Juan de la Cierva fellowship program and the BUCEADOR project (TEC2009-14094-C04-01). The authors also want to thank the anonymous reviewers of this paper for their valuable comments. Finally, the authors want to thank Barcelona Media Innovation Center, Universitat Politècnica de Catalunya and TALP Research Center for their support and permission to publish this research.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Marta R. Costa-jussà.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Costa-jussà, M.R., Fonollosa, J.A.R. & Monte, E. Recursive alignment block classification technique for word reordering in statistical machine translation. Lang Resources & Evaluation 45, 165–179 (2011). https://doi.org/10.1007/s10579-010-9133-9

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10579-010-9133-9

Keywords

Navigation