Abstract
In most statistical machine translation (SMT) systems, bilingual segments are extracted via word alignment. However, there is a need for systematic study as to what alignment characteristics can benefit MT under specific experimental settings such as the type of MT system, the language pair or the type or size of the corpus. In this paper we perform, in each of these experimental settings, a statistical analysis of the data and study the sample correlation coefficients between a number of alignment or phrase table characteristics and variables such as the phrase table size, the number of untranslated words or the BLEU score. We report results for two different SMT systems (a phrase-based and an n-gram-based system) on Chinese-to-English FBIS and BTEC data, and Spanish-to-English European Parliament data. We find that the alignment characteristics which help in translation greatly depend on the MT system and on the corpus size. We give alignment hints to improve BLEU score, depending on the SMT system used and the type of corpus. For example, for phrase-based SMT, dense alignments are required with larger corpora, especially on the target side, while with smaller corpora, more precise, sparser alignments are better, especially on the source side. Avoiding some long-distance crossing links may also improve BLEU score with small corpora. We take these conclusions into account to modify two types of alignment systems, and get 1 to 1.6 % relative improvements in BLEU score on two held-out corpora, although the improved system is different in each corpus.
Similar content being viewed by others
References
Ayan NF, Dorr BJ (2006) Going beyond AER: an extensive analysis of word alignments and their impact on MT. In: Proceedings of the 21st international conference on computational linguistics and 44th annual meeting of the association for computational linguistics. Sydney, Australia, pp 9–16
Brown PF, Della Pietra SA, Della Pietra VJ, Mercer RL (1993) The mathematics of statistical machine translation: parameter estimation. Comput Linguist 19(2): 263–311
Chen B, Federico M (2006) Improving phrase-based statistical translation through combination of word alignment. In: Proceedings of FinTAL—5th international conference on natural language processing. Turku, Finland, pp 356–367
Clark JH, Dyer C, Lavie A, Smith NA (2011) Better hypothesis testing for statistical machine translation: controlling for optimizer instability. In: Proceedings of the 49th annual meeting of the association for computational linguistics. Portland, Oregon, USA, pp 176–181
Crego JM, Mariño JB (2007) Improving SMT by coupling reordering and decoding. Mach Trans 20(3): 199–215
DeNero J, Klein D (2007) Tailoring word alignments to syntactic machine translation. In: Proceedings of the 45th annual meeting of the association for computational linguistics. Prague, Czech Republic, pp 17–24
Fraser A, Marcu D (2007) Measuring word alignment quality for statistical machine translation. Comput Linguist 33(3): 293–303
Guzman F, Gao Q, Vogel S (2009) Reassessment of the role of phrase extraction in PBSMT. In: Proceedings of machine translation summit XII. Ottawa, Canada, pp 49–56
Hollander M, Wolfe D (1973) Nonparametric statistical methods. Wiley, New York
Jolliffe IT (2002) Principal component analysis. Springer, New York
Koehn P, Och FJ, Marcu D (2003) Statistical phrase-based translation. In: Proceedings of the human language technology conference of the NAACL. Edmonton, Canada, pp 48–54
Koehn P, Hoang H, Birch A, Callison-Burch C, Federico M, Bertoldi N, Cowan B, Shen W, Moran C, Zens R, Dyer C, Bojar O, Constantin A, Herbst E (2007) Moses: open source toolkit for statistical machine translation. In: Proceedings of the 45th annual meeting of the association for computational linguistics (demo and poster sessions). Association for Computational Linguistics, Prague, Czech Republic, pp 177–180
Lambert P, Banchs RE (2006) Tuning machine translation parameters with SPSA. In: Proceedings of the international workshop on spoken language translation, IWSLT’06. Kyoto, Japan, pp 190–196
Lambert P, Banchs RE (2011) BIA: a discriminative phrase alignment toolkit. Prague Bulletin of Mathematical Linguistics 97
Lambert P, de Gispert A, Banchs RE, Mariño JB (2005) Guidelines for word alignment evaluation and manual alignment. Lang Resour Eval 39(4): 267–285
Lambert P, Banchs RE, Crego JM (2007) Discriminative alignment training without annotated data for machine translation. In: Proceedings of the human language technology conference of the NAACL (short papers). Rochester, NY, USA, pp 85–88
Lambert P, Ma Y, Ozdowska S, Way A (2009) Tracking relevant alignment characteristics for machine translation. In: Proceedings of machine translation summit XII. Ottawa, Canada, pp 268–275
Liang P, Taskar B, Klein D (2006) Alignment by agreement. In: Proceedings of the human language technology conference of the NAACL. New York City, USA, pp 104–111
Liu Y, Liu Q, Lin S (2010) Discriminative word alignment by linear modeling. Comput Linguist 36(3): 303–339
Mariño JB, Banchs RE, Crego JM, de Gispert A, Lambert P, Fonollosa JA, Costa-jussá MR (2006) N-gram based machine translation. Comput Linguist 32(4): 527–549
Melamed ID (2000) Models of translational equivalence among words. Comput Linguist 26(2): 221–249
Moore RC (2005) A discriminative framework for bilingual word alignment. In: Proceedings of human language technology conference and conference on empirical methods in natural language processing. Vancouver, Canada, pp 81–88
Näther W (2001) Random fuzzy variable of second order and applications to statistical inference. Inform Sci 133: 69–88
Nelder J, Mead R (1965) A simplex method for function minimization. Comput J 7: 308–313
Och F, Ney H (2004) The alignment template approach to statistical machine translation. Comput Linguist 30(4): 417–449
Och FJ (2003) Minimum error rate training in statistical machine translation. In: Proceedings of the 41th annual meeting of the association for computational linguistics, pp 160–167
Och FJ, Ney H (2003) A systematic comparison of various statistical alignment models. Comput Linguist 29(1): 19–51
Papineni K, Roukos S, Ward T, Zhu W-J (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the association for computational linguistics. Philadelphia, USA, pp 311–318
Rodgers JL, Nicewander WA (1988) Thirteen ways to look at the correlation coefficient. Am Stat 42(1): 59–66
Spall JC (1992) Multivariate stochastic approximation using a simultaneous perturbation gradient approximation. IEEE Trans Automat Control 37: 332–341
Spall JC (1998) An overview of the simultaneous perturbation method for efficient optimization. Johns Hopkins APL Techn Digest 19(4): 482–492
Stephens MA (1974) EDF statistics for goodness of fit and some comparisons. J Am Stat Assoc 69: 730–737
Takezawa T, Sumita E, Sugaya F, Yamamoto H, Yamamoto S (2002) Toward a broad-coverage bilingual corpus for speech translation of travel conversations in the real world. In: Proceedings of third international conference on language resources and evaluation 2002. Las Palmas, Canary Islands, Spain, pp 147–152
Vilar D, Popovic M, Ney H (2006) AER: do we need to “improve” our alignments? In: Proceedings of the international workshop on spoken language translation, IWSLT’06. Kyoto, Japan, pp 205–212
Author information
Authors and Affiliations
Corresponding author
Additional information
P. Lambert, Y. Ma and A. Way–Work partially done while at CNGL, Dublin City University, Ireland.
Rights and permissions
About this article
Cite this article
Lambert, P., Petitrenaud, S., Ma, Y. et al. What types of word alignment improve statistical machine translation?. Machine Translation 26, 289–323 (2012). https://doi.org/10.1007/s10590-012-9123-3
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10590-012-9123-3