Skip to main content
Log in

Parallel feature weight decay algorithms for fast development of machine translation models

  • Published:
Machine Translation

Abstract

Parallel feature weight decay algorithms, parfwd, are engineered for language- and task-adaptive instance selection to build distinct machine translation (MT) models and enable the fast development of accurate MT using less data and computation. parfwd decay the weights of both source and target features to increase their average coverage. In a conference on MT (WMT), parfwd achieved the lowest translation error rate from French to English in 2015, and a rate \(11.7\%\) less than the top phrase-based statistical MT (PBSMT) in 2017. parfwd also achieved a rate \(5.8\%\) less than the top in TweetMT and the top from Catalan to English. BLEU upper bounds identify the translation directions that offer the largest room for relative improvement and MT models that use additional data. Performance trend angles show the power of MT models to convert unit data into unit translation results or more BLEU for an increase in coverage. The source coverage angle of parfwd in the 2013–2019 WMT reached + 6\(^{\circ }\)   better than the top with \(35^{\circ }\)   for translation into English, and it was + 1.4\(^{\circ }\)   better than the top with \(22^{\circ }\)   overall.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

Notes

  1. Building only neural LMs took 1000 times longer in WMT19 (Biçici 2019).

  2. We obtain this figure by comparing the desktop we used in WMT19 against the 128 GPU machine used by Ott et al. (2018) in GFLOPs.

  3. Question 6.5-9 (Cormen et al. 2009). Merging k sorted lists using a min-heap for k-way merging.

  4. The code for the optimizer is available (Sect. 3).

  5. For the Moses operation sequence model (Durrani et al. 2011), a sequence of 7 contextual operations with 4 on translation and 3 on reordering and a 4-gram LM is built.

  6. Compared to the size in the previous year.

  7. Performance compared with the top PBSMT system improved to 2.252 BLEU points distance on average compared with WMT16 results with the new parfwd model, and also improved due to PBSMT being less common and BPE-NMT models being more abundant at WMT17. The average distance to the top BPE-NMT result increased from 5.86 to 6.93 BLEU points in WMT17.

  8. From http://matrix.statmt.org, which also include results that use neural networks for LM (Ding et al. 2016). BLEU is computed using mteval-v13a.pl of the Moses toolset with cased option -c. We compare parfwd results with the top PBSMT results until WMT18 except for de-en in WMT17. The average BLEU score in WMT17 including all NMT results is 0.2496, which also did not improve except for WMT16.

  9. German is a morphologically rich language with compounding and some configurability (Fraser et al. 2013), such that some syntactic information can be obtained from a correct constituent parse.

  10. The wc command in unix counts Russian characters twice.

References

  • Axelrod A, He X, Gao J (2011) Domain adaptation via pseudo in-domain data selection. In: Conf. on empirical methods in NLP, Edinburgh, Scotland, UK, EMNLP ’11, pp 355–362

  • Beinborn L, Zesch T, Gurevych I (2013) Cognate production using character-based machine translation. In: sixth int. joint conf. on NLP, Nagoya, Japan, pp 883–891

  • Biçici E (2013) Feature decay algorithms for fast deployment of accurate statistical machine translation systems. In: Eighth workshop on statistical machine translation, Sofia, Bulgaria, pp 78–84

  • Biçici E (2015) Domain adaptation for machine translation with instance selection. Prague Bull Math Linguist 103:5–20. https://doi.org/10.1515/pralin-2015-0001

    Article  Google Scholar 

  • Biçici E (2019) Machine translation with parfda, Moses, kenlm, nplm, and PRO. In: Proc. of the fourth conf. on machine translation (WMT19), Florence, Italy, pp 122–128, https://doi.org/10.18653/v1/W19-5306

  • Biçici E, Yuret D (2015) Optimizing instance selection for statistical machine translation with feature decay algorithms. IEEE/ACM Trans Audio Speech Lang Process 23:339–350. https://doi.org/10.1109/TASLP.2014.2381882

    Article  Google Scholar 

  • Biçici E, Groves D, van Genabith J (2013) Predicting sentence translation quality using extrinsic and language independent features. Mach Transl 27(3–4):171–192. https://doi.org/10.1007/s10590-013-9138-4

    Article  Google Scholar 

  • Bojar O, Federmann C, Fishel M, Graham Y, Haddow B, Huck M, Koehn P, Monz C, Müller M, Post M (2019) Findings of the 2019 conf. on machine translation (WMT19). In: Proc. of the fourth conf. on machine translation, Association for Comp. Ling., Florence, Italy, pp 1–61

  • Buck C, Heafield K, van Ooyen B (2014) N-gram counts and language models from the common crawl. In: Language resources and evaluation conf., Reykjavík, Iceland, pp 3579–3584

  • Chen B, Kuhn R, Foster G, Cherry C, Huang F (2016) Bilingual methods for adaptive training data selection for machine translation. In: Proc. AMTA, pp 93–103

  • Chung J, Cho K, Bengio Y (2016) NYU-MILA neural machine translation systems for WMT’16. In: First conf. on machine translation, Berlin, Germany, pp 268–271

  • Cormen TH, Leiserson CE, Rivest RL, Stein C (2009) Introduction to algorithms, 3rd edn. The MIT Press, New York

    MATH  Google Scholar 

  • Costa-jussà RM, Fonollosa RJA (2016) Character-based neural machine translation. In: 54th annual meeting of the assoc. for comp. ling., Berlin, Germany, pp 357–361

  • Ding S, Duh K, Khayrallah H, Koehn P, Post M (2016) The JHU machine translation systems for WMT 2016. In: First conf. on machine translation, Berlin, Germany, pp 272–280

  • Durrani N, Schmid H, Fraser A (2011) A joint sequence translation model with integrated reordering. In: 49th annual meeting of the assoc. for comp. ling., Portland, Oregon, USA, pp 1045–1054

  • Fraser A, Schmid H, Farkas R, Wang R, Schütze H (2013) Knowledge sources for constituent parsing of german, a morphologically rich and less-configurational language. Comput Linguist 39(1):57–85

    Article  Google Scholar 

  • Gao Q, Vogel S (2008) Software engineering, testing, and quality assurance for natural language processing, association for comp. ling., Columbus, Ohio, chap Parallel Implementations of Word Alignment Tool, pp 49–57

  • Hakkani-Tür DZ, Oflazer K, Tür G (2002) Statistical morphological disambiguation for agglutinative languages. Comput Hum 36(4):381–410

    Article  Google Scholar 

  • Heafield K, Pouzyrevsky I, Clark JH, Koehn P (2013) Scalable modified Kneser-Ney language model estimation. In: 51st annual meeting of the assoc. for comp. ling., Sofia, Bulgaria, pp 690–696

  • Huck M, Riess S, Fraser A (2017) Target-side word segmentation strategies for neural machine translation. In: Bojar O, Buck C, Chatterjee R, Federmann C, Graham Y, Haddow B, Huck M, Yepes AJ, Koehn P, Kreutzer J (eds) Second conf. on machine translation, Copenhagen, Denmark, pp 56–67

  • Kirchhoff K, Bilmes J (2014) Submodularity for data selection in machine translation. In: Conf. on empirical methods in NLP (EMNLP), Doha, Qatar, pp 131–141

  • Koehn P (2010) An experimental management system. Prague Bull Math Ling 94:87–96

    Google Scholar 

  • Koehn P, Hoang H, Birch A, Callison-Burch C, Federico M, Bertoldi N, Cowan B, Shen W, Moran C, Zens R, Dyer C, Bojar O, Constantin A, Herbst E (2007) Moses: open source toolkit for statistical machine translation. In: 45th annual meeting of the assoc. for Comp. Ling., pp 177–180

  • Landauer TK (2002) On the computational basis of learning and cognition: arguments from LSA. Psychol Learn Motiv 41:4384

    Google Scholar 

  • Li X, Zhang J, Zong C (2018) One sentence one model for neural machine translation. In: Proc. of the eleventh intl. conf. on language resources and evaluation (LREC 2018), European Language Resources Association (ELRA), Miyazaki, Japan, pp 910–917

  • Lin Y, Michel JB, Aiden Lieberman E, Orwant J, Brockman W, Petrov S (2012) Syntactic annotations for the google books ngram corpus. In: ACL 2012 system demonstrations, Jeju Island, Korea, pp 169–174

  • Liu L, Hong Y, Liu H, Wang X, Yao J (2014) Effective selection of translation model training data. In: 52nd annual meeting of the assoc. for comp. ling., Baltimore, Maryland, pp 569–573

  • Moore RC, Lewis W (2010) Intelligent selection of language model training data. In: 2010 conf. short papers, Assoc. for comp. ling. (ACL), Uppsala, Sweden, pp 220–224

  • Nakov P, Tiedemann J (2012) Combining word-level and character-level models for machine translation between closely-related languages. In: 50th annual meeting of the assoc. for comp. ling., Jeju Island, Korea, pp 301–305

  • Och FJ, Ney H (2003) A systematic comparison of various statistical alignment models. Comp Ling 29(1):19–51

    Article  Google Scholar 

  • Östling R, Scherrer Y, Tiedemann J, Tang G, Nieminen T (2017) The Helsinki neural machine translation system. In: Bojar O, Buck C, Chatterjee R, Federmann C, Graham Y, Haddow B, Huck M, Yepes AJ, Koehn P, Kreutzer J (eds) Second conf. on machine translation, Copenhagen, Denmark, pp 338–347

  • Ott M, Edunov S, Grangier D, Auli M (2018) Scaling neural machine translation. In: Proc. of the third conf. on machine translation, Volume 1: Research Papers, Belgium, Brussels, pp 1–9

  • Papineni K, Roukos S, Ward T, Zhu WJ (2002) BLEU: a method for automatic evaluation of machine translation. In: 40th annual meeting of the assoc. for comp. ling., Philadelphia, PA, USA, pp 311–318

  • Parcheta Z, Sanchis-Trilles G, Casacuberta F (2018) Data selection for NMT using infrequent n-gram recovery. In: Pérez-Ortiz ea Juan Antonio (ed) Proc. of the 21st Annual conf. of the European association for machine translation, Alacant, Spain, pp 219–227

  • Peris Á, Chinea-Ríos M, Casacuberta F (2017) Neural networks classifier for data selection in statistical machine translation. Prague Bull Math Ling 108(1):283–294

    Article  Google Scholar 

  • Poncelas A, Way A, Toral A (2017) Extending feature decay algorithms using alignment entropy. In: Quesada JF, Martín Mateos FJ, López Soto T (eds) Future and emerging trends in language technology. Machine learning and big data, Seville, Spain, pp 170–182

  • Poncelas A, de Buy Wenniger GM, Way A (2018a) Data selection with feature decay algorithms using an approximated target side. In: 15th Intl. workshop on spoken language translation (IWSLT 2018), Bruges, Belgium, pp 173–180

  • Poncelas A, de Buy Wenniger GM, Way A (2018b) Feature decay algorithms for neural machine translation. In: 21th conf. of the european assoc. for machine translation (EAMT), Universitat d’Alacant, Alacant, Spain, pp 239–248

  • Sennrich R (2016) Neural machine translation lecture at 11th machine translation marathon

  • Sennrich R, Haddow B, Birch A (2016) Edinburgh neural machine translation systems for WMT 16. In: First conf. on machine translation, Berlin, Germany, pp 371–376, https://doi.org/10.18653/v1/W16-2323

  • Sennrich R, Birch A, Currey A, Germann U, Haddow B, Heafield K, Miceli Barone AV, Williams P (2017) The University of Edinburgh’s Neural MT Systems for WMT17. In: Proc. of the second conf. on machine translation, Copenhagen, Denmark, pp 389–399

  • Stolcke A (2002) SRILM—an extensible language modeling toolkit. In: Proc. intl. conf. on spoken language processing, Denver, Colorado, USA, pp 901–904

  • Tiedemann J, Nakov P (2013) Analyzing the use of character-level translation with sparse and noisy datasets. In: Int. conf. on recent adv. in NLP RANLP, Hissar, Bulgaria, pp 676–684

  • Toldova S, Lyashevskaya O, Bonch-Osmolovskaya A, Ionov M (2015) Evaluation for morphologically rich language: Russian NLP. In: Int. conf. of on artificial intelligence ICAI’2015, Las Vegas, NV, USA, vol 2, pp 300–306

  • Toral A, Wu X, Pirinen T, Qiu Z, Bicici E, Du J (2015) Dublin City University at the TweetMT 2015 shared task. In: La Sociedad Española para el Procesamiento del Lenguaje Natural (SEPLN), Alicante, Spain, pp 33–39

  • Vaswani A, Zhao Y, Fossum V, Chiang D (2013) Decoding with large-scale neural language models improves translation. In: Proc. of the 2013 conf. on empirical methods in NLP, Association for Comp. Ling., Seattle, Washington, USA, pp 1387–1392

  • Wang L, Wong D, Chao L, Lu Y, Xing J (2014) A systematic comparison of data selection criteria for SMT domain adaptation. Sci World J. https://doi.org/10.1155/2014/745485

    Article  Google Scholar 

  • Wang R, Utiyama M, Finch A, Liu L, Chen K, Sumita E (2018) Sentence selection and weighting for neural machine translation domain adaptation. IEEE/ACM Trans Audio Speech Lang Process 26(10):1727–1741. https://doi.org/10.1109/TASLP.2018.2837223

    Article  Google Scholar 

  • Yarowsky D (1995) Unsupervised word sense disambiguation rivaling supervised methods. In: 33rd annual meeting of the assoc. for comp. ling., Cambridge, MA, USA, pp 189–196

Download references

Acknowledgements

The research received financial support from TÜBİTAK 2232 project 118C008.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ergun Biçici.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Biçici, E. Parallel feature weight decay algorithms for fast development of machine translation models. Machine Translation 35, 239–263 (2021). https://doi.org/10.1007/s10590-021-09275-z

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10590-021-09275-z

Keywords

Navigation