Abstract
Parallel feature weight decay algorithms, parfwd, are engineered for language- and task-adaptive instance selection to build distinct machine translation (MT) models and enable the fast development of accurate MT using less data and computation. parfwd decay the weights of both source and target features to increase their average coverage. In a conference on MT (WMT), parfwd achieved the lowest translation error rate from French to English in 2015, and a rate \(11.7\%\) less than the top phrase-based statistical MT (PBSMT) in 2017. parfwd also achieved a rate \(5.8\%\) less than the top in TweetMT and the top from Catalan to English. BLEU upper bounds identify the translation directions that offer the largest room for relative improvement and MT models that use additional data. Performance trend angles show the power of MT models to convert unit data into unit translation results or more BLEU for an increase in coverage. The source coverage angle of parfwd in the 2013–2019 WMT reached + 6\(^{\circ }\) better than the top with \(35^{\circ }\) for translation into English, and it was + 1.4\(^{\circ }\) better than the top with \(22^{\circ }\) overall.
Similar content being viewed by others
Notes
Building only neural LMs took 1000 times longer in WMT19 (Biçici 2019).
We obtain this figure by comparing the desktop we used in WMT19 against the 128 GPU machine used by Ott et al. (2018) in GFLOPs.
Question 6.5-9 (Cormen et al. 2009). Merging k sorted lists using a min-heap for k-way merging.
The code for the optimizer is available (Sect. 3).
For the Moses operation sequence model (Durrani et al. 2011), a sequence of 7 contextual operations with 4 on translation and 3 on reordering and a 4-gram LM is built.
Compared to the size in the previous year.
Performance compared with the top PBSMT system improved to 2.252 BLEU points distance on average compared with WMT16 results with the new parfwd model, and also improved due to PBSMT being less common and BPE-NMT models being more abundant at WMT17. The average distance to the top BPE-NMT result increased from 5.86 to 6.93 BLEU points in WMT17.
From http://matrix.statmt.org, which also include results that use neural networks for LM (Ding et al. 2016). BLEU is computed using mteval-v13a.pl of the Moses toolset with cased option -c. We compare parfwd results with the top PBSMT results until WMT18 except for de-en in WMT17. The average BLEU score in WMT17 including all NMT results is 0.2496, which also did not improve except for WMT16.
German is a morphologically rich language with compounding and some configurability (Fraser et al. 2013), such that some syntactic information can be obtained from a correct constituent parse.
The wc command in unix counts Russian characters twice.
References
Axelrod A, He X, Gao J (2011) Domain adaptation via pseudo in-domain data selection. In: Conf. on empirical methods in NLP, Edinburgh, Scotland, UK, EMNLP ’11, pp 355–362
Beinborn L, Zesch T, Gurevych I (2013) Cognate production using character-based machine translation. In: sixth int. joint conf. on NLP, Nagoya, Japan, pp 883–891
Biçici E (2013) Feature decay algorithms for fast deployment of accurate statistical machine translation systems. In: Eighth workshop on statistical machine translation, Sofia, Bulgaria, pp 78–84
Biçici E (2015) Domain adaptation for machine translation with instance selection. Prague Bull Math Linguist 103:5–20. https://doi.org/10.1515/pralin-2015-0001
Biçici E (2019) Machine translation with parfda, Moses, kenlm, nplm, and PRO. In: Proc. of the fourth conf. on machine translation (WMT19), Florence, Italy, pp 122–128, https://doi.org/10.18653/v1/W19-5306
Biçici E, Yuret D (2015) Optimizing instance selection for statistical machine translation with feature decay algorithms. IEEE/ACM Trans Audio Speech Lang Process 23:339–350. https://doi.org/10.1109/TASLP.2014.2381882
Biçici E, Groves D, van Genabith J (2013) Predicting sentence translation quality using extrinsic and language independent features. Mach Transl 27(3–4):171–192. https://doi.org/10.1007/s10590-013-9138-4
Bojar O, Federmann C, Fishel M, Graham Y, Haddow B, Huck M, Koehn P, Monz C, Müller M, Post M (2019) Findings of the 2019 conf. on machine translation (WMT19). In: Proc. of the fourth conf. on machine translation, Association for Comp. Ling., Florence, Italy, pp 1–61
Buck C, Heafield K, van Ooyen B (2014) N-gram counts and language models from the common crawl. In: Language resources and evaluation conf., Reykjavík, Iceland, pp 3579–3584
Chen B, Kuhn R, Foster G, Cherry C, Huang F (2016) Bilingual methods for adaptive training data selection for machine translation. In: Proc. AMTA, pp 93–103
Chung J, Cho K, Bengio Y (2016) NYU-MILA neural machine translation systems for WMT’16. In: First conf. on machine translation, Berlin, Germany, pp 268–271
Cormen TH, Leiserson CE, Rivest RL, Stein C (2009) Introduction to algorithms, 3rd edn. The MIT Press, New York
Costa-jussà RM, Fonollosa RJA (2016) Character-based neural machine translation. In: 54th annual meeting of the assoc. for comp. ling., Berlin, Germany, pp 357–361
Ding S, Duh K, Khayrallah H, Koehn P, Post M (2016) The JHU machine translation systems for WMT 2016. In: First conf. on machine translation, Berlin, Germany, pp 272–280
Durrani N, Schmid H, Fraser A (2011) A joint sequence translation model with integrated reordering. In: 49th annual meeting of the assoc. for comp. ling., Portland, Oregon, USA, pp 1045–1054
Fraser A, Schmid H, Farkas R, Wang R, Schütze H (2013) Knowledge sources for constituent parsing of german, a morphologically rich and less-configurational language. Comput Linguist 39(1):57–85
Gao Q, Vogel S (2008) Software engineering, testing, and quality assurance for natural language processing, association for comp. ling., Columbus, Ohio, chap Parallel Implementations of Word Alignment Tool, pp 49–57
Hakkani-Tür DZ, Oflazer K, Tür G (2002) Statistical morphological disambiguation for agglutinative languages. Comput Hum 36(4):381–410
Heafield K, Pouzyrevsky I, Clark JH, Koehn P (2013) Scalable modified Kneser-Ney language model estimation. In: 51st annual meeting of the assoc. for comp. ling., Sofia, Bulgaria, pp 690–696
Huck M, Riess S, Fraser A (2017) Target-side word segmentation strategies for neural machine translation. In: Bojar O, Buck C, Chatterjee R, Federmann C, Graham Y, Haddow B, Huck M, Yepes AJ, Koehn P, Kreutzer J (eds) Second conf. on machine translation, Copenhagen, Denmark, pp 56–67
Kirchhoff K, Bilmes J (2014) Submodularity for data selection in machine translation. In: Conf. on empirical methods in NLP (EMNLP), Doha, Qatar, pp 131–141
Koehn P (2010) An experimental management system. Prague Bull Math Ling 94:87–96
Koehn P, Hoang H, Birch A, Callison-Burch C, Federico M, Bertoldi N, Cowan B, Shen W, Moran C, Zens R, Dyer C, Bojar O, Constantin A, Herbst E (2007) Moses: open source toolkit for statistical machine translation. In: 45th annual meeting of the assoc. for Comp. Ling., pp 177–180
Landauer TK (2002) On the computational basis of learning and cognition: arguments from LSA. Psychol Learn Motiv 41:4384
Li X, Zhang J, Zong C (2018) One sentence one model for neural machine translation. In: Proc. of the eleventh intl. conf. on language resources and evaluation (LREC 2018), European Language Resources Association (ELRA), Miyazaki, Japan, pp 910–917
Lin Y, Michel JB, Aiden Lieberman E, Orwant J, Brockman W, Petrov S (2012) Syntactic annotations for the google books ngram corpus. In: ACL 2012 system demonstrations, Jeju Island, Korea, pp 169–174
Liu L, Hong Y, Liu H, Wang X, Yao J (2014) Effective selection of translation model training data. In: 52nd annual meeting of the assoc. for comp. ling., Baltimore, Maryland, pp 569–573
Moore RC, Lewis W (2010) Intelligent selection of language model training data. In: 2010 conf. short papers, Assoc. for comp. ling. (ACL), Uppsala, Sweden, pp 220–224
Nakov P, Tiedemann J (2012) Combining word-level and character-level models for machine translation between closely-related languages. In: 50th annual meeting of the assoc. for comp. ling., Jeju Island, Korea, pp 301–305
Och FJ, Ney H (2003) A systematic comparison of various statistical alignment models. Comp Ling 29(1):19–51
Östling R, Scherrer Y, Tiedemann J, Tang G, Nieminen T (2017) The Helsinki neural machine translation system. In: Bojar O, Buck C, Chatterjee R, Federmann C, Graham Y, Haddow B, Huck M, Yepes AJ, Koehn P, Kreutzer J (eds) Second conf. on machine translation, Copenhagen, Denmark, pp 338–347
Ott M, Edunov S, Grangier D, Auli M (2018) Scaling neural machine translation. In: Proc. of the third conf. on machine translation, Volume 1: Research Papers, Belgium, Brussels, pp 1–9
Papineni K, Roukos S, Ward T, Zhu WJ (2002) BLEU: a method for automatic evaluation of machine translation. In: 40th annual meeting of the assoc. for comp. ling., Philadelphia, PA, USA, pp 311–318
Parcheta Z, Sanchis-Trilles G, Casacuberta F (2018) Data selection for NMT using infrequent n-gram recovery. In: Pérez-Ortiz ea Juan Antonio (ed) Proc. of the 21st Annual conf. of the European association for machine translation, Alacant, Spain, pp 219–227
Peris Á, Chinea-Ríos M, Casacuberta F (2017) Neural networks classifier for data selection in statistical machine translation. Prague Bull Math Ling 108(1):283–294
Poncelas A, Way A, Toral A (2017) Extending feature decay algorithms using alignment entropy. In: Quesada JF, Martín Mateos FJ, López Soto T (eds) Future and emerging trends in language technology. Machine learning and big data, Seville, Spain, pp 170–182
Poncelas A, de Buy Wenniger GM, Way A (2018a) Data selection with feature decay algorithms using an approximated target side. In: 15th Intl. workshop on spoken language translation (IWSLT 2018), Bruges, Belgium, pp 173–180
Poncelas A, de Buy Wenniger GM, Way A (2018b) Feature decay algorithms for neural machine translation. In: 21th conf. of the european assoc. for machine translation (EAMT), Universitat d’Alacant, Alacant, Spain, pp 239–248
Sennrich R (2016) Neural machine translation lecture at 11th machine translation marathon
Sennrich R, Haddow B, Birch A (2016) Edinburgh neural machine translation systems for WMT 16. In: First conf. on machine translation, Berlin, Germany, pp 371–376, https://doi.org/10.18653/v1/W16-2323
Sennrich R, Birch A, Currey A, Germann U, Haddow B, Heafield K, Miceli Barone AV, Williams P (2017) The University of Edinburgh’s Neural MT Systems for WMT17. In: Proc. of the second conf. on machine translation, Copenhagen, Denmark, pp 389–399
Stolcke A (2002) SRILM—an extensible language modeling toolkit. In: Proc. intl. conf. on spoken language processing, Denver, Colorado, USA, pp 901–904
Tiedemann J, Nakov P (2013) Analyzing the use of character-level translation with sparse and noisy datasets. In: Int. conf. on recent adv. in NLP RANLP, Hissar, Bulgaria, pp 676–684
Toldova S, Lyashevskaya O, Bonch-Osmolovskaya A, Ionov M (2015) Evaluation for morphologically rich language: Russian NLP. In: Int. conf. of on artificial intelligence ICAI’2015, Las Vegas, NV, USA, vol 2, pp 300–306
Toral A, Wu X, Pirinen T, Qiu Z, Bicici E, Du J (2015) Dublin City University at the TweetMT 2015 shared task. In: La Sociedad Española para el Procesamiento del Lenguaje Natural (SEPLN), Alicante, Spain, pp 33–39
Vaswani A, Zhao Y, Fossum V, Chiang D (2013) Decoding with large-scale neural language models improves translation. In: Proc. of the 2013 conf. on empirical methods in NLP, Association for Comp. Ling., Seattle, Washington, USA, pp 1387–1392
Wang L, Wong D, Chao L, Lu Y, Xing J (2014) A systematic comparison of data selection criteria for SMT domain adaptation. Sci World J. https://doi.org/10.1155/2014/745485
Wang R, Utiyama M, Finch A, Liu L, Chen K, Sumita E (2018) Sentence selection and weighting for neural machine translation domain adaptation. IEEE/ACM Trans Audio Speech Lang Process 26(10):1727–1741. https://doi.org/10.1109/TASLP.2018.2837223
Yarowsky D (1995) Unsupervised word sense disambiguation rivaling supervised methods. In: 33rd annual meeting of the assoc. for comp. ling., Cambridge, MA, USA, pp 189–196
Acknowledgements
The research received financial support from TÜBİTAK 2232 project 118C008.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Biçici, E. Parallel feature weight decay algorithms for fast development of machine translation models. Machine Translation 35, 239–263 (2021). https://doi.org/10.1007/s10590-021-09275-z
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10590-021-09275-z