Experimenting with Different Machine Translation Models in Medium-Resource Settings

Jónsson, Haukur Páll; Símonarson, Haukur Barri; Snæbjarnarson, Vésteinn; Steingrímsson, Steinþór; Loftsson, Hrafn

doi:10.1007/978-3-030-58323-1_10

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 12284))

Included in the following conference series:

International Conference on Text, Speech, and Dialogue

1336 Accesses
1 Citations

Abstract

State-of-the-art machine translation (MT) systems rely on the availability of large parallel corpora, containing millions of sentence pairs. For the Icelandic language, the parallel corpus ParIce exists, consisting of about 3.6 million English-Icelandic sentence pairs. Given that parallel corpora for low-resource languages typically contain sentence pairs in the tens or hundreds of thousands, we classify Icelandic as a medium-resource language for MT purposes. In this paper, we present on-going experiments with different MT models, both statistical and neural, for translating English to Icelandic based on ParIce. We describe the corpus and the filtering process used for removing noisy segments, the different models used for training, and the preliminary automatic and human evaluation. We find that, while using an aggressive filtering approach, the most recent neural MT system (Transformer) performs best, obtaining the highest BLEU score and the highest fluency and adequacy scores from human evaluation for in-domain translation. Our work could be beneficial to other languages for which a similar amount of parallel data is available.

H. P. Jónsson, H. B. Símonarson, V. Snæbjarnarson, S. Steingrímsson—Equal contribution.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 79.99; Price excludes VAT (USA)

Softcover Book: USD 99.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

References

Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. In: Bengio, Y., LeCun, Y. (eds.) Proceedings of the 3rd International Conference on Learning Representations, ICLR, San Diego (2015)
Google Scholar
Barkarson, S., Steingrímsson, S.: Compiling and filtering ParIce: an English-icelandic parallel corpus. In: Proceedings of the 22nd Nordic Conference on Computational Linguistics, NODALIDA, Turku, Finland (2019)
Google Scholar
Bentivogli, L., Bisazza, A., Cettolo, M., Federico, M.: Neural versus phrase-based machine translation quality: a case study. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, EMNLP, Austin, TX, USA (2016)
Google Scholar
Brandt, M.D., Loftsson, H., Sigurþórsson, H., Tyers, F.M.: Apertium-IceNLP: a rule-based Icelandic to English machine translation system. In: Proceedings of the 15th Annual Conference of the European Association for Machine Translation, EAMT, Leuven, Belgium (2011)
Google Scholar
Castilho, S., Moorkens, J., Gaspari, F., Calixto, I., Tinsley, J., Way, A.: Is neural machine translation the new state of the art? Prague Bull. Math. Linguist. 108(1), 109–120 (2017)
Article Google Scholar
Defauw, A., Vanallemeersch, T., Van Winckel, K., Szoc, S., Van den Bogaert, J.: Being generous with sub-words towards small NMT children. In: Proceedings of the 12th Language Resources and Evaluation Conference, LREC, Marseille, France (2020)
Google Scholar
Heafield, K., Pouzyrevsky, I., Clark, J.H., Koehn, P.: Scalable modified Kneser-Ney language model estimation. In: Proceeedings of 51st Annual Meeting of the Association for Computational Linguistics, ACL, Sofia, Bulgaria (2013)
Google Scholar
Jassem, K., Dwojak, T.: Statistical versus neural machine translation - a case study for a medium size domain-specific bilingual corpus. Poznan Stud. Contemp. Linguist. 55(2), 491–515 (2019)
Article Google Scholar
Johnson, M., Firat, O., Aharoni, R.: Massively multilingual neural machine translation. In: Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), NAACL, Minneapolis, MN, USA (2019)
Google Scholar
Koehn, P.: Statistical Machine Translation. Cambridge University Press, Cambridge (2009)
Book Google Scholar
Koehn, P., Knowles, R.: Six challenges for neural machine translation. In: Proceedings of the First Workshop on Neural Machine Translation, Vancouver, Canada (2017)
Google Scholar
Koo, T., Li, M.: A guideline of selecting and reporting intraclass correlation coefficients for reliability research. J. Chiropractic Med. 15, 155–163 (2016)
Article Google Scholar
Kudo, T., Richardson, J.: SentencePiece: a simple and language independent subword tokenizer and detokenizer for Neural Text Processing. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing: System Demonstrations, EMNLP, Brussels, Belgium (2018)
Google Scholar
Luong, T., Pham, H., Manning, C.D.: Effective approaches to attention-based neural machine translation. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, EMNLP, Lisbon, Portugal (2015)
Google Scholar
Pinnis, M.: Tilde’s parallel corpus filtering methods for WMT 2018. In: Proceedings of the Third Conference on Machine Translation: Shared Task Papers, Brussels, Belgium (2018)
Google Scholar
Reiter, E.: A structured review of the validity of BLEU. Comput. Linguist. 44(3), 393–401 (2018)
Article Google Scholar
Rozis, R., Skadiņš, R.: Tilde MODEL - multilingual open data for EU languages. In: Proceedings of the 21st Nordic Conference on Computational Linguistics, NODALIDA, Gothenburg, Sweden (2017)
Google Scholar
Steingrímsson, S., Helgadóttir, S., Rögnvaldsson, E., Barkarson, S., Gunason, J.: Risamálheild: a very large icelandic text corpus. In: Proceedings of the 11th International Conference on Language Resources and Evaluation, LREC, Miyazaki, Japan, May 2018
Google Scholar
Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. In: Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 2, NIPS, Montreal, Canada (2014)
Google Scholar
Tiedemann, J.: Parallel data, tools and interfaces in OPUS. In: Proceedings of the 8th International Conference on Language Resources and Evaluation, LREC 2012, Istanbul, Turkey (2012)
Google Scholar
Varga, D., Németh, L., Halácsy, P., Kornai, A., Viktor Trón, V.N.: Parallel corpora for medium density languages. In: Proceedings of Recent Advances in Natural Language Processing, RANLP, Borovets, Bulgaria (2005)
Google Scholar
Vaswani, A., et al.: Attention is all you need. In: Guyon, I., et al. (eds.) Advances in Neural Information Processing Systems, vol. 30, pp. 5998–6008. Curran Associates, Inc. (2017)
Google Scholar

Download references

Acknowledgments

This project was funded by the Language Technology Programme for Icelandic 2019–2023. The programme, which is managed and coordinated by Almannarómur, is funded by the Icelandic Ministry of Education, Science and Culture.

Author information

Authors and Affiliations

Language and Voice Lab, Reykjavik University, Reykjavik, Iceland
Haukur Páll Jónsson, Steinþór Steingrímsson & Hrafn Loftsson
Mieind ehf., Reykjavik, Iceland
Haukur Barri Símonarson & Vésteinn Snæbjarnarson

Authors

Haukur Páll Jónsson
View author publications
You can also search for this author in PubMed Google Scholar
Haukur Barri Símonarson
View author publications
You can also search for this author in PubMed Google Scholar
Vésteinn Snæbjarnarson
View author publications
You can also search for this author in PubMed Google Scholar
Steinþór Steingrímsson
View author publications
You can also search for this author in PubMed Google Scholar
Hrafn Loftsson
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hrafn Loftsson .

Editor information

Editors and Affiliations

Faculty of Informatics, Masaryk University, Brno, Czech Republic
Petr Sojka
Faculty of Informatics, Masaryk University, Brno, Czech Republic
Ivan Kopeček
Faculty of Informatics, Masaryk University, Brno, Czech Republic
Karel Pala
Faculty of Informatics, Masaryk University, Brno, Czech Republic
Aleš Horák

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Jónsson, H.P., Símonarson, H.B., Snæbjarnarson, V., Steingrímsson, S., Loftsson, H. (2020). Experimenting with Different Machine Translation Models in Medium-Resource Settings. In: Sojka, P., Kopeček, I., Pala, K., Horák, A. (eds) Text, Speech, and Dialogue. TSD 2020. Lecture Notes in Computer Science(), vol 12284. Springer, Cham. https://doi.org/10.1007/978-3-030-58323-1_10

Download citation

DOI: https://doi.org/10.1007/978-3-030-58323-1_10
Published: 01 September 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-58322-4
Online ISBN: 978-3-030-58323-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics