Skip to main content

Reassessing gApp: Does MWE Discontinuity Always Pose a Challenge to Neural Machine Translation?

  • Conference paper
  • First Online:
Computational and Corpus-Based Phraseology (EUROPHRAS 2022)

Abstract

In this paper we present research results with gApp, a text-preprocessing system designed for automatically detecting and converting discontinuous multiword expressions (MWEs) into their continuous forms so as to improve the performance of current neural machine translation systems (NMT) (see Hidalgo-Ternero 2021; Hidalgo-Ternero and Corpas Pastor 2020, 2022a, 2022b and 2022c, among others). To test its effectiveness, an experiment with the NMT systems of Google Translate and DeepL has been carried out in the ES>EN/ZH directionalities for the translation of somatisms, i. e., MWEs containing lexemes referring to human or animal body parts (Mellado Blanco 2004). More specifically, we have analysed “Verb Noun Idiomatic Constructions” (VNICs), such as tocar los cojones, tocar los huevos, tocar las narices, and tocar las pelotas. In this regard, some of the unexpected results yielded by the study of these multiword expressions will question the widely accepted conception of phraseological discontinuity as an unequivocal synonym of worse NMT performance.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    gApp is available through the following link: http://lexytrad.es/gapp/app.php. This application is registered in Safecreative: https://www.safecreative.org/work/2011165898461-gapp.

  2. 2.

    All the corpora employed in the present study are described in Sect. 3 (“Methodology”).

  3. 3.

    By user-generated content, we mean ‘content published on an online platform by users. The term social media comprises platforms that contain user-generated content. Users do not need programming skills to publish content on a social media platform.’ (Wyrwoll 2014).

References

  • Koike, K.: Relaciones paradigmáticas y sintagmáticas de las locuciones verbales en español. In: Cuartero Otal, J., Emsel, M. (eds.) Vernetzungen Bedeutung in Wort, Satz und Text. Festschrift für Gerd Wotjak zum 65. Geburtstag, pp. 263–275. Peter Lang, Frankfurt (2007)

    Google Scholar 

  • Bargmann, S., Sailer, M.: The syntactic flexibility of semantically non-decomposable idioms. In: Sailer, M., Markantonatou, S. (eds.) Multiword Expressions: Insights from a Multi-Lingual Perspective, pp. 1–29. Language Science Press (2018)

    Google Scholar 

  • Bentivogli, L., Bisazza, A., Cettolo, M., Federico, M.: Neural versus phrase-based machine translation quality: a case study. arXiv (2018)

    Google Scholar 

  • Colson, J.-P.: Multi-word units in machine translation: why the tip of the iceberg remains problematic – and a tentative corpus-driven solution. In: MUMTT 2019 (2019)

    Google Scholar 

  • Constant, M., et al.: Multiword expression processing: a survey. Comput. Linguist. 43(4), 1–92 (2017)

    Article  MathSciNet  Google Scholar 

  • Corpas Pastor, G.: Detección, descripción y contraste de las unidades fraseológicas mediante tecnologías lingüísticas. In: Olza, I., Manero, E. (eds.) Fraseopragmática. Colección Romanistik, pp. 335–373. Frank & Timme (2013)

    Google Scholar 

  • Derczynski, L., Ritter, A., Clark, S., Bontcheva, K.: Twitter part-of-speech tagging for all: overcoming sparse and noisy data. In: Mitkov, R., Angelova, G., Bontcheva, K. (eds.) Proceedings of the International Conference on Recent Advances in Natural Language Processing, pp. 198–206. INCOMA Ltd. (2013)

    Google Scholar 

  • Seco, M., Andrés, O., Ramos, G.: Diccionario fraseológico documentado del español actual, locuciones y modismos españoles, 2ª edición. Aguilar (2017)

    Google Scholar 

  • ELIS – European Language Industry Survey: 2018 Language Industry Survey – Expectations and Concerns of the European Language Industry (2018)

    Google Scholar 

  • ELIS – European Language Industry Survey: 2020 Language Industry Survey – 2020 before & after COVID-19 (2020)

    Google Scholar 

  • ELIS – European Language Industry Survey: 2021 Language Industry Survey (2020)

    Google Scholar 

  • Fazly, A., Cook, P., Stevenson, S.: Unsupervised type and token identification of idiomatic expressions. Comput. Linguist. 35(1), 61–103 (2009)

    Article  Google Scholar 

  • Foufi, V., Nerima, L., Wehrli, E.: Multilingual parsing and MWE detection. In: Parmentier, Y., Waszczuk, J. (eds.) Representation and Parsing of Multiword Expressions: Current Trends, pp. 217–237. Language Science Press (2019)

    Google Scholar 

  • Gui, T., Zhang, Q., Huang, H., Peng, M., Huang, X.: Part-of-speech tagging for Twitter with adversarial neural networks. In: Palmer, M., Hwa, R., Riedel, S. (eds.) Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp. 2411–2420. Association for Computational Linguistics (2017)

    Google Scholar 

  • Hidalgo-Ternero, C.M.: Google Translate vs. DeepL: analysing neural machine translation performance under the challenge of phraseological variation. In: Mogorrón Huerta, P. (ed.) Multidisciplinary Analysis of the Phenomenon of Phraseological Variation in Translation and Interpreting. MonTI Special Issue 6, pp. 154–177 (2020)

    Google Scholar 

  • Hidalgo-Ternero, C.M.: El algoritmo ReGap para la mejora de la traducción automática neuronal de expresiones pluriverbales discontinuas (FR>EN/ES). In: Corpas Pastor, G., Bautista Zambrana, M.R., Hidalgo-Ternero, C.M. (eds.) Sistemas fraseológicos en contraste: enfoques computacionales y de corpus, pp. 253–270. Comares (2021)

    Google Scholar 

  • Hidalgo-Ternero, C.M., Corpas Pastor, G.: Bridging the ‘gApp’: improving neural machine translation systems for multiword expression detection. Yearb. Phraseol. 11, 61–80 (2020). https://doi.org/10.1515/phras-2020-0005

  • Hidalgo-Ternero, C.M., Corpas Pastor, G.: Qué se traerá gApp entre manos… O cómo mejorar la traducción automática neuronal de variantes somáticas (ES>EN/DE/FR/IT/PT). In: Seghiri, M., Pérez Carrasco, M. (eds.) Aproximación a la traducción especializada. Peter Lang (2022a, forthcoming)

    Google Scholar 

  • Hidalgo-Ternero, C.M., Corpas Pastor, G.: A la cabeza de la traducción automática neuronal asistida por gApp: somatismos en VIP, DeepL y Google Translate. In: Corpas Pastor, G., Seghiri, M. (eds.) Aplicaciones didácticas de las tecnologías de la interpretación. Comares (2022b, forthcoming)

    Google Scholar 

  • Hidalgo-Ternero, C.M., Corpas Pastor, G.: ReGap: a text preprocessing algorithm to enhance MWE-aware neural machine translation systems. In: Monti, J., Corpas Pastor, G., Mitkov, R. (eds.) Recent Advances in MWU in Machine Translation and Translation technology. John Benjamins Publishing Company (2022c, forthcoming)

    Google Scholar 

  • Hidalgo-Ternero, C.M., Lista, F., Corpas Pastor, G.: gApp-assisted NMT: how to improve the neural machine translation of discontinuous multiword expressions (IT>EN/DE). Language Resources and Evaluation (2022, under review)

    Google Scholar 

  • Junczys-Dowmunt, M., Dwojak, T., Hoang, H.: Is neural machine translation ready for deployment? A case study on 30 translation directions. arXiv (2016)

    Google Scholar 

  • Kilgarriff, A., Rychly, P., Smrz, P., Tugwell, D.: The sketch engine. In: Proceedings of the 11th EURALEX International Congress, pp. 105–116 (2004)

    Google Scholar 

  • Lohar, P., Popović, M., Alfi, H., Way, A.: A systematic comparison between SMT and NMT on translating user-generated content. In: 20th International Conference on Computational Linguistics and Intelligent Text Processing (CICLing 2019) (2019)

    Google Scholar 

  • Mellado Blanco, C.: Fraseologismos somáticos del alemán. Peter Lang, Frankfurt (2004)

    Google Scholar 

  • Monti, J., Seretan, V., Corpas Pastor, G., Mitkov, R.: Multiword units in machine translation and technology. In: Mitkov, R., Monti, J., Corpas Pastor, G., Seretan, V. (eds.) Multiword Units in Translation and Translation Technology, pp. 1–37. John Benjamins (2018)

    Google Scholar 

  • Neunerdt, M., Trevisan, B., Reyer, M., Mathar, R.: Part-of-speech tagging for social media texts. In: Gurevych, I., Biemann, C., Zesch, T. (eds.) GSCL 2013. LNCS (LNAI), vol. 8105, pp. 139–150. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-40722-2_15

    Chapter  Google Scholar 

  • Parra Escartín, C., Nevado Llopis, A., Sánchez Martínez, E.: Spanish multiword expressions: looking for a taxonomy. In: Sailer, M., Markantonatou, S. (eds.) Multiword Expressions: Insights from a Multi-Lingual Perspective, pp. 271–323. Language Science Press (2018)

    Google Scholar 

  • Ramisch, C.: Multiword Expressions Acquisition: A Generic and Open Framework. Theory and Applications of Natural Language Processing. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-09207-2

    Book  Google Scholar 

  • Ramisch, C., Villavicencio, A.: Computational treatment of multiword expressions. In: Mitkov, R. (ed.) Oxford Handbook on Computational Linguistics, 2ª ed (2018)

    Google Scholar 

  • Rohanian, O., Taslimipoor, S., Kouchaki, S., An Ha, L., Mitkov, R.: Bridging the gap: attending to discontinuity in identification of multiword expressions. In: Burstein, J., Doran, C., Solorio, T. (eds.) 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, vol. 1, pp. 2692–2698 (2019)

    Google Scholar 

  • Shterionov, D., Superbo, R., Nagle, P., Casanellas, L., O’Dowd, T., Way, A.: Human versus automatic quality evaluation of NMT and PBSMT. Mach. Transl. 32(3), 217–235 (2018). https://doi.org/10.1007/s10590-018-9220-z

    Article  Google Scholar 

  • Wang, H., Wu, H., He, Z., Huang, L., Church, K.W.: Progress in machine translation. Engineering (2022, forthcoming)

    Google Scholar 

  • Wyrwoll, C.: User-generated content. In: Wyrwoll, C. (ed.) Social Media, pp. 11–45. Springer, Wiesbaden (2014). https://doi.org/10.1007/978-3-658-06984-1_2

    Chapter  Google Scholar 

  • Zaninello, A., Birch, A.: Multiword expression aware neural machine translation. In: Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020), pp. 3816–3825 (2020)

    Google Scholar 

Download references

Acknowledgements

This research has been carried out within the framework of several research projects (ref. PID2020-112818GB-I00, UMA18-FEDERJA-067, P20-00109, E3/04/21, UMA-CEIATECH-04 and 03/2021-Embassy of France in Spain) at Universidad de Málaga (Spain).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xiaoqing Zhou-Lian .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Hidalgo-Ternero, C.M., Zhou-Lian, X. (2022). Reassessing gApp: Does MWE Discontinuity Always Pose a Challenge to Neural Machine Translation?. In: Corpas Pastor, G., Mitkov, R. (eds) Computational and Corpus-Based Phraseology. EUROPHRAS 2022. Lecture Notes in Computer Science(), vol 13528. Springer, Cham. https://doi.org/10.1007/978-3-031-15925-1_9

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-15925-1_9

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-15924-4

  • Online ISBN: 978-3-031-15925-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics