Abstract
In this paper we present research results with gApp, a text-preprocessing system designed for automatically detecting and converting discontinuous multiword expressions (MWEs) into their continuous forms so as to improve the performance of current neural machine translation systems (NMT) (see Hidalgo-Ternero 2021; Hidalgo-Ternero and Corpas Pastor 2020, 2022a, 2022b and 2022c, among others). To test its effectiveness, an experiment with the NMT systems of Google Translate and DeepL has been carried out in the ES>EN/ZH directionalities for the translation of somatisms, i. e., MWEs containing lexemes referring to human or animal body parts (Mellado Blanco 2004). More specifically, we have analysed “Verb Noun Idiomatic Constructions” (VNICs), such as tocar los cojones, tocar los huevos, tocar las narices, and tocar las pelotas. In this regard, some of the unexpected results yielded by the study of these multiword expressions will question the widely accepted conception of phraseological discontinuity as an unequivocal synonym of worse NMT performance.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
gApp is available through the following link: http://lexytrad.es/gapp/app.php. This application is registered in Safecreative: https://www.safecreative.org/work/2011165898461-gapp.
- 2.
All the corpora employed in the present study are described in Sect. 3 (“Methodology”).
- 3.
By user-generated content, we mean ‘content published on an online platform by users. The term social media comprises platforms that contain user-generated content. Users do not need programming skills to publish content on a social media platform.’ (Wyrwoll 2014).
References
Koike, K.: Relaciones paradigmáticas y sintagmáticas de las locuciones verbales en español. In: Cuartero Otal, J., Emsel, M. (eds.) Vernetzungen Bedeutung in Wort, Satz und Text. Festschrift für Gerd Wotjak zum 65. Geburtstag, pp. 263–275. Peter Lang, Frankfurt (2007)
Bargmann, S., Sailer, M.: The syntactic flexibility of semantically non-decomposable idioms. In: Sailer, M., Markantonatou, S. (eds.) Multiword Expressions: Insights from a Multi-Lingual Perspective, pp. 1–29. Language Science Press (2018)
Bentivogli, L., Bisazza, A., Cettolo, M., Federico, M.: Neural versus phrase-based machine translation quality: a case study. arXiv (2018)
Colson, J.-P.: Multi-word units in machine translation: why the tip of the iceberg remains problematic – and a tentative corpus-driven solution. In: MUMTT 2019 (2019)
Constant, M., et al.: Multiword expression processing: a survey. Comput. Linguist. 43(4), 1–92 (2017)
Corpas Pastor, G.: Detección, descripción y contraste de las unidades fraseológicas mediante tecnologías lingüísticas. In: Olza, I., Manero, E. (eds.) Fraseopragmática. Colección Romanistik, pp. 335–373. Frank & Timme (2013)
Derczynski, L., Ritter, A., Clark, S., Bontcheva, K.: Twitter part-of-speech tagging for all: overcoming sparse and noisy data. In: Mitkov, R., Angelova, G., Bontcheva, K. (eds.) Proceedings of the International Conference on Recent Advances in Natural Language Processing, pp. 198–206. INCOMA Ltd. (2013)
Seco, M., Andrés, O., Ramos, G.: Diccionario fraseológico documentado del español actual, locuciones y modismos españoles, 2ª edición. Aguilar (2017)
ELIS – European Language Industry Survey: 2018 Language Industry Survey – Expectations and Concerns of the European Language Industry (2018)
ELIS – European Language Industry Survey: 2020 Language Industry Survey – 2020 before & after COVID-19 (2020)
ELIS – European Language Industry Survey: 2021 Language Industry Survey (2020)
Fazly, A., Cook, P., Stevenson, S.: Unsupervised type and token identification of idiomatic expressions. Comput. Linguist. 35(1), 61–103 (2009)
Foufi, V., Nerima, L., Wehrli, E.: Multilingual parsing and MWE detection. In: Parmentier, Y., Waszczuk, J. (eds.) Representation and Parsing of Multiword Expressions: Current Trends, pp. 217–237. Language Science Press (2019)
Gui, T., Zhang, Q., Huang, H., Peng, M., Huang, X.: Part-of-speech tagging for Twitter with adversarial neural networks. In: Palmer, M., Hwa, R., Riedel, S. (eds.) Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp. 2411–2420. Association for Computational Linguistics (2017)
Hidalgo-Ternero, C.M.: Google Translate vs. DeepL: analysing neural machine translation performance under the challenge of phraseological variation. In: Mogorrón Huerta, P. (ed.) Multidisciplinary Analysis of the Phenomenon of Phraseological Variation in Translation and Interpreting. MonTI Special Issue 6, pp. 154–177 (2020)
Hidalgo-Ternero, C.M.: El algoritmo ReGap para la mejora de la traducción automática neuronal de expresiones pluriverbales discontinuas (FR>EN/ES). In: Corpas Pastor, G., Bautista Zambrana, M.R., Hidalgo-Ternero, C.M. (eds.) Sistemas fraseológicos en contraste: enfoques computacionales y de corpus, pp. 253–270. Comares (2021)
Hidalgo-Ternero, C.M., Corpas Pastor, G.: Bridging the ‘gApp’: improving neural machine translation systems for multiword expression detection. Yearb. Phraseol. 11, 61–80 (2020). https://doi.org/10.1515/phras-2020-0005
Hidalgo-Ternero, C.M., Corpas Pastor, G.: Qué se traerá gApp entre manos… O cómo mejorar la traducción automática neuronal de variantes somáticas (ES>EN/DE/FR/IT/PT). In: Seghiri, M., Pérez Carrasco, M. (eds.) Aproximación a la traducción especializada. Peter Lang (2022a, forthcoming)
Hidalgo-Ternero, C.M., Corpas Pastor, G.: A la cabeza de la traducción automática neuronal asistida por gApp: somatismos en VIP, DeepL y Google Translate. In: Corpas Pastor, G., Seghiri, M. (eds.) Aplicaciones didácticas de las tecnologías de la interpretación. Comares (2022b, forthcoming)
Hidalgo-Ternero, C.M., Corpas Pastor, G.: ReGap: a text preprocessing algorithm to enhance MWE-aware neural machine translation systems. In: Monti, J., Corpas Pastor, G., Mitkov, R. (eds.) Recent Advances in MWU in Machine Translation and Translation technology. John Benjamins Publishing Company (2022c, forthcoming)
Hidalgo-Ternero, C.M., Lista, F., Corpas Pastor, G.: gApp-assisted NMT: how to improve the neural machine translation of discontinuous multiword expressions (IT>EN/DE). Language Resources and Evaluation (2022, under review)
Junczys-Dowmunt, M., Dwojak, T., Hoang, H.: Is neural machine translation ready for deployment? A case study on 30 translation directions. arXiv (2016)
Kilgarriff, A., Rychly, P., Smrz, P., Tugwell, D.: The sketch engine. In: Proceedings of the 11th EURALEX International Congress, pp. 105–116 (2004)
Lohar, P., Popović, M., Alfi, H., Way, A.: A systematic comparison between SMT and NMT on translating user-generated content. In: 20th International Conference on Computational Linguistics and Intelligent Text Processing (CICLing 2019) (2019)
Mellado Blanco, C.: Fraseologismos somáticos del alemán. Peter Lang, Frankfurt (2004)
Monti, J., Seretan, V., Corpas Pastor, G., Mitkov, R.: Multiword units in machine translation and technology. In: Mitkov, R., Monti, J., Corpas Pastor, G., Seretan, V. (eds.) Multiword Units in Translation and Translation Technology, pp. 1–37. John Benjamins (2018)
Neunerdt, M., Trevisan, B., Reyer, M., Mathar, R.: Part-of-speech tagging for social media texts. In: Gurevych, I., Biemann, C., Zesch, T. (eds.) GSCL 2013. LNCS (LNAI), vol. 8105, pp. 139–150. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-40722-2_15
Parra Escartín, C., Nevado Llopis, A., Sánchez Martínez, E.: Spanish multiword expressions: looking for a taxonomy. In: Sailer, M., Markantonatou, S. (eds.) Multiword Expressions: Insights from a Multi-Lingual Perspective, pp. 271–323. Language Science Press (2018)
Ramisch, C.: Multiword Expressions Acquisition: A Generic and Open Framework. Theory and Applications of Natural Language Processing. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-09207-2
Ramisch, C., Villavicencio, A.: Computational treatment of multiword expressions. In: Mitkov, R. (ed.) Oxford Handbook on Computational Linguistics, 2ª ed (2018)
Rohanian, O., Taslimipoor, S., Kouchaki, S., An Ha, L., Mitkov, R.: Bridging the gap: attending to discontinuity in identification of multiword expressions. In: Burstein, J., Doran, C., Solorio, T. (eds.) 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, vol. 1, pp. 2692–2698 (2019)
Shterionov, D., Superbo, R., Nagle, P., Casanellas, L., O’Dowd, T., Way, A.: Human versus automatic quality evaluation of NMT and PBSMT. Mach. Transl. 32(3), 217–235 (2018). https://doi.org/10.1007/s10590-018-9220-z
Wang, H., Wu, H., He, Z., Huang, L., Church, K.W.: Progress in machine translation. Engineering (2022, forthcoming)
Wyrwoll, C.: User-generated content. In: Wyrwoll, C. (ed.) Social Media, pp. 11–45. Springer, Wiesbaden (2014). https://doi.org/10.1007/978-3-658-06984-1_2
Zaninello, A., Birch, A.: Multiword expression aware neural machine translation. In: Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020), pp. 3816–3825 (2020)
Acknowledgements
This research has been carried out within the framework of several research projects (ref. PID2020-112818GB-I00, UMA18-FEDERJA-067, P20-00109, E3/04/21, UMA-CEIATECH-04 and 03/2021-Embassy of France in Spain) at Universidad de Málaga (Spain).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Hidalgo-Ternero, C.M., Zhou-Lian, X. (2022). Reassessing gApp: Does MWE Discontinuity Always Pose a Challenge to Neural Machine Translation?. In: Corpas Pastor, G., Mitkov, R. (eds) Computational and Corpus-Based Phraseology. EUROPHRAS 2022. Lecture Notes in Computer Science(), vol 13528. Springer, Cham. https://doi.org/10.1007/978-3-031-15925-1_9
Download citation
DOI: https://doi.org/10.1007/978-3-031-15925-1_9
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-15924-4
Online ISBN: 978-3-031-15925-1
eBook Packages: Computer ScienceComputer Science (R0)