ParaDiom – A Parallel Corpus of Idiomatic Texts

Donaj, Gregor; Antloga, Špela

doi:10.1007/978-3-031-40498-6_7

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 14102))

Included in the following conference series:

International Conference on Text, Speech, and Dialogue

377 Accesses
2 Altmetric

Abstract

This paper present ParaDiom – a parallel corpus with 2000 Slovene and English text segments. The text segments are rich with manually annotated idiomatic expressions, which poses a challenge for machine translation systems. We describe the definition of idiomatic expressions, the sampling of the corpus sentences, the annotation scheme, and the general characteristics of the finished corpus. The motivation for this corpus is to have a test set for machine translation systems to evaluate their performance on figurative language. In the last part of the paper, we demonstrate an example use of the corpus in a machine translation experiment.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 59.99; Price excludes VAT (USA)

Softcover Book: USD 74.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
Kočevje is a region in Slovenia.
2.
http://hdl.handle.net/11356/1035.
3.
http://hdl.handle.net/11356/1431.
4.
https://fedora.clarin-d.uni-saarland.de/clmet/clmet.html.
5.
https://opus.nlpl.eu/ParaCrawl-v8.php.
6.
https://opus.nlpl.eu/OpenSubtitles-v2018.php.
7.
https://github.com/marian-nmt/marian-examples/tree/master/wmt2017-transformer.

References

Abarna, S., Sheeba, J., Devaneyan, S.P.: An ensemble model for idioms and literal text classification using knowledge-enabled BERT in deep learning. Measur. Sens. 24, 100434 (2022)
Article Google Scholar
Brank, J.: Q-CAT corpus annotation tool (2019). http://hdl.handle.net/11356/1262, slovenian language resource repository CLARIN.SI
Briskilal, J., Subalalitha, C.: An ensemble model for classifying idioms and literal texts using BERT and RoBERTa. Inf. Process. Manage. 59(1), 102756 (2022)
Article Google Scholar
Cowie, A.P.: Multiword lexical units and communicative language teaching. In: Arnaud, P.J.L., Béjoint, H. (eds.) Vocabulary and Applied Linguistics, pp. 1–12. Palgrave Macmillan UK, London (1992)
Google Scholar
Dhariya, O., Malviya, S., Tiwary, U.S.: A hybrid approach for Hindi-English machine translation. In: 2017 International Conference on Information Networking (ICOIN), pp. 389–394. IEEE (2017)
Google Scholar
Diller, H.J., De Smet, H., Tyrkkö, J.: A European database of descriptors of English electronic texts. Eur. Engl. Messenger 19, 21–35 (2011)
Google Scholar
Donaj, G., Antloga, Š.: Parallel corpus of idiomatic text ParaDiom 1.0 (2022). http://hdl.handle.net/11356/1714. slovenian language resource repository CLARIN.SI
Ducar, C., Schocket, D.H.: Machine translation and the L2 classroom: pedagogical solutions for making peace with google translate. Foreign Lang. Ann. 51(4), 779–795 (2018)
Article Google Scholar
Ebrahim, S., Hegazy, D., Mostafa, M.G.H.M., El-Beltagy, S.R.: Detecting and integrating multiword expression into English-Arabic statistical machine translation. Procedia Comput. Sci. 117, 111–118 (2017)
Article Google Scholar
Erjavec, T., et al.: The ParlaMint corpora of parliamentary proceedings. Lang. Resour. Eval. 57(1), 415–448 (2022)
Article MathSciNet Google Scholar
Gantar, P., Krek, S., Kuzman, T.: Verbal multiword expressions in Slovene. In: Mitkov, R. (ed.) EUROPHRAS 2017. LNCS (LNAI), vol. 10596, pp. 247–259. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-69805-2_18
Chapter Google Scholar
Ghoneim, M., Diab, M.: Multiword expressions in the context of statistical machine translation. In: Mitkov, R., Park, J.C. (eds.) Proceedings of the Sixth International Joint Conference on Natural Language Processing, pp. 1181–1187. Asian Federation of Natural Language Processing, Nagoya, Japan (2013)
Google Scholar
Gläser, R.: Terminological problems in linguistics, with special reference to neologisms. In: Hartmann, R.R.K. (ed.) LEXeter ’83 Proceedings, pp. 345–351. Max Niemeyer Verlag, Tübingen, Germany (Sep (1983)
Google Scholar
Gläser, R.: The stylistic potential of phraseological units in the light of genre analysis. In: Cowie, A.P. (ed.) Phraseology: Theory, Analysis, and Applications, chap. 9, pp. 128–143. Oxford University Press, Oxford (1998)
Google Scholar
Junczys-Dowmunt, M., et al.: Marian: fast neural machine translation in C++. In: Proceedings of ACL 2018, System Demonstrations, pp. 116–121. Association for Computational Linguistics, Melbourne, Australia (2018)
Google Scholar
Keber, J.: Slovar Slovenskih Frazemov. Založba ZRC, ZRC SAZU, Ljubljana (2011)
Google Scholar
Krek, S., et al.: Gigafida 2.0: the reference corpus of written standard Slovene. In: Proceedings of the Twelfth Language Resources and Evaluation Conference, pp. 3340–3345. European Language Resources Association, Marseille, France (2020)
Google Scholar
Kržišnik, E.: Idiomatska beseda ali frazeološka enota. Slavistična revija 58(1), 83–94 (2010)
Google Scholar
Ljubešić, N., Dobrovoljc, K.: What does neural bring? Analysing improvements in morphosyntactic annotation and lemmatisation of Slovenian, Croatian and Serbian. In: Proceedings of the 7th Workshop on Balto-Slavic Natural Language Processing, pp. 29–34. Association for Computational Linguistics, Florence, Italy (2019)
Google Scholar
Mel’cuk, I.: Phrasemes in language and phraseology in linguistics. In: Everaert, M., Erik-Jan van der Linden, A.S., Schreuder, R., Schreuder, R. (eds.) Idioms: Structural and Psycological Perspectives, pp. 167–232. Hillsdale: Lawrence Erlbaum Associates (1995)
Google Scholar
Naciscione, A.: Stylistic use of phraseological units in discourse. John Benjamins Publishing Company, Amsterdam, Philadelphia (2010)
Book Google Scholar
Qi, P., Zhang, Y., Zhang, Y., Bolton, J., Manning, C.D.: Stanza: a Python natural language processing toolkit for many human languages. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pp. 101–108. Association for Computational Linguistics (2020)
Google Scholar
Saini, J.R., Modh, J.C.: GIdTra: a dictionary-based MTS for translating Gujarati bigram idioms to English. In: 2016 Fourth International Conference on Parallel, Distributed and Grid Computing (PDGC), pp. 192–196. IEEE, Waknaghat, India (2016)
Google Scholar
Savary, A., et al.: The PARSEME shared task on automatic identification of verbal multiword expressions. In: Markantonatou, S., Ramisch, C., Savary, A., Vincze, V. (eds.) Proceedings of the 13th Workshop on Multiword Expressions (MWE 2017), pp. 31–47. Association for Computational Linguistics, Valencia, Spain (2017)
Google Scholar
Svensson, M.H.: A very complex criterion of fixedness: Noncompositionality. In: Granger, S., Meunier, F. (eds.) Phraseology: An Interdisciplinary Perspective, pp. 81–93. John Benjamins Publishing Company, Philadelphia (2008)
Chapter Google Scholar
Verstraten, L.: Fixed phrases in monolingual learners’ dictionaries. In: Arnaud, P.J.L., Béjoint, H. (eds.) Vocabulary and Applied Linguistics, pp. 28–40. Palgrave Macmillan UK, London (1992)
Chapter Google Scholar
Vieira, L.N., O’Sullivan, C., Zhang, X., O’Hagan, M.: Machine translation in society: insights from UK users. Language Resources and Evaluation (2022)
Google Scholar

Download references

Acknowledgements

This work was supported by CLARIN.SI and the Slovenian Research Agency (research core funding No.P2-0069-Advanced Methods of Interaction in Telecommunications).

The authors thank the creators of the ParaCrawl project (paracrawl.eu) and OpenSubtitles (www.opensubtitles.org) for their corpora and OPUS (opus.nlpl.eu) for their service. The authors also thank the HPC RIVR (www.hpc-rivr.si) consortium for the use of the HPC system VEGA on the Institute of Information Science (IZUM).

Author information

Authors and Affiliations

University of Maribor, Faculty of Electrical Engineering and Computer Science, Koroška c. 46, 2000, Maribor, Slovenia
Gregor Donaj & Špela Antloga

Authors

Gregor Donaj
View author publications
You can also search for this author in PubMed Google Scholar
Špela Antloga
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Gregor Donaj .

Editor information

Editors and Affiliations

University of West Bohemia, Pilsen, Czech Republic
Kamil Ekštein
University of West Bohemia, Pilsen, Czech Republic
František Pártl
University of West Bohemia, Pilsen, Czech Republic
Miloslav Konopík

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Donaj, G., Antloga, Š. (2023). ParaDiom – A Parallel Corpus of Idiomatic Texts. In: Ekštein, K., Pártl, F., Konopík, M. (eds) Text, Speech, and Dialogue. TSD 2023. Lecture Notes in Computer Science(), vol 14102. Springer, Cham. https://doi.org/10.1007/978-3-031-40498-6_7

Download citation

DOI: https://doi.org/10.1007/978-3-031-40498-6_7
Published: 23 August 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-40497-9
Online ISBN: 978-3-031-40498-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

ParaDiom – A Parallel Corpus of Idiomatic Texts