Skip to main content

ParaDiom – A Parallel Corpus of Idiomatic Texts

  • Conference paper
  • First Online:
Text, Speech, and Dialogue (TSD 2023)

Abstract

This paper present ParaDiom – a parallel corpus with 2000 Slovene and English text segments. The text segments are rich with manually annotated idiomatic expressions, which poses a challenge for machine translation systems. We describe the definition of idiomatic expressions, the sampling of the corpus sentences, the annotation scheme, and the general characteristics of the finished corpus. The motivation for this corpus is to have a test set for machine translation systems to evaluate their performance on figurative language. In the last part of the paper, we demonstrate an example use of the corpus in a machine translation experiment.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 59.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 74.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Kočevje is a region in Slovenia.

  2. 2.

    http://hdl.handle.net/11356/1035.

  3. 3.

    http://hdl.handle.net/11356/1431.

  4. 4.

    https://fedora.clarin-d.uni-saarland.de/clmet/clmet.html.

  5. 5.

    https://opus.nlpl.eu/ParaCrawl-v8.php.

  6. 6.

    https://opus.nlpl.eu/OpenSubtitles-v2018.php.

  7. 7.

    https://github.com/marian-nmt/marian-examples/tree/master/wmt2017-transformer.

References

  1. Abarna, S., Sheeba, J., Devaneyan, S.P.: An ensemble model for idioms and literal text classification using knowledge-enabled BERT in deep learning. Measur. Sens. 24, 100434 (2022)

    Article  Google Scholar 

  2. Brank, J.: Q-CAT corpus annotation tool (2019). http://hdl.handle.net/11356/1262, slovenian language resource repository CLARIN.SI

  3. Briskilal, J., Subalalitha, C.: An ensemble model for classifying idioms and literal texts using BERT and RoBERTa. Inf. Process. Manage. 59(1), 102756 (2022)

    Article  Google Scholar 

  4. Cowie, A.P.: Multiword lexical units and communicative language teaching. In: Arnaud, P.J.L., Béjoint, H. (eds.) Vocabulary and Applied Linguistics, pp. 1–12. Palgrave Macmillan UK, London (1992)

    Google Scholar 

  5. Dhariya, O., Malviya, S., Tiwary, U.S.: A hybrid approach for Hindi-English machine translation. In: 2017 International Conference on Information Networking (ICOIN), pp. 389–394. IEEE (2017)

    Google Scholar 

  6. Diller, H.J., De Smet, H., Tyrkkö, J.: A European database of descriptors of English electronic texts. Eur. Engl. Messenger 19, 21–35 (2011)

    Google Scholar 

  7. Donaj, G., Antloga, Š.: Parallel corpus of idiomatic text ParaDiom 1.0 (2022). http://hdl.handle.net/11356/1714. slovenian language resource repository CLARIN.SI

  8. Ducar, C., Schocket, D.H.: Machine translation and the L2 classroom: pedagogical solutions for making peace with google translate. Foreign Lang. Ann. 51(4), 779–795 (2018)

    Article  Google Scholar 

  9. Ebrahim, S., Hegazy, D., Mostafa, M.G.H.M., El-Beltagy, S.R.: Detecting and integrating multiword expression into English-Arabic statistical machine translation. Procedia Comput. Sci. 117, 111–118 (2017)

    Article  Google Scholar 

  10. Erjavec, T., et al.: The ParlaMint corpora of parliamentary proceedings. Lang. Resour. Eval. 57(1), 415–448 (2022)

    Article  MathSciNet  Google Scholar 

  11. Gantar, P., Krek, S., Kuzman, T.: Verbal multiword expressions in Slovene. In: Mitkov, R. (ed.) EUROPHRAS 2017. LNCS (LNAI), vol. 10596, pp. 247–259. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-69805-2_18

    Chapter  Google Scholar 

  12. Ghoneim, M., Diab, M.: Multiword expressions in the context of statistical machine translation. In: Mitkov, R., Park, J.C. (eds.) Proceedings of the Sixth International Joint Conference on Natural Language Processing, pp. 1181–1187. Asian Federation of Natural Language Processing, Nagoya, Japan (2013)

    Google Scholar 

  13. Gläser, R.: Terminological problems in linguistics, with special reference to neologisms. In: Hartmann, R.R.K. (ed.) LEXeter ’83 Proceedings, pp. 345–351. Max Niemeyer Verlag, Tübingen, Germany (Sep (1983)

    Google Scholar 

  14. Gläser, R.: The stylistic potential of phraseological units in the light of genre analysis. In: Cowie, A.P. (ed.) Phraseology: Theory, Analysis, and Applications, chap. 9, pp. 128–143. Oxford University Press, Oxford (1998)

    Google Scholar 

  15. Junczys-Dowmunt, M., et al.: Marian: fast neural machine translation in C++. In: Proceedings of ACL 2018, System Demonstrations, pp. 116–121. Association for Computational Linguistics, Melbourne, Australia (2018)

    Google Scholar 

  16. Keber, J.: Slovar Slovenskih Frazemov. Založba ZRC, ZRC SAZU, Ljubljana (2011)

    Google Scholar 

  17. Krek, S., et al.: Gigafida 2.0: the reference corpus of written standard Slovene. In: Proceedings of the Twelfth Language Resources and Evaluation Conference, pp. 3340–3345. European Language Resources Association, Marseille, France (2020)

    Google Scholar 

  18. Kržišnik, E.: Idiomatska beseda ali frazeološka enota. Slavistična revija 58(1), 83–94 (2010)

    Google Scholar 

  19. Ljubešić, N., Dobrovoljc, K.: What does neural bring? Analysing improvements in morphosyntactic annotation and lemmatisation of Slovenian, Croatian and Serbian. In: Proceedings of the 7th Workshop on Balto-Slavic Natural Language Processing, pp. 29–34. Association for Computational Linguistics, Florence, Italy (2019)

    Google Scholar 

  20. Mel’cuk, I.: Phrasemes in language and phraseology in linguistics. In: Everaert, M., Erik-Jan van der Linden, A.S., Schreuder, R., Schreuder, R. (eds.) Idioms: Structural and Psycological Perspectives, pp. 167–232. Hillsdale: Lawrence Erlbaum Associates (1995)

    Google Scholar 

  21. Naciscione, A.: Stylistic use of phraseological units in discourse. John Benjamins Publishing Company, Amsterdam, Philadelphia (2010)

    Book  Google Scholar 

  22. Qi, P., Zhang, Y., Zhang, Y., Bolton, J., Manning, C.D.: Stanza: a Python natural language processing toolkit for many human languages. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pp. 101–108. Association for Computational Linguistics (2020)

    Google Scholar 

  23. Saini, J.R., Modh, J.C.: GIdTra: a dictionary-based MTS for translating Gujarati bigram idioms to English. In: 2016 Fourth International Conference on Parallel, Distributed and Grid Computing (PDGC), pp. 192–196. IEEE, Waknaghat, India (2016)

    Google Scholar 

  24. Savary, A., et al.: The PARSEME shared task on automatic identification of verbal multiword expressions. In: Markantonatou, S., Ramisch, C., Savary, A., Vincze, V. (eds.) Proceedings of the 13th Workshop on Multiword Expressions (MWE 2017), pp. 31–47. Association for Computational Linguistics, Valencia, Spain (2017)

    Google Scholar 

  25. Svensson, M.H.: A very complex criterion of fixedness: Noncompositionality. In: Granger, S., Meunier, F. (eds.) Phraseology: An Interdisciplinary Perspective, pp. 81–93. John Benjamins Publishing Company, Philadelphia (2008)

    Chapter  Google Scholar 

  26. Verstraten, L.: Fixed phrases in monolingual learners’ dictionaries. In: Arnaud, P.J.L., Béjoint, H. (eds.) Vocabulary and Applied Linguistics, pp. 28–40. Palgrave Macmillan UK, London (1992)

    Chapter  Google Scholar 

  27. Vieira, L.N., O’Sullivan, C., Zhang, X., O’Hagan, M.: Machine translation in society: insights from UK users. Language Resources and Evaluation (2022)

    Google Scholar 

Download references

Acknowledgements

This work was supported by CLARIN.SI and the Slovenian Research Agency (research core funding No.P2-0069-Advanced Methods of Interaction in Telecommunications).

The authors thank the creators of the ParaCrawl project (paracrawl.eu) and OpenSubtitles (www.opensubtitles.org) for their corpora and OPUS (opus.nlpl.eu) for their service. The authors also thank the HPC RIVR (www.hpc-rivr.si) consortium for the use of the HPC system VEGA on the Institute of Information Science (IZUM).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Gregor Donaj .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Donaj, G., Antloga, Š. (2023). ParaDiom – A Parallel Corpus of Idiomatic Texts. In: Ekštein, K., Pártl, F., Konopík, M. (eds) Text, Speech, and Dialogue. TSD 2023. Lecture Notes in Computer Science(), vol 14102. Springer, Cham. https://doi.org/10.1007/978-3-031-40498-6_7

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-40498-6_7

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-40497-9

  • Online ISBN: 978-3-031-40498-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics