Diversification of Serbian-French-English-Spanish Parallel Corpus ParCoLab with Spoken Language Data

Terzić, Dušica; Marjanović, Saša; Stosic, Dejan; Miletic, Aleksandra

doi:10.1007/978-3-030-58323-1_6

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 12284))

Included in the following conference series:

International Conference on Text, Speech, and Dialogue

1497 Accesses

Abstract

In this paper we present the efforts to diversify Serbian-French-English-Spanish corpus ParCoLab. ParCoLab is the project led by CLLE research unit (UMR 5263 CNRS) at the University of Toulouse, France, and the Romance Department at the University of Belgrade, Serbia. The main goal of the project is to create a freely searchable and widely applicable multilingual resource with Serbian as the pivot language. Initially, the majority of the corpus texts represented written language. Since diversity of text types contributes to the usefulness and applicability of a parallel corpus, a great deal of effort has been made to include spoken language data in the ParCoLab database. Transcripts and translations of TED talks, films and cartoons have been included so far, along with transcripts of original Serbian films. Thus, the 17.6M-word database of mainly literary texts has been extended with spoken language data and it now contains 32.9M words.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 79.99; Price excludes VAT (USA)

Softcover Book: USD 99.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

ParCoLab: A Parallel Corpus for Serbian, French and English

Challenges to Prepare the Parallel Corpus for Luganda Language

New Parallel Corpora of Baltic and Slavic Languages — Assumptions of Corpus Construction

Notes

1.
http://parcolab.univ-tlse2.fr. Last access to URLs in the paper: 20 Apr 2020.
2.
Both corpora can be queried via the ParCoLab search engine and are available for the download at http://parcolab.univ-tlse2.fr/about/ressources.
3.
Consultable at: http://www.korpus.matf.bg.ac.rs/korpus/login.php. It is necessary to demand authorization to access the interface.
4.
The official website of the project is: https://intercorp.korpus.cz.
5.
https://wiki.korpus.cz/doku.php/en:cnk:intercorp:verze12.
6.
https://www.opensubtitles.org/en/search/subs.
7.
http://opus.nlpl.eu/OpenSubtitles2016.php.
8.
https://wit3.fbk.eu/#releases.
9.
http://oujda-nlp-team.net/en/corpora/multed-corpus.
10.
https://www.youtube.com/channel/UCeY4C8Sbx8B4bIyREPSvORQ/videos.
11.
https://en.wikipedia.org/wiki/Jetlag_Productions.

References

Agić, Ž., Ljubešić, N.: Universal dependencies for Croatian (that work for Serbian, too). In: Piskorski, J. (ed.) Proceedings of the 5th Workshop on Balto-Slavic Natural Language Processing (BSNLP 2015), pp. 1–8. INCOMA, Hissar (2015)
Google Scholar
Agić, Ž., Ljubešić, N., Merkler, D.: Lemmatization and morphosyntactic tagging of Croatian and Serbian. In: Piskorski, J. (ed.) Proceedings of the 4th Biennial International Workshop on Balto-Slavic Natural Language Processing (BSNLP 2013), pp. 48–57. Association for Computational Linguistics, Sofia (2013)
Google Scholar
Balvet, A., Stosic, D., Miletic, A.: TALC-sef a manually-revised POS-tagged literary corpus in Serbian, English and French. In: LREC 2014, pp. 4105–4110. European Language Resources Association, Reykjavik (2014)
Google Scholar
Cettolo, M., Girardi, C., Federico, M.: WIT3: web inventory of transcribed and translated talks. In: Proceedings of the 16th EAMT Conference, pp. 261–268 (2012)
Google Scholar
Čermák, F., Rosen, A.: The case of interCorp, a multilingual parallel corpus. Int. J. Corpus Linguist. 13(3), 411–427 (2012)
Article Google Scholar
Gildea, D.: Corpus variation and parser performance. In: Proceedings of the 2001 Conference on Empirical Methods in Natural Language Processing (2001). https://www.aclweb.org/anthology/W01-0521
van der Klis, M., Le Bruyn, B., de Swart, H.: Temporal reference in discourse and dialogue (Forth)
Google Scholar
Krstev, C., Vitas, D.: An aligned English-Serbian corpus. In: Tomović, N., Vujić, J. (eds.) ELLSIIR Proceedings (English Language and Literature Studies: Image, Identity, Reality), vol. 1, pp. 495–508. Faculty of Philology, Belgrade (2011)
Google Scholar
Krstev, C., Vitas, D., Erjavec, T.: MULTEXT-East resources for Serbian. In: Erjavec, T., Gros, J.Z. (eds.) Zbornik 7. mednarodne multikonference “Informacijska druzba IS 2004”, Jezikovne tehnologije, Ljubljana, Slovenija, 9–15 Oktober 2004. Institut “Jožef Stefan", Ljubljana (2004)
Google Scholar
Marjanović, S., Stosic, D., Miletic, A.: A sample French-Serbian dictionary entry based on the ParCoLab parallel corpus. In: Krek, S., et al. (eds.) Proceedings of the XVIII EURALEX International Congress: Lexicography in Global Contexts, pp. 423–435. Faculty of Arts, Ljubljana (2018)
Google Scholar
Marjanović, S., Stošić, D., Miletić, A.: Paralelni korpus ParCoLab u službi srpsko-francuske leksikografije. In: Novaković, J., Srebro, M. (eds.) Srpsko-francuske književne i kulturne veze u evropskom kontekstu I, pp. 279–307. Matica srpska, Novi Sad (2019)
Google Scholar
Miletic, A.: Un treebank pour le serbe: constitution et exploitations. Ph.D. thesis. Université Toulouse Jean Jaurès, Toulouse (2018)
Google Scholar
Miletic, A., Fabre, C., Stosic, D.: De la constitution d’un corpus arboré á l’analyse syntaxique du serbe. Traitement Automatique des Langues 59(3), 15–39 (2018)
Google Scholar
Miletic, A., Stosic, D., Marjanović, S.: ParCoLab: a parallel corpus for Serbian, French and English. In: Ekštein, K., Matoušek, V. (eds.) TSD 2017. LNCS (LNAI), vol. 10415, pp. 156–164. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-64206-2_18
Chapter Google Scholar
Nivre, J., et al.: The CoNLL 2007 shared task on dependency parsing. In: Proceedings of the CoNLL Shared Task Session of EMNLP-CoNLL, pp. 915–932. Association for Computational Linguistics, Prague (2007)
Google Scholar
Ruiz, N., Federico, M.: Complexity of spoken versus written language for machine translation. In: Proceedings of the 17th Annual Conference of the European Association for Machine Translation (EAMT), pp. 173–180. Hrvatsko društvo za jezične tehnologije, Zagreb (2014)
Google Scholar
Stosic, D., Marjanović, S., Miletic, A.: Corpus parallèle ParCoLab et lexicographie bilingue français-serbe: recherches et applications. In: Srebro, M., Novaković, J. (eds.) Serbica (2019). https://serbica.u-bordeaux-montaigne.fr/index.php/revues
Terzic, D.: Parsing des textes journalistiques en serbe par le logiciel Talismane. In: Proceedings of TALN-RECITAL, PFIA 2019, pp. 591–604. AfIA, Toulouse (2019)
Google Scholar
Tiedemann, J.: Parallel data, tools and interfaces in OPUS. In: Calzolari, N. (eds.) Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC 2012). European Language Resources Association, Istanbul (2014)
Google Scholar
Tyers, F.M., Alperen, M.S.: South-East European times: a parallel corpus of Balkan languages. In: Proceedings of the LREC Workshop on Exploitation of Multilingual Resources and Tools for Central and (South-) Eastern European Languages, pp. 49–53 (2010)
Google Scholar
Vitas, D., Krstev, C.: Literature and aligned texts. In: Slavcheva, M., et al. (eds.) Readings in Multilinguality, pp. 148–155. Institute for Parallel Processing, Bulgarian Academy of Sciences, Sofia (2006)
Google Scholar
von Waldenfels, R.: Compiling a parallel corpus of Slavic languages. Text strategies, tools and the question of lemmatization in alignment. In: Brehmer, B., Zdanova, V., Zimny, R. (eds.) Beiträge der Europäischen Slavistischen Linguistik (POLYSLAV) 9, pp. 123–138. Verlag Otto Sagner, München (2006)
Google Scholar
von Waldenfels, R.: Recent developments in ParaSol: breadth for depth and XSLT based web concordancing with CWB. In: Daniela, M., Garabík, R. (eds.) Natural Language Processing, Multilinguality, Proceedings of Slovko 2011, Modra, Slovakia, 20–21 October 2011, pp. 156–162. Tribun EU, Bratislava (2011)
Google Scholar
Zeroual, I., Lakhouaja, A.: MulTed: a multilingual aligned and tagged parallel corpus. Appl. Comput. Inform. (2018). https://doi.org/10.1016/j.aci.2018.12.003

Download references

Author information

Authors and Affiliations

Faculty of Philology, University of Belgrade, Studentski trg 3, 11000, Belgrade, Serbia
Dušica Terzić & Saša Marjanović
CNRS and University of Toulouse, 5, Allées Antonio Machado, 31058, Toulouse, France
Dejan Stosic & Aleksandra Miletic

Authors

Dušica Terzić
View author publications
You can also search for this author in PubMed Google Scholar
Saša Marjanović
View author publications
You can also search for this author in PubMed Google Scholar
Dejan Stosic
View author publications
You can also search for this author in PubMed Google Scholar
Aleksandra Miletic
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Dušica Terzić .

Editor information

Editors and Affiliations

Faculty of Informatics, Masaryk University, Brno, Czech Republic
Petr Sojka
Faculty of Informatics, Masaryk University, Brno, Czech Republic
Ivan Kopeček
Faculty of Informatics, Masaryk University, Brno, Czech Republic
Karel Pala
Faculty of Informatics, Masaryk University, Brno, Czech Republic
Aleš Horák

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Terzić, D., Marjanović, S., Stosic, D., Miletic, A. (2020). Diversification of Serbian-French-English-Spanish Parallel Corpus ParCoLab with Spoken Language Data. In: Sojka, P., Kopeček, I., Pala, K., Horák, A. (eds) Text, Speech, and Dialogue. TSD 2020. Lecture Notes in Computer Science(), vol 12284. Springer, Cham. https://doi.org/10.1007/978-3-030-58323-1_6

Download citation

DOI: https://doi.org/10.1007/978-3-030-58323-1_6
Published: 01 September 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-58322-4
Online ISBN: 978-3-030-58323-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics