Abstract
In this paper we present the efforts to diversify Serbian-French-English-Spanish corpus ParCoLab. ParCoLab is the project led by CLLE research unit (UMR 5263 CNRS) at the University of Toulouse, France, and the Romance Department at the University of Belgrade, Serbia. The main goal of the project is to create a freely searchable and widely applicable multilingual resource with Serbian as the pivot language. Initially, the majority of the corpus texts represented written language. Since diversity of text types contributes to the usefulness and applicability of a parallel corpus, a great deal of effort has been made to include spoken language data in the ParCoLab database. Transcripts and translations of TED talks, films and cartoons have been included so far, along with transcripts of original Serbian films. Thus, the 17.6M-word database of mainly literary texts has been extended with spoken language data and it now contains 32.9M words.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
http://parcolab.univ-tlse2.fr. Last access to URLs in the paper: 20 Apr 2020.
- 2.
Both corpora can be queried via the ParCoLab search engine and are available for the download at http://parcolab.univ-tlse2.fr/about/ressources.
- 3.
Consultable at: http://www.korpus.matf.bg.ac.rs/korpus/login.php. It is necessary to demand authorization to access the interface.
- 4.
The official website of the project is: https://intercorp.korpus.cz.
- 5.
- 6.
- 7.
- 8.
- 9.
- 10.
- 11.
References
Agić, Ž., Ljubešić, N.: Universal dependencies for Croatian (that work for Serbian, too). In: Piskorski, J. (ed.) Proceedings of the 5th Workshop on Balto-Slavic Natural Language Processing (BSNLP 2015), pp. 1–8. INCOMA, Hissar (2015)
Agić, Ž., Ljubešić, N., Merkler, D.: Lemmatization and morphosyntactic tagging of Croatian and Serbian. In: Piskorski, J. (ed.) Proceedings of the 4th Biennial International Workshop on Balto-Slavic Natural Language Processing (BSNLP 2013), pp. 48–57. Association for Computational Linguistics, Sofia (2013)
Balvet, A., Stosic, D., Miletic, A.: TALC-sef a manually-revised POS-tagged literary corpus in Serbian, English and French. In: LREC 2014, pp. 4105–4110. European Language Resources Association, Reykjavik (2014)
Cettolo, M., Girardi, C., Federico, M.: WIT3: web inventory of transcribed and translated talks. In: Proceedings of the 16th EAMT Conference, pp. 261–268 (2012)
Čermák, F., Rosen, A.: The case of interCorp, a multilingual parallel corpus. Int. J. Corpus Linguist. 13(3), 411–427 (2012)
Gildea, D.: Corpus variation and parser performance. In: Proceedings of the 2001 Conference on Empirical Methods in Natural Language Processing (2001). https://www.aclweb.org/anthology/W01-0521
van der Klis, M., Le Bruyn, B., de Swart, H.: Temporal reference in discourse and dialogue (Forth)
Krstev, C., Vitas, D.: An aligned English-Serbian corpus. In: Tomović, N., Vujić, J. (eds.) ELLSIIR Proceedings (English Language and Literature Studies: Image, Identity, Reality), vol. 1, pp. 495–508. Faculty of Philology, Belgrade (2011)
Krstev, C., Vitas, D., Erjavec, T.: MULTEXT-East resources for Serbian. In: Erjavec, T., Gros, J.Z. (eds.) Zbornik 7. mednarodne multikonference “Informacijska druzba IS 2004”, Jezikovne tehnologije, Ljubljana, Slovenija, 9–15 Oktober 2004. Institut “Jožef Stefan", Ljubljana (2004)
Marjanović, S., Stosic, D., Miletic, A.: A sample French-Serbian dictionary entry based on the ParCoLab parallel corpus. In: Krek, S., et al. (eds.) Proceedings of the XVIII EURALEX International Congress: Lexicography in Global Contexts, pp. 423–435. Faculty of Arts, Ljubljana (2018)
Marjanović, S., Stošić, D., Miletić, A.: Paralelni korpus ParCoLab u službi srpsko-francuske leksikografije. In: Novaković, J., Srebro, M. (eds.) Srpsko-francuske književne i kulturne veze u evropskom kontekstu I, pp. 279–307. Matica srpska, Novi Sad (2019)
Miletic, A.: Un treebank pour le serbe: constitution et exploitations. Ph.D. thesis. Université Toulouse Jean Jaurès, Toulouse (2018)
Miletic, A., Fabre, C., Stosic, D.: De la constitution d’un corpus arboré á l’analyse syntaxique du serbe. Traitement Automatique des Langues 59(3), 15–39 (2018)
Miletic, A., Stosic, D., Marjanović, S.: ParCoLab: a parallel corpus for Serbian, French and English. In: Ekštein, K., Matoušek, V. (eds.) TSD 2017. LNCS (LNAI), vol. 10415, pp. 156–164. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-64206-2_18
Nivre, J., et al.: The CoNLL 2007 shared task on dependency parsing. In: Proceedings of the CoNLL Shared Task Session of EMNLP-CoNLL, pp. 915–932. Association for Computational Linguistics, Prague (2007)
Ruiz, N., Federico, M.: Complexity of spoken versus written language for machine translation. In: Proceedings of the 17th Annual Conference of the European Association for Machine Translation (EAMT), pp. 173–180. Hrvatsko društvo za jezične tehnologije, Zagreb (2014)
Stosic, D., Marjanović, S., Miletic, A.: Corpus parallèle ParCoLab et lexicographie bilingue français-serbe: recherches et applications. In: Srebro, M., Novaković, J. (eds.) Serbica (2019). https://serbica.u-bordeaux-montaigne.fr/index.php/revues
Terzic, D.: Parsing des textes journalistiques en serbe par le logiciel Talismane. In: Proceedings of TALN-RECITAL, PFIA 2019, pp. 591–604. AfIA, Toulouse (2019)
Tiedemann, J.: Parallel data, tools and interfaces in OPUS. In: Calzolari, N. (eds.) Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC 2012). European Language Resources Association, Istanbul (2014)
Tyers, F.M., Alperen, M.S.: South-East European times: a parallel corpus of Balkan languages. In: Proceedings of the LREC Workshop on Exploitation of Multilingual Resources and Tools for Central and (South-) Eastern European Languages, pp. 49–53 (2010)
Vitas, D., Krstev, C.: Literature and aligned texts. In: Slavcheva, M., et al. (eds.) Readings in Multilinguality, pp. 148–155. Institute for Parallel Processing, Bulgarian Academy of Sciences, Sofia (2006)
von Waldenfels, R.: Compiling a parallel corpus of Slavic languages. Text strategies, tools and the question of lemmatization in alignment. In: Brehmer, B., Zdanova, V., Zimny, R. (eds.) Beiträge der Europäischen Slavistischen Linguistik (POLYSLAV) 9, pp. 123–138. Verlag Otto Sagner, München (2006)
von Waldenfels, R.: Recent developments in ParaSol: breadth for depth and XSLT based web concordancing with CWB. In: Daniela, M., Garabík, R. (eds.) Natural Language Processing, Multilinguality, Proceedings of Slovko 2011, Modra, Slovakia, 20–21 October 2011, pp. 156–162. Tribun EU, Bratislava (2011)
Zeroual, I., Lakhouaja, A.: MulTed: a multilingual aligned and tagged parallel corpus. Appl. Comput. Inform. (2018). https://doi.org/10.1016/j.aci.2018.12.003
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Terzić, D., Marjanović, S., Stosic, D., Miletic, A. (2020). Diversification of Serbian-French-English-Spanish Parallel Corpus ParCoLab with Spoken Language Data. In: Sojka, P., Kopeček, I., Pala, K., Horák, A. (eds) Text, Speech, and Dialogue. TSD 2020. Lecture Notes in Computer Science(), vol 12284. Springer, Cham. https://doi.org/10.1007/978-3-030-58323-1_6
Download citation
DOI: https://doi.org/10.1007/978-3-030-58323-1_6
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-58322-4
Online ISBN: 978-3-030-58323-1
eBook Packages: Computer ScienceComputer Science (R0)