Abstract
ParCoLab is a trilingual parallel corpus containing texts in Serbian, French and English. It is developed at the CLLE-ERSS research unit (UMR 5263 CNRS) at the University of Toulouse, France, in collaboration with the Department of Romance Studies at the University of Belgrade, Serbia. Serbian being one of the less-resourced European languages, this is an important step towards the creation of freely accessible corpora and NLP tools for this language. Our main goal is to provide the scientific community with a high-quality resource that can be used in a wide range of applications, such as contrastive linguistic studies, NLP research, machine and computer assisted translation, translation studies, second language learning and teaching, and applied lexicography. The corpus currently contains 7.1M tokens mainly from literary works, but corpus extension and diversification efforts are ongoing. ParCoLab can be queried online and a part of it is available for download.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
The only two modifications we make is that we introduce an attribute @langOri used in the<teiHeader> in order to encode the language of the original text in the XML files containing translations, and the @id attribute used on the root<TEI> element, indicating the unique ID of the file inside the collection.
- 2.
TED is a platform for short talks on various subjects. See http://www.ted.com/.
- 3.
References
Agić, Ž., Ljubešić, N., Berović, D.: Lemmatization and morphosyntactic tagging of Croatian and Serbian. In: 4th Biennial International Workkshop on Balto-Slavic Natural Language Processing, BSNLP 2013 (2013)
Agić, Ž., Merkler, D., Berović, D.: Parsing Croatian and Serbian by using Croatian dependency treebanks. In: Proceedings of the Fourth Workshop on Statistical Parsing of Morphologically-Rich Languages (2013)
Candito, M., Nivre, J., Denis, P., Anguiano, E.H.: Benchmarking of statistical dependency parsers for French. In: Proceedings of the 23rd International Conference on Computational Linguistics: Posters, pp. 108–116. Association for Computational Linguistics (2010)
Carreras, X.: Experiments with a higher-order projective dependency parser. In: EMNLP-CoNLL, pp. 957–961 (2007)
Čermák, F., Rosen, A.: The case of InterCorp, a multilingual parallel corpus. Int. J. Corpus Linguist. 17(3), 411–427 (2012)
Text Encoding Initiative Consortium (eds.): TEI P5: Guidelines for Electronic Text Encoding and Interchange. Text Encoding Initiative Consortium (2008)
Esplá-Gomis, M., Forcada, M.: Bitextor, a free/open-source software to harvest translation memories from multilingual websites. In: Proceedings of MT Summit XII, Ottawa, Canada. Association for Machine Translation in the Americas (2009)
Gesmundo, A., Samardžić, T.: Lemmatising Serbian as category tagging with bidirectional sequence classification. In: LREC, pp. 2103–2106 (2012)
Halácsy, P., Kornai, A., Oravecz, C.: Hunpos: an open source trigram tagger. In: Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions, pp. 209–212. Association for Computational Linguistics (2007)
Jakovljević, B., Kovačević, A., Sečujski, M., Marković, M.: A dependency treebank for Serbian: initial experiments. In: Ronzhin, A., Potapova, R., Delic, V. (eds.) SPECOM 2014. LNCS, vol. 8773, pp. 42–49. Springer, Cham (2014). doi:10.1007/978-3-319-11581-8_5
Jongejan, B., Dalianis, H.: Automatic training of lemmatization rules that handle morphological changes in pre-, in- and suffixes alike. In: Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, pp. 145–153 (2009)
Koehn, P.: Europarl: a parallel corpus for statistical machine translation. In: MT Summit, vol. 5, pp. 79–86 (2005)
Krstev, C., Vitas, D.: An aligned English-Serbian corpus. In: ELLSIIR Proceedings (English Language and Literature Studies: Image, Identity, Reality), vol. 1, pp. 495–508 (2011)
Krstev, C., Vitas, D., Erjavec, T.: MULTEXT-East resources for Serbian. In: Zbornik 7. mednarodne multikonference Informacijska druzba IS 2004 Jezikovne tehnologije 9–15 Oktober 2004, Ljubljana, Slovenija, 2004. Erjavec, Tomaž and Zganec Gros, Jerneja (2004)
Ljubešić, N., Klubička, F., Agić, Ž., Jazbec, I.P.: New inflectional lexicons and training corpora for improved morphosyntactic annotation of Croatian and Serbian. In: Calzolari, N., Choukri, K., Declerck, T., Goggi, S., Grobelnik, M., Maegaard, B., Mariani, J., Mazo, H., Moreno, A., Odijk, J., Piperidis, S. (eds.) Proceedings of the Tenth International Conference on Language Resources and Evaluation, LREC 2016. European Language Resources Association (ELRA), Paris, May 2016
Ljubešić, N., Klubička, F.: \(\{\)bs, hr, sr\(\}\) WaC-web corpora of Bosnian, Croatian and Serbian. In: Proceedings of the 9th Web as Corpus Workshop (WaC-9), pp. 29–35 (2014)
Marjanović, S.: « Entrez, s’il vous plaît ! » : De la sélection lexicographique des phrasémes. In: Repenser le figement: enjeux et perspectives en phraséo-didactique des langues. Université Paris3 - Sorbonne Nouvelle (2016, forthcoming)
McDonald, R., Lerman, K., Pereira, F.: Multilingual dependency analysis with a two-stage discriminative parser. In: Proceedings of the Tenth Conference on Computational Natural Language Learning, pp. 216–220. Association for Computational Linguistics (2006)
Miletic, A.: Annotation morphosyntaxique semi-automatique d’un corpus litéraire serbe. Master’s thesis, Université Charles de Gaulle - Lille 3 (2013)
Miletic, A.: Building a morphosyntactic lexicon for Serbian using Wiktionary. In: 6th Journées d’études Toulousaines, JéTou 2017 (2017, forthcoming)
Sagot, B.: Etiquetage multilingue en parties du discours avec MELT. In: Actes de la conférence conjointe JEP-TALN-RECITAL 2016 (2016)
Seddah, D., Chrupała, G., Çetinoğlu, Ö., Van Genabith, J., Candito, M.: Lemmatization and lexicalized statistical parsing of morphologically rich languages: the case of French. In: Proceedings of the NAACL HLT 2010 First Workshop on Statistical Parsing of Morphologically-Rich Languages, pp. 85–93. Association for Computational Linguistics (2010)
Shen, L., Satta, G., Joshi, A.: Guided learning for bidirectional sequence classification. ACL 7, 760–767 (2007)
Stanojević, V., Durić, L.: Sur les indéfinis singuliers génériques en français et en serbe. Travaux de linguistique 1, 121–133 (2016)
Steinberger, R., Pouliquen, B., Widiger, A., Ignat, C., Erjavec, T., Tufis, D., Varga, D.: The JRC-Acquis: A multilingual aligned parallel corpus with 20+ languages. In: 5th International Conference on Language Ressources and Evaluation, LREC2006 (2006)
Stosic, D., Fagard, B., Sarda, L., Colin, C.: Does the road go up the mountain? Fictive motion between linguistic conventions and cognitive motivations. Cogn. Process. 16(1), 221–225 (2015)
Tiedemann, J.: News from Opus-a collection of multilingual parallel corpora with tools and interfaces. In: Nicolov, N., Bontchev, K., Angelova, G., Mitkov, R. (eds.) Recent Advances in Natural Language Processing, vol. 5, pp. 237–248 (2009)
Tyers, F.M., Alperen, M.S.: South-East European Times: a parallel corpus of Balkan languages. In: Proceedings of the LREC Workshop on Exploitation of Multilingual Resources and Tools for Central and (South-) Eastern European Languages, pp. 49–53 (2010)
Urieli, A.: Robust French syntax analysis: reconciling statistical methods and linguistic knowledge in the Talismane toolkit. Ph.D. thesis, Université Toulouse le Mirail-Toulouse II (2013)
Utvić, M.: Annotating the corpus of contemporary Serbian. In: Proceedings of the INFOtheca 2012 Conference (2011)
Vitas, D., Krstev, C.: Literature and aligned texts. Readings in Multilinguality, pp. 148–155 (2006)
von Waldenfels, R.: Compiling a parallel corpus of Slavic languages. Text strategies, tools and the question of lemmatization in alignment. In: Brehmer, B., Zdanova, V., Zimny, R. (eds.) Beiträge der Europäischen Slavistischen Linguistik, vol. 9, pp. 123–138 (2006)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Miletic, A., Stosic, D., Marjanović, S. (2017). ParCoLab: A Parallel Corpus for Serbian, French and English. In: Ekštein, K., Matoušek, V. (eds) Text, Speech, and Dialogue. TSD 2017. Lecture Notes in Computer Science(), vol 10415. Springer, Cham. https://doi.org/10.1007/978-3-319-64206-2_18
Download citation
DOI: https://doi.org/10.1007/978-3-319-64206-2_18
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-64205-5
Online ISBN: 978-3-319-64206-2
eBook Packages: Computer ScienceComputer Science (R0)