Skip to main content

ParCoLab: A Parallel Corpus for Serbian, French and English

  • Conference paper
  • First Online:
Text, Speech, and Dialogue (TSD 2017)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 10415))

Included in the following conference series:

Abstract

ParCoLab is a trilingual parallel corpus containing texts in Serbian, French and English. It is developed at the CLLE-ERSS research unit (UMR 5263 CNRS) at the University of Toulouse, France, in collaboration with the Department of Romance Studies at the University of Belgrade, Serbia. Serbian being one of the less-resourced European languages, this is an important step towards the creation of freely accessible corpora and NLP tools for this language. Our main goal is to provide the scientific community with a high-quality resource that can be used in a wide range of applications, such as contrastive linguistic studies, NLP research, machine and computer assisted translation, translation studies, second language learning and teaching, and applied lexicography. The corpus currently contains 7.1M tokens mainly from literary works, but corpus extension and diversification efforts are ongoing. ParCoLab can be queried online and a part of it is available for download.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    The only two modifications we make is that we introduce an attribute @langOri used in the<teiHeader> in order to encode the language of the original text in the XML files containing translations, and the @id attribute used on the root<TEI> element, indicating the unique ID of the file inside the collection.

  2. 2.

    TED is a platform for short talks on various subjects. See http://www.ted.com/.

  3. 3.

    See, e.g., [23] for POS-tagging and [4] for parsing of English; [21] for POS-tagging, [3] for parsing, and [22] for lemmatization of French.

References

  1. Agić, Ž., Ljubešić, N., Berović, D.: Lemmatization and morphosyntactic tagging of Croatian and Serbian. In: 4th Biennial International Workkshop on Balto-Slavic Natural Language Processing, BSNLP 2013 (2013)

    Google Scholar 

  2. Agić, Ž., Merkler, D., Berović, D.: Parsing Croatian and Serbian by using Croatian dependency treebanks. In: Proceedings of the Fourth Workshop on Statistical Parsing of Morphologically-Rich Languages (2013)

    Google Scholar 

  3. Candito, M., Nivre, J., Denis, P., Anguiano, E.H.: Benchmarking of statistical dependency parsers for French. In: Proceedings of the 23rd International Conference on Computational Linguistics: Posters, pp. 108–116. Association for Computational Linguistics (2010)

    Google Scholar 

  4. Carreras, X.: Experiments with a higher-order projective dependency parser. In: EMNLP-CoNLL, pp. 957–961 (2007)

    Google Scholar 

  5. Čermák, F., Rosen, A.: The case of InterCorp, a multilingual parallel corpus. Int. J. Corpus Linguist. 17(3), 411–427 (2012)

    Article  Google Scholar 

  6. Text Encoding Initiative Consortium (eds.): TEI P5: Guidelines for Electronic Text Encoding and Interchange. Text Encoding Initiative Consortium (2008)

    Google Scholar 

  7. Esplá-Gomis, M., Forcada, M.: Bitextor, a free/open-source software to harvest translation memories from multilingual websites. In: Proceedings of MT Summit XII, Ottawa, Canada. Association for Machine Translation in the Americas (2009)

    Google Scholar 

  8. Gesmundo, A., Samardžić, T.: Lemmatising Serbian as category tagging with bidirectional sequence classification. In: LREC, pp. 2103–2106 (2012)

    Google Scholar 

  9. Halácsy, P., Kornai, A., Oravecz, C.: Hunpos: an open source trigram tagger. In: Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions, pp. 209–212. Association for Computational Linguistics (2007)

    Google Scholar 

  10. Jakovljević, B., Kovačević, A., Sečujski, M., Marković, M.: A dependency treebank for Serbian: initial experiments. In: Ronzhin, A., Potapova, R., Delic, V. (eds.) SPECOM 2014. LNCS, vol. 8773, pp. 42–49. Springer, Cham (2014). doi:10.1007/978-3-319-11581-8_5

    Google Scholar 

  11. Jongejan, B., Dalianis, H.: Automatic training of lemmatization rules that handle morphological changes in pre-, in- and suffixes alike. In: Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, pp. 145–153 (2009)

    Google Scholar 

  12. Koehn, P.: Europarl: a parallel corpus for statistical machine translation. In: MT Summit, vol. 5, pp. 79–86 (2005)

    Google Scholar 

  13. Krstev, C., Vitas, D.: An aligned English-Serbian corpus. In: ELLSIIR Proceedings (English Language and Literature Studies: Image, Identity, Reality), vol. 1, pp. 495–508 (2011)

    Google Scholar 

  14. Krstev, C., Vitas, D., Erjavec, T.: MULTEXT-East resources for Serbian. In: Zbornik 7. mednarodne multikonference Informacijska druzba IS 2004 Jezikovne tehnologije 9–15 Oktober 2004, Ljubljana, Slovenija, 2004. Erjavec, Tomaž and Zganec Gros, Jerneja (2004)

    Google Scholar 

  15. Ljubešić, N., Klubička, F., Agić, Ž., Jazbec, I.P.: New inflectional lexicons and training corpora for improved morphosyntactic annotation of Croatian and Serbian. In: Calzolari, N., Choukri, K., Declerck, T., Goggi, S., Grobelnik, M., Maegaard, B., Mariani, J., Mazo, H., Moreno, A., Odijk, J., Piperidis, S. (eds.) Proceedings of the Tenth International Conference on Language Resources and Evaluation, LREC 2016. European Language Resources Association (ELRA), Paris, May 2016

    Google Scholar 

  16. Ljubešić, N., Klubička, F.: \(\{\)bs, hr, sr\(\}\) WaC-web corpora of Bosnian, Croatian and Serbian. In: Proceedings of the 9th Web as Corpus Workshop (WaC-9), pp. 29–35 (2014)

    Google Scholar 

  17. Marjanović, S.: « Entrez, s’il vous plaît ! » : De la sélection lexicographique des phrasémes. In: Repenser le figement: enjeux et perspectives en phraséo-didactique des langues. Université Paris3 - Sorbonne Nouvelle (2016, forthcoming)

    Google Scholar 

  18. McDonald, R., Lerman, K., Pereira, F.: Multilingual dependency analysis with a two-stage discriminative parser. In: Proceedings of the Tenth Conference on Computational Natural Language Learning, pp. 216–220. Association for Computational Linguistics (2006)

    Google Scholar 

  19. Miletic, A.: Annotation morphosyntaxique semi-automatique d’un corpus litéraire serbe. Master’s thesis, Université Charles de Gaulle - Lille 3 (2013)

    Google Scholar 

  20. Miletic, A.: Building a morphosyntactic lexicon for Serbian using Wiktionary. In: 6th Journées d’études Toulousaines, JéTou 2017 (2017, forthcoming)

    Google Scholar 

  21. Sagot, B.: Etiquetage multilingue en parties du discours avec MELT. In: Actes de la conférence conjointe JEP-TALN-RECITAL 2016 (2016)

    Google Scholar 

  22. Seddah, D., Chrupała, G., Çetinoğlu, Ö., Van Genabith, J., Candito, M.: Lemmatization and lexicalized statistical parsing of morphologically rich languages: the case of French. In: Proceedings of the NAACL HLT 2010 First Workshop on Statistical Parsing of Morphologically-Rich Languages, pp. 85–93. Association for Computational Linguistics (2010)

    Google Scholar 

  23. Shen, L., Satta, G., Joshi, A.: Guided learning for bidirectional sequence classification. ACL 7, 760–767 (2007)

    Google Scholar 

  24. Stanojević, V., Durić, L.: Sur les indéfinis singuliers génériques en français et en serbe. Travaux de linguistique 1, 121–133 (2016)

    Article  Google Scholar 

  25. Steinberger, R., Pouliquen, B., Widiger, A., Ignat, C., Erjavec, T., Tufis, D., Varga, D.: The JRC-Acquis: A multilingual aligned parallel corpus with 20+ languages. In: 5th International Conference on Language Ressources and Evaluation, LREC2006 (2006)

    Google Scholar 

  26. Stosic, D., Fagard, B., Sarda, L., Colin, C.: Does the road go up the mountain? Fictive motion between linguistic conventions and cognitive motivations. Cogn. Process. 16(1), 221–225 (2015)

    Article  Google Scholar 

  27. Tiedemann, J.: News from Opus-a collection of multilingual parallel corpora with tools and interfaces. In: Nicolov, N., Bontchev, K., Angelova, G., Mitkov, R. (eds.) Recent Advances in Natural Language Processing, vol. 5, pp. 237–248 (2009)

    Google Scholar 

  28. Tyers, F.M., Alperen, M.S.: South-East European Times: a parallel corpus of Balkan languages. In: Proceedings of the LREC Workshop on Exploitation of Multilingual Resources and Tools for Central and (South-) Eastern European Languages, pp. 49–53 (2010)

    Google Scholar 

  29. Urieli, A.: Robust French syntax analysis: reconciling statistical methods and linguistic knowledge in the Talismane toolkit. Ph.D. thesis, Université Toulouse le Mirail-Toulouse II (2013)

    Google Scholar 

  30. Utvić, M.: Annotating the corpus of contemporary Serbian. In: Proceedings of the INFOtheca 2012 Conference (2011)

    Google Scholar 

  31. Vitas, D., Krstev, C.: Literature and aligned texts. Readings in Multilinguality, pp. 148–155 (2006)

    Google Scholar 

  32. von Waldenfels, R.: Compiling a parallel corpus of Slavic languages. Text strategies, tools and the question of lemmatization in alignment. In: Brehmer, B., Zdanova, V., Zimny, R. (eds.) Beiträge der Europäischen Slavistischen Linguistik, vol. 9, pp. 123–138 (2006)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Aleksandra Miletic .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing AG

About this paper

Cite this paper

Miletic, A., Stosic, D., Marjanović, S. (2017). ParCoLab: A Parallel Corpus for Serbian, French and English. In: Ekštein, K., Matoušek, V. (eds) Text, Speech, and Dialogue. TSD 2017. Lecture Notes in Computer Science(), vol 10415. Springer, Cham. https://doi.org/10.1007/978-3-319-64206-2_18

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-64206-2_18

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-64205-5

  • Online ISBN: 978-3-319-64206-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics