The GV-LEx corpus of tales in French

Doukhan, David; Rosset, Sophie; Rilliard, Albert; d’Alessandro, Christophe; Adda-Decker, Martine

doi:10.1007/s10579-015-9306-7

The GV-LEx corpus of tales in French

Text and speech corpora enriched with lexical, discourse, structural, phonemic and prosodic annotations

Original Paper
Published: 20 June 2015

Volume 49, pages 521–547, (2015)
Cite this article

Language Resources and Evaluation Aims and scope Submit manuscript

David Doukhan¹,
Sophie Rosset¹,
Albert Rilliard¹,
Christophe d’Alessandro¹ &
…
Martine Adda-Decker^1,2

296 Accesses
Explore all metrics

Abstract

A corpus of French tales is presented. Its two parts, a text corpus and a speech corpus, were designed for studying the relationships between the textual structures of tales and speech prosody, with the targeted application of an expressive text-to-speech synthesis system embedded in a humanoid robot. The 89-tale text corpus, and the 12-tale speech corpus were annotated using a common tale description framework. Lexical level annotations include extended definitions of enumerations, time, place and person named entities, as well as part of speech tags. Supra-lexical level annotations include the segmentation of tales into a sequence of episodes, the localization and attribution of direct quotations, together with tale protagonists co-references. Annotation distributions and inter-annotator agreement were analyzed. The largest coverage and strongest agreement were observed for person named entities, characters’ direct quotations, and their associated coreference chains. Speech corpus annotations were extended to allow the analysis of the relations between tale linguistic information and prosodic properties observed in associated speech. Word and phoneme boundaries were inferred through semi-automatic procedures, resulting in linguistic annotations aligned with the speech signal. Intonation stylization models were used to ease the visual and statistical analysis of tale’s prosody. Additional meta-information is provided with the speech corpus, allowing describing tale characters according to their gender, age, size, valence and kind. The corpora described in this article are publicly available through the European Language Resources Association catalog.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Spoken Corpora of Slavic Languages

Article Open access 20 July 2022

Case Study: The AusTalk Corpus

The Tomsk Dialect Corpus: a comprehensively annotated database of a Siberian Russian dialect from material collected over the last 70 years

Article 05 July 2023

Notes

The GV-LEx corpus and annotation tool are distributed through the European Language Resources Association (ELRA) catalog: ELRA-S0373. ISLRN: 433-270-888-230-5. http://www.islrn.org/resources/433-270-888-230-5/.
Eighty-six tales belonging to the public domain were obtained from a collaborative website http://www.contes.biz, three tales were courtesy of their author.
ELRA catalog: ELRA-S0373. ISLRN: 433-270-888-230-5, http://www.islrn.org/resources/433-270-888-230-5/.

References

Adda, G., Adda-Decker, M., Gauvain, J. L., & Lamel, L. (1997). Text normalization and speech recognition in French. In Eurospeech (pp. 2711–2714).
Adda-Decker, M., Boula de Mareuil, P., Adda, G., & Lamel, L. (2005). Investigating syllabic structures and their variation in spontaneous French. Speech Communication, 46(2), 119–139.
Article Google Scholar
Adda-Decker, M., & Lamel, L. (1999). Pronunciation variants across system configuration, language and speaking style. Speech Communication, 29(2–4), 83–98.
Article Google Scholar
Adell, J., Bonafonte, A., & Escudero, D. (2005). Analysis of prosodic features towards modelling of emotional and pragmatic attributes of speech. Procesamiento de Lenguaje Natural, 35, 277–284.
Google Scholar
Alm, C. O., Roth, D., & Sproat, R. (2005). Emotions from text: Machine learning for text-based emotion prediction. In Proceedings of the conference on human language technology and empirical methods in natural language processing (pp. 579–586). Association for Computational Linguistics.
Alm, C., & Sproat, R. (2005). Emotional sequencing and development in fairy tales. In J. Tao, T. Tan & R. W. Picard (Eds.), Affective computing and intelligent interaction Vol. 3784 of Lecture notes in computer science (pp. 668–674). Heidelberg: Springer.
Artstein, R., & Poesio, M. (2008). Inter-coder agreement for computational linguistics. Computational Linguistics, 34(4), 555–596.
Article Google Scholar
Astesano, C., Bertrand, R., Brousseau, M., Chafcouloff, M., Di Cristo, A., Ghio, A., et al. (1995). The PACOMUST Project, a corpus of multistyle continue speech: Objectives and methodological choices. Travaux de l’institut de Phonétique d’Aix, 16, 9–38.
Google Scholar
Barbosa, P., & Bailly, G. (1994). Characterisation of rhythmic patterns for text-to-speech synthesis. Speech Communication, 15(1), 127–137.
Article Google Scholar
Beeferman, D., Berger, A., & Lafferty, J. (1999). Statistical models for text segmentation. Machine Learning, 34(1), 177–210.
Article Google Scholar
Bettelheim, B. (1976). The uses of enchantment. New York: Alfred A. Knopf.
Book Google Scholar
Bod, R., Fisseni, B., Kurji, A., & Löwe, B. (2012). Objectivity and reproducibility of Proppian narrative annotations. In Workshop on computational models of narrative.
Bodo, A. Z., Toderean, G., & Buza, O. (2009). TTS experiments: Romanian prosody. Acta Technica Napocensis, 50, 25–30.
Google Scholar
Boersma, P. P. G. (2002). Praat, a system for doing phonetics by computer. Glot International, 5(9/10), 341–345.
Google Scholar
d’Alessandro, C., & Mertens, P. (1995). Automatic pitch contour stylization using a model of tonal perception. Computer Speech & Language, 9, 257–288.
Article Google Scholar
De Cheveigné, A., & Kawahara, H. (2002). Yin, a fundamental frequency estimator for speech and music. The Journal of the Acoustical Society of America, 111(4), 1917–1930.
Article Google Scholar
Declerck, T., & Scheidel, A. (2010). An information extraction approach to the semantic annotation of folktales. In First international AMICUS workshop on automated motif discovery in cultural heritage and scientific communication texts, Vienna, Austria.
Doukhan, D. (2013). Synthèse de parole expressive au delà du niveau de la phrase: le cas du conte pour enfant. PhD dissertation, Université Paris-Sud 11.
Doukhan, D., Rilliard, A., Rosset, S., Adda-Decker, M., & d’Alessandro, C. (2011). Prosodic analysis of a corpus of tales. In Interspeech (pp. 3129–3132).
Doukhan, D., Rilliard, A., Rosset, S., & D’Alessandro, C. (2012a). Modelling pause duration as a function of contextual length. In Interspeech. Portland, OR.
Doukhan, D., Rosset, S. Rilliard, A., d’Alessandro, C., & Adda-Decker, M. (2012b). Designing french tale corpora for entertaining text to speech synthesis. In N. Calzolari, K. Choukri, T. Declerck, M. U. Doğan, B. Maegaard, J. Mariani, et al. (Eds.), Proceedings of the eighth international conference on language resources and evaluation (lrec’12). Istanbul, Turkey: European Language Resources Association (ELRA).
El Maarouf, I., & Villaneau, J. (2012). A french fairy tale corpus syntactically and semantically annotated. In Proceedings of the eight international conference on language resources and evaluation (lrec’12). Istanbul: European Language Resources Association (ELRA).
Elson, D. K., & McKeown, K. R. (2010). Automatic attribution of quoted speech in literary narrative. In Proceedings of AAAI.
Fackrell, J., Vereecken, H., Buhmann, J., Martens, J. P., & Van Coile, B. (2000). Prosodic variation with text type. In Proceedings of ICSLP.
Fort, K., François, C., Galibert, O., & Ghribi, M. (2012). Analyzing the impact of prevalence on the evaluation of a manual annotation campaign. In Proceedings of the eight international conference on language resources and evaluation (LREC’12). Istanbul, Turquie.
Francisco, V., Hervás, R., Peinado, F., & Gervás, P. (2012). Emotales: Creating a corpus of folk tales with emotional annotations. Language Resources and Evaluation, 46(3), 341–381.
Galibert, O. (2009). Approches et méthodologies pour la réponse automatique à des questions adaptées à un cadre interactif en domaine ouvert. Ph.D. dissertation. Orsay: Université Paris Sud.
Galibert, O., Quintard, L., Rosset, S., Zweigenbaum, P., Nédellec, C., Aubin, S., et al. (2010). Named and specific entity detection in varied data: The quaero named entity baseline evaluation. In Proceedings of LREC, Valletta, Malta, May 2010. European Language Resources Association (ELRA).
Gauvain, J. L., Adda, G., Adda-Decker, M., Allauzen, A., Gendner, V., Lamel, L., et al. (2005). Where are we in transcribing French broadcast news? In Ninth European conference on speech communication and technology. ISCA.
Gelin, R., d’Alessandro, C., Le, Q. A., Deroo, O., Doukhan, D., Martin, J. C., et al. (2010). Towards a storytelling humanoid robot. In AAAI fall symposium series on dialog with robots (pp. 137–138).
Gervás, P. (2010). Corpus annotation for narrative generation research: A wish list. In Amicus workshop.
Gervás, P., Díaz-Agudo, B., Peinado, F., & Hervás, R. (2005). Story plot generation based on CBR. Knowledge-based systems, 18(4), 235–242.
Article Google Scholar
Goh, H.-N., Soon, L.-K., & Haw, S.-C. (2012). Automatic identification of protagonist in fairy tales using verb. In P.-N. Tan, S. Chawla, C. K. Ho, & J. Bailey (Eds.), Advances in knowledge discovery and data mining, Vol. 7302 of Lecture notes in computer science (pp. 395–406). Springer.
Golden, J. M. (1985). Interpreting a tale: Three perspectives on text construction. Poetics, 14(6), 503–524.
Article Google Scholar
Grasbon, D., & Braun, N. (2001). A morphological approach to interactive storytelling. In Proceedings of cast01, living in mixed realities. special issue of netzspannung. org/journal, the magazine for media production and inter-media research (pp. 337–340). Citeseer.
Greimas, A. J. (1966). Sémantique structurale: recherche et méthode. Paris: Larousse.
Google Scholar
Greimas, A. J. (1989). Description and narrativity: “The piece of string”. New Literary History, 20(3), 615–626.
Article Google Scholar
Greimas, A. J., & Courtès, J. (1976). The cognitive dimension of narrative discourse. New Literary History, 7(3), 433–447.
Article Google Scholar
Grouin, C., Rosset, S., Zweigenbaum, P., Fort, K., Galibert, O., & Quintard, L. (2011). Proposal for an extension of traditional named entities: From guidelines to evaluation, an overview. In Proceedings of the 5th linguistic annotation workshop (pp. 92–100). Association for Computational Linguistics.
Hearst, M. A. (1997). Texttiling: Segmenting text into multi-paragraph subtopic passages. Computational Linguistics, 23(1), 33–64.
Google Scholar
Hendricks, W. O. (1967). On the notion ‘beyond the sentence’. Linguistics, 5(37), 12–51.
Article Google Scholar
Holt, E. (1996). Reporting on talk: The use of direct reported speech in conversation. Research on Language and Social Interaction, 29(3), 219–245.
Article Google Scholar
Hripcsak, G., & Rothschild, A. S. (2005). Agreement, the f-measure, and reliability in information retrieval. Journal of the American Medical Informatics Association, 12(3), 296–298.
Article Google Scholar
Klabbers, E., & van Santen, J. (2004). Clustering of foot-based pitch contours in expressive speech. In Proceedings of 5th ISCA speech synthesis workshop (pp. 73–78). Citeseer.
Krippendorff, K. (1980). Content analysis: An introduction to its methodology. London: Sage.
Google Scholar
Landis, J. R., Koch, G. G., et al. (1977). The measurement of observer agreement for categorical data. Biometrics, 33(1), 159–174.
Article Google Scholar
Leech, G. (1997). Corpus annotation: Linguistic information from computer text corpora. London: Longman.
Google Scholar
Lendvai, P., Declerck, T., Darányi, S., Gervás, P., Hervás, R., Malec, S., & Peinado, F. (2010). Integration of linguistic markup into semantic models of folk narratives: The fairy tale use case. In LREC.
Levin, H., Schaffer, C. A., & Snow, C. (1982). The prosodic and paralinguistic features of reading and telling stories. Language and Speech, 25(1), 43.
Google Scholar
Malec, S. (2010). Autopropp: Toward the automatic markup, classification, and annotation of Russian magic tales. In First amicus workshop.
Mamede, N., & Chaleira, P. (2004). Character identification in children stories. In J. L. Vicedo, P. Martínez-Barco, R. Muńoz & M. Saiz-Noeda (Eds.), Advances in natural language processing Vol. 3230 of Lecture notes in computer science (pp. 82–90). Heidelberg: Springer.
Mani, I. (2014). Computational narratology. In P. Hühn, J. C. Meister, J. Pier & W. Schmid (Eds.), Handbook of narratology (pp. 84–92). Berlin/Boston: Walter de Gruyter GmbH.
Mutlu, B., Forlizzi, J., & Hodgins, J. (2006). A storytelling robot: Modeling and evaluation of human-like gaze behavior. In 2006 6th iEEE-RAS international conference on humanoid robots (pp. 518–523). IEEE.
Passonneau, R. J. (2004). Computing reliability for coreference annotation. In Proceedings of lrec (Vol. 4, pp. 1503–1506).
Propp, V. (1968). (orig 1928). Morphology of the folktale. Austin: University of Texas Press.
Ronfard, R., & Szilas, N. (2014). Where story and media meet: Computer generation of narrative discourse. In M. A. Finlayson, J. C. Meister & E. G. Bruneau (Eds.), 2014 Workshop on computational models of narrative (pp. 164–176). Dagstuhl: Schloss Dagstuhl—Leibniz-Zentrum fuer Informatik.
Rosset, S., Galibert, O., Bernard, G., Bilinski, E., & Adda, G. (2009). The LIMSI multilingual, multitask QAst system. In Proceedings of CLEF 2008 (pp. 480–487). Berlin: Springer.
Sluijter, A. M. C., & Terken, J. M. B. (2009). Beyond sentence prosody: Paragraph intonation in Dutch. Phonetica, 50(3), 180–188.
Article Google Scholar
Stein, A., & Schmid, H. (1995). Étiquetage morphologique de textes français avec un arbre de décisions. Traitement automatique des langues, 36(1–2), 23–35.
Google Scholar
Stenetorp, P., Pyysalo, S., Topić, G., Ohta, T., Ananiadou, S., & Tsujii, J. (2012). BRAT: A web-based tool for nlp-assisted text annotation. In Proceedings of the demonstrations at the 13th conference of the European chapter of the Association for Computational Linguistics (pp. 102–107). Avignon: Association for Computational Linguistics. http://www.aclweb.org/anthology/E12-2021.
Theune, M., Meijs, K., Heylen, D., & Ordelman, R. (2006). Generating expressive speech for storytelling applications. IEEE Transactions on Audio, Speech, and Language Processing, 14(4), 1137–1144.
Article Google Scholar
Uzuner, O., Bodnari, A., Shen, S., Forbush, T., Pestian, J., & South, B. R. (2012). Evaluating the state of the art in coreference resolution for electronic medical records. Journal of the American Medical Informatics Association, 19(5), 786–791.
van Dijk, T. A. (1982). Episodes as units of discourse analysis. In D. Tannen (Ed.), Analyzing discourse: Text and talk (pp. 177–195). Georgetown: Georgetown University Press.
Weiser, S., & Watrin, P. (2012). Extraction of unmarked quotations in newspapers. In Proceedings of the eight international conference on language resources and evaluation (lrec’12). Istanbul: European Language Resources Association (ELRA).
Widlöcher, A., & Mathet, Y. (2012). The glozz platform: A corpus annotation and mining tool. In Proceedings of the 2012 ACM symposium on document engineering. Doceng ’12 (pp. 171–180). New York, NY: ACM. doi:10.1145/2361354.2361394.
Zhang, J., Black, A., & Sproat, R. (2003). Identifying speakers in children’s stories for speech synthesis. In Proceedings of Eurospeech (pp. 2041–2044).

Download references

Acknowledgments

This work has been funded by the French project GV-LEx (ANR-08-CORD-024 http://www.gvlex.com). The annotation of the corpus and the realization of the annotation software was done by Syllabs http://www.syllabs.com. We are deeply indebted towards anonymous reviewers for their constructive comments on earlier versions of this article.

Author information

Authors and Affiliations

LIMSI-CNRS, Rue John von Neumann, Campus Universitaire d’Orsay, Bat. 508, 91405, Orsay Cedex, France
David Doukhan, Sophie Rosset, Albert Rilliard, Christophe d’Alessandro & Martine Adda-Decker
LPP UMR7018, 19 rue des Bernardins, 75005, Paris, France
Martine Adda-Decker

Authors

David Doukhan
View author publications
You can also search for this author inPubMed Google Scholar
Sophie Rosset
View author publications
You can also search for this author inPubMed Google Scholar
Albert Rilliard
View author publications
You can also search for this author inPubMed Google Scholar
Christophe d’Alessandro
View author publications
You can also search for this author inPubMed Google Scholar
Martine Adda-Decker
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Albert Rilliard.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Doukhan, D., Rosset, S., Rilliard, A. et al. The GV-LEx corpus of tales in French. Lang Resources & Evaluation 49, 521–547 (2015). https://doi.org/10.1007/s10579-015-9306-7

Download citation

Published: 20 June 2015
Issue Date: September 2015
DOI: https://doi.org/10.1007/s10579-015-9306-7

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

The GV-LEx corpus of tales in French

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Spoken Corpora of Slavic Languages

Case Study: The AusTalk Corpus

The Tomsk Dialect Corpus: a comprehensively annotated database of a Siberian Russian dialect from material collected over the last 70 years

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now