Skip to main content
Log in

The GV-LEx corpus of tales in French

Text and speech corpora enriched with lexical, discourse, structural, phonemic and prosodic annotations

  • Original Paper
  • Published:
Language Resources and Evaluation Aims and scope Submit manuscript

Abstract

A corpus of French tales is presented. Its two parts, a text corpus and a speech corpus, were designed for studying the relationships between the textual structures of tales and speech prosody, with the targeted application of an expressive text-to-speech synthesis system embedded in a humanoid robot. The 89-tale text corpus, and the 12-tale speech corpus were annotated using a common tale description framework. Lexical level annotations include extended definitions of enumerations, time, place and person named entities, as well as part of speech tags. Supra-lexical level annotations include the segmentation of tales into a sequence of episodes, the localization and attribution of direct quotations, together with tale protagonists co-references. Annotation distributions and inter-annotator agreement were analyzed. The largest coverage and strongest agreement were observed for person named entities, characters’ direct quotations, and their associated coreference chains. Speech corpus annotations were extended to allow the analysis of the relations between tale linguistic information and prosodic properties observed in associated speech. Word and phoneme boundaries were inferred through semi-automatic procedures, resulting in linguistic annotations aligned with the speech signal. Intonation stylization models were used to ease the visual and statistical analysis of tale’s prosody. Additional meta-information is provided with the speech corpus, allowing describing tale characters according to their gender, age, size, valence and kind. The corpora described in this article are publicly available through the European Language Resources Association catalog.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

Notes

  1. The GV-LEx corpus and annotation tool are distributed through the European Language Resources Association (ELRA) catalog: ELRA-S0373. ISLRN: 433-270-888-230-5. http://www.islrn.org/resources/433-270-888-230-5/.

  2. Eighty-six tales belonging to the public domain were obtained from a collaborative website http://www.contes.biz, three tales were courtesy of their author.

  3. ELRA catalog: ELRA-S0373. ISLRN: 433-270-888-230-5, http://www.islrn.org/resources/433-270-888-230-5/.

References

  • Adda, G., Adda-Decker, M., Gauvain, J. L., & Lamel, L. (1997). Text normalization and speech recognition in French. In Eurospeech (pp. 2711–2714).

  • Adda-Decker, M., Boula de Mareuil, P., Adda, G., & Lamel, L. (2005). Investigating syllabic structures and their variation in spontaneous French. Speech Communication, 46(2), 119–139.

    Article  Google Scholar 

  • Adda-Decker, M., & Lamel, L. (1999). Pronunciation variants across system configuration, language and speaking style. Speech Communication, 29(2–4), 83–98.

    Article  Google Scholar 

  • Adell, J., Bonafonte, A., & Escudero, D. (2005). Analysis of prosodic features towards modelling of emotional and pragmatic attributes of speech. Procesamiento de Lenguaje Natural, 35, 277–284.

    Google Scholar 

  • Alm, C. O., Roth, D., & Sproat, R. (2005). Emotions from text: Machine learning for text-based emotion prediction. In Proceedings of the conference on human language technology and empirical methods in natural language processing (pp. 579–586). Association for Computational Linguistics.

  • Alm, C., & Sproat, R. (2005). Emotional sequencing and development in fairy tales. In J. Tao, T. Tan & R. W. Picard (Eds.), Affective computing and intelligent interaction Vol. 3784 of Lecture notes in computer science (pp. 668–674). Heidelberg: Springer.

  • Artstein, R., & Poesio, M. (2008). Inter-coder agreement for computational linguistics. Computational Linguistics, 34(4), 555–596.

    Article  Google Scholar 

  • Astesano, C., Bertrand, R., Brousseau, M., Chafcouloff, M., Di Cristo, A., Ghio, A., et al. (1995). The PACOMUST Project, a corpus of multistyle continue speech: Objectives and methodological choices. Travaux de l’institut de Phonétique d’Aix, 16, 9–38.

    Google Scholar 

  • Barbosa, P., & Bailly, G. (1994). Characterisation of rhythmic patterns for text-to-speech synthesis. Speech Communication, 15(1), 127–137.

    Article  Google Scholar 

  • Beeferman, D., Berger, A., & Lafferty, J. (1999). Statistical models for text segmentation. Machine Learning, 34(1), 177–210.

    Article  Google Scholar 

  • Bettelheim, B. (1976). The uses of enchantment. New York: Alfred A. Knopf.

    Book  Google Scholar 

  • Bod, R., Fisseni, B., Kurji, A., & Löwe, B. (2012). Objectivity and reproducibility of Proppian narrative annotations. In Workshop on computational models of narrative.

  • Bodo, A. Z., Toderean, G., & Buza, O. (2009). TTS experiments: Romanian prosody. Acta Technica Napocensis, 50, 25–30.

    Google Scholar 

  • Boersma, P. P. G. (2002). Praat, a system for doing phonetics by computer. Glot International, 5(9/10), 341–345.

    Google Scholar 

  • d’Alessandro, C., & Mertens, P. (1995). Automatic pitch contour stylization using a model of tonal perception. Computer Speech & Language, 9, 257–288.

    Article  Google Scholar 

  • De Cheveigné, A., & Kawahara, H. (2002). Yin, a fundamental frequency estimator for speech and music. The Journal of the Acoustical Society of America, 111(4), 1917–1930.

    Article  Google Scholar 

  • Declerck, T., & Scheidel, A. (2010). An information extraction approach to the semantic annotation of folktales. In First international AMICUS workshop on automated motif discovery in cultural heritage and scientific communication texts, Vienna, Austria.

  • Doukhan, D. (2013). Synthèse de parole expressive au delà du niveau de la phrase: le cas du conte pour enfant. PhD dissertation, Université Paris-Sud 11.

  • Doukhan, D., Rilliard, A., Rosset, S., Adda-Decker, M., & d’Alessandro, C. (2011). Prosodic analysis of a corpus of tales. In Interspeech (pp. 3129–3132).

  • Doukhan, D., Rilliard, A., Rosset, S., & D’Alessandro, C. (2012a). Modelling pause duration as a function of contextual length. In Interspeech. Portland, OR.

  • Doukhan, D., Rosset, S. Rilliard, A., d’Alessandro, C., & Adda-Decker, M. (2012b). Designing french tale corpora for entertaining text to speech synthesis. In N. Calzolari, K. Choukri, T. Declerck, M. U. Doğan, B. Maegaard, J. Mariani, et al. (Eds.), Proceedings of the eighth international conference on language resources and evaluation (lrec’12). Istanbul, Turkey: European Language Resources Association (ELRA).

  • El Maarouf, I., & Villaneau, J. (2012). A french fairy tale corpus syntactically and semantically annotated. In Proceedings of the eight international conference on language resources and evaluation (lrec’12). Istanbul: European Language Resources Association (ELRA).

  • Elson, D. K., & McKeown, K. R. (2010). Automatic attribution of quoted speech in literary narrative. In Proceedings of AAAI.

  • Fackrell, J., Vereecken, H., Buhmann, J., Martens, J. P., & Van Coile, B. (2000). Prosodic variation with text type. In Proceedings of ICSLP.

  • Fort, K., François, C., Galibert, O., & Ghribi, M. (2012). Analyzing the impact of prevalence on the evaluation of a manual annotation campaign. In Proceedings of the eight international conference on language resources and evaluation (LREC’12). Istanbul, Turquie.

  • Francisco, V., Hervás, R., Peinado, F., & Gervás, P. (2012). Emotales: Creating a corpus of folk tales with emotional annotations. Language Resources and Evaluation, 46(3), 341–381.

  • Galibert, O. (2009). Approches et méthodologies pour la réponse automatique à des questions adaptées à un cadre interactif en domaine ouvert. Ph.D. dissertation. Orsay: Université Paris Sud.

  • Galibert, O., Quintard, L., Rosset, S., Zweigenbaum, P., Nédellec, C., Aubin, S., et al. (2010). Named and specific entity detection in varied data: The quaero named entity baseline evaluation. In Proceedings of LREC, Valletta, Malta, May 2010. European Language Resources Association (ELRA).

  • Gauvain, J. L., Adda, G., Adda-Decker, M., Allauzen, A., Gendner, V., Lamel, L., et al. (2005). Where are we in transcribing French broadcast news? In Ninth European conference on speech communication and technology. ISCA.

  • Gelin, R., d’Alessandro, C., Le, Q. A., Deroo, O., Doukhan, D., Martin, J. C., et al. (2010). Towards a storytelling humanoid robot. In AAAI fall symposium series on dialog with robots (pp. 137–138).

  • Gervás, P. (2010). Corpus annotation for narrative generation research: A wish list. In Amicus workshop.

  • Gervás, P., Díaz-Agudo, B., Peinado, F., & Hervás, R. (2005). Story plot generation based on CBR. Knowledge-based systems, 18(4), 235–242.

    Article  Google Scholar 

  • Goh, H.-N., Soon, L.-K., & Haw, S.-C. (2012). Automatic identification of protagonist in fairy tales using verb. In P.-N. Tan, S. Chawla, C. K. Ho, & J. Bailey (Eds.), Advances in knowledge discovery and data mining, Vol. 7302 of Lecture notes in computer science (pp. 395–406). Springer.

  • Golden, J. M. (1985). Interpreting a tale: Three perspectives on text construction. Poetics, 14(6), 503–524.

    Article  Google Scholar 

  • Grasbon, D., & Braun, N. (2001). A morphological approach to interactive storytelling. In Proceedings of cast01, living in mixed realities. special issue of netzspannung. org/journal, the magazine for media production and inter-media research (pp. 337–340). Citeseer.

  • Greimas, A. J. (1966). Sémantique structurale: recherche et méthode. Paris: Larousse.

    Google Scholar 

  • Greimas, A. J. (1989). Description and narrativity: “The piece of string”. New Literary History, 20(3), 615–626.

    Article  Google Scholar 

  • Greimas, A. J., & Courtès, J. (1976). The cognitive dimension of narrative discourse. New Literary History, 7(3), 433–447.

    Article  Google Scholar 

  • Grouin, C., Rosset, S., Zweigenbaum, P., Fort, K., Galibert, O., & Quintard, L. (2011). Proposal for an extension of traditional named entities: From guidelines to evaluation, an overview. In Proceedings of the 5th linguistic annotation workshop (pp. 92–100). Association for Computational Linguistics.

  • Hearst, M. A. (1997). Texttiling: Segmenting text into multi-paragraph subtopic passages. Computational Linguistics, 23(1), 33–64.

    Google Scholar 

  • Hendricks, W. O. (1967). On the notion ‘beyond the sentence’. Linguistics, 5(37), 12–51.

    Article  Google Scholar 

  • Holt, E. (1996). Reporting on talk: The use of direct reported speech in conversation. Research on Language and Social Interaction, 29(3), 219–245.

    Article  Google Scholar 

  • Hripcsak, G., & Rothschild, A. S. (2005). Agreement, the f-measure, and reliability in information retrieval. Journal of the American Medical Informatics Association, 12(3), 296–298.

    Article  Google Scholar 

  • Klabbers, E., & van Santen, J. (2004). Clustering of foot-based pitch contours in expressive speech. In Proceedings of 5th ISCA speech synthesis workshop (pp. 73–78). Citeseer.

  • Krippendorff, K. (1980). Content analysis: An introduction to its methodology. London: Sage.

    Google Scholar 

  • Landis, J. R., Koch, G. G., et al. (1977). The measurement of observer agreement for categorical data. Biometrics, 33(1), 159–174.

    Article  Google Scholar 

  • Leech, G. (1997). Corpus annotation: Linguistic information from computer text corpora. London: Longman.

    Google Scholar 

  • Lendvai, P., Declerck, T., Darányi, S., Gervás, P., Hervás, R., Malec, S., & Peinado, F. (2010). Integration of linguistic markup into semantic models of folk narratives: The fairy tale use case. In LREC.

  • Levin, H., Schaffer, C. A., & Snow, C. (1982). The prosodic and paralinguistic features of reading and telling stories. Language and Speech, 25(1), 43.

    Google Scholar 

  • Malec, S. (2010). Autopropp: Toward the automatic markup, classification, and annotation of Russian magic tales. In First amicus workshop.

  • Mamede, N., & Chaleira, P. (2004). Character identification in children stories. In J. L. Vicedo, P. Martínez-Barco, R. Muńoz & M. Saiz-Noeda (Eds.), Advances in natural language processing Vol. 3230 of Lecture notes in computer science (pp. 82–90). Heidelberg: Springer.

  • Mani, I. (2014). Computational narratology. In P. Hühn, J. C. Meister, J. Pier & W. Schmid (Eds.), Handbook of narratology (pp. 84–92). Berlin/Boston: Walter de Gruyter GmbH.

  • Mutlu, B., Forlizzi, J., & Hodgins, J. (2006). A storytelling robot: Modeling and evaluation of human-like gaze behavior. In 2006 6th iEEE-RAS international conference on humanoid robots (pp. 518–523). IEEE.

  • Passonneau, R. J. (2004). Computing reliability for coreference annotation. In Proceedings of lrec (Vol. 4, pp. 1503–1506).

  • Propp, V. (1968). (orig 1928). Morphology of the folktale. Austin: University of Texas Press.

  • Ronfard, R., & Szilas, N. (2014). Where story and media meet: Computer generation of narrative discourse. In M. A. Finlayson, J. C. Meister & E. G. Bruneau (Eds.), 2014 Workshop on computational models of narrative (pp. 164–176). Dagstuhl: Schloss Dagstuhl—Leibniz-Zentrum fuer Informatik.

  • Rosset, S., Galibert, O., Bernard, G., Bilinski, E., & Adda, G. (2009). The LIMSI multilingual, multitask QAst system. In Proceedings of CLEF 2008 (pp. 480–487). Berlin: Springer.

  • Sluijter, A. M. C., & Terken, J. M. B. (2009). Beyond sentence prosody: Paragraph intonation in Dutch. Phonetica, 50(3), 180–188.

    Article  Google Scholar 

  • Stein, A., & Schmid, H. (1995). Étiquetage morphologique de textes français avec un arbre de décisions. Traitement automatique des langues, 36(1–2), 23–35.

    Google Scholar 

  • Stenetorp, P., Pyysalo, S., Topić, G., Ohta, T., Ananiadou, S., & Tsujii, J. (2012). BRAT: A web-based tool for nlp-assisted text annotation. In Proceedings of the demonstrations at the 13th conference of the European chapter of the Association for Computational Linguistics (pp. 102–107). Avignon: Association for Computational Linguistics. http://www.aclweb.org/anthology/E12-2021.

  • Theune, M., Meijs, K., Heylen, D., & Ordelman, R. (2006). Generating expressive speech for storytelling applications. IEEE Transactions on Audio, Speech, and Language Processing, 14(4), 1137–1144.

    Article  Google Scholar 

  • Uzuner, O., Bodnari, A., Shen, S., Forbush, T., Pestian, J., & South, B. R. (2012). Evaluating the state of the art in coreference resolution for electronic medical records. Journal of the American Medical Informatics Association, 19(5), 786–791.

  • van Dijk, T. A. (1982). Episodes as units of discourse analysis. In D. Tannen (Ed.), Analyzing discourse: Text and talk (pp. 177–195). Georgetown: Georgetown University Press.

  • Weiser, S., & Watrin, P. (2012). Extraction of unmarked quotations in newspapers. In Proceedings of the eight international conference on language resources and evaluation (lrec’12). Istanbul: European Language Resources Association (ELRA).

  • Widlöcher, A., & Mathet, Y. (2012). The glozz platform: A corpus annotation and mining tool. In Proceedings of the 2012 ACM symposium on document engineering. Doceng ’12 (pp. 171–180). New York, NY: ACM. doi:10.1145/2361354.2361394.

  • Zhang, J., Black, A., & Sproat, R. (2003). Identifying speakers in children’s stories for speech synthesis. In Proceedings of Eurospeech (pp. 2041–2044).

Download references

Acknowledgments

This work has been funded by the French project GV-LEx (ANR-08-CORD-024 http://www.gvlex.com). The annotation of the corpus and the realization of the annotation software was done by Syllabs http://www.syllabs.com. We are deeply indebted towards anonymous reviewers for their constructive comments on earlier versions of this article.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Albert Rilliard.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Doukhan, D., Rosset, S., Rilliard, A. et al. The GV-LEx corpus of tales in French. Lang Resources & Evaluation 49, 521–547 (2015). https://doi.org/10.1007/s10579-015-9306-7

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10579-015-9306-7

Keywords

Navigation