Abstract
A corpus of French tales is presented. Its two parts, a text corpus and a speech corpus, were designed for studying the relationships between the textual structures of tales and speech prosody, with the targeted application of an expressive text-to-speech synthesis system embedded in a humanoid robot. The 89-tale text corpus, and the 12-tale speech corpus were annotated using a common tale description framework. Lexical level annotations include extended definitions of enumerations, time, place and person named entities, as well as part of speech tags. Supra-lexical level annotations include the segmentation of tales into a sequence of episodes, the localization and attribution of direct quotations, together with tale protagonists co-references. Annotation distributions and inter-annotator agreement were analyzed. The largest coverage and strongest agreement were observed for person named entities, characters’ direct quotations, and their associated coreference chains. Speech corpus annotations were extended to allow the analysis of the relations between tale linguistic information and prosodic properties observed in associated speech. Word and phoneme boundaries were inferred through semi-automatic procedures, resulting in linguistic annotations aligned with the speech signal. Intonation stylization models were used to ease the visual and statistical analysis of tale’s prosody. Additional meta-information is provided with the speech corpus, allowing describing tale characters according to their gender, age, size, valence and kind. The corpora described in this article are publicly available through the European Language Resources Association catalog.



Similar content being viewed by others
Notes
The GV-LEx corpus and annotation tool are distributed through the European Language Resources Association (ELRA) catalog: ELRA-S0373. ISLRN: 433-270-888-230-5. http://www.islrn.org/resources/433-270-888-230-5/.
Eighty-six tales belonging to the public domain were obtained from a collaborative website http://www.contes.biz, three tales were courtesy of their author.
ELRA catalog: ELRA-S0373. ISLRN: 433-270-888-230-5, http://www.islrn.org/resources/433-270-888-230-5/.
References
Adda, G., Adda-Decker, M., Gauvain, J. L., & Lamel, L. (1997). Text normalization and speech recognition in French. In Eurospeech (pp. 2711–2714).
Adda-Decker, M., Boula de Mareuil, P., Adda, G., & Lamel, L. (2005). Investigating syllabic structures and their variation in spontaneous French. Speech Communication, 46(2), 119–139.
Adda-Decker, M., & Lamel, L. (1999). Pronunciation variants across system configuration, language and speaking style. Speech Communication, 29(2–4), 83–98.
Adell, J., Bonafonte, A., & Escudero, D. (2005). Analysis of prosodic features towards modelling of emotional and pragmatic attributes of speech. Procesamiento de Lenguaje Natural, 35, 277–284.
Alm, C. O., Roth, D., & Sproat, R. (2005). Emotions from text: Machine learning for text-based emotion prediction. In Proceedings of the conference on human language technology and empirical methods in natural language processing (pp. 579–586). Association for Computational Linguistics.
Alm, C., & Sproat, R. (2005). Emotional sequencing and development in fairy tales. In J. Tao, T. Tan & R. W. Picard (Eds.), Affective computing and intelligent interaction Vol. 3784 of Lecture notes in computer science (pp. 668–674). Heidelberg: Springer.
Artstein, R., & Poesio, M. (2008). Inter-coder agreement for computational linguistics. Computational Linguistics, 34(4), 555–596.
Astesano, C., Bertrand, R., Brousseau, M., Chafcouloff, M., Di Cristo, A., Ghio, A., et al. (1995). The PACOMUST Project, a corpus of multistyle continue speech: Objectives and methodological choices. Travaux de l’institut de Phonétique d’Aix, 16, 9–38.
Barbosa, P., & Bailly, G. (1994). Characterisation of rhythmic patterns for text-to-speech synthesis. Speech Communication, 15(1), 127–137.
Beeferman, D., Berger, A., & Lafferty, J. (1999). Statistical models for text segmentation. Machine Learning, 34(1), 177–210.
Bettelheim, B. (1976). The uses of enchantment. New York: Alfred A. Knopf.
Bod, R., Fisseni, B., Kurji, A., & Löwe, B. (2012). Objectivity and reproducibility of Proppian narrative annotations. In Workshop on computational models of narrative.
Bodo, A. Z., Toderean, G., & Buza, O. (2009). TTS experiments: Romanian prosody. Acta Technica Napocensis, 50, 25–30.
Boersma, P. P. G. (2002). Praat, a system for doing phonetics by computer. Glot International, 5(9/10), 341–345.
d’Alessandro, C., & Mertens, P. (1995). Automatic pitch contour stylization using a model of tonal perception. Computer Speech & Language, 9, 257–288.
De Cheveigné, A., & Kawahara, H. (2002). Yin, a fundamental frequency estimator for speech and music. The Journal of the Acoustical Society of America, 111(4), 1917–1930.
Declerck, T., & Scheidel, A. (2010). An information extraction approach to the semantic annotation of folktales. In First international AMICUS workshop on automated motif discovery in cultural heritage and scientific communication texts, Vienna, Austria.
Doukhan, D. (2013). Synthèse de parole expressive au delà du niveau de la phrase: le cas du conte pour enfant. PhD dissertation, Université Paris-Sud 11.
Doukhan, D., Rilliard, A., Rosset, S., Adda-Decker, M., & d’Alessandro, C. (2011). Prosodic analysis of a corpus of tales. In Interspeech (pp. 3129–3132).
Doukhan, D., Rilliard, A., Rosset, S., & D’Alessandro, C. (2012a). Modelling pause duration as a function of contextual length. In Interspeech. Portland, OR.
Doukhan, D., Rosset, S. Rilliard, A., d’Alessandro, C., & Adda-Decker, M. (2012b). Designing french tale corpora for entertaining text to speech synthesis. In N. Calzolari, K. Choukri, T. Declerck, M. U. Doğan, B. Maegaard, J. Mariani, et al. (Eds.), Proceedings of the eighth international conference on language resources and evaluation (lrec’12). Istanbul, Turkey: European Language Resources Association (ELRA).
El Maarouf, I., & Villaneau, J. (2012). A french fairy tale corpus syntactically and semantically annotated. In Proceedings of the eight international conference on language resources and evaluation (lrec’12). Istanbul: European Language Resources Association (ELRA).
Elson, D. K., & McKeown, K. R. (2010). Automatic attribution of quoted speech in literary narrative. In Proceedings of AAAI.
Fackrell, J., Vereecken, H., Buhmann, J., Martens, J. P., & Van Coile, B. (2000). Prosodic variation with text type. In Proceedings of ICSLP.
Fort, K., François, C., Galibert, O., & Ghribi, M. (2012). Analyzing the impact of prevalence on the evaluation of a manual annotation campaign. In Proceedings of the eight international conference on language resources and evaluation (LREC’12). Istanbul, Turquie.
Francisco, V., Hervás, R., Peinado, F., & Gervás, P. (2012). Emotales: Creating a corpus of folk tales with emotional annotations. Language Resources and Evaluation, 46(3), 341–381.
Galibert, O. (2009). Approches et méthodologies pour la réponse automatique à des questions adaptées à un cadre interactif en domaine ouvert. Ph.D. dissertation. Orsay: Université Paris Sud.
Galibert, O., Quintard, L., Rosset, S., Zweigenbaum, P., Nédellec, C., Aubin, S., et al. (2010). Named and specific entity detection in varied data: The quaero named entity baseline evaluation. In Proceedings of LREC, Valletta, Malta, May 2010. European Language Resources Association (ELRA).
Gauvain, J. L., Adda, G., Adda-Decker, M., Allauzen, A., Gendner, V., Lamel, L., et al. (2005). Where are we in transcribing French broadcast news? In Ninth European conference on speech communication and technology. ISCA.
Gelin, R., d’Alessandro, C., Le, Q. A., Deroo, O., Doukhan, D., Martin, J. C., et al. (2010). Towards a storytelling humanoid robot. In AAAI fall symposium series on dialog with robots (pp. 137–138).
Gervás, P. (2010). Corpus annotation for narrative generation research: A wish list. In Amicus workshop.
Gervás, P., Díaz-Agudo, B., Peinado, F., & Hervás, R. (2005). Story plot generation based on CBR. Knowledge-based systems, 18(4), 235–242.
Goh, H.-N., Soon, L.-K., & Haw, S.-C. (2012). Automatic identification of protagonist in fairy tales using verb. In P.-N. Tan, S. Chawla, C. K. Ho, & J. Bailey (Eds.), Advances in knowledge discovery and data mining, Vol. 7302 of Lecture notes in computer science (pp. 395–406). Springer.
Golden, J. M. (1985). Interpreting a tale: Three perspectives on text construction. Poetics, 14(6), 503–524.
Grasbon, D., & Braun, N. (2001). A morphological approach to interactive storytelling. In Proceedings of cast01, living in mixed realities. special issue of netzspannung. org/journal, the magazine for media production and inter-media research (pp. 337–340). Citeseer.
Greimas, A. J. (1966). Sémantique structurale: recherche et méthode. Paris: Larousse.
Greimas, A. J. (1989). Description and narrativity: “The piece of string”. New Literary History, 20(3), 615–626.
Greimas, A. J., & Courtès, J. (1976). The cognitive dimension of narrative discourse. New Literary History, 7(3), 433–447.
Grouin, C., Rosset, S., Zweigenbaum, P., Fort, K., Galibert, O., & Quintard, L. (2011). Proposal for an extension of traditional named entities: From guidelines to evaluation, an overview. In Proceedings of the 5th linguistic annotation workshop (pp. 92–100). Association for Computational Linguistics.
Hearst, M. A. (1997). Texttiling: Segmenting text into multi-paragraph subtopic passages. Computational Linguistics, 23(1), 33–64.
Hendricks, W. O. (1967). On the notion ‘beyond the sentence’. Linguistics, 5(37), 12–51.
Holt, E. (1996). Reporting on talk: The use of direct reported speech in conversation. Research on Language and Social Interaction, 29(3), 219–245.
Hripcsak, G., & Rothschild, A. S. (2005). Agreement, the f-measure, and reliability in information retrieval. Journal of the American Medical Informatics Association, 12(3), 296–298.
Klabbers, E., & van Santen, J. (2004). Clustering of foot-based pitch contours in expressive speech. In Proceedings of 5th ISCA speech synthesis workshop (pp. 73–78). Citeseer.
Krippendorff, K. (1980). Content analysis: An introduction to its methodology. London: Sage.
Landis, J. R., Koch, G. G., et al. (1977). The measurement of observer agreement for categorical data. Biometrics, 33(1), 159–174.
Leech, G. (1997). Corpus annotation: Linguistic information from computer text corpora. London: Longman.
Lendvai, P., Declerck, T., Darányi, S., Gervás, P., Hervás, R., Malec, S., & Peinado, F. (2010). Integration of linguistic markup into semantic models of folk narratives: The fairy tale use case. In LREC.
Levin, H., Schaffer, C. A., & Snow, C. (1982). The prosodic and paralinguistic features of reading and telling stories. Language and Speech, 25(1), 43.
Malec, S. (2010). Autopropp: Toward the automatic markup, classification, and annotation of Russian magic tales. In First amicus workshop.
Mamede, N., & Chaleira, P. (2004). Character identification in children stories. In J. L. Vicedo, P. Martínez-Barco, R. Muńoz & M. Saiz-Noeda (Eds.), Advances in natural language processing Vol. 3230 of Lecture notes in computer science (pp. 82–90). Heidelberg: Springer.
Mani, I. (2014). Computational narratology. In P. Hühn, J. C. Meister, J. Pier & W. Schmid (Eds.), Handbook of narratology (pp. 84–92). Berlin/Boston: Walter de Gruyter GmbH.
Mutlu, B., Forlizzi, J., & Hodgins, J. (2006). A storytelling robot: Modeling and evaluation of human-like gaze behavior. In 2006 6th iEEE-RAS international conference on humanoid robots (pp. 518–523). IEEE.
Passonneau, R. J. (2004). Computing reliability for coreference annotation. In Proceedings of lrec (Vol. 4, pp. 1503–1506).
Propp, V. (1968). (orig 1928). Morphology of the folktale. Austin: University of Texas Press.
Ronfard, R., & Szilas, N. (2014). Where story and media meet: Computer generation of narrative discourse. In M. A. Finlayson, J. C. Meister & E. G. Bruneau (Eds.), 2014 Workshop on computational models of narrative (pp. 164–176). Dagstuhl: Schloss Dagstuhl—Leibniz-Zentrum fuer Informatik.
Rosset, S., Galibert, O., Bernard, G., Bilinski, E., & Adda, G. (2009). The LIMSI multilingual, multitask QAst system. In Proceedings of CLEF 2008 (pp. 480–487). Berlin: Springer.
Sluijter, A. M. C., & Terken, J. M. B. (2009). Beyond sentence prosody: Paragraph intonation in Dutch. Phonetica, 50(3), 180–188.
Stein, A., & Schmid, H. (1995). Étiquetage morphologique de textes français avec un arbre de décisions. Traitement automatique des langues, 36(1–2), 23–35.
Stenetorp, P., Pyysalo, S., Topić, G., Ohta, T., Ananiadou, S., & Tsujii, J. (2012). BRAT: A web-based tool for nlp-assisted text annotation. In Proceedings of the demonstrations at the 13th conference of the European chapter of the Association for Computational Linguistics (pp. 102–107). Avignon: Association for Computational Linguistics. http://www.aclweb.org/anthology/E12-2021.
Theune, M., Meijs, K., Heylen, D., & Ordelman, R. (2006). Generating expressive speech for storytelling applications. IEEE Transactions on Audio, Speech, and Language Processing, 14(4), 1137–1144.
Uzuner, O., Bodnari, A., Shen, S., Forbush, T., Pestian, J., & South, B. R. (2012). Evaluating the state of the art in coreference resolution for electronic medical records. Journal of the American Medical Informatics Association, 19(5), 786–791.
van Dijk, T. A. (1982). Episodes as units of discourse analysis. In D. Tannen (Ed.), Analyzing discourse: Text and talk (pp. 177–195). Georgetown: Georgetown University Press.
Weiser, S., & Watrin, P. (2012). Extraction of unmarked quotations in newspapers. In Proceedings of the eight international conference on language resources and evaluation (lrec’12). Istanbul: European Language Resources Association (ELRA).
Widlöcher, A., & Mathet, Y. (2012). The glozz platform: A corpus annotation and mining tool. In Proceedings of the 2012 ACM symposium on document engineering. Doceng ’12 (pp. 171–180). New York, NY: ACM. doi:10.1145/2361354.2361394.
Zhang, J., Black, A., & Sproat, R. (2003). Identifying speakers in children’s stories for speech synthesis. In Proceedings of Eurospeech (pp. 2041–2044).
Acknowledgments
This work has been funded by the French project GV-LEx (ANR-08-CORD-024 http://www.gvlex.com). The annotation of the corpus and the realization of the annotation software was done by Syllabs http://www.syllabs.com. We are deeply indebted towards anonymous reviewers for their constructive comments on earlier versions of this article.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Doukhan, D., Rosset, S., Rilliard, A. et al. The GV-LEx corpus of tales in French. Lang Resources & Evaluation 49, 521–547 (2015). https://doi.org/10.1007/s10579-015-9306-7
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10579-015-9306-7