Abstract
Emotion-aware computing presents one of the key challenges in contemporary natural human interaction research in which emotional speech is an essential modality in multimodal user interfaces. Speech modality relates mainly to speech emotion and affect recognition as well as near natural expressive speech synthesis, the latter being considered as one of the next significant milestones in speech synthesis technology. A common problem to recognizing as well as to generating affective and emotional speech content is the adopted methodology on emotion analysis and modeling. This work proposes a generalized framework for annotating, analyzing and modeling expressive speech in a data-driven machine learning approach, towards building expressive text to speech synthesis systems. To this end, the framework as well as the data driven methodology is described, comprised of the techniques and approaches for acoustic analysis and expression clustering. In addition, the deployment of online experimental tools for speech perception and annotation and the description of the utilized speech data together with initial experimental results are also given, depicting the potential of the proposed framework and providing encouraging indications for further research.




Similar content being viewed by others
References
Cowie R, Douglas-Cowie E, Tsapatsoulis N, Kollias S, Fellenz W, Taylor J (2001) Emotion recognition in human-computer interaction. IEEE Signal Process Mag 18:32–80
Calvo RA, D’Mello S (2010) Affect detection: an interdisciplinary review of models, methods, and their applications. IEEE Trans Affect Comput 1(1):18–37
Schuller B, Batliner A (2014) Computational paralinguistics: emotion, affect and personality in speech and language processing. Wiley, New York. ISBN: 978-1-119-97136-8
Ayadi WE, Kamel MS, Karray F (2011) Survey on speech emotion recognition: features, classification schemes, and databases. Pattern Recognit 44:572–587
Koolagudi SG, Rao KS (2012) Emotion recognition from speech: a review. Int J Speech Technol 15:99–117
Cowie R, Cornelius RR (2003) Describing the emotional states that are expressed in speech. Speech Commun 40(1–2):5–32
Schroeder M (2009) Expressive speech synthesis: past, present, and possible futures. In: Tao JH, Tan TN (eds) Affective information processing. Springer Science+Business Media LLC, London
Theune M, Meijs K, Heylen D, Ordelman R (2006) Generating Expressive Speech for Storytelling Applications. IEEE Trans Audio Speech Lang Process 14(4):1137–1144
Chalamandaris A, Tsiakoulis P, Karabetsos S, Raptis S (2014) Using audio books for training a text-to-speech system. In: Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC’14), May 26–31, Reykjavik, pp 3076–3080
Chalamandaris A, Tsiakoulis P, Karabetsos S, Raptis S (2013) The ILSP/INNOETICS text-to-speech system for the Blizzard challenge 2013. In: The Blizzard Challenge 2013 Workshop, Barcelona
Braunschweiler N, Gales MJF, Buchholz S (2010) Lightly supervised recognition for automatic alignment of large coherent speech recordings. In: Proceedings of Interspeech 2010, Makuhari, pp 2222–2225
Szekely E, Csapo TG, Toth B, Mihajlik P, Carson-Berndsen J (2012) Synthesizing expressive speech from amateur audiobook recordings’. In: Proceedings of SLT 2012, Florida, pp. 297–302
Eyben F, Buchholz S, Braunschweiler N, Latorre J, Wan V, Gales MJF, Knill K (2012) Unsupervised clustering of emotion and voice styles for expressive TTS. In: Proceedings of IEEE ICASSP 2012, Kyoto, pp. 4009–4012
Tsiakoulis P, Karabetsos S, Chalamandaris A, Raptis S (2014) An overview of the ILSP unit selection text-to-speech synthesis system. In: Likas A, Blekas K, Kalles D (ed) SETN 2014, LNCS 8445, Springer International Publishing Switzerland, pp. 370–383
Alm CO, Sproat R (2005) Perceptions of emotions in expressive storytelling. In: Proceedings INTERSPEECH 2005
Raptis S (2013) Exploring latent structure in expressive speech. In: Proceedings of IEEE CogInfo-Com 2013, 4th IEEE International Conference on Cognitive Infocommunications, December 2–5, Budapest, pp. 741–745
Raptis S, Karabetsos S, Chalamandaris A, Tsiakoulis P (2014) Towards expressive speech synthesis: analysis and modelling of expressive speech. In: 5th IEEE International Conference on Cognitive Infocommunications, IEEE CogInfoCom 2014, Vietri Sul Mare, (Best Paper Award)
Baranyi P, Csapó A (2012) Definition and synergies of cognitive infocommunications. Acta Polytech Hung 9(1):67–83
Schuller B, Batliner A, Steidl S, Seppi D (2011) Recognising realistic emotions and affect in speech: state of the art and lessons learnt from the first challenge. Speech Commun 53:1062–1087
Yun S, Yoo CD (2012) Loss-scaled large-margin Gaussian mixture models for speech emotion classification. IEEE Trans Audio Speech Lang Process 20(4):585–597
Tawari A, Trivedi MM (2010) Speech emotion analysis: exploring the role of context. IEEE Trans Multimed 12:6
King S (2014) Measuring a decade of progress in text-to-speech. Loquens 1(1):e006. doi:10.3989/loquens.2014.006
Batliner A, Steidl S, Schuller B et al (2011) Whodunnit—searching for the most important feature types signalling emotion-related user states in speech. Comput Speech Lang 25:4–28
Acknowledgments
The research leading to these results has been partially funded by POLYTROPON Project (KRIPIS-GSRT, MIS: 448306).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Raptis, S., Karabetsos, S., Chalamandaris, A. et al. A framework towards expressive speech analysis and synthesis with preliminary results. J Multimodal User Interfaces 9, 387–394 (2015). https://doi.org/10.1007/s12193-015-0186-3
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s12193-015-0186-3