A framework towards expressive speech analysis and synthesis with preliminary results

Raptis, Spyros; Karabetsos, Sotiris; Chalamandaris, Aimilios; Tsiakoulis, Pirros

doi:10.1007/s12193-015-0186-3

A framework towards expressive speech analysis and synthesis with preliminary results

Original Paper
Published: 29 June 2015

Volume 9, pages 387–394, (2015)
Cite this article

Journal on Multimodal User Interfaces Aims and scope Submit manuscript

Spyros Raptis¹,
Sotiris Karabetsos^1,2,
Aimilios Chalamandaris¹ &
…
Pirros Tsiakoulis¹

292 Accesses
1 Citation
Explore all metrics

Abstract

Emotion-aware computing presents one of the key challenges in contemporary natural human interaction research in which emotional speech is an essential modality in multimodal user interfaces. Speech modality relates mainly to speech emotion and affect recognition as well as near natural expressive speech synthesis, the latter being considered as one of the next significant milestones in speech synthesis technology. A common problem to recognizing as well as to generating affective and emotional speech content is the adopted methodology on emotion analysis and modeling. This work proposes a generalized framework for annotating, analyzing and modeling expressive speech in a data-driven machine learning approach, towards building expressive text to speech synthesis systems. To this end, the framework as well as the data driven methodology is described, comprised of the techniques and approaches for acoustic analysis and expression clustering. In addition, the deployment of online experimental tools for speech perception and annotation and the description of the utilized speech data together with initial experimental results are also given, depicting the potential of the proposed framework and providing encouraging indications for further research.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Emotional Speech Datasets for English Speech Synthesis Purpose: A Review

Perception of Expressivity in TTS: Linguistics, Phonetics or Prosody?

Audio Tagging for Emotion Recognition: A Review

References

Cowie R, Douglas-Cowie E, Tsapatsoulis N, Kollias S, Fellenz W, Taylor J (2001) Emotion recognition in human-computer interaction. IEEE Signal Process Mag 18:32–80
Article Google Scholar
Calvo RA, D’Mello S (2010) Affect detection: an interdisciplinary review of models, methods, and their applications. IEEE Trans Affect Comput 1(1):18–37
Article Google Scholar
Schuller B, Batliner A (2014) Computational paralinguistics: emotion, affect and personality in speech and language processing. Wiley, New York. ISBN: 978-1-119-97136-8
Ayadi WE, Kamel MS, Karray F (2011) Survey on speech emotion recognition: features, classification schemes, and databases. Pattern Recognit 44:572–587
Article MATH Google Scholar
Koolagudi SG, Rao KS (2012) Emotion recognition from speech: a review. Int J Speech Technol 15:99–117
Article Google Scholar
Cowie R, Cornelius RR (2003) Describing the emotional states that are expressed in speech. Speech Commun 40(1–2):5–32
Article MATH Google Scholar
Schroeder M (2009) Expressive speech synthesis: past, present, and possible futures. In: Tao JH, Tan TN (eds) Affective information processing. Springer Science+Business Media LLC, London
Google Scholar
Theune M, Meijs K, Heylen D, Ordelman R (2006) Generating Expressive Speech for Storytelling Applications. IEEE Trans Audio Speech Lang Process 14(4):1137–1144
Article Google Scholar
Chalamandaris A, Tsiakoulis P, Karabetsos S, Raptis S (2014) Using audio books for training a text-to-speech system. In: Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC’14), May 26–31, Reykjavik, pp 3076–3080
Chalamandaris A, Tsiakoulis P, Karabetsos S, Raptis S (2013) The ILSP/INNOETICS text-to-speech system for the Blizzard challenge 2013. In: The Blizzard Challenge 2013 Workshop, Barcelona
Braunschweiler N, Gales MJF, Buchholz S (2010) Lightly supervised recognition for automatic alignment of large coherent speech recordings. In: Proceedings of Interspeech 2010, Makuhari, pp 2222–2225
Szekely E, Csapo TG, Toth B, Mihajlik P, Carson-Berndsen J (2012) Synthesizing expressive speech from amateur audiobook recordings’. In: Proceedings of SLT 2012, Florida, pp. 297–302
Eyben F, Buchholz S, Braunschweiler N, Latorre J, Wan V, Gales MJF, Knill K (2012) Unsupervised clustering of emotion and voice styles for expressive TTS. In: Proceedings of IEEE ICASSP 2012, Kyoto, pp. 4009–4012
Tsiakoulis P, Karabetsos S, Chalamandaris A, Raptis S (2014) An overview of the ILSP unit selection text-to-speech synthesis system. In: Likas A, Blekas K, Kalles D (ed) SETN 2014, LNCS 8445, Springer International Publishing Switzerland, pp. 370–383
Alm CO, Sproat R (2005) Perceptions of emotions in expressive storytelling. In: Proceedings INTERSPEECH 2005
Raptis S (2013) Exploring latent structure in expressive speech. In: Proceedings of IEEE CogInfo-Com 2013, 4th IEEE International Conference on Cognitive Infocommunications, December 2–5, Budapest, pp. 741–745
Raptis S, Karabetsos S, Chalamandaris A, Tsiakoulis P (2014) Towards expressive speech synthesis: analysis and modelling of expressive speech. In: 5th IEEE International Conference on Cognitive Infocommunications, IEEE CogInfoCom 2014, Vietri Sul Mare, (Best Paper Award)
Baranyi P, Csapó A (2012) Definition and synergies of cognitive infocommunications. Acta Polytech Hung 9(1):67–83
Google Scholar
Schuller B, Batliner A, Steidl S, Seppi D (2011) Recognising realistic emotions and affect in speech: state of the art and lessons learnt from the first challenge. Speech Commun 53:1062–1087
Article Google Scholar
Yun S, Yoo CD (2012) Loss-scaled large-margin Gaussian mixture models for speech emotion classification. IEEE Trans Audio Speech Lang Process 20(4):585–597
Google Scholar
Tawari A, Trivedi MM (2010) Speech emotion analysis: exploring the role of context. IEEE Trans Multimed 12:6
Google Scholar
King S (2014) Measuring a decade of progress in text-to-speech. Loquens 1(1):e006. doi:10.3989/loquens.2014.006
Article Google Scholar
Batliner A, Steidl S, Schuller B et al (2011) Whodunnit—searching for the most important feature types signalling emotion-related user states in speech. Comput Speech Lang 25:4–28
Article Google Scholar

Download references

Acknowledgments

The research leading to these results has been partially funded by POLYTROPON Project (KRIPIS-GSRT, MIS: 448306).

Author information

Authors and Affiliations

Voice & Sound Technology Department, Institute for Language and Speech Processing (ILSP) – “Athena” Research Center, Artemidos 6 & Epidavrou, Maroussi, 151 25, Athens, Greece
Spyros Raptis, Sotiris Karabetsos, Aimilios Chalamandaris & Pirros Tsiakoulis
Electronics Engineering Department, Technological Educational Institute (TEI) of Athens, Ag. Spiridonos Street, 122 43, Athens, Egaleo, Greece
Sotiris Karabetsos

Authors

Spyros Raptis
View author publications
You can also search for this author in PubMed Google Scholar
Sotiris Karabetsos
View author publications
You can also search for this author in PubMed Google Scholar
Aimilios Chalamandaris
View author publications
You can also search for this author in PubMed Google Scholar
Pirros Tsiakoulis
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Sotiris Karabetsos.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Raptis, S., Karabetsos, S., Chalamandaris, A. et al. A framework towards expressive speech analysis and synthesis with preliminary results. J Multimodal User Interfaces 9, 387–394 (2015). https://doi.org/10.1007/s12193-015-0186-3

Download citation

Received: 19 January 2015
Accepted: 18 June 2015
Published: 29 June 2015
Issue Date: December 2015
DOI: https://doi.org/10.1007/s12193-015-0186-3

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A framework towards expressive speech analysis and synthesis with preliminary results

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Emotional Speech Datasets for English Speech Synthesis Purpose: A Review

Perception of Expressivity in TTS: Linguistics, Phonetics or Prosody?

Audio Tagging for Emotion Recognition: A Review

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now