Skip to main content
Log in

A framework towards expressive speech analysis and synthesis with preliminary results

  • Original Paper
  • Published:
Journal on Multimodal User Interfaces Aims and scope Submit manuscript

Abstract

Emotion-aware computing presents one of the key challenges in contemporary natural human interaction research in which emotional speech is an essential modality in multimodal user interfaces. Speech modality relates mainly to speech emotion and affect recognition as well as near natural expressive speech synthesis, the latter being considered as one of the next significant milestones in speech synthesis technology. A common problem to recognizing as well as to generating affective and emotional speech content is the adopted methodology on emotion analysis and modeling. This work proposes a generalized framework for annotating, analyzing and modeling expressive speech in a data-driven machine learning approach, towards building expressive text to speech synthesis systems. To this end, the framework as well as the data driven methodology is described, comprised of the techniques and approaches for acoustic analysis and expression clustering. In addition, the deployment of online experimental tools for speech perception and annotation and the description of the utilized speech data together with initial experimental results are also given, depicting the potential of the proposed framework and providing encouraging indications for further research.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

References

  1. Cowie R, Douglas-Cowie E, Tsapatsoulis N, Kollias S, Fellenz W, Taylor J (2001) Emotion recognition in human-computer interaction. IEEE Signal Process Mag 18:32–80

    Article  Google Scholar 

  2. Calvo RA, D’Mello S (2010) Affect detection: an interdisciplinary review of models, methods, and their applications. IEEE Trans Affect Comput 1(1):18–37

    Article  Google Scholar 

  3. Schuller B, Batliner A (2014) Computational paralinguistics: emotion, affect and personality in speech and language processing. Wiley, New York. ISBN: 978-1-119-97136-8

  4. Ayadi WE, Kamel MS, Karray F (2011) Survey on speech emotion recognition: features, classification schemes, and databases. Pattern Recognit 44:572–587

    Article  MATH  Google Scholar 

  5. Koolagudi SG, Rao KS (2012) Emotion recognition from speech: a review. Int J Speech Technol 15:99–117

    Article  Google Scholar 

  6. Cowie R, Cornelius RR (2003) Describing the emotional states that are expressed in speech. Speech Commun 40(1–2):5–32

    Article  MATH  Google Scholar 

  7. Schroeder M (2009) Expressive speech synthesis: past, present, and possible futures. In: Tao JH, Tan TN (eds) Affective information processing. Springer Science+Business Media LLC, London

    Google Scholar 

  8. Theune M, Meijs K, Heylen D, Ordelman R (2006) Generating Expressive Speech for Storytelling Applications. IEEE Trans Audio Speech Lang Process 14(4):1137–1144

    Article  Google Scholar 

  9. Chalamandaris A, Tsiakoulis P, Karabetsos S, Raptis S (2014) Using audio books for training a text-to-speech system. In: Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC’14), May 26–31, Reykjavik, pp 3076–3080

  10. Chalamandaris A, Tsiakoulis P, Karabetsos S, Raptis S (2013) The ILSP/INNOETICS text-to-speech system for the Blizzard challenge 2013. In: The Blizzard Challenge 2013 Workshop, Barcelona

  11. Braunschweiler N, Gales MJF, Buchholz S (2010) Lightly supervised recognition for automatic alignment of large coherent speech recordings. In: Proceedings of Interspeech 2010, Makuhari, pp 2222–2225

  12. Szekely E, Csapo TG, Toth B, Mihajlik P, Carson-Berndsen J (2012) Synthesizing expressive speech from amateur audiobook recordings’. In: Proceedings of SLT 2012, Florida, pp. 297–302

  13. Eyben F, Buchholz S, Braunschweiler N, Latorre J, Wan V, Gales MJF, Knill K (2012) Unsupervised clustering of emotion and voice styles for expressive TTS. In: Proceedings of IEEE ICASSP 2012, Kyoto, pp. 4009–4012

  14. Tsiakoulis P, Karabetsos S, Chalamandaris A, Raptis S (2014) An overview of the ILSP unit selection text-to-speech synthesis system. In: Likas A, Blekas K, Kalles D (ed) SETN 2014, LNCS 8445, Springer International Publishing Switzerland, pp. 370–383

  15. Alm CO, Sproat R (2005) Perceptions of emotions in expressive storytelling. In: Proceedings INTERSPEECH 2005

  16. Raptis S (2013) Exploring latent structure in expressive speech. In: Proceedings of IEEE CogInfo-Com 2013, 4th IEEE International Conference on Cognitive Infocommunications, December 2–5, Budapest, pp. 741–745

  17. Raptis S, Karabetsos S, Chalamandaris A, Tsiakoulis P (2014) Towards expressive speech synthesis: analysis and modelling of expressive speech. In: 5th IEEE International Conference on Cognitive Infocommunications, IEEE CogInfoCom 2014, Vietri Sul Mare, (Best Paper Award)

  18. Baranyi P, Csapó A (2012) Definition and synergies of cognitive infocommunications. Acta Polytech Hung 9(1):67–83

    Google Scholar 

  19. Schuller B, Batliner A, Steidl S, Seppi D (2011) Recognising realistic emotions and affect in speech: state of the art and lessons learnt from the first challenge. Speech Commun 53:1062–1087

    Article  Google Scholar 

  20. Yun S, Yoo CD (2012) Loss-scaled large-margin Gaussian mixture models for speech emotion classification. IEEE Trans Audio Speech Lang Process 20(4):585–597

    Google Scholar 

  21. Tawari A, Trivedi MM (2010) Speech emotion analysis: exploring the role of context. IEEE Trans Multimed 12:6

    Google Scholar 

  22. King S (2014) Measuring a decade of progress in text-to-speech. Loquens 1(1):e006. doi:10.3989/loquens.2014.006

    Article  Google Scholar 

  23. Batliner A, Steidl S, Schuller B et al (2011) Whodunnit—searching for the most important feature types signalling emotion-related user states in speech. Comput Speech Lang 25:4–28

    Article  Google Scholar 

Download references

Acknowledgments

The research leading to these results has been partially funded by POLYTROPON Project (KRIPIS-GSRT, MIS: 448306).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Sotiris Karabetsos.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Raptis, S., Karabetsos, S., Chalamandaris, A. et al. A framework towards expressive speech analysis and synthesis with preliminary results. J Multimodal User Interfaces 9, 387–394 (2015). https://doi.org/10.1007/s12193-015-0186-3

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s12193-015-0186-3

Keywords