skip to main content
10.1145/3003733.3003815acmotherconferencesArticle/Chapter ViewAbstractPublication PagespciConference Proceedingsconference-collections
research-article

Affective word ratings for concatenative text-to-speech synthesis

Published: 10 November 2016 Publication History

Abstract

This work explores affective word ratings as an auxiliary target cost for unit-selection-based concatenative speech synthesis. The method does not require task-specific crafted corpora, nor does it rely on additional annotations, making it ideal for found data. Following the general philosophy of our text-to-speech system, the approach does not enforce any explicit prosodic model, instead the affect information is implicitly modeled via its contribution to the unit-selection cost function. The auxiliary affective feature vector comprises of continuous ratings in three dimensions (valence, arousal and dominance), extracted at the word level via state-of-the-art sentiment analysis techniques. In this case study, speech data consists of several professionally-produced children's audiobooks totaling about 5 hours of speech. The affective dimensions are shown to correlate well with acoustic/prosodic features extracted from the speech data, highlighting their utility for the affective speech synthesis. This is further confirmed via a preference listening test between the baseline and the affective voice.

References

[1]
X. Anguera, N. Perez, A. Urruela, and N. Oliver. Automatic synchronization of electronic and audio books via tts alignment and silence filtering. In Multimedia and Expo (ICME), 2011 IEEE International Conference on, pages 1--6. IEEE, 2011.
[2]
O. Boeffard, L. Charonnat, S. Le Maguer, D. Lolive, and G. Vidal. Towards fully automatic annotation of audio books for tts. In LREC, pages 975--980, 2012.
[3]
P. Boersma et al. Praat, a system for doing phonetics by computer. Glot international, 5(9/10):341--345, 2002.
[4]
N. Braunschweiler, M. J. Gales, and S. Buchholz. Lightly supervised recognition for automatic alignment of large coherent speech recordings. In INTERSPEECH, pages 2222--2225, 2010.
[5]
C. Busso, S. Lee, and S. Narayanan. Analysis of emotionally salient aspects of fundamental frequency for emotion detection. Audio, Speech, and Language Processing, IEEE Transactions on, 17(4):582--596, 2009.
[6]
A. Chalamandaris, P. Tsiakoulis, S. Karabetsos, and S. Raptis. Using audio books for training a text-to-speech system. In LREC, pages 3076--3080, 2014.
[7]
M. Charfuelan and M. Schröder. Correlation analysis of sentiment analysis scores and acoustic features in audiobook narratives. In 4th International Workshop on Corpora for Research on Emotion Sentiment & Social Signals, Istanbul, Turkey, 2012.
[8]
D. Govind and S. M. Prasanna. Expressive speech synthesis: a review. International Journal of Speech Technology, 16(2):237--260, 2013.
[9]
A. J. Hunt and A. W. Black. Unit selection in a concatenative speech synthesis system using a large speech database. In Acoustics, Speech, and Signal Processing, 1996. ICASSP-96. Conference Proceedings., 1996 IEEE International Conference on, volume 1, pages 373--376. IEEE, 1996.
[10]
S. King. Measuring a decade of progress in text-to-speech. Loquens, 1(1):e006, 2014.
[11]
S. King and V. Karaiskos. The Blizzard Challenge 2016. 2016.
[12]
J. A. Louw, G. Schlunz, W. van der Walt, F. De Wet, and L. Pretorius. The speect text-to-speech system entry for the blizzard challenge 2013. 2013.
[13]
E. Palogiannidi, E. Iosif, P. Koutsakis, and A. Potamianos. Valence, arousal and dominance estimation for english, german, greek, portuguese and spanish lexica using semantic models. In Sixteenth Annual Conference of the International Speech Communication Association, 2015.
[14]
K. Prahallad and A. W. Black. Segmentation of monologues in audio books for building synthetic voices. Audio, Speech, and Language Processing, IEEE Transactions on, 19(5):1444--1449, 2011.
[15]
S. Raptis, S. Karabetsos, A. Chalamandaris, and P. Tsiakoulis. A framework towards expressive speech analysis and synthesis with preliminary results. Journal on Multimodal User Interfaces, 9(4):387--394, 2015.
[16]
G. I. Schlunz and E. Barnard. A discourse model of affect for text-to-speech synthesis. 2013.
[17]
M. Schröder. Expressive speech synthesis: Past, present, and possible futures. In Affective information processing, pages 111--126. Springer, 2009.
[18]
M. A. M. Shaikh, A. R. F. Rebordao, and K. Hirose. Improving tts synthesis for emotional expressivity by a prosodic parameterization of affect based on linguistic analysis. In Proceedings of the 5th International Conference on Speech Prosody, Chicago, USA, 2010.
[19]
A. Stan, O. Watts, Y. Mamiya, M. Giurgiu, R. A. Clark, J. Yamagishi, and S. King. Tundra: a multilingual corpus of found data for tts research created with light supervision. In INTERSPEECH, pages 2331--2335, 2013.
[20]
I. Steiner, M. Schröder, M. Charfuelan, and A. Klepp. Symbolic vs. acoustics-based style control for expressive unit selection. In SSW, pages 114--119, 2010.
[21]
E. Székely, J. P. Cabral, P. Cahill, and J. Carson-Berndsen. Clustering expressive speech styles in audiobooks using glottal source parameters. In Interspeech, pages 2409--2412, 2011.
[22]
P. Tsiakoulis, C. Breslin, M. Gasic, M. Henderson, D. Kim, M. Szummer, B. Thomson, and S. Young. Dialogue context sensitive hmm-based speech synthesis. In Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on, pages 2554--2558. IEEE, 2014.
[23]
P. Tsiakoulis, C. Breslin, M. Gasic, M. Henderson, D. Kim, and S. J. Young. Dialogue context sensitive speech synthesis using factorized decision trees. In INTERSPEECH, pages 2937--2941, 2014.
[24]
P. Tsiakoulis, S. Karabetsos, A. Chalamandaris, and S. Raptis. An overview of the ilsp unit selection text-to-speech synthesis system. In Artificial Intelligence: Methods and Applications, pages 370--383. Springer, 2014.
[25]
J. Yamagishi, T. Kobayashi, M. Tachibana, K. Ogata, and Y. Nakano. Model adaptation approach to speech synthesis with diverse voices and styles. In Acoustics, Speech and Signal Processing, 2007. ICASSP 2007. IEEE International Conference on, volume 4, pages IV--1233. IEEE, 2007.
  1. Affective word ratings for concatenative text-to-speech synthesis

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Other conferences
    PCI '16: Proceedings of the 20th Pan-Hellenic Conference on Informatics
    November 2016
    449 pages
    ISBN:9781450347891
    DOI:10.1145/3003733
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    In-Cooperation

    • Greek Com Soc: Greek Computer Society
    • TEI: Technological Educational Institution of Athens

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 10 November 2016

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. affective speech synthesis
    2. sentiment analysis for speech synthesis
    3. text-to-speech synthesis

    Qualifiers

    • Research-article
    • Research
    • Refereed limited

    Conference

    PCI '16
    PCI '16: 20th Pan-Hellenic Conference on Informatics
    November 10 - 12, 2016
    Patras, Greece

    Acceptance Rates

    Overall Acceptance Rate 190 of 390 submissions, 49%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 99
      Total Downloads
    • Downloads (Last 12 months)0
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 05 Mar 2025

    Other Metrics

    Citations

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media