Cross-Cultural Comparison of Gradient Emotion Perception: Human vs. Alexa TTS Voices

Gessinger, Iona; Cohn, Michelle; Zellou, Georgia; Möbius, Bernd

doi:10.21437/Interspeech.2022-146

Cross-Cultural Comparison of Gradient Emotion Perception: Human vs. Alexa TTS Voices

Iona Gessinger, Michelle Cohn, Georgia Zellou, Bernd Möbius

This study compares how American (US) and German (DE) listeners perceive emotional expressiveness from Amazon Alexa text-to-speech (TTS) and human voices. Participants heard identical stimuli, manipulated from an emotionally ‘neutral' production to three levels of increased happiness generated by resynthesis. Results show that, for both groups, ‘happiness' manipulations lead to higher ratings of emotional valence (i.e., more positive) for the human voice. Moreover, there was a difference across the groups in their perception of arousal (i.e., excitement): US listeners show higher ratings for human voices with manipulations, while DE listeners perceive the Alexa voice as sounding less ‘excited' overall. We discuss these findings in terms of theories of cross-cultural emotion perception and human-computer interaction.

doi: 10.21437/Interspeech.2022-146

Cite as: Gessinger, I., Cohn, M., Zellou, G., Möbius, B. (2022) Cross-Cultural Comparison of Gradient Emotion Perception: Human vs. Alexa TTS Voices. Proc. Interspeech 2022, 4970-4974, doi: 10.21437/Interspeech.2022-146

@inproceedings{gessinger22_interspeech,
  author={Iona Gessinger and Michelle Cohn and Georgia Zellou and Bernd Möbius},
  title={{Cross-Cultural Comparison of Gradient Emotion Perception: Human vs. Alexa TTS Voices}},
  year=2022,
  booktitle={Proc. Interspeech 2022},
  pages={4970--4974},
  doi={10.21437/Interspeech.2022-146}
}