skip to main content
research-article

Do Prosody and Embodiment Influence the Perceived Naturalness of Conversational Agents’ Speech?

Published: 28 October 2021 Publication History

Abstract

For conversational agents’ speech, either all possible sentences have to be prerecorded by voice actors or the required utterances can be synthesized. While synthesizing speech is more flexible and economic in production, it also potentially reduces the perceived naturalness of the agents among others due to mistakes at various linguistic levels. In our article, we are interested in the impact of adequate and inadequate prosody, here particularly in terms of accent placement, on the perceived naturalness and aliveness of the agents. We compare (1) inadequate prosody, as generated by off-the-shelf text-to-speech (TTS) engines with synthetic output; (2) the same inadequate prosody imitated by trained human speakers; and (3) adequate prosody produced by those speakers. The speech was presented either as audio-only or by embodied, anthropomorphic agents, to investigate the potential masking effect by a simultaneous visual representation of those virtual agents. To this end, we conducted an online study with 40 participants listening to four different dialogues each presented in the three Speech levels and the two Embodiment levels. Results confirmed that adequate prosody in human speech is perceived as more natural (and the agents are perceived as more alive) than inadequate prosody in both human (2) and synthetic speech (1). Thus, it is not sufficient to just use a human voice for an agents’ speech to be perceived as natural—it is decisive whether the prosodic realisation is adequate or not. Furthermore, and surprisingly, we found no masking effect by speaker embodiment, since neither a human voice with inadequate prosody nor a synthetic voice was judged as more natural, when a virtual agent was visible compared to the audio-only condition. On the contrary, the human voice was even judged as less “alive” when accompanied by a virtual agent. In sum, our results emphasize, on the one hand, the importance of adequate prosody for perceived naturalness, especially in terms of accents being placed on important words in the phrase, while showing, on the other hand, that the embodiment of virtual agents plays a minor role in the naturalness ratings of voices.

References

[1]
Samer Al Moubayed, Jonas Beskow, Gabriel Skantze, and Björn Granström. 2012. Furhat: A back-projected human-like robot head for multiparty human-machine interaction. In Cognitive Behavioural Systems. 114–130. https://doi.org/10.1007/978-3-642-34584-5_9
[2]
Keith Anderson, Elisabeth André, T. Baur, Sara Bernardini, M. Chollet, E. Chryssafidou, I. Damian, C. Ennis, A. Egges, P. Gebhard, H. Jones, M. Ochs, C. Pelachaud, Kaśka Porayska-Pomsta, P. Rizzo, and Nicolas Sabouret. 2013. The TARDIS framework: Intelligent virtual agents for social coaching in job interviews. In Advances in Computer Entertainment. 476–491. https://doi.org/10.1007/978-3-319-03161-3_35
[3]
R. H. Baayen, D. J. Davidson, and D. M. Bates. 2008. Mixed-effects modeling with crossed random effects for subjects and items. Journal of Memory and Language 59, 4 (2008), 390–412. https://doi.org/10.1016/j.jml.2007.12.005
[4]
Jeremy N. Bailenson, Jim Blascovich, Andrew C. Beall, and Jack M. Loomis. 2001. Equilibrium theory revisited: Mutual gaze and personal space in virtual environments. Presence: Teleoperators and Virtual Environments 10, 6 (2001), 583–598. https://doi.org/10.1162/105474601753272844
[5]
Tadas Baltrusaitis, Amir Zadeh, Yao Chong Lim, and Louis Philippe Morency. 2018. OpenFace 2.0: Facial behavior analysis toolkit. In IEEE International Conference on Automatic Face and Gesture Recognition. 59–66. https://doi.org/10.1109/FG.2018.00019
[6]
Douglas Bates, Martin Mächler, Benjamin M. Bolker, and Steven C. Walker. 2015. Fitting linear mixed-effects models using lme4. Journal of Statistical Software 67, 1 (2015), 1–48. https://doi.org/10.18637/jss.v067.i01
[7]
Tobias Baur, Ionut Damian, Patrick Gebhard, Kaéka Porayska-Pomsta, and Elisabeth André. 2013. A job interview simulation: Social cue-based interaction with a virtual character. In International Conference on Social Computing. 220–227. https://doi.org/10.1109/SocialCom.2013.39
[8]
João Paulo Cabral, Benjamin R. Cowan, Katja Zibrek, and Rachel Mcdonnell. 2017. The influence of synthetic voice on the evaluation of a virtual character. In Proceedings of Interspeech. 229–233. https://doi.org/10.21437/Interspeech.2017-325
[9]
Julia Cambre and Chinmay Kulkarni. 2019. One voice fits all? Social implications and research challenges of designing voices for smart devices. Proceedings on ACM Human-Computer Interaction 3, 223 (2019), 19. https://doi.org/10.1145/3359325
[10]
Justine Cassell, Joseph Sullivan, Scott Prevost, and Elizabeth Churchill. 2000. Embodied Conversational Agents. MIT Press.
[11]
Noël Chateau, Valérie Maffiolo, Nathalie Pican, and Marc Mersiol. 2005. The effect of embodied conversational agents’ speech quality on users’ attention and emotion. In International Conference on Affective Computing and Intelligent Interaction. 652–659. https://doi.org/10.1007/11573548_84
[12]
Emna Chérif and Jean-François Lemoine. 2019. Anthropomorphic virtual assistants and the reactions of Internet users: An experiment on the assistant’s voice:. Recherche et Applications en Marketing (English Edition) 34, 1 (2019), 28–47. https://doi.org/10.1177/2051570719829432
[13]
Jacquelyn J. Chini, Carrie L. Straub, and Kevin H. Thomas. 2016. Learning from avatars: Learning assistants practice physics pedagogy in a classroom simulator. Physical Review Physics Education Research 12, 010117 (2016), 1–15. https://doi.org/10.1103/PhysRevPhysEducRes.12.010117
[14]
Mathieu Chollet, Torsten Wörtwein, Louis-Philippe Morency, Ari Shapiro, and Stefan Scherer. 2015. Exploring feedback strategies to improve public speaking: An interactive virtual audience framework. In Proceedings of the 2015 ACM International Joint Conference on Pervasive and Ubiquitous Computing. 1143–1154. https://doi.org/10.1145/2750858.2806060
[15]
Michelle Cohn, Patrik Jonell, Taylor Kim, Jonas Beskow, and Georgia Zellou. 2020. Embodiment and gender interact in alignment to TTS voices. In Proceedings of the Cognitive Science Society. 220–226.
[16]
Anne Cutler. 1980. Errors of stress and intonation. In Errors in Linguistic Performance: Slips of the Tongue, Ear, Pen and Hand, V. A. Fromkin (Ed.). New York, Academic Press, 67–80.
[17]
Robert O. Davis, Joseph Vincent, and Taejung Park. 2019. Reconsidering the voice principle with non-native language speakers. Computers and Education 140 (2019), 103605. https://doi.org/10.1016/j.compedu.2019.103605
[18]
Aline W. de Borst and Beatrice de Gelder. 2015. Is it the real deal? Perception of virtual characters versus humans: An affective cognitive neuroscience perspective. Frontiers in Psychology 6, 576 (2015), 1–12. https://doi.org/10.3389/fpsyg.2015.00576
[19]
Ramiro H. Gálvez, Agustín Gravano, Stefan Beňuš, Rivka Levitan, Marian Trnka, and Julia Hirschberg. 2020. An empirical study of the effect of acoustic-prosodic entrainment on the perceived trustworthiness of conversational avatars. Speech Communication 124 (2020), 46–67. https://doi.org/10.1016/j.specom.2020.07.007
[20]
Kallirroi Georgila, Alan W. Black, Kenji Sagae, and David Traum. 2012. Practical evaluation of human and synthesized speech for virtual human dialogue systems. In Proceedings of the 8th International Conference on Language Resources and Evaluation. 3519–3526.
[21]
Li Gong and Clifford Nass. 2007. When a talking-face computer agent is half-human and half-humanoid: Human identity and consistency preference. Human Communication Research 33, 2 (2007), 163–193. https://doi.org/10.1111/j.1468-2958.2007.00295.x
[22]
Jonathan Gratch, David DeVault, and Gale Lucas. 2016. The benefits of virtual humans for teaching negotiation. In Proceedings of the 12th International Conference on Intelligent Virtual Agents. 283–294. https://doi.org/10.1007/978-3-319-47665-0_25
[23]
Laurie Hiyakumoto, Scott Prevost, and Justine Cassell. 1997. Semantic and discourse information for text-to-speech intonation. In Concept to Speech Generation Systems. 47–56.
[24]
Ni Kang, Willem Paul Brinkman, M. Birna Van Riemsdijk, and Mark Neerincx. 2016. The design of virtual audiences: Noticeable and recognizable behavioral styles. Computers in Human Behavior 55 (2016), 680–694. https://doi.org/10.1016/j.chb.2015.10.008
[25]
Jari Kätsyri, Klaus Förger, Meeri Mäkäräinen, and Tapio Takala. 2015. A review of empirical evidence on different uncanny valley hypotheses: Support for perceptual mismatch as one road to the valley of eeriness. Frontiers in Psychology 6, 390 (2015), 1–16. https://doi.org/10.3389/fpsyg.2015.00390
[26]
Brigitte Krenn, Stephanie Schreitter, and Friedrich Neubarth. 2017. Speak to me and I tell you who you are! A language-attitude study in a cultural-heritage application. AI & Society 32, 1 (2017), 65–77. https://doi.org/10.1007/s00146-014-0569-0
[27]
Katharina Kühne, Martin H. Fischer, and Yuefang Zhou. 2020. The human takes it all: Humanlike synthesized voices are perceived as less eerie and more likable. evidence from a subjective ratings study. Frontiers in Neurorobotics 14 (2020), 593732. https://doi.org/10.3389/fnbot.2020.593732
[28]
D. J. Leiner. 2021. SoSci Survey (Version 3.2.28) [Computer software]. https://www.soscisurvey.de.
[29]
Jean-Luc Lugrin, Marc Erich Latoschik, Michael Habel, Daniel Roth, Christian Seufert, and Silke Grafe. 2016. Breaking bad behaviors: A new tool for learning classroom management using virtual reality. Frontiers in ICT 3, 26 (2016), 1–21. https://doi.org/10.3389/fict.2016.00026
[30]
Zofia Malisz, Harald Berthelsen, Jonas Beskow, and Joakim Gustafson. 2019. PROMIS: A statistical-parametric speech synthesis system with prominence control via a prominence network. In 10th ISCA Speech Synthesis Workshop. 257–262. https://doi.org/10.21437/SSW.2019-46
[31]
Stacy Marsella, Yuyu Xu, Margaux Lhommet, Andrew Feng, Stefan Scherer, and Ari Shapiro. 2013. Virtual character performance from speech. In Proceedings of the 12th ACM SIGGRAPH/Eurographics Symposium on Computer Animation. 25. https://doi.org/10.1145/2485895.2485900
[32]
Catherine S. Oh, Jeremy N. Bailenson, and Gregory F. Welch. 2018. A systematic review of social presence: Definition, antecedents, and implications. Frontiers in Robotics and AI 5, 114 (2018), 1–35. https://doi.org/10.3389/frobt.2018.00114
[33]
David Peeters. 2019. Virtual reality: A game-changing method for the language sciences. Psychonomic Bulletin and Review 26, 3 (2019), 894–900. https://doi.org/10.3758/s13423-019-01571-3
[34]
R Core Team. 2015. R: A Language and Environment for Statistical Computing. http://www.r-project.org/.
[35]
Astrid M. Rosenthal-von der Pütten, Carolin Straßmann, and Nicole C. Krämer. 2016. Robots or agents-neither helps you more or less during second language acquisition. In International Conference on Intelligent Virtual Agents (IVA’16). 256–268. https://doi.org/10.1007/978-3-319-47665-0_23
[36]
Marc Schröder, Marcela Charfuelan, Sathish Pammi, and Ingmar Steiner. 2011. Open source voice creation toolkit for the MARY TTS platform. In 12th Annual Conference of the International Speech Communication Association. 3253–3256. http://mary.dfki.de/.
[37]
Katie Seaborn, Norihisa P. Miyake, Peter Pennefather, and Mihoko Otake-Matsuura. 2021. Voice in human-agent interaction. Computing Surveys 54, 4 (2021), 1–43. https://doi.org/10.1145/3386867
[38]
Jonathan Shen, Ruoming Pang, Ron J. Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng Chen, Yu Zhang, Yuxuan Wang, Rj Skerrv-Ryan, Rif A. Saurous, Yannis Agiomvrgiannakis, and Yonghui Wu. 2018. Natural TTS synthesis by conditioning wavenet on MEL spectrogram predictions. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’18). 4779–4783. https://doi.org/10.1109/ICASSP.2018.8461368
[39]
Stef Van Der Struijk, Hung-Hsuan Huang, Maryam Sadat Mirzaei, and Toyoaki Nishida. 2018. FACSvatar: An open source modular framework for real-time FACS based facial animation. In Proceedings of the 18th International Conference on Intelligent Virtual Agents. 159–164. https://doi.org/10.1145/3267851.3267918
[40]
Isaac Wang and Jaime Ruiz. 2021. Examining the use of nonverbal communication in virtual agents. International Journal of Human–Computer Interaction 37, 17 (2021), 1648–1673. https://doi.org/10.1080/10447318.2021.1898851
[41]
Mark West, Rebecca Kraut, and Han Ei Chew. 2019. I’d blush if I could: Closing gender divides in digital skills through education. https://unesdoc.unesco.org/ark:/48223/pf0000367416.

Cited By

View all
  • (2025)Understanding voice naturalnessTrends in Cognitive Sciences10.1016/j.tics.2025.01.010Online publication date: Feb-2025
  • (2024)The Effect of Eye Contact in Multi-Party Conversations with Virtual Humans and Mitigating the Mona Lisa EffectElectronics10.3390/electronics1302043013:2(430)Online publication date: 19-Jan-2024
  • (2024)Wayfinding in immersive virtual environments as social activity supported by virtual agentsFrontiers in Virtual Reality10.3389/frvir.2023.13347954Online publication date: 29-Feb-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Applied Perception
ACM Transactions on Applied Perception  Volume 18, Issue 4
October 2021
74 pages
ISSN:1544-3558
EISSN:1544-3965
DOI:10.1145/3492443
Issue’s Table of Contents

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 28 October 2021
Accepted: 01 August 2021
Received: 01 August 2021
Published in TAP Volume 18, Issue 4

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Embodied conversational agents (ECAs)
  2. virtual acoustics
  3. prosody
  4. accentuation
  5. speech
  6. text-to-speech
  7. audio
  8. embodiment

Qualifiers

  • Research-article
  • Refereed

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)216
  • Downloads (Last 6 weeks)35
Reflects downloads up to 25 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2025)Understanding voice naturalnessTrends in Cognitive Sciences10.1016/j.tics.2025.01.010Online publication date: Feb-2025
  • (2024)The Effect of Eye Contact in Multi-Party Conversations with Virtual Humans and Mitigating the Mona Lisa EffectElectronics10.3390/electronics1302043013:2(430)Online publication date: 19-Jan-2024
  • (2024)Wayfinding in immersive virtual environments as social activity supported by virtual agentsFrontiers in Virtual Reality10.3389/frvir.2023.13347954Online publication date: 29-Feb-2024
  • (2024)Text‐to‐speech and virtual reality agents in primary school classroom environmentsJournal of Computer Assisted Learning10.1111/jcal.1304640:6(2964-2984)Online publication date: 7-Aug-2024
  • (2024)Audiovisual Coherence: Is Embodiment of Background Noise Sources a Necessity?2024 IEEE Conference on Virtual Reality and 3D User Interfaces Abstracts and Workshops (VRW)10.1109/VRW62533.2024.00017(61-67)Online publication date: 16-Mar-2024
  • (2024)A lecturer’s voice quality and its effect on memory, listening effort, and perception in a VR environmentScientific Reports10.1038/s41598-024-63097-614:1Online publication date: 30-May-2024
  • (2023)ERP evidence for Slavic and German word stress cue sensitivity in EnglishFrontiers in Psychology10.3389/fpsyg.2023.119382214Online publication date: 23-Jun-2023
  • (2023)Who's next?Proceedings of the 23rd ACM International Conference on Intelligent Virtual Agents10.1145/3570945.3607312(1-8)Online publication date: 19-Sep-2023
  • (2023)Hi robot, it’s not what you say, it’s how you say it2023 32nd IEEE International Conference on Robot and Human Interactive Communication (RO-MAN)10.1109/RO-MAN57019.2023.10309427(307-314)Online publication date: 28-Aug-2023
  • (2023)Truedepth Measurements of Facial Expressions: Sensitivity to the Angle Between Camera and Face2023 IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops (ICASSPW)10.1109/ICASSPW59220.2023.10193107(1-5)Online publication date: 4-Jun-2023
  • Show More Cited By

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Full Text

View this article in Full Text.

Full Text

HTML Format

View this article in HTML Format.

HTML Format

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media