skip to main content
10.1145/3462244.3479934acmconferencesArticle/Chapter ViewAbstractPublication Pagesicmi-mlmiConference Proceedingsconference-collections
research-article

Digital Speech Makeup: Voice Conversion Based Altered Auditory Feedback for Transforming Self-Representation

Published: 18 October 2021 Publication History

Abstract

Makeup (i.e., cosmetics) has long been used to transform not only one’s appearance but also their self-representation. Previous studies have demonstrated that visual transformations can induce a variety of effects on self-representation. Herein, we introduce Digital Speech Makeup (DSM), the novel concept of using voice conversion (VC) based auditory feedback to transform human self-representation. We implemented a proof-of-concept system that leverages a state-of-the-art algorithm for near real-time VC and bone-conduction headphones for resolving speech disruptions caused by delayed auditory feedback. Our user study confirmed that conversing for a few dozen minutes using the system influenced participants’ speech ownership and implicit bias. Furthermore, we reviewed the participants’ comments about the experience of DSM and gained additional qualitative insight into possible future directions for the concept. Our work represents the first step towards utilizing VC to design various interpersonal interactions, centered on influencing the users’ psychological state.

References

[1]
Alexander Travis Adams, Jean Marcel dos Reis Costa, Malte F. Jung, and Tanzeem Choudhury. 2015. Mindless computing: designing technologies to subtly influence behavior. In UbiComp '15. ACM, New York, NY, 719–730. https://doi.org/10.1145/2750858.2805843
[2]
Riku Arakawa, Shinnosuke Takamichi, and Hiroshi Saruwatari. 2019. Implementation of DNN-based real-time voice conversion and its improvements by audio data augmentation and mask-shaped device. In ISCA SSW '19. ISCA, Singapore, 93–98. https://doi.org/10.21437/ssw.2019-17
[3]
Riku Arakawa, Shinnosuke Takamichi, and Hiroshi Saruwatari. 2019. TransVoice: Real-Time Voice Conversion for Augmenting Near-Field Speech Communication. In UIST '19. ACM Press, New York, NY, 33–35. https://doi.org/10.1145/3332167.3357106
[4]
Riku Arakawa and Hiromu Yakura. 2021. Mindless Attractor: A False-Positive Resistant Intervention for Drawing Attention Using Auditory Perturbation. In CHI '21. ACM, New York, NY, 99:1–99:15. https://doi.org/10.1145/3411764.3445339
[5]
Jean-Julien Aucouturier, Petter Johansson, Lars Hall, Rodrigo Segnini, Lolita Mercadié, and Katsumi Watanabe. 2016. Covert digital manipulation of vocal emotion alter speakers’ emotional states in a congruent direction. PNAS 113, 4 (Jan. 2016), 948–953. https://doi.org/10.1073/pnas.1506552113
[6]
D. Banakou, R. Groten, and M. Slater. 2013. Illusory ownership of a virtual child body causes overestimation of object sizes and implicit attitude changes. PNAS 110, 31 (July 2013), 12846–12851. https://doi.org/10.1073/pnas.1306779110
[7]
Domna Banakou, Parasuram D. Hanumanthu, and Mel Slater. 2016. Virtual Embodiment of White People in a Black Virtual Body Leads to a Sustained Reduction in Their Implicit Racial Bias. Front. Hum. Neurosci. 10 (Nov. 2016). https://doi.org/10.3389/fnhum.2016.00601
[8]
Domna Banakou, Sameer Kishore, and Mel Slater. 2018. Virtually Being Einstein Results in an Improvement in Cognitive Task Performance and a Decrease in Age Bias. Front. Psychol. 9 (June 2018). https://doi.org/10.3389/fpsyg.2018.00917
[9]
Rachel L. Bedder, Daniel Bush, Domna Banakou, Tabitha Peck, Mel Slater, and Neil Burgess. 2019. A mechanistic account of bodily resonance and implicit bias. Cognition 184 (March 2019), 1–10. https://doi.org/10.1016/j.cognition.2018.11.010
[10]
John W. Black. 1951. The Effect Of Delayed Side-Tone Upon Vocal Rate And Intensity. J. Speech Hearing Disord. 16, 1 (March 1951), 56–60. https://doi.org/10.1044/jshd.1601.56
[11]
Matthew Botvinick and Jonathan Cohen. 1998. Rubber hands ‘feel’ touch that eyes see. Nature 391, 6669 (Feb. 1998), 756–756. https://doi.org/10.1038/35784
[12]
Theresa A. Burnett, Jill E. Senner, and Charles R. Larson. 1997. Voice F0 responses to pitch-shifted auditory feedback: a preliminary study. J. Voice. 11, 2 (June 1997), 202–211. https://doi.org/10.1016/s0892-1997(97)80079-3
[13]
Maria Christofi and Despina Michael-Grigoriou. 2017. Virtual reality for inducing empathy and reducing prejudice towards stigmatized groups: A survey. In VSMM'17. IEEE, New York, NY, 1–8. https://doi.org/10.1109/vsmm.2017.8346252
[14]
David M. Corey and Vishnu Anand Cuddapah. 2008. Delayed auditory feedback effects during reading and conversation tasks: Gender differences in fluent adults. J. Fluency. Disord. 33, 4 (Dec. 2008), 291–305. https://doi.org/10.1016/j.jfludis.2008.12.001
[15]
Richard Corson. 1972. Fashions in makeup, from ancient to modern times. Peter Owen Limited.
[16]
Jean Costa, Malte F. Jung, Mary Czerwinski, François Guimbretière, Trinh Le, and Tanzeem Choudhury. 2018. Regulating Feelings During Interpersonal Conflicts by Changing Voice Self-perception. In CHI '18. ACM Press, New York, NY, 631. https://doi.org/10.1145/3173574.3174205
[17]
Donna Z. Davis and Karikarn Chansiri. 2018. Digital identities – overcoming visual bias through virtual embodiment. Inf. Commun. Soc. 22, 4 (Nov. 2018), 491–505. https://doi.org/10.1080/1369118x.2018.1548631
[18]
Mar Gonzalez-Franco and Tabitha C. Peck. 2018. Avatar Embodiment. Towards a Standardized Questionnaire. Front. Robot. AI. 5 (June 2018), 74. https://doi.org/10.3389/frobt.2018.00074
[19]
Anthony G Greenwald, Debbie E McGhee, and Jordan LK Schwartz. 1998. Measuring individual differences in implicit cognition: the implicit association test.J. Pers. Soc. Psychol. 74, 6 (1998), 1464.
[20]
Anthony G. Greenwald, Brian A. Nosek, and Mahzarin R. Banaji. 2003. Understanding and using the Implicit Association Test: I. An improved scoring algorithm.J. Pers. Soc. Psychol. 85, 2 (2003), 197–216. https://doi.org/10.1037/0022-3514.85.2.197
[21]
Victoria Groom, Jeremy N. Bailenson, and Clifford Nass. 2009. The influence of racial embodiment on racial bias in immersive virtual environments. Soc. Influ. 4, 3 (July 2009), 231–248. https://doi.org/10.1080/15534510802643750
[22]
Fernanda Herrera, Jeremy Bailenson, Erika Weisz, Elise Ogle, and Jamil Zaki. 2018. Building long-term empathy: A large-scale comparison of traditional and virtual reality perspective-taking. PLOS ONE 13, 10 (Oct. 2018), e0204494. https://doi.org/10.1371/journal.pone.0204494
[23]
Katunobu Itou, Mikio Yamamoto, Kazuya Takeda, Toshiyuki Takezawa, Tatsuo Matsuoka, Tetsunori Kobayashi, Kiyohiro Shikano, and Shuichi Itahashi. 1999. JNAS: Japanese speech corpus for large vocabulary continuous speech recognition research.Journal of the Acoustical Society of Japan (E) 20, 3 (1999), 199–206. https://doi.org/10.1250/ast.20.199
[24]
Roz Ivanič and David Camps. 2001. I am how I sound. J. Second Lang. Writ. 10, 1-2 (Feb. 2001), 3–33. https://doi.org/10.1016/s1060-3743(01)00034-0
[25]
Hideki Kawahara. 1993. Transformed auditory feedback: Effects of fundamental frequency perturbation. JASA 94, 3 (Sept. 1993), 1883–1884. https://doi.org/10.1121/1.407536
[26]
Konstantina Kilteni, Ilias Bergstrom, and Mel Slater. 2013. Drumming in Immersive Virtual Reality: The Body Shapes the Way We Play. IEEE Trans. Vis. Comput. Graph. 19, 4 (April 2013), 597–605. https://doi.org/10.1109/tvcg.2013.29
[27]
Rébecca Kleinberger, George Stefanakis, and Sebastian Franjou. 2019. Speech Companions: Evaluating the Effects of Musically Modulated Auditory Feedback on the Voice. In (ICAD'19). Department of Computer and Information Sciences, Northumbria University, Newcastle upon Tyne. https://doi.org/10.21785/icad2019.035
[28]
Bernard S. Lee. 1951. Artificial Stutter. J. Speech Hearing Disord. 16, 1 (March 1951), 53–55. https://doi.org/10.1044/jshd.1601.53
[29]
Jong-Eun Roselyn Lee, Clifford I. Nass, and Jeremy N. Bailenson. 2014. Does the Mask Govern the Mind?: Effects of Arbitrary Gender Representation on Quantitative Task Performance in Avatar-Represented Virtual Groups. Cyberpsychol Behav Soc Netw. 17, 4 (April 2014), 248–254. https://doi.org/10.1089/cyber.2013.0358
[30]
Michelle Lincoln, Ann Packman, and Mark Onslow. 2006. Altered auditory feedback and the treatment of stuttering: A review. J. Fluency. Disord. 31, 2 (Jan. 2006), 71–89. https://doi.org/10.1016/j.jfludis.2006.04.001
[31]
Matthew R. Longo, Friederike Schüür, Marjolein P.M. Kammers, Manos Tsakiris, and Patrick Haggard. 2008. What is embodiment? A psychometric approach. Cognition 107, 3 (June 2008), 978–998. https://doi.org/10.1016/j.cognition.2007.12.004
[32]
Sarah Lopez, Yi Yang, Kevin Beltran, Soo Jung Kim, Jennifer Cruz Hernandez, Chelsy Simran, Bingkun Yang, and Beste F. Yuksel. 2019. Investigating Implicit Gender Bias and Embodiment of White Males in Virtual Reality with Full Body Visuomotor Synchrony. In CHI '19. ACM Press, New York, NY, 557. https://doi.org/10.1145/3290605.3300787
[33]
Lara Maister, Natalie Sebanz, Günther Knoblich, and Manos Tsakiris. 2013. Experiencing ownership over a dark-skinned body reduces implicit racial bias. Cognition 128, 2 (Aug. 2013), 170–178. https://doi.org/10.1016/j.cognition.2013.04.002
[34]
Lynn Carol Miller and Cathryn Leigh Cox. 1982. For Appearances' Sake. Pers Soc Psychol Bull 8, 4 (Dec. 1982), 748–751. https://doi.org/10.1177/0146167282084023
[35]
Daiki Miyashiro, Akira Toyomura, Tomosumi Haitani, and Hiroaki Kumano. 2019. Altered auditory feedback perception following an 8-week mindfulness meditation practice. Int. J. Psychophysiol. 138 (April 2019), 38–46. https://doi.org/10.1016/j.ijpsycho.2019.01.010
[36]
Kristine L Nowak and Jesse Fox. 2018. Avatars and computer-mediated communication: A review of the definitions, uses, and effects of digital representations on communication. Rev. Commun. Res. 6, 1 (Jan. 2018), 30–53. https://doi.org/10.12840/issn.2255-4165.2018.06.01.015
[37]
Veronica S Pantelidis. 2010. Reasons to use virtual reality in education and training courses and a model to determine when to use virtual reality. Themes in Science and Technology Education 2, 1-2 (2010), 59–70.
[38]
Tabitha C. Peck, Sofia Seinfeld, Salvatore M. Aglioti, and Mel Slater. 2013. Putting yourself in the skin of a black avatar reduces implicit racial bias. Conscious. Cogn. 22, 3 (Sept. 2013), 779–787. https://doi.org/10.1016/j.concog.2013.04.016
[39]
Laura Rachman, Marco Liuni, Pablo Arias, Andreas Lind, Petter Johansson, Lars Hall, Daniel Richardson, Katsumi Watanabe, Stéphanie Dubal, and Jean-Julien Aucouturier. 2017. DAVID: An open-source platform for real-time transformation of infra-segmental emotional cues in running speech. Behav. Res. Methods 50, 1 (April 2017), 323–343. https://doi.org/10.3758/s13428-017-0873-y
[40]
Lisa. E. Rombout and Marie Postma-Nilsenova. 2019. Exploring a Voice Illusion. In ACII'19. IEEE, New York, NY, 711–717. https://doi.org/10.1109/acii.2019.8925492
[41]
Yuki Saito, Yusuke Ijima, Kyosuke Nishida, and Shinnosuke Takamichi. 2018. Non-Parallel Voice Conversion Using Variational Autoencoders Conditioned by Phonetic Posteriorgrams and D-Vectors. In ICASSP'’18. IEEE, New York, NY, 5274–5278. https://doi.org/10.1109/icassp.2018.8461384
[42]
Maria V. Sanchez-Vives, Bernhard Spanlang, Antonio Frisoli, Massimo Bergamasco, and Mel Slater. 2010. Virtual Hand Illusion Induced by Visuomotor Correlations. PLoS ONE 5, 4 (April 2010), e10381. https://doi.org/10.1371/journal.pone.0010381
[43]
Ulrike Schultze. 2010. Embodiment and presence in virtual worlds: a review. J. Inf. Technol. 25, 4 (Dec. 2010), 434–449. https://doi.org/10.1057/jit.2010.25
[44]
Berrak Sisman, Junichi Yamagishi, Simon King, and Haizhou Li. 2021. An Overview of Voice Conversion and Its Challenges: From Statistical Modeling to Deep Learning. IEEE ACM Trans. Audio Speech Lang. Process. 29 (2021), 132–157. https://doi.org/10.1109/TASLP.2020.3038524
[45]
Yannis Stylianou. 2009. Voice Transformation: A survey. In ICASSP'09. IEEE, New York, NY, 3585–3588. https://doi.org/10.1109/icassp.2009.4960401
[46]
Yannis Stylianou, Oliver Cappe, and Eric Moulines. 1998. Continuous probabilistic transform for voice conversion. IEEE Trans. Audio Speech Lang. Process 6, 2 (March 1998), 131–142. https://doi.org/10.1109/89.661472
[47]
Lifa Sun, Kun Li, Hao Wang, Shiyin Kang, and Helen Meng. 2016. Phonetic posteriorgrams for many-to-one voice conversion without parallel data training. In ICME'16. IEEE, New York, NY, 1–6. https://doi.org/10.1109/icme.2016.7552917
[48]
Ana Tajadura-Jiménez, Domna Banakou, Nadia Bianchi-Berthouze, and Mel Slater. 2017. Embodiment in a Child-Like Talking Virtual Body Influences Object Size Perception, Self-Identification, and Subsequent Real Speaking. Sci. 7, 1 (Aug. 2017). https://doi.org/10.1038/s41598-017-09497-3
[49]
Tomoki Toda. 2014. Augmented speech production based on real-time statistical voice conversion. In GlobalSIP'14. IEEE, New York, NY, 592–596. https://doi.org/10.1109/globalsip.2014.7032186
[50]
Tomoki Toda, Alan W. Black, and Keiichi Tokuda. 2007. Voice Conversion Based on Maximum-Likelihood Estimation of Spectral Parameter Trajectory. IEEE Trans. on Audio, Speech and Lang. Processing 15, 8 (Nov. 2007), 2222–2235. https://doi.org/10.1109/tasl.2007.907344
[51]
Tomoki Toda, Takashi Muramatsu, and Hideki Banno. 2012. Implementation of computationally efficient real-time voice conversion. In ISCA Speech'12. ISCA, Singapore, 94–97.
[52]
Mirjam Wester, Zhizheng Wu, and Junichi Yamagishi. 2016. Multidimensional scaling of systems in the Voice Conversion Challenge 2016. In ISCA SSW'16. ISCA, Singapore, 38–43. https://doi.org/10.21437/SSW.2016-7
[53]
Aubrey J. Yates. 1963. Delayed auditory feedback.Psychol. Bull. 60, 3 (1963), 213–232. https://doi.org/10.1037/h0044155
[54]
Nick Yee and Jeremy Bailenson. 2006. Walk a mile in digital shoes: The impact of embodied perspective-taking on the reduction of negative stereotyping in immersive virtual environments. Presence (Camb) (01 2006).
[55]
Nick Yee and Jeremy Bailenson. 2007. The Proteus Effect: The Effect of Transformed Self-Representation on Behavior. Hum. Commun. Res. 33, 3 (July 2007), 271–290. https://doi.org/10.1111/j.1468-2958.2007.00299.x
[56]
Zane Z. Zheng, Ewen N. MacDonald, Kevin G. Munhall, and Ingrid S. Johnsrude. 2011. Perceiving a Stranger's Voice as Being One's Own: A ‘Rubber Voice’ Illusion?PLoS ONE 6, 4 (April 2011), e18655. https://doi.org/10.1371/journal.pone.0018655

Cited By

View all
  • (2024)Investigating Effect of Altered Auditory Feedback on Self-Representation, Subjective Operator Experience, and Task Performance in Teleoperation of a Social RobotProceedings of the 2024 CHI Conference on Human Factors in Computing Systems10.1145/3613904.3642561(1-18)Online publication date: 11-May-2024
  • (2024)Towards Inclusive Video Commenting: Introducing Signmaku for the Deaf and Hard-of-HearingProceedings of the 2024 CHI Conference on Human Factors in Computing Systems10.1145/3613904.3642287(1-18)Online publication date: 11-May-2024
  • (2022)Avatar Voice Morphing to Match Subjective and Objective Self Voice PerceptionProceedings of the 28th ACM Symposium on Virtual Reality Software and Technology10.1145/3562939.3565671(1-2)Online publication date: 29-Nov-2022

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
ICMI '21: Proceedings of the 2021 International Conference on Multimodal Interaction
October 2021
876 pages
ISBN:9781450384810
DOI:10.1145/3462244
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 18 October 2021

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. auditory feedback
  2. self-representation
  3. speech transformation
  4. voice conversion

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Funding Sources

  • JSPS KAKENHI
  • MIC/SCOPE

Conference

ICMI '21
Sponsor:
ICMI '21: INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION
October 18 - 22, 2021
QC, Montréal, Canada

Acceptance Rates

Overall Acceptance Rate 453 of 1,080 submissions, 42%

Upcoming Conference

CCS '25

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)74
  • Downloads (Last 6 weeks)5
Reflects downloads up to 13 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Investigating Effect of Altered Auditory Feedback on Self-Representation, Subjective Operator Experience, and Task Performance in Teleoperation of a Social RobotProceedings of the 2024 CHI Conference on Human Factors in Computing Systems10.1145/3613904.3642561(1-18)Online publication date: 11-May-2024
  • (2024)Towards Inclusive Video Commenting: Introducing Signmaku for the Deaf and Hard-of-HearingProceedings of the 2024 CHI Conference on Human Factors in Computing Systems10.1145/3613904.3642287(1-18)Online publication date: 11-May-2024
  • (2022)Avatar Voice Morphing to Match Subjective and Objective Self Voice PerceptionProceedings of the 28th ACM Symposium on Virtual Reality Software and Technology10.1145/3562939.3565671(1-2)Online publication date: 29-Nov-2022

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media