skip to main content
10.1145/3550469.3555408acmconferencesArticle/Chapter ViewAbstractPublication Pagessiggraph-asiaConference Proceedingsconference-collections
research-article

VOCAL: Vowel and Consonant Layering for Expressive Animator-Centric Singing Animation

Authors Info & Claims
Published:30 November 2022Publication History

ABSTRACT

Singing and speaking are two fundamental forms of human communication. From a modeling perspective however, speaking can be seen as a subset of singing. We present VOCAL, a system that automatically generates expressive, animator-centric lower face animation from singing audio input. Articulatory phonetics and voice instruction ascribe additional roles to vowels (projecting melody and volume) and consonants (lyrical clarity and rhythmic emphasis) in song. Our approach directly uses these insights to define axes for Melodic-accent and Pitch-sensitivity (Ma-Ps), which together provide an abstract space to visually represent various singing styles. In our system. vowels are processed first. A lyrical vowel is often sung tonally as one or more different vowels. We perform any such vowel modifications using a neural network trained on input audio. These vowels are then dilated from their spoken behaviour to bleed into each other based on Melodic-accent (Ma), with Pitch-sensitivity (Ps) modeling visual vibrato. Consonant animation curves are then layered in, with viseme intensity modeling rhythmic emphasis (inverse to Ma). Our evaluation is fourfold: we show the impact of our design parameters; we compare our results to ground truth and prior art; we present compelling results on a variety of voices and singing styles; and we validate these results with professional singers and animators.

Skip Supplemental Material Section

Supplemental Material

3550469.3555408.mov

mov

487.9 MB

References

  1. Robert Anderson, Bjorn Stenger, Vincent Wan, and Roberto Cipolla. 2013. Expressive Visual Text-To-Speech Using Active Appearance Models. https://doi.org/10.1109/CVPR.2013.434 Journal Abbreviation: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition Publication Title: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition.Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Stephen F. Austin. 2007. Jaw Opening in Novice and Experienced Classically Trained Singers. Journal of Voice 21, 1 (Jan. 2007), 72–79. https://doi.org/10.1016/j.jvoice.2005.08.013Google ScholarGoogle ScholarCross RefCross Ref
  3. Gérard Bailly. 1997. Learning to speak. Sensori-motor control of speech movements. Speech Communication 22, 2 (Aug. 1997), 251–267. https://doi.org/10.1016/S0167-6393(97)00025-3Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Elisabetta Bevacqua and Catherine Pelachaud. 2004. Expressive audio-visual speech. Journal of Visualization and Computer Animation 15 (July 2004), 297–304. https://doi.org/10.1002/cav.32Google ScholarGoogle ScholarCross RefCross Ref
  5. V. Blanz, C. Basso, T. Poggio, and T. Vetter. 2003. Reanimating Faces in Images and Video. Computer Graphics Forum 22, 3 (2003), 641–650. https://doi.org/10.1111/1467-8659.t01-1-00712 _eprint: https://onlinelibrary.wiley.com/doi/pdf/10.1111/1467-8659.t01-1-00712.Google ScholarGoogle ScholarCross RefCross Ref
  6. Paul Boersma and David Weenink. 2001. Praat: doing Phonetics by Computer. (2001). https://www.fon.hum.uva.nl/praat/Google ScholarGoogle Scholar
  7. Kenneth Bozeman. 2017. Kinesthetic Voice Pedagogy 2: Motivating Acoustic Efficiency. Inside View Press. Google-Books-ID: rzopzgEACAAJ.Google ScholarGoogle Scholar
  8. Kenneth W Bozeman. 2013. Practical Vocal Acoustics. (2013), 162.Google ScholarGoogle Scholar
  9. Derek Bradley, Wolfgang Heidrich, Tiberiu Popa, and Alla Sheffer. 2010. High Resolution Passive Facial Performance Capture. (2010), 10.Google ScholarGoogle Scholar
  10. Christoph Bregler, Michele Covell, and Malcolm Slaney. 1997. Video Rewrite: driving visual speech with audio. In Proceedings of the 24th annual conference on Computer graphics and interactive techniques - SIGGRAPH ’97. ACM Press, Not Known, 353–360. https://doi.org/10.1145/258734.258880Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Yong Cao, Wen C. Tien, Petros Faloutsos, and Frédéric Pighin. 2005. Expressive speech-driven facial animation. ACM Transactions on Graphics 24, 4 (Oct. 2005), 1283–1302. https://doi.org/10.1145/1095878.1095881Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. E. Cosatto and H.P. Graf. 2000. Photo-realistic talking-heads from image samples. IEEE Transactions on Multimedia 2, 3 (Sept. 2000), 152–163. https://doi.org/10.1109/6046.865480 Conference Name: IEEE Transactions on Multimedia.Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. P. Cosi, E.M. Caldognetto, G. Perin, and C. Zmarich. 2002. Labial coarticulation modeling for realistic facial animation. In Proceedings. Fourth IEEE International Conference on Multimodal Interfaces. 505–510. https://doi.org/10.1109/ICMI.2002.1167047Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Daniel Cudeiro, Timo Bolkart, Cassidy Laidlaw, Anurag Ranjan, and Michael J. Black. 2019. Capture, Learning, and Synthesis of 3D Speaking Styles. arXiv:1905.03079 [cs] (May 2019). http://arxiv.org/abs/1905.03079 arXiv:1905.03079.Google ScholarGoogle Scholar
  15. Paul C DiLorenzo, Victor B Zordan, and Benjamin L Sanders. 2008. Laughing out loud: Control for modeling anatomically inspired laughter using audio. In ACM SIGGRAPH Asia 2008 papers. 1–8.Google ScholarGoogle Scholar
  16. Pif Edwards, Chris Landreth, Eugene Fiume, and Karan Singh. 2016. JALI: an animator-centric viseme model for expressive lip synchronization. ACM Transactions on Graphics 35, 4 (July 2016), 1–11. https://doi.org/10.1145/2897824.2925984Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Pif Edwards, Chris Landreth, Mateusz Popławski, Robert Malinowski, Sarah Watling, Eugene Fiume, and Karan Singh. 2020. JALI-Driven Expressive Facial Animation and Multilingual Speech in Cyberpunk 2077. In ACM SIGGRAPH 2020 Talks(SIGGRAPH ’20). Association for Computing Machinery, New York, NY, USA, Article 60, 2 pages. https://doi.org/10.1145/3388767.3407339Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Paul Ekman and Erika L. Rosenberg. 1997. What the Face Reveals: Basic and Applied Studies of Spontaneous Expression Using the Facial Action Coding System (FACS). Oxford University Press. Google-Books-ID: KVmZKGZfmfEC.Google ScholarGoogle Scholar
  19. Faceware. 2017. Analyzer. http://facewaretech.com/products/software/analyzer. (2017).Google ScholarGoogle Scholar
  20. Yingruo Fan, Zhaojiang Lin, Jun Saito, Wenping Wang, and Taku Komura. 2022. FaceFormer: Speech-Driven 3D Facial Animation with Transformers. Technical Report arXiv:2112.05329. arXiv. http://arxiv.org/abs/2112.05329 arXiv:2112.05329 [cs] type: article.Google ScholarGoogle Scholar
  21. Cletus G. Fisher. 1968. Confusions Among Visually Perceived Consonants. Journal of Speech and Hearing Research 11, 4 (Dec. 1968), 796–804. https://doi.org/10.1044/jshr.1104.796 Publisher: American Speech-Language-Hearing Association.Google ScholarGoogle ScholarCross RefCross Ref
  22. Bryan Gick, Ian Wilson, and Donald Derrick. 2012. Articulatory Phonetics. John Wiley & Sons. Google-Books-ID: rrfoJJKmIq4C.Google ScholarGoogle Scholar
  23. Brian Guenter, Cindy Grimm, Daniel Wood, Henrique Malvar, and Fredric Pighin. 1998. Making faces. In Proceedings of the 25th annual conference on Computer graphics and interactive techniques(SIGGRAPH ’98). Association for Computing Machinery, New York, NY, USA, 55–66. https://doi.org/10.1145/280814.280822Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Liwen Hu. 2017. Avatar digitization from a single image for real-time rendering | ACM Transactions on Graphics. (2017). https://dl.acm.org/doi/10.1145/3130800.31310887Google ScholarGoogle Scholar
  25. Takayuki Ito, Emi Murano, and Hiroaki Gomi. 2004. Fast force generation dynamics of human articulatory muscles. Journal of applied physiology (Bethesda, Md. : 1985) 96 (July 2004), 2318–24; discussion 2317. https://doi.org/10.1152/japplphysiol.01048.2003Google ScholarGoogle ScholarCross RefCross Ref
  26. Shohei Iwase, Takuya Kato, Shugo Yamaguchi, Tsuchiya Yukitaka, and Shigeo Morishima. 2020. Song2Face: Synthesizing Singing Facial Animation from Audio. In SIGGRAPH Asia 2020 Technical Communications(SA ’20). Association for Computing Machinery, New York, NY, USA, 1–4. https://doi.org/10.1145/3410700.3425435Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Tero Karras, Timo Aila, Samuli Laine, Antti Herva, and Jaakko Lehtinen. 2017. Audio-driven facial animation by joint end-to-end learning of pose and emotion. ACM Transactions on Graphics 36, 4 (July 2017), 1–12. https://doi.org/10.1145/3072959.3073658Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Namjung Kim and Kyoungju Park. 2020. Singing Lip Sync Animation System Using Audio Spectrum. In Advances in Computer Science and Ubiquitous Computing. Springer, Singapore, 135–140. https://doi.org/10.1007/978-981-13-9341-9_23Google ScholarGoogle ScholarCross RefCross Ref
  29. Scott A. King and Richard E. Parent. 2004. Animating song. Computer Animation and Virtual Worlds 15, 1 (2004), 53–61. https://doi.org/10.1002/cav.7 _eprint: https://onlinelibrary.wiley.com/doi/pdf/10.1002/cav.7.Google ScholarGoogle ScholarCross RefCross Ref
  30. Scott A. King and Richard E. Parent. 2005. Creating speech-synchronized animation. IEEE transactions on visualization and computer graphics 11, 3 (June 2005), 341–352. https://doi.org/10.1109/TVCG.2005.43Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Diederik P. Kingma and Jimmy Ba. 2017. Adam: A Method for Stochastic Optimization. arXiv:1412.6980 [cs] (Jan. 2017). http://arxiv.org/abs/1412.6980 arXiv:1412.6980.Google ScholarGoogle Scholar
  32. H. Kuwabara. 1996. Acoustic properties of phonemes in continuous speech for different speaking rate. In Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP ’96, Vol. 4. 2435–2438 vol.4. https://doi.org/10.1109/ICSLP.1996.607301Google ScholarGoogle ScholarCross RefCross Ref
  33. B. E. Lindblom and J. E. Sundberg. 1971. Acoustical consequences of lip, tongue, jaw, and larynx movement. The Journal of the Acoustical Society of America 50, 4 (Oct. 1971), 1166–1179. https://doi.org/10.1121/1.1912750Google ScholarGoogle ScholarCross RefCross Ref
  34. Yilong Liu, Feng Xu, Jinxiang Chai, Xin Tong, Lijuan Wang, and Qiang Huo. 2015. Video-Audio Driven Real-Time Facial Animation. ACM Trans. Graph. 34, 6, Article 182 (oct 2015), 10 pages. https://doi.org/10.1145/2816795.2818122Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. D. W. Massaro, M. M. Cohen, R. Clark, M. Tabain, and Jonas Beskow. 2001. Animated speech : Research progress and applications. Cambridge University Press, 309–345. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-167652Google ScholarGoogle Scholar
  36. James McCrae and Karan Singh. 2009. Sketching piecewise clothoid curves. Computers & Graphics 33, 4 (Aug. 2009), 452–461. https://doi.org/10.1016/j.cag.2009.05.006Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Ulrich Neumann, J.P. Lewis, Tae Kim, Murtaza Bulut, and Shrikanth Narayanan. 2006. Expressive Facial Animation Synthesis by Learning Speech Coarticulation and Expression Spaces. IEEE Transactions on Visualization and Computer Graphics 12 (Nov. 2006), 1523–1534. https://doi.org/10.1109/TVCG.2006.90Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. John Nix. 2015. Speaking vs Singing. (Sept. 2015). http://music.utsa.edu/pdfs/61_SpeakingvsSinging.pdfGoogle ScholarGoogle Scholar
  39. Kyle Olszewski, Joseph J. Lim, Shunsuke Saito, and Hao Li. 2016. High-fidelity facial and speech animation for VR HMDs. ACM Transactions on Graphics 35, 6 (Nov. 2016), 1–14. https://doi.org/10.1145/2980179.2980252Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Guilherme Pecoraro, Daniella Curcio, and Mara Behlau. 2013. Vibrato rate variability in three professional singing styles: Opera, Rock and Brazilian country. The Journal of the Acoustical Society of America 133 (May 2013), 3321. https://doi.org/10.1121/1.4805550Google ScholarGoogle ScholarCross RefCross Ref
  41. Alexander Richard, Michael Zollhoefer, Yandong Wen, Fernando de la Torre, and Yaser Sheikh. 2021. MeshTalk: 3D Face Animation from Speech using Cross-Modality Disentanglement. arXiv:2104.08223 [cs] (April 2021). http://arxiv.org/abs/2104.08223 arXiv:2104.08223.Google ScholarGoogle Scholar
  42. Kilian Schulze-Forster, Clement S. J. Doire, Gaël Richard, and Roland Badeau. 2021. Phoneme Level Lyrics Alignment and Text-Informed Singing Voice Separation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29 (2021), 2382–2395. https://doi.org/10.1109/TASLP.2021.3091817 Conference Name: IEEE/ACM Transactions on Audio, Speech, and Language Processing.Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. J. Sundberg. 1970. Formant Structure and Articulation of Spoken and Sung Vowels. Folia Phoniatrica et Logopaedica 22, 1 (1970), 28–48. https://doi.org/10.1159/000263365 Publisher: Karger Publishers.Google ScholarGoogle ScholarCross RefCross Ref
  44. Supasorn Suwajanakorn, Steven M. Seitz, and Ira Kemelmacher-Shlizerman. 2017. Synthesizing Obama: learning lip sync from audio. ACM Transactions on Graphics 36, 4 (July 2017), 1–13. https://doi.org/10.1145/3072959.3073640Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Ken Tamplin. 2016. How To Sing Any Song: Voice Lessons, Tamplin Vocal Academy. https://www.youtube.com/watch?v=ZATunybJm_4&t=57s.Google ScholarGoogle Scholar
  46. Sarah Taylor, Taehwan Kim, Yisong Yue, Moshe Mahler, James Krahe, Anastasio Garcia Rodriguez, Jessica Hodgins, and Iain Matthews. 2017. A deep learning approach for generalized speech animation. ACM Transactions on Graphics 36, 4 (July 2017), 1–11. https://doi.org/10.1145/3072959.3073699Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Sarah L. Taylor, Moshe Mahler, Barry-John Theobald, and Iain Matthews. 2012. Dynamic Units of Visual Speech. In Proc. SCA.Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Justus Thies, Mohamed A. Elgharib, Ayush Tewari, C. Theobalt, and M. Nießner. 2020. Neural Voice Puppetry: Audio-driven Facial Reenactment. In ECCV. https://doi.org/10.1007/978-3-030-58517-4_42Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Ingo Titze. 2011. Formant Frequency Shifts for Classical and Theater Belt Vowel Modification. (2011), 2.Google ScholarGoogle Scholar
  50. Konstantinos Vougioukas, Stavros Petridis, and Maja Pantic. 2018. End-to-End Speech-Driven Facial Animation with Temporal GANs. arXiv:1805.09313 [cs, eess] (July 2018). http://arxiv.org/abs/1805.09313 arXiv:1805.09313.Google ScholarGoogle Scholar
  51. Alice Wang, Michael Emmi, and Petros Faloutsos. 2007. Assembling an expressive facial animation system. In Proceedings of the 2007 ACM SIGGRAPH symposium on Video games - Sandbox ’07. ACM Press, San Diego, California, 21. https://doi.org/10.1145/1274940.1274947Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Lijuan Wang, Wei Han, and Frank K. Soong. 2012. High Quality Lip-Sync Animation for 3d Photo-Realistic Talking Head. (2012).Google ScholarGoogle Scholar
  53. Thibaut Weise, Hao Li, Luc Van Gool, and Mark Pauly. 2009. Face/Off: live facial puppetry. In Proceedings of the 2009 ACM SIGGRAPH/Eurographics Symposium on Computer Animation - SCA ’09. ACM Press, New Orleans, Louisiana, 7. https://doi.org/10.1145/1599470.1599472Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. Julia Wilkins, Prem Seetharaman, Alison Wahl, and Bryan Pardo. 2018. Vocalset: A Singing Voice Dataset. (March 2018). https://doi.org/10.5281/ZENODO.1203819 Type: dataset.Google ScholarGoogle ScholarCross RefCross Ref
  55. Lance Williams. 1990. Performance-driven Facial Animation. In Proc. SIGGRAPH.Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. Yuyu Xu, Andrew W. Feng, Stacy Marsella, and Ari Shapiro. 2013. A Practical and Configurable Lip Sync Method for Games. In Proceedings of Motion on Games. ACM, Dublin 2 Ireland, 131–140. https://doi.org/10.1145/2522628.2522904Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. Jun Yu, Chang Wen Chen, and Zengfu Wang. 2019. 3D Singing Head for Music VR: Learning External and Internal Articulatory Synchronicity from Lyric, Audio and Notes(MM ’19). ACM, 945–952. https://doi.org/10.1145/3343031.3350865 Book Title: Proceedings of the 27th ACM International Conference on multimedia.Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. Egor Zakharov, Aliaksandra Shysheya, Egor Burkov, and Victor Lempitsky. 2019. Few-Shot Adversarial Learning of Realistic Neural Talking Head Models. 9458–9467. https://doi.org/10.1109/ICCV.2019.00955Google ScholarGoogle ScholarCross RefCross Ref
  59. Hang Zhou, Yu Liu, Ziwei Liu, Ping Luo, and Xiaogang Wang. 2019. Talking Face Generation by Adversarially Disentangled Audio-Visual Representation. In Proceedings of the Thirty-Third AAAI Conference on Artificial Intelligence and Thirty-First Innovative Applications of Artificial Intelligence Conference and Ninth AAAI Symposium on Educational Advances in Artificial Intelligence(AAAI’19/IAAI’19/EAAI’19). AAAI Press, Article 1141, 8 pages. https://doi.org/10.1609/aaai.v33i01.33019299Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. Yang Zhou, Xintong Han, Eli Shechtman, Jose Echevarria, Evangelos Kalogerakis, and Dingzeyu Li. 2020. MakeltTalk: Speaker-Aware Talking-Head Animation. ACM Trans. Graph. 39, 6, Article 221 (nov 2020), 15 pages. https://doi.org/10.1145/3414685.3417774Google ScholarGoogle ScholarDigital LibraryDigital Library
  61. Yang Zhou, Zhan Xu, Chris Landreth, Evangelos Kalogerakis, Subhransu Maji, and Karan Singh. 2018. Visemenet: audio-driven animator-centric speech animation. ACM Transactions on Graphics 37, 4 (Aug. 2018), 1–10. https://doi.org/10.1145/3197517.3201292Google ScholarGoogle ScholarDigital LibraryDigital Library
  62. Victor Brian Zordan, Bhrigu Celly, Bill Chiu, and Paul C DiLorenzo. 2004. Breathe easy: model and control of simulated respiration for animation. In Proceedings of the 2004 ACM SIGGRAPH/Eurographics symposium on Computer animation. 29–37.Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. VOCAL: Vowel and Consonant Layering for Expressive Animator-Centric Singing Animation

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      SA '22: SIGGRAPH Asia 2022 Conference Papers
      November 2022
      482 pages
      ISBN:9781450394703
      DOI:10.1145/3550469

      Copyright © 2022 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 30 November 2022

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Research
      • Refereed limited

      Acceptance Rates

      Overall Acceptance Rate178of869submissions,20%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format .

    View HTML Format