ABSTRACT
Singing and speaking are two fundamental forms of human communication. From a modeling perspective however, speaking can be seen as a subset of singing. We present VOCAL, a system that automatically generates expressive, animator-centric lower face animation from singing audio input. Articulatory phonetics and voice instruction ascribe additional roles to vowels (projecting melody and volume) and consonants (lyrical clarity and rhythmic emphasis) in song. Our approach directly uses these insights to define axes for Melodic-accent and Pitch-sensitivity (Ma-Ps), which together provide an abstract space to visually represent various singing styles. In our system. vowels are processed first. A lyrical vowel is often sung tonally as one or more different vowels. We perform any such vowel modifications using a neural network trained on input audio. These vowels are then dilated from their spoken behaviour to bleed into each other based on Melodic-accent (Ma), with Pitch-sensitivity (Ps) modeling visual vibrato. Consonant animation curves are then layered in, with viseme intensity modeling rhythmic emphasis (inverse to Ma). Our evaluation is fourfold: we show the impact of our design parameters; we compare our results to ground truth and prior art; we present compelling results on a variety of voices and singing styles; and we validate these results with professional singers and animators.
Supplemental Material
- Robert Anderson, Bjorn Stenger, Vincent Wan, and Roberto Cipolla. 2013. Expressive Visual Text-To-Speech Using Active Appearance Models. https://doi.org/10.1109/CVPR.2013.434 Journal Abbreviation: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition Publication Title: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition.Google ScholarDigital Library
- Stephen F. Austin. 2007. Jaw Opening in Novice and Experienced Classically Trained Singers. Journal of Voice 21, 1 (Jan. 2007), 72–79. https://doi.org/10.1016/j.jvoice.2005.08.013Google ScholarCross Ref
- Gérard Bailly. 1997. Learning to speak. Sensori-motor control of speech movements. Speech Communication 22, 2 (Aug. 1997), 251–267. https://doi.org/10.1016/S0167-6393(97)00025-3Google ScholarDigital Library
- Elisabetta Bevacqua and Catherine Pelachaud. 2004. Expressive audio-visual speech. Journal of Visualization and Computer Animation 15 (July 2004), 297–304. https://doi.org/10.1002/cav.32Google ScholarCross Ref
- V. Blanz, C. Basso, T. Poggio, and T. Vetter. 2003. Reanimating Faces in Images and Video. Computer Graphics Forum 22, 3 (2003), 641–650. https://doi.org/10.1111/1467-8659.t01-1-00712 _eprint: https://onlinelibrary.wiley.com/doi/pdf/10.1111/1467-8659.t01-1-00712.Google ScholarCross Ref
- Paul Boersma and David Weenink. 2001. Praat: doing Phonetics by Computer. (2001). https://www.fon.hum.uva.nl/praat/Google Scholar
- Kenneth Bozeman. 2017. Kinesthetic Voice Pedagogy 2: Motivating Acoustic Efficiency. Inside View Press. Google-Books-ID: rzopzgEACAAJ.Google Scholar
- Kenneth W Bozeman. 2013. Practical Vocal Acoustics. (2013), 162.Google Scholar
- Derek Bradley, Wolfgang Heidrich, Tiberiu Popa, and Alla Sheffer. 2010. High Resolution Passive Facial Performance Capture. (2010), 10.Google Scholar
- Christoph Bregler, Michele Covell, and Malcolm Slaney. 1997. Video Rewrite: driving visual speech with audio. In Proceedings of the 24th annual conference on Computer graphics and interactive techniques - SIGGRAPH ’97. ACM Press, Not Known, 353–360. https://doi.org/10.1145/258734.258880Google ScholarDigital Library
- Yong Cao, Wen C. Tien, Petros Faloutsos, and Frédéric Pighin. 2005. Expressive speech-driven facial animation. ACM Transactions on Graphics 24, 4 (Oct. 2005), 1283–1302. https://doi.org/10.1145/1095878.1095881Google ScholarDigital Library
- E. Cosatto and H.P. Graf. 2000. Photo-realistic talking-heads from image samples. IEEE Transactions on Multimedia 2, 3 (Sept. 2000), 152–163. https://doi.org/10.1109/6046.865480 Conference Name: IEEE Transactions on Multimedia.Google ScholarDigital Library
- P. Cosi, E.M. Caldognetto, G. Perin, and C. Zmarich. 2002. Labial coarticulation modeling for realistic facial animation. In Proceedings. Fourth IEEE International Conference on Multimodal Interfaces. 505–510. https://doi.org/10.1109/ICMI.2002.1167047Google ScholarDigital Library
- Daniel Cudeiro, Timo Bolkart, Cassidy Laidlaw, Anurag Ranjan, and Michael J. Black. 2019. Capture, Learning, and Synthesis of 3D Speaking Styles. arXiv:1905.03079 [cs] (May 2019). http://arxiv.org/abs/1905.03079 arXiv:1905.03079.Google Scholar
- Paul C DiLorenzo, Victor B Zordan, and Benjamin L Sanders. 2008. Laughing out loud: Control for modeling anatomically inspired laughter using audio. In ACM SIGGRAPH Asia 2008 papers. 1–8.Google Scholar
- Pif Edwards, Chris Landreth, Eugene Fiume, and Karan Singh. 2016. JALI: an animator-centric viseme model for expressive lip synchronization. ACM Transactions on Graphics 35, 4 (July 2016), 1–11. https://doi.org/10.1145/2897824.2925984Google ScholarDigital Library
- Pif Edwards, Chris Landreth, Mateusz Popławski, Robert Malinowski, Sarah Watling, Eugene Fiume, and Karan Singh. 2020. JALI-Driven Expressive Facial Animation and Multilingual Speech in Cyberpunk 2077. In ACM SIGGRAPH 2020 Talks(SIGGRAPH ’20). Association for Computing Machinery, New York, NY, USA, Article 60, 2 pages. https://doi.org/10.1145/3388767.3407339Google ScholarDigital Library
- Paul Ekman and Erika L. Rosenberg. 1997. What the Face Reveals: Basic and Applied Studies of Spontaneous Expression Using the Facial Action Coding System (FACS). Oxford University Press. Google-Books-ID: KVmZKGZfmfEC.Google Scholar
- Faceware. 2017. Analyzer. http://facewaretech.com/products/software/analyzer. (2017).Google Scholar
- Yingruo Fan, Zhaojiang Lin, Jun Saito, Wenping Wang, and Taku Komura. 2022. FaceFormer: Speech-Driven 3D Facial Animation with Transformers. Technical Report arXiv:2112.05329. arXiv. http://arxiv.org/abs/2112.05329 arXiv:2112.05329 [cs] type: article.Google Scholar
- Cletus G. Fisher. 1968. Confusions Among Visually Perceived Consonants. Journal of Speech and Hearing Research 11, 4 (Dec. 1968), 796–804. https://doi.org/10.1044/jshr.1104.796 Publisher: American Speech-Language-Hearing Association.Google ScholarCross Ref
- Bryan Gick, Ian Wilson, and Donald Derrick. 2012. Articulatory Phonetics. John Wiley & Sons. Google-Books-ID: rrfoJJKmIq4C.Google Scholar
- Brian Guenter, Cindy Grimm, Daniel Wood, Henrique Malvar, and Fredric Pighin. 1998. Making faces. In Proceedings of the 25th annual conference on Computer graphics and interactive techniques(SIGGRAPH ’98). Association for Computing Machinery, New York, NY, USA, 55–66. https://doi.org/10.1145/280814.280822Google ScholarDigital Library
- Liwen Hu. 2017. Avatar digitization from a single image for real-time rendering | ACM Transactions on Graphics. (2017). https://dl.acm.org/doi/10.1145/3130800.31310887Google Scholar
- Takayuki Ito, Emi Murano, and Hiroaki Gomi. 2004. Fast force generation dynamics of human articulatory muscles. Journal of applied physiology (Bethesda, Md. : 1985) 96 (July 2004), 2318–24; discussion 2317. https://doi.org/10.1152/japplphysiol.01048.2003Google ScholarCross Ref
- Shohei Iwase, Takuya Kato, Shugo Yamaguchi, Tsuchiya Yukitaka, and Shigeo Morishima. 2020. Song2Face: Synthesizing Singing Facial Animation from Audio. In SIGGRAPH Asia 2020 Technical Communications(SA ’20). Association for Computing Machinery, New York, NY, USA, 1–4. https://doi.org/10.1145/3410700.3425435Google ScholarDigital Library
- Tero Karras, Timo Aila, Samuli Laine, Antti Herva, and Jaakko Lehtinen. 2017. Audio-driven facial animation by joint end-to-end learning of pose and emotion. ACM Transactions on Graphics 36, 4 (July 2017), 1–12. https://doi.org/10.1145/3072959.3073658Google ScholarDigital Library
- Namjung Kim and Kyoungju Park. 2020. Singing Lip Sync Animation System Using Audio Spectrum. In Advances in Computer Science and Ubiquitous Computing. Springer, Singapore, 135–140. https://doi.org/10.1007/978-981-13-9341-9_23Google ScholarCross Ref
- Scott A. King and Richard E. Parent. 2004. Animating song. Computer Animation and Virtual Worlds 15, 1 (2004), 53–61. https://doi.org/10.1002/cav.7 _eprint: https://onlinelibrary.wiley.com/doi/pdf/10.1002/cav.7.Google ScholarCross Ref
- Scott A. King and Richard E. Parent. 2005. Creating speech-synchronized animation. IEEE transactions on visualization and computer graphics 11, 3 (June 2005), 341–352. https://doi.org/10.1109/TVCG.2005.43Google ScholarDigital Library
- Diederik P. Kingma and Jimmy Ba. 2017. Adam: A Method for Stochastic Optimization. arXiv:1412.6980 [cs] (Jan. 2017). http://arxiv.org/abs/1412.6980 arXiv:1412.6980.Google Scholar
- H. Kuwabara. 1996. Acoustic properties of phonemes in continuous speech for different speaking rate. In Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP ’96, Vol. 4. 2435–2438 vol.4. https://doi.org/10.1109/ICSLP.1996.607301Google ScholarCross Ref
- B. E. Lindblom and J. E. Sundberg. 1971. Acoustical consequences of lip, tongue, jaw, and larynx movement. The Journal of the Acoustical Society of America 50, 4 (Oct. 1971), 1166–1179. https://doi.org/10.1121/1.1912750Google ScholarCross Ref
- Yilong Liu, Feng Xu, Jinxiang Chai, Xin Tong, Lijuan Wang, and Qiang Huo. 2015. Video-Audio Driven Real-Time Facial Animation. ACM Trans. Graph. 34, 6, Article 182 (oct 2015), 10 pages. https://doi.org/10.1145/2816795.2818122Google ScholarDigital Library
- D. W. Massaro, M. M. Cohen, R. Clark, M. Tabain, and Jonas Beskow. 2001. Animated speech : Research progress and applications. Cambridge University Press, 309–345. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-167652Google Scholar
- James McCrae and Karan Singh. 2009. Sketching piecewise clothoid curves. Computers & Graphics 33, 4 (Aug. 2009), 452–461. https://doi.org/10.1016/j.cag.2009.05.006Google ScholarDigital Library
- Ulrich Neumann, J.P. Lewis, Tae Kim, Murtaza Bulut, and Shrikanth Narayanan. 2006. Expressive Facial Animation Synthesis by Learning Speech Coarticulation and Expression Spaces. IEEE Transactions on Visualization and Computer Graphics 12 (Nov. 2006), 1523–1534. https://doi.org/10.1109/TVCG.2006.90Google ScholarDigital Library
- John Nix. 2015. Speaking vs Singing. (Sept. 2015). http://music.utsa.edu/pdfs/61_SpeakingvsSinging.pdfGoogle Scholar
- Kyle Olszewski, Joseph J. Lim, Shunsuke Saito, and Hao Li. 2016. High-fidelity facial and speech animation for VR HMDs. ACM Transactions on Graphics 35, 6 (Nov. 2016), 1–14. https://doi.org/10.1145/2980179.2980252Google ScholarDigital Library
- Guilherme Pecoraro, Daniella Curcio, and Mara Behlau. 2013. Vibrato rate variability in three professional singing styles: Opera, Rock and Brazilian country. The Journal of the Acoustical Society of America 133 (May 2013), 3321. https://doi.org/10.1121/1.4805550Google ScholarCross Ref
- Alexander Richard, Michael Zollhoefer, Yandong Wen, Fernando de la Torre, and Yaser Sheikh. 2021. MeshTalk: 3D Face Animation from Speech using Cross-Modality Disentanglement. arXiv:2104.08223 [cs] (April 2021). http://arxiv.org/abs/2104.08223 arXiv:2104.08223.Google Scholar
- Kilian Schulze-Forster, Clement S. J. Doire, Gaël Richard, and Roland Badeau. 2021. Phoneme Level Lyrics Alignment and Text-Informed Singing Voice Separation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29 (2021), 2382–2395. https://doi.org/10.1109/TASLP.2021.3091817 Conference Name: IEEE/ACM Transactions on Audio, Speech, and Language Processing.Google ScholarDigital Library
- J. Sundberg. 1970. Formant Structure and Articulation of Spoken and Sung Vowels. Folia Phoniatrica et Logopaedica 22, 1 (1970), 28–48. https://doi.org/10.1159/000263365 Publisher: Karger Publishers.Google ScholarCross Ref
- Supasorn Suwajanakorn, Steven M. Seitz, and Ira Kemelmacher-Shlizerman. 2017. Synthesizing Obama: learning lip sync from audio. ACM Transactions on Graphics 36, 4 (July 2017), 1–13. https://doi.org/10.1145/3072959.3073640Google ScholarDigital Library
- Ken Tamplin. 2016. How To Sing Any Song: Voice Lessons, Tamplin Vocal Academy. https://www.youtube.com/watch?v=ZATunybJm_4&t=57s.Google Scholar
- Sarah Taylor, Taehwan Kim, Yisong Yue, Moshe Mahler, James Krahe, Anastasio Garcia Rodriguez, Jessica Hodgins, and Iain Matthews. 2017. A deep learning approach for generalized speech animation. ACM Transactions on Graphics 36, 4 (July 2017), 1–11. https://doi.org/10.1145/3072959.3073699Google ScholarDigital Library
- Sarah L. Taylor, Moshe Mahler, Barry-John Theobald, and Iain Matthews. 2012. Dynamic Units of Visual Speech. In Proc. SCA.Google ScholarDigital Library
- Justus Thies, Mohamed A. Elgharib, Ayush Tewari, C. Theobalt, and M. Nießner. 2020. Neural Voice Puppetry: Audio-driven Facial Reenactment. In ECCV. https://doi.org/10.1007/978-3-030-58517-4_42Google ScholarDigital Library
- Ingo Titze. 2011. Formant Frequency Shifts for Classical and Theater Belt Vowel Modification. (2011), 2.Google Scholar
- Konstantinos Vougioukas, Stavros Petridis, and Maja Pantic. 2018. End-to-End Speech-Driven Facial Animation with Temporal GANs. arXiv:1805.09313 [cs, eess] (July 2018). http://arxiv.org/abs/1805.09313 arXiv:1805.09313.Google Scholar
- Alice Wang, Michael Emmi, and Petros Faloutsos. 2007. Assembling an expressive facial animation system. In Proceedings of the 2007 ACM SIGGRAPH symposium on Video games - Sandbox ’07. ACM Press, San Diego, California, 21. https://doi.org/10.1145/1274940.1274947Google ScholarDigital Library
- Lijuan Wang, Wei Han, and Frank K. Soong. 2012. High Quality Lip-Sync Animation for 3d Photo-Realistic Talking Head. (2012).Google Scholar
- Thibaut Weise, Hao Li, Luc Van Gool, and Mark Pauly. 2009. Face/Off: live facial puppetry. In Proceedings of the 2009 ACM SIGGRAPH/Eurographics Symposium on Computer Animation - SCA ’09. ACM Press, New Orleans, Louisiana, 7. https://doi.org/10.1145/1599470.1599472Google ScholarDigital Library
- Julia Wilkins, Prem Seetharaman, Alison Wahl, and Bryan Pardo. 2018. Vocalset: A Singing Voice Dataset. (March 2018). https://doi.org/10.5281/ZENODO.1203819 Type: dataset.Google ScholarCross Ref
- Lance Williams. 1990. Performance-driven Facial Animation. In Proc. SIGGRAPH.Google ScholarDigital Library
- Yuyu Xu, Andrew W. Feng, Stacy Marsella, and Ari Shapiro. 2013. A Practical and Configurable Lip Sync Method for Games. In Proceedings of Motion on Games. ACM, Dublin 2 Ireland, 131–140. https://doi.org/10.1145/2522628.2522904Google ScholarDigital Library
- Jun Yu, Chang Wen Chen, and Zengfu Wang. 2019. 3D Singing Head for Music VR: Learning External and Internal Articulatory Synchronicity from Lyric, Audio and Notes(MM ’19). ACM, 945–952. https://doi.org/10.1145/3343031.3350865 Book Title: Proceedings of the 27th ACM International Conference on multimedia.Google ScholarDigital Library
- Egor Zakharov, Aliaksandra Shysheya, Egor Burkov, and Victor Lempitsky. 2019. Few-Shot Adversarial Learning of Realistic Neural Talking Head Models. 9458–9467. https://doi.org/10.1109/ICCV.2019.00955Google ScholarCross Ref
- Hang Zhou, Yu Liu, Ziwei Liu, Ping Luo, and Xiaogang Wang. 2019. Talking Face Generation by Adversarially Disentangled Audio-Visual Representation. In Proceedings of the Thirty-Third AAAI Conference on Artificial Intelligence and Thirty-First Innovative Applications of Artificial Intelligence Conference and Ninth AAAI Symposium on Educational Advances in Artificial Intelligence(AAAI’19/IAAI’19/EAAI’19). AAAI Press, Article 1141, 8 pages. https://doi.org/10.1609/aaai.v33i01.33019299Google ScholarDigital Library
- Yang Zhou, Xintong Han, Eli Shechtman, Jose Echevarria, Evangelos Kalogerakis, and Dingzeyu Li. 2020. MakeltTalk: Speaker-Aware Talking-Head Animation. ACM Trans. Graph. 39, 6, Article 221 (nov 2020), 15 pages. https://doi.org/10.1145/3414685.3417774Google ScholarDigital Library
- Yang Zhou, Zhan Xu, Chris Landreth, Evangelos Kalogerakis, Subhransu Maji, and Karan Singh. 2018. Visemenet: audio-driven animator-centric speech animation. ACM Transactions on Graphics 37, 4 (Aug. 2018), 1–10. https://doi.org/10.1145/3197517.3201292Google ScholarDigital Library
- Victor Brian Zordan, Bhrigu Celly, Bill Chiu, and Paul C DiLorenzo. 2004. Breathe easy: model and control of simulated respiration for animation. In Proceedings of the 2004 ACM SIGGRAPH/Eurographics symposium on Computer animation. 29–37.Google ScholarDigital Library
Index Terms
- VOCAL: Vowel and Consonant Layering for Expressive Animator-Centric Singing Animation
Recommendations
Song2Face: Synthesizing Singing Facial Animation from Audio
SA '20: SIGGRAPH Asia 2020 Technical CommunicationsWe present Song2Face, a deep neural network capable of producing singing facial animation from an input of singing voice and singer label. The network architecture is built upon our insight that, although facial expression when singing varies between ...
Learning transfer from singing to speech: Insights from vowel analyses in aging amateur singers and non-singers
Highlights- Articulatory space and vowel distinctiveness are independent vowel properties.
- ...
Abstract PurposeTask-independent (e.g., Ballard et al., 2003) and task-dependent models (e.g., Ziegler, 2003) differ in their predictions regarding the learning transfer from non-speech activities to speech. We argue that singing is ...
Effects of consonant cluster syllabification on vowel-to-vowel coarticulation in English
This paper investigates how different syllable affiliations of intervocalic /st/ cluster affect vowel-to-vowel coarticulation in English. Very few studies have examined the effect of syllable structure on vowel-to-vowel coarticulation. Previous studies ...
Comments