research-article

VOCAL: Vowel and Consonant Layering for Expressive Animator-Centric Singing Animation

Authors:
Yifang Pan

Department of Computer Science, University of Toronto, Canada

Department of Computer Science, University of Toronto, Canada
View Profile

,
Chris Landreth

Department of Computer Science, University of Toronto, Canada

Department of Computer Science, University of Toronto, Canada
View Profile

,
Eugene Fiume

Department of Computer Science, University of Toronto, Canada and School of Computing Science, Simon Fraser University, Canada

Department of Computer Science, University of Toronto, Canada and School of Computing Science, Simon Fraser University, Canada
View Profile

,
Karan Singh

Department of Computer Science, University of Toronto, Canada

Department of Computer Science, University of Toronto, Canada
View Profile

SA '22: SIGGRAPH Asia 2022 Conference PapersNovember 2022Article No.: 18Pages 1–9https://doi.org/10.1145/3550469.3555408

Published:30 November 2022Publication History

SA '22: SIGGRAPH Asia 2022 Conference Papers

Pages 1–9

ABSTRACT

Singing and speaking are two fundamental forms of human communication. From a modeling perspective however, speaking can be seen as a subset of singing. We present VOCAL, a system that automatically generates expressive, animator-centric lower face animation from singing audio input. Articulatory phonetics and voice instruction ascribe additional roles to vowels (projecting melody and volume) and consonants (lyrical clarity and rhythmic emphasis) in song. Our approach directly uses these insights to define axes for Melodic-accent and Pitch-sensitivity (Ma-Ps), which together provide an abstract space to visually represent various singing styles. In our system. vowels are processed first. A lyrical vowel is often sung tonally as one or more different vowels. We perform any such vowel modifications using a neural network trained on input audio. These vowels are then dilated from their spoken behaviour to bleed into each other based on Melodic-accent (Ma), with Pitch-sensitivity (Ps) modeling visual vibrato. Consonant animation curves are then layered in, with viseme intensity modeling rhythmic emphasis (inverse to Ma). Our evaluation is fourfold: we show the impact of our design parameters; we compare our results to ground truth and prior art; we present compelling results on a variety of voices and singing styles; and we validate these results with professional singers and animators.

Supplemental Material

3550469.3555408.mov

mov

487.9 MB

Download

References

Robert Anderson, Bjorn Stenger, Vincent Wan, and Roberto Cipolla. 2013. Expressive Visual Text-To-Speech Using Active Appearance Models. https://doi.org/10.1109/CVPR.2013.434 Journal Abbreviation: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition Publication Title: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition.Google ScholarDigital Library
Stephen F. Austin. 2007. Jaw Opening in Novice and Experienced Classically Trained Singers. Journal of Voice 21, 1 (Jan. 2007), 72–79. https://doi.org/10.1016/j.jvoice.2005.08.013Google ScholarCross Ref
Gérard Bailly. 1997. Learning to speak. Sensori-motor control of speech movements. Speech Communication 22, 2 (Aug. 1997), 251–267. https://doi.org/10.1016/S0167-6393(97)00025-3Google ScholarDigital Library
Elisabetta Bevacqua and Catherine Pelachaud. 2004. Expressive audio-visual speech. Journal of Visualization and Computer Animation 15 (July 2004), 297–304. https://doi.org/10.1002/cav.32Google ScholarCross Ref
V. Blanz, C. Basso, T. Poggio, and T. Vetter. 2003. Reanimating Faces in Images and Video. Computer Graphics Forum 22, 3 (2003), 641–650. https://doi.org/10.1111/1467-8659.t01-1-00712 _eprint: https://onlinelibrary.wiley.com/doi/pdf/10.1111/1467-8659.t01-1-00712.Google ScholarCross Ref
Paul Boersma and David Weenink. 2001. Praat: doing Phonetics by Computer. (2001). https://www.fon.hum.uva.nl/praat/Google Scholar
Kenneth Bozeman. 2017. Kinesthetic Voice Pedagogy 2: Motivating Acoustic Efficiency. Inside View Press. Google-Books-ID: rzopzgEACAAJ.Google Scholar
Kenneth W Bozeman. 2013. Practical Vocal Acoustics. (2013), 162.Google Scholar
Derek Bradley, Wolfgang Heidrich, Tiberiu Popa, and Alla Sheffer. 2010. High Resolution Passive Facial Performance Capture. (2010), 10.Google Scholar
Christoph Bregler, Michele Covell, and Malcolm Slaney. 1997. Video Rewrite: driving visual speech with audio. In Proceedings of the 24th annual conference on Computer graphics and interactive techniques - SIGGRAPH ’97. ACM Press, Not Known, 353–360. https://doi.org/10.1145/258734.258880Google ScholarDigital Library
Yong Cao, Wen C. Tien, Petros Faloutsos, and Frédéric Pighin. 2005. Expressive speech-driven facial animation. ACM Transactions on Graphics 24, 4 (Oct. 2005), 1283–1302. https://doi.org/10.1145/1095878.1095881Google ScholarDigital Library
E. Cosatto and H.P. Graf. 2000. Photo-realistic talking-heads from image samples. IEEE Transactions on Multimedia 2, 3 (Sept. 2000), 152–163. https://doi.org/10.1109/6046.865480 Conference Name: IEEE Transactions on Multimedia.Google ScholarDigital Library
P. Cosi, E.M. Caldognetto, G. Perin, and C. Zmarich. 2002. Labial coarticulation modeling for realistic facial animation. In Proceedings. Fourth IEEE International Conference on Multimodal Interfaces. 505–510. https://doi.org/10.1109/ICMI.2002.1167047Google ScholarDigital Library
Daniel Cudeiro, Timo Bolkart, Cassidy Laidlaw, Anurag Ranjan, and Michael J. Black. 2019. Capture, Learning, and Synthesis of 3D Speaking Styles. arXiv:1905.03079 [cs] (May 2019). http://arxiv.org/abs/1905.03079 arXiv:1905.03079.Google Scholar
Paul C DiLorenzo, Victor B Zordan, and Benjamin L Sanders. 2008. Laughing out loud: Control for modeling anatomically inspired laughter using audio. In ACM SIGGRAPH Asia 2008 papers. 1–8.Google Scholar
Pif Edwards, Chris Landreth, Eugene Fiume, and Karan Singh. 2016. JALI: an animator-centric viseme model for expressive lip synchronization. ACM Transactions on Graphics 35, 4 (July 2016), 1–11. https://doi.org/10.1145/2897824.2925984Google ScholarDigital Library
Pif Edwards, Chris Landreth, Mateusz Popławski, Robert Malinowski, Sarah Watling, Eugene Fiume, and Karan Singh. 2020. JALI-Driven Expressive Facial Animation and Multilingual Speech in Cyberpunk 2077. In ACM SIGGRAPH 2020 Talks(SIGGRAPH ’20). Association for Computing Machinery, New York, NY, USA, Article 60, 2 pages. https://doi.org/10.1145/3388767.3407339Google ScholarDigital Library
Paul Ekman and Erika L. Rosenberg. 1997. What the Face Reveals: Basic and Applied Studies of Spontaneous Expression Using the Facial Action Coding System (FACS). Oxford University Press. Google-Books-ID: KVmZKGZfmfEC.Google Scholar
Faceware. 2017. Analyzer. http://facewaretech.com/products/software/analyzer. (2017).Google Scholar
Yingruo Fan, Zhaojiang Lin, Jun Saito, Wenping Wang, and Taku Komura. 2022. FaceFormer: Speech-Driven 3D Facial Animation with Transformers. Technical Report arXiv:2112.05329. arXiv. http://arxiv.org/abs/2112.05329 arXiv:2112.05329 [cs] type: article.Google Scholar
Cletus G. Fisher. 1968. Confusions Among Visually Perceived Consonants. Journal of Speech and Hearing Research 11, 4 (Dec. 1968), 796–804. https://doi.org/10.1044/jshr.1104.796 Publisher: American Speech-Language-Hearing Association.Google ScholarCross Ref
Bryan Gick, Ian Wilson, and Donald Derrick. 2012. Articulatory Phonetics. John Wiley & Sons. Google-Books-ID: rrfoJJKmIq4C.Google Scholar
Brian Guenter, Cindy Grimm, Daniel Wood, Henrique Malvar, and Fredric Pighin. 1998. Making faces. In Proceedings of the 25th annual conference on Computer graphics and interactive techniques(SIGGRAPH ’98). Association for Computing Machinery, New York, NY, USA, 55–66. https://doi.org/10.1145/280814.280822Google ScholarDigital Library
Liwen Hu. 2017. Avatar digitization from a single image for real-time rendering | ACM Transactions on Graphics. (2017). https://dl.acm.org/doi/10.1145/3130800.31310887Google Scholar
Takayuki Ito, Emi Murano, and Hiroaki Gomi. 2004. Fast force generation dynamics of human articulatory muscles. Journal of applied physiology (Bethesda, Md. : 1985) 96 (July 2004), 2318–24; discussion 2317. https://doi.org/10.1152/japplphysiol.01048.2003Google ScholarCross Ref
Shohei Iwase, Takuya Kato, Shugo Yamaguchi, Tsuchiya Yukitaka, and Shigeo Morishima. 2020. Song2Face: Synthesizing Singing Facial Animation from Audio. In SIGGRAPH Asia 2020 Technical Communications(SA ’20). Association for Computing Machinery, New York, NY, USA, 1–4. https://doi.org/10.1145/3410700.3425435Google ScholarDigital Library
Tero Karras, Timo Aila, Samuli Laine, Antti Herva, and Jaakko Lehtinen. 2017. Audio-driven facial animation by joint end-to-end learning of pose and emotion. ACM Transactions on Graphics 36, 4 (July 2017), 1–12. https://doi.org/10.1145/3072959.3073658Google ScholarDigital Library
Namjung Kim and Kyoungju Park. 2020. Singing Lip Sync Animation System Using Audio Spectrum. In Advances in Computer Science and Ubiquitous Computing. Springer, Singapore, 135–140. https://doi.org/10.1007/978-981-13-9341-9_23Google ScholarCross Ref
Scott A. King and Richard E. Parent. 2004. Animating song. Computer Animation and Virtual Worlds 15, 1 (2004), 53–61. https://doi.org/10.1002/cav.7 _eprint: https://onlinelibrary.wiley.com/doi/pdf/10.1002/cav.7.Google ScholarCross Ref
Scott A. King and Richard E. Parent. 2005. Creating speech-synchronized animation. IEEE transactions on visualization and computer graphics 11, 3 (June 2005), 341–352. https://doi.org/10.1109/TVCG.2005.43Google ScholarDigital Library
Diederik P. Kingma and Jimmy Ba. 2017. Adam: A Method for Stochastic Optimization. arXiv:1412.6980 [cs] (Jan. 2017). http://arxiv.org/abs/1412.6980 arXiv:1412.6980.Google Scholar
H. Kuwabara. 1996. Acoustic properties of phonemes in continuous speech for different speaking rate. In Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP ’96, Vol. 4. 2435–2438 vol.4. https://doi.org/10.1109/ICSLP.1996.607301Google ScholarCross Ref
B. E. Lindblom and J. E. Sundberg. 1971. Acoustical consequences of lip, tongue, jaw, and larynx movement. The Journal of the Acoustical Society of America 50, 4 (Oct. 1971), 1166–1179. https://doi.org/10.1121/1.1912750Google ScholarCross Ref
Yilong Liu, Feng Xu, Jinxiang Chai, Xin Tong, Lijuan Wang, and Qiang Huo. 2015. Video-Audio Driven Real-Time Facial Animation. ACM Trans. Graph. 34, 6, Article 182 (oct 2015), 10 pages. https://doi.org/10.1145/2816795.2818122Google ScholarDigital Library
D. W. Massaro, M. M. Cohen, R. Clark, M. Tabain, and Jonas Beskow. 2001. Animated speech : Research progress and applications. Cambridge University Press, 309–345. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-167652Google Scholar
James McCrae and Karan Singh. 2009. Sketching piecewise clothoid curves. Computers & Graphics 33, 4 (Aug. 2009), 452–461. https://doi.org/10.1016/j.cag.2009.05.006Google ScholarDigital Library
Ulrich Neumann, J.P. Lewis, Tae Kim, Murtaza Bulut, and Shrikanth Narayanan. 2006. Expressive Facial Animation Synthesis by Learning Speech Coarticulation and Expression Spaces. IEEE Transactions on Visualization and Computer Graphics 12 (Nov. 2006), 1523–1534. https://doi.org/10.1109/TVCG.2006.90Google ScholarDigital Library
John Nix. 2015. Speaking vs Singing. (Sept. 2015). http://music.utsa.edu/pdfs/61_SpeakingvsSinging.pdfGoogle Scholar
Kyle Olszewski, Joseph J. Lim, Shunsuke Saito, and Hao Li. 2016. High-fidelity facial and speech animation for VR HMDs. ACM Transactions on Graphics 35, 6 (Nov. 2016), 1–14. https://doi.org/10.1145/2980179.2980252Google ScholarDigital Library
Guilherme Pecoraro, Daniella Curcio, and Mara Behlau. 2013. Vibrato rate variability in three professional singing styles: Opera, Rock and Brazilian country. The Journal of the Acoustical Society of America 133 (May 2013), 3321. https://doi.org/10.1121/1.4805550Google ScholarCross Ref
Alexander Richard, Michael Zollhoefer, Yandong Wen, Fernando de la Torre, and Yaser Sheikh. 2021. MeshTalk: 3D Face Animation from Speech using Cross-Modality Disentanglement. arXiv:2104.08223 [cs] (April 2021). http://arxiv.org/abs/2104.08223 arXiv:2104.08223.Google Scholar
Kilian Schulze-Forster, Clement S. J. Doire, Gaël Richard, and Roland Badeau. 2021. Phoneme Level Lyrics Alignment and Text-Informed Singing Voice Separation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29 (2021), 2382–2395. https://doi.org/10.1109/TASLP.2021.3091817 Conference Name: IEEE/ACM Transactions on Audio, Speech, and Language Processing.Google ScholarDigital Library
J. Sundberg. 1970. Formant Structure and Articulation of Spoken and Sung Vowels. Folia Phoniatrica et Logopaedica 22, 1 (1970), 28–48. https://doi.org/10.1159/000263365 Publisher: Karger Publishers.Google ScholarCross Ref
Supasorn Suwajanakorn, Steven M. Seitz, and Ira Kemelmacher-Shlizerman. 2017. Synthesizing Obama: learning lip sync from audio. ACM Transactions on Graphics 36, 4 (July 2017), 1–13. https://doi.org/10.1145/3072959.3073640Google ScholarDigital Library
Ken Tamplin. 2016. How To Sing Any Song: Voice Lessons, Tamplin Vocal Academy. https://www.youtube.com/watch?v=ZATunybJm_4&t=57s.Google Scholar
Sarah Taylor, Taehwan Kim, Yisong Yue, Moshe Mahler, James Krahe, Anastasio Garcia Rodriguez, Jessica Hodgins, and Iain Matthews. 2017. A deep learning approach for generalized speech animation. ACM Transactions on Graphics 36, 4 (July 2017), 1–11. https://doi.org/10.1145/3072959.3073699Google ScholarDigital Library
Sarah L. Taylor, Moshe Mahler, Barry-John Theobald, and Iain Matthews. 2012. Dynamic Units of Visual Speech. In Proc. SCA.Google ScholarDigital Library
Justus Thies, Mohamed A. Elgharib, Ayush Tewari, C. Theobalt, and M. Nießner. 2020. Neural Voice Puppetry: Audio-driven Facial Reenactment. In ECCV. https://doi.org/10.1007/978-3-030-58517-4_42Google ScholarDigital Library
Ingo Titze. 2011. Formant Frequency Shifts for Classical and Theater Belt Vowel Modification. (2011), 2.Google Scholar
Konstantinos Vougioukas, Stavros Petridis, and Maja Pantic. 2018. End-to-End Speech-Driven Facial Animation with Temporal GANs. arXiv:1805.09313 [cs, eess] (July 2018). http://arxiv.org/abs/1805.09313 arXiv:1805.09313.Google Scholar
Alice Wang, Michael Emmi, and Petros Faloutsos. 2007. Assembling an expressive facial animation system. In Proceedings of the 2007 ACM SIGGRAPH symposium on Video games - Sandbox ’07. ACM Press, San Diego, California, 21. https://doi.org/10.1145/1274940.1274947Google ScholarDigital Library
Lijuan Wang, Wei Han, and Frank K. Soong. 2012. High Quality Lip-Sync Animation for 3d Photo-Realistic Talking Head. (2012).Google Scholar
Thibaut Weise, Hao Li, Luc Van Gool, and Mark Pauly. 2009. Face/Off: live facial puppetry. In Proceedings of the 2009 ACM SIGGRAPH/Eurographics Symposium on Computer Animation - SCA ’09. ACM Press, New Orleans, Louisiana, 7. https://doi.org/10.1145/1599470.1599472Google ScholarDigital Library
Julia Wilkins, Prem Seetharaman, Alison Wahl, and Bryan Pardo. 2018. Vocalset: A Singing Voice Dataset. (March 2018). https://doi.org/10.5281/ZENODO.1203819 Type: dataset.Google ScholarCross Ref
Lance Williams. 1990. Performance-driven Facial Animation. In Proc. SIGGRAPH.Google ScholarDigital Library
Yuyu Xu, Andrew W. Feng, Stacy Marsella, and Ari Shapiro. 2013. A Practical and Configurable Lip Sync Method for Games. In Proceedings of Motion on Games. ACM, Dublin 2 Ireland, 131–140. https://doi.org/10.1145/2522628.2522904Google ScholarDigital Library
Jun Yu, Chang Wen Chen, and Zengfu Wang. 2019. 3D Singing Head for Music VR: Learning External and Internal Articulatory Synchronicity from Lyric, Audio and Notes(MM ’19). ACM, 945–952. https://doi.org/10.1145/3343031.3350865 Book Title: Proceedings of the 27th ACM International Conference on multimedia.Google ScholarDigital Library
Egor Zakharov, Aliaksandra Shysheya, Egor Burkov, and Victor Lempitsky. 2019. Few-Shot Adversarial Learning of Realistic Neural Talking Head Models. 9458–9467. https://doi.org/10.1109/ICCV.2019.00955Google ScholarCross Ref
Hang Zhou, Yu Liu, Ziwei Liu, Ping Luo, and Xiaogang Wang. 2019. Talking Face Generation by Adversarially Disentangled Audio-Visual Representation. In Proceedings of the Thirty-Third AAAI Conference on Artificial Intelligence and Thirty-First Innovative Applications of Artificial Intelligence Conference and Ninth AAAI Symposium on Educational Advances in Artificial Intelligence(AAAI’19/IAAI’19/EAAI’19). AAAI Press, Article 1141, 8 pages. https://doi.org/10.1609/aaai.v33i01.33019299Google ScholarDigital Library
Yang Zhou, Xintong Han, Eli Shechtman, Jose Echevarria, Evangelos Kalogerakis, and Dingzeyu Li. 2020. MakeltTalk: Speaker-Aware Talking-Head Animation. ACM Trans. Graph. 39, 6, Article 221 (nov 2020), 15 pages. https://doi.org/10.1145/3414685.3417774Google ScholarDigital Library
Yang Zhou, Zhan Xu, Chris Landreth, Evangelos Kalogerakis, Subhransu Maji, and Karan Singh. 2018. Visemenet: audio-driven animator-centric speech animation. ACM Transactions on Graphics 37, 4 (Aug. 2018), 1–10. https://doi.org/10.1145/3197517.3201292Google ScholarDigital Library
Victor Brian Zordan, Bhrigu Celly, Bill Chiu, and Paul C DiLorenzo. 2004. Breathe easy: model and control of simulated respiration for animation. In Proceedings of the 2004 ACM SIGGRAPH/Eurographics symposium on Computer animation. 29–37.Google ScholarDigital Library

Index Terms

VOCAL: Vowel and Consonant Layering for Expressive Animator-Centric Singing Animation
1. Computing methodologies
  1. Computer graphics
    1. Animation
      1. Procedural animation

Recommendations

Song2Face: Synthesizing Singing Facial Animation from Audio
SA '20: SIGGRAPH Asia 2020 Technical Communications

We present Song2Face, a deep neural network capable of producing singing facial animation from an input of singing voice and singer label. The network architecture is built upon our insight that, although facial expression when singing varies between ...
Read More
Learning transfer from singing to speech: Insights from vowel analyses in aging amateur singers and non-singers
Highlights
- Articulatory space and vowel distinctiveness are independent vowel properties.
- ...
Abstract Purpose
Task-independent (e.g., Ballard et al., 2003) and task-dependent models (e.g., Ziegler, 2003) differ in their predictions regarding the learning transfer from non-speech activities to speech. We argue that singing is ...
Read More
Effects of consonant cluster syllabification on vowel-to-vowel coarticulation in English

This paper investigates how different syllable affiliations of intervocalic /st/ cluster affect vowel-to-vowel coarticulation in English. Very few studies have examined the effect of syllable structure on vowel-to-vowel coarticulation. Previous studies ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SA '22: SIGGRAPH Asia 2022 Conference Papers
November 2022
482 pages
ISBN:9781450394703
DOI:10.1145/3550469
Editors:
Soon Ki Jung
Kyungpook National University, South Korea
,
Jehee Lee
Seoul National University, South Korea
,
Adam Bargteil
University of Maryland Baltimore County, USA
Copyright © 2022 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 30 November 2022
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
facial animation
lip-sync
music
singing
Qualifiers
- research-article
- Research
- Refereed limited
Conference

Acceptance Rates
Overall Acceptance Rate178of869submissions,20%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 1
  Total Citations
  View Citations
- 270
  Total Downloads
- Downloads (Last 12 months)84
- Downloads (Last 6 weeks)8
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format

VOCAL: Vowel and Consonant Layering for Expressive Animator-Centric Singing Animation

SA '22: SIGGRAPH Asia 2022 Conference Papers

ABSTRACT

Supplemental Material

References

Cited By

Index Terms

Recommendations

Song2Face: Synthesizing Singing Facial Animation from Audio

Learning transfer from singing to speech: Insights from vowel analyses in aging amateur singers and non-singers

Effects of consonant cluster syllabification on vowel-to-vowel coarticulation in English

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

HTML Format

Caption

VOCAL: Vowel and Consonant Layering for Expressive Animator-Centric Singing Animation

SA '22: SIGGRAPH Asia 2022 Conference Papers

ABSTRACT

Supplemental Material

References

Cited By

Index Terms

Recommendations

Song2Face: Synthesizing Singing Facial Animation from Audio

Learning transfer from singing to speech: Insights from vowel analyses in aging amateur singers and non-singers

Effects of consonant cluster syllabification on vowel-to-vowel coarticulation in English

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

HTML Format

Share this Publication link

Share on Social Media