ABSTRACT
We propose a new method for generating real-time realistic facial animation using face mesh data corresponding to the fifty-six C+V (Consonant and Vowel) type morae that form the basis of Japanese speech. This method produces facial expressions by weighted addition of fifty-three face meshes based on the mapping of voice streaming to registered morae in real-time. Both photogrammetric models and existing off-the-shelf head models can be used as face meshes. Natural facial expressions of speech can be synthesized from a modeling to live animation in less than two hours. The results of a user study of our method showed that the facial expression during Japanese speech was more natural than popular real-time methods to generate facial animation, English-base Oculus Lipsync and volume intensity based facial animations.
- Visage Technologies AB. 2012. MPEG-4 Face and Body Animation (MPEG-4 FBA). https://www.visagetechnologies.com/uploads/2012/08/MPEG-4FBAOverview.pdf. (Accessed on 12/01/2022).Google Scholar
- Apple. 2020. ARFaceAnchor.BlendShapeLocation | Apple Developer Documentation. https://developer.apple.com/documentation/arkit/arfaceanchor/blendshapelocation. (Accessed on 12/01/2022).Google Scholar
- Gérard Bailly. 1997. Learning to speak. Sensori-motor control of speech movements. Speech Communication 22, 2 (1997), 251–267. https://doi.org/10.1016/S0167-6393(97)00025-3Google ScholarDigital Library
- Eric Bateson, Gerard Bailly, Eric Vatikiotis-Bateson, and Pascal Perrier (Eds.). 2012. Audiovisual Speech Processing. Cambridge University Press, Cambridge, England.Google Scholar
- Preston Blair. 2012. Animation: Learn How to Draw Animated Cartoons. Literary Licensing, USA.Google Scholar
- Keith Brown and Sarah Ogilvie. 2008. Concise encyclopedia of languages of the world. Elsevier Science, London, England.Google Scholar
- AHS Co.2020. VOICEROID2 Yukari Yuzuki. https://www.ah-soft.com/voiceroid/yukari/. (Accessed on 12/01/2022).Google Scholar
- DevelopW Corporation. 2020. iFacialMocap. https://www.ifacialmocap.com/tutorial/unity/. (Accessed on 12/01/2022).Google Scholar
- Daniel Cudeiro, Timo Bolkart, Cassidy Laidlaw, Anurag Ranjan, and Michael J. Black. 2019. Capture, Learning, and Synthesis of 3D Speaking Styles. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, CA, USA, 10093–10103. https://doi.org/10.1109/CVPR.2019.01034Google ScholarCross Ref
- Pif Edwards, Chris Landreth, Eugene Fiume, and Karan Singh. 2016. JALI: An Animator-Centric Viseme Model for Expressive Lip Synchronization. ACM Trans. Graph. 35, 4, Article 127 (jul 2016), 11 pages. https://doi.org/10.1145/2897824.2925984Google ScholarDigital Library
- Paul Ekman and Wallace V Friesen. 1978. Facial action coding system. Environmental Psychology & Nonverbal Behavior (1978).Google Scholar
- T. Ezzat and T. Poggio. 1998. MikeTalk: a talking facial display based on morphing visemes. In Proceedings Computer Animation ’98 (Cat. No.98EX169). IEEE Computer Society, Philadelphia, USA, 96–102. https://doi.org/10.1109/CA.1998.681913Google ScholarCross Ref
- Cletus G Fisher. 1968. Confusions among visually perceived consonants. Journal of speech and hearing research 11, 4 (1968), 796–804.Google ScholarCross Ref
- hecomi. 2021. hecomi/uLipSync: A MFCC-based LipSync plugin for Unity using Burst Compiler. https://github.com/hecomi/uLipSync. (Accessed on 12/01/2022).Google Scholar
- Tero Karras, Timo Aila, Samuli Laine, Antti Herva, and Jaakko Lehtinen. 2017. Audio-Driven Facial Animation by Joint End-to-End Learning of Pose and Emotion. ACM Trans. Graph. 36, 4, Article 94 (jul 2017), 12 pages. https://doi.org/10.1145/3072959.3073658Google ScholarDigital Library
- J. P. Lewis, Ken Anjyo, Taehyun Rhee, Mengjie Zhang, Fred Pighin, and Zhigang Deng. 2014. Practice and Theory of Blendshape Facial Models. In Eurographics 2014 - State of the Art Reports, Sylvain Lefebvre and Michela Spagnuolo (Eds.). The Eurographics Association, Strasbourg, France, 199–218. https://doi.org/10.2312/egst.20141042Google Scholar
- Meta. 2018. Tech Note: Enhancing Oculus Lipsync with Deep Learning. https://developer.oculus.com/blog/tech-note-enhancing-oculus-lipsync-with-deep-learning/. (Accessed on 12/01/2022).Google Scholar
- Masahiro Mori, Karl F. MacDorman, and Norri Kageki. 2012. The Uncanny Valley [From the Field]. IEEE Robotics & Automation Magazine 19, 2 (2012), 98–100. https://doi.org/10.1109/MRA.2012.2192811Google ScholarCross Ref
- Jason Osipa. 2010. Stop staring: Facial modeling and animation done right (3 ed.). John Wiley & Sons, Chichester, England.Google Scholar
- T. Otake, G. Hatano, A. Cutler, and J. Mehler. 1993. Mora or Syllable? Speech Segmentation in Japanese. Journal of Memory and Language 32, 2 (1993), 258–278. https://doi.org/10.1006/jmla.1993.1014Google ScholarCross Ref
- Frederick I. Parke. 1972. Computer Generated Animation of Faces. In Proceedings of the ACM Annual Conference - Volume 1 (Boston, Massachusetts, USA) (ACM ’72). Association for Computing Machinery, New York, NY, USA, 451–457. https://doi.org/10.1145/800193.569955Google ScholarDigital Library
- R3DS. 2020. Wrapping — R3DS Wrap documentation. https://www.russian3dscanner.com/docs/Wrap3/Nodes/Wrapping/Wrapping.html. (Accessed on 12/01/2022).Google Scholar
- Alexander Richard, Michael Zollhoefer, Yandong Wen, Fernando de la Torre, and Yaser Sheikh. 2021. MeshTalk: 3D Face Animation from Speech using Cross-Modality Disentanglement. https://doi.org/10.48550/ARXIV.2104.08223Google Scholar
- Tim Riney and Janet Anderson-Hsieh. 1993. Japanese pronunciation of English. JALT Journal 15, 1 (1993), 21–36.Google Scholar
- Keith Brown Ronald E. Asher. 2006. Encyclopedia of language and linguistics, 14-volume set (2 ed.). Elsevier Science & Technology, Amsterdam, Netherlands. 149–156 pages.Google Scholar
- Sarah L. Taylor, Moshe Mahler, Barry-John Theobald, and Iain Matthews. 2012. Dynamic Units of Visual Speech. In Eurographics/ ACM SIGGRAPH Symposium on Computer Animation, Jehee Lee and Paul Kry (Eds.). The Eurographics Association, Lausanne, Switzerland, 275–284. https://doi.org/10.2312/SCA/SCA12/275-284Google Scholar
- Lance Williams. 1990. Performance-Driven Facial Animation. SIGGRAPH Comput. Graph. 24, 4 (sep 1990), 235–242. https://doi.org/10.1145/97880.97906Google ScholarDigital Library
- Yang Zhou, Zhan Xu, Chris Landreth, Evangelos Kalogerakis, Subhransu Maji, and Karan Singh. 2018. Visemenet: Audio-Driven Animator-Centric Speech Animation. ACM Trans. Graph. 37, 4, Article 161 (jul 2018), 10 pages. https://doi.org/10.1145/3197517.3201292Google ScholarDigital Library
Index Terms
- Generation of realistic facial animation of a CG avatar speaking a moraic language
Recommendations
Multimedia authoring tool for real-time facial animation
MCAM'07: Proceedings of the 2007 international conference on Multimedia content analysis and miningWe present a multimedia authoring tool for real-time facial animation based on multiple face models. In order to overcome the heavy burden of geometry data management on various multimedia applications, we employ wire curve [16] which is a simple, ...
Audio-driven talking face generation with diverse yet realistic facial animations
AbstractAudio-driven talking face generation, which aims to synthesize talking faces with realistic facial animations (including accurate lip movements, vivid facial expression details and natural head poses) corresponding to the audio, has achieved ...
Highlights- Generate diverse yet realistic talking faces from the same input audio.
- Network for modelling the uncertain relations between audio and visual signals.
- Novel technique that enables to generate temporally coherent talking faces.
Comments