Abstract
In this work, we investigate the problem of synthesizing a talking face video which should be synchronized with a target speech segment. Although there has been significant progress on this task, the most successful approaches are still those that are unrelated to identity. The results generated by these methods cannot capture the unique speaking characteristics of the target individual. In this paper, we propose an effective framework that enables personalized and high-fidelity lip synchronization. The emphasis of our method lies in the way of data preprocessing and inference. Specifically, we find that convolutional neural networks (CNNs) are more inclined to learn lip motions from facial and throat muscle movements rather than speech features. Only by masking region related to lip shape changes as much as possible, CNNs can associate speech features with lip motions. Given original head pose information, the features of input audio signal are solely used to generate realistic images in our model. Extensive experiments demonstrate the effectiveness of our method in producing high-fidelity results. The code will be made publicly available.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Cao, C., Weng, Y., Zhou, S., Tong, Y., Zhou, K.: FaceWarehouse: a 3D facial expression database for visual computing. IEEE Trans. Visual. Comput. Graph. 20(3), 413–425 (2013) publisher: IEEE
Cao, Y., Tien, W.C., Faloutsos, P., Pighin, F.: Expressive speech-driven facial animation. ACM Trans. Graph. (TOG) 24(4), 1283–1302 (2005) publisher: ACM New York, NY, USA
Chen, L., et al.: Talking-head generation with rhythmic head motion. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12354, pp. 35–51. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58545-7_3
Chen, L., Maddox, R.K., Duan, Z., Xu, C.: Hierarchical cross-modal talking face generation with dynamic pixel-wise loss. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7832–7841 (2019)
Cheng, K., et al.: VideoReTalking: audio-based lip synchronization for talking head video editing in the wild (2022). arXiv: 2211.14758 [cs.CV]
Chung, J.S., Jamaludin, A., Zisserman, A.: You said that? arXiv preprint arXiv:1705.02966 (2017)
Chung, J.S., Zisserman, A.: Out of time: automated lip sync in the wild. In: Chen, C.-S., Lu, J., Ma, K.-K. (eds.) Computer Vision – ACCV 2016 Workshops, pp. 251–263. Springer International Publishing, Cham (2017). https://doi.org/10.1007/978-3-319-54427-4_19
Das, D., Biswas, S., Sinha, S., Bhowmick, B.: Speech-driven facial animation using cascaded GANs for learning of motion and texture. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) Computer Vision – ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXX, pp. 408–424. Springer International Publishing, Cham (2020). https://doi.org/10.1007/978-3-030-58577-8_25
Deng, Y., Yang, J., Xu, S., Chen, D., Jia, Y., Tong, X.: Accurate 3D face reconstruction with weakly-supervised learning: From single image to image set. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pp. 0–0 (2019)
Duchi, J., Hazan, E., Singer, Y.: Adaptive subgradient methods for online learning and stochastic optimization. J. mach. learn. res. 12(7) (2011)
Edwards, P., Landreth, C., Fiume, E., Singh, K.: JALI: an animator-centric viseme model for expressive lip synchronization. ACM Trans. graph. (TOG) 35(4), 1–11 (2016) publisher: ACM New York, NY, USA
Gafni, G., Thies, J., Zollhofer, M., Nießner, M.: Dynamic neural radiance fields for monocular 4D facial avatar reconstruction. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8649–8658 (2021)
Guo, Y., Cai, J., Jiang, B., Zheng, J., et al: CNN-based real-time dense face reconstruction with inverse-rendered photo-realistic face images. IEEE Trans. Pattern Anal. Mach. Intell. 41(6), 1294–1307 (2018). publisher: IEEE
Guo, Y., Chen, K., Liang, S., Liu, Y.J., Bao, H., Zhang, J.: AD-NeRF: audio driven neural radiance fields for talking head synthesis. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5784–5794 (2021)
He, S., et al.: Speech4Mesh: speech-assisted monocular 3D facial reconstruction for speech-driven 3D facial animation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 14192–14202 (2023)
Kumar, R., Sotelo, J., Kumar, K., de Brébisson, A., Bengio, Y.: Obamanet: photo-realistic lip-sync from text. arXiv preprint arXiv:1801.01442 (2017)
Ling, J., et al.: StableFace: analyzing and improving motion stability for talking face generation. arXiv preprint arXiv:2208.13717 (2022)
Liu, X., Xu, Y., Wu, Q., Zhou, H., Wu, W., Zhou, B.: Semantic-aware implicit neural audio-driven video portrait generation. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) Computer Vision – ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXVII, pp. 106–125. Springer Nature Switzerland, Cham (2022). https://doi.org/10.1007/978-3-031-19836-6_7
Lu, Y., Chai, J., Cao, X.: Live speech portraits: real-time photorealistic talking-head animation. ACM Trans. Graph. (TOG) 40(6), 1–17 (2021) publisher: ACM New York, NY, USA
Mildenhall, B., Srinivasan, P.P., Tancik, M., Barron, J.T., Ramamoorthi, R., Ng, R.: NeRF: representing scenes as neural radiance fields for view synthesis. Commun. ACM 65(1), 99–106 (2021) publisher: ACM New York, NY, USA
Prajwal, K., Mukhopadhyay, R., Namboodiri, V.P., Jawahar, C.: A lip sync expert is all you need for speech to lip generation in the wild. In: Proceedings of the 28th ACM International Conference on Multimedia, pp. 484–492 (2020)
Richard, A., Zollhöfer, M., Wen, Y., De la Torre, F., Sheikh, Y.: MeshTalk: 3D face animation from speech using cross-modality disentanglement. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1173–1182 (2021)
Song, L., Wu, W., Qian, C., He, R., Loy, C.C.: Everybody’s talkin’: let me talk as you want. IEEE Trans. Inf. Forensics Secur. 17, 585–598 (2022) publisher: IEEE
Suwajanakorn, S., Seitz, S.M., Kemelmacher-Shlizerman, I.: Synthesizing Obama: learning lip sync from audio. ACM Trans. Graph. 36(4), 1–13 (2017). https://doi.org/10.1145/3072959.3073640
Thambiraja, B., Aliakbarian, S., Cosker, D., Thies, J.: 3DiFACE: diffusion-based speech-driven 3D facial animation and editing (2023). arXiv: 2312.00870 [cs.CV]
Thies, J., Elgharib, M., Tewari, A., Theobalt, C., Nießner, M.: Neural voice puppetry: audio-driven facial reenactment. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) Computer Vision – ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XVI, pp. 716–731. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58517-4_42
Thies, J., Zollhöfer, M., Stamminger, M., Theobalt, C., Nießner, M.: Facevr: real-time facial reenactment and eye gaze control in virtual reality. arXiv preprint arXiv:1610.03151 (2016)
Wang, Q., Fan, Z., Xia, S.: 3D-TalkEmo: learning to synthesize 3D emotional talking head. arXiv preprint arXiv:2104.12051 (2021)
Wang, X., Guo, Y., Deng, B., Zhang, J.: Lightweight photometric stereo for facial details recovery. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 740–749 (2020)
Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality assessment: from error visibility to structural similarity. IEEE Trans. Image Proc. 13(4), 600–612 (2004) publisher: IEEE
Williams, L.: Performance-driven facial animation. In: ACM SIGGRAPH 2006 courses, pp. 16–es (2006)
Zhang, C., et al.: FACIAL: synthesizing dynamic talking face with implicit attribute learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 3867–3876 (2021)
Zhang, W., et al.: SadTalker: learning realistic 3D motion coefficients for stylized audio-driven single image talking face animation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8652–8661 (2023)
Zhong, W., et al.: Identity-preserving talking face generation with landmark and appearance priors. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 9729–9738 (2023)
Zhou, Y., Xu, Z., Landreth, C., Kalogerakis, E., Maji, S., Singh, K.: Visemenet: audio-driven animator-centric speech animation. ACM Trans. Graph. (TOG) 37(4), 1–10 (2018) publisher: ACM New York, NY, USA
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Bai, X., Zhou, J., Zhang, P., Hao, R. (2024). Make Audio Solely Drive Lip in Talking Face Video Synthesis. In: Wand, M., Malinovská, K., Schmidhuber, J., Tetko, I.V. (eds) Artificial Neural Networks and Machine Learning – ICANN 2024. ICANN 2024. Lecture Notes in Computer Science, vol 15018. Springer, Cham. https://doi.org/10.1007/978-3-031-72338-4_24
Download citation
DOI: https://doi.org/10.1007/978-3-031-72338-4_24
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72337-7
Online ISBN: 978-3-031-72338-4
eBook Packages: Computer ScienceComputer Science (R0)