Sem-Avatar: Semantic Controlled Neural Field for High-Fidelity Audio Driven Avatar

Zhou, Xiang; Zhang, Weichen; Ding, Yikang; Zhou, Fan; Zhang, Kai

doi:10.1007/978-981-99-8432-9_6

Xiang Zhou¹⁵,
Weichen Zhang¹⁵,
Yikang Ding¹⁵,
Fan Zhou¹⁵ &
…
Kai Zhang^15,16

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14426))

Included in the following conference series:

Chinese Conference on Pattern Recognition and Computer Vision (PRCV)

476 Accesses

Abstract

In this paper, we tackle the audio-driven avatar challenge by fitting a semantic controlled neural field to a talking-head video. While existing methods struggle with realism and head-torso inconsistency, our novel end-to-end framework, semantic controlled neural field (Sem-Avatar) sucessfully overcomes the above problems, delievering high-fidelity avatar. Specifically, we devise a one-stage audio-driven forward deformation approach to ensure head-torso alignment. We further propose to use semantic mask as a control signal for eye opening, lifting the naturalness of the avatar to another level. We train our framework via comparing the rendered avatar to the original video. We further append a semantic loss which leverages human face prior to stabilize training. Extensive experiments on public datasets demonstrate Sem-Avatar’s superior rendering quality and lip synchronization, establishing a new state-of-the-art for audio-driven avatars.

X. Zhou and W. Zhang—Equal Contribution

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 59.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Amodei, D., et al.: Deep speech 2: end-to-end speech recognition in English and mandarin. In: International Conference on Machine Learning (2015)
Google Scholar
Blanz, V., Vetter, T.: A morphable model for the synthesis of 3d faces. In: Proceedings of the 26th Annual Conference on Computer Graphics and Interactive Techniques, pp. 187–194 (1999)
Google Scholar
Cao, Y., Tien, W.C., Faloutsos, P., Pighin, F.: Expressive speech-driven facial animation. ACM Trans. Graph. (TOG) 24(4), 1283–1302 (2005)
Article Google Scholar
Chung, J.S., Zisserman, A.: Out of time: automated lip sync in the wild. In: Asian Conference on Computer Vision (2016)
Google Scholar
Edwards, P., Landreth, C., Fiume, E., Singh, K.: Jali: an animator-centric viseme model for expressive lip synchronization. ACM Trans. Graph. (TOG) 35(4), 1–11 (2016)
Article Google Scholar
Gafni, G., Thies, J., Zollhöfer, M., Nießner, M.: Dynamic neural radiance fields for monocular 4d facial avatar reconstruction. In: Computer Vision and Pattern Recognition (2021)
Google Scholar
Grassal, P.W., Prinzler, M., Leistner, T., Rother, C., Nießner, M., Thies, J.: Neural head avatars from monocular rgb videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18653–18664 (2022)
Google Scholar
Guo, Y., Chen, K., Liang, S., Liu, Y.J., Bao, H., Zhang, J.: Ad-nerf: audio driven neural radiance fields for talking head synthesis. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5784–5794 (2021)
Google Scholar
Ji, X., et al.: Audio-driven emotional video portraits. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14080–14089 (2021)
Google Scholar
Li, T., Bolkart, T., Black, M.J., Li, H., Romero, J.: Learning a model of facial shape and expression from 4d scans. ACM Trans. Graph. (2017)
Google Scholar
Liu, X., Xu, Y., Wu, Q., Zhou, H., Wu, W., Zhou, B.: Semantic-aware implicit neural audio-driven video portrait generation. In: Computer Vision-ECCV 2022: 17th European Conference, Tel Aviv, Israel, 23–27 October 2022, Proceedings, Part XXXVII, pp. 106–125. Springer (2022). https://doi.org/10.1007/978-3-031-19836-6_7
Lu, Y., Chai, J., Cao, X.: Live speech portraits: real-time photorealistic talking-head animation. ACM Trans. Graph. (TOG) 40(6), 1–17 (2021)
Article Google Scholar
Mildenhall, B., Srinivasan, P.P., Tancik, M., Barron, J.T., Ramamoorthi, R., Ng, R.: NeRF: representing scenes as neural radiance fields for view synthesis. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 405–421. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_24
Chapter Google Scholar
Park, K., et al.: Nerfies: deformable neural radiance fields. In: International Conference on Computer Vision (2020)
Google Scholar
Park, K., et al.: Hypernerf: a higher-dimensional representation for topologically varying neural radiance fields. arXiv preprint arXiv:2106.13228 (2021)
Prajwal, K.R., Mukhopadhyay, R., Namboodiri, V.P., Jawahar, C.V.: A lip sync expert is all you need for speech to lip generation in the wild. ACM Multimedia (2020)
Google Scholar
Shen, S., Li, W., Zhu, Z., Duan, Y., Zhou, J., Lu, J.: Learning dynamic facial radiance fields for few-shot talking head synthesis. In: Computer Vision-ECCV 2022: 17th European Conference, Tel Aviv, Israel, 23–27 October 2022, Proceedings, Part XII, pp. 666–682. Springer (2022). https://doi.org/10.1007/978-3-031-19775-8_39
Sitzmann, V., Martel, J., Bergman, A., Lindell, D., Wetzstein, G.: Implicit neural representations with periodic activation functions. Adv. Neural. Inf. Process. Syst. 33, 7462–7473 (2020)
Google Scholar
Thies, J., Elgharib, M., Tewari, A., Theobalt, C., Nießner, M.: Neural voice puppetry: audio-driven facial reenactment. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12361, pp. 716–731. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58517-4_42
Chapter Google Scholar
Wen, X., Wang, M., Richardt, C., Chen, Z.Y., Hu, S.M.: Photorealistic audio-driven video portraits. In: International Symposium on Mixed and Augmented Reality (2020)
Google Scholar
Yu, C., Wang, J., Peng, C., Gao, C., Yu, G., Sang, N.: BiSeNet: bilateral segmentation network for real-time semantic segmentation. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11217, pp. 334–349. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01261-8_20
Chapter Google Scholar
Zheng, Y., Abrevaya, V.F., Bühler, M.C., Chen, X., Black, M.J., Hilliges, O.: Im avatar: implicit morphable head avatars from videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13545–13555 (2022)
Google Scholar
Zhou, Y., Han, X., Shechtman, E., Echevarria, J., Kalogerakis, E., Li, D.: Makelttalk: speaker-aware talking-head animation. ACM Trans. Graph. (TOG) 39(6), 1–15 (2020)
Google Scholar

Download references

Acknowledgments

This work was supported by the Key-Area Research and Development Program of Guangdong Province, under Grant 2020B0909050003.

Author information

Authors and Affiliations

Shenzhen International Graduate School, Tsinghua University, Beijing, China
Xiang Zhou, Weichen Zhang, Yikang Ding, Fan Zhou & Kai Zhang
Research Institute of Tsinghua, Pearl River Delta, Beijing, China
Kai Zhang

Authors

Xiang Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Weichen Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Yikang Ding
View author publications
You can also search for this author in PubMed Google Scholar
Fan Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Kai Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Kai Zhang .

Editor information

Editors and Affiliations

Nanjing University of Information Science and Technology, Nanjing, China
Qingshan Liu
Xiamen University, Xiamen, China
Hanzi Wang
Beijing University of Posts and Telecommunications, Beijing, China
Zhanyu Ma
Sun Yat-sen University, Guangzhou, China
Weishi Zheng
Peking University, Beijing, China
Hongbin Zha
Chinese Academy of Sciences, Beijing, China
Xilin Chen
Chinese Academy of Sciences, Beijing, China
Liang Wang
Xiamen University, Xiamen, China
Rongrong Ji

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhou, X., Zhang, W., Ding, Y., Zhou, F., Zhang, K. (2024). Sem-Avatar: Semantic Controlled Neural Field for High-Fidelity Audio Driven Avatar. In: Liu, Q., et al. Pattern Recognition and Computer Vision. PRCV 2023. Lecture Notes in Computer Science, vol 14426. Springer, Singapore. https://doi.org/10.1007/978-981-99-8432-9_6

Download citation

DOI: https://doi.org/10.1007/978-981-99-8432-9_6
Published: 24 December 2023
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-8431-2
Online ISBN: 978-981-99-8432-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Sem-Avatar: Semantic Controlled Neural Field for High-Fidelity Audio Driven Avatar