research-article

Self-Supervised Emotion Representation Disentanglement for Speech-Preserving Facial Expression Manipulation

Authors:

Liang LinAuthors Info & Claims

MM '24: Proceedings of the 32nd ACM International Conference on Multimedia

Pages 3800 - 3808

https://doi.org/10.1145/3664647.3681017

Published: 28 October 2024 Publication History

Abstract

Speech-preserving Facial Expression Manipulation (SPFEM) aims to alter facial emotions in video content while preserving the facial movements associated with speech. Current works often fall short due to the inadequate representation of emotion as well as the absence of time-aligned paired data-two corresponding frames from the same speaker that showcase the same speech content but differ in emotional expression. In this work, we introduce a novel framework, Self-Supervised Emotion Representation Disentanglement (SSERD), to disentangle emotion representation for accurate emotion transfer while implementing a paired data construction module to facilitate automated, photorealistic facial animations. Specifically, We developed a module for learning emotion latent codes using StyleGAN's latent space, employing a cross-attention mechanism to extract and predict emotion editing codes, with contrastive learning to differentiate emotions. To overcome the lack of strictly paired data in the SPFEM task, we exploit pretrained StyleGAN to generate paired data, focusing on expression vectors unrelated to mouth shape. Additionally, we employed a hybrid training strategy using both synthetic paired and real unpaired data to enhance the realism of SPFEM model's generated images. Extensive experiments conducted on benchmark datasets, including MEAD and RAVDESS, have validated the effectiveness of our framework, demonstrating its superior capability in generating photorealistic and expressive facial animations.

References

[1]

Donald J Berndt and James Clifford. 1994. Using dynamic time warping to find patterns in time series. In KDD workshop, Vol. 10. Seattle, WA, USA:, 359--370.

Digital Library

[2]

Volker Blanz and Thomas Vetter. 1999. A morphable model for the synthesis of 3D faces. In Proceedings of the 26th annual conference on Computer graphics and interactive techniques. 187--194.

Digital Library

[3]

Tianshui Chen, Jianman Lin, Zhijing Yang, Chunmei Qing, and Liang Lin. 2024. Learning Adaptive Spatial Coherent Correlations for Speech-Preserving Facial Expression Manipulation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 7267--7276.

[4]

Yunjey Choi, Youngjung Uh, Jaejun Yoo, and Jung-Woo Ha. 2020. StarGAN v2: Diverse Image Synthesis for Multiple Domains. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.

[5]

J. S. Chung and A. Zisserman. 2016. Out of time: automated lip sync in the wild. In Workshop on Multi-view Lip-reading, ACCV.

[6]

Corinna Cortes and Vladimir Vapnik. 1995. Support-Vector Networks. Mach. Learn., Vol. 20, 3 (sep 1995), 273--297. https://doi.org/10.1023/A:1022627411411

Digital Library

[7]

Jiankang Deng, Jia Guo, Niannan Xue, and Stefanos Zafeiriou. 2019. Arcface: Additive angular margin loss for deep face recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 4690--4699.

[8]

Michail Christos Doukas, Mohammad Rami Koujan, Viktoriia Sharmanska, Anastasios Roussos, and Stefanos Zafeiriou. 2021. Head2head: Deep facial attributes re-targeting. IEEE Transactions on Biometrics, Behavior, and Identity Science, Vol. 3, 1 (2021), 31--43.

[9]

Ionut Cosmin Duta, Li Liu, Fan Zhu, and Ling Shao. 2020. Improved Residual Networks for Image and Video Recognition. arXiv preprint arXiv:2004.04989 (2020).

[10]

E Friesen and Paul Ekman. 1978. Facial action coding system: a technique for the measurement of facial movement. Palo Alto, Vol. 3, 2 (1978), 5.

[11]

Yuan Gan, Zongxin Yang, Xihang Yue, Lingyun Sun, and Yi Yang. 2023. Efficient Emotional Adaptation for Audio-Driven Talking-Head Generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 22634--22645.

[12]

Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron C. Courville, and Yoshua Bengio. 2014. Generative Adversarial Nets. In NIPS.

[13]

Siddharth Gururani, Arun Mallya, Ting-Chun Wang, Rafael Valle, and Ming-Yu Liu. 2023. SPACE: Speech-driven Portrait Animation with Controllable Expression. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 20914--20923.

[14]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770--778.

[15]

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. 2017. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, Vol. 30 (2017).

Digital Library

[16]

Tero Karras, Miika Aittala, Samuli Laine, Erik H"arkönen, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. 2021. Alias-Free Generative Adversarial Networks. In Conference on Neural Information Processing Systems.

[17]

Tero Karras, Samuli Laine, and Timo Aila. 2019. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 4401--4410.

[18]

Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. 2020. Analyzing and Improving the Image Quality of StyleGAN. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[19]

Gyeongman Kim, Hajin Shim, Hyunsu Kim, Yunjey Choi, Junho Kim, and Eunho Yang. 2023. Diffusion Video Autoencoders: Toward Temporally Consistent Face Video Editing via Disentangled Video Encoding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6091--6100.

[20]

Hyeongwoo Kim, Mohamed Elgharib, Michael Zollhöfer, Hans-Peter Seidel, Thabo Beeler, Christian Richardt, and Christian Theobalt. 2019. Neural style-preserving visual dubbing. ACM Transactions on Graphics, Vol. 38, 6 (2019), 1--13.

Digital Library

[21]

Steven R Livingstone and Frank A Russo. 2018. The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English. PloS one, Vol. 13, 5 (2018), e0196391.

[22]

Youxin Pang, Yong Zhang, Weize Quan, Yanbo Fan, Xiaodong Cun, Ying Shan, and Dong-Ming Yan. 2023. DPE: Disentanglement of Pose and Expression for General Video Portrait Editing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 427--436.

[23]

Foivos Paraperas Papantoniou, Panagiotis P Filntisis, Petros Maragos, and Anastasios Roussos. 2022. Neural Emotion Director: Speech-Preserving Semantic Control of Facial Expressions in" In-the-Wild" Videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 18781--18790.

[24]

K R Prajwal, Rudrabha Mukhopadhyay, Vinay P. Namboodiri, and C.V. Jawahar. 2020. A Lip Sync Expert Is All You Need for Speech to Lip Generation In the Wild. In Proceedings of the 28th ACM International Conference on Multimedia (Seattle, WA, USA). Association for Computing Machinery, New York, NY, USA, 484--492.

Digital Library

[25]

Albert Pumarola, Antonio Agudo, Aleix M Martinez, Alberto Sanfeliu, and Francesc Moreno-Noguer. 2020. Ganimation: One-shot anatomically consistent facial animation. International Journal of Computer Vision, Vol. 128 (2020), 698--713.

[26]

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. 2021. Learning Transferable Visual Models From Natural Language Supervision. In Proceedings of the 38th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 139), Marina Meila and Tong Zhang (Eds.). PMLR, 8748--8763.

[27]

Yurui Ren, Ge Li, Yuanqi Chen, Thomas H. Li, and Shan Liu. 2021. PIRenderer: Controllable Portrait Image Generation via Semantic Neural Rendering. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 13739--13748. https://doi.org/10.1109/ICCV48922.2021.01350

[28]

Aliaksandr Siarohin, Stéphane Lathuilière, Sergey Tulyakov, Elisa Ricci, and Nicu Sebe. 2019. First Order Motion Model for Image Animation. In Conference on Neural Information Processing Systems.

[29]

Karen Simonyan and Andrew Zisserman. 2014. Very Deep Convolutional Networks for Large-Scale Image Recognition. CoRR, Vol. abs/1409.1556 (2014).

[30]

Girish Kumar Solanki and Anastasios Roussos. 2022. Deep semantic manipulation of facial videos. In European Conference on Computer Vision. Springer, 104--120.

[31]

Zhiyao Sun, Yu-Hui Wen, Tian Lv, Yanan Sun, Ziyang Zhang, Yaoyuan Wang, and Yong-Jin Liu. 2023. Continuously Controllable Facial Expression Editing in Talking Face Videos. IEEE Transactions on Affective Computing (2023).

[32]

Soumya Tripathy, Juho Kannala, and Esa Rahtu. 2020. Icface: Interpretable and controllable face reenactment using gans. In Proceedings of the IEEE/CVF winter conference on applications of computer vision. 3385--3394.

[33]

Rotem Tzaban, Ron Mokady, Rinon Gal, Amit Bermano, and Daniel Cohen-Or. 2022. Stitch it in time: Gan-based facial editing of real videos. In SIGGRAPH Asia 2022 Conference Papers. 1--9.

Digital Library

[34]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS'17). Curran Associates Inc., 6000--6010.

Digital Library

[35]

Kaisiyuan Wang, Qianyi Wu, Linsen Song, Zhuoqian Yang, Wayne Wu, Chen Qian, Ran He, Yu Qiao, and Chen Change Loy. 2020. Mead: A large-scale audio-visual dataset for emotional talking-face generation. In European conference on computer vision. Springer, 700--717.

Digital Library

[36]

Yaohui Wang, Di Yang, Francois Bremond, and Antitza Dantcheva. 2022. Latent Image Animator: Learning to Animate Images via Latent Space Navigation. In International Conference on Learning Representations.

[37]

Less Wright. 2019. Ranger - a synergistic optimizer. https://github.com/lessw2020/Ranger-Deep-Learning-Optimizer.

[38]

Yongzong Xu, Zhijing Yang, Tianshui Chen, Kai Li, and Chunmei Qing. 2023. Progressive Transformer Machine for Natural Character Reenactment. ACM Trans. Multimedia Comput. Commun. Appl., Vol. 19, 2s (2023), 1--22.

Digital Library

[39]

Shuai Yang, Liming Jiang, Ziwei Liu, and Chen Change Loy. 2023. StyleGANEX: StyleGAN-Based Manipulation Beyond Cropped Aligned Faces. In Proceedings of the IEEE/CVF International Conference on Computer Vision.

[40]

Fei Yin, Yong Zhang, Xiaodong Cun, Mingdeng Cao, Yanbo Fan, Xuan Wang, Qingyan Bai, Baoyuan Wu, Jue Wang, and Yujiu Yang. 2022. StyleHEAT: One-shot high-resolution editable talking face generation via pre-trained StyleGAN. In European conference on computer vision. Springer, 85--101.

Digital Library

[41]

Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. 2018. The Unreasonable Effectiveness of Deep Features as a Perceptual Metric. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[42]

Zengqun Zhao, Qingshan Liu, and Feng Zhou. 2021. Robust Lightweight Facial Expression Recognition Network with Label Distribution Training. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 3510--3519.

[43]

Si Zheng, Junbin Chen, Zhijing Yang, Tianshui Chen, and Yongyi Lu. 2023. Face Reenactment Based on Motion Field Representation. In International Conference on Brain Inspired Cognitive Systems. 354--364.

Cited By

Chen TLin JYang ZQing CShi YLin L(2025)Contrastive Decoupled Representation Learning and Regularization for Speech-Preserving Facial Expression ManipulationInternational Journal of Computer Vision10.1007/s11263-025-02358-xOnline publication date: 4-Feb-2025
https://doi.org/10.1007/s11263-025-02358-x

Index Terms

Self-Supervised Emotion Representation Disentanglement for Speech-Preserving Facial Expression Manipulation
1. Computing methodologies
  1. Computer graphics
    1. Image manipulation

Recommendations

Occluded Facial Expression Recognition Using Self-supervised Learning
Computer Vision – ACCV 2022
Abstract
Recent studies on occluded facial expression recognition typically required fully expression-annotated facial images for training. However, it is time consuming and expensive to collect a large number of facial images with various occlusions and ...
Implicit video emotion tagging from audiences' facial expression

In this paper, we propose a novel implicit video emotion tagging approach by exploring the relationships between videos' common emotions, subjects' individualized emotions and subjects' outer facial expressions. First, head motion and face appearance ...
Human-Computer Interaction Using Emotion Recognition from Facial Expression
EMS '11: Proceedings of the 2011 UKSim 5th European Symposium on Computer Modeling and Simulation

This paper describes emotion recognition system based on facial expression. A fully automatic facial expression recognition system is based on three steps: face detection, facial characteristic extraction and facial expression classification. We have ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MM '24: Proceedings of the 32nd ACM International Conference on Multimedia

October 2024

11719 pages

ISBN:9798400706868

DOI:10.1145/3664647

General Chairs:
Jianfei Cai
Monash University, Australia
,
Mohan Kankanhalli
NUS, Singapore
,
Balakrishnan Prabhakaran
UT Dallas, USA
,
Susanne Boll
University of Oldenburg, Germany
,
Program Chairs:
Ramanathan Subramanian
University of Canberra & IIT Ropar, Australia
,
Liang Zheng
Australian National University, Australia
,
Vivek K. Singh
Rutgers University, USA
,
Pablo Cesar
Centrum Wiskunde & Informatica, Netherlands
,
Lexing Xie
Australian National University, Australia
,
Dong Xu
University of Hong Kong, Hong Kong

Copyright © 2024 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 28 October 2024

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

National Natural Science Foundation of China
Guangdong Provincial Key Laboratory of Human Digital Twin
Natural Science Foundation of Guangdong Province

Conference

MM '24

Sponsor:

SIGMM

MM '24: The 32nd ACM International Conference on Multimedia

October 28 - November 1, 2024

Melbourne VIC, Australia

Acceptance Rates

MM '24 Paper Acceptance Rate 1,150 of 4,385 submissions, 26%;

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
149
Total Downloads

Downloads (Last 12 months)149
Downloads (Last 6 weeks)87

Reflects downloads up to 02 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Chen TLin JYang ZQing CShi YLin L(2025)Contrastive Decoupled Representation Learning and Regularization for Speech-Preserving Facial Expression ManipulationInternational Journal of Computer Vision10.1007/s11263-025-02358-xOnline publication date: 4-Feb-2025
https://doi.org/10.1007/s11263-025-02358-x

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten