skip to main content
10.1145/3664647.3681017acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Self-Supervised Emotion Representation Disentanglement for Speech-Preserving Facial Expression Manipulation

Published: 28 October 2024 Publication History

Abstract

Speech-preserving Facial Expression Manipulation (SPFEM) aims to alter facial emotions in video content while preserving the facial movements associated with speech. Current works often fall short due to the inadequate representation of emotion as well as the absence of time-aligned paired data-two corresponding frames from the same speaker that showcase the same speech content but differ in emotional expression. In this work, we introduce a novel framework, Self-Supervised Emotion Representation Disentanglement (SSERD), to disentangle emotion representation for accurate emotion transfer while implementing a paired data construction module to facilitate automated, photorealistic facial animations. Specifically, We developed a module for learning emotion latent codes using StyleGAN's latent space, employing a cross-attention mechanism to extract and predict emotion editing codes, with contrastive learning to differentiate emotions. To overcome the lack of strictly paired data in the SPFEM task, we exploit pretrained StyleGAN to generate paired data, focusing on expression vectors unrelated to mouth shape. Additionally, we employed a hybrid training strategy using both synthetic paired and real unpaired data to enhance the realism of SPFEM model's generated images. Extensive experiments conducted on benchmark datasets, including MEAD and RAVDESS, have validated the effectiveness of our framework, demonstrating its superior capability in generating photorealistic and expressive facial animations.

References

[1]
Donald J Berndt and James Clifford. 1994. Using dynamic time warping to find patterns in time series. In KDD workshop, Vol. 10. Seattle, WA, USA:, 359--370.
[2]
Volker Blanz and Thomas Vetter. 1999. A morphable model for the synthesis of 3D faces. In Proceedings of the 26th annual conference on Computer graphics and interactive techniques. 187--194.
[3]
Tianshui Chen, Jianman Lin, Zhijing Yang, Chunmei Qing, and Liang Lin. 2024. Learning Adaptive Spatial Coherent Correlations for Speech-Preserving Facial Expression Manipulation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 7267--7276.
[4]
Yunjey Choi, Youngjung Uh, Jaejun Yoo, and Jung-Woo Ha. 2020. StarGAN v2: Diverse Image Synthesis for Multiple Domains. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
[5]
J. S. Chung and A. Zisserman. 2016. Out of time: automated lip sync in the wild. In Workshop on Multi-view Lip-reading, ACCV.
[6]
Corinna Cortes and Vladimir Vapnik. 1995. Support-Vector Networks. Mach. Learn., Vol. 20, 3 (sep 1995), 273--297. https://doi.org/10.1023/A:1022627411411
[7]
Jiankang Deng, Jia Guo, Niannan Xue, and Stefanos Zafeiriou. 2019. Arcface: Additive angular margin loss for deep face recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 4690--4699.
[8]
Michail Christos Doukas, Mohammad Rami Koujan, Viktoriia Sharmanska, Anastasios Roussos, and Stefanos Zafeiriou. 2021. Head2head: Deep facial attributes re-targeting. IEEE Transactions on Biometrics, Behavior, and Identity Science, Vol. 3, 1 (2021), 31--43.
[9]
Ionut Cosmin Duta, Li Liu, Fan Zhu, and Ling Shao. 2020. Improved Residual Networks for Image and Video Recognition. arXiv preprint arXiv:2004.04989 (2020).
[10]
E Friesen and Paul Ekman. 1978. Facial action coding system: a technique for the measurement of facial movement. Palo Alto, Vol. 3, 2 (1978), 5.
[11]
Yuan Gan, Zongxin Yang, Xihang Yue, Lingyun Sun, and Yi Yang. 2023. Efficient Emotional Adaptation for Audio-Driven Talking-Head Generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 22634--22645.
[12]
Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron C. Courville, and Yoshua Bengio. 2014. Generative Adversarial Nets. In NIPS.
[13]
Siddharth Gururani, Arun Mallya, Ting-Chun Wang, Rafael Valle, and Ming-Yu Liu. 2023. SPACE: Speech-driven Portrait Animation with Controllable Expression. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 20914--20923.
[14]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770--778.
[15]
Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. 2017. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, Vol. 30 (2017).
[16]
Tero Karras, Miika Aittala, Samuli Laine, Erik H"arkönen, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. 2021. Alias-Free Generative Adversarial Networks. In Conference on Neural Information Processing Systems.
[17]
Tero Karras, Samuli Laine, and Timo Aila. 2019. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 4401--4410.
[18]
Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. 2020. Analyzing and Improving the Image Quality of StyleGAN. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[19]
Gyeongman Kim, Hajin Shim, Hyunsu Kim, Yunjey Choi, Junho Kim, and Eunho Yang. 2023. Diffusion Video Autoencoders: Toward Temporally Consistent Face Video Editing via Disentangled Video Encoding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6091--6100.
[20]
Hyeongwoo Kim, Mohamed Elgharib, Michael Zollhöfer, Hans-Peter Seidel, Thabo Beeler, Christian Richardt, and Christian Theobalt. 2019. Neural style-preserving visual dubbing. ACM Transactions on Graphics, Vol. 38, 6 (2019), 1--13.
[21]
Steven R Livingstone and Frank A Russo. 2018. The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English. PloS one, Vol. 13, 5 (2018), e0196391.
[22]
Youxin Pang, Yong Zhang, Weize Quan, Yanbo Fan, Xiaodong Cun, Ying Shan, and Dong-Ming Yan. 2023. DPE: Disentanglement of Pose and Expression for General Video Portrait Editing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 427--436.
[23]
Foivos Paraperas Papantoniou, Panagiotis P Filntisis, Petros Maragos, and Anastasios Roussos. 2022. Neural Emotion Director: Speech-Preserving Semantic Control of Facial Expressions in" In-the-Wild" Videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 18781--18790.
[24]
K R Prajwal, Rudrabha Mukhopadhyay, Vinay P. Namboodiri, and C.V. Jawahar. 2020. A Lip Sync Expert Is All You Need for Speech to Lip Generation In the Wild. In Proceedings of the 28th ACM International Conference on Multimedia (Seattle, WA, USA). Association for Computing Machinery, New York, NY, USA, 484--492.
[25]
Albert Pumarola, Antonio Agudo, Aleix M Martinez, Alberto Sanfeliu, and Francesc Moreno-Noguer. 2020. Ganimation: One-shot anatomically consistent facial animation. International Journal of Computer Vision, Vol. 128 (2020), 698--713.
[26]
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. 2021. Learning Transferable Visual Models From Natural Language Supervision. In Proceedings of the 38th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 139), Marina Meila and Tong Zhang (Eds.). PMLR, 8748--8763.
[27]
Yurui Ren, Ge Li, Yuanqi Chen, Thomas H. Li, and Shan Liu. 2021. PIRenderer: Controllable Portrait Image Generation via Semantic Neural Rendering. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 13739--13748. https://doi.org/10.1109/ICCV48922.2021.01350
[28]
Aliaksandr Siarohin, Stéphane Lathuilière, Sergey Tulyakov, Elisa Ricci, and Nicu Sebe. 2019. First Order Motion Model for Image Animation. In Conference on Neural Information Processing Systems.
[29]
Karen Simonyan and Andrew Zisserman. 2014. Very Deep Convolutional Networks for Large-Scale Image Recognition. CoRR, Vol. abs/1409.1556 (2014).
[30]
Girish Kumar Solanki and Anastasios Roussos. 2022. Deep semantic manipulation of facial videos. In European Conference on Computer Vision. Springer, 104--120.
[31]
Zhiyao Sun, Yu-Hui Wen, Tian Lv, Yanan Sun, Ziyang Zhang, Yaoyuan Wang, and Yong-Jin Liu. 2023. Continuously Controllable Facial Expression Editing in Talking Face Videos. IEEE Transactions on Affective Computing (2023).
[32]
Soumya Tripathy, Juho Kannala, and Esa Rahtu. 2020. Icface: Interpretable and controllable face reenactment using gans. In Proceedings of the IEEE/CVF winter conference on applications of computer vision. 3385--3394.
[33]
Rotem Tzaban, Ron Mokady, Rinon Gal, Amit Bermano, and Daniel Cohen-Or. 2022. Stitch it in time: Gan-based facial editing of real videos. In SIGGRAPH Asia 2022 Conference Papers. 1--9.
[34]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS'17). Curran Associates Inc., 6000--6010.
[35]
Kaisiyuan Wang, Qianyi Wu, Linsen Song, Zhuoqian Yang, Wayne Wu, Chen Qian, Ran He, Yu Qiao, and Chen Change Loy. 2020. Mead: A large-scale audio-visual dataset for emotional talking-face generation. In European conference on computer vision. Springer, 700--717.
[36]
Yaohui Wang, Di Yang, Francois Bremond, and Antitza Dantcheva. 2022. Latent Image Animator: Learning to Animate Images via Latent Space Navigation. In International Conference on Learning Representations.
[37]
Less Wright. 2019. Ranger - a synergistic optimizer. https://github.com/lessw2020/Ranger-Deep-Learning-Optimizer.
[38]
Yongzong Xu, Zhijing Yang, Tianshui Chen, Kai Li, and Chunmei Qing. 2023. Progressive Transformer Machine for Natural Character Reenactment. ACM Trans. Multimedia Comput. Commun. Appl., Vol. 19, 2s (2023), 1--22.
[39]
Shuai Yang, Liming Jiang, Ziwei Liu, and Chen Change Loy. 2023. StyleGANEX: StyleGAN-Based Manipulation Beyond Cropped Aligned Faces. In Proceedings of the IEEE/CVF International Conference on Computer Vision.
[40]
Fei Yin, Yong Zhang, Xiaodong Cun, Mingdeng Cao, Yanbo Fan, Xuan Wang, Qingyan Bai, Baoyuan Wu, Jue Wang, and Yujiu Yang. 2022. StyleHEAT: One-shot high-resolution editable talking face generation via pre-trained StyleGAN. In European conference on computer vision. Springer, 85--101.
[41]
Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. 2018. The Unreasonable Effectiveness of Deep Features as a Perceptual Metric. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[42]
Zengqun Zhao, Qingshan Liu, and Feng Zhou. 2021. Robust Lightweight Facial Expression Recognition Network with Label Distribution Training. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 3510--3519.
[43]
Si Zheng, Junbin Chen, Zhijing Yang, Tianshui Chen, and Yongyi Lu. 2023. Face Reenactment Based on Motion Field Representation. In International Conference on Brain Inspired Cognitive Systems. 354--364.

Cited By

View all
  • (2025)Contrastive Decoupled Representation Learning and Regularization for Speech-Preserving Facial Expression ManipulationInternational Journal of Computer Vision10.1007/s11263-025-02358-xOnline publication date: 4-Feb-2025

Index Terms

  1. Self-Supervised Emotion Representation Disentanglement for Speech-Preserving Facial Expression Manipulation

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    MM '24: Proceedings of the 32nd ACM International Conference on Multimedia
    October 2024
    11719 pages
    ISBN:9798400706868
    DOI:10.1145/3664647
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 28 October 2024

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. emotion representation disentanglement
    2. expression manipulation
    3. lip synchronization
    4. self-supervision

    Qualifiers

    • Research-article

    Funding Sources

    Conference

    MM '24
    Sponsor:
    MM '24: The 32nd ACM International Conference on Multimedia
    October 28 - November 1, 2024
    Melbourne VIC, Australia

    Acceptance Rates

    MM '24 Paper Acceptance Rate 1,150 of 4,385 submissions, 26%;
    Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)149
    • Downloads (Last 6 weeks)87
    Reflects downloads up to 02 Mar 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2025)Contrastive Decoupled Representation Learning and Regularization for Speech-Preserving Facial Expression ManipulationInternational Journal of Computer Vision10.1007/s11263-025-02358-xOnline publication date: 4-Feb-2025

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media