ABSTRACT
We present Wav2Lip-Emotion, a video-to-video translation architecture that modifies facial expressions of emotion in videos of speakers. Previous work modifies emotion in images, uses a single image to produce a video with animated emotion, or puppets facial expressions in videos with landmarks from a reference video. However, many use cases such as modifying an actor's performance in post-production, coaching individuals to be more animated speakers, or touching up emotion in a teleconference require a video-to-video translation approach. We explore a method to maintain speakers' identity and pose while translating their expressed emotion. Our approach extends an existing multi-modal lip synchronization architecture to modify the speaker's emotion using L1 reconstruction and pre-trained emotion objectives. We also propose a novel automated emotion evaluation approach and corroborate it with a user study. These find that we succeed in modifying emotion while maintaining lip synchronization. Visual quality is somewhat diminished, with a trade off between greater emotion modification and visual quality between model variants. Nevertheless, we demonstrate (1) that facial expressions of emotion can be modified with nothing other than L1 reconstruction and pre-trained emotion objectives and (2) that our automated emotion evaluation approach aligns with human judgements.
- AmirAli Bagher Zadeh, Paul Pu Liang, Soujanya Poria, Erik Cambria, and Louis-Philippe Morency. 2018. Multimodal Language Analysis in the Wild: Carnegie Mellon University-MOSEI Dataset and Interpretable Dynamic Fusion Graph. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) . Association for Computational Linguistics, Melbourne, Australia, 2236--2246. https://doi.org/10.18653/v1/P18--1208Google Scholar
- Julianne Gold Brunson and P Scott Lawrence. 2002. Impact of sign language interpreter and therapist moods on deaf recipient mood. Professional Psychology: Research and Practice , Vol. 33, 6 (2002), 576.Google ScholarCross Ref
- Anpei Chen, Zhang Chen, Guli Zhang, Kenny Mitchell, and Jingyi Yu. 2019. Photo-Realistic Facial Details Synthesis From Single Image. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) .Google ScholarCross Ref
- Joon Son Chung, Andrew Senior, Oriol Vinyals, and Andrew Zisserman. 2017. Lip Reading Sentences in the Wild. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 3444--3453. https://doi.org/10.1109/CVPR.2017.367Google Scholar
- Joon Son Chung and Andrew Zisserman. 2016. Out of time: automated lip sync in the wild. In Workshop on Multi-view Lip-reading, ACCV .Google Scholar
- codeniko. 2019. 81 Facial Landmarks Shape Predictor . https://github.com/codeniko/shape_predictor_81_face_landmarks .Google Scholar
- D. Deng, Z. Chen, and B. E. Shi. 2020. Multitask Emotion Recognition with Incomplete Labels. In 2020 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020) (FG) . IEEE Computer Society, Los Alamitos, CA, USA, 592--599. https://doi.org/10.1109/FG47880.2020.00131Google Scholar
- Joseph R Dusseldorp, Diego L Guarin, Martinus M van Veen, Nate Jowett, and Tessa A Hadlock. 2019. In the eye of the beholder: changes in perceived emotion expression after smile reanimation. Plastic and reconstructive surgery , Vol. 144, 2 (2019), 457--471.Google Scholar
- Paul Ekman. 1993. Facial expression and emotion. American psychologist , Vol. 48, 4 (1993), 384.Google Scholar
- Lijie Fan, Wenbing Huang, Chuang Gan, Junzhou Huang, and Boqing Gong. 2019. Controllable Image-to-Video Translation: A Case Study on Facial Expression Generation. Proceedings of the AAAI Conference on Artificial Intelligence , Vol. 33, 01 (Jul. 2019), 3510--3517. https://doi.org/10.1609/aaai.v33i01.33013510Google ScholarCross Ref
- Panagiotis Giannopoulos, Isidoros Perikos, and Ioannis Hatzilygeroudis. 2018. Deep Learning Approaches for Facial Emotion Recognition: A Case Study on FER-2013 .Springer International Publishing, Cham, 1--16. https://doi.org/10.1007/978--3--319--66790--4_1Google Scholar
- Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. 2018. GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium. arxiv: 1706.08500 [cs.LG] Google ScholarDigital Library
- Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q. Weinberger. 2017. Densely Connected Convolutional Networks. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2261--2269. https://doi.org/10.1109/CVPR.2017.243Google Scholar
- Carroll E. Izard. 1990. Facial expressions and the regulation of emotions. Journal of Personality and Social Psychology , Vol. 58, 3 (1990), 487--498. https://doi.org/10.1037/0022--3514.58.3.487Google ScholarCross Ref
- Jerome Kagan, Nancy Snidman, and Doreen Arcus. 1993. On the temperamental categories of inhibited and uninhibited children. Social withdrawal, inhibition, and shyness in childhood (1993), 19--28.Google Scholar
- Tero Karras, Samuli Laine, and Timo Aila. 2019. A Style-Based Generator Architecture for Generative Adversarial Networks. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) . 4396--4405. https://doi.org/10.1109/CVPR.2019.00453Google ScholarCross Ref
- Caroline F Keating, Allan Mazur, and Marshall H Segall. 1977. Facial gestures which influence the perception of status. Sociometry (1977), 374--378.Google Scholar
- Davis E. King. 2009. Dlib-ml: A Machine Learning Toolkit. Journal of Machine Learning Research , Vol. 10 (2009), 1755--1758. Google ScholarDigital Library
- D Kollias, A Schulc, E Hajiyev, and S Zafeiriou. [n.d.]. Analysing Affective Behavior in the First ABAW 2020 Competition. In 2020 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020)(FG) . 794--800.Google Scholar
- Dimitrios Kollias and Stefanos Zafeiriou. 2018. Aff-Wild2: Extending the Aff-Wild Database for Affect Recognition. CoRR , Vol. abs/1811.07770 (2018). arxiv: 1811.07770 http://arxiv.org/abs/1811.07770Google Scholar
- Andrea Miller, Renita Coleman, and Donald Granberg. 2007. TV Anchors, Elections & Bias: A Longitudinal Study of the Facial Expressions of Brokaw Rather Jennings. Visual Communication Quarterly , Vol. 14, 4 (2007), 244--257. https://doi.org/10.1080/15551390701730232 https://doi.org/10.1145/3449063Google ScholarCross Ref
Index Terms
- Invertable Frowns: Video-to-Video Facial Emotion Translation
Recommendations
Facial design for humanoid robot
APCHI '12: Proceedings of the 10th asia pacific conference on Computer human interactionIn this research, the authors succeeded in creating facial expressions made with the minimum necessary elements for recognizing a face. The elements are two eyes and a mouth made using precise circles, which are transformed to make facial expressions ...
Cross-cultural design of facial expressions for humanoids: is there cultural difference between Japan and Denmark?
MMAsia '20: Proceedings of the 2nd ACM International Conference on Multimedia in AsiaIn this research, the authors succeeded in creating facial expressions made with the minimum necessary elements for recognizing a face. The elements are two eyes and a mouth made using precise circles, which are transformed to make facial expressions ...
Manipulation of an emotional experience by real-time deformed facial feedback
AH '13: Proceedings of the 4th Augmented Human International ConferenceThe main goals of this paper involved assessing the efficacy of computer-generated emotion and establishing a method for integrating emotional experience. Human internal processing mechanisms for evoking an emotion by a relevant stimulus have not been ...
Comments