research-article

Invertable Frowns: Video-to-Video Facial Emotion Translation

Authors:
Ian Magnusson

Northeastern University, Boston, MA, USA

Northeastern University, Boston, MA, USA
View Profile

,
Aruna Sankaranarayanan

Massachusetts Institute of Technology, Cambridge, MA, USA

Massachusetts Institute of Technology, Cambridge, MA, USA
View Profile

,
Andrew Lippman

Massachusetts Institute of Technology, Cambridge, MA, USA

Massachusetts Institute of Technology, Cambridge, MA, USA
View Profile

ADGD '21: Proceedings of the 1st Workshop on Synthetic Multimedia - Audiovisual Deepfake Generation and DetectionOctober 2021Pages 25–33https://doi.org/10.1145/3476099.3484317

Published:20 October 2021Publication History

ADGD '21: Proceedings of the 1st Workshop on Synthetic Multimedia - Audiovisual Deepfake Generation and Detection

Pages 25–33

ABSTRACT

We present Wav2Lip-Emotion, a video-to-video translation architecture that modifies facial expressions of emotion in videos of speakers. Previous work modifies emotion in images, uses a single image to produce a video with animated emotion, or puppets facial expressions in videos with landmarks from a reference video. However, many use cases such as modifying an actor's performance in post-production, coaching individuals to be more animated speakers, or touching up emotion in a teleconference require a video-to-video translation approach. We explore a method to maintain speakers' identity and pose while translating their expressed emotion. Our approach extends an existing multi-modal lip synchronization architecture to modify the speaker's emotion using L1 reconstruction and pre-trained emotion objectives. We also propose a novel automated emotion evaluation approach and corroborate it with a user study. These find that we succeed in modifying emotion while maintaining lip synchronization. Visual quality is somewhat diminished, with a trade off between greater emotion modification and visual quality between model variants. Nevertheless, we demonstrate (1) that facial expressions of emotion can be modified with nothing other than L1 reconstruction and pre-trained emotion objectives and (2) that our automated emotion evaluation approach aligns with human judgements.

References

AmirAli Bagher Zadeh, Paul Pu Liang, Soujanya Poria, Erik Cambria, and Louis-Philippe Morency. 2018. Multimodal Language Analysis in the Wild: Carnegie Mellon University-MOSEI Dataset and Interpretable Dynamic Fusion Graph. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) . Association for Computational Linguistics, Melbourne, Australia, 2236--2246. https://doi.org/10.18653/v1/P18--1208Google Scholar
Julianne Gold Brunson and P Scott Lawrence. 2002. Impact of sign language interpreter and therapist moods on deaf recipient mood. Professional Psychology: Research and Practice , Vol. 33, 6 (2002), 576.Google ScholarCross Ref
Anpei Chen, Zhang Chen, Guli Zhang, Kenny Mitchell, and Jingyi Yu. 2019. Photo-Realistic Facial Details Synthesis From Single Image. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) .Google ScholarCross Ref
Joon Son Chung, Andrew Senior, Oriol Vinyals, and Andrew Zisserman. 2017. Lip Reading Sentences in the Wild. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 3444--3453. https://doi.org/10.1109/CVPR.2017.367Google Scholar
Joon Son Chung and Andrew Zisserman. 2016. Out of time: automated lip sync in the wild. In Workshop on Multi-view Lip-reading, ACCV .Google Scholar
codeniko. 2019. 81 Facial Landmarks Shape Predictor . https://github.com/codeniko/shape_predictor_81_face_landmarks .Google Scholar
D. Deng, Z. Chen, and B. E. Shi. 2020. Multitask Emotion Recognition with Incomplete Labels. In 2020 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020) (FG) . IEEE Computer Society, Los Alamitos, CA, USA, 592--599. https://doi.org/10.1109/FG47880.2020.00131Google Scholar
Joseph R Dusseldorp, Diego L Guarin, Martinus M van Veen, Nate Jowett, and Tessa A Hadlock. 2019. In the eye of the beholder: changes in perceived emotion expression after smile reanimation. Plastic and reconstructive surgery , Vol. 144, 2 (2019), 457--471.Google Scholar
Paul Ekman. 1993. Facial expression and emotion. American psychologist , Vol. 48, 4 (1993), 384.Google Scholar
Lijie Fan, Wenbing Huang, Chuang Gan, Junzhou Huang, and Boqing Gong. 2019. Controllable Image-to-Video Translation: A Case Study on Facial Expression Generation. Proceedings of the AAAI Conference on Artificial Intelligence , Vol. 33, 01 (Jul. 2019), 3510--3517. https://doi.org/10.1609/aaai.v33i01.33013510Google ScholarCross Ref
Panagiotis Giannopoulos, Isidoros Perikos, and Ioannis Hatzilygeroudis. 2018. Deep Learning Approaches for Facial Emotion Recognition: A Case Study on FER-2013 .Springer International Publishing, Cham, 1--16. https://doi.org/10.1007/978--3--319--66790--4_1Google Scholar
Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. 2018. GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium. arxiv: 1706.08500 [cs.LG] Google ScholarDigital Library
Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q. Weinberger. 2017. Densely Connected Convolutional Networks. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2261--2269. https://doi.org/10.1109/CVPR.2017.243Google Scholar
Carroll E. Izard. 1990. Facial expressions and the regulation of emotions. Journal of Personality and Social Psychology , Vol. 58, 3 (1990), 487--498. https://doi.org/10.1037/0022--3514.58.3.487Google ScholarCross Ref
Jerome Kagan, Nancy Snidman, and Doreen Arcus. 1993. On the temperamental categories of inhibited and uninhibited children. Social withdrawal, inhibition, and shyness in childhood (1993), 19--28.Google Scholar
Tero Karras, Samuli Laine, and Timo Aila. 2019. A Style-Based Generator Architecture for Generative Adversarial Networks. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) . 4396--4405. https://doi.org/10.1109/CVPR.2019.00453Google ScholarCross Ref
Caroline F Keating, Allan Mazur, and Marshall H Segall. 1977. Facial gestures which influence the perception of status. Sociometry (1977), 374--378.Google Scholar
Davis E. King. 2009. Dlib-ml: A Machine Learning Toolkit. Journal of Machine Learning Research , Vol. 10 (2009), 1755--1758. Google ScholarDigital Library
D Kollias, A Schulc, E Hajiyev, and S Zafeiriou. [n.d.]. Analysing Affective Behavior in the First ABAW 2020 Competition. In 2020 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020)(FG) . 794--800.Google Scholar
Dimitrios Kollias and Stefanos Zafeiriou. 2018. Aff-Wild2: Extending the Aff-Wild Database for Affect Recognition. CoRR , Vol. abs/1811.07770 (2018). arxiv: 1811.07770 http://arxiv.org/abs/1811.07770Google Scholar
Andrea Miller, Renita Coleman, and Donald Granberg. 2007. TV Anchors, Elections & Bias: A Longitudinal Study of the Facial Expressions of Brokaw Rather Jennings. Visual Communication Quarterly , Vol. 14, 4 (2007), 244--257. https://doi.org/10.1080/15551390701730232 https://doi.org/10.1145/3449063Google ScholarCross Ref

Index Terms

Invertable Frowns: Video-to-Video Facial Emotion Translation
1. Computing methodologies

Recommendations

Facial design for humanoid robot
APCHI '12: Proceedings of the 10th asia pacific conference on Computer human interaction

In this research, the authors succeeded in creating facial expressions made with the minimum necessary elements for recognizing a face. The elements are two eyes and a mouth made using precise circles, which are transformed to make facial expressions ...
Read More
Cross-cultural design of facial expressions for humanoids: is there cultural difference between Japan and Denmark?
MMAsia '20: Proceedings of the 2nd ACM International Conference on Multimedia in Asia

In this research, the authors succeeded in creating facial expressions made with the minimum necessary elements for recognizing a face. The elements are two eyes and a mouth made using precise circles, which are transformed to make facial expressions ...
Read More
Manipulation of an emotional experience by real-time deformed facial feedback
AH '13: Proceedings of the 4th Augmented Human International Conference

The main goals of this paper involved assessing the efficacy of computer-generated emotion and establishing a method for integrating emotional experience. Human internal processing mechanisms for evoking an emotion by a relevant stimulus have not been ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
ADGD '21: Proceedings of the 1st Workshop on Synthetic Multimedia - Audiovisual Deepfake Generation and Detection
October 2021
39 pages
ISBN:9781450386821
DOI:10.1145/3476099
Program Chairs:
Stefan Winkler
Chinese University of Hong Kong, Shenzhen, China
,
Weiling Chen
National University of Singapore, Singapore
,
Abhinav Dhall
Monash University, Australia & Indian Institute of Technology Ropar, India
,
Pavel Korshunov
Idiap Research Institute, Switzerland
Copyright © 2021 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 20 October 2021
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
computer vision
deep learning
emotion
evaluation
facial expression
video manipulation
video-to-video translation
Qualifiers
- research-article
Conference
Upcoming Conference
MM '24

Sponsor:

sigmm

MM '24: The 32nd ACM International Conference on Multimedia

October 28 - November 1, 2024

Melbourne , VIC , Australia
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 1
  Total Citations
  View Citations
- 141
  Total Downloads
- Downloads (Last 12 months)29
- Downloads (Last 6 weeks)4
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Invertable Frowns: Video-to-Video Facial Emotion Translation

ADGD '21: Proceedings of the 1st Workshop on Synthetic Multimedia - Audiovisual Deepfake Generation and Detection

ABSTRACT

References

Cited By

Index Terms

Recommendations

Facial design for humanoid robot

Cross-cultural design of facial expressions for humanoids: is there cultural difference between Japan and Denmark?

Manipulation of an emotional experience by real-time deformed facial feedback