research-article

UniFaRN: Unified Transformer for Facial Reaction Generation

Authors:

Xiaoping ChenAuthors Info & Claims

MM '23: Proceedings of the 31st ACM International Conference on Multimedia

Pages 9506 - 9510

https://doi.org/10.1145/3581783.3612854

Published: 27 October 2023 Publication History

Abstract

We propose the Unified Transformer for Facial Reaction GeneratioN (UniFaRN) framework for facial reaction prediction in dyadic interactions. Given the video and audio of one side, the task is to generate facial reactions of the other side. The challenge of the task lies in the fusion of multi-modal inputs and balancing appropriateness and diversity. We adopt the Transformer architecture to tackle the challenge by leveraging its flexibility of handling multi-modal data and ability to control the generation process. By successfully capturing the correlations between multi-modal inputs and outputs with unified layers and balancing the performance with sampling methods, we have won first place in the REACT2023 challenge.

References

[1]

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. Advances in neural information processing systems, Vol. 33 (2020), 1877--1901.

[2]

Angelo Cafaro, Johannes Wagner, Tobias Baur, Soumia Dermouche, Mercedes Torres Torres, Catherine Pelachaud, Elisabeth André, and Michel Valstar. 2017. The NoXi database: multimodal recordings of mediated novice-expert interactions. In Proceedings of the 19th ACM International Conference on Multimodal Interaction. 350--359.

Digital Library

[3]

Huiwen Chang, Han Zhang, Lu Jiang, Ce Liu, and William T Freeman. 2022. Maskgit: Masked generative image transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 11315--11325.

[4]

Florian Eyben, Klaus R Scherer, Björn W Schuller, Johan Sundberg, Elisabeth André, Carlos Busso, Laurence Y Devillers, Julien Epps, Petri Laukka, Shrikanth S Narayanan, et al. 2015. The Geneva minimalistic acoustic parameter set (GeMAPS) for voice research and affective computing. IEEE transactions on affective computing, Vol. 7, 2 (2015), 190--202.

[5]

Florian Eyben, Martin Wllmer, and Bjrn Schuller. 2010. Opensmile: the munich versatile and fast open-source audio feature extractor. In Acm International Conference on Multimedia.

Digital Library

[6]

Yuchi Huang and Saad M Khan. 2017. Dyadgan: Generating facial expressions in dyadic interactions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. 11--18.

[7]

Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of naacL-HLT, Vol. 1. 2.

[8]

Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. 2022. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International Conference on Machine Learning. PMLR, 12888--12900.

[9]

Junnan Li, Ramprasaath Selvaraju, Akhilesh Gotmare, Shafiq Joty, Caiming Xiong, and Steven Chu Hong Hoi. 2021. Align before fuse: Vision and language representation learning with momentum distillation. Advances in neural information processing systems, Vol. 34 (2021), 9694--9705.

[10]

Ilya Loshchilov and Frank Hutter. 2016. SGDR: Stochastic Gradient Descent with Warm Restarts. In International Conference on Learning Representations.

[11]

Ilya Loshchilov and Frank Hutter. 2018. Decoupled Weight Decay Regularization. In International Conference on Learning Representations.

[12]

Cheng Luo, Siyang Song, Weicheng Xie, Linlin Shen, and Hatice Gunes. 2022. Learning Multi-dimensional Edge Feature-based AU Relation Graph for Facial Action Unit Recognition. In Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, IJCAI-22. 1239--1246.

[13]

James Lyons, Darren Yow-Bang Wang, Gianluca, Hanan Shteingart, Erik Mavrinac, Yash Gaurkar, Watcharapol Watcharawisetkul, Sam Birch, Lu Zhihe, Josef Hölzl, Janis Lesinskis, Henrik Almér, Chris Lord, and Adam Stark. 2020. jameslyons/python_speech_features: release v0.6.1. https://doi.org/10.5281/zenodo.3607820

[14]

Cristina Palmero, Javier Selva, Sorina Smeureanu, Julio Junior, CS Jacques, Albert Clapés, Alexa Moseguí, Zejian Zhang, David Gallardo, Georgina Guilera, et al. 2021. Context-aware personality inference in dyadic scenarios: Introducing the udiva dataset. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 1--12.

[15]

Luan Pham, The Huynh Vu, and Tuan Anh Tran. 2021. Facial expression recognition using residual masking network. In 2020 25Th international conference on pattern recognition (ICPR). IEEE, 4513--4519.

[16]

Fabien Ringeval, Björn Schuller, Michel Valstar, Nicholas Cummins, Roddy Cowie, Leili Tavabi, Maximilian Schmitt, Sina Alisamir, Shahin Amiriparian, Eva-Maria Messner, et al. 2019. AVEC 2019 workshop and challenge: state-of-mind, detecting depression with AI, and cross-cultural affect recognition. In Proceedings of the 9th International on Audio/visual Emotion Challenge and Workshop. 3--12.

[17]

Fabien Ringeval, Andreas Sonderegger, Juergen Sauer, and Denis Lalanne. 2013. Introducing the RECOLA multimodal corpus of remote collaborative and affective interactions. In 2013 10th IEEE international conference and workshops on automatic face and gesture recognition (FG). IEEE, 1--8.

[18]

Zilong Shao, Siyang Song, Shashank Jaiswal, Linlin Shen, Michel Valstar, and Hatice Gunes. 2021. Personality recognition by modelling person-specific cognitive processes using graph representation. In proceedings of the 29th ACM international conference on multimedia. 357--366.

Digital Library

[19]

Haoyu Song, Wei-Nan Zhang, Yiming Cui, Dong Wang, and Ting Liu. 2019. Exploiting persona information for diverse generation of conversational responses. (2019).

[20]

Siyang Song, Zilong Shao, Shashank Jaiswal, Linlin Shen, Michel Valstar, and Hatice Gunes. 2022a. Learning person-specific cognition from facial reactions for automatic personality recognition. IEEE Transactions on Affective Computing (2022).

Digital Library

[21]

Siyang Song, Yuxin Song, Cheng Luo, Zhiyuan Song, Selim Kuzucu, Xi Jia, Zhijiang Guo, Weicheng Xie, Linlin Shen, and Hatice Gunes. 2022b. Gratis: Deep learning graph representation with task-specific topology and multi-dimensional edge features. arXiv preprint arXiv:2211.12482 (2022).

[22]

Siyang Song, Micol Spitale, Cheng Luo, German Barquero, Cristina Palmero, Sergio Escalera, Michel Valstar, Tobias Baur, Fabien Ringeval, Elisabeth Andre, et al. 2023 b. REACT2023: the first Multi-modal Multiple Appropriate Facial Reaction Generation Challenge. arXiv preprint arXiv:2306.06583 (2023).

[23]

Siyang Song, Micol Spitale, Yiming Luo, Batuhan Bal, and Hatice Gunes. 2023 a. Multiple Appropriate Facial Reaction Generation in Dyadic Interaction Settings: What, Why and How? arXiv preprint arXiv:2302.06514 (2023).

[24]

Zhan Tong, Yibing Song, Jue Wang, and Limin Wang. 2022. Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training. Advances in neural information processing systems, Vol. 35 (2022), 10078--10093.

[25]

Youngwoo Yoon, Pieter Wolfert, Taras Kucherenko, Carla Viegas, Teodor Nikolov, Mihail Tsakov, and Gustav Eje Henter. 2022. The GENEA Challenge 2022: A large evaluation of data-driven co-speech gesture generation. In Proceedings of the 2022 International Conference on Multimodal Interaction. 736--747.

Digital Library

Cited By

Zhu HKong XXie WHuang XShen LLiu LGunes HSong SCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)PerFRDiff: Personalised Weight Editing for Multiple Appropriate Facial Reaction GenerationProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3680752(9495-9504)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3680752
Hu GWei JSong SKollias DYang XSun ZKaloidas O(2024)Robust Facial Reactions Generation: An Emotion-Aware Framework with Modality Compensation2024 IEEE International Joint Conference on Biometrics (IJCB)10.1109/IJCB62174.2024.10744499(1-10)Online publication date: 15-Sep-2024
https://doi.org/10.1109/IJCB62174.2024.10744499
Nguyen MYang HHo NKim SKim SShin J(2024)Vector Quantized Diffusion Models for Multiple Appropriate Reactions Generation2024 IEEE 18th International Conference on Automatic Face and Gesture Recognition (FG)10.1109/FG59268.2024.10581978(1-5)Online publication date: 27-May-2024
https://doi.org/10.1109/FG59268.2024.10581978
Show More Cited By

Index Terms

UniFaRN: Unified Transformer for Facial Reaction Generation
1. Computing methodologies
  1. Artificial intelligence
2. Human-centered computing
  1. Human computer interaction (HCI)

Recommendations

REACT2023: The First Multiple Appropriate Facial Reaction Generation Challenge
MM '23: Proceedings of the 31st ACM International Conference on Multimedia

The Multiple Appropriate Facial Reaction Generation Challenge (REACT2023) is the first competition event focused on evaluating multimedia processing and machine learning techniques for generating human-appropriate facial reactions in various dyadic ...
Speech dialogue with facial displays: multimodal human-computer conversation
ACL '94: Proceedings of the 32nd annual meeting on Association for Computational Linguistics

Human face-to-face conversation is an ideal model for human-computer dialogue. One of the major features of face-to-face communication is its multiplicity of communication channels that act on multiple modalities. To realize a natural multimodal ...
Facial component-based blended facial expressions generation from static neutral face images

Facial expression synthesis is getting a wide-spread attention since past several years due to its multimedia applications. In most of the earlier research works, example images of target expressions are required to produce synthesized facial ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MM '23: Proceedings of the 31st ACM International Conference on Multimedia

October 2023

9913 pages

ISBN:9798400701085

DOI:10.1145/3581783

General Chairs:
Abdulmotaleb El Saddik
University of Ottawa, Canada & MBZUAI, UAE
,
Tao Mei
HiDream.ai, China
,
Rita Cucchiara
University of Modena and Reggio Emilia, Italy
,
Program Chairs:
Marco Bertini
University of Florence, Italy
,
Diana Patricia Tobon Vallejo
Unversidad de Medellin, Colombia
,
Pradeep K. Atrey
University at Albany, State University of New York, USA
,
M. Shamim Hossain
M. Shamim Hossain (King Saud University, KSA

Copyright © 2023 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 October 2023

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

National Key R & D program of China
National Natural Science Foundation of China

Conference

MM '23

Sponsor:

SIGMM

MM '23: The 31st ACM International Conference on Multimedia

October 29 - November 3, 2023

Ottawa ON, Canada

Acceptance Rates

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

4
Total Citations
View Citations
253
Total Downloads

Downloads (Last 12 months)123
Downloads (Last 6 weeks)6

Reflects downloads up to 01 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Zhu HKong XXie WHuang XShen LLiu LGunes HSong SCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)PerFRDiff: Personalised Weight Editing for Multiple Appropriate Facial Reaction GenerationProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3680752(9495-9504)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3680752
Hu GWei JSong SKollias DYang XSun ZKaloidas O(2024)Robust Facial Reactions Generation: An Emotion-Aware Framework with Modality Compensation2024 IEEE International Joint Conference on Biometrics (IJCB)10.1109/IJCB62174.2024.10744499(1-10)Online publication date: 15-Sep-2024
https://doi.org/10.1109/IJCB62174.2024.10744499
Nguyen MYang HHo NKim SKim SShin J(2024)Vector Quantized Diffusion Models for Multiple Appropriate Reactions Generation2024 IEEE 18th International Conference on Automatic Face and Gesture Recognition (FG)10.1109/FG59268.2024.10581978(1-5)Online publication date: 27-May-2024
https://doi.org/10.1109/FG59268.2024.10581978
Dam QNguyen Nguyen TTran DLee J(2024)Finite Scalar Quantization as Facial Tokenizer for Dyadic Reaction Generation2024 IEEE 18th International Conference on Automatic Face and Gesture Recognition (FG)10.1109/FG59268.2024.10581929(1-5)Online publication date: 27-May-2024
https://doi.org/10.1109/FG59268.2024.10581929

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten