skip to main content
10.1145/3581783.3612840acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Semi-Supervised Multimodal Emotion Recognition with Expression MAE

Published: 27 October 2023 Publication History

Abstract

The Multimodal Emotion Recognition (MER 2023) challenge aims to recognize emotion with audio, language, and visual signals, facilitating innovative technologies of affective computing. This paper presents our submission approach on the Semi-Supervised Learning Sub-Challenge (MER-SEMI). First, with large-scale unlabeled emotional videos, we train both image-based and video-based Masked Autoencoders to extract visual features, which termed as expression MAE (expMAE) for simplicity. The expMAE features are found to be largely complementary with other official baseline features. Second, since there is only a few labeled data, we use a classifier to generate pseudo labels for unlabeled videos which have high confidence for a certain category. In addition, we also explore several advanced large models for cross-feature extraction like CLIP, and apply factorized bilinear pooling (FBP) for multimodal feature fusion. Our methods finally achieved 88.55% in F1 score on MER-SEMI, ranking second place among all participating teams.

References

[1]
Sharmeen M Saleem Abdullah Abdullah, Siddeeq Y Ameen Ameen, Mohammed AM Sadeeq, and Subhi Zeebaree. 2021. Multimodal emotion recognition using deep learning. Journal of Applied Science and Technology Trends, Vol. 2, 02 (2021), 52--58.
[2]
Eric Arazo, Diego Ortego, Paul Albert, Noel E O'Connor, and Kevin McGuinness. 2020. Pseudo-labeling and confirmation bias in deep semi-supervised learning. In 2020 International Joint Conference on Neural Networks (IJCNN). IEEE, 1--8.
[3]
Tanja Bänziger, Didier Grandjean, and Klaus R Scherer. 2009. Emotion recognition from expressions in face, voice, and body: the Multimodal Emotion Recognition Test (MERT). Emotion, Vol. 9, 5 (2009), 691.
[4]
Yiming Cui, Wanxiang Che, Ting Liu, Bing Qin, Shijin Wang, and Guoping Hu. 2020. Revisiting pre-trained models for Chinese natural language processing. arXiv preprint arXiv:2004.13922 (2020).
[5]
Tiquan Gu, Hui Zhao, Zhenzhen He, Min Li, and Di Ying. 2023. Integrating external knowledge into aspect-based sentiment analysis using graph neural network. Knowledge-Based Systems, Vol. 259 (2023), 110025. https://doi.org/10.1016/j.knosys.2022.110025
[6]
Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. [n.,d.]. Masked autoencoders are scalable vision learners. IEEE. 15979--15988 pages.
[7]
Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, and Abdelrahman Mohamed. 2021. Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Transactions on Audio, Speech, and Language Processing, Vol. 29 (2021), 3451--3460.
[8]
Maryam Imani and Gholam Ali Montazer. 2019. A survey of emotion recognition methods with emphasis on E-Learning environments. Journal of Network and Computer Applications, Vol. 147 (2019), 102423. https://doi.org/10.1016/j.jnca.2019.102423
[9]
Zheng Lian, Haiyang Sun, Licai Sun, Jinming Zhao, Ye Liu, Bin Liu, Jiangyan Yi, Meng Wang, Erik Cambria, Guoying Zhao, et al. 2023. MER 2023: Multi-label Learning, Modality Robustness, and Semi-Supervised Learning. arXiv preprint arXiv:2304.08981 (2023).
[10]
Prameela Naga, Swamy Das Marri, and Raiza Borreo. 2023. Facial emotion recognition methods, datasets and technologies: A literature survey. Materials Today: Proceedings, Vol. 80 (2023), 2824--2828.
[11]
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. In International conference on machine learning. PMLR, 8748--8763.
[12]
Gaurav Sinha, Rahul Shahi, and Mani Shankar. 2010. Human computer interaction. In 2010 3rd International Conference on Emerging Trends in Engineering and Technology. IEEE, 1--4.
[13]
Mohammad Soleymani, Maja Pantic, and Thierry Pun. 2011. Multimodal emotion recognition in response to videos. IEEE transactions on affective computing, Vol. 3, 2 (2011), 211--223.
[14]
Zhan Tong, Yibing Song, Jue Wang, and Limin Wang. [n.,d.]. Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training.
[15]
Li Wan, Quan Wang, Alan Papir, and Ignacio Lopez Moreno. 2018. Generalized end-to-end loss for speaker verification. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 4879--4883.
[16]
Yuxuan Wang, RJ Skerry-Ryan, Daisy Stanton, Yonghui Wu, Ron J Weiss, Navdeep Jaitly, Zongheng Yang, Ying Xiao, Zhifeng Chen, Samy Bengio, et al. 2017. Tacotron: Towards end-to-end speech synthesis. arXiv preprint arXiv:1703.10135 (2017).
[17]
Zengqun Zhao and Qingshan Liu. 2021. Former-DFER: Dynamic Facial Expression Recognition Transformer. In Proceedings of the 29th ACM International Conference on Multimedia (Virtual Event, China) (MM '21). Association for Computing Machinery, New York, NY, USA, 1553--1561. https://doi.org/10.1145/3474085.3475292
[18]
Zengqun Zhao, Qingshan Liu, and Shanmin Wang. [n.,d.]. Learning deep global multiscale and local attention features for facial expression recognition in the wild. IEEE Transactions on Image Processing, Vol. 30 ( [n.,d.]), 6544--6556,.
[19]
Hengshun Zhou, Debin Meng, Yuanyuan Zhang, Xiaojiang Peng, Jun Du, Kai Wang, and Yu Qiao. 2019. Exploring emotion features and fusion strategies for audio-video emotion recognition. In 2019 International conference on multimodal interaction. 562--566.

Cited By

View all
  • (2024)SZTU-CMU at MER2024: Improving Emotion-LLaMA with Conv-Attention for Multimodal Emotion RecognitionProceedings of the 2nd International Workshop on Multimodal and Responsible Affective Computing10.1145/3689092.3689404(78-87)Online publication date: 28-Oct-2024
  • (2024)Multimodal Consistency-Based Teacher for Semi-Supervised Multimodal Sentiment AnalysisIEEE/ACM Transactions on Audio, Speech and Language Processing10.1109/TASLP.2024.343054332(3669-3683)Online publication date: 18-Jul-2024

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
MM '23: Proceedings of the 31st ACM International Conference on Multimedia
October 2023
9913 pages
ISBN:9798400701085
DOI:10.1145/3581783
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 October 2023

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. masked autoencoder
  2. multimodal emotion recognition
  3. semi-supervised learning

Qualifiers

  • Research-article

Funding Sources

Conference

MM '23
Sponsor:
MM '23: The 31st ACM International Conference on Multimedia
October 29 - November 3, 2023
Ottawa ON, Canada

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)321
  • Downloads (Last 6 weeks)12
Reflects downloads up to 05 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2024)SZTU-CMU at MER2024: Improving Emotion-LLaMA with Conv-Attention for Multimodal Emotion RecognitionProceedings of the 2nd International Workshop on Multimodal and Responsible Affective Computing10.1145/3689092.3689404(78-87)Online publication date: 28-Oct-2024
  • (2024)Multimodal Consistency-Based Teacher for Semi-Supervised Multimodal Sentiment AnalysisIEEE/ACM Transactions on Audio, Speech and Language Processing10.1109/TASLP.2024.343054332(3669-3683)Online publication date: 18-Jul-2024

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media