research-article

Temporal-aware Multimodal Feature Fusion for Sentiment Analysis

Authors:

Feixiang Zhang,

Meng WangAuthors Info & Claims

MuSe '23: Proceedings of the 4th on Multimodal Sentiment Analysis Challenge and Workshop: Mimicked Emotions, Humour and Personalisation

Pages 99 - 105

https://doi.org/10.1145/3606039.3613111

Published: 29 October 2023 Publication History

Abstract

In this paper, we present a solution to the MuSe-Personalisation sub-challenge in the Multimodal Sentiment Analysis Challenge 2023. The task of MuSe-Personalisation aims to predict a time-continuous emotional value (i.e., arousal and valence) by using multimodal data. The MuSe-Personalisation sub-challenge faces the individual variations problem, resulting in poor generalization on unknown test sets. To solve the above problem, we first extract several informative visual features, and then propose a framework containing feature selection, feature learning and fusion strategy to discover the best combination of features for sentiment analysis. Finally, our method achieved the Top 1 performance in the MuSe-Personalisation sub-challenge, and the result in the combined CCC of physiological arousal and valence was 0.8681, outperforming the baseline system by a large margin (i.e., 10.42%) on the test set.

References

[1]

Timo Ahonen, Abdenour Hadid, and Matti Pietikainen. 2006. Face description with local binary patterns: Application to face recognition. IEEE transactions on pattern analysis and machine intelligence, Vol. 28, 12 (2006), 2037--2041.

Digital Library

[2]

Shahin Amiriparian, Maurice Gerczuk, Sandra Ottl, Nicholas Cummins, Michael Freitag, Sergey Pugachevskiy, Alice Baird, and Björn Schuller. 2017. Snore sound classification using image-based deep spectrum features. (2017), 3512--3516.

[3]

Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli. 2020. wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in neural information processing systems, Vol. 33 (2020), 12449--12460.

[4]

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014).

[5]

Tadas Baltruvs aitis, Chaitanya Ahuja, and Louis-Philippe Morency. 2018. Multimodal machine learning: A survey and taxonomy. IEEE transactions on pattern analysis and machine intelligence, Vol. 41, 2 (2018), 423--443.

[6]

Prerna Chikersal, Soujanya Poria, Erik Cambria, Alexander Gelbukh, and Chng Eng Siong. 2015. Modelling public sentiment in Twitter: using linguistic patterns to enhance supervised learning. In Computational Linguistics and Intelligent Text Processing: 16th International Conference, CICLing 2015, Cairo, Egypt, April 14--20, 2015, Proceedings, Part II 16. 49--65.

[7]

Lukas Christ, Shahin Amiriparian, Alice Baird, Alexander Kathan, Niklas Müller, Steffen Klug, Chris Gagne, Panagiotis Tzirakis, Eva-Maria Meßner, Andreas König, et al. 2023. The MuSe 2023 Multimodal Sentiment Analysis Challenge: Mimicked Emotions, Cross-Cultural Humour, and Personalisation. arXiv preprint arXiv:2305.03369 (2023).

[8]

Navneet Dalal and Bill Triggs. 2005. Histograms of oriented gradients for human detection. In 2005 IEEE computer society conference on computer vision and pattern recognition. 886--893.

[9]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).

[10]

Sidney K D'mello and Jacqueline Kory. 2015. A review and meta-analysis of multimodal affect detection systems. ACM computing surveys, Vol. 47, 3 (2015), 1--36.

[11]

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020).

[12]

Paul Ekman and Wallace V Friesen. 1978. Facial action coding system. Environmental Psychology & Nonverbal Behavior (1978).

[13]

Florian Eyben, Klaus R Scherer, Björn W Schuller, Johan Sundberg, Elisabeth André, Carlos Busso, Laurence Y Devillers, Julien Epps, Petri Laukka, Shrikanth S Narayanan, et al. 2015. The Geneva minimalistic acoustic parameter set (GeMAPS) for voice research and affective computing. IEEE transactions on affective computing, Vol. 7, 2 (2015), 190--202.

[14]

Menglin Jia, Luming Tang, Bor-Chun Chen, Claire Cardie, Serge Belongie, Bharath Hariharan, and Ser-Nam Lim. 2022. Visual prompt tuning. In European Conference on Computer Vision. 709--727.

Digital Library

[15]

Yu-Gang Jiang, Zuxuan Wu, Jun Wang, Xiangyang Xue, and Shih-Fu Chang. 2017. Exploiting feature and class relationships in video categorization with regularized deep neural networks. IEEE transactions on pattern analysis and machine intelligence, Vol. 40, 2 (2017), 352--364.

[16]

Nal Kalchbrenner, Edward Grefenstette, and Phil Blunsom. 2014. A convolutional neural network for modelling sentences. arXiv preprint arXiv:1404.2188 (2014).

[17]

Martin Lades, Jan C Vorbruggen, Joachim Buhmann, Jörg Lange, Christoph Von Der Malsburg, Rolf P Wurtz, and Wolfgang Konen. 1993. Distortion invariant object recognition in the dynamic link architecture. IEEE Transactions on computers, Vol. 42, 3 (1993), 300--311.

Digital Library

[18]

Jialin Li, Alia Waleed, and Hanan Salam. 2023. A survey on personalized affective computing in human-machine interaction. arXiv preprint arXiv:2304.00377 (2023).

[19]

Verónica Pérez-Rosas, Rada Mihalcea, and Louis-Philippe Morency. 2013. Utterance-level multimodal sentiment analysis. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 973--982.

[20]

Soujanya Poria, Erik Cambria, Rajiv Bajpai, and Amir Hussain. 2017. A review of affective computing: From unimodal analysis to multimodal fusion. Information fusion, Vol. 37 (2017), 98--125.

[21]

Seyed Mahdi Rezaeinia, Rouhollah Rahmani, Ali Ghodsi, and Hadi Veisi. 2019. Sentiment analysis based on improved pre-trained word embeddings. Expert Systems with Applications, Vol. 117 (2019), 139--147.

[22]

Mousmita Sarma, Pegah Ghahremani, Daniel Povey, Nagendra Kumar Goel, Kandarpa Kumar Sarma, and Najim Dehak. 2018. Emotion Identification from Raw Speech Signals Using DNNs. In Interspeech. 3097--3101.

[23]

Aharon Satt, Shai Rozenberg, Ron Hoory, et al. 2017. Efficient emotion recognition from speech using deep learning on spectrograms. In Interspeech. 1089--1093.

[24]

Florian Schroff, Dmitry Kalenichenko, and James Philbin. 2015. Facenet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE conference on computer vision and pattern recognition. 815--823.

[25]

Lukas Stappen, Alice Baird, Lukas Christ, Lea Schumann, Benjamin Sertolli, Eva-Maria Messner, Erik Cambria, Guoying Zhao, and Björn W Schuller. 2021. The MuSe 2021 multimodal sentiment analysis challenge: sentiment, emotion, physiological-emotion, and stress. In Proceedings of the 2nd on Multimodal Sentiment Analysis Challenge. 5--14.

Digital Library

[26]

Lukas Stappen, Alice Baird, Georgios Rizos, Panagiotis Tzirakis, Xinchen Du, Felix Hafner, Lea Schumann, Adria Mallol-Ragolta, Björn W Schuller, Iulia Lefter, et al. 2020. Muse 2020 challenge and workshop: Multimodal sentiment analysis, emotion-target engagement and trustworthiness detection in real-life media: Emotional car reviews in-the-wild. In Proceedings of the 1st International on Multimodal Sentiment Analysis in Real-life Media Challenge and Workshop. 35--44.

Digital Library

[27]

Ruijie Tao, Rohan Kumar Das, and Haizhou Li. 2020. Audio-visual speaker recognition with a cross-modal discriminative network. arXiv preprint arXiv:2008.03894 (2020).

[28]

Antoine Toisoul, Jean Kossaifi, Adrian Bulat, Georgios Tzimiropoulos, and Maja Pantic. 2021. Estimation of continuous valence and arousal levels from faces in naturalistic conditions. Nature Machine Intelligence, Vol. 3, 1 (2021), 42--50.

[29]

Yao-Hung Hubert Tsai, Shaojie Bai, Paul Pu Liang, J Zico Kolter, Louis-Philippe Morency, and Ruslan Salakhutdinov. 2019. Multimodal transformer for unaligned multimodal language sequences. In Proceedings of the conference. Association for Computational Linguistics. Meeting. 6558.

[30]

Paul Viola and Michael Jones. 2001. Rapid object detection using a boosted cascade of simple features. In Proceedings of the 2001 IEEE computer society conference on computer vision and pattern recognition. CVPR 2001. I--I.

[31]

Xiaobo Wang, Shifeng Zhang, Shuo Wang, Tianyu Fu, Hailin Shi, and Tao Mei. 2020. Mis-classified vector guided softmax loss for face recognition. In Proceedings of the AAAI Conference on Artificial Intelligence. 12241--12248.

[32]

Zhengyao Wen, Wenzhong Lin, Tao Wang, and Ge Xu. 2023. Distract your attention: Multi-head cross attention network for facial expression recognition. Biomimetics, Vol. 8, 2 (2023), 199.

[33]

Fanglei Xue, Qiangchang Wang, and Guodong Guo. 2021. Transfer: Learning relation-aware facial expression representations with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 3601--3610.

[34]

Fanglei Xue, Qiangchang Wang, Zichang Tan, Zhongsong Ma, and Guodong Guo. 2022. Vision transformer with attentive pooling for robust facial expression recognition. IEEE Transactions on Affective Computing (2022), 1--13.

Digital Library

[35]

Amir Zadeh, Paul Pu Liang, Navonil Mazumder, Soujanya Poria, Erik Cambria, and Louis-Philippe Morency. 2018. Memory fusion network for multi-view sequential learning. In Proceedings of the AAAI conference on artificial intelligence. 5634--5641.

[36]

Yuhang Zhang, Chengrui Wang, Xu Ling, and Weihong Deng. 2022. Learn from all: Erasing attention consistency for noisy label facial expression recognition. In European Conference on Computer Vision. 418--434. io

Digital Library

Cited By

Amiriparian SChrist LKathan AGerczuk MMüller NKlug SStappen LKönig ACambria ESchuller BEulitz SAmiriparian SChrist LEulitz SKönig ACambria ESchuller B(2024)The MuSe 2024 Multimodal Sentiment Analysis Challenge: Social Perception and Humor RecognitionProceedings of the 5th on Multimodal Sentiment Analysis Challenge and Workshop: Social Perception and Humor10.1145/3689062.3689088(1-9)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3689062.3689088
Amiriparian SChrist LKönig ACowen AMeßner ECambria ESchuller BEl Saddik AMei TCucchiara RBertini MTobon Vallejo DAtrey PHossain M(2023)MuSe 2023 Challenge: Multimodal Prediction of Mimicked Emotions, Cross-Cultural Humour, and Personalised Recognition of AffectsProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3610943(9723-9725)Online publication date: 26-Oct-2023
https://dl.acm.org/doi/10.1145/3581783.3610943

Index Terms

Temporal-aware Multimodal Feature Fusion for Sentiment Analysis
1. Computing methodologies
  1. Artificial intelligence
2. Information systems
  1. Information systems applications
    1. Multimedia information systems

Recommendations

Multimodal Sentiment Analysis via Efficient Multimodal Transformer and Modality-Aware Adaptive Training Strategy
MuSe '23: Proceedings of the 4th on Multimodal Sentiment Analysis Challenge and Workshop: Mimicked Emotions, Humour and Personalisation

In this paper, we present the solution to the MuSe-Mimic subchallenge of the 4th Multimodal Sentiment Analysis Challenge. This sub-challenge aims to predict the level of approval, disappointment and uncertainty in user-generated video clips. In our ...
The MuSe 2022 Multimodal Sentiment Analysis Challenge: Humor, Emotional Reactions, and Stress
MuSe' 22: Proceedings of the 3rd International on Multimodal Sentiment Analysis Workshop and Challenge

The Multimodal Sentiment Analysis Challenge (MuSe) 2022 is dedicated to multimodal sentiment and emotion recognition. For this year's challenge, we feature three datasets: (i) the Passau Spontaneous Football Coach Humor (Passau-SFCH) dataset that ...
The MuSe 2021 Multimodal Sentiment Analysis Challenge: Sentiment, Emotion, Physiological-Emotion, and Stress
MuSe '21: Proceedings of the 2nd on Multimodal Sentiment Analysis Challenge

Multimodal Sentiment Analysis (MuSe) 2021 is a challenge focusing on the tasks of sentiment and emotion, as well as physiological-emotion and emotion-based stress recognition through more comprehensively integrating the audio-visual, language, and ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MuSe '23: Proceedings of the 4th on Multimodal Sentiment Analysis Challenge and Workshop: Mimicked Emotions, Humour and Personalisation

November 2023

113 pages

ISBN:9798400702709

DOI:10.1145/3606039

General Chairs:
Shahin Amiriparian
University of Augsburg, Germany
,
Lukas Christ
University of Augsburg, Germany
,
Andreas Konig
University of Passau, Germany
,
Alan Cowen
Hume AI, USA
,
Eva-Maria Meßner
University of Ulm, Germany
,
Erik Cambria
Nanyang Technological University, Singapore
,
Bjorn W. Schuller
Imperial College London, UK

Copyright © 2023 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 29 October 2023

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Major Project of Anhui Province
National Key R&D Programme of China

Conference

MM '23

Sponsor:

SIGMM

MM '23: The 31st ACM International Conference on Multimedia

October 29, 2023

Ottawa ON, Canada

Acceptance Rates

Overall Acceptance Rate 14 of 17 submissions, 82%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
206
Total Downloads

Downloads (Last 12 months)107
Downloads (Last 6 weeks)1

Reflects downloads up to 08 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Amiriparian SChrist LKathan AGerczuk MMüller NKlug SStappen LKönig ACambria ESchuller BEulitz SAmiriparian SChrist LEulitz SKönig ACambria ESchuller B(2024)The MuSe 2024 Multimodal Sentiment Analysis Challenge: Social Perception and Humor RecognitionProceedings of the 5th on Multimodal Sentiment Analysis Challenge and Workshop: Social Perception and Humor10.1145/3689062.3689088(1-9)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3689062.3689088
Amiriparian SChrist LKönig ACowen AMeßner ECambria ESchuller BEl Saddik AMei TCucchiara RBertini MTobon Vallejo DAtrey PHossain M(2023)MuSe 2023 Challenge: Multimodal Prediction of Mimicked Emotions, Cross-Cultural Humour, and Personalised Recognition of AffectsProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3610943(9723-9725)Online publication date: 26-Oct-2023
https://dl.acm.org/doi/10.1145/3581783.3610943

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten