research-article

Multimodal Fusion Strategies for Physiological-emotion Analysis

Authors:

Qin JinAuthors Info & Claims

MuSe '21: Proceedings of the 2nd on Multimodal Sentiment Analysis Challenge

Pages 43 - 50

https://doi.org/10.1145/3475957.3484452

Published: 20 October 2021 Publication History

Abstract

Physiological-emotion analysis is a novel aspect of automatic emotion analysis. It can support revealing a subject's emotional state, even if he/she consciously suppresses the emotional expression. In this paper, we present our solutions for the MuSe-Physio sub-challenge of Multimodal Sentiment Analysis (MuSe) 2021. The aim of this task is to predict the level of psycho-physiological arousal from combined audio-visual signals and the galvanic skin response (also known as Electrodermal Activity signals) of subjects under a highly stress-induced free speech scenario. In the scenarios, the speaker's emotion can be conveyed in different modalities including acoustic, visual, textual, and physiological signal modalities. Due to the complementarity of different modalities, the fusion of the multiple modalities has a large impact on emotion analysis. In this paper, we highlight two aspects of our solutions: 1) we explore various efficient low-level and high-level features from different modalities for this task, 2) we propose two effective multi-modal fusion strategies to make full use of the different modalities. Our solutions achieve the best CCC performance of 0.5728 on the challenge testing set, which significantly outperforms the baseline system with corresponding CCC of 0.4908. The experimental results show that our proposed various effective features and efficient fusion strategies have a strong generalization ability and can bring more robust performance.

References

[1]

Sharmeen M Saleem Abdullah Abdullah, Siddeeq Y Ameen Ameen, Mohammed AM Sadeeq, and Subhi Zeebaree. 2021. Multimodal emotion recognition using deep learning. Journal of Applied Science and Technology Trends, Vol. 2, 02 (2021), 52--58.

[2]

Shad Akhtar, Ayush Kumar, Asif Ekbal, and Pushpak Bhattacharya. 2016. A Hybrid Deep Learning Architecture for Sentiment Analysis. In COLING .

[3]

Mouhannad Ali, Ahmad Haj Mosa, Fadi Al Machot, and Kyandoghere Kyamakya. 2018. Emotion recognition involving physiological and speech signals: a comprehensive review. Recent advances in nonlinear dynamics and synchronization (2018), 287--302.

[4]

Shahin Amiriparian, Maurice Gerczuk, Sandra Ottl, Nicholas Cummins, Michael Freitag, Sergey Pugachevskiy, Alice Baird, and Bjö rn W. Schuller. 2017. Snore Sound Classification Using Image-Based Deep Spectrum Features. In Interspeech 2017, 18th Annual Conference of the International Speech Communication Association, Stockholm, Sweden, August 20--24, 2017, Francisco Lacerda (Ed.). ISCA, 3512--3516. http://www.isca-speech.org/archive/Interspeech_2017/abstracts/0434.html

[5]

Tadas Baltruvs aitis, Peter Robinson, and Louis-Philippe Morency. 2016. Openface: an open source facial behavior analysis toolkit. In 2016 IEEE Winter Conference on Applications of Computer Vision (WACV). IEEE, 1--10.

[6]

Zhe Cao, Tomas Simon, Shih-En Wei, and Yaser Sheikh. 2017. Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields. In CVPR .

[7]

D. Caruelle, A. Gustafsson, P. Shams, and L. Lervik-Olsen. 2019. The use of electrodermal activity (EDA) measurement to understand consumer emotions - A literature review and a call for action. Journal of Business Research, Vol. 104 (2019), 146--160.

[8]

JunKai Chen, Zenghai Chen, Zheru Chi, and Hong Fu. 2014. Emotion recognition in the wild with feature fusion and multiple kernel learning. In Proceedings of the 16th International Conference on Multimodal Interaction . 508--513.

Digital Library

[9]

Shizhe Chen and Qin Jin. 2015. Multi-modal dimensional emotion recognition using recurrent neural networks. In Proceedings of the 5th International Workshop on Audio/Visual Emotion Challenge . 49--56.

Digital Library

[10]

Shizhe Chen and Qin Jin. 2016. Multi-modal conditional attention fusion for dimensional emotion prediction. In Proceedings of the 24th ACM international conference on Multimedia . 571--575.

Digital Library

[11]

Shizhe Chen, Qin Jin, Jinming Zhao, and Shuai Wang. 2017. Multimodal multi-task learning for dimensional and continuous emotion recognition. In Proceedings of the 7th Annual Workshop on Audio/Visual Emotion Challenge . 19--26.

Digital Library

[12]

Kyunghyun Cho, Bart Van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation. Computer Ence (2014).

[13]

Colin A Depp, Snigdha Kamarsu, Tess F Filip, Emma M Parrish, Philip D Harvey, Eric L Granholm, Samantha Chalker, Raeanne C Moore, and Amy Pinkham. 2021. Ecological momentary facial emotion recognition in psychotic disorders. Psychological medicine (2021), 1--9.

[14]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv preprint arXiv:1810.04805 (2018).

[15]

Andrius Dzedzickis, Art=uras Kaklauskas, and Vytautas Bucinskas. 2020. Human emotion recognition: Review of sensors and methods. Sensors, Vol. 20, 3 (2020), 592.

[16]

Jeffrey L Elman. 1990. Finding structure in time. Cognitive science, Vol. 14, 2 (1990), 179--211.

[17]

Florian Eyben, Klaus R Scherer, Bjorn Schuller, Johan Sundberg, Elisabeth Andre, Carlos Busso, Laurence Devillers, Julien Epps, Petri Laukka, Shrikanth S Narayanan, et almbox. 2016. The Geneva Minimalistic Acoustic Parameter Set (GeMAPS) for Voice Research and Affective Computing. IEEE Transactions on Affective Computing, Vol. 7, 2 (2016), 190--202.

Digital Library

[18]

Florian Eyben, Martin Wö llmer, and Bjö rn W. Schuller. 2010. Opensmile: the munich versatile and fast open-source audio feature extractor. In Proceedings of the 18th International Conference on Multimedia 2010, Firenze, Italy, October 25--29, 2010, Alberto Del Bimbo, Shih-Fu Chang, and Arnold W. M. Smeulders (Eds.). ACM, 1459--1462. https://doi.org/10.1145/1873951.1874246

Digital Library

[19]

Fabien Ringeval and Björn Schuller and Michel Valstar and Roddy Cowie and Heysem Kaya and Maximilian Schmitt and Shahin Amiriparian and Nicholas Cummins and Denis Lalanne and Adrien Michaud and Elvan c Ciftc ci and Hüseyin Gülec c and Albert Ali Salah and Maja Pantic. 2018. AVEC 2018 Workshop and Challenge: Bipolar Disorder and Cross-Cultural Affect Recognition. In Proceedings of the 8th International Workshop on Audio/Visual Emotion Challenge, AVEC'18, co-located with the 26th ACM International Conference on Multimedia, MM 2018, Fabien Ringeval, Björn Schuller, Michel Valstar, Roddy Cowie, and Maja Pantic (Eds.). ACM, Seoul, Korea.

Digital Library

[20]

Andreas Haag, Silke Goronzy, Peter Schaich, and Jason Williams. 2004. Emotion recognition using bio-sensors: First steps towards an automatic system. In Tutorial and research workshop on affective dialogue systems. Springer, 36--48.

[21]

S Hochreiter and J Schmidhuber. 1997. Long Short-Term Memory. Neural Computation, Vol. 9, 8 (1997), 1735--1780.

Digital Library

[22]

Gao Huang, Zhuang Liu, Laurens van der Maaten, and Kilian Q. Weinberger. 2015b. Densely Connected Convolutional Networks. (2015).

[23]

Zhaocheng Huang, Ting Dang, Nicholas Cummins, Brian Stasak, Phu Le, Vidhyasaharan Sethu, and Julien Epps. 2015a. An investigation of annotation delay compensation and output-associative fusion for multimodal continuous emotion prediction. In Proceedings of the 5th International Workshop on Audio/Visual Emotion Challenge. 41--48.

Digital Library

[24]

Du J, Xu R, He Y, and Gui L. 2017. Stance classification with target-specific neural attention networks. In In Proceedings of the Internal Joint Conference on Artificial Intelligence (IJCAI 2017) .

Digital Library

[25]

Noor H. Jabber and Ivan A. Hashim. 2018. Robust Eye Features Extraction Based on Eye Angles for Efficient Gaze Classification System. In 2018 Third Scientific Conference of Electrical Engineering (SCEE) . 13--18. https://doi.org/10.1109/SCEE.2018.8684107

[26]

Emily Joy, Rehna Baby Joseph, M.B Lakshmi, Willson Joseph, and M. Rajeswari. 2021. Recent Survey on Emotion Recognition Using Physiological Signals. In 2021 7th International Conference on Advanced Computing and Communication Systems (ICACCS), Vol. 1. 1858--1863. https://doi.org/10.1109/ICACCS51430.2021.9441999

[27]

Jonghwa Kim and Elisabeth André. 2008. Emotion recognition based on physiological changes in music listening. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 30, 12 (2008), 2067--2083. https://doi.org/10.1109/TPAMI.2008.26

Digital Library

[28]

Clemens Kirschbaum, Karl-Martin Pirke, and Dirk H Hellhammer. 1993. The 'Trier Social Stress Test'--a tool for investigating psychobiological stress responses in a laboratory setting. Neuropsychobiology, Vol. 28, 1--2 (1993), 76--81.

[29]

Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. 2019. ALBERT: A Lite BERT for Self-supervised Learning of Language Representations. (2019).

[30]

Lawrence Ikuei Lin. 1989. A concordance correlation coefficient to evaluate reproducibility. Biometrics, Vol. 45, 1 (1989), 255--268.

[31]

Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013).

[32]

Andreas Müller. 2015. Analyse von Wort-Vektoren deutscher Textkorpora . https://devmount.github.io/GermanWordEmbeddings

[33]

Omkar M Parkhi, Andrea Vedaldi, and Andrew Zisserman. 2015. Deep Face Recognition. (2015).

[34]

Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. Glove: Global Vectors for Word Representation. In Conference on Empirical Methods in Natural Language Processing .

[35]

Russell and A. James. 1980. A circumplex model of affect. Journal of Personality & Social Psychology, Vol. 39, 6 (1980), 1161--1178.

[36]

H. Sak, A. Senior, and F. Beaufays. 2014. Long Short-Term Memory Based Recurrent Neural Network Architectures for Large Vocabulary Speech Recognition. (2014).

[37]

Steffen Schneider, Alexei Baevski, Ronan Collobert, and Michael Auli. 2019. wav2vec: Unsupervised Pre-Training for Speech Recognition. (2019).

[38]

Tomas Simon, Hanbyul Joo, Iain Matthews, and Yaser Sheikh. 2017. Hand Keypoint Detection in Single Images using Multiview Bootstrapping. In CVPR .

[39]

Lukas Stappen, Alice Baird, Lukas Christ, Lea Schumann, Benjamin Sertolli, Eva-Maria Messner, Erik Cambria, Guoying Zhao, and Björn W Schuller. 2021. The MuSe 2021 Multimodal Sentiment Analysis Challenge: Sentiment, Emotion, Physiological-Emotion, and Stress. arXiv preprint arXiv:2104.07123 (2021).

Digital Library

[40]

Lukas Stappen, Alice Baird, Georgios Rizos, Panagiotis Tzirakis, Xinchen Du, Felix Hafner, Lea Schumann, Adria Mallol-Ragolta, Björn W Schuller, Iulia Lefter, Erik Cambria, and Ioannis Kompatsiaris. 2020. MuSe 2020 Challenge and Workshop: Multimodal Sentiment Analysis, Emotion-target Engagement and Trustworthiness Detection in Real-life Media. In 1st International Multimodal Sentiment Analysis in Real-life Media Challenge and Workshop, co-located with the 28th ACM International Conference on Multimedia (ACM MM). ACM.

Digital Library

[41]

Yao-Hung Hubert Tsai, Shaojie Bai, Paul Pu Liang, J Zico Kolter, Louis-Philippe Morency, and Ruslan Salakhutdinov. 2019. Multimodal transformer for unaligned multimodal language sequences. In Proceedings of the conference. Association for Computational Linguistics. Meeting, Vol. 2019. NIH Public Access, 6558.

[42]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information processing systems. 5998--6008.

Digital Library

[43]

Zequn Wang, Rui Jiao, and Huiping Jiang. 2020. Emotion Recognition Using WT-SVM in Human-Computer Interaction. Journal of New Media, Vol. 2, 3 (2020), 121.

[44]

Zhongqing Wang and Yue Zhang. 2017. Opinion Recommendation Using A Neural Model. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing .

[45]

Shih-En Wei, Varun Ramakrishna, Takeo Kanade, and Yaser Sheikh. 2016. Convolutional pose machines. In CVPR .

[46]

Qizhe Xie, Minh-Thang Luong, Eduard Hovy, and Quoc V Le. 2020. Self-training with noisy student improves imagenet classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition . 10687--10698.

[47]

Jianhua Zhang, Zhong Yin, Peng Chen, and Stefano Nichele. 2020. Emotion recognition using multi-modal data and machine learning techniques: A tutorial and review. Information Fusion, Vol. 59 (2020), 103--126.

Cited By

Cano SCubillos CAlfaro RRomo AGarcía MMoreira F(2024)Wearable Solutions Using Physiological Signals for Stress Monitoring on Individuals with Autism Spectrum Disorder (ASD): A Systematic Literature ReviewSensors10.3390/s2424813724:24(8137)Online publication date: 20-Dec-2024
https://doi.org/10.3390/s24248137

Index Terms

Multimodal Fusion Strategies for Physiological-emotion Analysis

Recommendations

Multi-modal Fusion for Video Sentiment Analysis
MuSe'20: Proceedings of the 1st International on Multimodal Sentiment Analysis in Real-life Media Challenge and Workshop

Automatic sentiment analysis can support revealing a subject's emotional state and opinion tendency toward an entity. In this paper, we present our solutions for the MuSe-Wild sub-challenge of Multimodal Sentiment Analysis in Real-life Media (MuSe) ...
Multimodal Physiological Signals Fusion for Online Emotion Recognition
MM '23: Proceedings of the 31st ACM International Conference on Multimedia

Multimodal physiological-based emotion recognition is one of the most available but challenging studies due to complexity of emotions and individual differences in physiological signals. However, existing studies mainly combine multimodal data to fuse ...
Interactive double states emotion cell model for textual dialogue emotion prediction
Abstract
Daily dialogues are full of emotions that control the trends of dialogues and influence the attitudes of interlocutors toward each other, and understanding the human emotions in dialogues is of great significance in emotional comfort, ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MuSe '21: Proceedings of the 2nd on Multimodal Sentiment Analysis Challenge

October 2021

88 pages

ISBN:9781450386784

DOI:10.1145/3475957

General Chairs:
Björn W. Schuller
Imperial College London, UK
,
Lukas Stappen
University of Augsburg, GER
,
Eva-Maria Meßner
University of Ulm, GER
,
Erik Cambria
Nanyang Technological University, SNG
,
Guoying Zhao
University of Oulu, FIN

Copyright © 2021 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 20 October 2021

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

NationalKey R&D Program of China
Beijing Natural Science Foundation
National Natural ScienceFoundation of China
National Natural Science Foundation of China

Conference

MM '21

Sponsor:

SIGMM

MM '21: ACM Multimedia Conference

October 24, 2021

Virtual Event, China

Acceptance Rates

Overall Acceptance Rate 14 of 17 submissions, 82%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
380
Total Downloads

Downloads (Last 12 months)32
Downloads (Last 6 weeks)4

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Cano SCubillos CAlfaro RRomo AGarcía MMoreira F(2024)Wearable Solutions Using Physiological Signals for Stress Monitoring on Individuals with Autism Spectrum Disorder (ASD): A Systematic Literature ReviewSensors10.3390/s2424813724:24(8137)Online publication date: 20-Dec-2024
https://doi.org/10.3390/s24248137

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten