skip to main content
10.1145/3475957.3484452acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Multimodal Fusion Strategies for Physiological-emotion Analysis

Published: 20 October 2021 Publication History

Abstract

Physiological-emotion analysis is a novel aspect of automatic emotion analysis. It can support revealing a subject's emotional state, even if he/she consciously suppresses the emotional expression. In this paper, we present our solutions for the MuSe-Physio sub-challenge of Multimodal Sentiment Analysis (MuSe) 2021. The aim of this task is to predict the level of psycho-physiological arousal from combined audio-visual signals and the galvanic skin response (also known as Electrodermal Activity signals) of subjects under a highly stress-induced free speech scenario. In the scenarios, the speaker's emotion can be conveyed in different modalities including acoustic, visual, textual, and physiological signal modalities. Due to the complementarity of different modalities, the fusion of the multiple modalities has a large impact on emotion analysis. In this paper, we highlight two aspects of our solutions: 1) we explore various efficient low-level and high-level features from different modalities for this task, 2) we propose two effective multi-modal fusion strategies to make full use of the different modalities. Our solutions achieve the best CCC performance of 0.5728 on the challenge testing set, which significantly outperforms the baseline system with corresponding CCC of 0.4908. The experimental results show that our proposed various effective features and efficient fusion strategies have a strong generalization ability and can bring more robust performance.

References

[1]
Sharmeen M Saleem Abdullah Abdullah, Siddeeq Y Ameen Ameen, Mohammed AM Sadeeq, and Subhi Zeebaree. 2021. Multimodal emotion recognition using deep learning. Journal of Applied Science and Technology Trends, Vol. 2, 02 (2021), 52--58.
[2]
Shad Akhtar, Ayush Kumar, Asif Ekbal, and Pushpak Bhattacharya. 2016. A Hybrid Deep Learning Architecture for Sentiment Analysis. In COLING .
[3]
Mouhannad Ali, Ahmad Haj Mosa, Fadi Al Machot, and Kyandoghere Kyamakya. 2018. Emotion recognition involving physiological and speech signals: a comprehensive review. Recent advances in nonlinear dynamics and synchronization (2018), 287--302.
[4]
Shahin Amiriparian, Maurice Gerczuk, Sandra Ottl, Nicholas Cummins, Michael Freitag, Sergey Pugachevskiy, Alice Baird, and Bjö rn W. Schuller. 2017. Snore Sound Classification Using Image-Based Deep Spectrum Features. In Interspeech 2017, 18th Annual Conference of the International Speech Communication Association, Stockholm, Sweden, August 20--24, 2017, Francisco Lacerda (Ed.). ISCA, 3512--3516. http://www.isca-speech.org/archive/Interspeech_2017/abstracts/0434.html
[5]
Tadas Baltruvs aitis, Peter Robinson, and Louis-Philippe Morency. 2016. Openface: an open source facial behavior analysis toolkit. In 2016 IEEE Winter Conference on Applications of Computer Vision (WACV). IEEE, 1--10.
[6]
Zhe Cao, Tomas Simon, Shih-En Wei, and Yaser Sheikh. 2017. Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields. In CVPR .
[7]
D. Caruelle, A. Gustafsson, P. Shams, and L. Lervik-Olsen. 2019. The use of electrodermal activity (EDA) measurement to understand consumer emotions - A literature review and a call for action. Journal of Business Research, Vol. 104 (2019), 146--160.
[8]
JunKai Chen, Zenghai Chen, Zheru Chi, and Hong Fu. 2014. Emotion recognition in the wild with feature fusion and multiple kernel learning. In Proceedings of the 16th International Conference on Multimodal Interaction . 508--513.
[9]
Shizhe Chen and Qin Jin. 2015. Multi-modal dimensional emotion recognition using recurrent neural networks. In Proceedings of the 5th International Workshop on Audio/Visual Emotion Challenge . 49--56.
[10]
Shizhe Chen and Qin Jin. 2016. Multi-modal conditional attention fusion for dimensional emotion prediction. In Proceedings of the 24th ACM international conference on Multimedia . 571--575.
[11]
Shizhe Chen, Qin Jin, Jinming Zhao, and Shuai Wang. 2017. Multimodal multi-task learning for dimensional and continuous emotion recognition. In Proceedings of the 7th Annual Workshop on Audio/Visual Emotion Challenge . 19--26.
[12]
Kyunghyun Cho, Bart Van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation. Computer Ence (2014).
[13]
Colin A Depp, Snigdha Kamarsu, Tess F Filip, Emma M Parrish, Philip D Harvey, Eric L Granholm, Samantha Chalker, Raeanne C Moore, and Amy Pinkham. 2021. Ecological momentary facial emotion recognition in psychotic disorders. Psychological medicine (2021), 1--9.
[14]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv preprint arXiv:1810.04805 (2018).
[15]
Andrius Dzedzickis, Art=uras Kaklauskas, and Vytautas Bucinskas. 2020. Human emotion recognition: Review of sensors and methods. Sensors, Vol. 20, 3 (2020), 592.
[16]
Jeffrey L Elman. 1990. Finding structure in time. Cognitive science, Vol. 14, 2 (1990), 179--211.
[17]
Florian Eyben, Klaus R Scherer, Bjorn Schuller, Johan Sundberg, Elisabeth Andre, Carlos Busso, Laurence Devillers, Julien Epps, Petri Laukka, Shrikanth S Narayanan, et almbox. 2016. The Geneva Minimalistic Acoustic Parameter Set (GeMAPS) for Voice Research and Affective Computing. IEEE Transactions on Affective Computing, Vol. 7, 2 (2016), 190--202.
[18]
Florian Eyben, Martin Wö llmer, and Bjö rn W. Schuller. 2010. Opensmile: the munich versatile and fast open-source audio feature extractor. In Proceedings of the 18th International Conference on Multimedia 2010, Firenze, Italy, October 25--29, 2010, Alberto Del Bimbo, Shih-Fu Chang, and Arnold W. M. Smeulders (Eds.). ACM, 1459--1462. https://doi.org/10.1145/1873951.1874246
[19]
Fabien Ringeval and Björn Schuller and Michel Valstar and Roddy Cowie and Heysem Kaya and Maximilian Schmitt and Shahin Amiriparian and Nicholas Cummins and Denis Lalanne and Adrien Michaud and Elvan c Ciftc ci and Hüseyin Gülec c and Albert Ali Salah and Maja Pantic. 2018. AVEC 2018 Workshop and Challenge: Bipolar Disorder and Cross-Cultural Affect Recognition. In Proceedings of the 8th International Workshop on Audio/Visual Emotion Challenge, AVEC'18, co-located with the 26th ACM International Conference on Multimedia, MM 2018, Fabien Ringeval, Björn Schuller, Michel Valstar, Roddy Cowie, and Maja Pantic (Eds.). ACM, Seoul, Korea.
[20]
Andreas Haag, Silke Goronzy, Peter Schaich, and Jason Williams. 2004. Emotion recognition using bio-sensors: First steps towards an automatic system. In Tutorial and research workshop on affective dialogue systems. Springer, 36--48.
[21]
S Hochreiter and J Schmidhuber. 1997. Long Short-Term Memory. Neural Computation, Vol. 9, 8 (1997), 1735--1780.
[22]
Gao Huang, Zhuang Liu, Laurens van der Maaten, and Kilian Q. Weinberger. 2015b. Densely Connected Convolutional Networks. (2015).
[23]
Zhaocheng Huang, Ting Dang, Nicholas Cummins, Brian Stasak, Phu Le, Vidhyasaharan Sethu, and Julien Epps. 2015a. An investigation of annotation delay compensation and output-associative fusion for multimodal continuous emotion prediction. In Proceedings of the 5th International Workshop on Audio/Visual Emotion Challenge. 41--48.
[24]
Du J, Xu R, He Y, and Gui L. 2017. Stance classification with target-specific neural attention networks. In In Proceedings of the Internal Joint Conference on Artificial Intelligence (IJCAI 2017) .
[25]
Noor H. Jabber and Ivan A. Hashim. 2018. Robust Eye Features Extraction Based on Eye Angles for Efficient Gaze Classification System. In 2018 Third Scientific Conference of Electrical Engineering (SCEE) . 13--18. https://doi.org/10.1109/SCEE.2018.8684107
[26]
Emily Joy, Rehna Baby Joseph, M.B Lakshmi, Willson Joseph, and M. Rajeswari. 2021. Recent Survey on Emotion Recognition Using Physiological Signals. In 2021 7th International Conference on Advanced Computing and Communication Systems (ICACCS), Vol. 1. 1858--1863. https://doi.org/10.1109/ICACCS51430.2021.9441999
[27]
Jonghwa Kim and Elisabeth André. 2008. Emotion recognition based on physiological changes in music listening. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 30, 12 (2008), 2067--2083. https://doi.org/10.1109/TPAMI.2008.26
[28]
Clemens Kirschbaum, Karl-Martin Pirke, and Dirk H Hellhammer. 1993. The 'Trier Social Stress Test'--a tool for investigating psychobiological stress responses in a laboratory setting. Neuropsychobiology, Vol. 28, 1--2 (1993), 76--81.
[29]
Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. 2019. ALBERT: A Lite BERT for Self-supervised Learning of Language Representations. (2019).
[30]
Lawrence Ikuei Lin. 1989. A concordance correlation coefficient to evaluate reproducibility. Biometrics, Vol. 45, 1 (1989), 255--268.
[31]
Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013).
[32]
Andreas Müller. 2015. Analyse von Wort-Vektoren deutscher Textkorpora . https://devmount.github.io/GermanWordEmbeddings
[33]
Omkar M Parkhi, Andrea Vedaldi, and Andrew Zisserman. 2015. Deep Face Recognition. (2015).
[34]
Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. Glove: Global Vectors for Word Representation. In Conference on Empirical Methods in Natural Language Processing .
[35]
Russell and A. James. 1980. A circumplex model of affect. Journal of Personality & Social Psychology, Vol. 39, 6 (1980), 1161--1178.
[36]
H. Sak, A. Senior, and F. Beaufays. 2014. Long Short-Term Memory Based Recurrent Neural Network Architectures for Large Vocabulary Speech Recognition. (2014).
[37]
Steffen Schneider, Alexei Baevski, Ronan Collobert, and Michael Auli. 2019. wav2vec: Unsupervised Pre-Training for Speech Recognition. (2019).
[38]
Tomas Simon, Hanbyul Joo, Iain Matthews, and Yaser Sheikh. 2017. Hand Keypoint Detection in Single Images using Multiview Bootstrapping. In CVPR .
[39]
Lukas Stappen, Alice Baird, Lukas Christ, Lea Schumann, Benjamin Sertolli, Eva-Maria Messner, Erik Cambria, Guoying Zhao, and Björn W Schuller. 2021. The MuSe 2021 Multimodal Sentiment Analysis Challenge: Sentiment, Emotion, Physiological-Emotion, and Stress. arXiv preprint arXiv:2104.07123 (2021).
[40]
Lukas Stappen, Alice Baird, Georgios Rizos, Panagiotis Tzirakis, Xinchen Du, Felix Hafner, Lea Schumann, Adria Mallol-Ragolta, Björn W Schuller, Iulia Lefter, Erik Cambria, and Ioannis Kompatsiaris. 2020. MuSe 2020 Challenge and Workshop: Multimodal Sentiment Analysis, Emotion-target Engagement and Trustworthiness Detection in Real-life Media. In 1st International Multimodal Sentiment Analysis in Real-life Media Challenge and Workshop, co-located with the 28th ACM International Conference on Multimedia (ACM MM). ACM.
[41]
Yao-Hung Hubert Tsai, Shaojie Bai, Paul Pu Liang, J Zico Kolter, Louis-Philippe Morency, and Ruslan Salakhutdinov. 2019. Multimodal transformer for unaligned multimodal language sequences. In Proceedings of the conference. Association for Computational Linguistics. Meeting, Vol. 2019. NIH Public Access, 6558.
[42]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information processing systems. 5998--6008.
[43]
Zequn Wang, Rui Jiao, and Huiping Jiang. 2020. Emotion Recognition Using WT-SVM in Human-Computer Interaction. Journal of New Media, Vol. 2, 3 (2020), 121.
[44]
Zhongqing Wang and Yue Zhang. 2017. Opinion Recommendation Using A Neural Model. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing .
[45]
Shih-En Wei, Varun Ramakrishna, Takeo Kanade, and Yaser Sheikh. 2016. Convolutional pose machines. In CVPR .
[46]
Qizhe Xie, Minh-Thang Luong, Eduard Hovy, and Quoc V Le. 2020. Self-training with noisy student improves imagenet classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition . 10687--10698.
[47]
Jianhua Zhang, Zhong Yin, Peng Chen, and Stefano Nichele. 2020. Emotion recognition using multi-modal data and machine learning techniques: A tutorial and review. Information Fusion, Vol. 59 (2020), 103--126.

Cited By

View all
  • (2024)Wearable Solutions Using Physiological Signals for Stress Monitoring on Individuals with Autism Spectrum Disorder (ASD): A Systematic Literature ReviewSensors10.3390/s2424813724:24(8137)Online publication date: 20-Dec-2024

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
MuSe '21: Proceedings of the 2nd on Multimodal Sentiment Analysis Challenge
October 2021
88 pages
ISBN:9781450386784
DOI:10.1145/3475957
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 20 October 2021

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. emotion prediction
  2. multimodal
  3. multmodal fusion

Qualifiers

  • Research-article

Funding Sources

Conference

MM '21
Sponsor:
MM '21: ACM Multimedia Conference
October 24, 2021
Virtual Event, China

Acceptance Rates

Overall Acceptance Rate 14 of 17 submissions, 82%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)32
  • Downloads (Last 6 weeks)4
Reflects downloads up to 05 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Wearable Solutions Using Physiological Signals for Stress Monitoring on Individuals with Autism Spectrum Disorder (ASD): A Systematic Literature ReviewSensors10.3390/s2424813724:24(8137)Online publication date: 20-Dec-2024

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media