Multimodal Emotion Recognition Based on Speech and Physiological Signals Using Deep Neural Networks

Bakhshi, Ali; Chalup, Stephan

doi:10.1007/978-3-030-68780-9_25

Multimodal Emotion Recognition Based on Speech and Physiological Signals Using Deep Neural Networks

Conference paper
First Online: 25 February 2021

2346 Accesses
4 Citations

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 12666))

Abstract

A suitable combination of data in a multimodal emotion recognition model allows conveying and combining each channel’s information to achieve a better recognition of the encoded emotion than would be possible using only a single modality and channel. In this paper, we focus on combining speech and physiological signals to predict the arousal and valence levels of the emotional states of a person. We designed a neural network that can use the information from raw audio signals, electrocardiograms, heart rate variability, electro-dermal activity, and skin conductance levels, to predict emotional states. The proposed deep neural network architecture works as an end-to-end process, which means, neither any pre-processing of the input data nor post-processing of the prediction of the network was applied. Using the data of the modalities available in the publicly accessible part of the RECOLA database, we achieved results comparable to other state-of-the-art approaches.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

Cho, K., et al.: Learning phrase representations using RNN encoder-decoder for statistical machine translation. In: EMNLP 2014: Conference on Emprical Methods in Natural Language Processing (2014)
Google Scholar
Egger, M., Ley, M., Hanke, S.: Emotion recognition from physiological signal analysis: a review. Electron. Notes Theoret. Comput. Sci. 343, 35–55 (2019)
Article Google Scholar
El Ayadi, M., Kamel, M.S., Karray, F.: Survey on speech emotion recognition: features, classification schemes, and databases. Pattern Recogn. 44(3), 572–587 (2011)
Article Google Scholar
Goodfellow, I., et al.: Generative adversarial nets. In: Advances in Neural Information Processing Systems, pp. 2672–2680 (2014)
Google Scholar
Han, J., Zhang, Z., Cummins, N., Ringeval, F., Schuller, B.: Strength modelling for real-world automatic continuous affect recognition from audiovisual signals. Image Vis. Comput. 65, 76–86 (2017)
Article Google Scholar
Hinton, G.E., Osindero, S., Teh, Y.W.: A fast learning algorithm for deep belief nets. Neural Comput. 18(7), 1527–1554 (2006)
Article MathSciNet Google Scholar
Hinton, G.E., Zemel, R.S.: Autoencoders, minimum description length and Helmholtz free energy. In: Advances in Neural Information Processing Systems, pp. 3–10 (1994)
Google Scholar
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Article Google Scholar
Huang, Z., et al.: Staircase regression in OA RVM, data selection and gender dependency in AVEC 2016. In: Proceedings of the 6th International Workshop on Audio/Visual Emotion Challenge, pp. 19–26 (2016)
Google Scholar
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
Lalitha, S., Tripathi, S., Gupta, D.: Enhanced speech emotion detection using deep neural networks. Int. J. Speech Technol. 22(3), 497–510 (2018). https://doi.org/10.1007/s10772-018-09572-8
Article Google Scholar
Lawrence, I., Lin, K.: A concordance correlation coefficient to evaluate reproducibility. Biometrics 45, 255–268 (1989)
Article Google Scholar
LeCun, Y.: Generalization and network design strategies. Connect. Pers. 19, 143–155 (1989)
Google Scholar
Li, C., Bao, Z., Li, L., Zhao, Z.: Exploring temporal representations by leveraging attention-based bidirectional LSTM-RNNs for multi-modal emotion recognition. Inf. Process. Manage. 57(3), 102185 (2020)
Article Google Scholar
Matsuda, Y., Fedotov, D., Takahashi, Y., Arakawa, Y., Yasumoto, K., Minker, W.: EmoTour: multimodal emotion recognition using physiological and audio-visual features. In: Proceedings of the 2018 ACM International Joint Conference and 2018 International Symposium on Pervasive and Ubiquitous Computing and Wearable Computers, pp. 946–951 (2018)
Google Scholar
Ranganathan, H., Chakraborty, S., Panchanathan, S.: Multimodal emotion recognition using deep learning architectures. In: 2016 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 1–9. IEEE (2016)
Google Scholar
Ringeval, F., et al.: Prediction of asynchronous dimensional emotion ratings from audiovisual and physiological data. Pattern Recogn. Lett. 66, 22–30 (2015)
Article Google Scholar
Ringeval, F., et al.: Av+ EC 2015: the first affect recognition challenge bridging across audio, video, and physiological data. In: Proceedings of the 5th International Workshop on Audio/Visual Emotion Challenge, pp. 3–8. ACM (2015)
Google Scholar
Ringeval, F., Sonderegger, A., Sauer, J., Lalanne, D.: Introducing the RECOLA multimodal corpus of remote collaborative and affective interactions. In: 2013 10th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG), pp. 1–8. IEEE (2013)
Google Scholar
Shu, L., et al.: A review of emotion recognition using physiological signals. Sensors 18(7), 2074 (2018)
Article Google Scholar
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. J. Machine Learn. Res. 15(1), 1929–1958 (2014)
MathSciNet MATH Google Scholar
Trigeorgis, G., et al.: Adieu features? End-to-end speech emotion recognition using a deep convolutional recurrent network. In: 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5200–5204. IEEE (2016)
Google Scholar
Tzirakis, P., Trigeorgis, G., Nicolaou, M.A., Schuller, B.W., Zafeiriou, S.: End-to-end multimodal emotion recognition using deep neural networks. IEEE J. Sel. Top. Sig. Process. 11(8), 1301–1309 (2017)
Article Google Scholar
Tzirakis, P., Zhang, J., Schuller, B.W.: End-to-end speech emotion recognition using deep neural networks. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5089–5093. IEEE (2018)
Google Scholar
Yang, Z., Hirschberg, J.: Predicting arousal and valence from waveforms and spectrograms using deep neural networks. In: INTERSPEECH, pp. 3092–3096 (2018)
Google Scholar
Yin, Z., Zhao, M., Wang, Y., Yang, J., Zhang, J.: Recognition of emotions using multimodal physiological signals and an ensemble deep learning model. Comput. Methods Programs Biomed. 140, 93–110 (2017)
Article Google Scholar
Zhang, S., Zhang, S., Huang, T., Gao, W.: Speech emotion recognition using deep convolutional neural network and discriminant temporal pyramid matching. IEEE Trans. Multimedia 20(6), 1576–1590 (2017)
Article Google Scholar
Zhao, J., Mao, X., Chen, L.: Speech emotion recognition using deep 1D & 2D CNN LSTM networks. Biomed. Signal Process. Control 47, 312–323 (2019)
Article Google Scholar

Download references

Author information

Authors and Affiliations

The University of Newcastle, Callaghan, 2308, Australia
Ali Bakhshi & Stephan Chalup

Authors

Ali Bakhshi
View author publications
You can also search for this author in PubMed Google Scholar
Stephan Chalup
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ali Bakhshi .

Editor information

Editors and Affiliations

Dipartimento di Ingegneria dell’Informazione, University of Firenze, Firenze, Italy
Alberto Del Bimbo
Dipartimento di Ingegneria “Enzo Ferrari”, Università di Modena e Reggio Emilia, Modena, Italy
Rita Cucchiara
Department of Computer Science, Boston University, Boston, MA, USA
Stan Sclaroff
Dipartimento di Matematica e Informatica, University of Catania, Catania, Italy
Giovanni Maria Farinella
Cloud & AI, JD.COM, Beijing, China
Tao Mei
Dipartimento di Ingegneria dell’Informazione, University of Firenze, Firenze, Italy
Marco Bertini
Computational Sciences Department, National Institute of Astrophysics, Optics and Electronics (INAOE), Tonantzintla, Puebla, Mexico
Hugo Jair Escalante
Dipartimento di Ingegneria “Enzo Ferrari”, Università di Modena e Reggio Emilia, Modena, Italy
Roberto Vezzani

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Bakhshi, A., Chalup, S. (2021). Multimodal Emotion Recognition Based on Speech and Physiological Signals Using Deep Neural Networks. In: Del Bimbo, A., et al. Pattern Recognition. ICPR International Workshops and Challenges. ICPR 2021. Lecture Notes in Computer Science(), vol 12666. Springer, Cham. https://doi.org/10.1007/978-3-030-68780-9_25

Download citation

DOI: https://doi.org/10.1007/978-3-030-68780-9_25
Published: 25 February 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-68779-3
Online ISBN: 978-3-030-68780-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The International Association for Pattern Recognition (opens in a new tab)