Abstract
A suitable combination of data in a multimodal emotion recognition model allows conveying and combining each channel’s information to achieve a better recognition of the encoded emotion than would be possible using only a single modality and channel. In this paper, we focus on combining speech and physiological signals to predict the arousal and valence levels of the emotional states of a person. We designed a neural network that can use the information from raw audio signals, electrocardiograms, heart rate variability, electro-dermal activity, and skin conductance levels, to predict emotional states. The proposed deep neural network architecture works as an end-to-end process, which means, neither any pre-processing of the input data nor post-processing of the prediction of the network was applied. Using the data of the modalities available in the publicly accessible part of the RECOLA database, we achieved results comparable to other state-of-the-art approaches.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Cho, K., et al.: Learning phrase representations using RNN encoder-decoder for statistical machine translation. In: EMNLP 2014: Conference on Emprical Methods in Natural Language Processing (2014)
Egger, M., Ley, M., Hanke, S.: Emotion recognition from physiological signal analysis: a review. Electron. Notes Theoret. Comput. Sci. 343, 35–55 (2019)
El Ayadi, M., Kamel, M.S., Karray, F.: Survey on speech emotion recognition: features, classification schemes, and databases. Pattern Recogn. 44(3), 572–587 (2011)
Goodfellow, I., et al.: Generative adversarial nets. In: Advances in Neural Information Processing Systems, pp. 2672–2680 (2014)
Han, J., Zhang, Z., Cummins, N., Ringeval, F., Schuller, B.: Strength modelling for real-world automatic continuous affect recognition from audiovisual signals. Image Vis. Comput. 65, 76–86 (2017)
Hinton, G.E., Osindero, S., Teh, Y.W.: A fast learning algorithm for deep belief nets. Neural Comput. 18(7), 1527–1554 (2006)
Hinton, G.E., Zemel, R.S.: Autoencoders, minimum description length and Helmholtz free energy. In: Advances in Neural Information Processing Systems, pp. 3–10 (1994)
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Huang, Z., et al.: Staircase regression in OA RVM, data selection and gender dependency in AVEC 2016. In: Proceedings of the 6th International Workshop on Audio/Visual Emotion Challenge, pp. 19–26 (2016)
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
Lalitha, S., Tripathi, S., Gupta, D.: Enhanced speech emotion detection using deep neural networks. Int. J. Speech Technol. 22(3), 497–510 (2018). https://doi.org/10.1007/s10772-018-09572-8
Lawrence, I., Lin, K.: A concordance correlation coefficient to evaluate reproducibility. Biometrics 45, 255–268 (1989)
LeCun, Y.: Generalization and network design strategies. Connect. Pers. 19, 143–155 (1989)
Li, C., Bao, Z., Li, L., Zhao, Z.: Exploring temporal representations by leveraging attention-based bidirectional LSTM-RNNs for multi-modal emotion recognition. Inf. Process. Manage. 57(3), 102185 (2020)
Matsuda, Y., Fedotov, D., Takahashi, Y., Arakawa, Y., Yasumoto, K., Minker, W.: EmoTour: multimodal emotion recognition using physiological and audio-visual features. In: Proceedings of the 2018 ACM International Joint Conference and 2018 International Symposium on Pervasive and Ubiquitous Computing and Wearable Computers, pp. 946–951 (2018)
Ranganathan, H., Chakraborty, S., Panchanathan, S.: Multimodal emotion recognition using deep learning architectures. In: 2016 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 1–9. IEEE (2016)
Ringeval, F., et al.: Prediction of asynchronous dimensional emotion ratings from audiovisual and physiological data. Pattern Recogn. Lett. 66, 22–30 (2015)
Ringeval, F., et al.: Av+ EC 2015: the first affect recognition challenge bridging across audio, video, and physiological data. In: Proceedings of the 5th International Workshop on Audio/Visual Emotion Challenge, pp. 3–8. ACM (2015)
Ringeval, F., Sonderegger, A., Sauer, J., Lalanne, D.: Introducing the RECOLA multimodal corpus of remote collaborative and affective interactions. In: 2013 10th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG), pp. 1–8. IEEE (2013)
Shu, L., et al.: A review of emotion recognition using physiological signals. Sensors 18(7), 2074 (2018)
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. J. Machine Learn. Res. 15(1), 1929–1958 (2014)
Trigeorgis, G., et al.: Adieu features? End-to-end speech emotion recognition using a deep convolutional recurrent network. In: 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5200–5204. IEEE (2016)
Tzirakis, P., Trigeorgis, G., Nicolaou, M.A., Schuller, B.W., Zafeiriou, S.: End-to-end multimodal emotion recognition using deep neural networks. IEEE J. Sel. Top. Sig. Process. 11(8), 1301–1309 (2017)
Tzirakis, P., Zhang, J., Schuller, B.W.: End-to-end speech emotion recognition using deep neural networks. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5089–5093. IEEE (2018)
Yang, Z., Hirschberg, J.: Predicting arousal and valence from waveforms and spectrograms using deep neural networks. In: INTERSPEECH, pp. 3092–3096 (2018)
Yin, Z., Zhao, M., Wang, Y., Yang, J., Zhang, J.: Recognition of emotions using multimodal physiological signals and an ensemble deep learning model. Comput. Methods Programs Biomed. 140, 93–110 (2017)
Zhang, S., Zhang, S., Huang, T., Gao, W.: Speech emotion recognition using deep convolutional neural network and discriminant temporal pyramid matching. IEEE Trans. Multimedia 20(6), 1576–1590 (2017)
Zhao, J., Mao, X., Chen, L.: Speech emotion recognition using deep 1D & 2D CNN LSTM networks. Biomed. Signal Process. Control 47, 312–323 (2019)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Bakhshi, A., Chalup, S. (2021). Multimodal Emotion Recognition Based on Speech and Physiological Signals Using Deep Neural Networks. In: Del Bimbo, A., et al. Pattern Recognition. ICPR International Workshops and Challenges. ICPR 2021. Lecture Notes in Computer Science(), vol 12666. Springer, Cham. https://doi.org/10.1007/978-3-030-68780-9_25
Download citation
DOI: https://doi.org/10.1007/978-3-030-68780-9_25
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-68779-3
Online ISBN: 978-3-030-68780-9
eBook Packages: Computer ScienceComputer Science (R0)