Abstract
Advances in machine learning and deep learning make it possible to detect and analyse emotion and sentiment using textual and audio-visual information at increasing levels of effectiveness. Recently, an interest has emerged to also apply these techniques for the assessment of mental health, including the detection of stress and depression. In this paper, we introduce an approach that predicts stress (emotional valence and arousal) in a time-continuous manner from audio-visual recordings, testing the effectiveness of different deep learning techniques and various features. Specifically, apart from adopting popular features (e.g., BERT, BPM, ECG, and VGGFace), we explore the use of new features, both engineered and learned, along different modalities to improve the effectiveness of time-continuous stress prediction: for video, we study the use of ResNet-50 features and the use of body and pose features through OpenPose, whereas for audio, we primarily investigate the use of Integrated Linear Prediction Residual (ILPR) features. The best result we achieved was a combined CCC value of 0.7595 and 0.3379 for the development set and the test set of MuSe-Stress 2021, respectively.
This research was supported under the India-Korea Joint Programme of Cooperation in Science & Technology by the National Research Foundation (NRF) Korea (2020K1A3A1A68093469), the Ministry of Science and ICT (MSIT) Korea and by the Department of Biotechnology (India) (DBT/IC-12031(22)-ICD-DBT).
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
The 25 OBF keypoints are Nose, Neck, R/L Shoulders, R/L Elbows, R/L Wrists, MidHip, R/L Hips, R/L Knees, R/L Ankles, R/L Eyes, R/L Ears, R/L BigToes, R/L SmallToes, R/L Heels, and Background (R/L stands for Right/Left).
- 2.
References
Stappen, L., et al.: The MuSe 2021 multimodal sentiment analysis challenge: sentiment, emotion, physiological-emotion, and stress. In: Proceedings of the 2nd International on Multimodal Sentiment Analysis Challenge and Workshop. Association for Computing Machinery, New York (2021)
Stappen, L., Baird, A., Schumann, L., Schuller, B.: The multimodal sentiment analysis in car reviews (MuSe-car) dataset: collection, insights and improvements. IEEE Trans. Affect. Comput. (2021)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016)
Cao, Z., Hidalgo, G., Simon, T., Wei, S.-E., Sheikh, Y.: OpenPose: realtime multi-person 2D pose estimation using part affinity fields. IEEE Trans. Pattern Anal. Mach. Intell. 43(1), 172–186 (2019)
Redmon, J., Farhadi, A.: YOLOV3: an incremental improvement (2018)
Baghel, S., Prasanna, S.R.M., Guha, P.: Classification of multi speaker shouted speech and single speaker normal speech. In: TENCON 2017–2017 IEEE Region 10 Conference, pp. 2388–2392. IEEE (2017)
Eyben, F., et al.: The Geneva minimalistic acoustic parameter set (GeMAPS) for voice research and affective computing. IEEE Trans. Affect. Comput. 7(2), 190–202 (2015)
Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
Degottex, G.: Glottal source and vocal-tract separation. Ph.D. thesis, Université Pierre et Marie Curie-Paris VI (2010)
Rothenberg, M.: Acoustic interaction between the glottal source and the vocal tract. Vocal Fold Physiol. 1, 305–323 (1981)
Loweimi, E., Barker, J., Saz-Torralba, O., Hain, T.: Robust source-filter separation of speech signal in the phase domain. In: Interspeech, pp. 414–418 (2017)
Prasanna, S.R.M., Gupta, C.S., Yegnanarayana, B.: Extraction of speaker-specific excitation information from linear prediction residual of speech. Speech Commun. 48(10), 1243–1261 (2006)
Baghel, S., Prasanna, S.R.M., Guha, P.: Exploration of excitation source information for shouted and normal speech classification. J. Acoust. Soc. Am. 147(2), 1250–1261 (2020)
Makhoul, J.: Linear prediction: a tutorial review. Proc. IEEE 63(4), 561–580 (1975)
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition (2015)
Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. Commun. ACM 60(6), 84–90 (2017)
Zhang, K., Zhang, Z., Li, Z., Qiao, Y.: Joint face detection and alignment using multitask cascaded convolutional networks. IEEE Signal Process. Lett. 23(10), 1499–1503 (2016)
Cao, Q., Shen, L., Xie, W., Parkhi, O.M., Zisserman, A.: VGGFace2: a dataset for recognising faces across pose and age. In: 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018) (2018)
Parkhi, O.M., Vedaldi, A., Zisserman, A.: Deep face recognition, pp. 1–12. British Machine Vision Association (2015)
Stappen, L., et al.: MuSe 2020 challenge and workshop: multimodal sentiment analysis, emotion-target engagement and trustworthiness detection in real-life media: emotional car reviews in-the-wild. In: Proceedings of the 1st International on Multimodal Sentiment Analysis in Real-life Media Challenge and Workshop, pp. 35–44 (2020)
Cao, Z., Hidalgo, G., Simon, T., Wei, S.-E., Sheikh, Y.: OpenPose: realtime multi-person 2D pose estimation using part affinity fields. arXiv preprint arXiv:1812.08008 (2018)
Simon, T., Joo, H., Matthews, I., Sheikh, Y.: Hand keypoint detection in single images using multiview bootstrapping. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4645–4653 (2017)
Qin, S., Kim, S., Manduchi, R.: Automatic skin and hair masking using fully convolutional networks. In: 2017 IEEE International Conference on Multimedia and Expo (ICME) (2017)
Reimers, N., Gurevych, I.: Sentence-BERT: sentence embeddings using Siamese BERT-networks. arXiv preprint arXiv:1908.10084 (2019)
Atrey, P.K., Hossain, M.A., El Saddik, A., Kankanhalli, M.S.: Multimodal fusion for multimedia analysis: a survey. Multimed. Syst. 16(6), 345–379 (2010)
Zhang, Q., Xiao, T., Huang, N., Zhang, D., Han, J.: Revisiting feature fusion for RGB-T salient object detection. IEEE Trans. Circ. Syst. Video Technol. 31(5), 1804–1818 (2020)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 Springer Nature Switzerland AG
About this paper
Cite this paper
Kumar, A. et al. (2022). Exploring Multimodal Features and Fusion for Time-Continuous Prediction of Emotional Valence and Arousal. In: Kim, JH., Singh, M., Khan, J., Tiwary, U.S., Sur, M., Singh, D. (eds) Intelligent Human Computer Interaction. IHCI 2021. Lecture Notes in Computer Science, vol 13184. Springer, Cham. https://doi.org/10.1007/978-3-030-98404-5_65
Download citation
DOI: https://doi.org/10.1007/978-3-030-98404-5_65
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-98403-8
Online ISBN: 978-3-030-98404-5
eBook Packages: Computer ScienceComputer Science (R0)