A multimodal emotion recognition model integrating speech, video and MoCAP

Jia, Ning; Zheng, Chunjun; Sun, Wei

doi:10.1007/s11042-022-13091-9

A multimodal emotion recognition model integrating speech, video and MoCAP

Published: 13 April 2022

Volume 81, pages 32265–32286, (2022)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Ning Jia¹,
Chunjun Zheng^1,2 &
Wei Sun¹

1019 Accesses
10 Citations
Explore all metrics

Abstract

As one of the core technologies in the field of human-computer interaction, emotion recognition focuses on the simulation of human emotion perception and understanding process. Emotion recognition is widely used in medical, education, life, transportation and other fields. At present, the emotion recognition is still a challenging topic. The accuracy of emotion recognition in multimodal is discussed, different emotion features are extracted from speech, video and motion capture (MoCAP) by using deep learning methods, and a matching emotion recognition model called facial motion speech emotion recognition (FM-SER) model is designed. Local and global information of speech, dual spectrograms are designed in audio mode to choose the time-domain and frequency-domain information, and convolutional neural networks (CNN), gated recurrent unit (GRU) and attention models are used to realize speech emotion recognition. A 3D CNN model based on attention mechanism is used in the video mode to capture the potential emotional expression. The sequential features of hand and head movements are extracted from MoCAP, and import into a bidirectional three-layer long short-term memory (LSTM) model with the attention mechanism. Based on the complementary relationship between multimodal, the decision level integrating scheme is designed with higher-precision, stronger generalization ability of emotion recognition. Through a lot of experiments, we compared the results of several popular emotion recognition models on the Interactive Emotional Dyadic Motion Capture (IEMOCAP) corpus. The results showed that the proposed method had higher recognition accuracies in single modality and multimodal, and the average accuracies of one modality and multimodal were improved by 16.3% and 9%. The effectiveness of FM-SER model in emotion recognition was proved.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

DeepVANet: A Deep End-to-End Network for Multi-modal Emotion Recognition

Multimodal emotion recognition from facial expression and speech based on feature fusion

Article 11 November 2022

Multimodal Emotion Recognition System Through Three Different Channels (MER-3C)

References

Ahmed F, Bari ASMH, Gavrilova ML (2020) Emotion recognition from body movement[J]. IEEE Access 8:11761–11781
Article Google Scholar
Ajili I, Mallem M, Didier JY (2019) Human motions and emotions recognition inspired by LMA qualities[J]. Vis Comput 35(10):1411–1426
Article Google Scholar
Bertero D, Siddique FB, Wu CS et al (2016) Real-time speech emotion and sentiment recognition for interactive dialogue systems. In: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Austin, pp 1042–1047
Busso C, Bulut M, Lee CC, Kazemzadeh A, Mower E, Kim S, Chang JN, Lee S, Narayanan SS (2008) IEMOCAP: interactive emotional dyadic motion capture database[J]. Lang Resour Eval 42(4):335–359
Article Google Scholar
Ding IJ, Hsieh MC (2020) A hand gesture action-based emotion recognition system by 3D image sensor information derived from leap motion sensors for the specific group with restlessness emotion problems[J]. Microsyst Technol 3
Gupta S et al (2016) Cross modal distillation for supervision transfer. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 2827–2836
Chapter Google Scholar
Hazarika D, Poria S, Mihalcea R et al (2018) ICON: interactive conversational memory network for muitimodal emotion detection. In: Proceedings of the 2018 Conference on empirical methods in natural language processing, Brussels, pp 2594–2604
Huang L, Xie F, Shen S et al (2020) Human emotion recognition based on face and facial expression detection using deep belief network under complicated backgrounds[J]. Int J Pattern Recognit Artif Intell 1
Jiahui PAN, Zhipeng HE, Zina LI et al (2020) A review of multimodal emotion recognition[J]. CAAI Trans Intell Syst 15(4):1–13
Google Scholar
Kan W, Longlong M (2020) Research on design innovation method based on multimodal perception and recognition technology[J]. J Phys Conf Ser 1607(1):012107 (6pp)
Article Google Scholar
Latif S, Rana R, Khalifa S (2019) Direct modelling of speech emotion from raw speech[C]. In: Interspeech 2019
Google Scholar
Li J, Mi Y, Li G, Ju Z (2019) CNN-based facial expression recognition from annotated RGB-D images for human–robot interaction[J]. Int J Humanoid Robot 16(04):504–505
Article Google Scholar
Lin M, Chen C, Lai C (2019) Object detection algorithm based AdaBoost residual correction fast R-CNN on network[C]. In: The 2019 3rd international conference
Google Scholar
Luo Y, Ye J, Adams RB et al (2019) ARBEE: towards automated recognition of bodily expression of emotion in the wild[J]. Int J Comput Vis:1–25
Mohammed SN, Karim A (2020) Speech emotion recognition using MELBP variants of spectrogram image[J]. Int J Intell Eng Syst 13(5):257–266
Google Scholar
Nie W, Yan Y, Song D et al (2020) Multi-modal feature fusion based on multi-layers LSTM for video emotion recognition[J]. Multimed Tools Appl 4
Pan Z., Luo Z., Yang J, et al (2020) Multi-modal attention for speech emotion recognition. InterSpeech, 2020
Poria S, Cambria E, Hazarika D, Majumder N, Zadeh A, Morency L-P (2017) Context-dependent sentiment analysis in user-generated videos. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp 873–883
Chapter Google Scholar
Poria S, Majumder N, Hazarika D, Cambria E, Gelbukh A, Hussain A (2018) Multimodal sentiment analysis: addressing key issues and setting up the baselines. IEEE Intell Syst 33(6):17–25
Article Google Scholar
Ramanarayanan V, Pugh R, Qian Y, Suendermann-Oeft D Automatic turn-level language identification for code-switched Spanish-English dialog. In: Proc. of IWSDS 2018, International workshop on spoken dialog systems, Singapore, Singapore, vol 2018
Ren M, Nie W, Liu A et al (2019) Multi-modal correlated network for emotion recognition in speech[J]. Vis Inform 3(3)
Sahu G (2019) Multimodal speech emotion recognition and ambiguity resolution
Google Scholar
Salama ES et al (2020) A 3D-convolutional neural network framework with ensemble learning techniques for multi-modal emotion recognition[J]. Egypt Inform J
Satt A et al (2017) Efficient emotion recognition from speech using deep learning on spectrograms. Interspeech:1089–1093
Tripathi S, Tripathi S, Beigi H (2018) Multi-modal emotion recognition on IEMOCAP dataset using deep learning
Google Scholar
Wang W, Enescu V, Sahli H (2015) Adaptive real-time emotion recognition from body movements[J]. ACM Trans Interact Intell Syst 5(4):1–21
Article Google Scholar
Wu S, Li F, Zhang P (2019) Weighted feature fusion based emotional recognition for variable-length speech using DNN[C]. In: 2019 15th international wireless communications and Mobile computing conference (IWCMC)
Google Scholar
Xu Y, Liu J, Zhai Y, Gan J, Zeng J, Cao H, Scotti F, Piuri V, Labati RD (2020) Weakly supervised facial expression recognition via transferred DAL-CNN and active incremental learning[J]. Soft Comput 24(8):5971–5985
Article Google Scholar
Zadeh A, Liang P, Mazumder N et al (2018) Memory fusion network for multi-view sequential learning. In: Proceedings of the 32nd AAAI Conference on Artificial Intelligence New Orleans, pp 5634–5641
Google Scholar
Zhang L, Wang L, Dang J et al (2018) Convolutional neural network with spectrogram and perceptual features for speech emotion recognition[C]. In: International conference on neural information processing. Springer, Cham
Google Scholar
Zhao J, Mao X, Chen L (2019) Speech emotion recognition using deep 1D & 2D CNN LSTM networks[J]. Biomed Signal Process Control 47(JAN.):312–323
Article Google Scholar

Download references

Acknowledgements

This paper was funded by the Liaoning Province Key Laboratory for the Application Research of Big Data, the Dalian Science and Technology Star Project, grant number 2019RQ120, and the Intercollegiate cooperation projects of Liaoning Provincial Department of Education, grant number 86896244.

Author information

Authors and Affiliations

School of Software, Dalian Neusoft University of Information, Dalian, China
Ning Jia, Chunjun Zheng & Wei Sun
Information Science and Technology College, Dalian Maritime University, Dalian, China
Chunjun Zheng

Authors

Ning Jia
View author publications
You can also search for this author in PubMed Google Scholar
Chunjun Zheng
View author publications
You can also search for this author in PubMed Google Scholar
Wei Sun
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ning Jia.

Ethics declarations

Conflict of interest

The authors declare that they have no competing interest.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Jia, N., Zheng, C. & Sun, W. A multimodal emotion recognition model integrating speech, video and MoCAP. Multimed Tools Appl 81, 32265–32286 (2022). https://doi.org/10.1007/s11042-022-13091-9

Download citation

Received: 13 April 2021
Revised: 03 November 2021
Accepted: 03 April 2022
Published: 13 April 2022
Issue Date: September 2022
DOI: https://doi.org/10.1007/s11042-022-13091-9

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A multimodal emotion recognition model integrating speech, video and MoCAP

Abstract

Access this article

Similar content being viewed by others

DeepVANet: A Deep End-to-End Network for Multi-modal Emotion Recognition

Multimodal emotion recognition from facial expression and speech based on feature fusion

Multimodal Emotion Recognition System Through Three Different Channels (MER-3C)

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A multimodal emotion recognition model integrating speech, video and MoCAP

Abstract

Access this article

Similar content being viewed by others

DeepVANet: A Deep End-to-End Network for Multi-modal Emotion Recognition

Multimodal emotion recognition from facial expression and speech based on feature fusion

Multimodal Emotion Recognition System Through Three Different Channels (MER-3C)

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation