Multimodal temporal context network for tracking dynamic changes in emotion

Zhang, Xiufeng; Zhou, Jinwei; Qi, Guobin

doi:10.1007/s11227-024-06484-0

Multimodal temporal context network for tracking dynamic changes in emotion

Published: 23 October 2024

Volume 81, article number 71, (2025)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

Xiufeng Zhang¹,
Jinwei Zhou¹ &
Guobin Qi¹

263 Accesses
Explore all metrics

Abstract

In the medical field, the analysis and understanding of human emotions is a key approach to the study of mental diseases. Many psychological or psychiatric disorders exhibit inconsistent and often subtle symptoms, which complicates the prediction of human emotions based on singular traits. Consequently, this study integrates a range of modal cues. The study proposes THRMM, a Transformer-based network for temporal modeling that leverages multiple contextual cues. The THRMM architecture effectively extracts global video features, character traits, and dialogue cues to monitor emotional shifts, capturing the emotional dynamics for timely and accurate emotion predictions. Ablation and comparative studies confirm the effectiveness of THRMM in temporal context modeling, emphasizing the importance of scene, task, and dialogue information in interpreting emotions.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Context-Aware Multimodal Emotion Recognition

Detecting Depression in Dyadic Conversations with Multimodal Narratives and Visualizations

Personalized emotion analysis based on fuzzy multi-modal transformer model

Article 27 December 2024

Data availability

No datasets were generated or analysed during the current study.

References

Bird JJ, Ekart A, Buckingham CD, Faria DR (2019) Mental emotional sentiment classification with an eeg-based brain-machine interface. In: Proceedings of the International Conference on Digital Image and Signal Processing (DISP’19)
Saeed SMU, Anwar SM, Khalid H, Majid M, Bagci U (2020) Eeg based classification of long-term stress using psychological labeling. Sensors 20(7):1886
Article Google Scholar
Agrafioti F, Hatzinakos D, Anderson AK (2011) Ecg pattern analysis for emotion detection. IEEE Transactions on affective computing 3(1):102–115
Article Google Scholar
Sarkar P, Etemad A (2020) Self-supervised ecg representation learning for emotion recognition. IEEE Transactions on Affective Computing 13(3):1541–1554
Article Google Scholar
Sarkar P, Etemad A (2020) Self-supervised learning for ecg-based emotion recognition. In: ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp 3217–3221. IEEE
Hart B, Struiksma ME, Boxtel A, Van Berkum JJ (2018) Emotion in stories: Facial emg evidence for both mental simulation and moral evaluation. Frontiers in psychology 9:613
Article Google Scholar
Künecke J, Hildebrandt A, Recio G, Sommer W, Wilhelm O (2014) Facial emg responses to emotional expressions are related to emotion perception ability. PloS one 9(1):84053
Article Google Scholar
Ekman P, Friesen WV (1978) Facial action coding system. Environmental Psychology & Nonverbal Behavior
Google Scholar
Wen Z, Lin W, Wang T, Xu G (2023) Distract your attention: Multi-head cross attention network for facial expression recognition. Biomimetics 8(2):199
Article Google Scholar
Wang K, Peng X, Yang J, Meng D, Qiao Y (2020) Region attention networks for pose and occlusion robust facial expression recognition. IEEE Transactions on Image Processing 29:4057–4069
Article Google Scholar
Cai J, Meng Z, Khan AS, Li Z, O’Reilly J, Tong Y (2018) Island loss for learning discriminative features in facial expression recognition. In: 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018), pp 302–309. IEEE
Kosti R, Alvarez JM, Recasens A, Lapedriza A (2017) Emotion recognition in context. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 1667–1675
Lee J, Kim S, Kim S, Park J, Sohn K (2019) Context-aware emotion recognition networks. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 10143–10152
Yang D, Huang S, Wang S, Liu Y, Zhai P, Su L, Li M, Zhang L (2022) Emotion recognition for multiple context awareness. In: European Conference on Computer Vision, pp 144–162. Springer
Vicol P, Tapaswi M, Castrejon L, Fidler S (2018) Moviegraphs: Towards understanding human-centric situations from videos. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 8581–8590
Savchenko AV (2021) Facial expression and attributes recognition based on multi-task learning of lightweight neural networks. In: 2021 IEEE 19th International Symposium on Intelligent Systems and Informatics (SISY), pp 119–124. IEEE
Zhang Y, Wang C, Ling X, Deng W (2022) Learn from all: Erasing attention consistency for noisy label facial expression recognition. In: European Conference on Computer Vision, pp 418–434. Springer
Rao T, Li X, Zhang H, Xu M (2019) Multi-level region-based convolutional neural network for image emotion classification. Neurocomputing 333:429–439
Article Google Scholar
Yang J, She D, Sun M, Cheng M-M, Rosin PL, Wang L (2018) Visual sentiment prediction based on automatic discovery of affective regions. IEEE Transactions on Multimedia 20(9):2513–2525
Article Google Scholar
Majumder N, Poria S, Hazarika D, Mihalcea R, Gelbukh A, Cambria E (2019) Dialoguernn: An attentive rnn for emotion detection in conversations. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp 6818–6825
Yang D, Chen Z, Wang Y, Wang S, Li M, Liu S, Zhao X, Huang S, Dong Z, Zhai P, et al (2023) Context de-confounded emotion recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 19005–19015
Fan Y, Li VO, Lam JC (2020) Facial expression recognition with deeply-supervised attention network. IEEE transactions on affective computing 13(2):1057–1071
Article Google Scholar
Han W, Chen H, Gelbukh A, Zadeh A, Morency L-p, Poria S (2021) Bi-bimodal modality fusion for correlation-controlled multimodal sentiment analysis. In: Proceedings of the 2021 International Conference on Multimodal Interaction, pp 6–15
Siriwardhana S, Kaluarachchi T, Billinghurst M, Nanayakkara S (2020) Multimodal emotion recognition with transformer-based self supervised feature fusion. Ieee Access 8:176274–176285
Article Google Scholar
Yang X, Feng S, Wang D, Zhang Y (2020) Image-text multimodal emotion classification via multi-view attentional network. IEEE Transactions on Multimedia 23:4014–4026
Article Google Scholar
Ong DC, Wu Z, Tan Z-X, Reddan M, Kahhale I, Mattek A, Zaki J (2019) Modeling emotion in complex stories: the stanford emotional narratives dataset. IEEE Transactions on Affective Computing 12(3):579–594
Article Google Scholar
Poria S, Hazarika D, Majumder N, Naik G, Cambria E, Mihalcea R (2018) Meld: A multimodal multi-party dataset for emotion recognition in conversations. arXiv preprint arXiv:1810.02508
Do HH, Prasad PW, Maag A, Alsadoon A (2019) Deep learning for aspect-based sentiment analysis: a comparative review. Expert systems with applications 118:272–299
Article Google Scholar
Liu S, Zhang L, Yang X, Su H, Zhu J (2021) Query2label: A simple transformer way to multi-label classification. arXiv preprint arXiv:2107.10834
Kosti R, Alvarez JM, Recasens A, Lapedriza A (2019) Context based emotion recognition using emotic dataset. IEEE transactions on pattern analysis and machine intelligence 42(11):2755–2766
Google Scholar
Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, Huang Z, Karpathy A, Khosla A, Bernstein M et al (2015) Imagenet large scale visual recognition challenge. International journal of computer vision 115:211–252
Article MathSciNet Google Scholar
Zhou B, Lapedriza A, Khosla A, Oliva A, Torralba A (2017) Places: A 10 million image database for scene recognition. IEEE transactions on pattern analysis and machine intelligence 40(6):1452–1464
Article Google Scholar
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 770–778
Fan H, Xiong B, Mangalam K, Li Y, Yan Z, Malik J, Feichtenhofer C (2021) Multiscale vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 6824–6835
Zhuang L, Wayne L, Ya S, Jun Z (2021) A robustly optimized bert pre-training approach with post-training. In: Proceedings of the 20th Chinese National Conference on Computational Linguistics, pp 1218–1227
Devlin J, Chang M-W, Lee K, Toutanova K (2018) Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805
Zhang K, Zhang Z, Li Z, Qiao Y (2016) Joint face detection and alignment using multitask cascaded convolutional networks. IEEE signal processing letters 23(10):1499–1503
Article Google Scholar
Zhang S, Zhang Y, Zhang Y, Wang Y, Song Z (2023) A dual-direction attention mixed feature network for facial expression recognition. Electronics 12(17):3595
Article Google Scholar
Deng J, Guo J, Xue N, Zafeiriou S (2019) Arcface: Additive angular margin loss for deep face recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 4690–4699
Li S, Deng W, Du J (2017) Reliable crowdsourcing and deep locality-preserving learning for expression recognition in the wild. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 2852–2861
Wei Z, Zhang J, Lin Z, Lee J-Y, Balasubramanian N, Hoai M, Samaras D (2020) Learning visual emotion representations from web data. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 13106–13115
Thao HTP, Balamurali B, Herremans D, Roig G (2021) Attendaffectnet: Self-attention based networks for predicting affective responses from movies. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp 8719–8726. IEEE
Chudasama V, Kar P, Gudmalwar A, Shah N, Wasnik P, Onoe N (2022) M2fnet: Multi-modal fusion network for emotion recognition in conversation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 4652–4661
Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J (2013) Distributed representations of words and phrases and their compositionality. Advances in neural information processing systems 26
Radford A, Kim JW, Xu T, Brockman G, McLeavey C, Sutskever I (2023) Robust speech recognition via large-scale weak supervision. In: International Conference on Machine Learning, pp 28492–28518. PMLR
Zhao Z, Li Q, Cummins N, Liu B, Wang H, Tao J, Schuller B (2020) Hybrid network feature extraction for depression assessment from speech
Pérez H, Escalante HJ, Villasenor-Pineda L, Montes-y-Gómez M, Pinto-Avedano D, Reyes-Meza V (2014) Fusing affective dimensions and audio-visual features from segmented video for depression recognition. In: Proceedings of the 4th International Workshop on Audio/Visual Emotion Challenge, pp 49–55. ACM
Jan A, Meng H, Gaus YFBA, Zhang F (2017) Artificial intelligent system for automatic depression level analysis through visual and vocal expressions. IEEE Transactions on Cognitive and Developmental Systems 10(3):668–680
Article Google Scholar
Kaya H, Çilli F, Salah AA (2014) Ensemble cca for continuous emotion prediction. In: Proceedings of the 4th International Workshop on Audio/Visual Emotion Challenge, pp 19–26
Niu M, Tao J, Liu B, Huang J, Lian Z (2020) Multimodal spatiotemporal representation for automatic depression level detection. IEEE transactions on affective computing 14(1):294–307
Article Google Scholar
Uddin MA, Joolee JB, Sohn K-A (2022) Deep multi-modal network based automated depression severity estimation. IEEE transactions on affective computing 14(3):2153–2167
Article Google Scholar

Download references

Author information

Authors and Affiliations

Mechanical and Electrical Engineering, Dalian Minzu University, No. 31, Jinshi Road, Jinshitan Tourist Resort, Dalian, 116650, Liaoning, China
Xiufeng Zhang, Jinwei Zhou & Guobin Qi

Authors

Xiufeng Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Jinwei Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Guobin Qi
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

J wrote the main manuscript text and prepared Figures 1–3. J and G prepared Table 1–7. G handled data colletcion and data preprocessing. J Trunk code construction. J, G ablation and comparison experiments. X Verify the article and revise it.

Corresponding author

Correspondence to Jinwei Zhou.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Zhang, X., Zhou, J. & Qi, G. Multimodal temporal context network for tracking dynamic changes in emotion. J Supercomput 81, 71 (2025). https://doi.org/10.1007/s11227-024-06484-0

Download citation

Accepted: 02 October 2024
Published: 23 October 2024
DOI: https://doi.org/10.1007/s11227-024-06484-0

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Multimodal temporal context network for tracking dynamic changes in emotion

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Context-Aware Multimodal Emotion Recognition

Detecting Depression in Dyadic Conversations with Multimodal Narratives and Visualizations

Personalized emotion analysis based on fuzzy multi-modal transformer model

Data availability

References

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now