Abstract
In the medical field, the analysis and understanding of human emotions is a key approach to the study of mental diseases. Many psychological or psychiatric disorders exhibit inconsistent and often subtle symptoms, which complicates the prediction of human emotions based on singular traits. Consequently, this study integrates a range of modal cues. The study proposes THRMM, a Transformer-based network for temporal modeling that leverages multiple contextual cues. The THRMM architecture effectively extracts global video features, character traits, and dialogue cues to monitor emotional shifts, capturing the emotional dynamics for timely and accurate emotion predictions. Ablation and comparative studies confirm the effectiveness of THRMM in temporal context modeling, emphasizing the importance of scene, task, and dialogue information in interpreting emotions.



Similar content being viewed by others
Data availability
No datasets were generated or analysed during the current study.
References
Bird JJ, Ekart A, Buckingham CD, Faria DR (2019) Mental emotional sentiment classification with an eeg-based brain-machine interface. In: Proceedings of the International Conference on Digital Image and Signal Processing (DISP’19)
Saeed SMU, Anwar SM, Khalid H, Majid M, Bagci U (2020) Eeg based classification of long-term stress using psychological labeling. Sensors 20(7):1886
Agrafioti F, Hatzinakos D, Anderson AK (2011) Ecg pattern analysis for emotion detection. IEEE Transactions on affective computing 3(1):102–115
Sarkar P, Etemad A (2020) Self-supervised ecg representation learning for emotion recognition. IEEE Transactions on Affective Computing 13(3):1541–1554
Sarkar P, Etemad A (2020) Self-supervised learning for ecg-based emotion recognition. In: ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp 3217–3221. IEEE
Hart B, Struiksma ME, Boxtel A, Van Berkum JJ (2018) Emotion in stories: Facial emg evidence for both mental simulation and moral evaluation. Frontiers in psychology 9:613
Künecke J, Hildebrandt A, Recio G, Sommer W, Wilhelm O (2014) Facial emg responses to emotional expressions are related to emotion perception ability. PloS one 9(1):84053
Ekman P, Friesen WV (1978) Facial action coding system. Environmental Psychology & Nonverbal Behavior
Wen Z, Lin W, Wang T, Xu G (2023) Distract your attention: Multi-head cross attention network for facial expression recognition. Biomimetics 8(2):199
Wang K, Peng X, Yang J, Meng D, Qiao Y (2020) Region attention networks for pose and occlusion robust facial expression recognition. IEEE Transactions on Image Processing 29:4057–4069
Cai J, Meng Z, Khan AS, Li Z, O’Reilly J, Tong Y (2018) Island loss for learning discriminative features in facial expression recognition. In: 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018), pp 302–309. IEEE
Kosti R, Alvarez JM, Recasens A, Lapedriza A (2017) Emotion recognition in context. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 1667–1675
Lee J, Kim S, Kim S, Park J, Sohn K (2019) Context-aware emotion recognition networks. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 10143–10152
Yang D, Huang S, Wang S, Liu Y, Zhai P, Su L, Li M, Zhang L (2022) Emotion recognition for multiple context awareness. In: European Conference on Computer Vision, pp 144–162. Springer
Vicol P, Tapaswi M, Castrejon L, Fidler S (2018) Moviegraphs: Towards understanding human-centric situations from videos. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 8581–8590
Savchenko AV (2021) Facial expression and attributes recognition based on multi-task learning of lightweight neural networks. In: 2021 IEEE 19th International Symposium on Intelligent Systems and Informatics (SISY), pp 119–124. IEEE
Zhang Y, Wang C, Ling X, Deng W (2022) Learn from all: Erasing attention consistency for noisy label facial expression recognition. In: European Conference on Computer Vision, pp 418–434. Springer
Rao T, Li X, Zhang H, Xu M (2019) Multi-level region-based convolutional neural network for image emotion classification. Neurocomputing 333:429–439
Yang J, She D, Sun M, Cheng M-M, Rosin PL, Wang L (2018) Visual sentiment prediction based on automatic discovery of affective regions. IEEE Transactions on Multimedia 20(9):2513–2525
Majumder N, Poria S, Hazarika D, Mihalcea R, Gelbukh A, Cambria E (2019) Dialoguernn: An attentive rnn for emotion detection in conversations. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp 6818–6825
Yang D, Chen Z, Wang Y, Wang S, Li M, Liu S, Zhao X, Huang S, Dong Z, Zhai P, et al (2023) Context de-confounded emotion recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 19005–19015
Fan Y, Li VO, Lam JC (2020) Facial expression recognition with deeply-supervised attention network. IEEE transactions on affective computing 13(2):1057–1071
Han W, Chen H, Gelbukh A, Zadeh A, Morency L-p, Poria S (2021) Bi-bimodal modality fusion for correlation-controlled multimodal sentiment analysis. In: Proceedings of the 2021 International Conference on Multimodal Interaction, pp 6–15
Siriwardhana S, Kaluarachchi T, Billinghurst M, Nanayakkara S (2020) Multimodal emotion recognition with transformer-based self supervised feature fusion. Ieee Access 8:176274–176285
Yang X, Feng S, Wang D, Zhang Y (2020) Image-text multimodal emotion classification via multi-view attentional network. IEEE Transactions on Multimedia 23:4014–4026
Ong DC, Wu Z, Tan Z-X, Reddan M, Kahhale I, Mattek A, Zaki J (2019) Modeling emotion in complex stories: the stanford emotional narratives dataset. IEEE Transactions on Affective Computing 12(3):579–594
Poria S, Hazarika D, Majumder N, Naik G, Cambria E, Mihalcea R (2018) Meld: A multimodal multi-party dataset for emotion recognition in conversations. arXiv preprint arXiv:1810.02508
Do HH, Prasad PW, Maag A, Alsadoon A (2019) Deep learning for aspect-based sentiment analysis: a comparative review. Expert systems with applications 118:272–299
Liu S, Zhang L, Yang X, Su H, Zhu J (2021) Query2label: A simple transformer way to multi-label classification. arXiv preprint arXiv:2107.10834
Kosti R, Alvarez JM, Recasens A, Lapedriza A (2019) Context based emotion recognition using emotic dataset. IEEE transactions on pattern analysis and machine intelligence 42(11):2755–2766
Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, Huang Z, Karpathy A, Khosla A, Bernstein M et al (2015) Imagenet large scale visual recognition challenge. International journal of computer vision 115:211–252
Zhou B, Lapedriza A, Khosla A, Oliva A, Torralba A (2017) Places: A 10 million image database for scene recognition. IEEE transactions on pattern analysis and machine intelligence 40(6):1452–1464
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 770–778
Fan H, Xiong B, Mangalam K, Li Y, Yan Z, Malik J, Feichtenhofer C (2021) Multiscale vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 6824–6835
Zhuang L, Wayne L, Ya S, Jun Z (2021) A robustly optimized bert pre-training approach with post-training. In: Proceedings of the 20th Chinese National Conference on Computational Linguistics, pp 1218–1227
Devlin J, Chang M-W, Lee K, Toutanova K (2018) Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805
Zhang K, Zhang Z, Li Z, Qiao Y (2016) Joint face detection and alignment using multitask cascaded convolutional networks. IEEE signal processing letters 23(10):1499–1503
Zhang S, Zhang Y, Zhang Y, Wang Y, Song Z (2023) A dual-direction attention mixed feature network for facial expression recognition. Electronics 12(17):3595
Deng J, Guo J, Xue N, Zafeiriou S (2019) Arcface: Additive angular margin loss for deep face recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 4690–4699
Li S, Deng W, Du J (2017) Reliable crowdsourcing and deep locality-preserving learning for expression recognition in the wild. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 2852–2861
Wei Z, Zhang J, Lin Z, Lee J-Y, Balasubramanian N, Hoai M, Samaras D (2020) Learning visual emotion representations from web data. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 13106–13115
Thao HTP, Balamurali B, Herremans D, Roig G (2021) Attendaffectnet: Self-attention based networks for predicting affective responses from movies. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp 8719–8726. IEEE
Chudasama V, Kar P, Gudmalwar A, Shah N, Wasnik P, Onoe N (2022) M2fnet: Multi-modal fusion network for emotion recognition in conversation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 4652–4661
Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J (2013) Distributed representations of words and phrases and their compositionality. Advances in neural information processing systems 26
Radford A, Kim JW, Xu T, Brockman G, McLeavey C, Sutskever I (2023) Robust speech recognition via large-scale weak supervision. In: International Conference on Machine Learning, pp 28492–28518. PMLR
Zhao Z, Li Q, Cummins N, Liu B, Wang H, Tao J, Schuller B (2020) Hybrid network feature extraction for depression assessment from speech
Pérez H, Escalante HJ, Villasenor-Pineda L, Montes-y-Gómez M, Pinto-Avedano D, Reyes-Meza V (2014) Fusing affective dimensions and audio-visual features from segmented video for depression recognition. In: Proceedings of the 4th International Workshop on Audio/Visual Emotion Challenge, pp 49–55. ACM
Jan A, Meng H, Gaus YFBA, Zhang F (2017) Artificial intelligent system for automatic depression level analysis through visual and vocal expressions. IEEE Transactions on Cognitive and Developmental Systems 10(3):668–680
Kaya H, Çilli F, Salah AA (2014) Ensemble cca for continuous emotion prediction. In: Proceedings of the 4th International Workshop on Audio/Visual Emotion Challenge, pp 19–26
Niu M, Tao J, Liu B, Huang J, Lian Z (2020) Multimodal spatiotemporal representation for automatic depression level detection. IEEE transactions on affective computing 14(1):294–307
Uddin MA, Joolee JB, Sohn K-A (2022) Deep multi-modal network based automated depression severity estimation. IEEE transactions on affective computing 14(3):2153–2167
Author information
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Zhang, X., Zhou, J. & Qi, G. Multimodal temporal context network for tracking dynamic changes in emotion. J Supercomput 81, 71 (2025). https://doi.org/10.1007/s11227-024-06484-0
Accepted:
Published:
DOI: https://doi.org/10.1007/s11227-024-06484-0