Skip to main content

Advertisement

Log in

Multimodal temporal context network for tracking dynamic changes in emotion

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

In the medical field, the analysis and understanding of human emotions is a key approach to the study of mental diseases. Many psychological or psychiatric disorders exhibit inconsistent and often subtle symptoms, which complicates the prediction of human emotions based on singular traits. Consequently, this study integrates a range of modal cues. The study proposes THRMM, a Transformer-based network for temporal modeling that leverages multiple contextual cues. The THRMM architecture effectively extracts global video features, character traits, and dialogue cues to monitor emotional shifts, capturing the emotional dynamics for timely and accurate emotion predictions. Ablation and comparative studies confirm the effectiveness of THRMM in temporal context modeling, emphasizing the importance of scene, task, and dialogue information in interpreting emotions.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

Data availability

No datasets were generated or analysed during the current study.

References

  1. Bird JJ, Ekart A, Buckingham CD, Faria DR (2019) Mental emotional sentiment classification with an eeg-based brain-machine interface. In: Proceedings of the International Conference on Digital Image and Signal Processing (DISP’19)

  2. Saeed SMU, Anwar SM, Khalid H, Majid M, Bagci U (2020) Eeg based classification of long-term stress using psychological labeling. Sensors 20(7):1886

    Article  Google Scholar 

  3. Agrafioti F, Hatzinakos D, Anderson AK (2011) Ecg pattern analysis for emotion detection. IEEE Transactions on affective computing 3(1):102–115

    Article  Google Scholar 

  4. Sarkar P, Etemad A (2020) Self-supervised ecg representation learning for emotion recognition. IEEE Transactions on Affective Computing 13(3):1541–1554

    Article  Google Scholar 

  5. Sarkar P, Etemad A (2020) Self-supervised learning for ecg-based emotion recognition. In: ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp 3217–3221. IEEE

  6. Hart B, Struiksma ME, Boxtel A, Van Berkum JJ (2018) Emotion in stories: Facial emg evidence for both mental simulation and moral evaluation. Frontiers in psychology 9:613

    Article  Google Scholar 

  7. Künecke J, Hildebrandt A, Recio G, Sommer W, Wilhelm O (2014) Facial emg responses to emotional expressions are related to emotion perception ability. PloS one 9(1):84053

    Article  Google Scholar 

  8. Ekman P, Friesen WV (1978) Facial action coding system. Environmental Psychology & Nonverbal Behavior

    Google Scholar 

  9. Wen Z, Lin W, Wang T, Xu G (2023) Distract your attention: Multi-head cross attention network for facial expression recognition. Biomimetics 8(2):199

    Article  Google Scholar 

  10. Wang K, Peng X, Yang J, Meng D, Qiao Y (2020) Region attention networks for pose and occlusion robust facial expression recognition. IEEE Transactions on Image Processing 29:4057–4069

    Article  Google Scholar 

  11. Cai J, Meng Z, Khan AS, Li Z, O’Reilly J, Tong Y (2018) Island loss for learning discriminative features in facial expression recognition. In: 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018), pp 302–309. IEEE

  12. Kosti R, Alvarez JM, Recasens A, Lapedriza A (2017) Emotion recognition in context. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 1667–1675

  13. Lee J, Kim S, Kim S, Park J, Sohn K (2019) Context-aware emotion recognition networks. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 10143–10152

  14. Yang D, Huang S, Wang S, Liu Y, Zhai P, Su L, Li M, Zhang L (2022) Emotion recognition for multiple context awareness. In: European Conference on Computer Vision, pp 144–162. Springer

  15. Vicol P, Tapaswi M, Castrejon L, Fidler S (2018) Moviegraphs: Towards understanding human-centric situations from videos. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 8581–8590

  16. Savchenko AV (2021) Facial expression and attributes recognition based on multi-task learning of lightweight neural networks. In: 2021 IEEE 19th International Symposium on Intelligent Systems and Informatics (SISY), pp 119–124. IEEE

  17. Zhang Y, Wang C, Ling X, Deng W (2022) Learn from all: Erasing attention consistency for noisy label facial expression recognition. In: European Conference on Computer Vision, pp 418–434. Springer

  18. Rao T, Li X, Zhang H, Xu M (2019) Multi-level region-based convolutional neural network for image emotion classification. Neurocomputing 333:429–439

    Article  Google Scholar 

  19. Yang J, She D, Sun M, Cheng M-M, Rosin PL, Wang L (2018) Visual sentiment prediction based on automatic discovery of affective regions. IEEE Transactions on Multimedia 20(9):2513–2525

    Article  Google Scholar 

  20. Majumder N, Poria S, Hazarika D, Mihalcea R, Gelbukh A, Cambria E (2019) Dialoguernn: An attentive rnn for emotion detection in conversations. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp 6818–6825

  21. Yang D, Chen Z, Wang Y, Wang S, Li M, Liu S, Zhao X, Huang S, Dong Z, Zhai P, et al (2023) Context de-confounded emotion recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 19005–19015

  22. Fan Y, Li VO, Lam JC (2020) Facial expression recognition with deeply-supervised attention network. IEEE transactions on affective computing 13(2):1057–1071

    Article  Google Scholar 

  23. Han W, Chen H, Gelbukh A, Zadeh A, Morency L-p, Poria S (2021) Bi-bimodal modality fusion for correlation-controlled multimodal sentiment analysis. In: Proceedings of the 2021 International Conference on Multimodal Interaction, pp 6–15

  24. Siriwardhana S, Kaluarachchi T, Billinghurst M, Nanayakkara S (2020) Multimodal emotion recognition with transformer-based self supervised feature fusion. Ieee Access 8:176274–176285

    Article  Google Scholar 

  25. Yang X, Feng S, Wang D, Zhang Y (2020) Image-text multimodal emotion classification via multi-view attentional network. IEEE Transactions on Multimedia 23:4014–4026

    Article  Google Scholar 

  26. Ong DC, Wu Z, Tan Z-X, Reddan M, Kahhale I, Mattek A, Zaki J (2019) Modeling emotion in complex stories: the stanford emotional narratives dataset. IEEE Transactions on Affective Computing 12(3):579–594

    Article  Google Scholar 

  27. Poria S, Hazarika D, Majumder N, Naik G, Cambria E, Mihalcea R (2018) Meld: A multimodal multi-party dataset for emotion recognition in conversations. arXiv preprint arXiv:1810.02508

  28. Do HH, Prasad PW, Maag A, Alsadoon A (2019) Deep learning for aspect-based sentiment analysis: a comparative review. Expert systems with applications 118:272–299

    Article  Google Scholar 

  29. Liu S, Zhang L, Yang X, Su H, Zhu J (2021) Query2label: A simple transformer way to multi-label classification. arXiv preprint arXiv:2107.10834

  30. Kosti R, Alvarez JM, Recasens A, Lapedriza A (2019) Context based emotion recognition using emotic dataset. IEEE transactions on pattern analysis and machine intelligence 42(11):2755–2766

    Google Scholar 

  31. Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, Huang Z, Karpathy A, Khosla A, Bernstein M et al (2015) Imagenet large scale visual recognition challenge. International journal of computer vision 115:211–252

    Article  MathSciNet  Google Scholar 

  32. Zhou B, Lapedriza A, Khosla A, Oliva A, Torralba A (2017) Places: A 10 million image database for scene recognition. IEEE transactions on pattern analysis and machine intelligence 40(6):1452–1464

    Article  Google Scholar 

  33. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 770–778

  34. Fan H, Xiong B, Mangalam K, Li Y, Yan Z, Malik J, Feichtenhofer C (2021) Multiscale vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 6824–6835

  35. Zhuang L, Wayne L, Ya S, Jun Z (2021) A robustly optimized bert pre-training approach with post-training. In: Proceedings of the 20th Chinese National Conference on Computational Linguistics, pp 1218–1227

  36. Devlin J, Chang M-W, Lee K, Toutanova K (2018) Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805

  37. Zhang K, Zhang Z, Li Z, Qiao Y (2016) Joint face detection and alignment using multitask cascaded convolutional networks. IEEE signal processing letters 23(10):1499–1503

    Article  Google Scholar 

  38. Zhang S, Zhang Y, Zhang Y, Wang Y, Song Z (2023) A dual-direction attention mixed feature network for facial expression recognition. Electronics 12(17):3595

    Article  Google Scholar 

  39. Deng J, Guo J, Xue N, Zafeiriou S (2019) Arcface: Additive angular margin loss for deep face recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 4690–4699

  40. Li S, Deng W, Du J (2017) Reliable crowdsourcing and deep locality-preserving learning for expression recognition in the wild. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 2852–2861

  41. Wei Z, Zhang J, Lin Z, Lee J-Y, Balasubramanian N, Hoai M, Samaras D (2020) Learning visual emotion representations from web data. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 13106–13115

  42. Thao HTP, Balamurali B, Herremans D, Roig G (2021) Attendaffectnet: Self-attention based networks for predicting affective responses from movies. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp 8719–8726. IEEE

  43. Chudasama V, Kar P, Gudmalwar A, Shah N, Wasnik P, Onoe N (2022) M2fnet: Multi-modal fusion network for emotion recognition in conversation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 4652–4661

  44. Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J (2013) Distributed representations of words and phrases and their compositionality. Advances in neural information processing systems 26

  45. Radford A, Kim JW, Xu T, Brockman G, McLeavey C, Sutskever I (2023) Robust speech recognition via large-scale weak supervision. In: International Conference on Machine Learning, pp 28492–28518. PMLR

  46. Zhao Z, Li Q, Cummins N, Liu B, Wang H, Tao J, Schuller B (2020) Hybrid network feature extraction for depression assessment from speech

  47. Pérez H, Escalante HJ, Villasenor-Pineda L, Montes-y-Gómez M, Pinto-Avedano D, Reyes-Meza V (2014) Fusing affective dimensions and audio-visual features from segmented video for depression recognition. In: Proceedings of the 4th International Workshop on Audio/Visual Emotion Challenge, pp 49–55. ACM

  48. Jan A, Meng H, Gaus YFBA, Zhang F (2017) Artificial intelligent system for automatic depression level analysis through visual and vocal expressions. IEEE Transactions on Cognitive and Developmental Systems 10(3):668–680

    Article  Google Scholar 

  49. Kaya H, Çilli F, Salah AA (2014) Ensemble cca for continuous emotion prediction. In: Proceedings of the 4th International Workshop on Audio/Visual Emotion Challenge, pp 19–26

  50. Niu M, Tao J, Liu B, Huang J, Lian Z (2020) Multimodal spatiotemporal representation for automatic depression level detection. IEEE transactions on affective computing 14(1):294–307

    Article  Google Scholar 

  51. Uddin MA, Joolee JB, Sohn K-A (2022) Deep multi-modal network based automated depression severity estimation. IEEE transactions on affective computing 14(3):2153–2167

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Contributions

J wrote the main manuscript text and prepared Figures 13. J and G prepared Table 17. G handled data colletcion and data preprocessing. J Trunk code construction. J, G ablation and comparison experiments. X Verify the article and revise it.

Corresponding author

Correspondence to Jinwei Zhou.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhang, X., Zhou, J. & Qi, G. Multimodal temporal context network for tracking dynamic changes in emotion. J Supercomput 81, 71 (2025). https://doi.org/10.1007/s11227-024-06484-0

Download citation

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s11227-024-06484-0

Keywords