In the medical field, the analysis and understanding of human emotions is a key approach to the study of mental diseases. Many psychological or psychiatric disorders exhibit inconsistent and often subtle symptoms, which complicates the prediction of human emotions based on singular traits. Consequently, this study integrates a range of modal cues. The study proposes THRMM, a Transformer-based network for temporal modeling that leverages multiple contextual cues. The THRMM architecture effectively extracts global video features, character traits, and dialogue cues to monitor emotional shifts, capturing the emotional dynamics for timely and accurate emotion predictions. Ablation and comparative studies confirm the effectiveness of THRMM in temporal context modeling, emphasizing the importance of scene, task, and dialogue information in interpreting emotions.

Data availability
No datasets were generated or analysed during the current study.
