LMTformer: facial depression recognition with lightweight multi-scale transformer from videos

He, Lang; Zhao, Junnan; Zhang, Jie; Jiang, Jiewei; Qi, Senqing; Wang, Zhongmin; Wu, Di

doi:10.1007/s10489-024-05908-x

LMTformer: facial depression recognition with lightweight multi-scale transformer from videos

Published: 20 December 2024

Volume 55, article number 195, (2025)
Cite this article

Applied Intelligence Aims and scope Submit manuscript

Lang He^1,2,3,
Junnan Zhao¹,
Jie Zhang^1,2,3,
Jiewei Jiang⁴,
Senqing Qi⁶,
Zhongmin Wang^1,2,3 &
…
Di Wu⁵

242 Accesses
Explore all metrics

Abstract

Depression will become the most common mental disorder worldwide by 2030. A number of models based on deep learning are proposed to help the clinicians to assess the severity of depression. However, two issues remain unresolved: (1) few studies have not considered to encode multi-scale facial behaviors. (2) the current studies have the high computational complexity to hinder the proposed architecture in clinical application. To mitigate the above issues, an end-to-end, lightweight, multi-scale transformer based architecture, termed LMTformer, for sequential video-based depression analysis (SVDA), is proposed. In LMTformer, which consists of the three models: coarse-grained feature extraction (CFE) block, light multi-scale transformer (LMST), final Beck Depression Inventory–II (BDI–II) predictor (FBP). In CFE, coarse-grained features are extracted for LMST. In LMST, a multi-scale transformer is proposed to model the potential local and global features at the different receptive field. In addition, multi-scale global feature aggregation (MSGFA) is also proposed to model the global features. For FBP, two fully connected layers are used. Our novel architecture LMTformer is evaluated on the AVEC2013/AVEC2014 depression databases, and the former dataset with a root mean square error (RMSE) of 7.75 and a mean absolute error (MAE) of 6.12 for AVEC2013, and a RMSE of 7.97 and a MAE of 6.05 for AVEC2014. On the LMVD dataset, we obtain the best performances with F1-score of 82.74%. Additionally, the model represents the excellent computational complexity while only need 0.95M parameters and 1.1G floating-point operations per second (FLOPs). Code will be available at: https://github.com/helang818/LMTformer/.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Multimodal Depression Recognition Using Audio and Visual

An Investigation of Video Vision Transformers for Depression Severity Estimation from Facial Video Data

Combining Informative Regions and Clips for Detecting Depression from Facial Expressions

Article 14 June 2023

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Data Availability

In this work, we utilized the AVEC2013 and AVEC2014 dataset which is a public dataset. It is available at: http://avec2013-db.sspnet.eu/. Due to official requests, we are unable to provide the data, please contact the data owner if required.

Code Availability

Code is applicable at: https://github.com/helang818/LMTformer/.

Materials availability

Materials sharing not applicable.

References

Al Jazaery M, Guo G (2021) Video-based depression level analysis by encoding deep spatiotemporal features. IEEE Trans Affect Comput 12(1):262–268
Article MATH Google Scholar
Alghowinem S, Goecke R, Wagner M et al (2016) Multimodal depression detection: fusion analysis of paralinguistic, head pose and eye gaze behaviors. IEEE Trans Affect Comput 9(4):478–490
Article Google Scholar
Beck AT, Ward CH, Mendelson M et al (1961) An inventory for measuring depression. AMA Arch Gen Psychiatry 4(6):561–571
Article MATH Google Scholar
Bhadra S, Kumar CJ (2022) An insight into diagnosis of depression using machine learning techniques: a systematic review. Curr Med Res Opin 38(5):749–771
Article MATH Google Scholar
Cai C, Niu M, Liu B, et al (2021) TDCA-Net: Time-domain channel attention network for depression detection. In: Interspeech, pp 2511–2515
Carneiro de Melo W, Granger E, Bordallo Lopez M (2021) MDN: A deep maximization-differentiation network for spatio-temporal depression detection. IEEE Trans Affect Comput 1–1
Casado CÁ, Cañellas ML, López MB (2023) Depression recognition using remote photoplethysmography from facial videos. IEEE Trans Affect Comput
Chase TN (2011) Apathy in neuropsychiatric disease: Diagnosis, pathophysiology, and treatment. Neurotox Res 19(2):266–278
Article MATH Google Scholar
Chen Q, Chaturvedi I, Ji S et al (2021) Sequential fusion of facial appearance and dynamics for depression recognition. Pattern Recognit Lett 150:115–121
Article MATH Google Scholar
De Melo WC, Granger E, Hadid A (2019a) Depression detection based on deep distribution learning. In: 2019 IEEE international conference on image processing (ICIP), IEEE, pp 4544–4548
de Melo WC, Granger E, Hadid A (2019b) Combining global and local convolutional 3D networks for detecting depression from facial expressions. In: 2019 14th IEEE international conference on automatic face & gesture recognition (FG 2019), IEEE, pp 1–8
de Melo WC, Granger E, Hadid A (2020) A deep multiscale spatiotemporal network for assessing depression from facial dynamics. IEEE Trans Affect Comput
de Melo WC, Granger E, Lopez MB (2020) Encoding temporal information for automatic depression recognition from facial analysis. ICASSP 2020–2020 IEEE International Conference on Acoustics. Speech and Signal Processing, IEEE, pp 1080–1084
de Melo WC, Granger E, Lopez MB (2024) Facial expression analysis using decomposed multiscale spatiotemporal networks. Expert Syst Appl 236:121276
Article Google Scholar
Dhall A, Goecke R (2015) A temporally piece-wise fisher vector approach for depression analysis. In: 2015 International conference on affective computing and intelligent interaction (ACII), IEEE, pp 255–259
Fan H, Zhang X, Xu Y et al (2024) Transformer-based multimodal feature enhancement networks for multimodal depression detection integrating video, audio and remote photoplethysmograph signals. Inf Fusion 104:102161
Article Google Scholar
Hamilton M (1960) A rating scale for depression. J Neurol Neurosurg Psychiatry 23(1):56
Article MATH Google Scholar
He L, Jiang D, Sahli H (2015a) Multimodal depression recognition with dynamic visual and audio cues. In: 2015 International conference on affective computing and intelligent interaction (ACII), IEEE, pp 260–266
He L, Jiang D, Sahli H (2018) Automatic depression analysis using dynamic facial appearance descriptor and dirichlet process fisher encoding. IEEE Trans Multimed 21(6):1476–1486
Article Google Scholar
He L, Chan JCW, Wang Z (2021) Automatic depression recognition using CNN with attention mechanism from videos. Neurocomputing 422:165–175
Article MATH Google Scholar
He L, Guo C, Tiwari P, et al (2021b) Intelligent system for depression scale estimation with facial expressions and case study in industrial intelligence. Int J Intell Syst
He L, Guo C, Tiwari P, et al (2021c) DepNet: An automated industrial intelligent system using deep learning for video-based depression analysis. Int J Intell Syst
He L, Niu M, Tiwari P et al (2022) Deep learning for depression recognition with audiovisual cues: A review. Inf Fusion 80:56–86
Article MATH Google Scholar
He L, Tiwari P, Lv C et al (2022) Reducing noisy annotations for depression estimation from facial images. Neural Netw 153:120–129
Article MATH Google Scholar
He L, Chen K, Zhao J, et al (2024a) LMVD: A large-scale multimodal vlog dataset for depression detection in the wild. Authorea Preprints
He L, Li Z, Tiwari P et al (2024a) Depressformer: Leveraging video swin transformer and fine-grained local features for depression scale estimation. Biomed Signal Process Control 96:106490
He L, Li Z, Tiwari P et al (2024b) LSCAformer: Long and short-term cross-attention-aware transformer for depression recognition from video sequences. Biomed Signal Process Control 98:106767
Kapur S, Phillips AG, Insel TR (2012) Why has it taken so long for biological psychiatry to develop clinical tests and what to do about it? Mol Psychiatry 17(12):1174–1179
Article Google Scholar
Kraepelin E (1913) Lectures on clinical psychiatry. William Wood
Lee T, Baek S, Lee J et al (2024) A deep learning driven simulation analysis of the emotional profiles of depression based on facial expression dynamics. Clin Psychopharmacol Neurosci 22(1):87
Article MATH Google Scholar
Li M, Wang Y, Yang C, et al (2024) Automatic diagnosis of depression based on facial expression information and deep convolutional neural network. IEEE Trans Comput Soc Syst
Lin D, Chen G, Cohen-Or D, et al (2017) Cascaded feature network for semantic segmentation of rgb-d images. In: Proceedings of the IEEE international conference on computer vision, pp 1311–1319
Liu Z, Yuan X, Li Y et al (2023) PRA-Net: Part-and-relation attention network for depression recognition from facial expression. Comput Biol Med 157:106589
Article MATH Google Scholar
Montgomery SA, Åsberg M (1979) A new depression scale designed to be sensitive to change. Br J Psychiatry 134(4):382–389
Article MATH Google Scholar
Ning E, Wang Y, Wang C et al (2024) Enhancement, integration, expansion: Activating representation of detailed features for occluded person re-identification. Neural Netw 169:532–541
Article MATH Google Scholar
Niu M, Liu B, Tao J, et al (2021a) A time–frequency channel attention and vectorization network for automatic depression level prediction. Neurocomputing
Niu M, Tao J, Liu B (2021b) Multi-scale and multi-region facial discriminative representation for automatic depression level prediction. In: ICASSP 2021-2021 IEEE international conference on acoustics, speech and signal processing (ICASSP), IEEE, pp 1325–1329
Niu M, He L, Li Y et al (2022a) Depressioner: Facial dynamic representation for automatic depression level prediction. Expert Syst Appl 204:117512
Niu M, Zhao Z, Tao J, et al (2022b) Dual attention and element recalibration networks for automatic depression level prediction. IEEE Trans Affect Comput
Niu M, Zhao Z, Tao J et al (2022c) Selective element and two orders vectorization networks for automatic depression severity diagnosis via facial changes. IEEE Trans Circ Syst Vid Technol 32(11):8065–8077
Pan Y, Shang Y, Shao Z, et al (2023) Integrating deep facial priors into landmarks for privacy preserving multimodal depression recognition. IEEE Trans Affect Comput
Pan Y, Shang Y, Liu T et al (2024) Spatial-temporal attention network for depression recognition from facial videos. Expert Syst Appl 237:121410
Article Google Scholar
Uddin MA, Joolee JB, Lee YK (2020) Depression level prediction using deep spatiotemporal features and multilayer Bi-LSTM. IEEE Trans Affect Comput
Valstar M, Schuller B, Smith K, et al (2013) AVEC2013: the continuous audio/visual emotion and depression recognition challenge. In: Proceedings of the 3rd ACM international workshop on audio/visual emotion challenge, pp 3–10
Valstar M, Schuller B, Smith K, et al (2014) AVEC 2014: 3D dimensional affect and depression recognition challenge. In: Proceedings of the 4th international workshop on audio/visual emotion challenge, ACM. ACM, Orlando, FL, USA, pp 3–10
Wang C, Wang C, Li W et al (2021) A brief survey on rgb-d semantic segmentation using deep learning. Displays 70:102080
Article MATH Google Scholar
Wang C, Ning X, Li W, et al (2023) 3d person re-identification based on global semantic guidance and local feature aggregation. IEEE Trans Circ Syst Vid Technol
Wen L, Li X, Guo G et al (2015) Automated depression diagnosis based on facial dynamic analysis and sparse coding. IEEE Trans Inf Forensics Secur 10(7):1432–1441
Article MATH Google Scholar
Xiao T, Singh M, Mintun E et al (2021) Early convolutions help transformers see better. Adv Neural Inf Process Syst 34:30392–30400
Google Scholar
Zhang H, Wang C, Yu L, et al (2024) Pointgt: A method for point-cloud classification and segmentation based on local geometric transformation. IEEE Trans Multimed
Zhang S, Yang Y, Chen C et al (2023a) Multimodal emotion recognition based on audio and text by using hybrid attention networks. Biomed Signal Process Control 85:105052
Zhang S, Yang Y, Chen C, et al (2023b) Deep learning-based multimodal emotion recognition from audio, visual, and text modalities: A systematic review of recent advancements and future prospects. Expert Syst Appl 121692
Zhang S, Zhang X, Zhao X, et al (2023c) Mtdan: A lightweight multi-scale temporal difference attention networks for automated video depression detection. IEEE Trans Affect Comput
Zhou X, Huang P, Liu H et al (2019) Learning content-adaptive feature pooling for facial depression recognition in videos. Electron Lett 55(11):648–650
Article MATH Google Scholar
Zhou X, Jin K, Shang Y et al (2020) Visually interpretable representation learning for depression recognition from facial images. IEEE Trans Affect Comput 11(3):542–552. https://doi.org/10.1109/TAFFC.2018.2828819
Article MATH Google Scholar
Zhu Y, Shang Y, Shao Z et al (2017) Automated depression diagnosis based on deep networks to encode facial appearance and dynamics. IEEE Trans Affect Comput 9(4):578–584
Article MATH Google Scholar

Download references

Funding

This work is supported by National Natural Science Foundation of China (grant 62376215,62276210), the Open Fund of National Engineering Laboratory for Big Data System Computing Technology (Grant No. SZU-BDSC-OF2024-16), the Humanities and Social Sciences Program of the Ministry of Education (22YJCZH048), the Key Research and Development Project of Shaanxi Province (2024GX-YBXM-137), the Open Fund of Key Laboratory of Modem Teaching Technology, Minsity of Education, the Shaanxi Provincial Social Science Foundation (grant 2021K015),the key project of Natural Science Basic Research Program of Shaanxi Province (2024JC-ZDXM-37,2023-YBSF-434), the Shaanxi Province Qinchuangyuan "Scientist + Engineer" Team Construction Project (grant 2023KXJ-241), the Young Talent Fund of Xi’an Association for Science and Technology (959202413083).

Author information

Authors and Affiliations

School of Computer Science & Technology, Xi’an University of Posts & Telecommunications, 618 Chang’an West Steet, Chang’an District, Xi’an, 710121, Shaanxi, China
Lang He, Junnan Zhao, Jie Zhang & Zhongmin Wang
Shaanxi Key Laboratory of Network Data Analysis and Intelligent Processing, Xi’an University of Posts and Telecommunications, 618 Chang’an West Steet, Chang’an District, Xi’an, 710121, Shaanxi, China
Lang He, Jie Zhang & Zhongmin Wang
Xi’an Key Laboratory of Big Data and Intelligent Computing, Xi’an University of Posts and Telecommunications, 618 Chang’an West Steet, Chang’an District, Xi’an, 710121, Shaanxi, China
Lang He, Jie Zhang & Zhongmin Wang
School of Electronic Engineering, Xi’an University of Posts and Telecommunications, Xi’an, 710121, Shaanxi, China
Jiewei Jiang
First Affiliated Hospital, Air Force Medical University, Xi’an, 710032, Shaanxi, China
Di Wu
Key Laboratory of Modern Teaching Technology, Ministry of Education, Shaanxi Normal University, Xi’an, 710062, Shaanxi, China
Senqing Qi

Authors

Lang He
View author publications
You can also search for this author in PubMed Google Scholar
Junnan Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Jie Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Jiewei Jiang
View author publications
You can also search for this author in PubMed Google Scholar
Senqing Qi
View author publications
You can also search for this author in PubMed Google Scholar
Zhongmin Wang
View author publications
You can also search for this author in PubMed Google Scholar
Di Wu
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Lang He: Conceptualization, Methodology, Data curation, Writing-Original draft, Writing-Review & editing, Validation, Project administration. Junnan Zhao: Methodology, Software, Visualization, Data curation, Writing-Original draft. Prayag Tiwari: Data curation, Visualization, Funding acquisition. Jie Zhang: Conceptualization, Project administration. Jiewei Jiang: Funding acquisition, Writing-Review & editing. Di Wu: Formal analysis, Validation. Senqing Qi: Supervision, Investigation. Zhongmin Wang: Supervision, Project administration.

Corresponding author

Correspondence to Lang He.

Ethics declarations

Conflicts of interest

On behalf of all authors, the corresponding author states that there is no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

He, L., Zhao, J., Zhang, J. et al. LMTformer: facial depression recognition with lightweight multi-scale transformer from videos. Appl Intell 55, 195 (2025). https://doi.org/10.1007/s10489-024-05908-x

Download citation

Accepted: 23 October 2024
Published: 20 December 2024
DOI: https://doi.org/10.1007/s10489-024-05908-x

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

LMTformer: facial depression recognition with lightweight multi-scale transformer from videos

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Multimodal Depression Recognition Using Audio and Visual

An Investigation of Video Vision Transformers for Depression Severity Estimation from Facial Video Data

Combining Informative Regions and Clips for Detecting Depression from Facial Expressions

Data Availability

Code Availability

Materials availability

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflicts of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

LMTformer: facial depression recognition with lightweight multi-scale transformer from videos

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Multimodal Depression Recognition Using Audio and Visual

An Investigation of Video Vision Transformers for Depression Severity Estimation from Facial Video Data

Combining Informative Regions and Clips for Detecting Depression from Facial Expressions

Explore related subjects

Data Availability

Code Availability

Materials availability

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflicts of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation