An Automatic Depression Detection Method with Cross-Modal Fusion Network and Multi-head Attention Mechanism

Li, Yutong; Wang, Juan; Liu, Zhenyu; Zhou, Li; Zhang, Haibo; Tang, Cheng; Hu, Xiping; Hu, Bin

doi:10.1007/978-981-99-8469-5_20

Yutong Li¹⁵,
Juan Wang¹⁶,
Zhenyu Liu¹⁵,
Li Zhou¹⁵,
Haibo Zhang¹⁵,
Cheng Tang¹⁵,
Xiping Hu¹⁵ &
…
Bin Hu¹⁵

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14429))

Included in the following conference series:

Chinese Conference on Pattern Recognition and Computer Vision (PRCV)

356 Accesses

Abstract

Audio-visual based multimodal depression detection has gained significant attention due to its high efficiency and convenience as a computer-aided detection tool, resulting in promising performance. In this paper, we propose a cross-modal fusion network based on multi-head attention and residual structures (CMAFN) for depression recognition. CMAFN consists of three core modules: the Local Temporal Feature Extract Block (LTF), the Cross-Model Fusion Block (CFB), and the Multi-Head Temporal Attention Block (MTB). The LTF module performs feature extraction and encodes temporal information for audio and video modalities separately, while the CFB module facilitates complementary learning between the modalities. The MTB module accounts for the temporal influence of all modalities on each unimodal branch. With the incorporation of the three well-designed modules, CMAFN can refine the inter-modality complementarity and intra-modality temporal dependencies, achieving the interaction between unimodal branches and adaptive balance between modalities. Evaluation results on widely used depression datasets, AVEC2013 and AVEC2014, demonstrate that the proposed CMAFN method outperforms state-of-the-art approaches for depression recognition tasks. The results highlight the potential of CMAFN as an effective tool for the early detection and diagnosis of depression.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 69.99; Price excludes VAT (USA)

Softcover Book: USD 89.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

American Psychiatric Association, A., Association, A.P., et al.: Diagnostic and statistical manual of mental disorders: DSM-5, vol. 10. Washington, DC: American psychiatric association (2013)
Google Scholar
He, L., Cao, C.: Automated depression analysis using convolutional neural networks from speech. J. Biomed. Inform. 83, 103–111 (2018)
Article Google Scholar
Dong, Y., Yang, X.: A hierarchical depression detection model based on vocal and emotional cues. Neurocomputing 441, 279–290 (2021)
Article Google Scholar
Zhu, Y., Shang, Y., Shao, Z., Guo, G.: Automated depression diagnosis based on deep networks to encode facial appearance and dynamics. IEEE Trans. Affect. Comput. 9(4), 578–584 (2017)
Article Google Scholar
Al Jazaery, M., Guo, G.: Video-based depression level analysis by encoding deep spatiotemporal features. IEEE Trans. Affect. Comput. 12(1), 262–268 (2018)
Article Google Scholar
McPherson, A., Martin, C.: A narrative review of the beck depression inventory (BDI) and implications for its use in an alcohol-dependent population. J. Psychiatr. Ment. Health Nurs. 17(1), 19–30 (2010)
Article Google Scholar
Wen, L., Li, X., Guo, G., Zhu, Y.: Automated depression diagnosis based on facial dynamic analysis and sparse coding. IEEE Trans. Inf. Forensics Secur. 10(7), 1432–1441 (2015)
Article Google Scholar
Stasak, B., Joachim, D., Epps, J.: Breaking age barriers with automatic voice-based depression detection. IEEE Pervasive Comput. (2022)
Google Scholar
He, L., et al.: Deep learning for depression recognition with audiovisual cues: a review. Inf. Fusion 80, 56–86 (2022)
Article Google Scholar
Dubagunta, S.P., Vlasenko, B., Doss, M.M.: Learning voice source related information for depression detection. In: ICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6525–6529. IEEE (2019)
Google Scholar
Haque, A., Guo, M., Miner, A.S., Fei-Fei, L.: Measuring depression symptom severity from spoken language and 3d facial expressions. arXiv preprint arXiv:1811.08592 (2018)
Jan, A., Meng, H., Gaus, Y.F.B.A., Zhang, F.: Artificial intelligent system for automatic depression level analysis through visual and vocal expressions. IEEE Trans. Cogn. Dev. Syst. 10(3), 668–680 (2017)
Article Google Scholar
He, L., Jiang, D., Sahli, H.: Multimodal depression recognition with dynamic visual and audio cues. In: 2015 International Conference on Affective Computing and Intelligent Interaction (ACII), pp. 260–266. IEEE (2015)
Google Scholar
Meng, H., Huang, D., Wang, H., Yang, H., Ai-Shuraifi, M., Wang, Y.: Depression recognition based on dynamic facial and vocal expression features using partial least square regression. In: Proceedings of the 3rd ACM International Workshop on Audio/Visual Emotion Challenge, pp. 21–30 (2013)
Google Scholar
Cummins, N., Joshi, J., Dhall, A., Sethu, V., Goecke, R., Epps, J.: Diagnosis of depression by behavioural signals: a multimodal approach. In: Proceedings of the 3rd ACM International Workshop on Audio/Visual Emotion Challenge, pp. 11–20 (2013)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Google Scholar
Barsoum, E., Zhang, C., Ferrer, C.C., Zhang, Z.: Training deep networks for facial expression recognition with crowd-sourced label distribution. In: Proceedings of the 18th ACM International Conference on Multimodal Interaction, pp. 279–283 (2016)
Google Scholar
Eyben, F., et al.: The Geneva minimalistic acoustic parameter set (GeMAPS) for voice research and affective computing. IEEE Trans. Affect. Comput. 7(2), 190–202 (2015)
Article Google Scholar
Bai, S., Kolter, J.Z., Koltun, V.: An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. arXiv preprint arXiv:1803.01271 (2018)
King, D.E.: Dlib-ml: a machine learning toolkit. J. Mach. Learn. Res. 10, 1755–1758 (2009)
Google Scholar
Stevens, E., Antiga, L., Viehmann, T.: Deep Learning with PyTorch. Manning Publications (2020)
Google Scholar
Uddin, M.A., Joolee, J.B., Sohn, K.A.: Deep multi-modal network based automated depression severity estimation. IEEE Trans. Affect. Comput. (2022)
Google Scholar
Cummins, N., Sethu, V., Epps, J., Williamson, J.R., Quatieri, T.F., Krajewski, J.: Generalized two-stage rank regression framework for depression score prediction from speech. IEEE Trans. Affect. Comput. 11(2), 272–283 (2017)
Article Google Scholar
Niu, M., Tao, J., Liu, B., Fan, C.: Automatic depression level detection via lp-Norm pooling. In: Proceedings of the INTERSPEECH, Graz, Austria, pp. 4559–4563 (2019)
Google Scholar
Niu, M., Tao, J., Liu, B., Huang, J., Lian, Z.: Multimodal spatiotemporal representation for automatic depression level detection. IEEE Trans. Affect. Comput. (2020)
Google Scholar
Zhao, Z., Li, Q., Cummins, N., Liu, B., Wang, H., Tao, J., Schuller, B.: Hybrid network feature extraction for depression assessment from speech. In: Proceeding of the INTERSPEECH, Shanghai, China, pp. 4956–4960 (2020)
Google Scholar
De Melo, W.C., Granger, E., Hadid, A.: Depression detection based on deep distribution learning. In: 2019 IEEE International Conference on Image Processing (ICIP), pp. 4544–4548. IEEE (2019)
Google Scholar
Zhou, X., Jin, K., Shang, Y., Guo, G.: Visually interpretable representation learning for depression recognition from facial images. IEEE Trans. Affect. Comput. 11(3), 542–552 (2018)
Article Google Scholar
He, L., Chan, J.C.W., Wang, Z.: Automatic depression recognition using CNN with attention mechanism from videos. Neurocomputing 422, 165–175 (2021)
Article Google Scholar
Uddin, M.A., Joolee, J.B., Lee, Y.K.: Depression level prediction using deep spatiotemporal features and multilayer Bi-LTSM. IEEE Trans. Affect. Comput. 13(2), 864–870 (2020)
Article Google Scholar
He, L., Tiwari, P., Lv, C., Wu, W., Guo, L.: Reducing noisy annotations for depression estimation from facial images. Neural Netw. 153, 120–129 (2022)
Article Google Scholar
Liu, Z., Yuan, X., Li, Y., Shangguan, Z., Zhou, L., Hu, B.: PRA-Net: part-and-relation attention network for depression recognition from facial expression. Comput. Biol. Med., 106589 (2023)
Google Scholar
Li, Y., et al.: A facial depression recognition method based on hybrid multi-head cross attention network. Front. Neurosci. 17, 1188434 (2023)
Article Google Scholar
Kaya, H., Çilli, F., Salah, A.A.: Ensemble CCA for continuous emotion prediction. In: Proceedings of the 4th International Workshop on Audio/Visual Emotion Challenge, pp. 19–26 (2014)
Google Scholar

Download references

Acknowledgments

This work was supported in part by the National Key Research and Development Program of China (Grant No. 2019YFA0706200), in part by the National Natural Science Foundation of China (Grant No. 62227807, No. 62372217).

Author information

Authors and Affiliations

Gansu Provincial Key Laboratory of Wearable Computing, Lanzhou University, Lanzhou, China
Yutong Li, Zhenyu Liu, Li Zhou, Haibo Zhang, Cheng Tang, Xiping Hu & Bin Hu
Department of Psychological Medicine, Seventh Medical Center of PLA General Hospital, Beijing, China
Juan Wang

Authors

Yutong Li
View author publications
You can also search for this author in PubMed Google Scholar
Juan Wang
View author publications
You can also search for this author in PubMed Google Scholar
Zhenyu Liu
View author publications
You can also search for this author in PubMed Google Scholar
Li Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Haibo Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Cheng Tang
View author publications
You can also search for this author in PubMed Google Scholar
Xiping Hu
View author publications
You can also search for this author in PubMed Google Scholar
Bin Hu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Juan Wang , Zhenyu Liu or Bin Hu .

Editor information

Editors and Affiliations

Nanjing University of Information Science and Technology, Nanjing, China
Qingshan Liu
Xiamen University, Xiamen, China
Hanzi Wang
Beijing University of Posts and Telecommunications, Beijing, China
Zhanyu Ma
Sun Yat-sen University, Guangzhou, China
Weishi Zheng
Peking University, Beijing, China
Hongbin Zha
Chinese Academy of Sciences, Beijing, China
Xilin Chen
Chinese Academy of Sciences, Beijing, China
Liang Wang
Xiamen University, Xiamen, China
Rongrong Ji

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Li, Y. et al. (2024). An Automatic Depression Detection Method with Cross-Modal Fusion Network and Multi-head Attention Mechanism. In: Liu, Q., et al. Pattern Recognition and Computer Vision. PRCV 2023. Lecture Notes in Computer Science, vol 14429. Springer, Singapore. https://doi.org/10.1007/978-981-99-8469-5_20

Download citation

DOI: https://doi.org/10.1007/978-981-99-8469-5_20
Published: 25 December 2023
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-8468-8
Online ISBN: 978-981-99-8469-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

An Automatic Depression Detection Method with Cross-Modal Fusion Network and Multi-head Attention Mechanism