Skip to main content

An Automatic Depression Detection Method with Cross-Modal Fusion Network and Multi-head Attention Mechanism

  • Conference paper
  • First Online:
Pattern Recognition and Computer Vision (PRCV 2023)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14429))

Included in the following conference series:

  • 356 Accesses

Abstract

Audio-visual based multimodal depression detection has gained significant attention due to its high efficiency and convenience as a computer-aided detection tool, resulting in promising performance. In this paper, we propose a cross-modal fusion network based on multi-head attention and residual structures (CMAFN) for depression recognition. CMAFN consists of three core modules: the Local Temporal Feature Extract Block (LTF), the Cross-Model Fusion Block (CFB), and the Multi-Head Temporal Attention Block (MTB). The LTF module performs feature extraction and encodes temporal information for audio and video modalities separately, while the CFB module facilitates complementary learning between the modalities. The MTB module accounts for the temporal influence of all modalities on each unimodal branch. With the incorporation of the three well-designed modules, CMAFN can refine the inter-modality complementarity and intra-modality temporal dependencies, achieving the interaction between unimodal branches and adaptive balance between modalities. Evaluation results on widely used depression datasets, AVEC2013 and AVEC2014, demonstrate that the proposed CMAFN method outperforms state-of-the-art approaches for depression recognition tasks. The results highlight the potential of CMAFN as an effective tool for the early detection and diagnosis of depression.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 69.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 89.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. American Psychiatric Association, A., Association, A.P., et al.: Diagnostic and statistical manual of mental disorders: DSM-5, vol. 10. Washington, DC: American psychiatric association (2013)

    Google Scholar 

  2. He, L., Cao, C.: Automated depression analysis using convolutional neural networks from speech. J. Biomed. Inform. 83, 103–111 (2018)

    Article  Google Scholar 

  3. Dong, Y., Yang, X.: A hierarchical depression detection model based on vocal and emotional cues. Neurocomputing 441, 279–290 (2021)

    Article  Google Scholar 

  4. Zhu, Y., Shang, Y., Shao, Z., Guo, G.: Automated depression diagnosis based on deep networks to encode facial appearance and dynamics. IEEE Trans. Affect. Comput. 9(4), 578–584 (2017)

    Article  Google Scholar 

  5. Al Jazaery, M., Guo, G.: Video-based depression level analysis by encoding deep spatiotemporal features. IEEE Trans. Affect. Comput. 12(1), 262–268 (2018)

    Article  Google Scholar 

  6. McPherson, A., Martin, C.: A narrative review of the beck depression inventory (BDI) and implications for its use in an alcohol-dependent population. J. Psychiatr. Ment. Health Nurs. 17(1), 19–30 (2010)

    Article  Google Scholar 

  7. Wen, L., Li, X., Guo, G., Zhu, Y.: Automated depression diagnosis based on facial dynamic analysis and sparse coding. IEEE Trans. Inf. Forensics Secur. 10(7), 1432–1441 (2015)

    Article  Google Scholar 

  8. Stasak, B., Joachim, D., Epps, J.: Breaking age barriers with automatic voice-based depression detection. IEEE Pervasive Comput. (2022)

    Google Scholar 

  9. He, L., et al.: Deep learning for depression recognition with audiovisual cues: a review. Inf. Fusion 80, 56–86 (2022)

    Article  Google Scholar 

  10. Dubagunta, S.P., Vlasenko, B., Doss, M.M.: Learning voice source related information for depression detection. In: ICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6525–6529. IEEE (2019)

    Google Scholar 

  11. Haque, A., Guo, M., Miner, A.S., Fei-Fei, L.: Measuring depression symptom severity from spoken language and 3d facial expressions. arXiv preprint arXiv:1811.08592 (2018)

  12. Jan, A., Meng, H., Gaus, Y.F.B.A., Zhang, F.: Artificial intelligent system for automatic depression level analysis through visual and vocal expressions. IEEE Trans. Cogn. Dev. Syst. 10(3), 668–680 (2017)

    Article  Google Scholar 

  13. He, L., Jiang, D., Sahli, H.: Multimodal depression recognition with dynamic visual and audio cues. In: 2015 International Conference on Affective Computing and Intelligent Interaction (ACII), pp. 260–266. IEEE (2015)

    Google Scholar 

  14. Meng, H., Huang, D., Wang, H., Yang, H., Ai-Shuraifi, M., Wang, Y.: Depression recognition based on dynamic facial and vocal expression features using partial least square regression. In: Proceedings of the 3rd ACM International Workshop on Audio/Visual Emotion Challenge, pp. 21–30 (2013)

    Google Scholar 

  15. Cummins, N., Joshi, J., Dhall, A., Sethu, V., Goecke, R., Epps, J.: Diagnosis of depression by behavioural signals: a multimodal approach. In: Proceedings of the 3rd ACM International Workshop on Audio/Visual Emotion Challenge, pp. 11–20 (2013)

    Google Scholar 

  16. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)

    Google Scholar 

  17. Barsoum, E., Zhang, C., Ferrer, C.C., Zhang, Z.: Training deep networks for facial expression recognition with crowd-sourced label distribution. In: Proceedings of the 18th ACM International Conference on Multimodal Interaction, pp. 279–283 (2016)

    Google Scholar 

  18. Eyben, F., et al.: The Geneva minimalistic acoustic parameter set (GeMAPS) for voice research and affective computing. IEEE Trans. Affect. Comput. 7(2), 190–202 (2015)

    Article  Google Scholar 

  19. Bai, S., Kolter, J.Z., Koltun, V.: An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. arXiv preprint arXiv:1803.01271 (2018)

  20. King, D.E.: Dlib-ml: a machine learning toolkit. J. Mach. Learn. Res. 10, 1755–1758 (2009)

    Google Scholar 

  21. Stevens, E., Antiga, L., Viehmann, T.: Deep Learning with PyTorch. Manning Publications (2020)

    Google Scholar 

  22. Uddin, M.A., Joolee, J.B., Sohn, K.A.: Deep multi-modal network based automated depression severity estimation. IEEE Trans. Affect. Comput. (2022)

    Google Scholar 

  23. Cummins, N., Sethu, V., Epps, J., Williamson, J.R., Quatieri, T.F., Krajewski, J.: Generalized two-stage rank regression framework for depression score prediction from speech. IEEE Trans. Affect. Comput. 11(2), 272–283 (2017)

    Article  Google Scholar 

  24. Niu, M., Tao, J., Liu, B., Fan, C.: Automatic depression level detection via lp-Norm pooling. In: Proceedings of the INTERSPEECH, Graz, Austria, pp. 4559–4563 (2019)

    Google Scholar 

  25. Niu, M., Tao, J., Liu, B., Huang, J., Lian, Z.: Multimodal spatiotemporal representation for automatic depression level detection. IEEE Trans. Affect. Comput. (2020)

    Google Scholar 

  26. Zhao, Z., Li, Q., Cummins, N., Liu, B., Wang, H., Tao, J., Schuller, B.: Hybrid network feature extraction for depression assessment from speech. In: Proceeding of the INTERSPEECH, Shanghai, China, pp. 4956–4960 (2020)

    Google Scholar 

  27. De Melo, W.C., Granger, E., Hadid, A.: Depression detection based on deep distribution learning. In: 2019 IEEE International Conference on Image Processing (ICIP), pp. 4544–4548. IEEE (2019)

    Google Scholar 

  28. Zhou, X., Jin, K., Shang, Y., Guo, G.: Visually interpretable representation learning for depression recognition from facial images. IEEE Trans. Affect. Comput. 11(3), 542–552 (2018)

    Article  Google Scholar 

  29. He, L., Chan, J.C.W., Wang, Z.: Automatic depression recognition using CNN with attention mechanism from videos. Neurocomputing 422, 165–175 (2021)

    Article  Google Scholar 

  30. Uddin, M.A., Joolee, J.B., Lee, Y.K.: Depression level prediction using deep spatiotemporal features and multilayer Bi-LTSM. IEEE Trans. Affect. Comput. 13(2), 864–870 (2020)

    Article  Google Scholar 

  31. He, L., Tiwari, P., Lv, C., Wu, W., Guo, L.: Reducing noisy annotations for depression estimation from facial images. Neural Netw. 153, 120–129 (2022)

    Article  Google Scholar 

  32. Liu, Z., Yuan, X., Li, Y., Shangguan, Z., Zhou, L., Hu, B.: PRA-Net: part-and-relation attention network for depression recognition from facial expression. Comput. Biol. Med., 106589 (2023)

    Google Scholar 

  33. Li, Y., et al.: A facial depression recognition method based on hybrid multi-head cross attention network. Front. Neurosci. 17, 1188434 (2023)

    Article  Google Scholar 

  34. Kaya, H., Çilli, F., Salah, A.A.: Ensemble CCA for continuous emotion prediction. In: Proceedings of the 4th International Workshop on Audio/Visual Emotion Challenge, pp. 19–26 (2014)

    Google Scholar 

Download references

Acknowledgments

This work was supported in part by the National Key Research and Development Program of China (Grant No. 2019YFA0706200), in part by the National Natural Science Foundation of China (Grant No. 62227807, No. 62372217).

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Juan Wang , Zhenyu Liu or Bin Hu .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Li, Y. et al. (2024). An Automatic Depression Detection Method with Cross-Modal Fusion Network and Multi-head Attention Mechanism. In: Liu, Q., et al. Pattern Recognition and Computer Vision. PRCV 2023. Lecture Notes in Computer Science, vol 14429. Springer, Singapore. https://doi.org/10.1007/978-981-99-8469-5_20

Download citation

  • DOI: https://doi.org/10.1007/978-981-99-8469-5_20

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-99-8468-8

  • Online ISBN: 978-981-99-8469-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics