Skip to main content

Audio-Based Detection of Anxiety and Depression via Vocal Biomarkers

  • Conference paper
  • First Online:
Advances in Information and Communication (FICC 2024)

Part of the book series: Lecture Notes in Networks and Systems ((LNNS,volume 919))

Included in the following conference series:

  • 706 Accesses

Abstract

We present a comparison of results based on the application of various model/feature combinations on the task of detecting anxiety and depression from audio signals of spontaneous speech. The adopted models comprise several different advanced deep neural networks, including CNN, LSTM, and attention networks, and are compared against traditional, shallow machine learning models. As input features we compare supra-segmental, paralinguistic feature sets against classical Mel-Frequency Cepstral Coefficients and advanced pre-trained X-vector and Wav2Vec2 features. Our models are trained based on self-assessment scores: GAD-7 for anxiety and PHQ-8 for depression. We present binary classification results for anxiety and depression separately and show that despite the noisy self-assessment labels our best model is able to achieve an unweighted average recall (UAR) of 0.60 for anxiety and 0.63 on the depression task. The result on the anxiety task almost reaches the reported self-scored GAD-7 screening reliability of 0.64. This shows that our best audio-based model can be deployed as an anxiety and depression screening tool.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    https://www.who.int/news-room/fact-sheets/detail/mental-disorders.

  2. 2.

    One could, for example, assign a person to the depressed class based on independent depression screening tests.

  3. 3.

    Paralinguistics is the study of paralanguage, which connotes “alongside language” and generally describes the non-verbal elements of human communication, i.e. all meta-information that accompanies and complements language [6].

  4. 4.

    https://kaldi-asr.org/models/8/0008_sitw_v2_1a.tar.gz.

  5. 5.

    Note that this data set was labeled by trained clinical assessors, not relying on self-assessment labels.

References

  1. Abadi, M., et al.: TensorFlow: large-scale machine learning on heterogeneous systems (2015). Software available from tensorflow.org

    Google Scholar 

  2. Arroll, B., et al.: Validation of PHQ-2 and PHQ-9 to screen for major depression in the primary care population. Ann. Family Med. 8(4), 348 (2010)

    Google Scholar 

  3. Baevski, A., Zhou, H., Mohamed, A., Auli, M.: Wav2vec 2.0: a framework for self-supervised learning of speech representations. In: Advances in Neural Information Processing Systems, vol. 33 (NeurIPS 2020). Curran Associates Inc., Red Hook, NY, USA (2020)

    Google Scholar 

  4. Bandelow, B., Michaelis, S.: Epidemiology of anxiety disorders in the 21st century. Dialogues Clin. Neurosci. 17, 327–335 (2015)

    Article  Google Scholar 

  5. Beard, C., Björgvinsson, T.: Beyond generalized anxiety disorder: psychometric properties of the GAD-7 in a heterogeneous psychiatric sample. J. Anxiety Disord. 28(6), 547–552 (2014)

    Article  Google Scholar 

  6. Brueckner, R.: Application of Deep Learning Methods in Computational Paralinguistics. Ph.D. thesis, Technische Universität München (2020)

    Google Scholar 

  7. Chen, T., Guestrin, C.: XGBoost: a scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. KDD ’16, pp. 785–794. ACM, New York (2016)

    Google Scholar 

  8. Davis, S., Mermelstein, P.: Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Trans. Acoust. Speech Sig. Process. 28(4), 357–366 (1980)

    Article  Google Scholar 

  9. De Angel, V., et al.: Digital health tools for the passive monitoring of depression: a systematic review of methods. NPJ Digit. Med. 5(1), 3 (2022)

    Google Scholar 

  10. Endler, N.S., Kocovski, N.L.: State and trait anxiety revisited. J. Anxiety Disorders 15(3), 231–245 (2001)

    Google Scholar 

  11. Eyben, F.: Real-Time Speech and Music Classification by Large Audio Feature Space Extraction. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-27299-3

    Book  Google Scholar 

  12. Eyben, F., et al.: The Geneva minimalistic acoustic parameter set (GeMAPS) for voice research and affective computing. IEEE Trans. Affect. Comput. 7(2), 190–202 (2016)

    Google Scholar 

  13. Eyben, F., Wöllmer, M., Schuller, B.: openSMILE – The Munich versatile and fast open-source audio feature extractor. In: Proceedings of the 18th ACM International Conference on Multimedia, pp. 1459–1462. ACM, Florence, Italy (2010)

    Google Scholar 

  14. Huang, Z., Epps, J., Joachim, D.: Investigation of speech landmark patterns for depression detection. IEEE Trans. Affect. Comput. 13(2), 666–679 (2022)

    Article  Google Scholar 

  15. Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. In: International Conference on Machine Learning, pp. 448–456. PMLR (2015)

    Google Scholar 

  16. Jeancolas, L., et al.: X-vectors: new quantitative biomarkers for early Parkinson’s disease detection from speech. Front. Neuroinform. 15, 578369 (2021)

    Google Scholar 

  17. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)

  18. Kroenke, K., Spitzer, R.L., Williams, J.B.W.: The PHQ-9: validity of a brief depression severity measure. J. General Internal Med. 16(9), 606–613 (2001)

    Google Scholar 

  19. Ma, X., Yang, H., Chen, Q., Huang, D., Wang, Y.: Depaudionet: an efficient deep model for audio based depression classification. In: Proceedings of the 6th International Workshop on Audio/visual Emotion Challenge, pp. 35–42 (2016)

    Google Scholar 

  20. Moro-Velazquez, L., Villalba, J., Dehak, N.: Using X-vectors to automatically detect Parkinson’s disease from speech. In: ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1155–1159. IEEE (2020)

    Google Scholar 

  21. Nair, V., Hinton, G.E.: Rectified linear units improve restricted Boltzmann machines. In: International Conference on Machine Learning (ICML) 2010, pp. 807–814 (2010)

    Google Scholar 

  22. Nirjhar, E.H., Behzadan, A., Chaspari, T.: Exploring bio-behavioral signal trajectories of state anxiety during public speaking. In: ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1294–1298. IEEE (2020)

    Google Scholar 

  23. Pappagari, R., Cho, J., Moro-Velazquez, L., Dehak, N.: Using state of the art speaker recognition and natural language processing technologies to detect Alzheimer’s disease and assess its severity. In: INTERSPEECH, pp. 2177–2181 (2020)

    Google Scholar 

  24. Pappagari, R., Wang, T., Villalba, J., Chen, N., Dehak, N.: X-vectors meet emotions: a study on dependencies between emotion and speaker recognition. In: ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7169–7173. IEEE (2020)

    Google Scholar 

  25. Luis Felipe Parra-Gallego and Juan Rafael Orozco-Arroyave: Classification of emotions and evaluation of customer satisfaction from speech in real world acoustic environments. Digit. Sig. Process. 120, 103286 (2022)

    Article  Google Scholar 

  26. Pedregosa, F., et al.: Scikit-learn: machine Learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)

    MathSciNet  Google Scholar 

  27. Povey, D., et al.: The Kaldi speech recognition toolkit. In: IEEE 2011 Workshop on Automatic Speech Recognition and Understanding, number CONF. IEEE Signal Processing Society (2011)

    Google Scholar 

  28. Raj, D., Snyder, D., Povey, D., Khudanpur, S.: Probing the information encoded in x-vectors. In: 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 726–733. IEEE (2019)

    Google Scholar 

  29. Ringeval, F., et al.: AV+EC 2015: the first affect recognition challenge bridging across audio, video, and physiological data. In: Proceedings of the 5th International Workshop on Audio/Visual Emotion Challenge, pp. 3–8. ACM, Brisbane, Australia (2015)

    Google Scholar 

  30. Sakib, Md.N., Nirjhar, E.H., Feng, K., Behzadan, A., Chaspari, T., Chaspari, T.: Exploring individual differences of public speaking anxiety in real-life and virtual presentations. IEEE Trans. Affect. Comput. 1 (2021)

    Google Scholar 

  31. Salekin, A., Eberle, J.W., Glenn, J.J., Teachman, B.A., Stankovic, J.A.: A weakly supervised learning framework for detecting social anxiety and depression. Proc. ACM Interact. Mob. Wearable Ubiquit. Technol. 2(2), 1–26 (2018)

    Google Scholar 

  32. Schuller, B.: Intelligent Audio Analysis – Speech, Music, and Sound Recognition in Real-Life Conditions. Habilitation thesis, Technische Universität München, Munich, Germany (2012)

    Google Scholar 

  33. Schuller, B., Batliner, A.: Computational Paralinguistics: Emotion, Affect and Personality in Speech and Language Processing. Wiley, Chichester (2014)

    Google Scholar 

  34. Schuller, B., Steidl, S., Batliner, A.: The INTERSPEECH 2009 emotion challenge. In: Proceedings of the 10th Annual Conference of the International Speech Communication Association (INTERSPEECH). ISCA, Brighton, UK (2009)

    Google Scholar 

  35. Schuller, B., et al.: The INTERSPEECH 2014 computational paralinguistics challenge: cognitive & physical load. In: Proceedings of the 15th Annual Conference of the International Speech Communication Association (INTERSPEECH), Singapore (2014)

    Google Scholar 

  36. Schuller, B., et al.: Affective and behavioural computing: lessons learnt from the first computational paralinguistics challenge. Comput. Speech Lang. 53, 156–180 (2019)

    Google Scholar 

  37. Schuller, B.W., et al.: The INTERSPEECH 2016 computational paralinguistics challenge: deception, sincerity & native language. In: Proceedings of the 17th Annual Conference of the International Speech Communication Association (INTERSPEECH), vol. 2016, pp. 2001–2005. ISCA, San Francisco, CA, USA (2016)

    Google Scholar 

  38. Schuller, B.W., et al.: The INTERSPEECH 2021 computational paralinguistics challenge: COVID-19 cough, COVID-19 speech, escalation & primates. In: Proceedings of the 22nd Annual Conference of the International Speech Communication Association (INTERSPEECH), pp. 431–435 (2021)

    Google Scholar 

  39. Snyder, D., Garcia-Romero, D., McCree, A., Sell, G., Povey, D., Khudanpur, S.: Spoken language recognition using x-vectors. In: Odyssey, pp. 105–111 (2018)

    Google Scholar 

  40. Snyder, D., Garcia-Romero, D., Sell, G., Povey, D., Khudanpur, S.: X-vectors: robust DNN embeddings for speaker recognition. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5329–5333. IEEE (2018)

    Google Scholar 

  41. Spitzer, R.L., Kroenke, K., Williams, J.B.W., Löwe, B.: A brief measure for assessing generalized anxiety disorder: the GAD-7. Arch. Intern. Med. 166(10), 1092–1097 (2006)

    Google Scholar 

  42. Ting, K.M.: Precision and recall. In: Sammut, C., Webb, G.I. (eds.) Encyclopedia of Machine Learning, p. 781. Springer, Boston (2010). https://doi.org/10.1007/978-0-387-30164-8_652

  43. Valstar, M.F., Gratch, J., Schuller, B.W., Ringeval, F., Cowie, R., Pantic, M. (eds.) Proceedings of the 6th International Workshop on Audio/Visual Emotion Challenge, AVEC@MM 2016. ACM, Amsterdam, October 2016

    Google Scholar 

  44. Valstar, M.F., et al.: AVEC 2013: the continuous audio/visual emotion and depression recognition challenge. In: Schuller, B.W., Valstar, M.F., Cowie, R., Krajewski, J., Pantic, M. (eds.) Proceedings of the 3rd ACM International Workshop on Audio/Visual Emotion Challenge, AVEC@ACM Multimedia 2013, Barcelona, Spain, 21 October 2013, pp. 3–10. ACM (2013)

    Google Scholar 

  45. Waibel, A.H., Hanazawa, T., Hinton, G.E., Shikano, K., Kevin, J.L.: Phoneme recognition using time-delay neural networks. IEEE Trans. Acoustics Speech Sig. Process. 37, 328–339 (1989)

    Google Scholar 

  46. Weninger, F., Eyben, F., Schuller, B.W., Mortillaro, M., Scherer, K.R.: On the acoustics of emotion in audio: what speech, music, and sound have in common. Front. Psychol. 4 (2013)

    Google Scholar 

  47. Werneck, A.O., Silva, D.R.: Population density, depressive symptoms, and suicidal thoughts. Revista Brasileira de Psiquiatria (2020)

    Google Scholar 

  48. Yin, W., Levis, B., Riehm, K.E., et al.: Equivalency of the diagnostic accuracy of the PHQ-8 and PHQ-9: a systematic review and individual participant data meta-analysis. Psychol. Med. 50(8), 1368–1380 (2020)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Raymond Brueckner .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Brueckner, R., Kwon, N., Subramanian, V., Blaylock, N., O’Connell, H. (2024). Audio-Based Detection of Anxiety and Depression via Vocal Biomarkers. In: Arai, K. (eds) Advances in Information and Communication. FICC 2024. Lecture Notes in Networks and Systems, vol 919. Springer, Cham. https://doi.org/10.1007/978-3-031-53960-2_9

Download citation

Publish with us

Policies and ethics