Abstract
We present a comparison of results based on the application of various model/feature combinations on the task of detecting anxiety and depression from audio signals of spontaneous speech. The adopted models comprise several different advanced deep neural networks, including CNN, LSTM, and attention networks, and are compared against traditional, shallow machine learning models. As input features we compare supra-segmental, paralinguistic feature sets against classical Mel-Frequency Cepstral Coefficients and advanced pre-trained X-vector and Wav2Vec2 features. Our models are trained based on self-assessment scores: GAD-7 for anxiety and PHQ-8 for depression. We present binary classification results for anxiety and depression separately and show that despite the noisy self-assessment labels our best model is able to achieve an unweighted average recall (UAR) of 0.60 for anxiety and 0.63 on the depression task. The result on the anxiety task almost reaches the reported self-scored GAD-7 screening reliability of 0.64. This shows that our best audio-based model can be deployed as an anxiety and depression screening tool.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
- 2.
One could, for example, assign a person to the depressed class based on independent depression screening tests.
- 3.
Paralinguistics is the study of paralanguage, which connotes “alongside language” and generally describes the non-verbal elements of human communication, i.e. all meta-information that accompanies and complements language [6].
- 4.
- 5.
Note that this data set was labeled by trained clinical assessors, not relying on self-assessment labels.
References
Abadi, M., et al.: TensorFlow: large-scale machine learning on heterogeneous systems (2015). Software available from tensorflow.org
Arroll, B., et al.: Validation of PHQ-2 and PHQ-9 to screen for major depression in the primary care population. Ann. Family Med. 8(4), 348 (2010)
Baevski, A., Zhou, H., Mohamed, A., Auli, M.: Wav2vec 2.0: a framework for self-supervised learning of speech representations. In: Advances in Neural Information Processing Systems, vol. 33 (NeurIPS 2020). Curran Associates Inc., Red Hook, NY, USA (2020)
Bandelow, B., Michaelis, S.: Epidemiology of anxiety disorders in the 21st century. Dialogues Clin. Neurosci. 17, 327–335 (2015)
Beard, C., Björgvinsson, T.: Beyond generalized anxiety disorder: psychometric properties of the GAD-7 in a heterogeneous psychiatric sample. J. Anxiety Disord. 28(6), 547–552 (2014)
Brueckner, R.: Application of Deep Learning Methods in Computational Paralinguistics. Ph.D. thesis, Technische Universität München (2020)
Chen, T., Guestrin, C.: XGBoost: a scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. KDD ’16, pp. 785–794. ACM, New York (2016)
Davis, S., Mermelstein, P.: Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Trans. Acoust. Speech Sig. Process. 28(4), 357–366 (1980)
De Angel, V., et al.: Digital health tools for the passive monitoring of depression: a systematic review of methods. NPJ Digit. Med. 5(1), 3 (2022)
Endler, N.S., Kocovski, N.L.: State and trait anxiety revisited. J. Anxiety Disorders 15(3), 231–245 (2001)
Eyben, F.: Real-Time Speech and Music Classification by Large Audio Feature Space Extraction. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-27299-3
Eyben, F., et al.: The Geneva minimalistic acoustic parameter set (GeMAPS) for voice research and affective computing. IEEE Trans. Affect. Comput. 7(2), 190–202 (2016)
Eyben, F., Wöllmer, M., Schuller, B.: openSMILE – The Munich versatile and fast open-source audio feature extractor. In: Proceedings of the 18th ACM International Conference on Multimedia, pp. 1459–1462. ACM, Florence, Italy (2010)
Huang, Z., Epps, J., Joachim, D.: Investigation of speech landmark patterns for depression detection. IEEE Trans. Affect. Comput. 13(2), 666–679 (2022)
Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. In: International Conference on Machine Learning, pp. 448–456. PMLR (2015)
Jeancolas, L., et al.: X-vectors: new quantitative biomarkers for early Parkinson’s disease detection from speech. Front. Neuroinform. 15, 578369 (2021)
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
Kroenke, K., Spitzer, R.L., Williams, J.B.W.: The PHQ-9: validity of a brief depression severity measure. J. General Internal Med. 16(9), 606–613 (2001)
Ma, X., Yang, H., Chen, Q., Huang, D., Wang, Y.: Depaudionet: an efficient deep model for audio based depression classification. In: Proceedings of the 6th International Workshop on Audio/visual Emotion Challenge, pp. 35–42 (2016)
Moro-Velazquez, L., Villalba, J., Dehak, N.: Using X-vectors to automatically detect Parkinson’s disease from speech. In: ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1155–1159. IEEE (2020)
Nair, V., Hinton, G.E.: Rectified linear units improve restricted Boltzmann machines. In: International Conference on Machine Learning (ICML) 2010, pp. 807–814 (2010)
Nirjhar, E.H., Behzadan, A., Chaspari, T.: Exploring bio-behavioral signal trajectories of state anxiety during public speaking. In: ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1294–1298. IEEE (2020)
Pappagari, R., Cho, J., Moro-Velazquez, L., Dehak, N.: Using state of the art speaker recognition and natural language processing technologies to detect Alzheimer’s disease and assess its severity. In: INTERSPEECH, pp. 2177–2181 (2020)
Pappagari, R., Wang, T., Villalba, J., Chen, N., Dehak, N.: X-vectors meet emotions: a study on dependencies between emotion and speaker recognition. In: ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7169–7173. IEEE (2020)
Luis Felipe Parra-Gallego and Juan Rafael Orozco-Arroyave: Classification of emotions and evaluation of customer satisfaction from speech in real world acoustic environments. Digit. Sig. Process. 120, 103286 (2022)
Pedregosa, F., et al.: Scikit-learn: machine Learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
Povey, D., et al.: The Kaldi speech recognition toolkit. In: IEEE 2011 Workshop on Automatic Speech Recognition and Understanding, number CONF. IEEE Signal Processing Society (2011)
Raj, D., Snyder, D., Povey, D., Khudanpur, S.: Probing the information encoded in x-vectors. In: 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 726–733. IEEE (2019)
Ringeval, F., et al.: AV+EC 2015: the first affect recognition challenge bridging across audio, video, and physiological data. In: Proceedings of the 5th International Workshop on Audio/Visual Emotion Challenge, pp. 3–8. ACM, Brisbane, Australia (2015)
Sakib, Md.N., Nirjhar, E.H., Feng, K., Behzadan, A., Chaspari, T., Chaspari, T.: Exploring individual differences of public speaking anxiety in real-life and virtual presentations. IEEE Trans. Affect. Comput. 1 (2021)
Salekin, A., Eberle, J.W., Glenn, J.J., Teachman, B.A., Stankovic, J.A.: A weakly supervised learning framework for detecting social anxiety and depression. Proc. ACM Interact. Mob. Wearable Ubiquit. Technol. 2(2), 1–26 (2018)
Schuller, B.: Intelligent Audio Analysis – Speech, Music, and Sound Recognition in Real-Life Conditions. Habilitation thesis, Technische Universität München, Munich, Germany (2012)
Schuller, B., Batliner, A.: Computational Paralinguistics: Emotion, Affect and Personality in Speech and Language Processing. Wiley, Chichester (2014)
Schuller, B., Steidl, S., Batliner, A.: The INTERSPEECH 2009 emotion challenge. In: Proceedings of the 10th Annual Conference of the International Speech Communication Association (INTERSPEECH). ISCA, Brighton, UK (2009)
Schuller, B., et al.: The INTERSPEECH 2014 computational paralinguistics challenge: cognitive & physical load. In: Proceedings of the 15th Annual Conference of the International Speech Communication Association (INTERSPEECH), Singapore (2014)
Schuller, B., et al.: Affective and behavioural computing: lessons learnt from the first computational paralinguistics challenge. Comput. Speech Lang. 53, 156–180 (2019)
Schuller, B.W., et al.: The INTERSPEECH 2016 computational paralinguistics challenge: deception, sincerity & native language. In: Proceedings of the 17th Annual Conference of the International Speech Communication Association (INTERSPEECH), vol. 2016, pp. 2001–2005. ISCA, San Francisco, CA, USA (2016)
Schuller, B.W., et al.: The INTERSPEECH 2021 computational paralinguistics challenge: COVID-19 cough, COVID-19 speech, escalation & primates. In: Proceedings of the 22nd Annual Conference of the International Speech Communication Association (INTERSPEECH), pp. 431–435 (2021)
Snyder, D., Garcia-Romero, D., McCree, A., Sell, G., Povey, D., Khudanpur, S.: Spoken language recognition using x-vectors. In: Odyssey, pp. 105–111 (2018)
Snyder, D., Garcia-Romero, D., Sell, G., Povey, D., Khudanpur, S.: X-vectors: robust DNN embeddings for speaker recognition. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5329–5333. IEEE (2018)
Spitzer, R.L., Kroenke, K., Williams, J.B.W., Löwe, B.: A brief measure for assessing generalized anxiety disorder: the GAD-7. Arch. Intern. Med. 166(10), 1092–1097 (2006)
Ting, K.M.: Precision and recall. In: Sammut, C., Webb, G.I. (eds.) Encyclopedia of Machine Learning, p. 781. Springer, Boston (2010). https://doi.org/10.1007/978-0-387-30164-8_652
Valstar, M.F., Gratch, J., Schuller, B.W., Ringeval, F., Cowie, R., Pantic, M. (eds.) Proceedings of the 6th International Workshop on Audio/Visual Emotion Challenge, AVEC@MM 2016. ACM, Amsterdam, October 2016
Valstar, M.F., et al.: AVEC 2013: the continuous audio/visual emotion and depression recognition challenge. In: Schuller, B.W., Valstar, M.F., Cowie, R., Krajewski, J., Pantic, M. (eds.) Proceedings of the 3rd ACM International Workshop on Audio/Visual Emotion Challenge, AVEC@ACM Multimedia 2013, Barcelona, Spain, 21 October 2013, pp. 3–10. ACM (2013)
Waibel, A.H., Hanazawa, T., Hinton, G.E., Shikano, K., Kevin, J.L.: Phoneme recognition using time-delay neural networks. IEEE Trans. Acoustics Speech Sig. Process. 37, 328–339 (1989)
Weninger, F., Eyben, F., Schuller, B.W., Mortillaro, M., Scherer, K.R.: On the acoustics of emotion in audio: what speech, music, and sound have in common. Front. Psychol. 4 (2013)
Werneck, A.O., Silva, D.R.: Population density, depressive symptoms, and suicidal thoughts. Revista Brasileira de Psiquiatria (2020)
Yin, W., Levis, B., Riehm, K.E., et al.: Equivalency of the diagnostic accuracy of the PHQ-8 and PHQ-9: a systematic review and individual participant data meta-analysis. Psychol. Med. 50(8), 1368–1380 (2020)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Brueckner, R., Kwon, N., Subramanian, V., Blaylock, N., O’Connell, H. (2024). Audio-Based Detection of Anxiety and Depression via Vocal Biomarkers. In: Arai, K. (eds) Advances in Information and Communication. FICC 2024. Lecture Notes in Networks and Systems, vol 919. Springer, Cham. https://doi.org/10.1007/978-3-031-53960-2_9
Download citation
DOI: https://doi.org/10.1007/978-3-031-53960-2_9
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-53959-6
Online ISBN: 978-3-031-53960-2
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)