skip to main content
10.1145/2988257.2988267acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

DepAudioNet: An Efficient Deep Model for Audio based Depression Classification

Published: 16 October 2016 Publication History

Abstract

This paper presents a novel and effective audio based method on depression classification. It focuses on two important issues, \emph{i.e.} data representation and sample imbalance, which are not well addressed in literature. For the former one, in contrast to traditional shallow hand-crafted features, we propose a deep model, namely DepAudioNet, to encode the depression related characteristics in the vocal channel, combining Convolutional Neural Network (CNN) and Long Short-Term Memory (LSTM) to deliver a more comprehensive audio representation. For the latter one, we introduce a random sampling strategy in the model training phase to balance the positive and negative samples, which largely alleviates the bias caused by uneven sample distribution. Evaluations are carried out on the DAIC-WOZ dataset for the Depression Classification Sub-challenge (DCC) at the 2016 Audio-Visual Emotion Challenge (AVEC), and the experimental results achieved clearly demonstrate the effectiveness of the proposed approach.

References

[1]
O. Abdel-Hamid, A.-R. Mohamed, H. Jiang, L. Deng, G. Penn, and D. Yu. Convolutional neural networks for speech recognition. IEEE Transactions on Audio, Speech and Language Processing, 22(10):1533--1545, Oct. 2014.
[2]
N. C. Andreasen. Scale for the Assessment of Negative Symptoms (SANS). The British Journal of Psychiatry, 1989.
[3]
I. G. Y. Bengio and A. Courville. Deep learning. Book in preparation for MIT Press, 2016.
[4]
Y. Bengio. A connectionist approach to speech recognition. International Journal of Pattern Recognition and Artificial Intelligence, 07(04):647--667, Aug. 1993.
[5]
M. Bhargava and R. Rose. Architectures for deep neural network based acoustic models defined over windowed speech waveforms. In International Speech Communication Association, 2015.
[6]
L. Chao, J. Tao, M. Yang, and Y. Li. Multi task sequence learning for depression scale prediction from video. In Affective Computing and Intelligent Interaction, pages 526--531. IEEE, 2015.
[7]
S. Chen and Q. Jin. Multi-modal dimensional emotion recognition using recurrent neural networks. In International Workshop on Audio/Visual Emotion Challenge, pages 49--56. ACM, 2015.
[8]
J. F. Cohn, T. S. Kruez, I. Matthews, Y. Yang, M. H. Nguyen, M. T. Padilla, F. Zhou, and F. D. La Torre. Detecting depression from facial actions and vocal prosody. In International Conference on Affective Computing and Intelligent Interaction and Workshops, pages 1--7. IEEE, 2009.
[9]
N. Cummins, J. Joshi, A. Dhall, V. Sethu, R. Goecke, and J. Epps. Diagnosis of depression by behavioural signals: a multimodal approach. In International Workshop on Audio/Visual Emotion Challenge, pages 11--20. ACM, 2013.
[10]
N. Cummins, S. Scherer, J. Krajewski, S. Schnieder, J. Epps, and T. F. Quatieri. A review of depression and suicide risk assessment using speech analysis. Speech Communication, 71:10--49, July 2015.
[11]
L. Deng, O. Abdel-Hamid, and D. Yu. A deep convolutional neural network using heterogeneous pooling for trading acoustic invariance with phonetic confusion. In International Conference on Acoustics, Speech and Signal Processing, pages 6669--6673, May 2013.
[12]
D. DeVault, R. Artstein, G. Benn, T. Dey, E. Fast, A. Gainer, K. Georgila, J. Gratch, A. Hartholt, M. Lhommet, G. M. Lucas, S. Marsella, F. Morbini, A. Nazarian, S. Scherer, G. Stratou, A. Suri, D. R. Traum, R. Wood, Y. Xu, A. A. Rizzo, and L. Morency. Simsensei kiosk: a virtual human interviewer for healthcare decision support. In International conference on Autonomous Agents and Multi-Agent Systems, AAMAS '14, Paris, France, May 5--9, 2014, pages 1061--1068, 2014.
[13]
P. Golik, Z. Tüske, R. Schlüter, and H. Ney. Convolutional neural networks for acoustic modeling of raw time signal in LVCSR. In Interspeech 2015, 16th Annual Conference of the International Speech Communication Association, Dresden, Germany, September 6--10, 2015, pages 26--30, 2015.
[14]
J. Gratch, R. Artstein, G. Lucas, S. Scherer, A. Nazarian, R. Wood, J. Boberg, D. DeVault, S. Marsella, D. Traum, A. Rizzo, and L.-P. Morency. The distress analysis interview corpus of human and computer interviews. In Proceedings of Language Resources and Evaluation Conference, pages 3123--3128, 2014.
[15]
A. Graves. Generating sequences with recurrent neural networks. arXiv:1308.0850 {cs}, Aug. 2013.
[16]
L. He, D. Jiang, and H. Sahli. Multimodal depression recognition with dynamic visual and audio cues. In International Conference on Affective Computing and Intelligent Interaction, pages 260--266, Sept. 2015.
[17]
S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9(8):1735--1780, 1997.
[18]
D. Huang, C. Shan, M. Ardabilian, Y. Wang, and L. Chen. Local binary patterns and its application to facial image analysis: A survey. IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications and Reviews, 41(6):765--781, 2011.
[19]
S. Ioffe and C. Szegedy. Batch Normalization: accelerating deep network training by reducing internal covariate shift. arXiv:1502.03167 {cs}, Feb. 2015.
[20]
M. Kachele, M. Glodek, D. Zharkov, S. Meudt, and F. Schwenker. Fusion of audio-visual features using hierarchical classifier systems for the recognition of affective states and the state of depression. depression, 1(1), 2014.
[21]
H. Kaya and A. A. Salah. Eyes whisper depression: a cca based multimodal approach. In International Conference on Multimedia, pages 961--964. ACM, 2014.
[22]
A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097--1105, 2012.
[23]
K. Kroenke, T. W. Strine, R. L. Spitzer, J. B. W. Williams, J. T. Berry, and A. H. Mokdad. The PHQ-8 as a measure of current depression in the general population. Journal of Affective Disorders, 114(1--3):163--173, Apr. 2009.
[24]
C. W. Lejuez, D. R. Hopko, and S. D. Hopko. A brief behavioral activation treatment for depression treatment manual. Behavior Modification, 25(2):255--286, 2001.
[25]
Z. C. Lipton, J. Berkowitz, and C. Elkan. A critical review of recurrent neural networks for sequence learning. arXiv:1506.00019 {cs}, May 2015.
[26]
A. Mcpherson and C. R. Martin. A narrative review of the Beck Depression Inventory (BDI) and implications for its use in an alcohol-dependent population. Journal of Psychiatric and Mental Health Nursing, 17(1):19--30, Feb. 2010.
[27]
H. Meng, D. Huang, H. Wang, H. Yang, M. AI-Shuraifi, and Y. Wang. Depression recognition based on dynamic facial and vocal expression features using partial least square regression. In International Workshop on Audio/Visual Emotion Challenge, pages 21--30. ACM, 2013.
[28]
V. Mitra, E. Shriberg, M. McLaren, A. Kathol, C. Richey, D. Vergyri, and M. Graciarena. The SRI AVEC-2014 evaluation system. In International Workshop on Audio/Visual Emotion Challenge, pages 93--101. ACM, 2014.
[29]
National Collaborating Centre for Mental Health (UK). Depression: the treatment and management of depression in adults (updated edition). National institute for health and clinical excellence: guidance. British Psychological Society, 2010.
[30]
D. Palaz, R. Collobert, and others. Analysis of cnn-based speech recognition system using raw speech as input. In Interspeech, 2015.
[31]
T. N. Sainath, R. J. Weiss, A. W. Senior, K. W. Wilson, and O. Vinyals. Learning the speech front-end with raw waveform cldnns. In Interspeech 2015, 16th Annual Conference of the International Speech Communication Association, Dresden, Germany, September 6--10, 2015, pages 1--5, 2015.
[32]
K. R. Scherer. Vocal affect expression: a review and a model for future research. Psychological bulletin, 99(2):143, 1986.
[33]
S. Scherer, G. Stratou, G. Lucas, M. Mahmoud, J. Boberg, J. Gratch, A. S. Rizzo, and L.-P. Morency. Automatic audiovisual behavior descriptors for psychological disorder analysis. Image and Vision Computing, 32(10):648--658, Oct. 2014.
[34]
M. Valstar, J. Gratch, B. Schuller, F. Ringeval, D. Lalanne, M. Torres Torres, S. Scherer, G. Stratou, R. Cowie, and M. Pantic. AVEC 2016 -- Depression, Mood, and Emotion Recognition Workshop and Challenge. In Proceedings of AVEC'16, co-located with ACM MM 2016, Amsterdam, The Netherlands, October 2016. ACM.
[35]
P. Wang, F. Barrett, E. Martin, M. Milonova, R. E. Gur, R. C. Gur, C. Kohler, and R. Verma. Automated video-based facial expression analysis of neuropsychiatric disorders. Journal of Neuroscience Methods, 168(1):224--238, Feb. 2008.
[36]
L. Wen, X. Li, G. Guo, and Y. Zhu. Automated depression diagnosis based on facial dynamic analysis and sparse coding. IEEE Transactions on Information Forensics and Security, 10(7):1432--1441, 2015.
[37]
J. R. Williamson, T. F. Quatieri, B. S. Helfer, R. Horwitz, B. Yu, and D. D. Mehta. Vocal biomarkers of depression based on motor incoordination. In International Workshop on Audio/Visual Emotion Challenge, pages 41--48. ACM, 2013.
[38]
M. Zimmerman, I. Chelminski, and M. Posternak. A review of studies of the Hamilton Depression Rating Scale in healthy controls: implications for the definition of remission in treatment studies of depression. The Journal of Nervous and Mental Disease, 192(9):595--601, Sept. 2004.

Cited By

View all
  • (2025)DEP-Former: Multimodal Depression Recognition Based on Facial Expressions and Audio Features via Emotional ChangesIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2024.349109835:3(2087-2100)Online publication date: Mar-2025
  • (2025)Weakly-Supervised Depression Detection in Speech Through Self-Learning Based Label CorrectionIEEE Transactions on Audio, Speech and Language Processing10.1109/TASLPRO.2025.353337033(748-758)Online publication date: 2025
  • (2025)Facial action units guided graph representation learning for multimodal depression detectionNeurocomputing10.1016/j.neucom.2024.129106619(129106)Online publication date: Feb-2025
  • Show More Cited By

Index Terms

  1. DepAudioNet: An Efficient Deep Model for Audio based Depression Classification

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    AVEC '16: Proceedings of the 6th International Workshop on Audio/Visual Emotion Challenge
    October 2016
    114 pages
    ISBN:9781450345163
    DOI:10.1145/2988257
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 16 October 2016

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. CNN
    2. LSTM
    3. audio representation
    4. depression recognition

    Qualifiers

    • Research-article

    Conference

    MM '16
    Sponsor:
    MM '16: ACM Multimedia Conference
    October 16, 2016
    Amsterdam, The Netherlands

    Acceptance Rates

    AVEC '16 Paper Acceptance Rate 12 of 14 submissions, 86%;
    Overall Acceptance Rate 52 of 98 submissions, 53%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)368
    • Downloads (Last 6 weeks)35
    Reflects downloads up to 05 Mar 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2025)DEP-Former: Multimodal Depression Recognition Based on Facial Expressions and Audio Features via Emotional ChangesIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2024.349109835:3(2087-2100)Online publication date: Mar-2025
    • (2025)Weakly-Supervised Depression Detection in Speech Through Self-Learning Based Label CorrectionIEEE Transactions on Audio, Speech and Language Processing10.1109/TASLPRO.2025.353337033(748-758)Online publication date: 2025
    • (2025)Facial action units guided graph representation learning for multimodal depression detectionNeurocomputing10.1016/j.neucom.2024.129106619(129106)Online publication date: Feb-2025
    • (2025)Deep multi-task learning based detection of correlated mental disorders using audio modalityComputer Speech and Language10.1016/j.csl.2024.10171089:COnline publication date: 1-Jan-2025
    • (2025)A Transformer-Based Depression Detection Network Leveraging Speech Emotional Expression CuesSocial Robotics10.1007/978-981-96-1151-5_18(177-186)Online publication date: 7-Feb-2025
    • (2024)Capítulo 4: Aplicación de Redes Neuronales para clasificación de texto sobre entrevistas médicas del corpus DAIC-WoZGestión del conocimiento. Perspectiva multidisciplinaria (libro 65)10.59899/Ges-cono-65-C4(65-85)Online publication date: 31-May-2024
    • (2024)Tackling Depression Detection With Deep LearningDriving Smart Medical Diagnosis Through AI-Powered Technologies and Applications10.4018/979-8-3693-3679-3.ch006(102-117)Online publication date: 9-Feb-2024
    • (2024)FairReFuseProceedings of the Thirty-Third International Joint Conference on Artificial Intelligence10.24963/ijcai.2024/799(7224-7232)Online publication date: 3-Aug-2024
    • (2024)Automatic diagnosis of depression based on attention mechanism and feature pyramid modelPLOS ONE10.1371/journal.pone.029505119:3(e0295051)Online publication date: 12-Mar-2024
    • (2024)Development of multimodal sentiment recognition and understandingJournal of Image and Graphics10.11834/jig.24001729:6(1607-1627)Online publication date: 2024
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media