research-article

DepAudioNet: An Efficient Deep Model for Audio based Depression Classification

Authors:

Yunhong WangAuthors Info & Claims

AVEC '16: Proceedings of the 6th International Workshop on Audio/Visual Emotion Challenge

Pages 35 - 42

https://doi.org/10.1145/2988257.2988267

Published: 16 October 2016 Publication History

Abstract

This paper presents a novel and effective audio based method on depression classification. It focuses on two important issues, \emph{i.e.} data representation and sample imbalance, which are not well addressed in literature. For the former one, in contrast to traditional shallow hand-crafted features, we propose a deep model, namely DepAudioNet, to encode the depression related characteristics in the vocal channel, combining Convolutional Neural Network (CNN) and Long Short-Term Memory (LSTM) to deliver a more comprehensive audio representation. For the latter one, we introduce a random sampling strategy in the model training phase to balance the positive and negative samples, which largely alleviates the bias caused by uneven sample distribution. Evaluations are carried out on the DAIC-WOZ dataset for the Depression Classification Sub-challenge (DCC) at the 2016 Audio-Visual Emotion Challenge (AVEC), and the experimental results achieved clearly demonstrate the effectiveness of the proposed approach.

References

[1]

O. Abdel-Hamid, A.-R. Mohamed, H. Jiang, L. Deng, G. Penn, and D. Yu. Convolutional neural networks for speech recognition. IEEE Transactions on Audio, Speech and Language Processing, 22(10):1533--1545, Oct. 2014.

Digital Library

[2]

N. C. Andreasen. Scale for the Assessment of Negative Symptoms (SANS). The British Journal of Psychiatry, 1989.

[3]

I. G. Y. Bengio and A. Courville. Deep learning. Book in preparation for MIT Press, 2016.

[4]

Y. Bengio. A connectionist approach to speech recognition. International Journal of Pattern Recognition and Artificial Intelligence, 07(04):647--667, Aug. 1993.

[5]

M. Bhargava and R. Rose. Architectures for deep neural network based acoustic models defined over windowed speech waveforms. In International Speech Communication Association, 2015.

[6]

L. Chao, J. Tao, M. Yang, and Y. Li. Multi task sequence learning for depression scale prediction from video. In Affective Computing and Intelligent Interaction, pages 526--531. IEEE, 2015.

Digital Library

[7]

S. Chen and Q. Jin. Multi-modal dimensional emotion recognition using recurrent neural networks. In International Workshop on Audio/Visual Emotion Challenge, pages 49--56. ACM, 2015.

Digital Library

[8]

J. F. Cohn, T. S. Kruez, I. Matthews, Y. Yang, M. H. Nguyen, M. T. Padilla, F. Zhou, and F. D. La Torre. Detecting depression from facial actions and vocal prosody. In International Conference on Affective Computing and Intelligent Interaction and Workshops, pages 1--7. IEEE, 2009.

[9]

N. Cummins, J. Joshi, A. Dhall, V. Sethu, R. Goecke, and J. Epps. Diagnosis of depression by behavioural signals: a multimodal approach. In International Workshop on Audio/Visual Emotion Challenge, pages 11--20. ACM, 2013.

Digital Library

[10]

N. Cummins, S. Scherer, J. Krajewski, S. Schnieder, J. Epps, and T. F. Quatieri. A review of depression and suicide risk assessment using speech analysis. Speech Communication, 71:10--49, July 2015.

Digital Library

[11]

L. Deng, O. Abdel-Hamid, and D. Yu. A deep convolutional neural network using heterogeneous pooling for trading acoustic invariance with phonetic confusion. In International Conference on Acoustics, Speech and Signal Processing, pages 6669--6673, May 2013.

[12]

D. DeVault, R. Artstein, G. Benn, T. Dey, E. Fast, A. Gainer, K. Georgila, J. Gratch, A. Hartholt, M. Lhommet, G. M. Lucas, S. Marsella, F. Morbini, A. Nazarian, S. Scherer, G. Stratou, A. Suri, D. R. Traum, R. Wood, Y. Xu, A. A. Rizzo, and L. Morency. Simsensei kiosk: a virtual human interviewer for healthcare decision support. In International conference on Autonomous Agents and Multi-Agent Systems, AAMAS '14, Paris, France, May 5--9, 2014, pages 1061--1068, 2014.

Digital Library

[13]

P. Golik, Z. Tüske, R. Schlüter, and H. Ney. Convolutional neural networks for acoustic modeling of raw time signal in LVCSR. In Interspeech 2015, 16th Annual Conference of the International Speech Communication Association, Dresden, Germany, September 6--10, 2015, pages 26--30, 2015.

[14]

J. Gratch, R. Artstein, G. Lucas, S. Scherer, A. Nazarian, R. Wood, J. Boberg, D. DeVault, S. Marsella, D. Traum, A. Rizzo, and L.-P. Morency. The distress analysis interview corpus of human and computer interviews. In Proceedings of Language Resources and Evaluation Conference, pages 3123--3128, 2014.

[15]

A. Graves. Generating sequences with recurrent neural networks. arXiv:1308.0850 {cs}, Aug. 2013.

[16]

L. He, D. Jiang, and H. Sahli. Multimodal depression recognition with dynamic visual and audio cues. In International Conference on Affective Computing and Intelligent Interaction, pages 260--266, Sept. 2015.

Digital Library

[17]

S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9(8):1735--1780, 1997.

Digital Library

[18]

D. Huang, C. Shan, M. Ardabilian, Y. Wang, and L. Chen. Local binary patterns and its application to facial image analysis: A survey. IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications and Reviews, 41(6):765--781, 2011.

Digital Library

[19]

S. Ioffe and C. Szegedy. Batch Normalization: accelerating deep network training by reducing internal covariate shift. arXiv:1502.03167 {cs}, Feb. 2015.

[20]

M. Kachele, M. Glodek, D. Zharkov, S. Meudt, and F. Schwenker. Fusion of audio-visual features using hierarchical classifier systems for the recognition of affective states and the state of depression. depression, 1(1), 2014.

[21]

H. Kaya and A. A. Salah. Eyes whisper depression: a cca based multimodal approach. In International Conference on Multimedia, pages 961--964. ACM, 2014.

Digital Library

[22]

A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097--1105, 2012.

Digital Library

[23]

K. Kroenke, T. W. Strine, R. L. Spitzer, J. B. W. Williams, J. T. Berry, and A. H. Mokdad. The PHQ-8 as a measure of current depression in the general population. Journal of Affective Disorders, 114(1--3):163--173, Apr. 2009.

[24]

C. W. Lejuez, D. R. Hopko, and S. D. Hopko. A brief behavioral activation treatment for depression treatment manual. Behavior Modification, 25(2):255--286, 2001.

[25]

Z. C. Lipton, J. Berkowitz, and C. Elkan. A critical review of recurrent neural networks for sequence learning. arXiv:1506.00019 {cs}, May 2015.

[26]

A. Mcpherson and C. R. Martin. A narrative review of the Beck Depression Inventory (BDI) and implications for its use in an alcohol-dependent population. Journal of Psychiatric and Mental Health Nursing, 17(1):19--30, Feb. 2010.

[27]

H. Meng, D. Huang, H. Wang, H. Yang, M. AI-Shuraifi, and Y. Wang. Depression recognition based on dynamic facial and vocal expression features using partial least square regression. In International Workshop on Audio/Visual Emotion Challenge, pages 21--30. ACM, 2013.

Digital Library

[28]

V. Mitra, E. Shriberg, M. McLaren, A. Kathol, C. Richey, D. Vergyri, and M. Graciarena. The SRI AVEC-2014 evaluation system. In International Workshop on Audio/Visual Emotion Challenge, pages 93--101. ACM, 2014.

Digital Library

[29]

National Collaborating Centre for Mental Health (UK). Depression: the treatment and management of depression in adults (updated edition). National institute for health and clinical excellence: guidance. British Psychological Society, 2010.

[30]

D. Palaz, R. Collobert, and others. Analysis of cnn-based speech recognition system using raw speech as input. In Interspeech, 2015.

[31]

T. N. Sainath, R. J. Weiss, A. W. Senior, K. W. Wilson, and O. Vinyals. Learning the speech front-end with raw waveform cldnns. In Interspeech 2015, 16th Annual Conference of the International Speech Communication Association, Dresden, Germany, September 6--10, 2015, pages 1--5, 2015.

[32]

K. R. Scherer. Vocal affect expression: a review and a model for future research. Psychological bulletin, 99(2):143, 1986.

[33]

S. Scherer, G. Stratou, G. Lucas, M. Mahmoud, J. Boberg, J. Gratch, A. S. Rizzo, and L.-P. Morency. Automatic audiovisual behavior descriptors for psychological disorder analysis. Image and Vision Computing, 32(10):648--658, Oct. 2014.

[34]

M. Valstar, J. Gratch, B. Schuller, F. Ringeval, D. Lalanne, M. Torres Torres, S. Scherer, G. Stratou, R. Cowie, and M. Pantic. AVEC 2016 -- Depression, Mood, and Emotion Recognition Workshop and Challenge. In Proceedings of AVEC'16, co-located with ACM MM 2016, Amsterdam, The Netherlands, October 2016. ACM.

Digital Library

[35]

P. Wang, F. Barrett, E. Martin, M. Milonova, R. E. Gur, R. C. Gur, C. Kohler, and R. Verma. Automated video-based facial expression analysis of neuropsychiatric disorders. Journal of Neuroscience Methods, 168(1):224--238, Feb. 2008.

[36]

L. Wen, X. Li, G. Guo, and Y. Zhu. Automated depression diagnosis based on facial dynamic analysis and sparse coding. IEEE Transactions on Information Forensics and Security, 10(7):1432--1441, 2015.

Digital Library

[37]

J. R. Williamson, T. F. Quatieri, B. S. Helfer, R. Horwitz, B. Yu, and D. D. Mehta. Vocal biomarkers of depression based on motor incoordination. In International Workshop on Audio/Visual Emotion Challenge, pages 41--48. ACM, 2013.

Digital Library

[38]

M. Zimmerman, I. Chelminski, and M. Posternak. A review of studies of the Hamilton Depression Rating Scale in healthy controls: implications for the definition of remission in treatment studies of depression. The Journal of Nervous and Mental Disease, 192(9):595--601, Sept. 2004.

Cited By

Ye JYu YLu LWang HZheng YLiu YWang Q(2025)DEP-Former: Multimodal Depression Recognition Based on Facial Expressions and Audio Features via Emotional ChangesIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2024.349109835:3(2087-2100)Online publication date: Mar-2025
https://doi.org/10.1109/TCSVT.2024.3491098
Sun YZhou YXu XQi JXu FRen ZSchuller B(2025)Weakly-Supervised Depression Detection in Speech Through Self-Learning Based Label CorrectionIEEE Transactions on Audio, Speech and Language Processing10.1109/TASLPRO.2025.353337033(748-758)Online publication date: 2025
https://doi.org/10.1109/TASLPRO.2025.3533370
Fu CQian FSu YSu KSong SNiu MShi JLiu ZLiu CIshi CIshiguro H(2025)Facial action units guided graph representation learning for multimodal depression detectionNeurocomputing10.1016/j.neucom.2024.129106619(129106)Online publication date: Feb-2025
https://doi.org/10.1016/j.neucom.2024.129106
Show More Cited By

Index Terms

DepAudioNet: An Efficient Deep Model for Audio based Depression Classification
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision tasks
        Activity recognition and understanding

Recommendations

Chinese Text Classification Based on Hybrid Model of CNN and LSTM
DSIT 2020: Proceedings of the 3rd International Conference on Data Science and Information Technology

Text classification is one of the basic tasks of natural language processing. In recent years, deep learning has been widely used in text classification tasks. The representative one is the convolutional neural network. The convolutional neural network(...
Word Representations For Gender Classification Using Deep Learning
Abstract
This paper studies the effect of word representations on gender classification using deep learning. There are two main objectives: how well do popular deep learning architectures, namely LSTM and CNNs, perform on gender classification task and ...
Research on Text Classification Based on LSTM-CNN
ICCSMT '24: Proceedings of the 2024 5th International Conference on Computer Science and Management Technology

Text categorization is a crucial task in natural language processing. In recent years, deep learn- ing methodologies, particularly Recurrent Neural Networks (RNN) and Convolutional Neural Net- works (CNN), have made significant advancements in text ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

AVEC '16: Proceedings of the 6th International Workshop on Audio/Visual Emotion Challenge

October 2016

114 pages

ISBN:9781450345163

DOI:10.1145/2988257

General Chairs:
Michel Valstar
University of Nottingham, UK
,
Jonathan Gratch
University of Southern California, USA
,
Björn Schuller
University of Pasau/Imperial College London, DE/UK
,
Fabien Ringeval
Université Grenoble Alpes, FR
,
Roddy Cowie
Queen's University Belfast, UK
,
Maja Pantic
Imperial College London/Twente University, UK/The Netherlands

Copyright © 2016 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 16 October 2016

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

MM '16

Sponsor:

SIGMM

MM '16: ACM Multimedia Conference

October 16, 2016

Amsterdam, The Netherlands

Acceptance Rates

AVEC '16 Paper Acceptance Rate 12 of 14 submissions, 86%;

Overall Acceptance Rate 52 of 98 submissions, 53%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

186
Total Citations
View Citations
3,010
Total Downloads

Downloads (Last 12 months)368
Downloads (Last 6 weeks)35

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Ye JYu YLu LWang HZheng YLiu YWang Q(2025)DEP-Former: Multimodal Depression Recognition Based on Facial Expressions and Audio Features via Emotional ChangesIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2024.349109835:3(2087-2100)Online publication date: Mar-2025
https://doi.org/10.1109/TCSVT.2024.3491098
Sun YZhou YXu XQi JXu FRen ZSchuller B(2025)Weakly-Supervised Depression Detection in Speech Through Self-Learning Based Label CorrectionIEEE Transactions on Audio, Speech and Language Processing10.1109/TASLPRO.2025.353337033(748-758)Online publication date: 2025
https://doi.org/10.1109/TASLPRO.2025.3533370
Fu CQian FSu YSu KSong SNiu MShi JLiu ZLiu CIshi CIshiguro H(2025)Facial action units guided graph representation learning for multimodal depression detectionNeurocomputing10.1016/j.neucom.2024.129106619(129106)Online publication date: Feb-2025
https://doi.org/10.1016/j.neucom.2024.129106
Gupta RSinha R(2025)Deep multi-task learning based detection of correlated mental disorders using audio modalityComputer Speech and Language10.1016/j.csl.2024.10171089:COnline publication date: 1-Jan-2025
https://dl.acm.org/doi/10.1016/j.csl.2024.101710
Xu CWu XLi NWang XFeng XSu RYan NWang L(2025)A Transformer-Based Depression Detection Network Leveraging Speech Emotional Expression CuesSocial Robotics10.1007/978-981-96-1151-5_18(177-186)Online publication date: 7-Feb-2025
https://doi.org/10.1007/978-981-96-1151-5_18
Sánchez Pineda CJaramillo Valbuena STriviño J(2024)Capítulo 4: Aplicación de Redes Neuronales para clasificación de texto sobre entrevistas médicas del corpus DAIC-WoZGestión del conocimiento. Perspectiva multidisciplinaria (libro 65)10.59899/Ges-cono-65-C4(65-85)Online publication date: 31-May-2024
https://doi.org/10.59899/Ges-cono-65-C4
Krishna NVikas Reddy RLikhith MPriya N(2024)Tackling Depression Detection With Deep LearningDriving Smart Medical Diagnosis Through AI-Powered Technologies and Applications10.4018/979-8-3693-3679-3.ch006(102-117)Online publication date: 9-Feb-2024
https://doi.org/10.4018/979-8-3693-3679-3.ch006
Cheong JKalkan SGunes HLarson K(2024)FairReFuseProceedings of the Thirty-Third International Joint Conference on Artificial Intelligence10.24963/ijcai.2024/799(7224-7232)Online publication date: 3-Aug-2024
https://dl.acm.org/doi/10.24963/ijcai.2024/799
Xu NHuo HXu JMa LWang J(2024)Automatic diagnosis of depression based on attention mechanism and feature pyramid modelPLOS ONE10.1371/journal.pone.029505119:3(e0295051)Online publication date: 12-Mar-2024
https://doi.org/10.1371/journal.pone.0295051
Jianhua TCunhang FZheng LZhao LYing SShan L(2024)Development of multimodal sentiment recognition and understandingJournal of Image and Graphics10.11834/jig.24001729:6(1607-1627)Online publication date: 2024
https://doi.org/10.11834/jig.240017
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten