research-article

Multi-modality Depression Detection via Multi-scale Temporal Dilated CNNs

Authors:

Weirui LuAuthors Info & Claims

AVEC '19: Proceedings of the 9th International on Audio/Visual Emotion Challenge and Workshop

Pages 73 - 80

https://doi.org/10.1145/3347320.3357695

Published: 15 October 2019 Publication History

Abstract

Depression, a prevalent mental illness, is negatively impacting on individual and society. This paper targets the Depression Detection Challenge with AI Sub-challenge (DDS) task of Audio Visual Emotion Challenge (AVEC) 2019. Firstly, two task-specific features are proposed: 1) deep contextual text features, which incorporate global text features and sentiment scores estimated by fine-tuned Bidirectional Encoder Representations from Transformers (BERT); 2) span-wise dense temporal statistical features, in which multiple statistical functions are conducted in each continuous time span. Furthermore, we propose a multi-scale temporal dilated CNN to precisely capture the hidden temporal dependency in the data for automatic multi-modality depression detection. Our proposed framework achieves competitive performance with Concordance Correlation Coefficient (CCC) of 0.466 on development set and 0.430 on test set which is remarkably higher than the baseline result of 0.269 on development set and 0.120 on test set.

References

[1]

RH Belmaker and Galila Agam. 2008. Major depressive disorder. New England Journal of Medicine 358, 1 (2008), 55--68.

[2]

Zhe Cao, Tomas Simon, Shih-En Wei, and Yaser Sheikh. 2017. Realtime multi-person 2d pose estimation using part affinity fields. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7291--7299.

[3]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXivpreprint arXiv:1810.04805(2018).

[4]

Zhengyin Du, Weixin Li, Di Huang, and Yunhong Wang. 2018. Bipolar Disorder Recognition via Multi-scale Discriminative Audio Temporal Representation. In Proceedings of the 2018 on Audio/Visual Emotion Challenge and Workshop. ACM,23--30.

Digital Library

[5]

Florian Eyben, Klaus R Scherer, Björn W Schuller, Johan Sundberg, Elisabeth André, Carlos Busso, Laurence Y Devillers, Julien Epps, Petri Laukka, Shrikanth S Narayanan, et al. 2015. The Geneva minimalistic acoustic parameter set (GeMAPS)for voice research and affective computing. IEEE Transactions on Affective Com-puting7, 2 (2015), 190--202.

[6]

Fabien Ringeval and Björn Schuller and Michel Valstar and Nicholas Cumminsand Roddy Cowie and Mohammad Soleymani and Maximilian Schmitt and ShahinAmiriparian and Eva-Maria Messner and Leili Tavabi and Siyang Song and Sina Alisamir and Shuo Lui and Ziping Zhao and Maja Pantic. 2019. AVEC 2019 Work-shop and Challenge: State-of-Mind, Depression with AI, and Cross-Cultural Affect Recognition. In Proceedings of the 9th International Workshop on Audio/Visual Emotion Challenge, AVEC'19, co-located with the 27th ACM International Conference on Multimedia, MM 2019, Fabien Ringeval, Björn Schuller, Michel Valstar, Nicholas Cummins, Roddy Cowie, and Maja Pantic (Eds.). ACM, Nice, France.

[7]

Yuan Gong and Christian Poellabauer. 2017. Topic modeling based multi-modal depression detection. InProceedings of the 7th Annual Workshop on Audio/Visual Emotion Challenge. ACM, 69--76.

Digital Library

[8]

Albert Haque, Michelle Guo, Adam S. Miner, and Li Fei-Fei. 2018. Measuring Depression Symptom Severity from Spoken Language and 3D Facial Expressions.CoRRabs/1811.08592 (2018). arXiv:1811.08592 http://arxiv.org/abs/1811.08592

[9]

Kurt Kroenke, Tara W Strine, Robert L Spitzer, Janet BW Williams, Joyce T Berry, and Ali H Mokdad. 2009. The PHQ-8 as a measure of current depression in thegeneral population.Journal of affective disorders 114, 1--3 (2009), 163--173.

[10]

Edward Loper and Steven Bird. 2002. NLTK: The Natural Language Toolkit.In Proceedings of the ACL Workshop on Effective Tools and Methodologies for Teaching Natural Language Processing and Computational Linguistics. Philadelphia: Association for Computational Linguistics.

Digital Library

[11]

Lindasalwa Muda, Mumtaj Begam, and Irraivan Elamvazuthi. 2010. Voice recognition algorithms using mel frequency cepstral coefficient (MFCC) and dynamictime warping (DTW) techniques. arXiv preprint arXiv:1003.4083(2010).

[12]

Matthew E Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations. arXiv preprint arXiv:1802.05365(2018).

[13]

Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever.2018. Improving language understanding by generative pre-training. https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/languageunsupervised/language understanding paper. pdf(2018).

[14]

Fabien Ringeval, Björn Schuller, Michel Valstar, Jonathan Gratch, Roddy Cowie, Stefan Scherer, Sharon Mozgai, Nicholas Cummins, Maximilian Schmitt, and Maja Pantic. 2017. Avec 2017: Real-life depression, and affect recognition workshop and challenge. In Proceedings of the 7th Annual Workshop on Audio/Visual Emotion Challenge. ACM, 3--9.

Digital Library

[15]

Björn Schuller, Stefan Steidl, Anton Batliner, Julien Epps, Florian Eyben, Fabien Ringeval, Erik Marchi, and Yue Zhang. 2014. I (Special Session)******** The INTER-SPEECH 2014 Computational Paralinguistics Challenge: Cognitive & Physical Load. In Fifteenth Annual Conference of the International Speech Communication Association.

[16]

Mohammed Senoussaoui, Milton Sarria-Paja, João F Santos, and Tiago H Falk.2014. Model fusion for multimodal depression classification and level detection. In Proceedings of the 4th International Workshop on Audio/Visual Emotion Challenge. ACM, 57--63.

[17]

Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D Manning,Andrew Ng, and Christopher Potts. 2013. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 conferenceon empirical methods in natural language processing. 1631--1642.

[18]

Bo Sun, Yinghui Zhang, Jun He, Lejun Yu, Qihua Xu, Dongliang Li, and Zhaoying Wang. 2017. A random forest regression method with selected-text feature for depression assessment. In Proceedings of the 7th Annual Workshop on Audio/VisualEmotion Challenge. ACM, 61--68.

Digital Library

[19]

Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition. 1--9.

[20]

Michel Valstar, Jonathan Gratch, Björn Schuller, Fabien Ringeval, Denis Lalanne, Mercedes Torres Torres, Stefan Scherer, Giota Stratou, Roddy Cowie, and MajaPantic. 2016. Avec 2016: Depression, mood, and emotion recognition workshopand challenge. In Proceedings of the 6th international workshop on audio/visual emotion challenge. ACM, 3--10.

Digital Library

[21]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, ukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information processing systems. 5998--6008.

[22]

James R Williamson, Elizabeth Godoy, Miriam Cha, Adrianne Schwarzentruber, Pooya Khorrami, Youngjune Gwon, Hsiang-Tsung Kung, Charlie Dagli, and Thomas F Quatieri. 2016. Detecting depression using vocal, facial and semantic communication cues. In Proceedings of the 6th International Workshop on Audio/Visual Emotion Challenge. ACM, 11--18.

Digital Library

[23]

James R Williamson, Thomas F Quatieri, Brian S Helfer, Gregory Ciccarelli, and Daryush D Mehta. 2014. Vocal and facial biomarkers of depression based on motor incoordination and timing. In Proceedings of the 4th International Workshopon Audio/Visual Emotion Challenge. ACM, 65--72.

Digital Library

[24]

Xiaofen Xing, Bolun Cai, Yinhu Zhao, Shuzhen Li, Zhiwei He, and Weiquan Fan.2018. Multi-modality Hierarchical Recall based on GBDTs for Bipolar Disorder Classification. In Proceedings of the 2018 on Audio/Visual Emotion Challenge and Workshop. ACM, 31--37.

[25]

Le Yang, Dongmei Jiang, Lang He, Ercheng Pei, Meshia Cédric Oveneke, and Hichem Sahli. 2016. Decision tree based depression classification from audio video and language information. In Proceedings of the 6th International Workshopon Audio/Visual Emotion Challenge. ACM, 89--96.

Digital Library

[26]

Le Yang, Dongmei Jiang, Xiaohan Xia, Ercheng Pei, Meshia Cédric Oveneke,and Hichem Sahli. 2017. Multimodal measurement of depression using deeplearning models. In Proceedings of the 7th Annual Workshop on Audio/Visual Emotion Challenge. ACM, 53--59.

Digital Library

[27]

Le Yang, Yan Li, Haifeng Chen, Dongmei Jiang, Meshia Cédric Oveneke, and Hichem Sahli. 2018. Bipolar Disorder Recognition with Histogram Features of Arousal and Body Gestures. InProceedings of the 2018 on Audio/Visual Emotion Challenge and Workshop. ACM, 15--21.

[28]

Le Yang, Hichem Sahli, Xiaohan Xia, Ercheng Pei, Meshia Cédric Oveneke, and Dongmei Jiang. 2017. Hybrid depression classification and estimation from audio video and text information. In Proceedings of the 7th Annual Workshop on Audio/Visual Emotion Challenge. ACM, 45--51.

Digital Library

[29]

Fisher Yu and Vladlen Koltun. 2015. Multi-scale context aggregation by dilated convolutions. arXiv preprint arXiv:1511.07122(2015).

Cited By

Xu JGunes HKusumam KValstar MSong S(2025)Two-Stage Temporal Modelling Framework for Video-Based Depression Recognition Using Graph RepresentationIEEE Transactions on Affective Computing10.1109/TAFFC.2024.341577016:1(161-178)Online publication date: Jan-2025
https://doi.org/10.1109/TAFFC.2024.3415770
Fu CQian FSu YSu KSong SNiu MShi JLiu ZLiu CIshi CIshiguro H(2025)Facial action units guided graph representation learning for multimodal depression detectionNeurocomputing10.1016/j.neucom.2024.129106619(129106)Online publication date: Feb-2025
https://doi.org/10.1016/j.neucom.2024.129106
Sánchez Pineda CJaramillo Valbuena STriviño J(2024)Capítulo 4: Aplicación de Redes Neuronales para clasificación de texto sobre entrevistas médicas del corpus DAIC-WoZGestión del conocimiento. Perspectiva multidisciplinaria (libro 65)10.59899/Ges-cono-65-C4(65-85)Online publication date: 31-May-2024
https://doi.org/10.59899/Ges-cono-65-C4
Show More Cited By

Index Terms

Multi-modality Depression Detection via Multi-scale Temporal Dilated CNNs
1. Applied computing
  1. Life and medical sciences
    1. Health informatics

Recommendations

A Multi-scale Dilated Residual Convolution Network for Image Denoising
Abstract
Due to the excellent performance of deep learning, more and more image denoising methods based on convolutional neural networks (CNN) are proposed, including dilated convolution method and multi-scale convolution method. A fundamental issue is how ...
Depression Detection Based on Multilevel Semantic Features
Artificial Neural Networks and Machine Learning – ICANN 2024
Abstract
Depression is a common mental health disorder that can affect a person’s mood, thoughts, and behavior. In this paper, we propose a depression detection method based on multilevel semantic features. This method consists of a character semantic ...
SenseMood: Depression Detection on Social Media
ICMR '20: Proceedings of the 2020 International Conference on Multimedia Retrieval

More than 300 million people have been affected by depression all over the world. Due to the medical equipment and knowledge limitations, most of them are not diagnosed at the early stages. Recent work attempts to use social media to detect depression ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

AVEC '19: Proceedings of the 9th International on Audio/Visual Emotion Challenge and Workshop

October 2019

96 pages

ISBN:9781450369138

DOI:10.1145/3347320

General Chairs:
Fabien Ringeval
Grenoble Alps University, France
,
Björn Schuller
University of Augsburg/Imperial College London, Germany/UK
,
Michel Valstar
University of Nottingham, UK
,
Nicholas Cummins
University of Augsburg, Germany
,
Roddy Cowie
Queen's University Belfast, UK
,
Maja Pantic
Imperial College London/Twente University, UK/The Netherlands

Copyright © 2019 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 15 October 2019

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

National Natural Science Foundation of China

Conference

MM '19

Sponsor:

SIGMM

MM '19: The 27th ACM International Conference on Multimedia

October 21, 2019

Nice, France

Acceptance Rates

Overall Acceptance Rate 52 of 98 submissions, 53%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

24
Total Citations
View Citations
995
Total Downloads

Downloads (Last 12 months)192
Downloads (Last 6 weeks)13

Reflects downloads up to 08 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Xu JGunes HKusumam KValstar MSong S(2025)Two-Stage Temporal Modelling Framework for Video-Based Depression Recognition Using Graph RepresentationIEEE Transactions on Affective Computing10.1109/TAFFC.2024.341577016:1(161-178)Online publication date: Jan-2025
https://doi.org/10.1109/TAFFC.2024.3415770
Fu CQian FSu YSu KSong SNiu MShi JLiu ZLiu CIshi CIshiguro H(2025)Facial action units guided graph representation learning for multimodal depression detectionNeurocomputing10.1016/j.neucom.2024.129106619(129106)Online publication date: Feb-2025
https://doi.org/10.1016/j.neucom.2024.129106
Sánchez Pineda CJaramillo Valbuena STriviño J(2024)Capítulo 4: Aplicación de Redes Neuronales para clasificación de texto sobre entrevistas médicas del corpus DAIC-WoZGestión del conocimiento. Perspectiva multidisciplinaria (libro 65)10.59899/Ges-cono-65-C4(65-85)Online publication date: 31-May-2024
https://doi.org/10.59899/Ges-cono-65-C4
Khoo LLim MChong CMcNaney R(2024)Machine Learning for Multimodal Mental Health Detection: A Systematic Review of Passive Sensing ApproachesSensors10.3390/s2402034824:2(348)Online publication date: 6-Jan-2024
https://doi.org/10.3390/s24020348
Liang LWang YMa HZhang RLiu RZhu RZheng ZZhang XWang F(2024)Enhanced classification and severity prediction of major depressive disorder using acoustic features and machine learningFrontiers in Psychiatry10.3389/fpsyt.2024.142202015Online publication date: 17-Sep-2024
https://doi.org/10.3389/fpsyt.2024.1422020
Zhou LLiu ZLi YDuan YYu HHu B(2024)Multi Fine-Grained Fusion Network for Depression DetectionACM Transactions on Multimedia Computing, Communications, and Applications10.1145/366524720:8(1-23)Online publication date: 29-Jun-2024
https://dl.acm.org/doi/10.1145/3665247
Pan YJiang JJiang KLiu XCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)Disentangled-Multimodal Privileged Knowledge Distillation for Depression Recognition with Incomplete Multimodal DataProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681227(5712-5721)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3681227
Guo RHuang BHao LJia B(2024)Crowd Counting in Large Surveillance Areas by Fusing Audio and WiFi Sniffing Data2024 International Joint Conference on Neural Networks (IJCNN)10.1109/IJCNN60899.2024.10651535(1-8)Online publication date: 30-Jun-2024
https://doi.org/10.1109/IJCNN60899.2024.10651535
Guan ZLiang X(2024)Effective and Efficient: Deeper and Faster Fusion Network for Multimodal Summarization2024 International Joint Conference on Neural Networks (IJCNN)10.1109/IJCNN60899.2024.10651126(1-8)Online publication date: 30-Jun-2024
https://doi.org/10.1109/IJCNN60899.2024.10651126
Sadeghi MRicher REgger BSchindler-Gmelch LRupp LRahimi FBerking MEskofier B(2024)Harnessing multimodal approaches for depression detection using large language models and facial expressionsnpj Mental Health Research10.1038/s44184-024-00112-83:1Online publication date: 23-Dec-2024
https://doi.org/10.1038/s44184-024-00112-8
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten