skip to main content
10.1145/3347320.3357695acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Multi-modality Depression Detection via Multi-scale Temporal Dilated CNNs

Published: 15 October 2019 Publication History

Abstract

Depression, a prevalent mental illness, is negatively impacting on individual and society. This paper targets the Depression Detection Challenge with AI Sub-challenge (DDS) task of Audio Visual Emotion Challenge (AVEC) 2019. Firstly, two task-specific features are proposed: 1) deep contextual text features, which incorporate global text features and sentiment scores estimated by fine-tuned Bidirectional Encoder Representations from Transformers (BERT); 2) span-wise dense temporal statistical features, in which multiple statistical functions are conducted in each continuous time span. Furthermore, we propose a multi-scale temporal dilated CNN to precisely capture the hidden temporal dependency in the data for automatic multi-modality depression detection. Our proposed framework achieves competitive performance with Concordance Correlation Coefficient (CCC) of 0.466 on development set and 0.430 on test set which is remarkably higher than the baseline result of 0.269 on development set and 0.120 on test set.

References

[1]
RH Belmaker and Galila Agam. 2008. Major depressive disorder. New England Journal of Medicine 358, 1 (2008), 55--68.
[2]
Zhe Cao, Tomas Simon, Shih-En Wei, and Yaser Sheikh. 2017. Realtime multi-person 2d pose estimation using part affinity fields. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7291--7299.
[3]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXivpreprint arXiv:1810.04805(2018).
[4]
Zhengyin Du, Weixin Li, Di Huang, and Yunhong Wang. 2018. Bipolar Disorder Recognition via Multi-scale Discriminative Audio Temporal Representation. In Proceedings of the 2018 on Audio/Visual Emotion Challenge and Workshop. ACM,23--30.
[5]
Florian Eyben, Klaus R Scherer, Björn W Schuller, Johan Sundberg, Elisabeth André, Carlos Busso, Laurence Y Devillers, Julien Epps, Petri Laukka, Shrikanth S Narayanan, et al. 2015. The Geneva minimalistic acoustic parameter set (GeMAPS)for voice research and affective computing. IEEE Transactions on Affective Com-puting7, 2 (2015), 190--202.
[6]
Fabien Ringeval and Björn Schuller and Michel Valstar and Nicholas Cumminsand Roddy Cowie and Mohammad Soleymani and Maximilian Schmitt and ShahinAmiriparian and Eva-Maria Messner and Leili Tavabi and Siyang Song and Sina Alisamir and Shuo Lui and Ziping Zhao and Maja Pantic. 2019. AVEC 2019 Work-shop and Challenge: State-of-Mind, Depression with AI, and Cross-Cultural Affect Recognition. In Proceedings of the 9th International Workshop on Audio/Visual Emotion Challenge, AVEC'19, co-located with the 27th ACM International Conference on Multimedia, MM 2019, Fabien Ringeval, Björn Schuller, Michel Valstar, Nicholas Cummins, Roddy Cowie, and Maja Pantic (Eds.). ACM, Nice, France.
[7]
Yuan Gong and Christian Poellabauer. 2017. Topic modeling based multi-modal depression detection. InProceedings of the 7th Annual Workshop on Audio/Visual Emotion Challenge. ACM, 69--76.
[8]
Albert Haque, Michelle Guo, Adam S. Miner, and Li Fei-Fei. 2018. Measuring Depression Symptom Severity from Spoken Language and 3D Facial Expressions.CoRRabs/1811.08592 (2018). arXiv:1811.08592 http://arxiv.org/abs/1811.08592
[9]
Kurt Kroenke, Tara W Strine, Robert L Spitzer, Janet BW Williams, Joyce T Berry, and Ali H Mokdad. 2009. The PHQ-8 as a measure of current depression in thegeneral population.Journal of affective disorders 114, 1--3 (2009), 163--173.
[10]
Edward Loper and Steven Bird. 2002. NLTK: The Natural Language Toolkit.In Proceedings of the ACL Workshop on Effective Tools and Methodologies for Teaching Natural Language Processing and Computational Linguistics. Philadelphia: Association for Computational Linguistics.
[11]
Lindasalwa Muda, Mumtaj Begam, and Irraivan Elamvazuthi. 2010. Voice recognition algorithms using mel frequency cepstral coefficient (MFCC) and dynamictime warping (DTW) techniques. arXiv preprint arXiv:1003.4083(2010).
[12]
Matthew E Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations. arXiv preprint arXiv:1802.05365(2018).
[13]
Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever.2018. Improving language understanding by generative pre-training. https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/languageunsupervised/language understanding paper. pdf(2018).
[14]
Fabien Ringeval, Björn Schuller, Michel Valstar, Jonathan Gratch, Roddy Cowie, Stefan Scherer, Sharon Mozgai, Nicholas Cummins, Maximilian Schmitt, and Maja Pantic. 2017. Avec 2017: Real-life depression, and affect recognition workshop and challenge. In Proceedings of the 7th Annual Workshop on Audio/Visual Emotion Challenge. ACM, 3--9.
[15]
Björn Schuller, Stefan Steidl, Anton Batliner, Julien Epps, Florian Eyben, Fabien Ringeval, Erik Marchi, and Yue Zhang. 2014. I (Special Session)******** The INTER-SPEECH 2014 Computational Paralinguistics Challenge: Cognitive & Physical Load. In Fifteenth Annual Conference of the International Speech Communication Association.
[16]
Mohammed Senoussaoui, Milton Sarria-Paja, João F Santos, and Tiago H Falk.2014. Model fusion for multimodal depression classification and level detection. In Proceedings of the 4th International Workshop on Audio/Visual Emotion Challenge. ACM, 57--63.
[17]
Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D Manning,Andrew Ng, and Christopher Potts. 2013. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 conferenceon empirical methods in natural language processing. 1631--1642.
[18]
Bo Sun, Yinghui Zhang, Jun He, Lejun Yu, Qihua Xu, Dongliang Li, and Zhaoying Wang. 2017. A random forest regression method with selected-text feature for depression assessment. In Proceedings of the 7th Annual Workshop on Audio/VisualEmotion Challenge. ACM, 61--68.
[19]
Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition. 1--9.
[20]
Michel Valstar, Jonathan Gratch, Björn Schuller, Fabien Ringeval, Denis Lalanne, Mercedes Torres Torres, Stefan Scherer, Giota Stratou, Roddy Cowie, and MajaPantic. 2016. Avec 2016: Depression, mood, and emotion recognition workshopand challenge. In Proceedings of the 6th international workshop on audio/visual emotion challenge. ACM, 3--10.
[21]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, ukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information processing systems. 5998--6008.
[22]
James R Williamson, Elizabeth Godoy, Miriam Cha, Adrianne Schwarzentruber, Pooya Khorrami, Youngjune Gwon, Hsiang-Tsung Kung, Charlie Dagli, and Thomas F Quatieri. 2016. Detecting depression using vocal, facial and semantic communication cues. In Proceedings of the 6th International Workshop on Audio/Visual Emotion Challenge. ACM, 11--18.
[23]
James R Williamson, Thomas F Quatieri, Brian S Helfer, Gregory Ciccarelli, and Daryush D Mehta. 2014. Vocal and facial biomarkers of depression based on motor incoordination and timing. In Proceedings of the 4th International Workshopon Audio/Visual Emotion Challenge. ACM, 65--72.
[24]
Xiaofen Xing, Bolun Cai, Yinhu Zhao, Shuzhen Li, Zhiwei He, and Weiquan Fan.2018. Multi-modality Hierarchical Recall based on GBDTs for Bipolar Disorder Classification. In Proceedings of the 2018 on Audio/Visual Emotion Challenge and Workshop. ACM, 31--37.
[25]
Le Yang, Dongmei Jiang, Lang He, Ercheng Pei, Meshia Cédric Oveneke, and Hichem Sahli. 2016. Decision tree based depression classification from audio video and language information. In Proceedings of the 6th International Workshopon Audio/Visual Emotion Challenge. ACM, 89--96.
[26]
Le Yang, Dongmei Jiang, Xiaohan Xia, Ercheng Pei, Meshia Cédric Oveneke,and Hichem Sahli. 2017. Multimodal measurement of depression using deeplearning models. In Proceedings of the 7th Annual Workshop on Audio/Visual Emotion Challenge. ACM, 53--59.
[27]
Le Yang, Yan Li, Haifeng Chen, Dongmei Jiang, Meshia Cédric Oveneke, and Hichem Sahli. 2018. Bipolar Disorder Recognition with Histogram Features of Arousal and Body Gestures. InProceedings of the 2018 on Audio/Visual Emotion Challenge and Workshop. ACM, 15--21.
[28]
Le Yang, Hichem Sahli, Xiaohan Xia, Ercheng Pei, Meshia Cédric Oveneke, and Dongmei Jiang. 2017. Hybrid depression classification and estimation from audio video and text information. In Proceedings of the 7th Annual Workshop on Audio/Visual Emotion Challenge. ACM, 45--51.
[29]
Fisher Yu and Vladlen Koltun. 2015. Multi-scale context aggregation by dilated convolutions. arXiv preprint arXiv:1511.07122(2015).

Cited By

View all
  • (2025)Two-Stage Temporal Modelling Framework for Video-Based Depression Recognition Using Graph RepresentationIEEE Transactions on Affective Computing10.1109/TAFFC.2024.341577016:1(161-178)Online publication date: Jan-2025
  • (2025)Facial action units guided graph representation learning for multimodal depression detectionNeurocomputing10.1016/j.neucom.2024.129106619(129106)Online publication date: Feb-2025
  • (2024)Capítulo 4: Aplicación de Redes Neuronales para clasificación de texto sobre entrevistas médicas del corpus DAIC-WoZGestión del conocimiento. Perspectiva multidisciplinaria (libro 65)10.59899/Ges-cono-65-C4(65-85)Online publication date: 31-May-2024
  • Show More Cited By

Index Terms

  1. Multi-modality Depression Detection via Multi-scale Temporal Dilated CNNs

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    AVEC '19: Proceedings of the 9th International on Audio/Visual Emotion Challenge and Workshop
    October 2019
    96 pages
    ISBN:9781450369138
    DOI:10.1145/3347320
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 15 October 2019

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. depression detection
    2. multi-modality
    3. multi-scale temporal dilated CNNs
    4. multi-scale temporal pooling

    Qualifiers

    • Research-article

    Funding Sources

    Conference

    MM '19
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 52 of 98 submissions, 53%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)192
    • Downloads (Last 6 weeks)13
    Reflects downloads up to 08 Mar 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2025)Two-Stage Temporal Modelling Framework for Video-Based Depression Recognition Using Graph RepresentationIEEE Transactions on Affective Computing10.1109/TAFFC.2024.341577016:1(161-178)Online publication date: Jan-2025
    • (2025)Facial action units guided graph representation learning for multimodal depression detectionNeurocomputing10.1016/j.neucom.2024.129106619(129106)Online publication date: Feb-2025
    • (2024)Capítulo 4: Aplicación de Redes Neuronales para clasificación de texto sobre entrevistas médicas del corpus DAIC-WoZGestión del conocimiento. Perspectiva multidisciplinaria (libro 65)10.59899/Ges-cono-65-C4(65-85)Online publication date: 31-May-2024
    • (2024)Machine Learning for Multimodal Mental Health Detection: A Systematic Review of Passive Sensing ApproachesSensors10.3390/s2402034824:2(348)Online publication date: 6-Jan-2024
    • (2024)Enhanced classification and severity prediction of major depressive disorder using acoustic features and machine learningFrontiers in Psychiatry10.3389/fpsyt.2024.142202015Online publication date: 17-Sep-2024
    • (2024)Multi Fine-Grained Fusion Network for Depression DetectionACM Transactions on Multimedia Computing, Communications, and Applications10.1145/366524720:8(1-23)Online publication date: 29-Jun-2024
    • (2024)Disentangled-Multimodal Privileged Knowledge Distillation for Depression Recognition with Incomplete Multimodal DataProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681227(5712-5721)Online publication date: 28-Oct-2024
    • (2024)Crowd Counting in Large Surveillance Areas by Fusing Audio and WiFi Sniffing Data2024 International Joint Conference on Neural Networks (IJCNN)10.1109/IJCNN60899.2024.10651535(1-8)Online publication date: 30-Jun-2024
    • (2024)Effective and Efficient: Deeper and Faster Fusion Network for Multimodal Summarization2024 International Joint Conference on Neural Networks (IJCNN)10.1109/IJCNN60899.2024.10651126(1-8)Online publication date: 30-Jun-2024
    • (2024)Harnessing multimodal approaches for depression detection using large language models and facial expressionsnpj Mental Health Research10.1038/s44184-024-00112-83:1Online publication date: 23-Dec-2024
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media