research-article

A Multi-Modal Hierarchical Recurrent Neural Network for Depression Detection

Authors:

Shangfei WangAuthors Info & Claims

AVEC '19: Proceedings of the 9th International on Audio/Visual Emotion Challenge and Workshop

Pages 65 - 71

https://doi.org/10.1145/3347320.3357696

Published: 15 October 2019 Publication History

Abstract

We propose a multi-modal method with a hierarchical recurrent neural structure to integrate vision, audio and text features for depression detection. Such a method contains two hierarchies of bidirectional long short term memories to fuse multi-modal features and predict the severity of depression. An adaptive sample weighting mechanism is introduced to adapt to the diversity of training samples. Experiments on the testing set of a depression detection challenge demonstrate the effectiveness of the proposed method.

References

[1]

Lei Jimmy Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton. 2016. Layer Normalization. CoRR, Vol. abs/1607.06450 (2016).

[2]

T. BaltruÅ¡aitis, P. Robinson, and L. Morency. 2016. OpenFace: An open source facial behavior analysis toolkit. In IEEE Winter Conference on Applications of Computer Vision (WACV). 1--10.

[3]

L. Chao, J. Tao, M. Yang, and Y. Li. 2015. Multi task sequence learning for depression scale prediction from video. In International Conference on Affective Computing and Intelligent Interaction (ACII) . 526--531.

[4]

J. F. Cohn, T. S. Kruez, I. Matthews, Y. Yang, M. H. Nguyen, M. T. Padilla, F. Zhou, and F. De la Torre. 2009. Detecting depression from facial actions and vocal prosody. In 3rd International Conference on Affective Computing and Intelligent Interaction and Workshops . 1--7.

[5]

Harris Drucker. 1997. Improving regressors using boosting techniques. In ICML, Vol. 97. 107--115.

Digital Library

[6]

Fabien Ringeval and Björn Schuller and Michel Valstar and Nicholas Cummins and Roddy Cowie and Leili Tavabi and Maximilian Schmitt and Sina Alisamir and Shahin Amiriparian and Eva-Maria Messner and Siyang Song and Shuo Lui and Ziping Zhao and Adria Mallol-Ragolta and Zhao Ren, and Mohammad Soleymani, and Maja Pantic. 2019. AVEC 2019 Workshop and Challenge: State-of-Mind, Detecting Depression with AI, and Cross-Cultural Affect Recognition. In Proceedings of the 9th International Workshop on Audio/Visual Emotion Challenge, AVEC'19, co-located with the 27th ACM International Conference on Multimedia, MM 2019, Fabien Ringeval, Björn Schuller, Michel Valstar, Nicholas Cummins, Roddy Cowie, and Maja Pantic (Eds.). ACM, Nice, France.

[7]

Yoav Freund and Robert E. Schapire. 1997. A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting. J. Comput. Syst. Sci., Vol. 55, 1 (1997), 119--139.

Digital Library

[8]

Zhe Gan, Yunchen Pu, Ricardo Henao, Chunyuan Li, Xiaodong He, and Lawrence Carin. 2017. Learning Generic Sentence Representations Using Convolutional Neural Networks. In EMNLP . 2390--2400.

[9]

Yuan Gong and Christian Poellabauer. 2017. Topic Modeling Based Multi-modal Depression Detection. (2017), 69--76.

[10]

Alex Graves and Jü rgen Schmidhuber. 2005. Framewise phoneme classification with bidirectional LS™ and other neural network architectures. Neural Networks, Vol. 18, 5--6 (2005), 602--610.

Digital Library

[11]

Felix Hill, Kyunghyun Cho, and Anna Korhonen. 2016. Learning Distributed Representations of Sentences from Unlabelled Data. In NAACL-HLT . 1367--1377.

[12]

Sepp Hochreiter and JÃ¼rgen Schmidhuber. 1997. Long Short-Term Memory. Neural Computation, Vol. 9, 8 (1997), 1735--1780.

Digital Library

[13]

Chao-Chun Hsu, Sheng-Yeh Chen, Chuan-Chun Kuo, Ting-Hao K. Huang, and Lun-Wei Ku. 2018. EmotionLines: An Emotion Corpus of Multi-Party Conversations. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC) .

[14]

Gao Huang, Zhuang Liu, Laurens van der Maaten, and Kilian Q. Weinberger. 2017. Densely Connected Convolutional Networks. In CVPR. 2261--2269.

[15]

Markus K"achele, Michael Glodek, Dimitrij Zharkov, Sascha Meudt, and Friedhelm Schwenker. 2014. Fusion of Audio-visual Features Using Hierarchical Classifier Systems for the Recognition of Affective States and the State of Depression. In Proceedings of the 3rd International Conference on Pattern Recognition Applications and Methods . 671--678.

[16]

Ryan Kiros, Yukun Zhu, Ruslan Salakhutdinov, Richard S. Zemel, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. 2015. Skip-Thought Vectors. In NIPS. 3294--3302.

[17]

Adrienne Lehrer. 1974. Semantic fields and lexical structure. (1974).

[18]

Xingchen Ma, Hongyu Yang, Qiang Chen, Di Huang, and Yunhong Wang. 2016. DepAudioNet: An Efficient Deep Model for Audio based Depression Classification. In Proceedings of the 6th International Workshop on Audio/Visual Emotion Challenge . 35--42.

Digital Library

[19]

Saif Mohammad. 2018. Word Affect Intensities. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC) .

[20]

Saif Mohammad and Peter D. Turney. 2013. Crowdsourcing a Word-Emotion Association Lexicon. Computational Intelligence, Vol. 29, 3 (2013), 436--465.

[21]

E. Moore II, M. A. Clements, J. W. Peifer, and L. Weisser. 2008. Critical Analysis of the Impact of Glottal Features in the Classification of Clinical Depression in Speech. IEEE Transactions on Biomedical Engineering, Vol. 55 (2008), 96--107.

[22]

Bo Pang, Lillian Lee, et almbox. 2008. Opinion mining and sentiment analysis. Foundations and Trends® in Information Retrieval, Vol. 2, 1--2 (2008), 1--135.

Digital Library

[23]

Changqin Quan and Fuji Ren. 2010. A blog emotion corpus for emotional expression analysis in Chinese. Computer Speech & Language, Vol. 24, 4 (2010), 726--749.

Digital Library

[24]

T.F. Quatieri and N Malyska. 2012. Vocal-source biomarkers for depression: A link to psychomotor activity. INTERSPEECH, Vol. 2 (2012), 1059--1062.

[25]

Fabien Ringeval, Bjö rn W. Schuller, Michel F. Valstar, Jonathan Gratch, Roddy Cowie, and Maja Pantic (Eds.). 2017. Proceedings of the 7th Annual Workshop on Audio/Visual Emotion Challenge, Mountain View, CA, USA, October 23 - 27, 2017. ACM .

Digital Library

[26]

Karen Simonyan and Andrew Zisserman. 2015. Very Deep Convolutional Networks for Large-Scale Image Recognition. In ICLR .

[27]

S. Song, L. Shen, and M. Valstar. 2018. Human Behaviour-Based Automatic Depression Analysis Using Hand-Crafted Statistics and Deep Learned Spectral Features. In FG. 158--165.

[28]

Peter D Turney and Michael L Littman. 2003. Measuring praise and criticism: Inference of semantic orientation from association. ACM Transactions on Information Systems (TOIS), Vol. 21, 4 (2003), 315--346.

Digital Library

[29]

Michel Valstar, Jonathan Gratch, Björn Schuller, Fabien Ringeval, Denis Lalanne, Mercedes Torres Torres, Stefan Scherer, Giota Stratou, Roddy Cowie, and Maja Pantic. 2016. AVEC 2016: Depression, Mood, and Emotion Recognition Workshop and Challenge. In Proceedings of the 6th International Workshop on Audio/Visual Emotion Challenge . 3--10.

Digital Library

[30]

L. Wen, X. Li, G. Guo, and Y. Zhu. 2015. Automated Depression Diagnosis Based on Facial Dynamic Analysis and Sparse Coding. IEEE Transactions on Information Forensics and Security, Vol. 10 (2015), 1432--1441.

[31]

Minghua Zhang, Yunfang Wu, Weikang Li, and Wei Li. 2018. Learning Universal Sentence Representations with Mean-Max Attention Autoencoder. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October 31 - November 4, 2018 . 4514--4523.

Cited By

Xu JGunes HKusumam KValstar MSong S(2025)Two-Stage Temporal Modelling Framework for Video-Based Depression Recognition Using Graph RepresentationIEEE Transactions on Affective Computing10.1109/TAFFC.2024.341577016:1(161-178)Online publication date: Jan-2025
https://doi.org/10.1109/TAFFC.2024.3415770
Fu CQian FSu YSu KSong SNiu MShi JLiu ZLiu CIshi CIshiguro H(2025)Facial action units guided graph representation learning for multimodal depression detectionNeurocomputing10.1016/j.neucom.2024.129106619(129106)Online publication date: Feb-2025
https://doi.org/10.1016/j.neucom.2024.129106
Kaynak EDibeklioğlu H(2024)Systematic Analysis of Speech Transcription Modeling for Reliable Assessment of Depression SeveritySakarya University Journal of Computer and Information Sciences10.35377/saucis...13815227:1(77-91)Online publication date: 30-Apr-2024
https://doi.org/10.35377/saucis...1381522
Show More Cited By

Index Terms

A Multi-Modal Hierarchical Recurrent Neural Network for Depression Detection
1. Computing methodologies
  1. Machine learning
    1. Learning paradigms
      1. Supervised learning
        Supervised learning by regression

Recommendations

Harnessing emotions for depression detection
Abstract
Human emotions using textual cues, speech patterns, and facial expressions can give insight into their mental state. Although there are several uni-modal datasets for emotion recognition, there are very few labeled datasets for multi-modal ...
Multi-Modal Depression Detection Based on High-Order Emotional Features
AICCC '22: Proceedings of the 2022 5th Artificial Intelligence and Cloud Computing Conference

The diagnosis of depression has always been a difficulty in its treatment. At present, the research on automatic depression detection mostly directly uses low-order features such as video, audio and text as input. The lack of guidance of high-order ...
A model of recurrent neural network with high capacity

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

AVEC '19: Proceedings of the 9th International on Audio/Visual Emotion Challenge and Workshop

October 2019

96 pages

ISBN:9781450369138

DOI:10.1145/3347320

General Chairs:
Fabien Ringeval
Grenoble Alps University, France
,
Björn Schuller
University of Augsburg/Imperial College London, Germany/UK
,
Michel Valstar
University of Nottingham, UK
,
Nicholas Cummins
University of Augsburg, Germany
,
Roddy Cowie
Queen's University Belfast, UK
,
Maja Pantic
Imperial College London/Twente University, UK/The Netherlands

Copyright © 2019 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 15 October 2019

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

National Science Foundation of China
National Key R&D Program of China

Conference

MM '19

Sponsor:

SIGMM

MM '19: The 27th ACM International Conference on Multimedia

October 21, 2019

Nice, France

Acceptance Rates

Overall Acceptance Rate 52 of 98 submissions, 53%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

42
Total Citations
View Citations
1,329
Total Downloads

Downloads (Last 12 months)181
Downloads (Last 6 weeks)21

Reflects downloads up to 01 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Xu JGunes HKusumam KValstar MSong S(2025)Two-Stage Temporal Modelling Framework for Video-Based Depression Recognition Using Graph RepresentationIEEE Transactions on Affective Computing10.1109/TAFFC.2024.341577016:1(161-178)Online publication date: Jan-2025
https://doi.org/10.1109/TAFFC.2024.3415770
Fu CQian FSu YSu KSong SNiu MShi JLiu ZLiu CIshi CIshiguro H(2025)Facial action units guided graph representation learning for multimodal depression detectionNeurocomputing10.1016/j.neucom.2024.129106619(129106)Online publication date: Feb-2025
https://doi.org/10.1016/j.neucom.2024.129106
Kaynak EDibeklioğlu H(2024)Systematic Analysis of Speech Transcription Modeling for Reliable Assessment of Depression SeveritySakarya University Journal of Computer and Information Sciences10.35377/saucis...13815227:1(77-91)Online publication date: 30-Apr-2024
https://doi.org/10.35377/saucis...1381522
KIUCHI KKANG XNISHIMURA RSASAYAMA MMATSUMOTO K(2024)Predicting physical and mental health status through interview-based evaluation of work stress: initial attempts to standardize the interviewing methodIndustrial Health10.2486/indhealth.2023-014462:4(237-251)Online publication date: 2024
https://doi.org/10.2486/indhealth.2023-0144
Jianhua TCunhang FZheng LZhao LYing SShan L(2024)Development of multimodal sentiment recognition and understandingJournal of Image and Graphics10.11834/jig.24001729:6(1607-1627)Online publication date: 2024
https://doi.org/10.11834/jig.240017
Pan YJiang JJiang KLiu XCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)Disentangled-Multimodal Privileged Knowledge Distillation for Depression Recognition with Incomplete Multimodal DataProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681227(5712-5721)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3681227
Jung JKang CYoon JKim SHan JSerra ESpezzano F(2024)HiQuE: Hierarchical Question Embedding Network for Multimodal Depression DetectionProceedings of the 33rd ACM International Conference on Information and Knowledge Management10.1145/3627673.3679797(1049-1059)Online publication date: 21-Oct-2024
https://dl.acm.org/doi/10.1145/3627673.3679797
Tao YYang MLi HWu YHu B(2024)DepMSTAT: Multimodal Spatio-Temporal Attentional Transformer for Depression DetectionIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2024.335007136:7(2956-2966)Online publication date: 5-Jan-2024
https://dl.acm.org/doi/10.1109/TKDE.2024.3350071
Teng SChai SLiu JTateyama TLin LChen Y(2024)Multi-Modal and Multi-Task Depression Detection with Sentiment Assistance2024 IEEE International Conference on Consumer Electronics (ICCE)10.1109/ICCE59016.2024.10444213(1-5)Online publication date: 6-Jan-2024
https://doi.org/10.1109/ICCE59016.2024.10444213
Ling TChen DLi B(2024)MDAVIF: A Multi-Domain Acoustical-Visual Information Fusion Model for Depression Recognition from Vlog DataICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP48485.2024.10446491(8115-8119)Online publication date: 14-Apr-2024
https://doi.org/10.1109/ICASSP48485.2024.10446491
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten