Automatic depression classification based on affective read sentences: Opportunities for text-dependent analysis
Introduction
During mental health evaluations, it is standard practice for a clinician to evaluate a patient's spoken language behavior. Individuals suffering from depression disorders often exhibit psychogenic voice disturbances that adversely change their autonomic system and personality (Perepa, 2017). Recorded exemplars of speech-language disruptions in clinically depressed patients include disfluent speech patterns, abandonment of phrases, and unusually long response latencies (Breznitz & Sherman, 1987; Greden & Carroll, 1980; Hoffman et al., 1985). Patients with clinical depression also exhibit a greater number of speech hesitations (i.e. pauses, repeats, false-starts) than non-depressed populations during communication due to psychomotor agitation/retardation and cognitive processing delays (Alpert et al., 2001; Cannizzaro et al., 2004; Darby et al., 1984; Duffy, 2008; Ellgring & Scherer, 1996; Fossati et al., 2003; Hartlage et al., 1993; Nilsonne, 1987, Nilsonne et al., 1988; Szabadi et al., 1976).
Speech-based depression studies (Alghowinem et al., 2012; Alpert et al., 2001; Esposito et al., 2016; Mundt, 2012; Nilsonne et al., 1988; Stassen et al., 1998; Szabadi et al., 1976) have evaluated pause durations and frequency ratios (e.g. filled-pause rate, empty pause rate) with varied success. For example, Alghowinem et al. (2012) and Esposito et al. (2016) observed that average spontaneous speech pause durations were significantly longer in depressed speakers than non-depressed speakers. Due to unrestrained spontaneous speech variables, pause and rate ratios (i.e. the total number of pauses divided by the total number of words; the total pause time divided by the total recording time) have often been used to help compare utterances of different lengths (Liu et al., 2017).
Concerning speech errors, Rubino et al. (2011) discovered that depressed speakers exhibited significantly greater numbers of referential failures (i.e. including word replacement errors, such as malapropisms) than non-depressed speakers during spontaneous tasks. A malapropism is an incorrect substitution word for an intended word. By definition, a malapropism is unrelated in meaning and has a similar pronunciation, grammatical category, word stress, and syllable length (Fay & Cutler, 1977). Since Rubino et al. (2011), no automatic speech-based depression studies have pursued using speech errors as prospective discriminative depression features – and surprisingly, not even for predetermined read tasks.
Diagnostic applications based on automatic speech-based depression classification are often reliant on the uniformity of the patient elicitation procedure. Each elicitation method can include a wide range of linguistic structure, affect information, and task requirements (Howe et al., 2014; Stasak et al., 2017, 2018b). To date, there is no clinically approved speech-depression automatic diagnosis protocol for widespread use, and there is not even consensus on elicitation methods among researchers. Many exploratory speech-based depression techniques have been systematically investigated using a combination of speech-related features (i.e. acoustic, text-based), machine learning techniques, and speech elicitation protocols (see e.g.: Cummins et al., 2015; Jiang et al., 2017; Liu et al., 2017; Stasak et al., 2017; Valstar et al., 2016). However, still to date, no specific system to aid depression diagnosis holds dominance over all others.
The aim of experiments herein is to investigate speech disfluency behaviors in non-depressed/depressed speakers using read text containing specific affective-linguistic criteria. We hypothesize that unlike spontaneous speech, sentences containing target words within specific valence ranges will provide more accurate ground-truth for disfluency analysis as a result of their phonetic and affective constraints; further affording directly comparable speech data between different speakers. While natural speech disfluencies are common in spontaneous speech (Johnson et al., 2004; Gósy, 2003), due to the abnormal cognitive-motor effects of depression on cognitive skills (Breznitz & Sherman, 1987; Greden & Carroll, 1980; Hoffman, et al., 1985), we hypothesize that during simple sentence reading tasks depressed speakers will demonstrate greater numbers of abnormal pauses and speech errors than non-depressed speakers.
According to structural affect theory (Brewer & Lichtenstein, 1982) and information structure studies (Arnold et al., 2013, Dahan, 2015), the emotions of a reader can be systematically manipulated by the sensitive order in which information is presented in a text. Based the aforementioned studies, it is hypothesized that positioning the affective target word at the beginning of sentence rather than middle or end of the sentence will allow a speaker earlier cognitive cues for mood processing and appropriate prosodic phrasing. We suspect that healthy speakers will utilize initial sentence position affective keyword information differently than depressed speakers (i.e. less prosodic range or appropriateness (Cummins et al., 2015)).
In regards to the reading of texts, Salem et al. (2017) found that direct discourse (e.g. first-person) narratives elicited a stronger feeling of taking over the perspective of the character than indirect discourse (e.g. third-person) narratives. Therefore, it is anticipated that features extracted from first-person read narrative sentences will generate better classification results than third-person narrative sentences due to greater emotional attachment. We hypothesize that depressed speakers will exhibit less dynamic emotional vocal range behaviors than healthy speakers on account of their negative fixation (Goeleven et al., 2006; Gotlib & McCann, 1984) and/or less vocal control due to psychomotor agitation/retardation (Flint et al., 1993; Hoffman et al., 1985). Due to passive avoidance strategies exhibited by people with depression (Holahan & Moos, 1987; Holahan et al., 2005), it is also anticipated that depressed speakers will attempt fewer self-corrections after speech errors when compared with non-depressed speakers.
Due to the constraints of text-dependent stimuli, both manual and automatic speech recognition disfluency annotations disfluency attributes are evaluated. It is anticipated that text-dependent constraints will improve the precision of the ASR output since the acoustic phonetic variability in the elicited speech is smaller than that of, for example, spontaneous speech.
Section snippets
Database
For all experiments herein, the Black Dog Institute Affective Sentences (BDAS) corpus, a collected data extension of that found in Alghowinem et al. (2012, 2013a, 2013b, 2015), Cummins et al. (2011), and Joshi et al. (2013a, 2013b), was used on account of its clinically validated depression diagnosis. Furthermore, the BDAS corpus had a controlled speech elicitation mode, which comprised read sentences with deliberately designed affective target words (see Section 3.6, Table 1). The speakers
Acoustic feature extraction
For experiments herein, the openSMILE speech toolkit was used to extract 88 eGeMAPS (Eyben, et al., 2015) acoustic speech features (i.e. derived from fundamental frequency, loudness, formants, mel-cepstral coefficients) from all 20 sentences in the BDAS corpus. The eGeMAPS features were calculated by extracting features from 20 ms frames with 50% frame-overlap, wherein an aggregated mean functional was computed per eGeMAPS feature. The eGeMAPS feature set was chosen because it has been used
Acoustic analysis
The accuracy of the baseline eGeMAPS features depression classification was 65% using all sentences, with F1 scores of 0.68 (0.63) for depressed and non-depressed classes respectively. The accuracies and F1 scores for individual sentences are shown in Table 2.
Generally, single sentence depression classification performance for eGeMAPS features was relatively low when compared with the all-sentence eGeMAPS baseline (i.e. this accuracy decrease was attributed to the reduction in training
Conclusion
In this study, a read sentence protocol was explored as an evaluation method for automatic speech-based depression classification. In comparison to spontaneous speech, text-dependent speech has advantages because the read linguistic and affective content can be explicitly designed to observe behaviors in a repeatable, controlled manner. Further, in a clinical context, text-dependent speech does not rely on the individual interviewer's expertise, bias, and skill level, which has been previously
CRediT authorship contribution statement
Brian Stasak: Formal analysis. Julien Epps: Formal analysis. Roland Goecke: Formal analysis.
Declaration of Competing Interest
None.
Acknowledgements
The work of Brian Stasak and Julien Epps was partly supported by ARC Discovery Project DP130101094 led by Roland Goecke and partly supported by ARC Linkage Project LP160101360, Data61-CSIRO. The Black Dog Institute (Sydney, Australia) provided the clinical depression speaker database.
References (87)
- et al.
Reflections of depression in acoustic measures of the patient's speech
J. Affect. Disord.
(2001) - et al.
Voice acoustical measurement of severity of major depression
Brain Cogn.
(2004) - et al.
A review of depression and suicide risk assessment using speech analysis
Speech Commun.
(2015) - et al.
Speech and voice parameters of depression: a pilot study
J. Commun. Disord.
(1984) - et al.
Abnormal speech articulation, psychomotor retardation, and subcortical dysfunction in major depression
J. Psych.
(1993) - et al.
Qualitative analysis of verbal fluency in depression
Psych. Res.
(2003) - et al.
Deficient inhibition of emotion information in depression
J. Affective Disorders
(2006) - et al.
Comparison of prosodic properties between read and spontaneous speech material
Speech Commun.
(1991) - et al.
Investigation of different speech types and emotions for detecting depression using different classifiers
Speech Commun.
(2017) - et al.
Multilingual processing of speech via web services
Comput. Speech Lang.
(2017)
Depression and the interpretation of ambiguity
Behav. Res. Ther.
Updating emotional content in recovering depressed individuals: evaluating deficits in emotion processing following a depressive episode
J. Behav. Ther. Exp. Psych.
Vocal acoustic biomarkers of depression severity and treatment response
Biol. Psych.
The 16-item quick inventory of depressive symptomatology (QIDS), clinician rating (QIDS-C), and self-report (QIDS-SR): a psychometric evaluation in patients with chronic major depression
Biol. Psych.
The speech analysis approach to determining onset of improvement under antidepressants
Eur. Neuropsychopharmacol.
Analysis of paired dichotomous data: a gentle introduction to the McNemar test in SPSS
J. MultiDiscip. Eval.
Detecting depression: a comparison between spontaneous and read speech
From joyous to clinically depressed: mood detection using spontaneous speech
Characterising depressed speech for classification
Multimodal Analysis of Verbal And Nonverbal Behavior on the Example of Clinical Depression
Information structure: linguistic, cognitive, and processing approaches
Wiley Interdiscip. Rev. Cogn. Sci.
Affect-induced changes in speech production
Exp. Brain Res.
Stories are to entertain: a structural-affect theory of stories
J. Pragmat.
Speech patterning of natural discourse of well and depressed mothers and their young children
Child Dev.
Emotional memory for words: separating content and context
Cognit. Emot.
Speech and psychopathology
Lang. Speech
The timing of accentual phrases in read and spontaneous speech: data from Acadian French
J. Can. Acoust. Assoc.
Reading errors as clues to the nature of reading
Sentiment analysis and social cognition engine (SEANCE): an automatic tool for sentiment, social cognition, and social order analysis
Behav. Res. Meth.
An investigation of depressed speech detection: features and normalization
Prosody and language comprehension
WIREs Cogn. Sci.
COVAREP – A collaborative voice analysis repository for speech technologies
Voice activity detection: merging source and filter-based information
IEEE Signal Process. Lett.
Smart Language: Readers, Readability, and the Grading of Text
Psychogenic speech disorders in people with suspected neurologic disease: diagnosis and management
Vocal indicators of mood change in depression
J. Nonverbal Behav.
On the significance of speech pauses in depressive disorders: results on read and spontaneous narratives
The Geneva minimalistic acoustic parameter set (GeMAPS) for voice research and affective computing
IEEE Trans. Affect. Comp.
“Recent developments in opensmile, the Munich open-source multimedia feature extractor
Malapropisms and the structure of the mental lexicon
Linguist. Inq.
Acoustical properties of speech as indicators of depression and suicidal risk
IEEE Trans. Biomed. Eng.
Psycholinguistics
The significance of changes in the rate of articulation
Lang. Speech
Cited by (20)
Critical Review of the Potential of Digital Technology in Psychopathology Research: A Psychoanalytical Perspective
2022, Evolution PsychiatriqueRead speech voice quality and disfluency in individuals with recent suicidal ideation or suicide attempt
2021, Speech CommunicationCitation Excerpt :As shown in Figs. 6 and 7, the SI and SA groups had more hesitations in the form of pauses and syllable word prolongations than the HC group. Similarly to patients with clinical depression evaluated in Stasak et al. (2019), the BBRS inpatients examined herein, a majority with depression, also had a significantly more speech errors while reading aloud when compared with the HC group. In Fig. 6, the SI and SA groups averaged three to four times as many speech errors for the twenty-one sentences than the HC group.
Linguistic features and automatic classifiers for identifying mild cognitive impairment and dementia
2021, Computer Speech and LanguageCitation Excerpt :During the last few years, new sophisticated techniques from Natural Language Processing (NLP) have been used to analyse written texts, clinically elicited utterances and spontaneous speech, in order to identify signs of psychiatric or neurological disorders and to extract automatically derived linguistic features for pathologies recognition, classification and description. Computational methods have been already successfully applied to the study of linguistic cues of cerebral functional disorders, both in the case of language modifications and disruption associated with depression (Jiang et al., 2017; Stasak et al., 2019), focal brain lesions (Fergadiotis and Wright, 2011), Parkinson’s disease (Arias-Vergara et al., 2018; Benba et al., 2016; Sztahó and Vicsi, 2016; Upadhya et al., 2019) and for detecting dementia prodroms (MCI) (dos Santos et al., 2017; Matsuda Toledo et al., 2018; Meilán et al., 2018; Roark et al., 2007, 2011; Satt et al., 2013; Tóth et al., 2018; Vincze et al., 2016; Wang et al., 2019) or the different associated pathologies, like Alzheimer’s Disease (Chinaei et al., 2017; Fraser et al., 2016; Jarrold et al., 2014; López-de-Ipiña et al., 2015; Yancheva and Rudzicz, 2016; Sirts, Piguet, Johnson), PPA (Fraser et al., 2014) and Fronto-Temporal Dementia (Jarrold et al., 2014). While neuropsychological tests and structured evaluations have a relevant impact on the naturalness of the subject’s responses, the analysis of natural spoken language productions could allow to ecologically and almost inexpensively pinpoint language modifications in potential patients even by primary care physicians.
Relative importance of speech and voice features in the classification of schizophrenia and depression
2023, Translational PsychiatryMultimodal Depression Detection: Using Fusion Strategies with Smart Phone Usage and Audio-visual Behavior
2023, International Journal on Artificial Intelligence ToolsAlgorithm Classifying Depressed Patients' Speech Samples Using Deep Learning
2023, Proceedings - 17th International Conference on Signal-Image Technology and Internet-Based Systems, SITIS 2023