Elsevier

Speech Communication

Volume 115, December 2019, Pages 1-14
Speech Communication

Automatic depression classification based on affective read sentences: Opportunities for text-dependent analysis

https://doi.org/10.1016/j.specom.2019.10.003Get rights and content

Highlights

  • During read aloud tasks, speech disfluency analysis indicates that speakers with depression have increases in hesitations and speech errors when compared to a nondepressed population.

  • This study shows, in a collective sense, that all ranges of affective valence speech samples (e.g. negative, neutral, positive) are important in the speech-based analysis of depression.

  • By fusing multi-valence based features from specific sentence groups, significant automatic depression classification improvements were recorded when compared to the affect agnostic feature baseline.

Abstract

In the future, automatic speech-based analysis of mental health could become widely available to help augment conventional healthcare evaluation methods. For speech-based patient evaluations of this kind, protocol design is a key consideration. Read speech provides an advantage over other verbal modes (e.g. automatic, spontaneous) by providing a clinically stable and repeatable protocol. Further, text-dependent speech helps to reduce phonetic variability and delivers controllable linguistic/affective stimuli, therefore allowing more precise analysis of recorded stimuli deviations. The purpose of this study is to investigate speech disfluency behaviors in non-depressed/depressed speakers using read aloud text containing constrained affective-linguistic criteria. Herein, using the Black Dog Institute Affective Sentences (BDAS) corpus, analysis demonstrates statistically significant feature differences in speech disfluencies, whereby when compared to non-depressed speakers, depressed speakers show relatively higher recorded frequencies of hesitations (55% increase) and speech errors (71% increase). Our study examines both manually and automatically labeled speech disfluency features, demonstrating that detailed disfluency analysis leads to considerable gains, of up to 100% in absolute depression classification accuracy, especially with affective considerations, when compared with the affect-agnostic acoustic baseline (65%).

Introduction

During mental health evaluations, it is standard practice for a clinician to evaluate a patient's spoken language behavior. Individuals suffering from depression disorders often exhibit psychogenic voice disturbances that adversely change their autonomic system and personality (Perepa, 2017). Recorded exemplars of speech-language disruptions in clinically depressed patients include disfluent speech patterns, abandonment of phrases, and unusually long response latencies (Breznitz & Sherman, 1987; Greden & Carroll, 1980; Hoffman et al., 1985). Patients with clinical depression also exhibit a greater number of speech hesitations (i.e. pauses, repeats, false-starts) than non-depressed populations during communication due to psychomotor agitation/retardation and cognitive processing delays (Alpert et al., 2001; Cannizzaro et al., 2004; Darby et al., 1984; Duffy, 2008; Ellgring & Scherer, 1996; Fossati et al., 2003; Hartlage et al., 1993; Nilsonne, 1987, Nilsonne et al., 1988; Szabadi et al., 1976).

Speech-based depression studies (Alghowinem et al., 2012; Alpert et al., 2001; Esposito et al., 2016; Mundt, 2012; Nilsonne et al., 1988; Stassen et al., 1998; Szabadi et al., 1976) have evaluated pause durations and frequency ratios (e.g. filled-pause rate, empty pause rate) with varied success. For example, Alghowinem et al. (2012) and Esposito et al. (2016) observed that average spontaneous speech pause durations were significantly longer in depressed speakers than non-depressed speakers. Due to unrestrained spontaneous speech variables, pause and rate ratios (i.e. the total number of pauses divided by the total number of words; the total pause time divided by the total recording time) have often been used to help compare utterances of different lengths (Liu et al., 2017).

Concerning speech errors, Rubino et al. (2011) discovered that depressed speakers exhibited significantly greater numbers of referential failures (i.e. including word replacement errors, such as malapropisms) than non-depressed speakers during spontaneous tasks. A malapropism is an incorrect substitution word for an intended word. By definition, a malapropism is unrelated in meaning and has a similar pronunciation, grammatical category, word stress, and syllable length (Fay & Cutler, 1977). Since Rubino et al. (2011), no automatic speech-based depression studies have pursued using speech errors as prospective discriminative depression features – and surprisingly, not even for predetermined read tasks.

Diagnostic applications based on automatic speech-based depression classification are often reliant on the uniformity of the patient elicitation procedure. Each elicitation method can include a wide range of linguistic structure, affect information, and task requirements (Howe et al., 2014; Stasak et al., 2017, 2018b). To date, there is no clinically approved speech-depression automatic diagnosis protocol for widespread use, and there is not even consensus on elicitation methods among researchers. Many exploratory speech-based depression techniques have been systematically investigated using a combination of speech-related features (i.e. acoustic, text-based), machine learning techniques, and speech elicitation protocols (see e.g.: Cummins et al., 2015; Jiang et al., 2017; Liu et al., 2017; Stasak et al., 2017; Valstar et al., 2016). However, still to date, no specific system to aid depression diagnosis holds dominance over all others.

The aim of experiments herein is to investigate speech disfluency behaviors in non-depressed/depressed speakers using read text containing specific affective-linguistic criteria. We hypothesize that unlike spontaneous speech, sentences containing target words within specific valence ranges will provide more accurate ground-truth for disfluency analysis as a result of their phonetic and affective constraints; further affording directly comparable speech data between different speakers. While natural speech disfluencies are common in spontaneous speech (Johnson et al., 2004; Gósy, 2003), due to the abnormal cognitive-motor effects of depression on cognitive skills (Breznitz & Sherman, 1987; Greden & Carroll, 1980; Hoffman, et al., 1985), we hypothesize that during simple sentence reading tasks depressed speakers will demonstrate greater numbers of abnormal pauses and speech errors than non-depressed speakers.

According to structural affect theory (Brewer & Lichtenstein, 1982) and information structure studies (Arnold et al., 2013, Dahan, 2015), the emotions of a reader can be systematically manipulated by the sensitive order in which information is presented in a text. Based the aforementioned studies, it is hypothesized that positioning the affective target word at the beginning of sentence rather than middle or end of the sentence will allow a speaker earlier cognitive cues for mood processing and appropriate prosodic phrasing. We suspect that healthy speakers will utilize initial sentence position affective keyword information differently than depressed speakers (i.e. less prosodic range or appropriateness (Cummins et al., 2015)).

In regards to the reading of texts, Salem et al. (2017) found that direct discourse (e.g. first-person) narratives elicited a stronger feeling of taking over the perspective of the character than indirect discourse (e.g. third-person) narratives. Therefore, it is anticipated that features extracted from first-person read narrative sentences will generate better classification results than third-person narrative sentences due to greater emotional attachment. We hypothesize that depressed speakers will exhibit less dynamic emotional vocal range behaviors than healthy speakers on account of their negative fixation (Goeleven et al., 2006; Gotlib & McCann, 1984) and/or less vocal control due to psychomotor agitation/retardation (Flint et al., 1993; Hoffman et al., 1985). Due to passive avoidance strategies exhibited by people with depression (Holahan & Moos, 1987; Holahan et al., 2005), it is also anticipated that depressed speakers will attempt fewer self-corrections after speech errors when compared with non-depressed speakers.

Due to the constraints of text-dependent stimuli, both manual and automatic speech recognition disfluency annotations disfluency attributes are evaluated. It is anticipated that text-dependent constraints will improve the precision of the ASR output since the acoustic phonetic variability in the elicited speech is smaller than that of, for example, spontaneous speech.

Section snippets

Database

For all experiments herein, the Black Dog Institute Affective Sentences (BDAS) corpus, a collected data extension of that found in Alghowinem et al. (2012, 2013a, 2013b, 2015), Cummins et al. (2011), and Joshi et al. (2013a, 2013b), was used on account of its clinically validated depression diagnosis. Furthermore, the BDAS corpus had a controlled speech elicitation mode, which comprised read sentences with deliberately designed affective target words (see Section 3.6, Table 1). The speakers

Acoustic feature extraction

For experiments herein, the openSMILE speech toolkit was used to extract 88 eGeMAPS (Eyben, et al., 2015) acoustic speech features (i.e. derived from fundamental frequency, loudness, formants, mel-cepstral coefficients) from all 20 sentences in the BDAS corpus. The eGeMAPS features were calculated by extracting features from 20 ms frames with 50% frame-overlap, wherein an aggregated mean functional was computed per eGeMAPS feature. The eGeMAPS feature set was chosen because it has been used

Acoustic analysis

The accuracy of the baseline eGeMAPS features depression classification was 65% using all sentences, with F1 scores of 0.68 (0.63) for depressed and non-depressed classes respectively. The accuracies and F1 scores for individual sentences are shown in Table 2.

Generally, single sentence depression classification performance for eGeMAPS features was relatively low when compared with the all-sentence eGeMAPS baseline (i.e. this accuracy decrease was attributed to the reduction in training

Conclusion

In this study, a read sentence protocol was explored as an evaluation method for automatic speech-based depression classification. In comparison to spontaneous speech, text-dependent speech has advantages because the read linguistic and affective content can be explicitly designed to observe behaviors in a repeatable, controlled manner. Further, in a clinical context, text-dependent speech does not rely on the individual interviewer's expertise, bias, and skill level, which has been previously

CRediT authorship contribution statement

Brian Stasak: Formal analysis. Julien Epps: Formal analysis. Roland Goecke: Formal analysis.

Declaration of Competing Interest

None.

Acknowledgements

The work of Brian Stasak and Julien Epps was partly supported by ARC Discovery Project DP130101094 led by Roland Goecke and partly supported by ARC Linkage Project LP160101360, Data61-CSIRO. The Black Dog Institute (Sydney, Australia) provided the clinical depression speaker database.

References (87)

  • C. Lawson et al.

    Depression and the interpretation of ambiguity

    Behav. Res. Ther.

    (1999)
  • S.M. Levens et al.

    Updating emotional content in recovering depressed individuals: evaluating deficits in emotion processing following a depressive episode

    J. Behav. Ther. Exp. Psych.

    (2015)
  • J.C. Mundt et al.

    Vocal acoustic biomarkers of depression severity and treatment response

    Biol. Psych.

    (2012)
  • A.J. Rush et al.

    The 16-item quick inventory of depressive symptomatology (QIDS), clinician rating (QIDS-C), and self-report (QIDS-SR): a psychometric evaluation in patients with chronic major depression

    Biol. Psych.

    (2003)
  • H.H. Stassen et al.

    The speech analysis approach to determining onset of improvement under antidepressants

    Eur. Neuropsychopharmacol.

    (1998)
  • O.A. Adedokun et al.

    Analysis of paired dichotomous data: a gentle introduction to the McNemar test in SPSS

    J. MultiDiscip. Eval.

    (2012)
  • S. Alghowinem et al.

    Detecting depression: a comparison between spontaneous and read speech

  • S. Algohowinem et al.

    From joyous to clinically depressed: mood detection using spontaneous speech

  • S. Alghowinem et al.

    Characterising depressed speech for classification

  • S. Alghowinem

    Multimodal Analysis of Verbal And Nonverbal Behavior on the Example of Clinical Depression

    (2015)
  • J.E. Arnold et al.

    Information structure: linguistic, cognitive, and processing approaches

    Wiley Interdiscip. Rev. Cogn. Sci.

    (2013)
  • J. Barrett et al.

    Affect-induced changes in speech production

    Exp. Brain Res.

    (2002)
  • W.F. Brewer et al.

    Stories are to entertain: a structural-affect theory of stories

    J. Pragmat.

    (1982)
  • Z. Breznitz et al.

    Speech patterning of natural discourse of well and depressed mothers and their young children

    Child Dev.

    (1987)
  • B. Brierley et al.

    Emotional memory for words: separating content and context

    Cognit. Emot.

    (2007)
  • C. Chevrie-Muller et al.

    Speech and psychopathology

    Lang. Speech

    (1985)
  • W. Cichocki

    The timing of accentual phrases in read and spontaneous speech: data from Acadian French

    J. Can. Acoust. Assoc.

    (2015)
  • R. Cowie

    Reading errors as clues to the nature of reading

  • S.A. Crossley et al.

    Sentiment analysis and social cognition engine (SEANCE): an automatic tool for sentiment, social cognition, and social order analysis

    Behav. Res. Meth.

    (2017)
  • Nicholas Cummins et al.

    An investigation of depressed speech detection: features and normalization

  • D. Dahan

    Prosody and language comprehension

    WIREs Cogn. Sci.

    (2015)
  • G. Degottex et al.

    COVAREP – A collaborative voice analysis repository for speech technologies

  • T. Drugman et al.

    Voice activity detection: merging source and filter-based information

    IEEE Signal Process. Lett.

    (2016)
  • W.H. DuBay

    Smart Language: Readers, Readability, and the Grading of Text

    (2006)
  • J. Duffy

    Psychogenic speech disorders in people with suspected neurologic disease: diagnosis and management

  • H. Ellgring et al.

    Vocal indicators of mood change in depression

    J. Nonverbal Behav.

    (1996)
  • A. Esposito et al.

    On the significance of speech pauses in depressive disorders: results on read and spontaneous narratives

  • F. Eyben et al.

    The Geneva minimalistic acoustic parameter set (GeMAPS) for voice research and affective computing

    IEEE Trans. Affect. Comp.

    (2015)
  • F. Eyben et al.

    “Recent developments in opensmile, the Munich open-source multimedia feature extractor

  • D. Fay et al.

    Malapropisms and the structure of the mental lexicon

    Linguist. Inq.

    (1977)
  • D.J. France et al.

    Acoustical properties of speech as indicators of depression and suicidal risk

    IEEE Trans. Biomed. Eng.

    (2000)
  • M. Garman

    Psycholinguistics

    (1990)
  • F. Goldman-Eisler

    The significance of changes in the rate of articulation

    Lang. Speech

    (1961)
  • Cited by (20)

    • Read speech voice quality and disfluency in individuals with recent suicidal ideation or suicide attempt

      2021, Speech Communication
      Citation Excerpt :

      As shown in Figs. 6 and 7, the SI and SA groups had more hesitations in the form of pauses and syllable word prolongations than the HC group. Similarly to patients with clinical depression evaluated in Stasak et al. (2019), the BBRS inpatients examined herein, a majority with depression, also had a significantly more speech errors while reading aloud when compared with the HC group. In Fig. 6, the SI and SA groups averaged three to four times as many speech errors for the twenty-one sentences than the HC group.

    • Linguistic features and automatic classifiers for identifying mild cognitive impairment and dementia

      2021, Computer Speech and Language
      Citation Excerpt :

      During the last few years, new sophisticated techniques from Natural Language Processing (NLP) have been used to analyse written texts, clinically elicited utterances and spontaneous speech, in order to identify signs of psychiatric or neurological disorders and to extract automatically derived linguistic features for pathologies recognition, classification and description. Computational methods have been already successfully applied to the study of linguistic cues of cerebral functional disorders, both in the case of language modifications and disruption associated with depression (Jiang et al., 2017; Stasak et al., 2019), focal brain lesions (Fergadiotis and Wright, 2011), Parkinson’s disease (Arias-Vergara et al., 2018; Benba et al., 2016; Sztahó and Vicsi, 2016; Upadhya et al., 2019) and for detecting dementia prodroms (MCI) (dos Santos et al., 2017; Matsuda Toledo et al., 2018; Meilán et al., 2018; Roark et al., 2007, 2011; Satt et al., 2013; Tóth et al., 2018; Vincze et al., 2016; Wang et al., 2019) or the different associated pathologies, like Alzheimer’s Disease (Chinaei et al., 2017; Fraser et al., 2016; Jarrold et al., 2014; López-de-Ipiña et al., 2015; Yancheva and Rudzicz, 2016; Sirts, Piguet, Johnson), PPA (Fraser et al., 2014) and Fronto-Temporal Dementia (Jarrold et al., 2014). While neuropsychological tests and structured evaluations have a relevant impact on the naturalness of the subject’s responses, the analysis of natural spoken language productions could allow to ecologically and almost inexpensively pinpoint language modifications in potential patients even by primary care physicians.

    • Algorithm Classifying Depressed Patients' Speech Samples Using Deep Learning

      2023, Proceedings - 17th International Conference on Signal-Image Technology and Internet-Based Systems, SITIS 2023
    View all citing articles on Scopus
    View full text