Automatic Detection of Depressive States from Speech

Mendiratta, Aditi; Scibelli, Filomena; Esposito, Antonietta M.; Capuano, Vincenzo; Likforman-Sulem, Laurence; Maldonato, Mauro N.; Vinciarelli, Alessandro; Esposito, Anna

doi:10.1007/978-3-319-56904-8_29

Automatic Detection of Depressive States from Speech

Aditi Mendiratta⁸,
Filomena Scibelli⁷,
Antonietta M. Esposito¹⁰,
Vincenzo Capuano⁷,
Laurence Likforman-Sulem¹¹,
Mauro N. Maldonato¹²,
Alessandro Vinciarelli¹³ &
…
Anna Esposito⁹

Chapter
First Online: 30 August 2017

1480 Accesses
7 Citations

Part of the book series: Smart Innovation, Systems and Technologies ((SIST,volume 69))

Abstract

This paper investigates the acoustical and perceptual speech features that differentiate a depressed individual from a healthy one. The speech data gathered was a collection from both healthy and depressed subjects in the Italian language, each comprising of a read and spontaneous narrative. The pre-processing of this dataset was done using Mel Frequency Cepstral Coefficient (MFCC). The speech samples were further processed using Principal Component Analysis (PCA) for correlation and dimensionality reduction. It was found that both groups differed with respect to the extracted speech features. To distinguish the depressed group from the healthy one on the basis the proposed speech processing algorithm the Self Organizing Map (SOM) algorithm was used. The clustering accuracy given by SOM’s was 80.67%.

Download chapter PDF

1 Introduction

It was found that most of the identified acoustic differences between depressed and healthy speech are attributed to changes in F0 frequency values and F0 related measures, formants’ frequencies, power spectral densities, Mel Frequency Cepstral (MFCC) coefficients, speech rate, and glottal parameters such as jitter and shimmer [1, 4, 20,21,22,23,24]. Recently, it has been observed that the depressed speech exhibits a “slow” auditory dimension [19] and it is perceived as sluggish with respect to the respective healthy samples. Esposito et al. [4] reported that this effect can be primarily attributed to lengthened empty pauses (no significant differences between filled pause durations and consonant/vowel lengthening), and shortened phonation time (less and shorter clauses), whose distribution in a dialogue significantly differs between depressed and healthy subjects. Such measures can be used to develop algorithms that can be implemented in automatic diagnostic tools for the diagnosis of different degrees of depressive states and in general for embedding in ICT interfaces socially believable and contextual information [5]. For this reason, our goal is to extend the earlier research work in Esposito et al. [4] and propose a way through which the detection of such features and measures can be automated so that the depressed speech could be distinguished from the healthy one. The proposed method is based on MFCC speech processing algorithm [16, 22, 26] and the SOM [9, 27, 28] clustering method adopted for the accuracy analysis. This approach is motivated by the fact that MFCC encoding captures subtle variations in the speech [14], structuring the data according to their properties. The algorithm implemented by SOM is able to identify such structures clustering similarities together. The authors have tested the algorithm on seismic signals obtaining valuable performance [8, 9]. The paper is structured as follows. The descriptions of the used database are presented in Sect. 29.2; detailed descriptions of the given dataset processing using MFCC and PCA are presented in Sects. 29.3 and 29.4 respectively. Section 29.5 reports our clustering results followed by conclusions in Sect. 29.6.

2 Database

Read and spontaneous speech narratives, collected from healthy and depressed Italian participants were used for the proposed research. For the read narratives, participants were asked to read the tale “The North Wind and the Sun” which is a standard phonetically balanced short folk tale composed of approximately six sentences. For the spontaneous narratives, participants were asked to narrate the daily activities performed by them during the past week. The depressed patients were recruited, with the help of psychiatrists in the Department of Mental Health (Dipartimento di salute mentale) at Caserta (Italy) general Hospital, the Institute for Mental Health (Istituto di Igiene Mentale) at Santa Maria Capua Vetere (Italy) general Hospital, the Centre for Psychological Listening (Centro di Ascolto Psicologico) in Aversa (Italy) and a private psychiatric office. Consent forms were collected from all participants, who were then administered the Beck Depression Inventory Second Edition, BDI II [4] assessed for the Italian language by Ghisi et al. [11]. BDI scores were calculated for both depressed and healthy subjects. A total of 24 sets of recordings were collected, 12 from healthy (or typical speech) and 12 from depressed patients. Each set further contains two types of recordings, i.e. read and spontaneous narratives. On an average, the duration of each set ranges from 150 (lower duration bound) to 300 s (approximatively). Therefore, 150 s from every set is selected and divided into 15 speech waveforms of 10 s. In this selection, the first 130 s belong to the spontaneous speech and the last 20 s to the read-narrative speech.

The recordings were made using a clip-on microphone (Audio-Technica ATR3350), with external USB sound card. Speech was sampled at 16 kHz and quantized at 16 bits. For each subject, the recording procedure did not last more than 15 min. The demographic description of each subject involved in the experiment is reported in Esposito et al. [4].

2.1 Analysis of the Database

The BDI-II scores in the range of 0–12 are from control subjects. Table 29.1 illustrates the BDI-II score distributions of depressed subjects which are significantly higher than the matched control group. The scores are from a mild/moderate to a severe depression degree. A T-Student test (one and two tailed testing hypothesis) for independent samples suggested that the factors contributing to the discrimination between healthy and depressed speech [4] are:

The total duration of speech pauses (empty, filled and vowel lengthening taken all together) is significantly longer for depressed subjects compared to healthy ones
The total duration of empty pauses is significantly longer for depressed subjects compared to healthy ones
The clause duration is significantly shorter for depressed subjects compared to healthy ones
The clause frequency is significantly lower for depressed subjects compared to healthy ones

Table 29.1 BDI-II score distributions of depressed patients with respect to age groups

Full size table

3 Mel Frequency Cepstral Coefficients (MFCC)

From the above information, it was decided to process the speech data through an MFCC pre-processing algorithm, since it has been shown that this kind of pre-processing is the most accurate in extracting perceptual features from speech [22, 26, 14]. The length of recordings obtained from the participants ranged from approximatively 200–360 s. In order to account for the same amount of data for each participant, only 150 s of speech were selected from the beginning of each recording. Each of these 150 s of speech waves has been divided into 15 segments of 10 s each giving a total of T = 360 speech signals ((12 depressed + 12 healthy samples) * 15 segments). Mel-frequency cepstral coefficients (MFCC) are the results of a cosine transform of the real logarithm of the short-term energy spectrum presented on a Mel-frequency scale. The MFCC algorithm is based on human hearing perception which is able to process the details having low frequency ranges below 1000 Hz. Higher frequency ranges are instead grouped more coarsely. In other words, MFCC is based on known variation of the human ear’s critical bandwidth and it is generally used to obtain the best parametric perceptual representation of acoustic signals [26, 30]. The MFCC algorithm exploits two kinds of filters that are spread linearly and logarithmically according to the frequencies in the signal below or above 1000 Hz [22] respectively. A subjective pitch of pure tones is present on the Mel Frequency Scale to capture significant characteristic of the speech perceptual features. Thus for each tone with an actual frequency t measured in Hz, a subjective pitch is measured on the so called ‘Mel Scale’. The Mel frequency scale produces a linear frequency spacing below 1 kHz and a logarithmic spacing above. The extraction of the MFCC coefficients is made according to the steps illustrated in Fig. 29.1. For sake of clarity, these steps are shortly described in the following.

3.1 Pre-processing and FFT

The speech signal is passed through a first order filter that increases high frequencies energy to emphasize these frequencies according to equation S′(α) = S(α) − A × S(α − 1), where S′(α) is the output of the filtered speech signal S′(α), A = 0.97 is the pre-emphasis coefficient, and α is the sample index. The pre-emphasised speech signal is segmented into small frames of TW length, with a shift of TS (all in ms). In our case, adjacent frames are separated by M = 640 overlapped samples (M < N) [16, 22]. A Hamming window [10] of 100 ms was applied according to Eqs. (29.1) and (29.2):

$$ S_{w}' \left(\upalpha \right) = S'\left(\upalpha \right) \times W\left(\upalpha \right) $$

(29.1)

$$ W\left(\upalpha \right) = 0.54 - 0.46\,\cos \left( {\frac{{2\pi\upalpha}}{N - 1}} \right),\quad 0 \le\upalpha \le N - 1 $$

(29.2)

where N is the number of samples in each window, $ S_{w}^{\prime} \left(\upalpha \right) $ the output, and S′(α) the input of the windowing process. The windowing was applied to any α-th speech frame of 10 s, 1 ≤ α ≤ T = 360 speech frames. The windowed signal is then Fast Fourier Transformed (FFT, [22]).

3.2 Mel Frequency Warping and DCT

A set of triangular filters are used to compute a weighted sum of the FFT spectral components so that the output approximates to a set of Mel-frequencies (Eq. (29.3))

$$ F\left( {Mel} \right) = 2595\,*\,\log \left( {1 + f/700} \right) $$

(29.3)

The amplitude of a given filter over the Mel scale is represented as m_j, where 1 ≤ j ≤ NC. NC is the number of filterbanks (30 channels in our case). The cepstral parameters (cτ) are calculated from the filterbank amplitudes m_j using the Eq. (29.4):

$$ c_{\tau } = \sqrt {\left( {\frac{2}{{N_{C} }}} \right)} \sum\limits_{j = 1}^{{N_{C} }} {m_{j} \cos \left( {\left( {\frac{\pi \tau }{{N_{C} }}} \right)\left( {j - 0.5} \right)} \right)} $$

(29.4)

where τ is index of the cepstral coefficients, 1 ≤ τ ≤ x, and x the number of cepstral coefficients (5 in this case). Finally, the MFCCs are calculated using the discrete cosine transform (DCT) and cepstral liftering routine through Linear Prediction Analysis [30]. Through trial and error processes, it was observed that there were no significant differences in the SOM classification accuracy trained over a dataset of 12 MFCCs versus one of 5 MFCCs, (for each 10 s of speech), while considering less than 5 MFCCs the SOM classification accuracy decreased.

Figures 29.2 and 29.3 report the MFCC processing of a 10 s speech wave for a depressed (Fig. 29.2) and healthy (Fig. 29.3) subject respectively. The figures are intended to show that such processing is able to capture the frequency and duration of clauses and empty pauses. Indeed, it is possible to see that empty pauses are clearly more frequent in the depressed speech producing a different MFCC encoding. In each figure, the topmost subfigure is the plot of the original 10 s speech wave. The middle one displays the energy of the same speech after a 30 channel Mel-frequency processing. On the x-axis is the time, and on y-axis the number of Mel-frequency. The different colours indicate the amount of energy, for a given sample at a given Mel-frequency filterbank. The bottommost subplot represents the MFCC encoding. When comparing the middle and bottom subplots of Figs. 29.2 and 29.3 it can clearly be seen that the energy of the depressed speech is lower as compared to a typical speech in the given time-frame distribution.

3.3 Principal Component Analysis (PCA)

Principal Component Analysis is a common dimension reduction method applied for feature extraction in speech recognition [13]. PCA maps m-dimensional input data to n-dimensional one, where n ≤ m. The method assumes that features that best describe the data are in the directions along which the variations of the data are the largest [12, 29]. Given F feature vectors each of H cepstral coefficients represented as $ x_{ij} , $ $ x_{ij} , 1 \le i \le H, 1 \le j \le F $, the PCA processing is given by the Eqs. (29.5) and (29.6):

$$ \nu_{ij} = x_{ij} - \overline{{x_{i} }} , 1 \le i \le H,1 \le j \le F $$

(29.5)

$$ \overline{{x_{i} }} = \frac{1}{F}\sum\nolimits_{j = 1}^{F} {x_{ij} } $$

(29.6)

where $ \nu_{ij} $ is the new jth—centered vector data for PCA and $ \overline{{x_{i} }} $ is the mean of the original dataset containing ith MFCCs. Usually, PCA analysis contains only a single covariance matrix. However, we had to compute P covariance matrices, for each 10 s of speech, as given by Eqs. (29.7) and (29.8).

$$ P = \frac{Total\,sample\,inputs\,of\,PCA}{Cepstral\,coefficient\,per\,speech\,sample} = \frac{H}{x} $$

(29.7)

$$ Cov_{i} = \frac{1}{F} \sum\limits_{j = 1}^{F} {\nu_{ij} \nu_{ij}^{T} ,1 \le \text{i} \le \text{P}} $$

(29.8)

The principal components are obtained by solving the equation:

$$ \lambda_{i} \left( {y_{i} } \right) = Cov_{i} \left( {y_{i} } \right),1 \le \text{i} \le \text{P} $$

(29.9)

where λ ≥ 0 and y ∈ v_iF. The dimensionality reduction step is performed by keeping only the eigenvectors corresponding to the K largest eigenvalues (K ≤ P). The resultant values are stored in the matrix Y_K = [y₁ y₂ … y_K] where y₁, …, y_K, are eigenvectors and $ \lambda_{1} , \ldots ,\lambda_{k} , $ are eigenvalues of the covariance matrix Cov_r (r ∈ [1, K]). The reduced PCA transformation matrix Y_K is obtained by solving the Eq. (29.10)

$$ z_{r} = Y_{K}^{T} \nu_{rj} ,1 \le \text{r} \le \text{K},1 \le \, \text{j} \le \text{F} $$

(29.10)

where z_r denotes the transformed vector.

The first subplot of Fig. 29.4 shows the distribution of the MFCC coefficients before applying the PCA algorithm. The data points have the largest variation along the x-axis. The second subplot shows the reduced dataset with data points correlated to the corresponding mean values of the original MFCC values. The dataset reduced from P = 360 vectors (each of 248 features) to K = 75 vectors. Both the depressed and healthy data, represented as principal MFCC coefficients, have been plotted together. As mentioned above, features discriminating between depressed and typical speech are the total duration of speech pauses (empty, filled and vowel lengthening taken all together), the total clause durations, and the clause frequency. These features are not the ones used for our clustering with SOM, since our speech samples were processed through the MFCC algorithm. However, it is possible that MFCC coefficients encode these parameters. Figures 29.2 and 29.3 support this hypothesis since the MFCC coefficients extracted from depressed speech (Fig. 29.2) display lower energy as compared to those extracted from healthy speech waves (Fig. 29.3). This indirectly suggests more silences and longer empty pauses.

4 Self Organizing Map (SOM) Clustering

The SOMs carry out a nonlinear projection of the input data space to a set of units (neural network nodes) on a two-dimensional grid. The grid contains µ neurons given $ \mu = R \times C $, where R and C are the number of rows and columns of the SOM grid respectively. Each neuron has an associated weight vector which is randomly initialized. During the training, the weight vector associated to each neuron is likely to become the center of a cluster of input vectors. In addition, the weight vectors of adjacent neurons (neighboring neurons) move close to each other to fit a high-dimensional inputs space into the two dimensions of the network topology [8, 9, 28].

5 Results

The main goal of this research was to automatically discriminate between depressed and healthy speech. To this aim, the final MFCC dataset after the PCA reduction was fed into a U _RC SOM using the MATLAB Neural Network Toolbox [2]. The R and C values were taken each equal to 6 making a grid of 36 neurons. After training the SOM for 600 epochs, clusters of input vectors with similar MFCC-PCA reduced coefficients are formed on the grid, as illustrated in Fig. 29.5. Figure 29.5 represents the resultant coefficient hits per neuron—i.e. the number of coefficients that cluster in a given neuron of the SOM. The class of a node corresponds to the majority class of the samples clustered in it. Generally, a cluster centre is a neuron that holds a high density of coefficient hits and is closest to all the remaining neurons in that particular cluster as compared to any other neurons in the same cluster. The centre of a cluster of neurons that collects the majority of hits from each class (in our case two classes: depressed and typical speech) is chosen as the neuron containing the maximum number of hits for a given class whose neighbouring neurons also attract the majority of hits from the same class. To this aim, the neuron 13th is a practical option for depressed speech and the 24th for typical speech.

Figure 29.5 represents a statistical analysis of the SOM clustering of the entire dataset containing 50% of depressed speech and 50% of control/typical speech feature coefficient hits. The x-axis represents the µ neurons (in this case µ = 36) on the SOM grid and the y-axis the number of hits for the healthy (red line) and depressed (blue line) speech. The neurons in Fig. 29.5 are not pure classes of only one type of hit (as in the ideal case of 100% accuracy). They contain hits from both typical and depressed speech, as it can be seen in Fig. 29.6. In the real life scenarios, it is quite possible that a small number of typical speech coefficients hit a cluster which have a majority of depressed coefficients and vice versa. Therefore, for the neuron 13 in Fig. 29.5—all the coefficients hits are 95. However, when the SOM output is analysed through the Matlab routine “nctool” that allows to quantify the hits in each neuron, it appears that out of 95 hits, 91 belong to the depressed speech (true hits) and 4 to the typical speech (false hits). This is illustrated in Fig. 29.6 where the 13th neuron shows 91 rather than 95 hits. The same reasoning applies to the neuron 24th and all the remaining neurons. To obtain a realistic clustering accuracy, the testing procedures were repeated three times, using the Rand Measure technique [18] and exploiting three different sets of input vectors, randomly chosen with different proportions of depressed and healthy feature coefficients. The three sets of input vectors were accordingly defined as:

1.
Two sets of input vectors (extracted from the 60% of the original dataset), containing 75% of depressed speech features and 25% of healthy ones;
2.
Two sets of input vectors (extracted from the 40% of the original database) containing the 12.5% of depressed speech and the 87.5% of healthy one;
3.
Two sets of input vectors (the entire dataset) containing an equal amount of depressed and healthy speech (50% each).

The mean performance accuracy from the resultant SOM clustering on each of the three sets, according to the numbers of SOM hits is reported on Table 29.2.

Table 29.2 SOM classification results on three different sets of input data

Full size table

6 Discussion and Conclusion

There are many parameters in the speech of depressed people that show significant differences compared to a healthy reference group [1, 4, 20, 22,23,24]. In this study, these parameters were automatically extracted from a dataset of healthy and depressed speech waves by using the MFCC speech processing algorithm [10, 16]. Further correlation of the processed speech was performed by using the PCA algorithm to reduce the data dimensionality. The findings in literature suggest that features discriminating depressed healthy speech are produced by the framing of speech pauses that are elongated, and duration and frequency of speech clauses which are shortened and less frequent for depressed subjects [4, 15]. It is possible that these features are captured by processing speech waves through the MFCC algorithm and through the PCA concept to select from MFCC coefficients those that show the greatest variability with respect to the variance of the data.

In this context, it was found that the combination of MFCC and PCA is a powerful technique for the automatic feature extraction of depressed speech features since, by using a SOM clustering algorithm on such processed data the discrimination accuracy of 80.67% (see Table 29.2) was obtained. The clustering was performed on a small database of 24 recordings (12 depressed and 12 healthy subjects). Despite of these limitations, the discrimination accuracy was far above the chance suggesting that the extracted automatic features d (Sects. 29.2.1 and 29.5) are quite descriptive of depressed and healthy speech despite of the amount of data used for the automatic feature extraction. With more of such data, it is expected to achieve an improvement of the discrimination accuracy. Therefore, the combination of MFCC and PCA is a robust process for extracting features from speech and SOMs provide a good platform for clustering. Similar results were obtained by Kiss et al. [15] using a Support Vector Machine Classification algorithm trained on a larger Hungarian database and tested on the same Italian data. The method presented in this paper for the same Italian database resulted in an improvement of the discrimination accuracy with respect to the classification accuracy reported in Kiss et al. [15]. This study can be extended to a multi-lingual speech database for detecting depression in a language independent way with a larger dataset.

Currently there is a huge demand for complex autonomous systems able to assist people on several needs, ranging from long term support of disordered health states (including caring of elders with motor-skills restrictions) to mood and communicative disorders. Provisions of support have been made either through the monitoring and detection of changes in the physical, and/or cognitive, and/or social daily functional activities, as well as in offering therapeutic interventions [3, 17, 25]. According to the World Health Organization (WHO) at the least 25% of people visiting family doctors live with depression. As reported on the WHO website, (http://www.euro.who.int/en/health-topics/noncommunicable-diseases/mental-health/news/news/2012/10/depression-in-europe). This number is projected to increase and place considerable burdens on national health care institutions in terms of medical, and social care costs associated to the assistance of such people. Voice Assistive Computer Interfaces able to detect depressive states from speech can be a solution to this problem because they can provide an automated on-demand health assistance reducing the abovementioned costs. However, speech is intrinsically complex, and emotional speech is even more [7] requiring the need of an holistic approach that account for several factors including personality traits [27], social and contextual information and cultural diversities [5]. “The goal is to provide experimental and theoretical models of behaviors for developing a computational paradigm that should produce [ICT interfaces] equipped with a human level [of] automaton intelligence” ([6], p. 48).

References

Alpert, M., Pouget, E.R., Silva, R.R.: Reflections of depression in acoustic measures of the patient’s speech. J. Affect. Disord. 66, 59–69 (2001)
Article Google Scholar
Beale, M.H., Hagan, M.T., Demuth, H.B.: Neural network toolbox. User’s Guide, The Mathworks Inc., 7–39 (2010)
Google Scholar
Cordasco, G., Esposito, M., Masucci, F., Riviello, M.T., Esposito, A., Chollet, G., Schlögl, S., Milhorat, P., Pelosi, G.: Assessing voice user interfaces: the assist system prototype. In: Proceedings of 5th IEEE international Conference on Cognitive Info Communications, Vietri sul Mare, 5–7 Nov, pp. 91–96 (2014)
Google Scholar
Esposito, A., Esposito, A.M., Likforman-Sulem, L., Maldonato, N.M., Vinciarelli, A.: On the significance of speech pauses in depressive disorders: results on read and spontaneous narratives. In: Esposito, A., et al. (eds.) Springer SIST series on Recent Advances in Nonlinear Speech Processing, vol. 48, pp. 73–82 (2016)
Google Scholar
Esposito, A., Jain, L.C.: Modeling social signals and contexts in robotic socially believable behaving systems. In Esposito, A., Jain, L.C. (eds.) Toward Robotic Socially Believable Behaving Systems Volume II—“Modeling Social Signals” Springer International Publishing Switzerland, ISRL series 106, pp. 5–13 (2016)
Google Scholar
Esposito, A., Esposito, A.M., Vogel, C.: Needs and challenges in human computer interaction for processing social emotional information. Pattern Recogn. Lett. 66, 41–51 (2015)
Article Google Scholar
Esposito, A., Esposito, A.M.: On the recognition of emotional vocal expressions: motivations for an holistic approach. Cogn. Process. J. 13(2), 541–550 (2012)
Article Google Scholar
Esposito, A.M., D’Auria, L., Angelillo, A, Giudicepietro, F., Martini, M.: Predictive analysis of the seismicity level at Campi Flegrei volcano using a data-driven approach. In: Bassis, et al. (eds.) Recent Advances of Neural Network Models and Applications, Springer Series in Smart Innovation, Systems and Technologies, vol. 19, pp. 133–145 (2014)
Google Scholar
Esposito, A.M., D’Auria, L., Angelillo, A, Giudicepietro, F., Martini, M.: Waveform variation of the explosion-quakes as a function of the eruptive activity at Stromboli volcano. In: Bassis, et al. (eds.) Neural Nets and Surroundings, Springer Series in Smart Innovation, Systems and Technologies, vol. 19, pp. 111–119 (2013)
Google Scholar
Gupta, S., Jaafar, J., Ahmad, W.F., Bansal, A.: Feature extraction using MFCC. Signal Image Process. (SIPIJ) 4(4), 101–108 (2013)
Google Scholar
Ghisi, M., Flebus, G.B., Montano, A., Sanavio, E., Sica, C.: Beck Depression Inventory-II. Manuale Italiano. Firenze, Organizzazioni Speciali (2006)
Google Scholar
Jackson, J.E.: A User’s Guide to Principal Components, p. 592. Wiley (1991)
Google Scholar
Jolliffe, I.T.: Principal Component Analysis, 2nd edn. pp. 299–316. Springer (2002)
Google Scholar
Kakumanu, P., Esposito, A., Gutierrez-Osuna, R., Garcia, O.N.: A comparison of acoustic coding models for speech-driven facial animation. Speech Commun. 48(6), 598–615 (2006)
Article Google Scholar
Kiss, G.C., Tulics, M.G., Sztahó, D., Esposito, A., Vicsi, K.: Language independent detection possibilities of depression by speech. In: Esposito, A., et al. (eds.) Springer SIST series on Recent Advances in Nonlinear Speech Processing, vol. 48, pp. 103–114 (2016)
Google Scholar
Kopparapu, K.S., Laxminarayana, M.: Choice of Mel filter bank in computing MFCC of a resampled speech. In: IEEE International Conference on Information Sciences Signal Processing and their Applications (ISSPA 2010), Malaysia 10–13 May, pp. 121–124 (2010)
Google Scholar
Maldonato, N.M., Dell’Orco, S.: Making decision under uncertainty, emotions, risk and biases. In: Bassis, S., Esposito, A., Morabito, F.C. (eds.) Advances in Neural Networks: Computational and Theoretical Issues, SIST Series 37, pp. 293–302. Springer International Publishing Switzerland (2015)
Google Scholar
Manning, C.D., Raghavan, P., Schütze, H.: Introduction to Information Retrieval: Evaluation of Clustering, pp. 349–356. Cambridge University Press (2008)
Google Scholar
Marazziti, D., Consoli, G., Picchetti, M., Carlini, M., Faravelli, L.: Cognitive impairment in major depression. Eur. J. Pharmacol. 626, 83–86 (2010)
Article Google Scholar
Moore, E., Clements, M., Peifer, J., Weisser L.: Investigating the role of glottal parameters in classifying clinical depression. In: Proceedings of the 25th Annual International Conference of the IEEE Engineering in Medicine and Biology Society, vol 3, pp. 2849–2852 (2003)
Google Scholar
Moore, E., Clements, M.A., Peifer, J.W., Weisser, L.: Critical analysis of the impact of glottal features in the classification of clinical depression in speech. IEEE Trans. Biomed. Eng. 55, 96–107 (2008)
Google Scholar
Muda, L., Begam, M., Elamvazuthi, I.: Voice recognition algorithms using Mel Frequency Cepstral Coefficient (MFCC) and Dynamic Time Warping (DTW) techniques. J. Comput. 2(3), 138–143 (2010)
Google Scholar
Mundt, J.C., Snyder, P.J., Cannizzaro, M.S., Chappie, K., Geralts, D.S.: Voice acoustic measures of depression severity and treatment response collected via interactive voice response (IVR) technology. J. Neurolinguist. 20, 50–64 (2007)
Article Google Scholar
Mundt, J.C., Vogel, A.P., Feltner, D.E., Lenderking, W.R.: Vocal acoustic biomarkers of depression severity and treatment response. Biol. Psychiatry 72, 580–587 (2012)
Article Google Scholar
Rosser, B.A., Vowles, K.E., Keogh, E., Eccleston, C., Mountain, G.A.: Technologically-assisted behaviour change: a systematic review of studies of novel technologies for the management of chronic illness. Telemed. Telecare 15(7), 327–338 (2009)
Article Google Scholar
Tiwari, V.: MFCC and its applications in speaker recognition. Int. J. Emerg. Technol. 19–22 (2010)
Google Scholar
Troncone, A., Palumbo, D., Esposito, A.: Mood effects on the decoding of emotional voices. In: Bassis, S., et al. (eds.) Recent Advances of Neural Network Models and Applications, SIST 26, pp. 325–332. International Publishing Switzerland (2014)
Google Scholar
Vesanto, J., Alhoniemi, E.: Clustering of the self-organizing map. IEEE Trans. Neural Netw. 11(3), 586–600 (2000)
Article Google Scholar
Viszlay, P., Pleva, M., Juhár, J.: Dimension reduction with principal component analysis applied to speech supervectors. J. Electr. Electron. Eng. 4(1), 245–250 (2011)
Google Scholar
Young, S., Evermann, G., Gales, M., Hain, T., Kershaw, D., Liu, X., Moore, G., Odell, J., Ollason, D., Povey, D., Valtchev, V., Woodland, P.: The HTK Book (for HTK Version 3.4.1). Engineering Department, Cambridge University, pp. 56–80 (2006)
Google Scholar

Download references

Acknowledgements

The patients, healthy subjects (with typical speech) and doctors (psychiatrists) are acknowledged for their involvement and contribution towards this research. The International Institute for Advanced Scientific Studies (IIASS) and Professor Ferdinando Mancini (President, IIASS), is acknowledged for having supported the first author during her internship.

Author information

Authors and Affiliations

Dipartimento di Psicologia, Università della Campania “Luigi Vanvitelli”, Caserta, Italy
Filomena Scibelli & Vincenzo Capuano
School of Computing Science and Engineering, VIT University, Vellore, India
Aditi Mendiratta
Dipartimento di Psicologia, Università della Campania “Luigi Vanvitelli” and International Institute for Advanced Scientific Studies (IIASS), Naples, Italy
Anna Esposito
Istituto Nazionale di Geofisica e Vulcanologia, Sez. di Napoli Osservatorio Vesuviano, Naples, Italy
Antonietta M. Esposito
Telecom ParisTech, Paris, France
Laurence Likforman-Sulem
Dipartimento di Scienze Umane, Università della Basilicata, Potenza, Italy
Mauro N. Maldonato
School of Computing Science, University of Glasgow, Glasgow, UK
Alessandro Vinciarelli

Authors

Aditi Mendiratta
View author publications
You can also search for this author in PubMed Google Scholar
Filomena Scibelli
View author publications
You can also search for this author in PubMed Google Scholar
Antonietta M. Esposito
View author publications
You can also search for this author in PubMed Google Scholar
Vincenzo Capuano
View author publications
You can also search for this author in PubMed Google Scholar
Laurence Likforman-Sulem
View author publications
You can also search for this author in PubMed Google Scholar
Mauro N. Maldonato
View author publications
You can also search for this author in PubMed Google Scholar
Alessandro Vinciarelli
View author publications
You can also search for this author in PubMed Google Scholar
Anna Esposito
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Anna Esposito .

Editor information

Editors and Affiliations

Dipartimento di Psicologia, Università della Campania “Luigi Vanvitelli”, Caserta, Italy
Anna Esposito
Fundació Tecnocampus, Pompeu Fabra University, Mataro, Spain
Marcos Faudez-Zanuy
Department of Civil, Environmental, Energy, and Material Engineering, Mediterranea University of Reggio Calabria, Reggio Calabria, Italy
Francesco Carlo Morabito
Dipartimento di Elettronica e Telecomunicazioni, Politecnico di Torino, Laboratorio di Neuronica, Torino, Italy
Eros Pasero

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Mendiratta, A. et al. (2018). Automatic Detection of Depressive States from Speech. In: Esposito, A., Faudez-Zanuy, M., Morabito, F., Pasero, E. (eds) Multidisciplinary Approaches to Neural Computing. Smart Innovation, Systems and Technologies, vol 69. Springer, Cham. https://doi.org/10.1007/978-3-319-56904-8_29

Download citation

DOI: https://doi.org/10.1007/978-3-319-56904-8_29
Published: 30 August 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-56903-1
Online ISBN: 978-3-319-56904-8
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics