1 Introduction

It was found that most of the identified acoustic differences between depressed and healthy speech are attributed to changes in F0 frequency values and F0 related measures, formants’ frequencies, power spectral densities, Mel Frequency Cepstral (MFCC) coefficients, speech rate, and glottal parameters such as jitter and shimmer [1, 4, 20,21,22,23,24]. Recently, it has been observed that the depressed speech exhibits a “slow” auditory dimension [19] and it is perceived as sluggish with respect to the respective healthy samples. Esposito et al. [4] reported that this effect can be primarily attributed to lengthened empty pauses (no significant differences between filled pause durations and consonant/vowel lengthening), and shortened phonation time (less and shorter clauses), whose distribution in a dialogue significantly differs between depressed and healthy subjects. Such measures can be used to develop algorithms that can be implemented in automatic diagnostic tools for the diagnosis of different degrees of depressive states and in general for embedding in ICT interfaces socially believable and contextual information [5]. For this reason, our goal is to extend the earlier research work in Esposito et al. [4] and propose a way through which the detection of such features and measures can be automated so that the depressed speech could be distinguished from the healthy one. The proposed method is based on MFCC speech processing algorithm [16, 22, 26] and the SOM [9, 27, 28] clustering method adopted for the accuracy analysis. This approach is motivated by the fact that MFCC encoding captures subtle variations in the speech [14], structuring the data according to their properties. The algorithm implemented by SOM is able to identify such structures clustering similarities together. The authors have tested the algorithm on seismic signals obtaining valuable performance [8, 9]. The paper is structured as follows. The descriptions of the used database are presented in Sect. 29.2; detailed descriptions of the given dataset processing using MFCC and PCA are presented in Sects. 29.3 and 29.4 respectively. Section 29.5 reports our clustering results followed by conclusions in Sect. 29.6.

2 Database

Read and spontaneous speech narratives, collected from healthy and depressed Italian participants were used for the proposed research. For the read narratives, participants were asked to read the tale “The North Wind and the Sun” which is a standard phonetically balanced short folk tale composed of approximately six sentences. For the spontaneous narratives, participants were asked to narrate the daily activities performed by them during the past week. The depressed patients were recruited, with the help of psychiatrists in the Department of Mental Health (Dipartimento di salute mentale) at Caserta (Italy) general Hospital, the Institute for Mental Health (Istituto di Igiene Mentale) at Santa Maria Capua Vetere (Italy) general Hospital, the Centre for Psychological Listening (Centro di Ascolto Psicologico) in Aversa (Italy) and a private psychiatric office. Consent forms were collected from all participants, who were then administered the Beck Depression Inventory Second Edition, BDI II [4] assessed for the Italian language by Ghisi et al. [11]. BDI scores were calculated for both depressed and healthy subjects. A total of 24 sets of recordings were collected, 12 from healthy (or typical speech) and 12 from depressed patients. Each set further contains two types of recordings, i.e. read and spontaneous narratives. On an average, the duration of each set ranges from 150 (lower duration bound) to 300 s (approximatively). Therefore, 150 s from every set is selected and divided into 15 speech waveforms of 10 s. In this selection, the first 130 s belong to the spontaneous speech and the last 20 s to the read-narrative speech.

The recordings were made using a clip-on microphone (Audio-Technica ATR3350), with external USB sound card. Speech was sampled at 16 kHz and quantized at 16 bits. For each subject, the recording procedure did not last more than 15 min. The demographic description of each subject involved in the experiment is reported in Esposito et al. [4].

2.1 Analysis of the Database

The BDI-II scores in the range of 0–12 are from control subjects. Table 29.1 illustrates the BDI-II score distributions of depressed subjects which are significantly higher than the matched control group. The scores are from a mild/moderate to a severe depression degree. A T-Student test (one and two tailed testing hypothesis) for independent samples suggested that the factors contributing to the discrimination between healthy and depressed speech [4] are:

  • The total duration of speech pauses (empty, filled and vowel lengthening taken all together) is significantly longer for depressed subjects compared to healthy ones

  • The total duration of empty pauses is significantly longer for depressed subjects compared to healthy ones

  • The clause duration is significantly shorter for depressed subjects compared to healthy ones

  • The clause frequency is significantly lower for depressed subjects compared to healthy ones

Table 29.1 BDI-II score distributions of depressed patients with respect to age groups

3 Mel Frequency Cepstral Coefficients (MFCC)

From the above information, it was decided to process the speech data through an MFCC pre-processing algorithm, since it has been shown that this kind of pre-processing is the most accurate in extracting perceptual features from speech [22, 26, 14]. The length of recordings obtained from the participants ranged from approximatively 200–360 s. In order to account for the same amount of data for each participant, only 150 s of speech were selected from the beginning of each recording. Each of these 150 s of speech waves has been divided into 15 segments of 10 s each giving a total of T = 360 speech signals ((12 depressed + 12 healthy samples) * 15 segments). Mel-frequency cepstral coefficients (MFCC) are the results of a cosine transform of the real logarithm of the short-term energy spectrum presented on a Mel-frequency scale. The MFCC algorithm is based on human hearing perception which is able to process the details having low frequency ranges below 1000 Hz. Higher frequency ranges are instead grouped more coarsely. In other words, MFCC is based on known variation of the human ear’s critical bandwidth and it is generally used to obtain the best parametric perceptual representation of acoustic signals [26, 30]. The MFCC algorithm exploits two kinds of filters that are spread linearly and logarithmically according to the frequencies in the signal below or above 1000 Hz [22] respectively. A subjective pitch of pure tones is present on the Mel Frequency Scale to capture significant characteristic of the speech perceptual features. Thus for each tone with an actual frequency t measured in Hz, a subjective pitch is measured on the so called ‘Mel Scale’. The Mel frequency scale produces a linear frequency spacing below 1 kHz and a logarithmic spacing above. The extraction of the MFCC coefficients is made according to the steps illustrated in Fig. 29.1. For sake of clarity, these steps are shortly described in the following.

Fig. 29.1
figure 1

MFCC processing

3.1 Pre-processing and FFT

The speech signal is passed through a first order filter that increases high frequencies energy to emphasize these frequencies according to equation S′(α) = S(α) − A × S(α − 1), where S′(α) is the output of the filtered speech signal S′(α), A = 0.97 is the pre-emphasis coefficient, and α is the sample index. The pre-emphasised speech signal is segmented into small frames of TW length, with a shift of TS (all in ms). In our case, adjacent frames are separated by M = 640 overlapped samples (M < N) [16, 22]. A Hamming window [10] of 100 ms was applied according to Eqs. (29.1) and (29.2):

$$ S_{w}' \left(\upalpha \right) = S'\left(\upalpha \right) \times W\left(\upalpha \right) $$
(29.1)
$$ W\left(\upalpha \right) = 0.54 - 0.46\,\cos \left( {\frac{{2\pi\upalpha}}{N - 1}} \right),\quad 0 \le\upalpha \le N - 1 $$
(29.2)

where N is the number of samples in each window, \( S_{w}^{\prime} \left(\upalpha \right) \) the output, and S′(α) the input of the windowing process. The windowing was applied to any α-th speech frame of 10 s, 1 ≤ α ≤ T = 360 speech frames. The windowed signal is then Fast Fourier Transformed (FFT, [22]).

3.2 Mel Frequency Warping and DCT

A set of triangular filters are used to compute a weighted sum of the FFT spectral components so that the output approximates to a set of Mel-frequencies (Eq. (29.3))

$$ F\left( {Mel} \right) = 2595\,*\,\log \left( {1 + f/700} \right) $$
(29.3)

The amplitude of a given filter over the Mel scale is represented as mj, where 1 ≤ j ≤ NC. NC is the number of filterbanks (30 channels in our case). The cepstral parameters (cτ) are calculated from the filterbank amplitudes mj using the Eq. (29.4):

$$ c_{\tau } = \sqrt {\left( {\frac{2}{{N_{C} }}} \right)} \sum\limits_{j = 1}^{{N_{C} }} {m_{j} \cos \left( {\left( {\frac{\pi \tau }{{N_{C} }}} \right)\left( {j - 0.5} \right)} \right)} $$
(29.4)

where τ is index of the cepstral coefficients, 1 ≤ τ ≤ x, and x the number of cepstral coefficients (5 in this case). Finally, the MFCCs are calculated using the discrete cosine transform (DCT) and cepstral liftering routine through Linear Prediction Analysis [30]. Through trial and error processes, it was observed that there were no significant differences in the SOM classification accuracy trained over a dataset of 12 MFCCs versus one of 5 MFCCs, (for each 10 s of speech), while considering less than 5 MFCCs the SOM classification accuracy decreased.

Figures 29.2 and 29.3 report the MFCC processing of a 10 s speech wave for a depressed (Fig. 29.2) and healthy (Fig. 29.3) subject respectively. The figures are intended to show that such processing is able to capture the frequency and duration of clauses and empty pauses. Indeed, it is possible to see that empty pauses are clearly more frequent in the depressed speech producing a different MFCC encoding. In each figure, the topmost subfigure is the plot of the original 10 s speech wave. The middle one displays the energy of the same speech after a 30 channel Mel-frequency processing. On the x-axis is the time, and on y-axis the number of Mel-frequency. The different colours indicate the amount of energy, for a given sample at a given Mel-frequency filterbank. The bottommost subplot represents the MFCC encoding. When comparing the middle and bottom subplots of Figs. 29.2 and 29.3 it can clearly be seen that the energy of the depressed speech is lower as compared to a typical speech in the given time-frame distribution.

Fig. 29.2
figure 2

MFCC processing on a 10 s speech sample of a depressed subject

Fig. 29.3
figure 3

MFCC processing on a 10 s speech sample of a typical person

3.3 Principal Component Analysis (PCA)

Principal Component Analysis is a common dimension reduction method applied for feature extraction in speech recognition [13]. PCA maps m-dimensional input data to n-dimensional one, where n ≤ m. The method assumes that features that best describe the data are in the directions along which the variations of the data are the largest [12, 29]. Given F feature vectors each of H cepstral coefficients represented as \( x_{ij} , \) \( x_{ij} , 1 \le i \le H, 1 \le j \le F \), the PCA processing is given by the Eqs. (29.5) and (29.6):

$$ \nu_{ij} = x_{ij} - \overline{{x_{i} }} , 1 \le i \le H,1 \le j \le F $$
(29.5)
$$ \overline{{x_{i} }} = \frac{1}{F}\sum\nolimits_{j = 1}^{F} {x_{ij} } $$
(29.6)

where \( \nu_{ij} \) is the new jth—centered vector data for PCA and \( \overline{{x_{i} }} \) is the mean of the original dataset containing ith MFCCs. Usually, PCA analysis contains only a single covariance matrix. However, we had to compute P covariance matrices, for each 10 s of speech, as given by Eqs. (29.7) and (29.8).

$$ P = \frac{Total\,sample\,inputs\,of\,PCA}{Cepstral\,coefficient\,per\,speech\,sample} = \frac{H}{x} $$
(29.7)
$$ Cov_{i} = \frac{1}{F} \sum\limits_{j = 1}^{F} {\nu_{ij} \nu_{ij}^{T} ,1 \le \text{i} \le \text{P}} $$
(29.8)

The principal components are obtained by solving the equation:

$$ \lambda_{i} \left( {y_{i} } \right) = Cov_{i} \left( {y_{i} } \right),1 \le \text{i} \le \text{P} $$
(29.9)

where λ ≥ 0 and y ∈ viF. The dimensionality reduction step is performed by keeping only the eigenvectors corresponding to the K largest eigenvalues (K ≤ P). The resultant values are stored in the matrix YK = [y1 y2 … yK] where y1, …, yK, are eigenvectors and \( \lambda_{1} , \ldots ,\lambda_{k} , \) are eigenvalues of the covariance matrix Covr (r ∈ [1, K]). The reduced PCA transformation matrix YK is obtained by solving the Eq. (29.10)

$$ z_{r} = Y_{K}^{T} \nu_{rj} ,1 \le \text{r} \le \text{K},1 \le \, \text{j} \le \text{F} $$
(29.10)

where zr denotes the transformed vector.

The first subplot of Fig. 29.4 shows the distribution of the MFCC coefficients before applying the PCA algorithm. The data points have the largest variation along the x-axis. The second subplot shows the reduced dataset with data points correlated to the corresponding mean values of the original MFCC values. The dataset reduced from P = 360 vectors (each of 248 features) to K = 75 vectors. Both the depressed and healthy data, represented as principal MFCC coefficients, have been plotted together. As mentioned above, features discriminating between depressed and typical speech are the total duration of speech pauses (empty, filled and vowel lengthening taken all together), the total clause durations, and the clause frequency. These features are not the ones used for our clustering with SOM, since our speech samples were processed through the MFCC algorithm. However, it is possible that MFCC coefficients encode these parameters. Figures 29.2 and 29.3 support this hypothesis since the MFCC coefficients extracted from depressed speech (Fig. 29.2) display lower energy as compared to those extracted from healthy speech waves (Fig. 29.3). This indirectly suggests more silences and longer empty pauses.

Fig. 29.4
figure 4

Plots of the obtained (top) and PCA reduced MFCC coefficients (bottom)

4 Self Organizing Map (SOM) Clustering

The SOMs carry out a nonlinear projection of the input data space to a set of units (neural network nodes) on a two-dimensional grid. The grid contains µ neurons given \( \mu = R \times C \), where R and C are the number of rows and columns of the SOM grid respectively. Each neuron has an associated weight vector which is randomly initialized. During the training, the weight vector associated to each neuron is likely to become the center of a cluster of input vectors. In addition, the weight vectors of adjacent neurons (neighboring neurons) move close to each other to fit a high-dimensional inputs space into the two dimensions of the network topology [8, 9, 28].

5 Results

The main goal of this research was to automatically discriminate between depressed and healthy speech. To this aim, the final MFCC dataset after the PCA reduction was fed into a U RC SOM using the MATLAB Neural Network Toolbox [2]. The R and C values were taken each equal to 6 making a grid of 36 neurons. After training the SOM for 600 epochs, clusters of input vectors with similar MFCC-PCA reduced coefficients are formed on the grid, as illustrated in Fig. 29.5. Figure 29.5 represents the resultant coefficient hits per neuron—i.e. the number of coefficients that cluster in a given neuron of the SOM. The class of a node corresponds to the majority class of the samples clustered in it. Generally, a cluster centre is a neuron that holds a high density of coefficient hits and is closest to all the remaining neurons in that particular cluster as compared to any other neurons in the same cluster. The centre of a cluster of neurons that collects the majority of hits from each class (in our case two classes: depressed and typical speech) is chosen as the neuron containing the maximum number of hits for a given class whose neighbouring neurons also attract the majority of hits from the same class. To this aim, the neuron 13th is a practical option for depressed speech and the 24th for typical speech.

Fig. 29.5
figure 5

The resulting SOM. The healthy (orange) and depressed (blue) MFCC features groups into two different clusters along the matrix diagonal. Neuron numbers on the grid must be read from left to right, bottom to top

Figure 29.5 represents a statistical analysis of the SOM clustering of the entire dataset containing 50% of depressed speech and 50% of control/typical speech feature coefficient hits. The x-axis represents the µ neurons (in this case µ = 36) on the SOM grid and the y-axis the number of hits for the healthy (red line) and depressed (blue line) speech. The neurons in Fig. 29.5 are not pure classes of only one type of hit (as in the ideal case of 100% accuracy). They contain hits from both typical and depressed speech, as it can be seen in Fig. 29.6. In the real life scenarios, it is quite possible that a small number of typical speech coefficients hit a cluster which have a majority of depressed coefficients and vice versa. Therefore, for the neuron 13 in Fig. 29.5—all the coefficients hits are 95. However, when the SOM output is analysed through the Matlab routine “nctool” that allows to quantify the hits in each neuron, it appears that out of 95 hits, 91 belong to the depressed speech (true hits) and 4 to the typical speech (false hits). This is illustrated in Fig. 29.6 where the 13th neuron shows 91 rather than 95 hits. The same reasoning applies to the neuron 24th and all the remaining neurons. To obtain a realistic clustering accuracy, the testing procedures were repeated three times, using the Rand Measure technique [18] and exploiting three different sets of input vectors, randomly chosen with different proportions of depressed and healthy feature coefficients. The three sets of input vectors were accordingly defined as:

  1. 1.

    Two sets of input vectors (extracted from the 60% of the original dataset), containing 75% of depressed speech features and 25% of healthy ones;

  2. 2.

    Two sets of input vectors (extracted from the 40% of the original database) containing the 12.5% of depressed speech and the 87.5% of healthy one;

  3. 3.

    Two sets of input vectors (the entire dataset) containing an equal amount of depressed and healthy speech (50% each).

Fig. 29.6
figure 6

Analysis of the result after clustering of the entire dataset. The x axis represents neurons and the y axis the total number of coefficient hits

The mean performance accuracy from the resultant SOM clustering on each of the three sets, according to the numbers of SOM hits is reported on Table 29.2.

Table 29.2 SOM classification results on three different sets of input data

6 Discussion and Conclusion

There are many parameters in the speech of depressed people that show significant differences compared to a healthy reference group [1, 4, 20, 22,23,24]. In this study, these parameters were automatically extracted from a dataset of healthy and depressed speech waves by using the MFCC speech processing algorithm [10, 16]. Further correlation of the processed speech was performed by using the PCA algorithm to reduce the data dimensionality. The findings in literature suggest that features discriminating depressed healthy speech are produced by the framing of speech pauses that are elongated, and duration and frequency of speech clauses which are shortened and less frequent for depressed subjects [4, 15]. It is possible that these features are captured by processing speech waves through the MFCC algorithm and through the PCA concept to select from MFCC coefficients those that show the greatest variability with respect to the variance of the data.

In this context, it was found that the combination of MFCC and PCA is a powerful technique for the automatic feature extraction of depressed speech features since, by using a SOM clustering algorithm on such processed data the discrimination accuracy of 80.67% (see Table 29.2) was obtained. The clustering was performed on a small database of 24 recordings (12 depressed and 12 healthy subjects). Despite of these limitations, the discrimination accuracy was far above the chance suggesting that the extracted automatic features d (Sects. 29.2.1 and 29.5) are quite descriptive of depressed and healthy speech despite of the amount of data used for the automatic feature extraction. With more of such data, it is expected to achieve an improvement of the discrimination accuracy. Therefore, the combination of MFCC and PCA is a robust process for extracting features from speech and SOMs provide a good platform for clustering. Similar results were obtained by Kiss et al. [15] using a Support Vector Machine Classification algorithm trained on a larger Hungarian database and tested on the same Italian data. The method presented in this paper for the same Italian database resulted in an improvement of the discrimination accuracy with respect to the classification accuracy reported in Kiss et al. [15]. This study can be extended to a multi-lingual speech database for detecting depression in a language independent way with a larger dataset.

Currently there is a huge demand for complex autonomous systems able to assist people on several needs, ranging from long term support of disordered health states (including caring of elders with motor-skills restrictions) to mood and communicative disorders. Provisions of support have been made either through the monitoring and detection of changes in the physical, and/or cognitive, and/or social daily functional activities, as well as in offering therapeutic interventions [3, 17, 25]. According to the World Health Organization (WHO) at the least 25% of people visiting family doctors live with depression. As reported on the WHO website, (http://www.euro.who.int/en/health-topics/noncommunicable-diseases/mental-health/news/news/2012/10/depression-in-europe). This number is projected to increase and place considerable burdens on national health care institutions in terms of medical, and social care costs associated to the assistance of such people. Voice Assistive Computer Interfaces able to detect depressive states from speech can be a solution to this problem because they can provide an automated on-demand health assistance reducing the abovementioned costs. However, speech is intrinsically complex, and emotional speech is even more [7] requiring the need of an holistic approach that account for several factors including personality traits [27], social and contextual information and cultural diversities [5]. “The goal is to provide experimental and theoretical models of behaviors for developing a computational paradigm that should produce [ICT interfaces] equipped with a human level [of] automaton intelligence” ([6], p. 48).