1 Introduction

Clef and Lip Palate (CLP) is a craniofacial malformation that occurs in about one in every 700 live births [1]. This malformation is characterized by the incomplete formation of tissues that separate oral and nasal cavities, which generates several speech disorders such as hypernasality, hyponasality, glottal stops, and others. During speech production it is necessary to control the amount of air that comes out through the nasal cavity. The amount of air is controlled by the velum, which opens and closes the connection between the oral and nasal cavity depending on which sound is intended to be produced by the speaker (e.g., nasal or no-nasal). When there is excess of air coming out through the nasal cavity the speech is perceived as hypernasal, which is the speech pathology suffered by the majority of CLP children [2, 3].

The evaluation of CLP patients is subjective and time-consuming because highly depends on the speech and language therapist’s expertise. The research community has been interested since several decades in the development of systems that allow the objective evaluation of speech in children with CLP. One of the most suitable approach has been the use of features that reflect articulation deficits. The most common ones include Mel-frequency cepstral coefficients (MFCCs), the first two vocal formants, vowel space area (VSA), and non-linear measures like the Teager Energy Operator (TEO). For instance, in [4] speech recordings of CLP patients are studied with the aim of discriminating the severity of hypernasality. The extracted features were based on the fundamental frequency of voice, energy content, and MFCCs. The authors reported accuracies of up to 80.4% when discriminating four different grades of hypernasality. In [5] the five Spanish vowels are considered to evaluate hypernasality in children with CLP considering acoustic and noise features. The authors reported accuracies of up to 89%. In [6] the authors introduced a method for the classification of hypernasal patients and healthy control subjects using the VSA and 13 MFCCs with their corresponding first and second derivatives. According to the authors, the proposed approach has an accuracy of 86.89%. In [7] features based on acoustic, noise, cepstral analysis, nonlinear dynamics [8] and entropy measures are used for hypernasality detection. Accuracies of around 92% are reported when the five Spanish vowels and the words coco and gato are considered.

Although articulation-related features have been extensively used in the literature to evaluate hypernasality in the speech of CLP patients, all of the works report results on single databases. Thus, there is no evidence about the robustness of those features when are used in different databases recorded in different acoustic conditions and with different recording settings. This study evaluates the suitability of classical articulation-related features to discriminate between hypernasal and healthy speech signals collected from children with CLP from two different clinics for children in Colombia, using different recording settings and under different acoustic conditions. The features set extracted for study includes 12 MFCCs and their first and second derivatives, the first two vocal formants (\(F_1\) and \(F_2\)) and their first and second derivatives, TEO formant centralization ratio (FCR), and VSA. Four statistical functionals are calculated upon each feature vector: mean, standard deviation, kurtosis and skewness, and two different classifiers are evaluated: a support vector machine (SVM) and a Random Forest (RF).

2 Participants

2.1 CLP Manizales

This database was provided by Grupo de Control y Procesamiento Digital de Señales - (GCyPDS) at Universidad Nacional de Colombia, Manizales. This database contains recordings of the five Spanish vowels pronounced by children between 5 and 15 years old. A total of 140 audio registers were collected, 84 labeled as hypernasal by a phoniatry expert and the remaining 56 were labeled as healthy. The signals were recorded in a quiet room but under non-controlled acoustic conditions using a non-professional audio setting with a sampling frequency of 44100 Hz and 16 bit-resolution.

2.2 CLP Clínica Noel

This data set was recorded in the Clínica Noel from Medellín and contains recordings of the five Spanish vowels pronounced by children between 5 and 15 years old. The data contain 95 audio recordings of 53 children with CLP and 42 healthy controls (HC). All of the 53 children included in the CLP group were labeled as hypernasal by a phoniatry expert. The recordings were collected in a quiet room using a professional audio setting with a sampling frequency of 44100 Hz and 16 bit-resolution. Further details of this corpus can be found in [9].

3 Methods

3.1 Feature Extraction

Mel-Frequency Cepstral Coefficients (MFCCs). Since several decades, these features have been widely used in applications for automatic speech recognition and for speaker identification. However, since about a decade the MFCCs started to be used in applications of pathological speech analysis like laryngeal pathologies [10] and dysarthria in Parkinson’s disease [11]. The coefficients are based on the Mel scale which considers the perceived frequency of a tone and the actual measured frequency. The scale emulates the frequency-response of the human hearing system.

To estimate MFCCs considers that \(s_i(k)\) is the \(i^{th}\) frame of the speech signal and \(S_i(k)\) is the Discrete Fourier Transform (DFT), the mel-spectrum is [12]:

$$\begin{aligned} \mathrm {MF}_i[r] = \frac{1}{A_r}\sum _{k = L_r}^{U_r} \left| V_r[k]S_i[k] \right| ^2, \quad \quad r = 1, 2, \ldots , R \end{aligned}$$
(1)

where R is the number of filters, \(V_r[k]\) is a weigh function for each mel filter, \(L_r\) and \(U_r\) are the lower and upper range of the filter, and \(A_r\) is a normalization factor for the \(r^{th}\) filter. The MFCC are finally found as:

$$\begin{aligned} \mathrm {MFCC}_i[n] = \frac{1}{R}\sum _{r = 1}^R \log (\mathrm {MF}_i[r])\cos \left[ \frac{2\pi }{R}\left( r + \frac{1}{2}\right) n \right] \end{aligned}$$
(2)

\(\mathrm {MFCC}_i[n]\) is evaluated for \(n = 1,2, \ldots , N_{\mathrm {MFCC}}\) where \(N_{\mathrm {MFCC}}\) is the number of desired coefficients (\(N_{\mathrm {MFCC}} < R\)). Besides, the first and second derivatives of MFCCs is compute. First and second derivatives of the MFCCs are also extracted before computing the four statistical functionals, forming a 144-dimensional feature vector.

First and Second Vocal Formants (\({\varvec{F}}_{\mathbf{1}}\) and \({\varvec{F}}_{\mathbf{1}}\)). These two frequencies correspond to the first two peaks that appear, when the envelope of a voice spectrum is calculated (typically by a linear predictive filter). \(F_1\) and \(F_2\) provide information about resonances that occur in the vocal tract during a phonation. Thus they are related to the shape of organs and tissues in the vocal cavity, e.g., the tongue. Typically, the first two formants are used to evaluate the capability of a speaker to keep the tongue in a certain position during the phonation of a given vowel [6]. In this study, we calculate \(F_1\) and \(F_2\) considering the spectral envelope found with a Linear Predictive Coding (LPC) filter. This method assumes that each frame, \(s_i(k)\), of the speech signal, can be approximated as a linear combination of the past p samples:

$$\begin{aligned} \hat{s}_i = \sum _{k = 1}^p a_k s_{i - k} \end{aligned}$$
(3)

where \(\mathbf {a} = {a_1, \ldots , a_p}\) is a vector with p coefficients. The main aim is to minimize the mean-square-error such that:

$$\begin{aligned} \mathbf {a} = \mathop {\text {arg min}}\limits _{\mathbf {a}} \frac{1}{N}\sum _ {n=1}^N (\hat{s}_i - s_i)^2 \end{aligned}$$
(4)

The optimal vector \(\mathbf {a}\) that minimizes the MSE comprises the envelope of the speech spectrum and the value of p determines the shape of such an envelope. As in the case of the MFCC, the first and second derivatives of \(F_1\) and \(F_2\) are also calculated before computing the four statistical functional, forming a 24-dimensional feature vector.

Teager Energy Operator (TEO). Consider the signal x(n). Its associated TEO was defined in [13] as:

$$\begin{aligned} \varPsi [x(n)] = x^2(n) - x(n + 1)x(n - 1) \end{aligned}$$
(5)

One of the most important characteristics of TEO is the sensibility to composed signals. If we consider a composed signal as \(x(n) = s(n) + g(n)\), the TEO is computed as:

$$\begin{aligned} \varPsi [x(n)] = [s(n) + g(n)]^2 - [s(n + 1) + g(n + 1)][s(n - 1) + g(n - 1)] \end{aligned}$$
(6)

if a cross-correlation term (\(\varPsi _{cross}[s(n), g(n)])\) between the two signals is considered as \(\varPsi _{cross}[s(n), g(n)] = s(n)g(n) - g(n+1)s(n-1)\), we obtain:

$$\begin{aligned} \varPsi [x(n)] = \varPsi [s(n)] + \varPsi [g(n)] + \varPsi _{cross}[s(n), g(n)] + \varPsi _{cross}[g(n), s(n)] \end{aligned}$$
(7)

Equation 7 shows that the superposition theorem does not apply to TEO. This property is useful in cases where the original signal is composed by several components. A regular speech spectrum has a typical profile where its peaks are characterized by the vocal formants. In the case of a hypernasal speech spectrum, additional peaks (additional formants) appear and also additional anti-formants (additional valleys in the speech spectrum) appear. TEO seems to be a good strategy to model such additional components that result in the spectrum of hypernasal signals [14]. The TEO is extracted from windows of 40 ms-length, then the four statistical functionals are computed to create a 4-dimensional feature vector per speaker.

Formant Centralization Ratio (FCR) is an alternative measure to represent articulatory problems in speakers. It offers the advantage of maximizing the sensitivity to vowel centralization and minimizing the sensitivity to inter-speaker variability. Thus, provides more robust and stable information of the vowel space produced by a speaker. The FCR was proposed in [15] as:

$$\begin{aligned} \mathrm {FCR} = \frac{\mathrm {F}_{2u} + \mathrm {F}_{2a} + \mathrm {F}_{1i} + \mathrm {F}_{1u}}{\mathrm {F}_{2i} + \mathrm {F}_{1a}} \end{aligned}$$
(8)

Where \(\mathrm {F}_{1a}\), \(\mathrm {F}_{2a}\), \(\mathrm {F}_{1i}\), \(\mathrm {F}_{2i}\), \(\mathrm {F}_{1u}\) and \(\mathrm {F}_{2u}\) are the first and second format of the corner vowels /a/, /i/ and /u/ respectively.

Vowel Space Area (VSA) is the most common way of measuring vowel centralization using \(F_1\) and \(F_2\) of corner vowels (/a/, /i/, /u/). It is given by the area of the triangle formed by the vertexes (\(F_1\), \(F_2\)) in the vowel space created for the three corner vowels. VSA is computed as [15]:

$$\begin{aligned} \mathrm {VSA} = \left| {\frac{\mathrm {F}_{1i}(\mathrm {F}_{2a} - \mathrm {F}_{2u}) + \mathrm {F}_{1a}(\mathrm {F}_{2u} - \mathrm {F}_{2i}) + \mathrm {F}_{1u}(\mathrm {F}_{2i} - \mathrm {F}_{2a})}{2}} \right| \end{aligned}$$
(9)
Fig. 1.
figure 1

Vocal triangle computed for two different children from the two databases.

Figure 1 shows the resulting vocal triangle for two different children from the two databases. Note that in both cases the triangle of the patients is compressed which is a typical indicator of reduced articulation capability.

At the end of the feature extraction procedure, each vowel pronounced by each speaker is modeled by a 172-dimensional feature vector. Additionally, FCR and VSA are calculated per speaker. Since the five Spanish vowels are considered together, each speaker is finally represented by a 862-dimensional feature vector.

3.2 Classification

With the aim of comparing the robustness of two different classification approaches, two classifiers are considered: a Support Vector Machine (SVM) with a Gaussian kernel, and a Random Forest (RF). The parameters of the classifiers were optimized following a 5-fold cross-validation strategy, where 4 folds were used for training and the remaining one for test. Within the 4 folds used for training, another 5-fold cross validation was performed. The optimization criterion is based on the accuracy in training. The parameters were optimized using a grid-search over the training folds. For SVM, \(\mathbf {C}\) and \(\mathbf {\gamma } \in \{10^{-6}, 10^{-5}, \ldots , 10^{4}\}\) and for RF, the number of trees \(\mathbf {N} \in \{5, 10, 20, \ldots , 100 \}\) and the depth of the decision trees \(\mathbf {D} \in \{ 2, 5, 10, 20, \ldots , 100\}\). Optimal parameters are selected according to the mode across the 5-fold cross-validation procedure.

3.3 Statistical Analysis and Merging of Features

Apart from calculating the features and performing the classification between the two classes, we wanted to evaluate the possibility of merging those features extracted from two different datasets that were collected under different acoustic conditions. The analysis to decide which features were able to be merged between datasets was performed according to Kruskal-Wallis statistical tests. The null hypothesis was that the given feature has the same distribution in both databases. Thus, if the p-value \(<0.05\) the null hypothesis is rejected. The test was applied over all features of both databases and only the features that successfully passed the test (p-value \(\ge 0.05\)) were included in the merging process. At the end of the procedure a total of 508 features passed the test.

4 Experiments and Results

In this study three experiments are performed: (1) classification of CLP vs. HC with the Manizales database, (2) classification of CLP vs. HC with the Clínica Noel database and (3) classification of CLP vs. HC with the fusion of both databases only considering those features that successfully passed the statistical test.

4.1 Experiment 1 - Manizales DB

Results for the classification of CLP patients vs HC subjects are presented in Table 1. Note that the best results are always obtained using the SVM.

Figure 2 shows histograms of the scores obtained during the classification process. Additionally, Receiver Operating Characteristic (ROC) curves obtained with the two classifiers are also included. Note that the scores in an SVM are the distance of each sample to the separating hyperplane. For RF the scores are the probability of a sample to belong to the selected class. From the histograms it is possible to observe that most of the CLP patients are correctly classified, which confirms the high sensitivity obtained in the experiments (\(96.4\%\)). For the HC subjects the result is not as high but still competitive with a specificity of \(89.6\%\). The values of AUROC allow to perform a more compact analysis of the system’s performance considering the classification accuracy of the two classes at the same time (sensitivity and specificity).

Table 1. CLP vs. HC using the Manizales DB
Fig. 2.
figure 2

Histograms of the scores and ROC curves (Manizales DB)

4.2 Experiment 2 - Clínica Noel DB

The results obtained with the two classifiers with data of the Clínica Noel are indicated in Table 2. Similar to Experiment 1, Fig. 3 the histograms of the scores and the ROC curves are presented. Note that as in the experiments with the other dataset, the best results are obtained with the SVM. This result confirms the robustness of these classifiers, which have been extensively used in the literature in problems of pathological speech processing. When this results are compared to those presented in Fig. 2, the histograms have a larger overlapping which reduces the performance of the classifiers.

Table 2. CLP vs. HC using the Clínica Noel DB
Fig. 3.
figure 3

Histograms of the scores and ROC curves (Clínica Noel DB)

4.3 Experiment 3 - Fusion of both DB

Only those features that successfully passed the Kruskal-Wallis test explained in Sect. 3.3 are considered in this experiment. The results obtained with the resulting sub-set of features in the Clínica Noel and Manizales databases are included in Tables 3 and 4. The results after merging both datasets are presented in Table 5. Note that the results after selecting those features that passed the statistical tests are lower than those obtained when considering the complete set of features. This result can be likely explained because only features robust against different acoustic conditions were included after the statistical test, hence those features are not necessary the most suitable to model articulation deficits. On the other hand, Table 5 indicates that when the two databases are merged the results improve. It seems like the difference in the acoustic conditions of the two databases allow the selected features to complement among them and the result is the improvement in the classification accuracies compared to those obtained when the selected features are used upon each database separately. Figure 4 shows the histograms of the scores obtained in the classification process and the ROC curves. Similarly to the previous experiments, there is a high sensitivity obtained with both classifiers. This result indicates that the proposed approach is suitable to detect CLP patients then it seems to be sensitive to articulation deficits exhibited by children with CLP. We think that it could be used in future studies to evaluate the degree of nasalization. We are currently collecting more data with those labels with the aim of performing these kinds of experiments.

Table 3. CLP vs. HC using the Manizales DB with selected features
Table 4. CLP vs. HC using the Clinica Noel DB with selected features
Table 5. CLP vs. HC using the merged DB with selected features
Fig. 4.
figure 4

Histograms of the scores and ROC curves (merged DB)

5 Conclusions

The proposed approach, based on articulation measures is effective for the classification of hypernasality in children. High accuracies were obtained with the SVM classifier (above \(90\%\)) in the two databases. When Kruskal-Wallis tests are applied as the selection criterion to include features before merging the two databases, accuracies of around \(80.0\%\) are obtained. The scores obtained with the Clínica Noel DB show less separability between CLP patients and HC patients. Conversely, the results obtained with the Manizales DB are higher and the associated histograms show less overlapping between the two classes. This can be likely explained due to the difference in the acoustic conditions of both databases. The sensitivity of the model was consistently high along the three experiments presented in this paper. This result may indicate that the proposed approach is suitable to evaluate the degree of hypernasality. Further research with more data and additional labels are required to confirm this hypothesis. Our team is currently working on the collection of more speech samples such that allow the evaluation of different degrees of hypernasality considering sustained vowel phonations and continuous speech signals.