Keywords

1 Introduction

While performing any physical exercise, we feel out-of-breath; our body demands more oxygen, which makes the rate of breathing faster and deeper. If we attempt to speak, the speech signal thus produced is perceptually different from that of normal condition; as the phenomenon of breathing provides the driving force behind speech communication [3]. A plot of a sample speech signal and its corresponding spectrogram is shown in Fig. 1 for both the normal and the out-of-breath conditions. For out-of-breath case it shows: reduction in signal duration, lessened pause duration, increase in signal amplitude as well as weakened harmonics at higher frequencies. According to the source-filter theory of speech production, the vocal tract (VT) system is driven by a source signal to produce sound. The source is characterized by air passing through the glottis: an opening formed by vibrating vocal folds situated inside the larynx. Considering the above delicate connection between the respiration process and the phonatory process, we can expect the source signal to get influenced by the out-of-breath condition.

Fig. 1.
figure 1

Normal and out-of-breath speech signal in (a), (b) and their respective spectrograms in (c), (d) respectively.

There have been a few studies on the effect of out-of-breath condition on speech signal. Trouvain et al. [13] showed that under out-of-breath condition, there is an increase in subglottal pressure that leads to higher pitch frequency (F0). Godin et al. [8] analysed formant frequencies F1 and F2, and glottal open quotient using vowel and consonant-vowel utterances. Using both speech and EGG signals, they found that formats F1 and F2; and open quotient of glottis are effected differently for different speakers. Also in [9] they showed that the influence of out-of-breath condition on vocalfold vibration using six glottal features. Some authors studied the classification between normal and out-of-breath speech using speech extracted features like MFCC and teager energy based features [11]; Fourier model based harmonic features [5]. Few works have been reported which deal with the type of changes that occur on the vibrating pattern of vocal folds. It is also still unclear how much the source signal gets affected under out-of-breath condition. Hence, in this work, we have focussed on two tasks for analysing source signal. First, EGG signal has been analysed for finding out the changes in glottal vibration pattern. In the second task, two source signals, namely zero frequency filtered (ZFF) and integrated linear prediction residual (ILPR), are analysed. In literature, ZFF [10] and ILPR [12] are used as an approximation to the source for speech production. In this work, ZFF and ILPR are analysed to know whether they get affected under out-of-breath condition. Rest of the sections are divided as follows: in Sect. 2, the methodology of our analysis is presented. Results and discussions are described in Sect. 3 followed by conclusion in Sect. 4.

Fig. 2.
figure 2

Speech, ZFF and ILPR sample segment of /a/ vowel.

2 Methodology

Under physical exertion, the speaker tries to suppress its effect by adjusting breathing and speaking durations. In this work, we are considering vowel segments from SVPs to analyse the effect of out-of-breath condition. EGG and speech signals are both taken into consideration. The task of analysing these signals has three common sub-tasks: pre-processing, feature extraction and classification. The corresponding block diagram is shown in Fig. 3.

Fig. 3.
figure 3

Block diagram showing EGG and vowel speech processing and classification.

2.1 Pre-processing

The low-frequency trend in EGG signal is removed by filtering it by a Butterworth high-pass filter of order 3 with cut-off frequency \(f_c = 50\) Hz set experimentally. The amplitude is normalised to the range \(-1\) to +1. Along with EGG signal, the first difference of EGG (DEGG) is also considered, which indicates the instances as well as the rate of glottal opening and closing. Figure 4(a) shows a schematic EGG and DEGG signal. Similarly, the vowel speech signal is made to pass through mean removal and normalisation steps, followed by ZFF and ILPR extraction.

ZFF Source. Zero frequency resonator is applied on speech signal to remove the influence of vocal tract [10]. The signal thus obtained is an approximation of the source signal. It is sinusoidal like signal whose periodic property is inherited from the periodic vibration of vocal folds. It shows negative to positive zero crossing at the instances of glottal closure [1]. Hence, it is used for detecting glottal closure and voicing region in speech signals. Figure 2 shows sample segments of speech and its corresponding ZFF signal.

ILPR Source. It is obtained by performing inverse filtering on the speech signal. The inverse filter is derived from the pre-emphasized speech signal [12]. The choice of non-pre-emphasized speech makes the source signal to have more prominent peaks at glottal closures and smaller peaks at glottal opening instances. ILPR has a higher resemblance to DEGG signal as it is more or less unipolar in nature than that of the pre-emphasized source [1]. It is used for detecting voicing and glottal closing instances. Figure 2 shows sample segments of speech and its corresponding ILPR signal.

2.2 Feature Extraction

For analysis of EGG signal, a set of five EGG and DEGG based features are studied which indicate the changes in vocal fold vibration pattern. These are open quotient (OQ\(_{EGG}\)), close quotient (CQ\(_{EGG}\)), normalized amplitude quotient (NAQ), DEGG strength at glottal opening instance (A\(_{min\_{DEGG}}\)) and skewness of the EGG waveform. At the same time, for analysis of ZFF and ILPR source signals, magnitude difference between the first two harmonics (H1-H2) is considered as a feature. Features are collected for every frame of duration 30 ms with overlapping of 20 ms.

Fig. 4.
figure 4

(a) Sample EGG and DEGG signal with time and amplitude parameters, (b) Boxplots of five features for vowel /a/ for one speaker.

Open quotient \((OQ_{EGG} = \frac{t_{op}}{T_0})\) and Close quotients \((CQ_{EGG} = \frac{t_{cl}}{T_0})\) define the fraction of time that the vocal folds remain open or closed with respect to a glottal cycle [14]. Where, \(t_{op}\) and \(t_{cl}\) are the duration of open phase and close phase, and \(T_0\) is the time period. Figure 4(a) shows a schematic diagram of EGG and DEGG along with different timing intervals. Normalized amplitdue quotient given as \(NAQ = \frac{f_{ac}}{A_{max\_DEGG}T_0}\) [2], is related to the closing phase of the glottis cycle. A\(_{min\_{DEGG}}\) indicates the rate of opening of vocal folds [14]. Skewness is a statistical parameter that measure the asymmetry of a real-valued random variable about its mean. The harmonics based feature H1-H2 is the magnitude difference between the first two harmonics. It is given as \(H1-H2 = 20log_{10}(\frac{H1}{H2})\).

2.3 Classifier

Support Vector Machine (SVM) Classifier. SVM is one of the widely used linear binary classifier. It determines a decision function that is maximally distanced from the training data [4]. Hence, it is also called the maximum margin classifier. The difficulty of getting a non-linear decision function is eased by using kernel functions that enable SVM to map the feature data into higher dimensional spaces where the optimal hyperplane is determined. In this experiment, SVM with radial basis function (RBF) kernel has been used for two-class classification. Five fold cross-validation is used to optimize the SVM parameters; where the training set is further divided into 5 sub-sets, four sub-sets are used for training the SVM model and the remaining one sub-set is used for testing.

K-Nearest Neighbour (KNN) Classifier. It is a non-parametric method of classification where membership of a test sample is computed by majority voting by K nearest neighbouring training samples [7]. In this work, K is set to 10 with distances computed using Euclidean measure.

3 Performance Analysis

The ability of a feature to indicate changes under out-of-breath condition is tested by Welch’s t-test using speech and EGG signals recorded under constant vowel phonation.

3.1 Out-of-breath Data

A new database is created having speech, and EGG signals recorded simultaneously for SVP of sounds /i/, /a/ and /u/. It has two classes of signals, namely out-of-breath and normal. Out-of-breath class is recorded after performing two minutes of jump rope workout, whereas the normal signal is recorded right before the speaker undergoes the workout. Five male speakers, all are research scholars from Indian Institute of Technology Guwahati, participated in the recording process; they belong to the age group of 25–30 years. Total 191 number of SVPs of duration 1 sec each are collected. The normal class has 105 SVPs whereas the count is 86 for the out-of-breath class. All recordings are carried out using Tascam DR-100MK II linear PCM recorder and TechCadenza M2LU digital electroglottograph recorder for recording speech and EGG signals respectively. The sampling frequency of 48 kHz with 24-bit resolution has been used.

Table 1. Welch’s t-test statistics for EGG.

3.2 Result and Discussion

Table 1 shows the t-test values for the five glottal features. Boxplot for vowel sound /a/ is shown in Fig. 4(b) for representation. Under vowel phonations, it is observed that for OQ\(_{EGG}\) the interquartile range (IQR) is placed high for the out-of-breath condition. At the same time, the opposite behaviour is shown by CQ\(_{EGG}\) as expected. It indicates that the vocal folds do remain open for a longer period of a glottal cycle when a person is out-of-breath. NAQ shows higher t-value as well as a downward shift of IQR for the out-of-breath case than the normal case. It implies that the rate at which vocal folds close, increases for the out-of-breath condition. Such kind of behaviour is not observed in A\(_{max\_{DEGG}}\), which stands for strength of glottis closure. This may be due to the level of exertion under out-of-breath condition is different for different speakers. Similarly, a minor change is observed for glottal opening strength A\(_{min\_{DEGG}}\). Skewness has higher t-value with a lower mean for IQR in case of normal condition. This hints that the density function for EGG waveform is positively skewed under out-of-breath condition.

Fig. 5.
figure 5

Spectrum of a frame of vowel /u/ for source (a) ZFF, (b) ILPR.

Fig. 6.
figure 6

Averaged H1-H2 over all frames of utterance /u/.

For ZFF and ILPR source signals, Fig. 5 shows spectrum of a frame of vowel /u/. For ILPR signal it is observed that, in majority of cases; the first harmonic peak magnitude (H1) is higher than that of the sencond harmonic peak (H2); which is opposite for that of normal condition. Thus the harmonic magnitude difference between H1 and H2 (H1-H2) is higher for out-of-breath condition. A similar trend is observed for H1-H2 in case of ZFF source signal. However for ZFF, H2 is more supressed in out-of-breath condition where as H1 remains high for both the conditions. This suggests that ZFF signal contains more low frequency components in case of out-of-breath condition. Figure 6 shows variation of averaged H1-H2 values for vowel sound /u/ uttered by all speakers. The Welch’s t-test values appear high: for both the source signals as shown in Table 3. It hints that these approximated source signals can carry information about physical exertion.

Table 2. Confusion matrix for classifiers SVM and KNN for the combined feature set of OQ\(_{EGG}\), CQ\(_{EGG}\), NAQ, A\(_min\_DEGG\) and skewness.

Classification results have been obtained using leave-one-speaker out approach. Where, utterances of one speaker are considered for testing and others for training. Table 2 shows the confusion matrix for SVM and KNN classifiers for EGG based features. This shows an average binary classification rate of 73.40% and 71.24% for SVM and KNN respectively. Between the two approximated source signals, ZFF gives the best classification result with accuracies 70.40% and 71.0% for SVM and KNN classifiers respectively. ILPR source is not much far behind with accuracies of 63.60% and 68.60% for same set of classifiers. Table 4 shows these classification result. In literature, the highest classification rate of 91.90% is obtained by Suman et al. [5] on a regular speech corpus. They used a combination of harmonic features, teager energy based features, glottal features and mel frequency cepstral features for classification.

 

Table 3. Welch’s t-test for ZFF and ILPR source signals
Table 4. H1-H2 feature classification for ZFF and ILPR source signals

4 Conclusion

In this work, we attempted to study the source characteristics of speech signal under out-of-breath condition. Different source signals like EGG, ZFF and ILPR are examined. Using sustained vowels, analysis of EGG showed that the glottal opening and closing pattern gets altered under out-of-breath case. It is expected under physical exertion as lungs require more air, and thus the breathing pattern becomes faster and deeper. As recording of EGG is not always possibile; we considered the sources extracted from speech signals like ZFF and ILPR. Spectral analysis of such sources showed alteration in the harmonic structure indicated by H1 and H2 harmonic peaks. Analysis by different statistical tools verifies that characteristics of source signals differ from normal to out-of-breath case.