1 Introduction

Nasal sounds are produced when the glottal wave passes through the nasal cavity. The passage of glottal wave through the nasal cavity is controlled by the velum. When we intend to utter a nasal sound, the velum lowers and allows the glottal wave to pass through the nasal cavity [10]. The percentage of glottal wave passed through the nasal cavity determines the percentage of nasalization. The nasal sounds can be broadly categorized into two categories. First types are nasal murmur or nasal consonants, e.g. /m/ and /n/, which are produced by decoupling oral tract. And the second types are nasalized vowels or nasalized semi-vowels, which are produced by coupling of both oral tract and nasal tract. In Indian languages, one can deliberately utter nasalized vowels with the help of ‘Matra’, which are present in scripts. This is called phonemic nasalization. While in English language vowel nasalization is occurred mostly due to co-articulatory nasalization. Co-articulatory nasalization is the phenomenon in which the velum raises beforehand, in anticipation of nasal consonants and thus makes an oral vowel nasalized one. Sometimes due to the presence of nasal consonants before an oral vowel, the velum remains open for some moments. This also contributes to the cause of Co-articulatory nasalization. Another kind of nasalization is functional nasalization, which occurs due to functional disorder of the velopharyngeal mechanism.

Nasalized vowels contribute to the vocabulary of almost every language. And there are words like dot and don’t, which differed by the introduction of vowel nasalization. But difficulty in detecting vowel nasalization makes it a challenging task for ASR systems. Pruthi [9] has shown that accuracy of a Hidden Markov Model (HMM) based ASR system decreases if nasalized vowels are not detected. So detecting vowel nasalization is important for improving ASR system performance. Nasalized vowels can be considered as vowels having a higher degree of nasalization. And the degree of nasalization contains significant clinical information as well as information about speech intelligibility.

Many researchers have studied spectral domain properties and proposed acoustic parameters for nasalized vowels. Fant in his work showed that due to nasalization there is a decrease in amplitude of 1st formant and increase in its bandwidth [2]. House and Stevens [5] observed a spectral prominence around 1000 Hz and reduction of 2nd formant. Effects of nasalization on vowels /aaa/, /ooo/ and /uuu/ are studied by Fujimura and Lindqvist [3]. They observed the movement of 1st formant towards higher frequency and a pair of pole-zero are being introduced near 1st and 3rd formant. Glass and Zue had done extensive statistical analysis on spectral domain characteristics of nasalized vowels and proposed six features for automatic detection of nasalized vowels [4]. Chen had found out the difference of first formant and an extra peak to be a promising feature of nasality [1]. This property was exploited by Vijaylaxmi et al. for detection of hypernasality [11]. They have used a modified group delay based approach to resolve the first formant and extra resonance that manifest in hypernasal speech due to nasalization [8]. In their paper, they have reported that the proposed feature has limitation in case of nasalized vowel detection in healthy speakers’ speech. Pruthi had analysed 37 acoustic parameters from the existing literature and selected nine knowledge based acoustic parameters for detection of nasalized vowels [9].

In this work, we have proposed an inverse filtering based feature which accounts for amount of nasalisation present in a vowel. The rest of the paper is organised as follows. Section 2 presents database description. In Sect. 3 inverse filtering based feature is proposed. In Sect. 4 analysis on different nasalised vowels is done. Section 5 summarises with the findings of the analysis.

2 Database Description

In this study, speech data from 15 speakers have been collected. Speech data are collected for vowels /e/, /u/, /i/, their nasalized counterparts i.e. /en/, /un/, /in/ respectively and the word ‘summer’. The word ‘summer’ contains the nasal consonant /m/ [11]. The nasal consonant part is manually marked for all the speech files. All the recordings are done in a speech recording studio. So, the data are free from any background noises. Speech recordings are done by using Audacity software. All the speech data are collected at 48000 Hz sampling frequency. However, as information contained in speech signal above 5000 Hz is least, data in this work are resampled at 11025 Hz.

3 Inverse Filtering Based Feature

The nasalized vowels are the addition of oral sounds and nasal sounds. For different nasalized vowels, the oral filter changes its characteristics, while the nasal filter remains invariant [7]. And this invariant nasal filter has similar characteristics to the filter which produces nasal murmur [7]. So, for different nasalized vowels only one nasal filter can be modeled. The nasal filter can be estimated from nasal murmur sound. The coupled oral and nasal tract can be modeled as in Fig. 1.

Fig. 1.
figure 1

Model of the speech production system

This understanding of our speech production system lets us model the speech sound as,

$$\begin{aligned} S(\omega )&=k \times G(\omega ) \times N(\omega )+(1-k) \times G(\omega ) \times O(\omega )\nonumber \\&=G(\omega ) \times N(\omega ) \times (k+(1-k) \times \frac{O(\omega )}{N(\omega )}) \end{aligned}$$
(1)

In Eq. 1, ‘k’ represents fraction of glottal wave passed through the invariant nasal filter. \(N(\omega )\) and \(O(\omega )\) represents nasal filter and oral filter, respectively. And \(G(\omega )\) and \(S(\omega )\) represents glottal wave and speech signal, respectively.

Let’s consider the nasal filter as,

$$\begin{aligned} N(\omega )=\frac{\prod _{z=1}^{Z} N(\omega -\omega _z)}{\prod _{p=1}^{P} N(\omega -\omega _j)} \end{aligned}$$
(2)

Putting the above value in Eq. 1 we will get,

$$\begin{aligned} S(\omega )=G(\omega ) \times N(\omega ) \times (k+(1-k)\nonumber \\ \times O(\omega ) \times \frac{\prod _{p=1}^{P} N(\omega -\omega _i)}{\prod _{z=1}^{Z} N(\omega -\omega _j)})\nonumber \\ \implies S(\omega ) \times N(\omega )^{-1}=G(\omega ) \times (k+(1-k)\nonumber \\ \times O(\omega ) \times \frac{\prod _{p=1}^{P} N(\omega -\omega _i)}{\prod _{z=1}^{Z} N(\omega -\omega _j)}) \end{aligned}$$
(3)

Now, if we will evaluate Eq. 3 on a nasal pole, which doesn’t have any oral pole nearby, then we will get,

$$\begin{aligned} S(\omega _p) \times N(\omega _p)^{-1}=G(\omega _p) \times (k)\nonumber \\ \implies k=\frac{S(\omega _p) \times N(\omega _p)^{-1}}{G(\omega _p)} \end{aligned}$$
(4)

In Eq. 4, ‘k’ represents the amount of nasalisation.

Equation 4 suggests that, if we have speech signal (\(S(\omega )\)), glottal wave (\(G(\omega )\)) and a mathematical model of nasal filter (\(N(\omega )\)) then we can find out amount of nasalisation present in the corresponding speech signal. However, in any speech application, we will be having \(S(\omega )\) and hence \(G(\omega )\). The only extra information needed is the person specific nasal filter.

In [6] we have shown that first three formants of nasal murmur occurs around 250 Hz, 1250 Hz and 2500 Hz. In [10] it is shown that /i/ and /e/ doesn’t have any formant near 1250 Hz. So to find out the value of ‘k’ in case of /i/ and /e/, we will use the pole corresponding to formant location 1250 Hz. And in case of vowel /u/, we will be using the pole near 2500 Hz. The expression of ‘k’ is evaluated on the pole-zero circle rather than evaluating it on the pole location.

The assumptions taken in this study are,

  1. i

    Effect of evaluating the value of ‘k’ on a unit circle instead of evaluating it on the pole location is negligible. This assumption is taken as the pole location of the nasal filter is very near to the unit circle.

  2. ii

    There is no nasal zero present near the pole location chosen. Also the zeros of oral filters, if any, are also not taken into account in this study.

4 Analysis of Nasalised Vowels

Information needed to find out the value of ‘k’ are person specific nasal filter (\(N(\omega )\)), speech signal (\(S(\omega )\)) and glottal wave (\(G(\omega )\)). The person specific invariant nasal filter is estimated from nasal murmur, using LP analysis. LP coefficients of 12th order LP model are estimated, for each 20 ms windowed segment of nasal murmur. From the LP coefficients, poles for all frames are estimated and they are averaged to get a desired invariant nasal filter for each person. In this analysis \(G(\omega )\) is taken as residual of an 12th order LP filter of the speech signal. In Fig. 2 pole-zero plot of nasal filter of a person is shown. The cross marks show the location of poles on the pole-zero plot. The pole locations marked in green color represents the locations corresponding to the estimated invariant nasal filter.

Fig. 2.
figure 2

Pole zero plot of the nasal filter of a person (Color figure online)

Value of ‘k’ is found out at five frequency locations and averaged to minimize any spurious peak that may arise due to division of residual signal. The five frequency locations are, the selected frequency and two lower and upper adjacent frequencies with differences of 5 Hz. Values of ‘k’ for vowel /i/, /e/ and /u/ are calculated and their box plots are also obtained. Figures 3, 4 and 5 correspond to the box plot of vowel /i/, /e/ and /u/ respectively.

Fig. 3.
figure 3

Box plot of ‘k’ for vowel /i/

Fig. 4.
figure 4

Box plot of ‘k’ for vowel /e/

Fig. 5.
figure 5

Box plot of ‘k’ for vowel /u/

From the box plots, it is observed that the value of ‘k’ for nasalised vowel is higher compared to their oral counterparts. It is also observed that the value of ’k’ is within the range of 0 to 1 as desired. The box plot also shows that the proposed feature has high discriminatory capability, which is also validated using F-ratios. In Table 1 the median values of ‘k’ are tabulated. It is to be noted that for oral vowel case, the value of ‘k’ is non-zero. The possible reasoning for this can be the approximations that we have taken and also due to vibrations of velum during the utterance of oral vowels.

Table 1. Median values of ‘k’
Table 2. ANOVA values

F-ratios and p values of ‘k’ are obtained using one-way ANOVA (Analysis Of Variance) for different vowels. ANOVA suggests whether different groups belong to the same distribution or they have come from different distributions. The small value of ‘p’ and high value of ‘F-ratio’ of two groups signify that the two groups have come from different distributions. From Table 2 it is observed that the F-ratios are high and p values are very low. This shows that oral vowels and nasalised vowels are highly discriminable for values of ‘k’.

5 Conclusions

In this study, we have proposed a simple inverse filtering based technique to find out a feature which accounts for the amount of nasalisation present in a vowel. Nasalized vowels differ from oral vowels by containing more amount of nasalisation. So this feature becomes useful for detection of nasalised vowels. Statistical analysis of the feature has shown that this feature gives values which are well separable for nasalised vowels and oral vowels. As the mathematical basis of this feature is degree of nasalisation, a good correlation of this feature may be found out with the perceptual score.