An engineering model of the masking for the noise-robust speech recognition
Introduction
Although tremendous attempts have been made to make machines which recognizes the human speeches, it is still very difficult for real world environments with the background noises. The modeling of human auditory system is one of the successful approaches which have been proposed to solve the problem. Critical-band filterbank and mel-scale frequency sampling are the most widely used properties from the psychoacoustics.
One of the useful aspects which had been seldom utilized is the “masking” effect of human hearing system. Masking has been investigated for ages and used to quantify the frequency selectivity of the auditory system. However, there have been few approaches which utilize the masking for the recognition tasks. The nature of the masking, that the spectral components of high intensity level suppress the adjacent spectral components, helps the recognition performance. The low-intensity level signals are usually unimportant for speech recognition and may be noises to be suppressed. In this paper, the time–frequency masking is modeled with lateral inhibition and incorporated into the current auditory model, mel-frequency cepstral coefficients (MFCC) model, which is the most widely used speech features, and tested on the task of isolated word recognition. The proposed algorithm does not require extensive computation, and results in much better recognition performance, especially in noisy environments.
Masking has been defined as a process in which the audible threshold for one sound is raised by the presence of another (masking) sound. Frequency masking means that signals are masked by the masking sound occurring at the same time. With temporal masking, signals can also be masked by the sound preceding it, called forward masking, or even by the sound following it, called backward masking. Frequency masking helps to discriminate signals from the other by enhancing the spectral resolution. Also by suppressing the adjacent signal in spectral domain, it reinforces the signal of critical interest so that the unimportant signals are filtered out. Forward masking, a short-term adaptation process of the auditory system, helps in the discrimination capability between signals by emphasizing time-dependent variations.
Section snippets
Frequency masking with lateral inhibition model
The essence of the masking is to reinforce the dominant signal components and to suppress the adjacent components. To implement this concept into the current auditory system, a lateral inhibition is introduced with a simple Mexican-hat convolutional filter as shown in Fig. 1(a). The sharp peak at the center reinforces the very close stimuli, and the negative values at neighborhoods inhibit the stimuli in the range.
To apply the inhibition filtering in spectral domain, the blocked speech signals
Temporal masking
The short-term adaptation and the temporal integration [4] are the possible mechanisms of the temporal masking. Many researches have modeled the temporal masking as the temporal integration of the response of the auditory nerve [1], [5]. If we assume that the temporal masking is due to the temporal integration of the response of the auditory nerves, temporal masking can be modeled aswhere x(n) is the output signal before temporal masking, y(n) the
Experimental results
The temporal masking using the integration model in the feature domain, as well as the frequency masking, is applied to the isolated word recognition task. For the simulation, 50 Korean words spoken by 13 men three times each were tested by nearest-neighbor classifier for simplicity. Fig. 4 compares the false recognition rates of the baseline without any masking, with the frequency masking, and with both the frequency and the temporal masking. Frequency masking reduces the misclassification
Conclusions and discussions
In this paper, the psychoacoustical phenomena of masking is modeled by simple convolutional filtering in both spectral and time domain. The concepts of lateral inhibition in spectral domain and of unilateral inhibition in time domain model the frequency masking and temporal masking, respectively. Proposed model results in efficient features for speech recognition and provides much better performance than popular MFCC features especially in noisy environments.
Acknowledgements
This research was supported by Korean Ministry of Science and Technology as Brain Neuroinformatics Research Program.
References (5)
Should recognizers have ears?
Speech Commun.
(1998)- et al.
A quantitative model of the “effective” signal processing in the auditory system. I. Model structure
J. Acoust. Soc. Am.
(1996)
Cited by (13)
2D Psychoacoustic modeling of equivalent masking for automatic speech recognition
2015, Signal ProcessingNonlinear spectro-temporal features based on a cochlear model for automatic speech recognition in a noisy situation
2013, Neural NetworksCitation Excerpt :Although the nonlinear amplification or compression has been modeled with one simple function such as the log function (ETSI, 2000) or the cubic root (Hermansky, 1990; Raj et al., 2007), the amount of compression was reported to be different for different frequency components (Plack & Oxenham, 2000). Furthermore, one frequency component may be affected by neighboring frequency components, as observed from two-tone suppression experiments (Rhode & Recio, 2001) and modeled by lateral inhibition (Park & Lee, 2003). As the traveling waves propagate, inner hair cells (IHCs) distributed along the BM receive the acoustic signals, convert them into neural signals, and deliver them to the auditory nerve (Nobili et al., 1998).
An improved model of masking effects for robust speech recognition system
2013, Speech CommunicationCitation Excerpt :The whole proposed new MFCC model with all three masking effects is given on the bottom of Fig. 3, compared to the traditional MFCC model shown on the top of Fig. 3. The continuous lateral inhibition model based on mel-frequency scale requires a large amount of computation to be applied to a feature extractor (Park and Lee, 2003). For a practical and feasible system, this paper proposes a simplified lateral inhibition model for front-end feature extraction.
A temporal frequency warped (TFW) 2D psychoacoustic filter for robust speech recognition system
2012, Speech CommunicationCitation Excerpt :Extensive comparison will be made against a series of relevant peer work. That is FM (Park and Lee, 2003), LI (Cheng and O’Shaughnessy, 1991), the relative spectra (RASTA) filter (Hermansky and Morgan, 1994), the original 2D filter (Dai and Soon, 2009), AFE (ETSI, 2007), and the temporal warped 2D filter (Dai and Soon, 2010). Experimental results are given in Tables 3–5.
A temporal warped 2D psychoacoustic modeling for robust speech recognition system
2011, Speech CommunicationCitation Excerpt :Fig. 2 shows the characteristic curve of temporal masking. Forward masking can be viewed as a consequence of auditory adaptation (Park and Lee, 2003; Strope and Alwan, 1997). Although forward masking is frequency dependent, the variation is relatively small.
Mel frequency cepstral coefficients (Mfcc) feature extraction enhancement in the application of speech recognition: A comparison study
2015, Journal of Theoretical and Applied Information Technology