Elsevier

Neurocomputing

Volumes 52–54, June 2003, Pages 615-620
Neurocomputing

An engineering model of the masking for the noise-robust speech recognition

https://doi.org/10.1016/S0925-2312(02)00791-9Get rights and content

Abstract

The masking effect of human hearing is modeled by lateral and unilateral inhibition, and tested for isolated word recognition tasks. Frequency masking suppresses unwanted signals close to the dominant signal of interest in frequency domain, and the weak signals following dominant ones in the time are suppressed by temporal forward masking. The masking effect filters out unimportant signals, which may improve the performance of speech recognition systems. With the parameters derived from the psychological observations, proposed model shows good analogy to psychoacoustic masking effects as well as superior recognition performance.

Introduction

Although tremendous attempts have been made to make machines which recognizes the human speeches, it is still very difficult for real world environments with the background noises. The modeling of human auditory system is one of the successful approaches which have been proposed to solve the problem. Critical-band filterbank and mel-scale frequency sampling are the most widely used properties from the psychoacoustics.

One of the useful aspects which had been seldom utilized is the “masking” effect of human hearing system. Masking has been investigated for ages and used to quantify the frequency selectivity of the auditory system. However, there have been few approaches which utilize the masking for the recognition tasks. The nature of the masking, that the spectral components of high intensity level suppress the adjacent spectral components, helps the recognition performance. The low-intensity level signals are usually unimportant for speech recognition and may be noises to be suppressed. In this paper, the time–frequency masking is modeled with lateral inhibition and incorporated into the current auditory model, mel-frequency cepstral coefficients (MFCC) model, which is the most widely used speech features, and tested on the task of isolated word recognition. The proposed algorithm does not require extensive computation, and results in much better recognition performance, especially in noisy environments.

Masking has been defined as a process in which the audible threshold for one sound is raised by the presence of another (masking) sound. Frequency masking means that signals are masked by the masking sound occurring at the same time. With temporal masking, signals can also be masked by the sound preceding it, called forward masking, or even by the sound following it, called backward masking. Frequency masking helps to discriminate signals from the other by enhancing the spectral resolution. Also by suppressing the adjacent signal in spectral domain, it reinforces the signal of critical interest so that the unimportant signals are filtered out. Forward masking, a short-term adaptation process of the auditory system, helps in the discrimination capability between signals by emphasizing time-dependent variations.

Section snippets

Frequency masking with lateral inhibition model

The essence of the masking is to reinforce the dominant signal components and to suppress the adjacent components. To implement this concept into the current auditory system, a lateral inhibition is introduced with a simple Mexican-hat convolutional filter as shown in Fig. 1(a). The sharp peak at the center reinforces the very close stimuli, and the negative values at neighborhoods inhibit the stimuli in the range.

To apply the inhibition filtering in spectral domain, the blocked speech signals

Temporal masking

The short-term adaptation and the temporal integration [4] are the possible mechanisms of the temporal masking. Many researches have modeled the temporal masking as the temporal integration of the response of the auditory nerve [1], [5]. If we assume that the temporal masking is due to the temporal integration of the response of the auditory nerves, temporal masking can be modeled asy(n)=x(n)+Ak=1α−kx(n−k)−Bk=1β−kx(n−k),where x(n) is the output signal before temporal masking, y(n) the

Experimental results

The temporal masking using the integration model in the feature domain, as well as the frequency masking, is applied to the isolated word recognition task. For the simulation, 50 Korean words spoken by 13 men three times each were tested by nearest-neighbor classifier for simplicity. Fig. 4 compares the false recognition rates of the baseline without any masking, with the frequency masking, and with both the frequency and the temporal masking. Frequency masking reduces the misclassification

Conclusions and discussions

In this paper, the psychoacoustical phenomena of masking is modeled by simple convolutional filtering in both spectral and time domain. The concepts of lateral inhibition in spectral domain and of unilateral inhibition in time domain model the frequency masking and temporal masking, respectively. Proposed model results in efficient features for speech recognition and provides much better performance than popular MFCC features especially in noisy environments.

Acknowledgements

This research was supported by Korean Ministry of Science and Technology as Brain Neuroinformatics Research Program.

References (5)

  • H. Hermansky

    Should recognizers have ears?

    Speech Commun.

    (1998)
  • T. Dau et al.

    A quantitative model of the “effective” signal processing in the auditory system. I. Model structure

    J. Acoust. Soc. Am.

    (1996)
There are more references available in the full text version of this article.

Cited by (13)

  • Nonlinear spectro-temporal features based on a cochlear model for automatic speech recognition in a noisy situation

    2013, Neural Networks
    Citation Excerpt :

    Although the nonlinear amplification or compression has been modeled with one simple function such as the log function (ETSI, 2000) or the cubic root (Hermansky, 1990; Raj et al., 2007), the amount of compression was reported to be different for different frequency components (Plack & Oxenham, 2000). Furthermore, one frequency component may be affected by neighboring frequency components, as observed from two-tone suppression experiments (Rhode & Recio, 2001) and modeled by lateral inhibition (Park & Lee, 2003). As the traveling waves propagate, inner hair cells (IHCs) distributed along the BM receive the acoustic signals, convert them into neural signals, and deliver them to the auditory nerve (Nobili et al., 1998).

  • An improved model of masking effects for robust speech recognition system

    2013, Speech Communication
    Citation Excerpt :

    The whole proposed new MFCC model with all three masking effects is given on the bottom of Fig. 3, compared to the traditional MFCC model shown on the top of Fig. 3. The continuous lateral inhibition model based on mel-frequency scale requires a large amount of computation to be applied to a feature extractor (Park and Lee, 2003). For a practical and feasible system, this paper proposes a simplified lateral inhibition model for front-end feature extraction.

  • A temporal frequency warped (TFW) 2D psychoacoustic filter for robust speech recognition system

    2012, Speech Communication
    Citation Excerpt :

    Extensive comparison will be made against a series of relevant peer work. That is FM (Park and Lee, 2003), LI (Cheng and O’Shaughnessy, 1991), the relative spectra (RASTA) filter (Hermansky and Morgan, 1994), the original 2D filter (Dai and Soon, 2009), AFE (ETSI, 2007), and the temporal warped 2D filter (Dai and Soon, 2010). Experimental results are given in Tables 3–5.

  • A temporal warped 2D psychoacoustic modeling for robust speech recognition system

    2011, Speech Communication
    Citation Excerpt :

    Fig. 2 shows the characteristic curve of temporal masking. Forward masking can be viewed as a consequence of auditory adaptation (Park and Lee, 2003; Strope and Alwan, 1997). Although forward masking is frequency dependent, the variation is relatively small.

View all citing articles on Scopus
View full text