Perceptual speech enhancement exploiting temporal masking properties of human auditory system
Introduction
The problem of enhancing speech degraded by noise remains largely open, even though many significant techniques have been introduced over the past decades. This problem is severe when no independent information on the nature of noise degradation is available, in which case the enhancement technique must utilise only the specific properties of given speech and noise signals.
In this paper, we focus on single channel speech enhancement. This is the most difficult task, since the noise and the speech are in the same channel. Many approaches have been reported in the literature (Lim and Oppenheim, 1979, Vaseghi, 2000). The most popular method, with many variants, is spectral subtraction. Although this method reduces the noise and improves the signal-to-noise ratio (SNR), it mostly tends to introduce speech distortion and a perceptually annoying residual noise usually called musical noise. Musical noise is a special term for short sinusoids (tones) randomly distributed over time and frequency. It occurs due to imperfections in the original spectral subtraction technique and statistical inaccuracy in noise magnitude spectrum estimation.
In order to reduce musical noise, various algorithms have been developed. Some recent noise reduction techniques have exploited the known properties of the human auditory system and have resulted in good speech quality with improved intelligibility and reduced levels of musical noise (Gunawan and Ambikairajah, 2004, Gunawan and Ambikairajah, 2006a, Gustafsson et al., 1998, Hu and Loizou, 2003, Lin et al., 2003, Ma et al., 2006, Tsoukalas et al., 1997, Virag, 1999). Psychoacoustic exploitation of this sort has so far utilised simultaneous masking only; temporal masking properties have not been exploited.
The human auditory system acts as an analysis filter bank with a perceptually relevant frequency resolution (such as the critical band scale or ERB scale). An appropriate choice for speech denoising, therefore, is just such an auditory frequency scale, instead of the uniform filter bank analysis of the Short Time Fourier Transform (STFT).
The objective of this paper is to develop a novel speech enhancement algorithm exploiting temporal masking properties in very noisy conditions (SNR <10 dB). The rest of the paper is organised as follows: a review of speech enhancement techniques is described in Section 2; two forward masking models for speech enhancement application are outlined in Section 3; a novel speech enhancement method exploiting temporal masking is presented in Section 4; Section 5 describes the effect of noisy conditions to the calculation of simultaneous and temporal masking thresholds and the performance evaluation of speech enhancement techniques based on masking properties. Finally, Section 6 summarises this paper.
Section snippets
Review of single channel speech enhancement techniques
For single-channel applications, only a single microphone is available. This is a very difficult task since noise and speech are in the same channel, and noise needs to be estimated from the noisy speech. This discussion will focus on methods based on the assumption that only one input channel is available, the noise is additive, and the noise and speech signals are uncorrelated.
The majority of single-channel enhancement techniques use the spectral weighting approach (Berouti et al., 1979,
Temporal masking models
Temporal masking is a time domain phenomenon in which two stimuli occur within a small interval of time, and plays an important role in human auditory perception. Forward temporal masking occurs when a masker precedes the signal (or maskee) in time, while backward masking occurs when the masker follows the signal in time. Forward masking is the more important effect since the duration of the masking effect can be much longer, depending on the duration of the masker. The forward masking
Proposed speech enhancement algorithm exploiting temporal masking
In this section, a novel speech enhancement algorithm that incorporates temporal masking is presented. The block diagram of the proposed algorithm is shown in Fig. 3. Moreover, the analysis and synthesis filter bank used is described in more details.
Performance evaluation
In this section, the performance of the proposed speech enhancement algorithm is presented. First, the calculation of simultaneous and temporal masking thresholds in noisy conditions was compared, to determine the susceptibility of both masking thresholds to corruption by noise. Moreover, the objective evaluation using PESQ and subjective evaluation conforming ITU-T P.835 (ITU, 2003) are described.
In order to assess the performance of the new forward masking model in enhancing speech signals a
Conclusions
A new speech enhancement algorithm based on a short-term temporal masking threshold to noise ratio (MNR) has been presented in this paper. In the algorithm development phase, our proposed algorithm was compared with three other speech enhancement methods over six different noise types and three SNRs. PESQ results revealed that the proposed algorithm outperforms the other algorithms by 6–20% depending on the SNR. In the particularly demanding 0 dB SNR condition, the new technique achieves at
References (55)
- et al.
Experiments with a Nonlinear spectral subtraction (NSS), Hidden Markov models and projection, for robust recognition in cars
Speech Comm.
(1992) Noise suppression by spectral magnitude estimation – mechanism and theoretical limits
Signal Process.
(1985)- Ambikairajah, E., Tattersall, G.D., Davis, A., 1998. Wavelet transform based speech enhancement. In: Internat. Conf. on...
- Berouti, M., Schwartz, R., Makhoul, J., 1979. Enhancement of speech corrupted by acoustic noise. In: Internat. Conf. on...
Suppresion of acoustic noise in speech using spectral subtraction
IEEE Trans. Acoust. Speech Signal Process.
(1979)Enhancement of noisy speech signals: application to mobile radio communications
Speech Comm.
(1996)Elimination of the musical noise phenomenon with the Ephraim and Malah noise suppressor
IEEE Trans. Speech Audio Process.
(1994)- et al.
Perceptual speech coding and enhancement using frame-synchronized fast wavelet packet transform algorithms
IEEE Trans. Signal Process.
(1999) Speech enhancement using a noncausal a priori SNR estimator
IEEE Signal Process. Lett.
(2004)- EBU, 1988. Sound Quality Assessment Material Recordings for Subjective Tests. European Broadcasting...
Statistical-model-based speech enhancement systems
Proc. IEEE
Speech enhancement using a minimum mean-square error short-time spectral amplitude estimator
IEEE Trans. Acoust. Speech Signal Process.
A signal subspace approach for speech enhancement
IEEE Trans. Speech Audio Process.
Maximum likelihood noise cancellation using the EM algorithm
IEEE Trans. Acoust. Speech Signal Process.
Temporal integration in normal hearing, cochlear impairment, and impairment simulated by masking
J. Acoust. Soc. Amer.
Speech enhancement using resonator filterbanks
Proc. Internat. Conf. Acoust. Speech Signal Process.
A novel psychoacoustically motivated audio enhancement algorithm preserving background noise characteristics
Internat. Conf. Acoust. Speech Signal Process.
Speech Enhancement
A perceptually motivated approach for speech enhancement
IEEE Trans. Speech Audio Process.
Incorporating a pyschoacoustical model in frequency domain speech enhancement
IEEE Signal Process. Lett.
Evaluation of objective quality measures for speech enhancement
IEEE Trans. Audio Speech Lang. Process.
Noise suppression using a time-varying, analysis/synthesis gammachirp filterbank
Proc. Internat. Conf. Acoust. Speech Signal Process.
Cited by (21)
A speech enhancement approach based on noise classification
2015, Applied AcousticsCitation Excerpt :Single channel speech enhancement has been one of the most widely used approaches for the enhancement of noisy speech which is a crucial component of speech signal processing in noisy environments [1–6].
Noise reduction using three-step gain factor and iterative-directional- median filter
2014, Applied AcousticsCitation Excerpt :Accordingly, median filtering is performed in the second stage to remove more of the residual noise. Speech enhancement which utilizes the noise-masking properties of the human ear can result in less musical residual noise [6,8–10,13,17–20]. The Virag method [20] employed noise-masking properties to adapt the generalized power spectral subtraction algorithm, enabling the residual noise to sound less annoying in enhanced speech.
An efficient solution to improve the spectral noise suppression rules
2013, Digital Signal Processing: A Review JournalNoise estimation based on time–frequency correlation for speech enhancement
2013, Applied AcousticsCitation Excerpt :Due to universal applicability and simplicity, single channel speech enhancement has been being a hot research spot of speech enhancement, for several years, that is an indispensable step in various fields, such as speech communication, speech coding and speech recognition in noisy environments [1–7].
Exact Discrete-time Realizations of the Gammatone Filter
2019, ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings