Perceptual speech enhancement exploiting temporal masking properties of human auditory system

https://doi.org/10.1016/j.specom.2009.12.006Get rights and content

Abstract

The use of simultaneous masking in speech enhancement has shown promise for a range of noise types. In this paper, a new speech enhancement algorithm based on a short-term temporal masking threshold to noise ratio (MNR) is presented. A novel functional model for forward masking based on three parameters is incorporated into a speech enhancement framework based on speech boosting. The performance of the speech enhancement algorithm using the proposed forward masking model was compared with seven other speech enhancement methods over 12 different noise types and four SNRs. Objective evaluation using PESQ revealed that using the proposed forward masking model, the speech enhancement algorithm outperforms the other algorithms by 6–20% depending on the SNR. Moreover, subjective evaluation using 16 listeners confirmed the objective test results.

Introduction

The problem of enhancing speech degraded by noise remains largely open, even though many significant techniques have been introduced over the past decades. This problem is severe when no independent information on the nature of noise degradation is available, in which case the enhancement technique must utilise only the specific properties of given speech and noise signals.

In this paper, we focus on single channel speech enhancement. This is the most difficult task, since the noise and the speech are in the same channel. Many approaches have been reported in the literature (Lim and Oppenheim, 1979, Vaseghi, 2000). The most popular method, with many variants, is spectral subtraction. Although this method reduces the noise and improves the signal-to-noise ratio (SNR), it mostly tends to introduce speech distortion and a perceptually annoying residual noise usually called musical noise. Musical noise is a special term for short sinusoids (tones) randomly distributed over time and frequency. It occurs due to imperfections in the original spectral subtraction technique and statistical inaccuracy in noise magnitude spectrum estimation.

In order to reduce musical noise, various algorithms have been developed. Some recent noise reduction techniques have exploited the known properties of the human auditory system and have resulted in good speech quality with improved intelligibility and reduced levels of musical noise (Gunawan and Ambikairajah, 2004, Gunawan and Ambikairajah, 2006a, Gustafsson et al., 1998, Hu and Loizou, 2003, Lin et al., 2003, Ma et al., 2006, Tsoukalas et al., 1997, Virag, 1999). Psychoacoustic exploitation of this sort has so far utilised simultaneous masking only; temporal masking properties have not been exploited.

The human auditory system acts as an analysis filter bank with a perceptually relevant frequency resolution (such as the critical band scale or ERB scale). An appropriate choice for speech denoising, therefore, is just such an auditory frequency scale, instead of the uniform filter bank analysis of the Short Time Fourier Transform (STFT).

The objective of this paper is to develop a novel speech enhancement algorithm exploiting temporal masking properties in very noisy conditions (SNR <10 dB). The rest of the paper is organised as follows: a review of speech enhancement techniques is described in Section 2; two forward masking models for speech enhancement application are outlined in Section 3; a novel speech enhancement method exploiting temporal masking is presented in Section 4; Section 5 describes the effect of noisy conditions to the calculation of simultaneous and temporal masking thresholds and the performance evaluation of speech enhancement techniques based on masking properties. Finally, Section 6 summarises this paper.

Section snippets

Review of single channel speech enhancement techniques

For single-channel applications, only a single microphone is available. This is a very difficult task since noise and speech are in the same channel, and noise needs to be estimated from the noisy speech. This discussion will focus on methods based on the assumption that only one input channel is available, the noise is additive, and the noise and speech signals are uncorrelated.

The majority of single-channel enhancement techniques use the spectral weighting approach (Berouti et al., 1979,

Temporal masking models

Temporal masking is a time domain phenomenon in which two stimuli occur within a small interval of time, and plays an important role in human auditory perception. Forward temporal masking occurs when a masker precedes the signal (or maskee) in time, while backward masking occurs when the masker follows the signal in time. Forward masking is the more important effect since the duration of the masking effect can be much longer, depending on the duration of the masker. The forward masking

Proposed speech enhancement algorithm exploiting temporal masking

In this section, a novel speech enhancement algorithm that incorporates temporal masking is presented. The block diagram of the proposed algorithm is shown in Fig. 3. Moreover, the analysis and synthesis filter bank used is described in more details.

Performance evaluation

In this section, the performance of the proposed speech enhancement algorithm is presented. First, the calculation of simultaneous and temporal masking thresholds in noisy conditions was compared, to determine the susceptibility of both masking thresholds to corruption by noise. Moreover, the objective evaluation using PESQ and subjective evaluation conforming ITU-T P.835 (ITU, 2003) are described.

In order to assess the performance of the new forward masking model in enhancing speech signals a

Conclusions

A new speech enhancement algorithm based on a short-term temporal masking threshold to noise ratio (MNR) has been presented in this paper. In the algorithm development phase, our proposed algorithm was compared with three other speech enhancement methods over six different noise types and three SNRs. PESQ results revealed that the proposed algorithm outperforms the other algorithms by 6–20% depending on the SNR. In the particularly demanding 0 dB SNR condition, the new technique achieves at

References (55)

  • P. Lockwood et al.

    Experiments with a Nonlinear spectral subtraction (NSS), Hidden Markov models and projection, for robust recognition in cars

    Speech Comm.

    (1992)
  • P. Vary

    Noise suppression by spectral magnitude estimation – mechanism and theoretical limits

    Signal Process.

    (1985)
  • Ambikairajah, E., Tattersall, G.D., Davis, A., 1998. Wavelet transform based speech enhancement. In: Internat. Conf. on...
  • Berouti, M., Schwartz, R., Makhoul, J., 1979. Enhancement of speech corrupted by acoustic noise. In: Internat. Conf. on...
  • S.F. Boll

    Suppresion of acoustic noise in speech using spectral subtraction

    IEEE Trans. Acoust. Speech Signal Process.

    (1979)
  • R.L. Bouquin

    Enhancement of noisy speech signals: application to mobile radio communications

    Speech Comm.

    (1996)
  • O. Cappe

    Elimination of the musical noise phenomenon with the Ephraim and Malah noise suppressor

    IEEE Trans. Speech Audio Process.

    (1994)
  • B. Carnero et al.

    Perceptual speech coding and enhancement using frame-synchronized fast wavelet packet transform algorithms

    IEEE Trans. Signal Process.

    (1999)
  • I. Cohen

    Speech enhancement using a noncausal a priori SNR estimator

    IEEE Signal Process. Lett.

    (2004)
  • EBU, 1988. Sound Quality Assessment Material Recordings for Subjective Tests. European Broadcasting...
  • Y. Ephraim

    Statistical-model-based speech enhancement systems

    Proc. IEEE

    (1992)
  • Y. Ephraim et al.

    Speech enhancement using a minimum mean-square error short-time spectral amplitude estimator

    IEEE Trans. Acoust. Speech Signal Process.

    (1984)
  • Y. Ephraim et al.

    A signal subspace approach for speech enhancement

    IEEE Trans. Speech Audio Process.

    (1995)
  • M. Feder et al.

    Maximum likelihood noise cancellation using the EM algorithm

    IEEE Trans. Acoust. Speech Signal Process.

    (1989)
  • M. Florentine et al.

    Temporal integration in normal hearing, cochlear impairment, and impairment simulated by masking

    J. Acoust. Soc. Amer.

    (1988)
  • L. Gagnon et al.

    Speech enhancement using resonator filterbanks

    Proc. Internat. Conf. Acoust. Speech Signal Process.

    (1991)
  • Gunawan, T.S., Ambikairajah, E., 2004. Speech enhancement using temporal masking and fractional bark gammatone filters....
  • Gunawan, T.S., Ambikairajah, E., 2006a. Subjective evaluation of speech enhancement algorithms using ITU-T P.835...
  • Gunawan, T.S., Ambikairajah, E., 2006b. A new forward masking model for speech enhancement. In: IEEE Internat. Conf. on...
  • S. Gustafsson et al.

    A novel psychoacoustically motivated audio enhancement algorithm preserving background noise characteristics

    Internat. Conf. Acoust. Speech Signal Process.

    (1998)
  • J.H.L. Hansen

    Speech Enhancement

    (1999)
  • Hirsch, H., Pearce, D., 2000. The AURORA experimental framework for the performance evaluations of speech recognition...
  • Y. Hu et al.

    A perceptually motivated approach for speech enhancement

    IEEE Trans. Speech Audio Process.

    (2003)
  • Y. Hu et al.

    Incorporating a pyschoacoustical model in frequency domain speech enhancement

    IEEE Signal Process. Lett.

    (2004)
  • Y. Hu et al.

    Evaluation of objective quality measures for speech enhancement

    IEEE Trans. Audio Speech Lang. Process.

    (2008)
  • T. Irino

    Noise suppression using a time-varying, analysis/synthesis gammachirp filterbank

    Proc. Internat. Conf. Acoust. Speech Signal Process.

    (1999)
  • ITU, 1996. ITU-T P.830, Subjective Performance Assessment of Telephone-band and Wideband Digital Codecs. International...
  • Cited by (21)

    • A speech enhancement approach based on noise classification

      2015, Applied Acoustics
      Citation Excerpt :

      Single channel speech enhancement has been one of the most widely used approaches for the enhancement of noisy speech which is a crucial component of speech signal processing in noisy environments [1–6].

    • Noise reduction using three-step gain factor and iterative-directional- median filter

      2014, Applied Acoustics
      Citation Excerpt :

      Accordingly, median filtering is performed in the second stage to remove more of the residual noise. Speech enhancement which utilizes the noise-masking properties of the human ear can result in less musical residual noise [6,8–10,13,17–20]. The Virag method [20] employed noise-masking properties to adapt the generalized power spectral subtraction algorithm, enabling the residual noise to sound less annoying in enhanced speech.

    • An efficient solution to improve the spectral noise suppression rules

      2013, Digital Signal Processing: A Review Journal
    • Noise estimation based on time–frequency correlation for speech enhancement

      2013, Applied Acoustics
      Citation Excerpt :

      Due to universal applicability and simplicity, single channel speech enhancement has been being a hot research spot of speech enhancement, for several years, that is an indispensable step in various fields, such as speech communication, speech coding and speech recognition in noisy environments [1–7].

    • Exact Discrete-time Realizations of the Gammatone Filter

      2019, ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings
    View all citing articles on Scopus
    View full text