Elsevier

Speech Communication

Volume 48, Issue 1, January 2006, Pages 57-70
Speech Communication

Masking-based β-order MMSE speech enhancement

https://doi.org/10.1016/j.specom.2005.05.012Get rights and content

Abstract

This paper considers an effective approach for attenuating acoustic noise and mitigating its effect in a speech signal. In this approach, human perceptual auditory masking effect is incorporated into an adaptive β-order minimum mean-square error (MMSE) speech enhancement algorithm. The relationship between the value of β and the noise-masking threshold is introduced and analyzed. The algorithm is based on a criterion by which the inaudible noise may be masked rather than suppressed. It thereby reduces the chance of distortion introduced to speech due to the enhancement process. In order to obtain an optimal estimation of the masking threshold, a modified way to measure the relative threshold offset is described. The performance of the proposed masking-based β-order MMSE method has been evaluated through objective speech distortion measurement, spectrogram inspection and subjective listening tests. It is shown that the proposed method can achieve a more significant noise reduction and a better spectral estimation over the conventional adaptive β-order MMSE method and the conventional over subtraction noise-masking method.

Introduction

Weak speech signals such as nasals, fricatives and affricates are often seriously contaminated by noise, resulting in reduction of speech intelligibility. So far, speech enhancement research mainly aims to solve the problem in which the speech signal is degraded by uncorrelated additive noise and only the noisy speech signal is available. Many approaches in the time/frequency domain have been investigated to date. In terms of the methodology adopted, the most popular methods for speech enhancement can be broadly categorized as (i) spectral amplitude estimation such as Wiener filtering (Lim and Opppenheim, 1979), spectral subtraction (McAulay and Malpass, 1980), Ephraim and Malah (E-M) MMSE (Ephraim and Malah, 1984) and log spectral amplitude (LSA) estimation (Ephraim and Malah, 1985); (ii) speech production model-based method (Gannot et al., 1998); (iii) hearing perceptual criteria-based enhancement (Virag, 1999, You et al., 2004b, Hansen and Nandkumar, 1995, Tsoukalas et al., 1997); (iv) text-directed non-real-time speech enhancement (Hansen and Pellom, 1997); (v) hidden Markov model (HMM) method (Ephraim et al., 1989); and (vi) eigen decomposition subspace method (Ephraim and Van Trees, 1995).

One of the main approaches of speech enhancement algorithms is to obtain the best possible estimate of the short time spectral amplitude (STSA) of a speech signal from a given noisy speech. Most of the STSA estimators ignore the estimation of the phase of a speech signal as it has been well demonstrated that the human ear is insensitive to the phase of the speech signal (Lim and Wang, 1982, Vary, 1985). There are many existing speech enhancement methods which exploit the properties of the human auditory system. The main aim of these methods is to find an optimal trade-off between noise suppression, speech distortion and residual tonal noise level (Virag, 1999, Hansen and Nandkumar, 1995, Tsoukalas et al., 1993). In addition, most of them use the a posteriori SNR to achieve noise suppression. It is assumed that human listeners are unable to perceive an additive noise so long as it remains below the masking threshold. In (Virag, 1999, Tsoukalas et al., 1997), Masking properties are incorporated into generalized spectral subtraction, which can be expressed as Sˆk=[|Xk|β-E{|Nk|β}]1/β for some constant β, where k is the frequency bin index, Sk, Xk and Nk are Fourier transforms of a windowed segment of speech, noisy speech and noise, respectively. In essence, the generalized spectral subtraction method hinges on the estimation of the value of a spectral amplitude/power term based on its expected value. For examples, when β = 1, it is an amplitude spectral subtraction which directly uses E{∣Nk∣} to replace ∣Nk∣; when β = 2, it is a power spectral subtraction which not only directly uses E{∣Nk2} to replace ∣Nk2, but also uses E{SkNk} and E{NkSk} to replace SkNk and NkSk. The expectations of SkNk and NkSk are equal to zero due to the statistical independence and zero mean assumptions (Lim and Opppenheim, 1979, p. 11), where Nk and Sk represent complex conjugates of Nk and Sk. For the β = 2 case, it is also equivalent to estimating the square root of the maximum likelihood estimator of each signal spectral component variance based on a complex Gaussian model (McAulay and Malpass, 1980, p. 138). Consequently, optimal estimation of a speech signal cannot be obtained using the mathematical model of generalized spectral subtraction method, which does not lead to improvement in the intelligibility of the processed speech (Lim, 1978). Therefore, the generalized spectral subtraction approach may be useful for those applications where perception of noise reduction, without any significant drop in speech intelligibility, is desired (Lim, 1978, p. 472). The presence of obvious and annoying musical tones in the processed speech caused by the imperfect model of generalized spectral subtraction is yet another of its drawbacks.

In contrast to many masking-based speech enhancement methods which are based on generalized spectral subtraction, we propose an enhancement method to incorporate the masking properties into β-order MMSE. The β-order MMSE speech enhancement method (You et al., 2003, You et al., 2005) is an optimal estimation method. It is derived by minimizing the mean-square error cost function J=E{(Akβ-Aˆkβ)2} based on the complex Gaussian distribution model and statistical independence assumption. Herein, Ak and Aˆk are respectively the original and the estimated spectral amplitude of the speech signal at frequency bin k. E-M MMSE and E-M LSA can be seen as special cases of β-order MMSE, which correspond to the case when β equals to one and β approaches zero respectively. In (Cappé, 1994), the elimination of musical noise phenomenon of the E-M MMSE method is analyzed; it shows that the E-M MMSE noise suppressor is effective if a nonlinear smoothing procedure is used to obtain more consistent estimates of the a priori and a posteriori SNRs which are used to control the gain function. Obviously, the principle of musical noise elimination in (Cappé, 1994) can also be applied to adaptive β-order MMSE (You et al., 2005).

In this paper, β-order MMSE is modified so as to incorporate the noise-masking threshold (You et al., 2004a). Specifically, the value of β is made to vary according to the frame SNR and masking threshold. Simulation results indicate that the masking-based β-order MMSE estimator outperforms many existing spectral suppression methods in terms of both objective and subjective measures. The remainder of this paper is organized as follows. The masking-based β-order MMSE speech enhancement method is introduced in Section 2. The performance of the masking-based β-order MMSE estimation is investigated in Section 3. Section 4 gives the conclusion.

Section snippets

Masking-based β-order MMSE speech enhancement

Human auditory modelling has been widely used in acoustic signal processing, especially in audio and speech coding. This model is based on the masking phenomenon and related to the concept of critical band analysis, which is a central analysis mechanism of the inner ear. The masking properties are modelled by the noise-masking threshold. Masking is present because the auditory system is incapable of distinguishing two signals which are close to one another in spectral spacing. Masking effects

Performance evaluation

To evaluate the performance of the proposed masking-based β-order MMSE speech enhancement method, five different types of noise taken from the NOISEX-92 database (Varga and Steeneken, 1993) are used in our simulation experiments. They are white Gaussian noise, interior Volvo car noise, F16 cockpit noise, Babble noise (100 people speaking in a canteen) and Leopard (military vehicle) noise. A total of 30 phonetically balanced speech utterances from the TIMIT database (Garofolo, 1988) are used in

Conclusion

The focus of our study is to develop an optimal speech enhancement algorithm that would maximize noise reduction while minimizing speech distortion. In this paper, we propose an adaptive β-order STSA-MMSE speech enhancement method which incorporates the perceptual properties of the human auditory system. The proposed method leads to an improvement in performance over the conventional adaptive β-order MMSE method. The improvement is achieved due to the effectiveness of adapting the β value

References (27)

  • J.H.L. Hansen et al.

    Robust estimation of speech in noisy backgrounds based on aspects of the auditory process

    J. Acoust. Soc. Amer.

    (1995)
  • R.P. Hellman

    Asymmetry of masking between noise and tone

    Perception Psychophys.

    (1972)
  • J.D. Johnston

    Transform coding of audio signal using perceptual noise criteria

    IEEE J. Select. Areas Comm.

    (1988)
  • Cited by (0)

    View full text