Masking-based β-order MMSE speech enhancement

doi:10.1016/j.specom.2005.05.012

Speech Communication

Volume 48, Issue 1, January 2006, Pages 57-70

https://doi.org/10.1016/j.specom.2005.05.012 Get rights and content

Abstract

This paper considers an effective approach for attenuating acoustic noise and mitigating its effect in a speech signal. In this approach, human perceptual auditory masking effect is incorporated into an adaptive β-order minimum mean-square error (MMSE) speech enhancement algorithm. The relationship between the value of β and the noise-masking threshold is introduced and analyzed. The algorithm is based on a criterion by which the inaudible noise may be masked rather than suppressed. It thereby reduces the chance of distortion introduced to speech due to the enhancement process. In order to obtain an optimal estimation of the masking threshold, a modified way to measure the relative threshold offset is described. The performance of the proposed masking-based β-order MMSE method has been evaluated through objective speech distortion measurement, spectrogram inspection and subjective listening tests. It is shown that the proposed method can achieve a more significant noise reduction and a better spectral estimation over the conventional adaptive β-order MMSE method and the conventional over subtraction noise-masking method.

Introduction

Weak speech signals such as nasals, fricatives and affricates are often seriously contaminated by noise, resulting in reduction of speech intelligibility. So far, speech enhancement research mainly aims to solve the problem in which the speech signal is degraded by uncorrelated additive noise and only the noisy speech signal is available. Many approaches in the time/frequency domain have been investigated to date. In terms of the methodology adopted, the most popular methods for speech enhancement can be broadly categorized as (i) spectral amplitude estimation such as Wiener filtering (Lim and Opppenheim, 1979), spectral subtraction (McAulay and Malpass, 1980), Ephraim and Malah (E-M) MMSE (Ephraim and Malah, 1984) and log spectral amplitude (LSA) estimation (Ephraim and Malah, 1985); (ii) speech production model-based method (Gannot et al., 1998); (iii) hearing perceptual criteria-based enhancement (Virag, 1999, You et al., 2004b, Hansen and Nandkumar, 1995, Tsoukalas et al., 1997); (iv) text-directed non-real-time speech enhancement (Hansen and Pellom, 1997); (v) hidden Markov model (HMM) method (Ephraim et al., 1989); and (vi) eigen decomposition subspace method (Ephraim and Van Trees, 1995).

One of the main approaches of speech enhancement algorithms is to obtain the best possible estimate of the short time spectral amplitude (STSA) of a speech signal from a given noisy speech. Most of the STSA estimators ignore the estimation of the phase of a speech signal as it has been well demonstrated that the human ear is insensitive to the phase of the speech signal (Lim and Wang, 1982, Vary, 1985). There are many existing speech enhancement methods which exploit the properties of the human auditory system. The main aim of these methods is to find an optimal trade-off between noise suppression, speech distortion and residual tonal noise level (Virag, 1999, Hansen and Nandkumar, 1995, Tsoukalas et al., 1993). In addition, most of them use the a posteriori SNR to achieve noise suppression. It is assumed that human listeners are unable to perceive an additive noise so long as it remains below the masking threshold. In (Virag, 1999, Tsoukalas et al., 1997), Masking properties are incorporated into generalized spectral subtraction, which can be expressed as ${\hat{S}}_{k} = [| X_{k} |^{β} - E {| N_{k} |^{β}}]^{1 / β}$ for some constant β, where k is the frequency bin index, S_k, X_k and N_k are Fourier transforms of a windowed segment of speech, noisy speech and noise, respectively. In essence, the generalized spectral subtraction method hinges on the estimation of the value of a spectral amplitude/power term based on its expected value. For examples, when β = 1, it is an amplitude spectral subtraction which directly uses E{∣N_k∣} to replace ∣N_k∣; when β = 2, it is a power spectral subtraction which not only directly uses E{∣N_k∣²} to replace ∣N_k∣², but also uses $E {S_{k} N_{k}^{*}}$ and $E {N_{k} S_{k}^{*}}$ to replace $S_{k} N_{k}^{*}$ and $N_{k} S_{k}^{*}$ . The expectations of $S_{k} N_{k}^{*}$ and $N_{k} S_{k}^{*}$ are equal to zero due to the statistical independence and zero mean assumptions (Lim and Opppenheim, 1979, p. 11), where $N_{k}^{*}$ and $S_{k}^{*}$ represent complex conjugates of N_k and S_k. For the β = 2 case, it is also equivalent to estimating the square root of the maximum likelihood estimator of each signal spectral component variance based on a complex Gaussian model (McAulay and Malpass, 1980, p. 138). Consequently, optimal estimation of a speech signal cannot be obtained using the mathematical model of generalized spectral subtraction method, which does not lead to improvement in the intelligibility of the processed speech (Lim, 1978). Therefore, the generalized spectral subtraction approach may be useful for those applications where perception of noise reduction, without any significant drop in speech intelligibility, is desired (Lim, 1978, p. 472). The presence of obvious and annoying musical tones in the processed speech caused by the imperfect model of generalized spectral subtraction is yet another of its drawbacks.

In contrast to many masking-based speech enhancement methods which are based on generalized spectral subtraction, we propose an enhancement method to incorporate the masking properties into β-order MMSE. The β-order MMSE speech enhancement method (You et al., 2003, You et al., 2005) is an optimal estimation method. It is derived by minimizing the mean-square error cost function $J = E {(A_{k}^{β} - {\hat{A}}_{k}^{β})^{2}}$ based on the complex Gaussian distribution model and statistical independence assumption. Herein, A_k and ${\hat{A}}_{k}$ are respectively the original and the estimated spectral amplitude of the speech signal at frequency bin k. E-M MMSE and E-M LSA can be seen as special cases of β-order MMSE, which correspond to the case when β equals to one and β approaches zero respectively. In (Cappé, 1994), the elimination of musical noise phenomenon of the E-M MMSE method is analyzed; it shows that the E-M MMSE noise suppressor is effective if a nonlinear smoothing procedure is used to obtain more consistent estimates of the a priori and a posteriori SNRs which are used to control the gain function. Obviously, the principle of musical noise elimination in (Cappé, 1994) can also be applied to adaptive β-order MMSE (You et al., 2005).

In this paper, β-order MMSE is modified so as to incorporate the noise-masking threshold (You et al., 2004a). Specifically, the value of β is made to vary according to the frame SNR and masking threshold. Simulation results indicate that the masking-based β-order MMSE estimator outperforms many existing spectral suppression methods in terms of both objective and subjective measures. The remainder of this paper is organized as follows. The masking-based β-order MMSE speech enhancement method is introduced in Section 2. The performance of the masking-based β-order MMSE estimation is investigated in Section 3. Section 4 gives the conclusion.

Section snippets

Masking-based β-order MMSE speech enhancement

Human auditory modelling has been widely used in acoustic signal processing, especially in audio and speech coding. This model is based on the masking phenomenon and related to the concept of critical band analysis, which is a central analysis mechanism of the inner ear. The masking properties are modelled by the noise-masking threshold. Masking is present because the auditory system is incapable of distinguishing two signals which are close to one another in spectral spacing. Masking effects

Performance evaluation

To evaluate the performance of the proposed masking-based β-order MMSE speech enhancement method, five different types of noise taken from the NOISEX-92 database (Varga and Steeneken, 1993) are used in our simulation experiments. They are white Gaussian noise, interior Volvo car noise, F16 cockpit noise, Babble noise (100 people speaking in a canteen) and Leopard (military vehicle) noise. A total of 30 phonetically balanced speech utterances from the TIMIT database (Garofolo, 1988) are used in

Conclusion

The focus of our study is to develop an optimal speech enhancement algorithm that would maximize noise reduction while minimizing speech distortion. In this paper, we propose an adaptive β-order STSA-MMSE speech enhancement method which incorporates the perceptual properties of the human auditory system. The proposed method leads to an improvement in performance over the conventional adaptive β-order MMSE method. The improvement is achieved due to the effectiveness of adapting the β value

References (27)

J.H.L. Hansen et al.
Text-directed speech enhancement using phoneme classification and feature map constrained vector quantization
Speech Comm.
(1997)
A. Varga et al.
Assessment for automatic speech recognition: II. NOISEX-92: a database and an experiment to study the effect of additive noise on speech recognition systems
Speech Comm.
(1993)
P. Vary
Noise suppression by spectral magnitude estimation-mechanism and theoretical limits
Signal Process.
(1985)
O. Cappé
Elimination of the musical noise phenomenon with the Ephraim and Malah noise suppressor
IEEE Trans. Speech Audio Process.
(1994)
Y. Ephraim et al.
Speech enhancement using a minimum mean-square error short-time spectral amplitude estimator
IEEE Trans. Acoust. Speech Signal Process. ASSP
(1984)
Y. Ephraim et al.
Speech enhancement using a minimum mean-square error log-spectral amplitude estimator
IEEE Trans. Acoust. Speech Signal Process. ASSP
(1985)
Y. Ephraim et al.
A signal subspace approach for speech enhancement
IEEE Trans. Speech Audio Process.
(1995)
Y. Ephraim et al.
On the application of hidden Markov models for enhancing noisy speech
IEEE Trans. Acoust. Speech Signal Process. ASSP
(1989)
S. Gannot et al.
Iterative and sequential Kalman filter-based speech enhancement algorithms
IEEE Trans. Speech Audio Process.
(1998)
J.S. Garofolo
Getting Started with the DARPA TIMIT CD-ROM: An Acoustic Phonetic Continuous Speech Database
(1988)

J.H.L. Hansen et al.

Robust estimation of speech in noisy backgrounds based on aspects of the auditory process

J. Acoust. Soc. Amer.

(1995)

R.P. Hellman

Asymmetry of masking between noise and tone

Perception Psychophys.

(1972)

J.D. Johnston

Transform coding of audio signal using perceptual noise criteria

IEEE J. Select. Areas Comm.

(1988)

Cited by (0)

View full text

Masking-based β-order MMSE speech enhancement

Abstract

Introduction

Section snippets

Masking-based β-order MMSE speech enhancement

Performance evaluation

Conclusion

Speech Comm.

Speech Comm.

Signal Process.

Elimination of the musical noise phenomenon with the Ephraim and Malah noise suppressor

IEEE Trans. Speech Audio Process.

Speech enhancement using a minimum mean-square error short-time spectral amplitude estimator

IEEE Trans. Acoust. Speech Signal Process. ASSP

Speech enhancement using a minimum mean-square error log-spectral amplitude estimator

IEEE Trans. Acoust. Speech Signal Process. ASSP

A signal subspace approach for speech enhancement

IEEE Trans. Speech Audio Process.

On the application of hidden Markov models for enhancing noisy speech

IEEE Trans. Acoust. Speech Signal Process. ASSP

Iterative and sequential Kalman filter-based speech enhancement algorithms

IEEE Trans. Speech Audio Process.

Getting Started with the DARPA TIMIT CD-ROM: An Acoustic Phonetic Continuous Speech Database

Robust estimation of speech in noisy backgrounds based on aspects of the auditory process

J. Acoust. Soc. Amer.

Asymmetry of masking between noise and tone

Perception Psychophys.

Transform coding of audio signal using perceptual noise criteria

IEEE J. Select. Areas Comm.