Improved noise suppression filter using self-adaptive estimator of probability of speech absence

doi:10.1016/S0165-1684(98)00230-8

Signal Processing

Volume 75, Issue 2, June 1999, Pages 151-159

https://doi.org/10.1016/S0165-1684(98)00230-8 Get rights and content

Abstract

In this paper, two estimators of the probability of speech absence are derived using the common assumption that the Fourier coefficients of a frame of speech and noise samples are statistically independent Gaussian random variables (Ephraim and Malah, 1984; McAulay and Malpass, 1980). The estimators are obtained directly from the noisy speech itself. The first estimator is obtained by binary classification of the received spectral amplitude into speech present or speech absent state. The second estimator is obtained by deriving the conditional probability of speech absence given the received spectral amplitude. Each of the time-adaptive estimators produces an estimate of the probability of speech absence for each spectral frequency. The estimated probability will be higher during the speech period and lower during the silence period. The estimated probability can be fed directly to any filter which requires such an estimate, e.g. the Ephraim and Malah noise suppressor (Ephraim and Malah, 1984), and the modified power subtraction method (Scalart and Vieira Filho, 1996), with significant improvements for various noise types.

Zusammenfassung

In diesem Artikel werden zwei Schätzer für die Wahrscheinlichkeit des Nichtvorhandenseins eines Sprachsignals unter Verwendung der üblichen Annahme abgeleitet, daß die Fourierkoeffizienten eines Frames von Sprach-und Geräuschabtastwerten unabhängige, gaußverteilte Zufallsvariablen sind (Ephraim and Malah, 1984; McAulay and Malpass, 1980). Die Schätzer werden direkt aus dem verrauschten Sprachsignal bestimmt. Der erste Schätzer wird durch binäre Klassifikation der Amplitude des empfangenen Spektrums in die Klassen “Sprache vorhanden” und “Sprache nicht vorhanden” erhalten. Der zweite Schätzer wird durch die Herleitung der bedingten Wahrscheinlichkeit des Nichtvorhandenseins eines Sprachsignals bei gegebenem empfangenem Amplitudenspektrum bestimmt. Jeder der zeitlich adaptiven Schätzer erzeugt einen Schätzwert für die Wahrscheinlichkeit des Nichtvorhandenseins eines Sprachsignals für jede spektrale Frequenz. Die geschätzte Waherscheinlichkeit wird während der Sprechphasen höher sein und niedriger bei Schweigen. Die geschätzten Wahrscheinlichkeiten können direkt jedem Filter zugeführt werden, das solch eine Schätzung benötigt, beispielsweise der Ephraim und Malah Rauschunterdrücker (Ephraim and Malah, 1984) und die modifizierte Leistungssubtraktionsmethode (Scalart and Vieira Filho, 1996), was zu signifikanten Verbesserungen für verschiedenste Rauscharten führt.

Résumé

Dans cet article, nous dérivons deux estimateurs de la probabilité d'absence de parole, en utilisant la supposition commune que les coefficients de Fourier des échantillons d'une trame de parole et du bruit sont des variables aléatoires gaussiennes indépendantes (Ephraim et Malah, 1984; McAulay et Malpass, 1980). Les estimateurs sont obtenus directement à partir de la parole bruitée elle-même. Le premier estimateur est obtenu par classification binaire de l'amplitude spectrale reçue en un état de présence ou d'absence de parole. Le second estimateur est obtenu en dérivant la probabilité conditionnelle d'absence de parole étant donnée l'amplitude spectrale reçue. Chacun des estimateurs adaptatifs dans le temps produit une estimation de la probabilité d'absence de parole pour chaque fréquence spectrale. La probabilité sera plus grande durant une période de parole et plus basse durant une période de silence. La probabilité estimée peut être directement entrée dans n'importe quel filtre qui nécessite une telle estimation, par exemple le suppresseur de bruit de Ephraim et Malah (1984) ou la méthode de soustraction de puissance modifiée (Scalart et Vieira Filho, 1996), produisant une amélioration significative pour différents types de bruit.

Introduction

There are two forms of speech absence which can be used to improve the noise removal filter. The first form of speech absence is due to the speaker pausing in his speech resulting in significant portions of silence. The second form of speech absence is that although the speaker is talking, the speech energy is not present in all the frequency components. For some frequency components with insignificant energy, speech can be considered to be absent in those components.

Such knowledge of speech absence can be used to improve a speech enhancement filter. The first attempt at utilizing the uncertainty of speech absence was explored by McAulay and Malpass [2]. In their approach, they derived a filter based on a fixed probability of speech absence of 0.5. The Ephraim and Malah noise removal filter [1] adopted a more flexible approach in which different spectral frequency components can be assigned a different probability of speech absence which ranges from zero to one. However, the paper did not touch on how the probability of speech absence can be estimated, and for performance evaluation, the probability of speech absence was set to 0.2 empirically. Intuitively, we expect the probability of speech absence to be a function of time and frequency.

In this paper, we formulate two approaches which adaptively estimate the probabilities of speech absence in different frequency components from the noisy speech itself. In the first approach, the noisy spectral component is hard classified or binary classified into speech presence or speech absence. In the second approach, the noisy spectral component is soft classified or statistically classified as speech absence, e.g. for a certain spectral component, the probability of speech absence is 0.6. After the classification stage, the probability of speech absence is computed using a running average of the classification results.

Both methods provide a much better estimate of speech absence which varies with time and frequency rather than a constant value. During the periods where speech is absent, the probability of speech absence will be close to one while during voiced speech the probability of speech absence will be close to zero at the pitch frequency component.

The probabilities obtained are then fed into a slightly modified form of Ephraim and Malah filter which takes into account the uncertainty of signal presence. The results show both an improvement in speech quality as well as better segmental SNR values. Similarly, the technique can be applied to other filters which require the probability of speech absence input, e.g. the modified power subtraction method using a priori SNR proposed in [3]. The results obtained also show significant improvements.

Section snippets

Ephraim and Malah noise suppression filter

This section provides a brief description of the Ephraim and Malah noise suppression filter [1], which gives excellent results. Let the kth spectral magnitude of the speech signal, noise and noisy speech be denoted by $A_{k}, D_{k}$ and R_k, respectively. The probability of speech absence is denoted as q_k. The kth spectral output, $A ̂_{k}$ , of the Ephraim and Malah noise suppression filter, taking into account the uncertainty of signal presence, is given by the equation $A ̂_{k} = G(q_{k})M(−0.5;1;−v_{k})R_{k} π v_{k} 2γ_{k},$ where $v_{k} = ξ_{k}$

Modified power subtraction filter

This modified power subtraction filter is a speech enhancement filter proposed by Scalart and Vieira Filho [3] using the power subtraction technique together with the a priori SNR estimated by the decision-directed approach [1] in Eq. (8). The additional attentuation for silence period can also be incorporated for a better performance. Using the same notations as in the above section, the combined filter can be described as follows: $A ̂_{k} =G(q_{k}) ξ ̂_{k} ξ ̂_{k} +1 R_{k} .$

Hard decision estimator

The first method will be hard to classify a received noisy amplitude as one which contains speech or just noise alone. The decision will be binary, with 0 representing speech presence and 1 representing speech absence. Let the input be represented by two states, H₀ and H₁, where H₀: speech absence, H₁: speech presence. Using Gaussian statistical model in 1, 2, the conditional probability density function of receiving the noisy amplitude, R_k, given that speech is absent, is $P(R_{k} |H_{0})= 2R_{k} λ_{d} exp (γ_{k}).$

Soft decision estimator

Unlike the Hard decision method which classifies the received amplitude into either speech presence or speech absence in a binary fashion, the soft decision method produces a value which ranges from 0 to 1 to represent the probability that the received amplitude is from a speech absence state. Using the conditional probabilities, the probability that speech is absent can be obtained from Bayes theorem: $P(H_{0} |R_{k})= P(R_{k} |H_{0})P(H_{0}) P(R_{k} |H_{0})P(H_{0})+P(R_{k} |H_{1})P(H_{1}) .$

However, values of P(H₀) and P(H₁) are

Results and discussions

A total of 10 different utterances, taken from the TIMIT database, are used in our evaluation. Half of the utterances were from male speakers while the rest are from female speakers. The speech data used are sampled at 8 kHz and quantized linearly using 16 bits. As for the additive noise, three different noise types were used, namely Gaussian white noise, recorded fan noise as well as the F16 (fighter jet) noise from the NOISEX database.

The noisy speech data were divided into frames, each of

Conclusions

This paper proposes two methods of estimating the probability of speech absence adaptively from the noisy speech itself. It shows that by using these two estimators of speech absence, the performance of the Ephraim and Malah noise filter and the power subtraction filter can be significantly improved. The technique should also be applicable in other filters incorporating the probability of speech absence.

References (6)

Y. Ephraim, D. Malah, Speech enhancement using a minimum mean-square error short-time spectral amplitude estimator,...
R.J. McAulay, M.L. Malpass, Speech enhancement using a soft-decision noise suppression filter, IEEE Trans. Acoust....
P. Scalart, J. Vieira Filho, Speech enhancement based on a a priori signal to noise estimation, in: Proc. ICASSP, Vol....

There are more references available in the full text version of this article.

Cited by (55)

Distributed-microphones based in-vehicle speech enhancement via sparse and low-rank spectrogram decomposition
2018, Speech Communication
Citation Excerpt :
A robust adaptive microphone array with improved spatial selectivity can obtain high quality signals in a certain direction and reduce the noise in other distractions. An earlier microphone array speech enhancement method is post-filtering (Zelinski, 1988; Soon et al., 1999) method. It hypothesizes that the signals acquired by microphone array are correlated, and the noise is independent and identically distributed.
In general, the in-vehicle speech enhancement is an application of the microphone array speech enhancement in particular acoustic environments. However, in this paper, we introduce a novel in-vehicle speech enhancement method based on distributed-microphones. The distributed-microphone signals have some features that the signals captured by microphone array do not have. Although distributed-microphones are not frequently used for speech enhancement, they can solve some practical problems, which cannot be solved by microphone array. In this paper, we propose a novel method using the signals acquired by distributed-microphones to enhance the speech corrupted by noise in-vehicle. The final enhanced speech is generated mainly by two steps. We first obtain the primary enhanced speech in each channel via sparse and low-rank spectrogram decomposition. Then based on the average improvements of segSNR (signal-to-noise ratio) and PESQ (perceptual evaluation of speech quality) in each channel, we fuse the primary enhanced speech in all channels into a single channel enhanced speech. In terms of PESQ and segSNR of the final enhanced speech, our approach outperforms several traditional approaches.
Generalized maximum a posteriori spectral amplitude estimation for speech enhancement
2016, Speech Communication
Citation Excerpt :
Additionally, some spectral restoration approaches are derived based on probabilistic models of speech and noise signals. Successful examples include minimum mean-square-error (MMSE) spectral estimator (Ephraim and Malah, 1984; Soon et al., 1999; Martin, 2005; Hansen et al., 2006; Malah et al., 1999; Cohen, 2002), maximum a posteriori spectral amplitude (MAPA) estimator (Plourde and Champagne, 2008; Lotter and Vary, 2005; Suhadi et al., 2011; Li et al., 2006; Xin et al., 2008), and maximum likelihood spectral amplitude (MLSA) estimator (McAulay and Malpass, 1980; Kjems and Jensen, 2012). In probabilistic model based estimation, the maximum a posteriori (MAP) based criterion explicitly takes a certain prior distribution of signal in modeling, which results in a much more accurate estimation than that without taking the prior knowledge of signal distributions.
Spectral restoration methods for speech enhancement aim to remove noise components in noisy speech signals by using a gain function in the spectral domain. How to design the gain function is one of the most important parts for obtaining enhanced speech with good quality. In most studies, the gain function is designed by optimizing a criterion based on some assumptions of the noise and speech distributions, such as minimum mean square error (MMSE), maximum likelihood (ML), and maximum a posteriori (MAP) criteria. The MAP criterion shows advantage in obtaining a more reliable gain function by incorporating a suitable prior density. However, it has a problem as several studies showed: although MAP based estimator effectively reduces noise components when the signal-to-noise ratio (SNR) is low, it brings large speech distortion when the SNR is high. For solving this problem, we have proposed a generalized maximum a posteriori spectral amplitude (GMAPA) algorithm in designing a gain function for speech enhancement. The proposed GMAPA algorithm dynamically specifies the weight of prior density of speech spectra according to the SNR of the testing speech signals to calculate the optimal gain function. When the SNR is high, GMAPA adopts a small weight to prevent overcompensations that may result in speech distortions. On the other hand, when the SNR is low, GMAPA uses a large weight to avoid disturbance of the restoration caused by measurement noises. In our previous study, it has been proven that the weight of the prior density plays a crucial role to the GMAPA performance, and the weight is determined based on the SNR in an utterance-level. In this paper, we propose to compute the weight with the consideration of time–frequency correlations that result in a more accurate estimation of the gain function. Experiments were carried out to evaluate the proposed algorithm on both objective tests and subjective tests. The experimental results obtained from objective tests indicate that GMAPA is promising compared to several well-known algorithms at both high and low SNRs. The results of subjective listening tests indicate that GMAPA provides significantly higher sound quality than other speech enhancement algorithms.
2D Psychoacoustic modeling of equivalent masking for automatic speech recognition
2015, Signal Processing
Citation Excerpt :
Notably, the human auditory system shows a much better resistance to the effects of noise [1,7]. The human auditory system can work relatively well in adverse situations where there is unpredictable environmental noise and distortion [8–11]. For example, a person with a healthy auditory system has little difficulty in communicating with other people in a crowded shopping mall, which would be a very challenging task for modern ASR [12–15].
Noise robustness has long been one of the most important goals in speech recognition. While the performance of automatic speech recognition (ASR) deteriorates in noisy situations, the human auditory system is relatively adept at handling noise. To mimic this adeptness, we study and apply psychoacoustic models in speech recognition as a means to improve robustness of ASR systems. Psychoacoustic models are usually implemented in a subtractive manner with the intention to remove noise. However, this is not necessarily the only approach to this challenge. This paper presents a novel algorithm which implements psychoacoustic models additively. The algorithm is motivated by the fact that weak sound elements that are below the masking threshold are the same for the human auditory system, regardless of the actual sound pressure level. Another important contribution of our proposed algorithm is a superior implementation of masking effect. Only those sounds that fall below the masking threshold are modified, which better reflects physical masking effects. We give detailed experimental results showing relationships between the subtractive and additive approaches. Since all the parameters of the proposed filters are positive or zero, they are named 2D psychoacoustic P-filters. Detailed theoretical analysis is provided to show the noise removal ability of these filters. Experiments are carried out on the AURORA2 database. Experimental results show that the word recognition rate using our proposed feature extraction method has been effectively increased. Given models trained with clean speech, our proposed method achieves up to 84.23% word recognition on noisy data.
Voiced/nonvoiced detection in compressively sensed speech signals
2015, Speech Communication
Citation Excerpt :
In order to further improve the performance, some methods learn statistical models after initial clustering is done (Ying et al., 2011). In addition, few methods extract features after using noise suppression as a part of the process e.g., use of enhanced speech spectra derived from Wiener filtering based on estimated noise statistics (Soon et al., 1999). Due to advantages such as no requirement of labeled training data, less training time and computational complexity, unsupervised methods are preferred over supervised methods (Sadjadi et al., 2013).
We leverage the recent algorithmic advances in compressive sensing (CS), and propose a novel unsupervised voiced/nonvoiced (V/NV) detection method for compressively sensed speech signals. It attempts to exploit the fact that there is significant glottal activity during production of voiced speech while the same is not true for nonvoiced speech. This characteristic of the speech production mechanism is captured in the sparse feature vector derived using CS framework. Further, we propose an information theoretic metric, for V/NV classification, exploiting the sparsity of the extracted feature using a signal adaptive dictionary motivated by speech production mechanism. The final classification is done using an adaptive threshold selection scheme, which uses the temporal information of speech signals. While existing methods of feature extraction use speech samples directly, proposed method performs V/NV detection in compressively sensed speech signals (requiring very less memory), where existing time or frequency domain detection methods are not directly applicable. Hence, this method can be effective for various speech applications. Performance of the proposed method is studied on CMU-ARCTIC database, for eight types of additive noises, taken from the NOISEX database, at different signal-to-noise ratios (SNRs). The proposed method performs similar or better compared to the existing methods, especially at lower SNRs and this provide compelling evidence of the effectiveness of sparse feature vector for V/NV detection.
Estimation of speech absence uncertainty based on multiple linear regression analysis for speech enhancement
2015, Applied Acoustics
We propose a novel approach to improve the performance of speech enhancement systems by using multiple linear regression to improve the technique of estimating the speech presence uncertainty. Conventional speech enhancement techniques use a fixed ratio Q of the a priori probability of speech presence and speech absence, or determine the value of Q simply by comparing one particular parameter against a threshold in deriving the speech absence probability (SAP) associated with the speech presence uncertainty. To further improve the performance of the SAP, we attempt to adaptively change Q according to a linear model consisting of the regression coefficients obtained by results from multiple linear regression analysis and two principal parameters: a priori SNR and the ratio between the local energy of the noisy speech and its derived minimum since these parameters correlate strongly with the value of Q. Distinct values of Q for each frequency in each frame are consequently assigned in time which leads to improved tracking performance of speech absence uncertainty and thus better performance of the proposed speech enhancement compared to conventional approaches. The superiority of the proposed approach is confirmed through extensive objective and subjective evaluations under various noise conditions.
Wavelet based speech presence probability estimator for speech enhancement
2012, Digital Signal Processing: A Review Journal
A reliable speech presence probability (SPP) estimator is important to many frequency domain speech enhancement algorithms. It is known that a good estimate of SPP can be obtained by having a smooth a-posteriori signal to noise ratio (SNR) function, which can be achieved by reducing the noise variance when estimating the speech power spectrum. Recently, the wavelet denoising with multitaper spectrum (MTS) estimation technique was suggested for such purpose. However, traditional approaches directly make use of the wavelet shrinkage denoiser which has not been fully optimized for denoising the MTS of noisy speech signals. In this paper, we firstly propose a two-stage wavelet denoising algorithm for estimating the speech power spectrum. First, we apply the wavelet transform to the periodogram of a noisy speech signal. Using the resulting wavelet coefficients, an oracle is developed to indicate the approximate locations of the noise floor in the periodogram. Second, we make use of the oracle developed in stage 1 to selectively remove the wavelet coefficients of the noise floor in the log MTS of the noisy speech. The wavelet coefficients that remained are then used to reconstruct a denoised MTS and in turn generate a smooth a-posteriori SNR function. To adapt to the enhanced a-posteriori SNR function, we further propose a new method to estimate the generalized likelihood ratio (GLR), which is an essential parameter for SPP estimation. Simulation results show that the new SPP estimator outperforms the traditional approaches and enables an improvement in both the quality and intelligibility of the enhanced speeches.

View all citing articles on Scopus

View full text