Improved noise suppression filter using self-adaptive estimator of probability of speech absence
Introduction
There are two forms of speech absence which can be used to improve the noise removal filter. The first form of speech absence is due to the speaker pausing in his speech resulting in significant portions of silence. The second form of speech absence is that although the speaker is talking, the speech energy is not present in all the frequency components. For some frequency components with insignificant energy, speech can be considered to be absent in those components.
Such knowledge of speech absence can be used to improve a speech enhancement filter. The first attempt at utilizing the uncertainty of speech absence was explored by McAulay and Malpass [2]. In their approach, they derived a filter based on a fixed probability of speech absence of 0.5. The Ephraim and Malah noise removal filter [1] adopted a more flexible approach in which different spectral frequency components can be assigned a different probability of speech absence which ranges from zero to one. However, the paper did not touch on how the probability of speech absence can be estimated, and for performance evaluation, the probability of speech absence was set to 0.2 empirically. Intuitively, we expect the probability of speech absence to be a function of time and frequency.
In this paper, we formulate two approaches which adaptively estimate the probabilities of speech absence in different frequency components from the noisy speech itself. In the first approach, the noisy spectral component is hard classified or binary classified into speech presence or speech absence. In the second approach, the noisy spectral component is soft classified or statistically classified as speech absence, e.g. for a certain spectral component, the probability of speech absence is 0.6. After the classification stage, the probability of speech absence is computed using a running average of the classification results.
Both methods provide a much better estimate of speech absence which varies with time and frequency rather than a constant value. During the periods where speech is absent, the probability of speech absence will be close to one while during voiced speech the probability of speech absence will be close to zero at the pitch frequency component.
The probabilities obtained are then fed into a slightly modified form of Ephraim and Malah filter which takes into account the uncertainty of signal presence. The results show both an improvement in speech quality as well as better segmental SNR values. Similarly, the technique can be applied to other filters which require the probability of speech absence input, e.g. the modified power subtraction method using a priori SNR proposed in [3]. The results obtained also show significant improvements.
Section snippets
Ephraim and Malah noise suppression filter
This section provides a brief description of the Ephraim and Malah noise suppression filter [1], which gives excellent results. Let the kth spectral magnitude of the speech signal, noise and noisy speech be denoted by and Rk, respectively. The probability of speech absence is denoted as qk. The kth spectral output, , of the Ephraim and Malah noise suppression filter, taking into account the uncertainty of signal presence, is given by the equationwhere
Modified power subtraction filter
This modified power subtraction filter is a speech enhancement filter proposed by Scalart and Vieira Filho [3] using the power subtraction technique together with the a priori SNR estimated by the decision-directed approach [1] in Eq. (8). The additional attentuation for silence period can also be incorporated for a better performance. Using the same notations as in the above section, the combined filter can be described as follows:
Hard decision estimator
The first method will be hard to classify a received noisy amplitude as one which contains speech or just noise alone. The decision will be binary, with 0 representing speech presence and 1 representing speech absence. Let the input be represented by two states, H0 and H1, where H0: speech absence, H1: speech presence. Using Gaussian statistical model in 1, 2, the conditional probability density function of receiving the noisy amplitude, Rk, given that speech is absent, is
Soft decision estimator
Unlike the Hard decision method which classifies the received amplitude into either speech presence or speech absence in a binary fashion, the soft decision method produces a value which ranges from 0 to 1 to represent the probability that the received amplitude is from a speech absence state. Using the conditional probabilities, the probability that speech is absent can be obtained from Bayes theorem:
However, values of P(H0) and P(H1) are
Results and discussions
A total of 10 different utterances, taken from the TIMIT database, are used in our evaluation. Half of the utterances were from male speakers while the rest are from female speakers. The speech data used are sampled at 8 kHz and quantized linearly using 16 bits. As for the additive noise, three different noise types were used, namely Gaussian white noise, recorded fan noise as well as the F16 (fighter jet) noise from the NOISEX database.
The noisy speech data were divided into frames, each of
Conclusions
This paper proposes two methods of estimating the probability of speech absence adaptively from the noisy speech itself. It shows that by using these two estimators of speech absence, the performance of the Ephraim and Malah noise filter and the power subtraction filter can be significantly improved. The technique should also be applicable in other filters incorporating the probability of speech absence.
References (6)
- Y. Ephraim, D. Malah, Speech enhancement using a minimum mean-square error short-time spectral amplitude estimator,...
- R.J. McAulay, M.L. Malpass, Speech enhancement using a soft-decision noise suppression filter, IEEE Trans. Acoust....
- P. Scalart, J. Vieira Filho, Speech enhancement based on a a priori signal to noise estimation, in: Proc. ICASSP, Vol....
Cited by (55)
Distributed-microphones based in-vehicle speech enhancement via sparse and low-rank spectrogram decomposition
2018, Speech CommunicationCitation Excerpt :A robust adaptive microphone array with improved spatial selectivity can obtain high quality signals in a certain direction and reduce the noise in other distractions. An earlier microphone array speech enhancement method is post-filtering (Zelinski, 1988; Soon et al., 1999) method. It hypothesizes that the signals acquired by microphone array are correlated, and the noise is independent and identically distributed.
Generalized maximum a posteriori spectral amplitude estimation for speech enhancement
2016, Speech CommunicationCitation Excerpt :Additionally, some spectral restoration approaches are derived based on probabilistic models of speech and noise signals. Successful examples include minimum mean-square-error (MMSE) spectral estimator (Ephraim and Malah, 1984; Soon et al., 1999; Martin, 2005; Hansen et al., 2006; Malah et al., 1999; Cohen, 2002), maximum a posteriori spectral amplitude (MAPA) estimator (Plourde and Champagne, 2008; Lotter and Vary, 2005; Suhadi et al., 2011; Li et al., 2006; Xin et al., 2008), and maximum likelihood spectral amplitude (MLSA) estimator (McAulay and Malpass, 1980; Kjems and Jensen, 2012). In probabilistic model based estimation, the maximum a posteriori (MAP) based criterion explicitly takes a certain prior distribution of signal in modeling, which results in a much more accurate estimation than that without taking the prior knowledge of signal distributions.
2D Psychoacoustic modeling of equivalent masking for automatic speech recognition
2015, Signal ProcessingCitation Excerpt :Notably, the human auditory system shows a much better resistance to the effects of noise [1,7]. The human auditory system can work relatively well in adverse situations where there is unpredictable environmental noise and distortion [8–11]. For example, a person with a healthy auditory system has little difficulty in communicating with other people in a crowded shopping mall, which would be a very challenging task for modern ASR [12–15].
Voiced/nonvoiced detection in compressively sensed speech signals
2015, Speech CommunicationCitation Excerpt :In order to further improve the performance, some methods learn statistical models after initial clustering is done (Ying et al., 2011). In addition, few methods extract features after using noise suppression as a part of the process e.g., use of enhanced speech spectra derived from Wiener filtering based on estimated noise statistics (Soon et al., 1999). Due to advantages such as no requirement of labeled training data, less training time and computational complexity, unsupervised methods are preferred over supervised methods (Sadjadi et al., 2013).
Wavelet based speech presence probability estimator for speech enhancement
2012, Digital Signal Processing: A Review Journal